Some Hardware Issues: the
compute node
|
|
|
N.D. hari Dass |
|
Institute of Mathematical Sciences |
|
Chennai |
Slide 2
Efficiency Issues
|
|
|
Here efficiency will mean not only computational
efficiency but also cost efficiency. |
|
An efficient cluster design must pay
attention to both these factors. |
Compute Nodes
|
|
|
Combination of clock speed and number
of instructions per cycle. |
|
Type and amount of memory: RDRAM fast
but expensive. DDRAM optimal. 50% cheaper but only 20% slower. |
|
266 Mhz available but 400 Mhz in the
market. |
|
Amount of memory depends on
application. In our case it is 2 GB/node. |
Compute Nodes: FSB
|
|
|
FSB speed puts the real constraint on
performance. |
|
533 Mhz common but 800 Mhz already in
the market. |
|
Memory operates at a maximum of half
the FSB speed. |
|
FSB speed determines memory bandwidth. |
|
To get the maximum memory bandwidth in
MB/s, multiply FSB speed in Mhz by 8. |
The Front Side Bus(FSB)
FSB Bottleneck:
|
|
|
If P is the peak CPU performance in
GFLOPs and B the memory bandwidth in GB/s, F the number of bytes of data
required per FLOP |
|
|
FSB Bottleneck contd….
|
|
|
F is application dependent. For lattice
calculations it is around 1. |
|
When B/F >> P, P determines the
performance. |
|
When B/F << P, maximum sustained
performance is B/F independent of P! |
|
Example: For Xeon with 400 Mhz FSB,
maximum B is 3.2 GB/s. If F =2, B/F is 1.6 GFLOPs. Even if it is a 8 Ghz CPU,
this is all that one can get! |
More on FSB:
|
|
|
Pitfalls of benchmarking: if done with
an application with low F would grossly overestimate performance. |
|
Actual memory performances very
different due to memory latency etc. |
|
Example: for 1.5 Ghz Intel processors
the load latencies for L1 and L2 are about 2 and 6 clock cycles respectively. |
|
These latencies increase drastically
once main memory is accessed due to bus limitations. |
|
To overcome these problems techniques
such as prefetch, data restructuring (more later) should be used. |
"Wherever possible
the B/F..."
|
|
|
Wherever possible the B/F should be
reduced by modifying the graininess such that while an atom is resident in
the L2 cache as many instructions are carried out as possible. |
|
It is desirable to have as large a cache
as possible. |
Networking Issues
|
|
|
Application dependent. |
|
Important Parameter is the ratio of
communication time to computation time. |
|
If this is high, it is important to go
for an efficient networking. |
|
Major factors: latency, bandwidth and
the topology of network. |
|
In a typical Gigabit Ethernet network
latencies are in hundred microsecs and sustained bandwidth about 0.5 Gbps. |
|
Wulfkits have a few microsec latency
and sustained node to node bandwidth can be 2 Gbps. |
Single or multiple CPU’s per node?
|
|
|
Many performance overheads in multiple
CPU nodes. |
|
But single CPU nodes increase the cost
of networking. |
|
Scaleability deteriorates with number
of nodes. |
|
2-CPU nodes a good compromise. |
|
Even in 2-CPU nodes if there is need
for all-to-all communication, channels can get choked leading to poor
scaleability. |
Some solutions:
|
|
|
Optimise the serial code in such a way
that it exploits shared memory features optimally. Then run MPI with one
processor per node. |
|
Run MPI with as many processors as
required. From the mynode assignments
create new communicators as follows: group all processors on a fiven node
into a communicator. Then create a collective communicator with one processor
from each node. |
Software Issues
|
|
|
Loop unrolling, inlining, modifying
data structure…… |
|
Fortran matrices are stored in Column
Major: order of storage a(1,1),a(2,1).. |
|
Array of structures: |
|
Struct AOS{double x,y,z}; |
|
AOS Vertex[n]; |
|
|
"Structure of
Arrays:"
|
|
|
Structure of Arrays: |
|
Struct SOA { double x [n], y
[n], z [n]}; |
|
SOA Vertex; |
|
|
|
X_0, X_1, X_2,………….. |
|
Y_0, Y_1, Y_2,………….. |
|
Z_0, Z_1, Z_2,…………… |
|
|
Exploiting Register
Structure
|
|
|
Intel 32-bit architecture has 8
additional 64-bit registers called MMX and 8 128-bit registers called XMM. |
|
Can load 4 single precision floating
point numbers or 2 double precision floating point numbers. |
|
A single operation like |
|
add xmm1 xmm2 xmm1 |
|
Will simultaneously add the numbers in
xmm1 to xmm2 and store it in xmm2. |
|
Can give in principle spped up by 4 for
single precision and 2 for double precision. |
KABRU – The Massive
Cluster at IMSc
How it was built…..
The Team that built it ..
"It was decided to
build..."
|
|
|
It was decided to build the cluster in
stages. |
|
A 9-node Pilot cluster as the first
stage. |
|
Actual QCD codes as well as extensive
benchmarkings were run. |
Pilot Cluster Node
Configuration
|
|
|
1U rackmountable servers |
|
Dual Xeon Processors @ 2.4 GHz |
|
E7500 chipset with 400 MHz FSB |
|
1GB of 266 Mhz ECC DDR memory |
|
40 GB IDE Hard disk |
|
64 bit/66 Mhz PCI slot with riser card |
|
Dolphin Wulfkit 2D networking |
|
MPI for communication |
Slide 22
System Interconnects
Phase II
Phase II ( Also the
final) Configuration
|
|
|
80 nodes (in final form 144) |
|
1U rack mountable Dual Xeon @ 2.4 Ghz,
E7501 chipset with 533 Mhz FSB |
|
2 GB ECC DDR memory (in final form
120x2 and 24x4) |
|
40 GB IDE HDD |
|
Dolphin 3D networking in 5x4x4 |
|
1.5 Terabytes of Network Attached
Storage |
KABRU in Final Form
Slide 27
Where does it stand
internationally?
|
|
|
On Oct 13,2004 Kabru reached 1002
GFlops sustained performance. |
|
It is among the top 500 supercomputers
of the world. |
|
It is the fastest academic computer in
India. |
|
It is the only indigenuously built
Indian entry. |
|
|