|
|
|
N.D. hari Dass |
|
Institute of Mathematical Sciences |
|
Chennai |
|
|
|
|
|
Here efficiency will mean not only computational
efficiency but also cost efficiency. |
|
An efficient cluster design must pay attention
to both these factors. |
|
|
|
|
Combination of clock speed and number of
instructions per cycle. |
|
Type and amount of memory: RDRAM fast but
expensive. DDRAM optimal. 50% cheaper but only 20% slower. |
|
266 Mhz available but 400 Mhz in the market. |
|
Amount of memory depends on application. In our
case it is 2 GB/node. |
|
|
|
|
FSB speed puts the real constraint on
performance. |
|
533 Mhz common but 800 Mhz already in the
market. |
|
Memory operates at a maximum of half the FSB
speed. |
|
FSB speed determines memory bandwidth. |
|
To get the maximum memory bandwidth in MB/s,
multiply FSB speed in Mhz by 8. |
|
|
|
|
|
If P is the peak CPU performance in GFLOPs and B
the memory bandwidth in GB/s, F the number of bytes of data required per
FLOP |
|
|
|
|
|
|
F is application dependent. For lattice
calculations it is around 1. |
|
When B/F >> P, P determines the
performance. |
|
When B/F << P, maximum sustained
performance is B/F independent of P! |
|
Example: For Xeon with 400 Mhz FSB, maximum B is
3.2 GB/s. If F =2, B/F is 1.6 GFLOPs. Even if it is a 8 Ghz CPU, this is
all that one can get! |
|
|
|
|
Pitfalls of benchmarking: if done with an
application with low F would grossly overestimate performance. |
|
Actual memory performances very different due to
memory latency etc. |
|
Example: for 1.5 Ghz Intel processors the load
latencies for L1 and L2 are about 2 and 6 clock cycles respectively. |
|
These latencies increase drastically once main
memory is accessed due to bus limitations. |
|
To overcome these problems techniques such as
prefetch, data restructuring (more later) should be used. |
|
|
|
|
Wherever possible the B/F should be reduced by
modifying the graininess such that while an atom is resident in the L2
cache as many instructions are carried out as possible. |
|
It is desirable to have as large a cache as
possible. |
|
|
|
|
Application dependent. |
|
Important Parameter is the ratio of
communication time to computation time. |
|
If this is high, it is important to go for an
efficient networking. |
|
Major factors: latency, bandwidth and the
topology of network. |
|
In a typical Gigabit Ethernet network latencies
are in hundred microsecs and sustained bandwidth about 0.5 Gbps. |
|
Wulfkits have a few microsec latency and
sustained node to node bandwidth can be 2 Gbps. |
|
|
|
|
Many performance overheads in multiple CPU
nodes. |
|
But single CPU nodes increase the cost of
networking. |
|
Scaleability deteriorates with number of nodes. |
|
2-CPU nodes a good compromise. |
|
Even in 2-CPU nodes if there is need for
all-to-all communication, channels can get choked leading to poor
scaleability. |
|
|
|
|
Optimise the serial code in such a way that it
exploits shared memory features optimally. Then run MPI with one processor
per node. |
|
Run MPI with as many processors as required.
From the mynode assignments create new
communicators as follows: group all processors on a fiven node into a
communicator. Then create a collective communicator with one processor from
each node. |
|
|
|
|
Loop unrolling, inlining, modifying data
structure…… |
|
Fortran matrices are stored in Column Major:
order of storage a(1,1),a(2,1).. |
|
Array of structures: |
|
Struct
AOS{double x,y,z}; |
|
AOS
Vertex[n]; |
|
|
|
|
|
|
Structure of Arrays: |
|
Struct
SOA { double x [n], y [n], z [n]}; |
|
SOA
Vertex; |
|
|
|
X_0, X_1, X_2,………….. |
|
Y_0, Y_1, Y_2,………….. |
|
Z_0, Z_1, Z_2,…………… |
|
|
|
|
|
|
Intel 32-bit architecture has 8 additional
64-bit registers called MMX and 8 128-bit registers called XMM. |
|
Can load 4 single precision floating point
numbers or 2 double precision floating point numbers. |
|
A single operation like |
|
add
xmm1 xmm2 xmm1 |
|
Will simultaneously add the numbers in xmm1 to
xmm2 and store it in xmm2. |
|
Can give in principle spped up by 4 for single
precision and 2 for double precision. |
|
|
|
|
|
|
|
It was decided to build the cluster in stages. |
|
A 9-node Pilot cluster as the first stage. |
|
Actual QCD codes as well as extensive
benchmarkings were run. |
|
|
|
|
1U rackmountable servers |
|
Dual Xeon Processors @ 2.4 GHz |
|
E7500 chipset with 400 MHz FSB |
|
1GB of 266 Mhz ECC DDR memory |
|
40 GB IDE Hard disk |
|
64 bit/66 Mhz PCI slot with riser card |
|
Dolphin Wulfkit 2D networking |
|
MPI for communication |
|
|
|
|
|
|
|
80 nodes (in final form 144) |
|
1U rack mountable Dual Xeon @ 2.4 Ghz, E7501
chipset with 533 Mhz FSB |
|
2 GB ECC DDR memory (in final form 120x2 and
24x4) |
|
40 GB IDE HDD |
|
Dolphin 3D networking in 5x4x4 |
|
1.5 Terabytes of Network Attached Storage |
|
|
|
|
|
|
On Oct 13,2004 Kabru reached 1002 GFlops
sustained performance. |
|
It is among the top 500 supercomputers of the
world. |
|
It is the fastest academic computer in India. |
|
It is the only indigenuously built Indian entry. |
|
|
|