Notes
Outline
Some Hardware Issues: the compute node
N.D. hari Dass
Institute of Mathematical Sciences
Chennai
Slide 2
Efficiency Issues
Here efficiency will mean not only computational efficiency but also cost efficiency.
An efficient cluster design must pay attention to both these factors.
Compute Nodes
Combination of clock speed and number of instructions per cycle.
Type and amount of memory: RDRAM fast but expensive. DDRAM optimal. 50% cheaper but only 20% slower.
266 Mhz available but 400 Mhz in the market.
Amount of memory depends on application. In our case it is 2 GB/node.
Compute Nodes: FSB
FSB speed puts the real constraint on performance.
533 Mhz common but 800 Mhz already in the market.
Memory operates at a maximum of half the FSB speed.
FSB speed determines memory bandwidth.
To get the maximum memory bandwidth in MB/s, multiply FSB speed in Mhz by 8.
The Front Side Bus(FSB)
FSB Bottleneck:
If P is the peak CPU performance in GFLOPs and B the memory bandwidth in GB/s, F the number of bytes of data required per FLOP
FSB Bottleneck contd….
F is application dependent. For lattice calculations it is around 1.
When B/F >> P, P determines the performance.
When B/F << P, maximum sustained performance is B/F independent of P!
Example: For Xeon with 400 Mhz FSB, maximum B is 3.2 GB/s. If F =2, B/F is 1.6 GFLOPs. Even if it is a 8 Ghz CPU, this is all that one can get!
More on FSB:
Pitfalls of benchmarking: if done with an application with low F would grossly overestimate performance.
Actual memory performances very different due to memory latency etc.
Example: for 1.5 Ghz Intel processors the load latencies for L1 and L2 are about 2 and 6 clock cycles respectively.
These latencies increase drastically once main memory is accessed due to bus limitations.
To overcome these problems techniques such as prefetch, data restructuring (more later) should be used.
"Wherever possible the B/F..."
Wherever possible the B/F should be reduced by modifying the graininess such that while an atom is resident in the L2 cache as many instructions are carried out as possible.
It is desirable to have as large a cache as possible.
Networking Issues
Application dependent.
Important Parameter is the ratio of communication time to computation time.
If this is high, it is important to go for an efficient networking.
Major factors: latency, bandwidth and the topology of network.
In a typical Gigabit Ethernet network latencies are in hundred microsecs and sustained bandwidth about 0.5 Gbps.
Wulfkits have a few microsec latency and sustained node to node bandwidth can be 2 Gbps.
 Single or multiple CPU’s per node?
Many performance overheads in multiple CPU nodes.
But single CPU nodes increase the cost of networking.
Scaleability deteriorates with number of nodes.
2-CPU nodes a good compromise.
Even in 2-CPU nodes if there is need for all-to-all communication, channels can get choked leading to poor scaleability.
Some solutions:
Optimise the serial code in such a way that it exploits shared memory features optimally. Then run MPI with one processor per node.
Run MPI with as many processors as required. From the mynode  assignments create new communicators as follows: group all processors on a fiven node into a communicator. Then create a collective communicator with one processor from each node.
Software Issues
Loop unrolling, inlining, modifying data structure……
Fortran matrices are stored in Column Major: order of storage a(1,1),a(2,1)..
Array of structures:
   Struct AOS{double  x,y,z};
   AOS Vertex[n];
"Structure of Arrays:"
Structure of Arrays:
   Struct SOA { double  x [n], y [n], z [n]};
   SOA Vertex;
X_0, X_1, X_2,…………..
Y_0, Y_1, Y_2,…………..
Z_0, Z_1, Z_2,……………
Exploiting Register Structure
Intel 32-bit architecture has 8 additional 64-bit registers called MMX and 8 128-bit registers called XMM.
Can load 4 single precision floating point numbers or 2 double precision floating point numbers.
A single operation like
   add xmm1 xmm2 xmm1
Will simultaneously add the numbers in xmm1 to xmm2 and store it in xmm2.
Can give in principle spped up by 4 for single precision and 2 for double precision.
KABRU – The Massive Cluster at IMSc
How it was built…..
The Team that built it ..
"It was decided to build..."
It was decided to build the cluster in stages.
A 9-node Pilot cluster as the first stage.
Actual QCD codes as well as extensive benchmarkings were run.
Pilot Cluster Node Configuration
1U rackmountable servers
Dual Xeon Processors @ 2.4 GHz
E7500 chipset with 400 MHz FSB
1GB of 266 Mhz ECC DDR memory
40 GB IDE Hard disk
64 bit/66 Mhz PCI slot with riser card
Dolphin Wulfkit 2D networking
MPI for communication
Slide 22
System Interconnects
Phase II
Phase II ( Also the final) Configuration
80 nodes (in final form 144)
1U rack mountable Dual Xeon @ 2.4 Ghz, E7501 chipset with 533 Mhz FSB
2 GB ECC DDR memory (in final form 120x2 and 24x4)
40 GB IDE HDD
Dolphin 3D networking in 5x4x4
1.5 Terabytes of Network Attached Storage
KABRU in Final Form
Slide 27
Where does it stand internationally?
On Oct 13,2004 Kabru reached 1002 GFlops sustained performance.
It is among the top 500 supercomputers of the world.
It is the fastest academic computer in India.
It is the only indigenuously built Indian entry.