N.D. Hari Dass | |
Institute of Mathematical Sciences | |
Chennai |
include mpif.h | |
Call MPI_xxx(…,ierr) | |
Case insensitive | |
Array storage column major | |
a(1,1),a(2,1),a(3,1)…contiguous | |
#include mpi.h | |
rc = mpi_xxx(….) | |
rc = MPI_SUCCESS | |
Case sensitive | |
Array storage row major | |
a(1,1),a(1,2),a(1,3)…contiguous |
MPI_CHARACTER | |
MPI_INTEGER | |
MPI_REAL | |
MPI_DOUBLE_PRECISION | |
MPI_COMPLEX | |
MPI_DOUBLE_COMPLEX | |
MPI_LOGICAL | |
MPI_CHAR | |
MPI_INT | |
MPI_LONG MPI_SHORT | |
MPI_UNSIGNED_LONG… | |
MPI_FLOAT | |
MPI_DOUBLE | |
MPI_LONG_DOUBLE | |
In addition to the data types shown it is possible to create derived data types . | |
Contiguous | |
Vector | |
Indexed | |
Struct | |
MPI_TYPE_COMMIT(datatype, ierr) | |
MPI_TYPE_FREE(datatype, ierr) |
#ifdef MPI | |
MPI routines | |
# endif | |
#ifdef SINGLE | |
Serial routines | |
#endif | |
Suffix source files with #ifdef and #endif by .F and the rest by .f | |
Let us say we have two source files a1.f and a2.F | |
ifort -c –O2 a1.f | |
ifort -DMPI –c –O2 a2.F | |
ifort -o all\a1.o a2.o | |
LDFLAGS = $(F77FLAGS) | |
.SUFFIXES .o .f | |
.SUFFIXES .o .F | |
.f .o: | |
$(F77) -c $(F77FLAGS) $< | |
.F .o: | |
$(F77) –D$(ARC) -c $(F77FLAGS) $< | |
all: qcd | |
qcd: a1.o a2.o | |
$(F77) –o qcd\a1.o a2.o\ | |
$(LDFLAGS) | |
qcd.o: qcd.F params2 |
Let cwd be the current working directory on the frontend , say kabru , where the executable ‘qcd’ has been succesfully compiled. | |
Let input be the file containing various input parameters. | |
Let kabru1 and kabru2 be the nodes or processors on which the parallel job is to be run. |
Copy the executable to the corresponding cwd on kabru1and kabru2 . | |
scp qcd user@kabru1:cwd | |
scp qcd user@kabru2:cwd | |
If cwd does not exist on these nodes, first create cwd on them: | |
rsh kabru1 “mkdir -p cwd “ | |
Another way: first create a file nodes which contains the list of nodes | |
scarcp -f nodes qcd cwd | |
scash –f nodes mkdir -p cwd |
There are essentially two ways of launching jobs: | |
mpimon -stdin all qcd - - k1 k2 < input > output & | |
mpirun -np 2 -machinefile nodes qcd <input > output & |
MPI jobs will hang for a variety of reasons or not behave the way they should e.g take too long or give absurd results… | |
In such cases they should be terminated. | |
This should be done in such a way as to ensure a clean environment for subsequent jobs. | |
Scali MPI: | |
scakill -f nodes -s qcd | |
Dangers: in Scali all jobs matching the given string are killed! |
There are a number of important routines for monitoring and controlling an MPI environment . | |
MPI_INIT(ierr),MPI_Init(&argc,&argv) | |
This initialises the MPI environment e.g setting up node id.. | |
MPI_INITIALIZED(flag,ierr) | |
This tells whether MPI_INIT has been called or not. | |
MPI_COMM_SIZE(comm,size.ierr) | |
This returns the total number of processors in the communicator ‘comm’ | |
MPI_COMM_RANK(comm,rank,ierr) | |
This returns the rank of the processor on which it is called. |
MPI_GET_PROCESSOR_NAME(name,resultlength,ierr) | |
This is akin to the hostname command. | |
MPI_WTIME( ),MPI_Wtime( ) | |
This returns wall time in seconds as a double precision number . | |
MPI_WTICK( ) | |
Returns the clock resolution in real*8 | |
MPI_ABORT(comm,errorcode,ierr) | |
This can be used to halt the job cleanly when it is behaving erroneously. | |
MPI_FINALIZE( ) | |
This should be the last call of an MPI code. It winds up the MPI environment cleanly. |
As already mentioned there are basically two types of MPI routines: point to point and collective . | |
As the names indicate, the first is used when communication is only between two processes and the second when several processes are involved. |
Send, receive are the most basic of the routines of this type. More evolved of this type are send - receive, send -receive – replace etc. | |
When the processor 1 sends to 2, 2 must | |
be ready to receive it. | |
What happens if the send request is sent before the receiving node is ready? |
In this mode of receiving the data sent is put into a buffer by the system of the receiving node and is subsequently sent to the application when it is ready. We shall discuss buffered sends later. |
The point to point communications can be broadly categorised as: | |
Synchronous : the receiving processor is always ready to receive the matching send. | |
Blocking send and receive | |
Non-blocking send and receive | |
Combined send/receive | |
Ready send |
The synchronous blocking send sends a message and block until the application buffer in the sending task is free for reuse and the destination process has started to receive the message. | |
MPI_SSEND(buf, count, type, dest, tag, comm, ierr) |
A blocking send returns only after the send buffer is safe for reuse. | |
This can be synchronous. | |
It can be asynchronous if buffering is used. | |
A blocking receive returns only after the data has arrived fully and is ready to use. | |
MPI_SEND (buf, count, type, dest, tag, comm, ierr) | |
MPI_RECV (buf, count, type, src, tag, comm, status, ierr) |
Non-blocking routines return immediately . | |
They do not wait for the actual arrival of the message nor for the sending protocols to be finished. | |
The MPI library performs these tasks whenever it is able to with the user having no control. | |
These routines are intrinsically unsafe as there is the danger that the application buffers may be modified by the program without ascertaining whether the non-blocking operation has been completed or not. | |
Non-blocking routines allow communications and computations to overlap . |
MPI_ISEND(buf, count, type, dest, tag, comm, request , ierr) | |
MPI_IRECV(buf, count, type, src, tag, comm, request , ierr) | |
Notice that the non-blocking receive routine has the same type of argument list as the non-blocking send ( no status ) | |
If the user wants protection against unintended modifications of the application buffer before communication calls are completed, use must be made of the MPI_WAIT routines. | |
MPI_WAIT(request, status, ierr) | |
MPI_WAITALL(count, array_of_requests, array_of_statuses, ierr) |
We have already discussed the concept of a buffered send . | |
The corresponding routine is | |
MPI_BSEND (buf, count, type, dest, tag, comm, ierr) | |
This is a blocking send which permits the programmer to allocate required amount of buffer space into which the data can be copied until it is delivered. | |
Routine returns after data has been copied from | |
application buffer to the allocated buffer. | |
This must be used in conjunction with MPI_BUFFER_ATTACH and MPI_BUFFER_DETACH routines. |
Suppose task 0 sends data to task 1 and at the same time receives data from task 2.This can be achieved through a combined send/receive call. | |
Technically this sends a message and posts a receive before blocking. | |
It will block till the sending application buffer is free for reuse and the receiving application buffer contains the sent message. | |
MPI_SENDRECV (sendbuf, sendcount, sendtype, dest, sendtag, recvbuf, recvcount, recvtype, source, recvtag, comm, status,ierr) |
Another very useful combined send/recv routine is the send_recv_replace : | |
MPI_SENDRECV_REPLACE (buf, count, type, dest, sendtag, source, recvtag, comm, status, ierr) | |
This routine uses a common buffer for the send and receive operation which is of the blocking type. |
A blocking send to be used when a matching receive has already been posted. | |
MPI_RSEND (buf, count, type, dest, tag, comm, ierr) |
In addition to the point to point routines discussed so far, MPI has many routines for collective communications. | |
These are typically of the type one – to –many , many – to – one and many – to – many . | |
All the collective routines are blocking . | |
Most frequently used routines are BCAST, REDUCE, SCATTER, ALLTOALL . |
Matrix multiplications either with a vector or another matrix are important cases where parallel computing becomes necessary. | |
This happens when the size of the matrix becomes very large. | |
For a NxN matrix one needs N2 elements. | |
For example for a complex matrix in double precision one requires 16 N2 bytes | |
of memory. | |
A matrix – vector multiplication takes N multiplications and N – 1 additions while a matrix – matrix multiplication takes N3 multiplications and N3 – N2 additions. | |
If the numbers are complex each multiplication 4 real multiplications and two additions while each addition is 2 real additions. | |
On a 324 lattice even the simplest algorithm involves 5 billion double precision multiplications per sweep of the lattice and one needs thousands of sweeps. |
x x x x x x x x y z | |
x x x x x x x x y z | |
x x x x x x x x y z | |
x x x x x x x x y = z | |
x x x x x x x x y z | |
x x x x x x x x y z | |
x x x x x x x x y z | |
x x x x x x x x y z |
x x x x x x x x y z | |
x x x x x x x x y z | |
x x x x x x x x y z | |
x x x x x x x x y z | |
y | |
y | |
y | |
y |
y | |
y | |
y | |
y | |
x x x x x x x x y z | |
x x x x x x x x y z | |
x x x x x x x x y z | |
x x x x x x x x y z |
Call MPI_SEND and MPI_RECV to combine the green and red half -vectors. | |
Estimate the TOTAL memory and time requirements. |
x x x x x x x x y z z | |
x x x x x x x x y z z | |
x x x x x x x x y z z | |
x x x x x x x x y z z | |
x x x x x x x x y z z | |
x x x x x x x x y z z | |
x x x x x x x x y z z | |
x x x x x x x x y z z |
Now one uses MPI_REDUCE with MPI_SUM to combine the red and green full vectors. | |
Make a comparitive study of the two solutions. |
Jacobi Iteration – A Grand Synthesis
Jacobi iteration provides a splendid example where many concepts of parallel computation in general and MPI in particular can be nicely illustrated. | |
The d-dimensional version of this problem requires solving the Lapalce’s equation subject to a boundary condition on the solution. |
The solution at the centre is the average of the solution over the d-dimensional sphere . |
"Consider approximating the manifold by..."
Consider approximating the manifold by a discrete set of points. | |
Consider the sum of f at the points 1,2,3,4: | |
Choose any trial solution and choose a new solution at every point according to the update algorithm keeping the boundary values fixed: |
Green lines show how to domain decompose the problem. |
Every d-dimensional hypercubic lattice admits a bipartite structure. | |
This allows independent updates at the even and odd sites. OpenMP? |