Notes
Outline
Introduction to MPI
N.D. Hari Dass
Institute of Mathematical Sciences
Chennai
Fortran and C
include mpif.h
Call MPI_xxx(…,ierr)
Case insensitive
Array storage column major
a(1,1),a(2,1),a(3,1)…contiguous
#include mpi.h
rc = mpi_xxx(….)
    rc = MPI_SUCCESS
Case sensitive
Array storage row major
a(1,1),a(1,2),a(1,3)…contiguous
DATA TYPES
MPI_CHARACTER
MPI_INTEGER
MPI_REAL
MPI_DOUBLE_PRECISION
MPI_COMPLEX
MPI_DOUBLE_COMPLEX
MPI_LOGICAL
MPI_CHAR
MPI_INT
MPI_LONG      MPI_SHORT
MPI_UNSIGNED_LONG…
MPI_FLOAT
MPI_DOUBLE
MPI_LONG_DOUBLE
Derived Data Types
In addition to the data types shown it is possible to create derived data types .
Contiguous
Vector
Indexed
Struct
MPI_TYPE_COMMIT(datatype, ierr)
MPI_TYPE_FREE(datatype, ierr)
Compiling and running
#ifdef MPI
   MPI routines
   # endif
#ifdef SINGLE
   Serial routines
   #endif
Suffix source files with #ifdef and #endif by .F and the rest by .f
Compiling contd….
Let us say we have two source files a1.f and a2.F
ifort  -c –O2 a1.f
ifort  -DMPI –c –O2 a2.F
ifort  -o all\a1.o a2.o
Compiling contd: a makefile
LDFLAGS = $(F77FLAGS)
.SUFFIXES .o .f
.SUFFIXES .o .F
.f .o:
       $(F77)  -c $(F77FLAGS) $<
.F .o:
       $(F77) –D$(ARC)  -c $(F77FLAGS) $<
all: qcd
qcd: a1.o a2.o
$(F77) –o qcd\a1.o a2.o\
$(LDFLAGS)
qcd.o: qcd.F params2
Running the job
Let cwd be the current working directory on the frontend , say kabru , where the executable ‘qcd’ has been succesfully compiled.
Let input be the file containing various input parameters.
Let kabru1 and kabru2 be the nodes or processors on which the parallel job is to be run.
Running (contd)
Copy the executable to the corresponding cwd on kabru1and kabru2 .
   scp qcd user@kabru1:cwd
   scp qcd user@kabru2:cwd
If cwd does not exist on these nodes, first create cwd on them:
   rsh kabru1 “mkdir  -p cwd “
Another way: first create a file nodes which contains the list of nodes
    scarcp  -f nodes qcd cwd
    scash –f nodes mkdir  -p cwd
Running (contd)
There are essentially two ways of launching jobs:
mpimon  -stdin all qcd  - - k1 k2 < input > output &
mpirun -np 2 -machinefile nodes qcd <input > output &
Killing jobs cleanly
MPI jobs will hang for a variety of reasons or not behave the way they should e.g take too long or give absurd results…
In such cases they should be terminated.
This should be done in such a way as to ensure a clean environment for subsequent jobs.
Scali MPI:
  scakill  -f nodes  -s qcd
Dangers: in Scali all jobs matching the given string are killed!
Environment Routines
There are a number of important routines for monitoring and controlling an MPI environment .
MPI_INIT(ierr),MPI_Init(&argc,&argv)
   This initialises the MPI environment e.g setting up node id..
MPI_INITIALIZED(flag,ierr)
    This tells whether MPI_INIT has been called or not.
MPI_COMM_SIZE(comm,size.ierr)
   This returns the total number of processors in the communicator ‘comm’
MPI_COMM_RANK(comm,rank,ierr)
    This returns the rank of the processor on which it is called.
More environment routines
MPI_GET_PROCESSOR_NAME(name,resultlength,ierr)
   This is akin to the hostname command.
MPI_WTIME( ),MPI_Wtime( )
   This returns wall time in seconds as a double precision number .
MPI_WTICK( )
   Returns the clock resolution in real*8
MPI_ABORT(comm,errorcode,ierr)
    This can be used to halt the job cleanly when it is behaving erroneously.
MPI_FINALIZE( )
     This should be the last call of an MPI code. It winds up the MPI environment cleanly.
Back to MPI Routines
As already mentioned there are basically two types of MPI routines: point to point and collective .
As the names indicate, the first is used when communication is only between two processes and the second when several processes are involved.
Point to Point Routines
Send, receive are the most basic of the routines of this type. More evolved of this type are send - receive, send -receive – replace etc.
When the processor 1 sends to 2, 2 must
   be ready to receive it.
What happens if the send request is sent before the receiving node is ready?
Buffering
In this mode of receiving the data sent is put into a buffer by the system of the receiving node and is subsequently sent to the application when it is ready. We shall discuss buffered sends later.
Point to Point (contd)
The point to point communications can be broadly categorised as:
Synchronous : the receiving processor is always ready to receive the matching send.
Blocking send and receive
Non-blocking send and receive
Combined send/receive
Ready send
Synchronous Routines
The synchronous blocking send sends a message and block until the application buffer in the sending task is free for reuse and the destination process has started to receive the message.
MPI_SSEND(buf, count, type, dest, tag, comm, ierr)
Blocking Routines
A blocking send returns only after the send buffer is safe for reuse.
This can be synchronous.
It can be asynchronous if buffering is used.
A blocking receive returns only after the data has arrived fully and is ready to use.
MPI_SEND (buf, count, type, dest, tag, comm, ierr)
MPI_RECV (buf, count, type, src, tag, comm, status, ierr)
Non-blocking Routines
Non-blocking routines return immediately .
They do not wait for the actual arrival of the message nor for the sending protocols to be finished.
The MPI library performs these tasks whenever it is able to with the user having no control.
These routines are intrinsically unsafe as there is the danger that the application buffers may be modified by the program without ascertaining whether the non-blocking operation has been completed or not.
Non-blocking routines allow communications and computations to overlap .
Non-blocking Routines(contd)
MPI_ISEND(buf, count, type, dest, tag, comm, request , ierr)
MPI_IRECV(buf, count, type, src, tag, comm, request , ierr)
Notice that the non-blocking receive routine has the same type of argument list as the non-blocking send ( no status )
If the user wants protection against unintended modifications of the application buffer before communication calls are completed, use must be made of the MPI_WAIT routines.
MPI_WAIT(request, status, ierr)
MPI_WAITALL(count, array_of_requests, array_of_statuses, ierr)
Buffered Send
We have already discussed the concept of a buffered send .
The corresponding routine is
   MPI_BSEND (buf, count, type, dest, tag, comm, ierr)
This is a blocking send which permits the programmer to allocate required amount of buffer space into which the data can be copied until it is delivered.
Routine returns after data has been copied from
    application buffer to the allocated buffer.
This must be used in conjunction with MPI_BUFFER_ATTACH and MPI_BUFFER_DETACH routines.
Combined Send/Receive
Suppose task 0 sends data to task 1 and at the same time receives data from task 2.This can be achieved through a combined send/receive call.
Technically this sends a message and posts a receive before blocking.
It will block till the sending application buffer is free for reuse and the receiving application buffer contains the sent message.
MPI_SENDRECV (sendbuf, sendcount, sendtype, dest, sendtag, recvbuf, recvcount, recvtype, source, recvtag, comm, status,ierr)
Another Combined Send/Recv
Another very useful combined send/recv routine is the send_recv_replace :
MPI_SENDRECV_REPLACE (buf, count, type, dest, sendtag, source, recvtag, comm, status, ierr)
This routine uses a common buffer for the send and receive operation which is of the blocking type.
Ready Send
A blocking send to be used when a matching receive has already been posted.
MPI_RSEND (buf, count, type, dest, tag, comm, ierr)
Collective Routines
In addition to the point to point routines discussed so far, MPI has many routines for collective communications.
These are typically of the type one – to –many , many – to – one and many – to – many .
All the collective routines are blocking .
Most frequently used routines are BCAST, REDUCE, SCATTER, ALLTOALL .
Slide 27
Slide 28
Slide 29
Slide 30
Slide 31
Slide 32
Slide 33
Matrix Multiplication
Matrix multiplications either with a vector or another matrix are important cases where parallel computing becomes necessary.
This happens when the size of the matrix becomes very large.
For a NxN matrix one needs N2  elements.
For example for a complex matrix in double precision one requires 16 N2   bytes
    of memory.
A matrix – vector multiplication takes N multiplications and N – 1 additions while a matrix – matrix multiplication takes N3 multiplications and N3 – N2 additions.
If the numbers are complex each multiplication 4 real multiplications and two additions while each addition is 2 real additions.
On a 324 lattice even the simplest algorithm involves 5 billion  double precision multiplications per sweep of the lattice and one needs thousands of sweeps.
Matrix multiplication
        x  x  x  x  x  x  x  x    y    z
        x  x  x  x  x  x  x  x    y    z
        x  x  x  x  x  x  x  x    y    z
        x  x  x  x  x  x  x  x    y = z
        x  x  x  x  x  x  x  x    y    z
        x  x  x  x  x  x  x  x    y    z
        x  x  x  x  x  x  x  x    y    z
        x  x  x  x  x  x  x  x    y    z
Solution - 1
  x  x  x  x  x  x  x  x    y    z
  x  x  x  x  x  x  x  x    y    z
  x  x  x  x  x  x  x  x    y    z
  x  x  x  x  x  x  x  x    y    z
                                  y
                                  y
                                  y
                                  y
"y"
                                    y
                                    y
                                    y
                                    y
   x  x  x  x  x  x  x  x    y    z
   x  x  x  x  x  x  x  x    y    z
   x  x  x  x  x  x  x  x    y    z
   x  x  x  x  x  x  x  x    y    z
Final step
Call MPI_SEND and MPI_RECV to combine the green and red  half -vectors.
Estimate the TOTAL memory and time requirements.
Second Solution
 x  x  x  x  x  x  x  x    y      z     z
 x  x  x  x  x  x  x  x    y      z     z
 x  x  x  x  x  x  x  x    y      z     z
 x  x  x  x  x  x  x  x    y      z     z
 x  x  x  x  x  x  x  x    y      z     z
 x  x  x  x  x  x  x  x    y      z     z
 x  x  x  x  x  x  x  x    y      z     z
 x  x  x  x  x  x  x  x    y      z     z
Second Soln (contd)
Now one uses MPI_REDUCE with MPI_SUM to combine the red and green full vectors.
Make a comparitive study of the two solutions.
Jacobi Iteration – A Grand Synthesis
Jacobi iteration provides a splendid example where many concepts of parallel computation in general and MPI in particular can be nicely illustrated.
The d-dimensional version of this problem  requires solving the Lapalce’s equation subject to a boundary condition on the solution.
Slide 42
Jacobi iteration…
Mean value theorem
The solution at the centre is the average of the solution over the d-dimensional sphere .
"Consider approximating the manifold by..."
Consider approximating the manifold by a discrete set of points.
Consider the sum of f  at the points 1,2,3,4:
Slide 46
Relaxation Algorithm
Choose any trial solution and choose a new solution at every point according to the update algorithm keeping the boundary values fixed:
2-d example
Green lines show how to domain decompose the problem.
Further parallelism
Every d-dimensional hypercubic lattice admits a bipartite structure.
This allows independent updates at the even and odd sites. OpenMP?
Improved algorithms