|
|
|
N.D. Hari Dass |
|
Institute of Mathematical Sciences |
|
Chennai |
|
|
|
|
include mpif.h |
|
Call MPI_xxx(…,ierr) |
|
Case insensitive |
|
Array storage column major |
|
a(1,1),a(2,1),a(3,1)…contiguous |
|
#include mpi.h |
|
rc = mpi_xxx(….) |
|
rc =
MPI_SUCCESS |
|
Case sensitive |
|
Array storage row major |
|
a(1,1),a(1,2),a(1,3)…contiguous |
|
|
|
|
MPI_CHARACTER |
|
MPI_INTEGER |
|
MPI_REAL |
|
MPI_DOUBLE_PRECISION |
|
MPI_COMPLEX |
|
MPI_DOUBLE_COMPLEX |
|
MPI_LOGICAL |
|
MPI_CHAR |
|
MPI_INT |
|
MPI_LONG
MPI_SHORT |
|
MPI_UNSIGNED_LONG… |
|
MPI_FLOAT |
|
MPI_DOUBLE |
|
MPI_LONG_DOUBLE |
|
|
|
|
|
|
In addition to the data types shown it is
possible to create derived data types . |
|
Contiguous |
|
Vector |
|
Indexed |
|
Struct |
|
MPI_TYPE_COMMIT(datatype, ierr) |
|
MPI_TYPE_FREE(datatype, ierr) |
|
|
|
|
#ifdef MPI |
|
MPI
routines |
|
#
endif |
|
#ifdef SINGLE |
|
Serial
routines |
|
#endif |
|
Suffix source files with #ifdef and #endif by .F
and the rest by .f |
|
|
|
|
|
|
Let us say we have two source files a1.f and
a2.F |
|
ifort -c
–O2 a1.f |
|
ifort
-DMPI –c –O2 a2.F |
|
ifort -o
all\a1.o a2.o |
|
|
|
|
|
|
LDFLAGS = $(F77FLAGS) |
|
.SUFFIXES .o .f |
|
.SUFFIXES .o .F |
|
.f .o: |
|
$(F77) -c $(F77FLAGS) $< |
|
.F .o: |
|
$(F77) –D$(ARC) -c
$(F77FLAGS) $< |
|
all: qcd |
|
qcd: a1.o a2.o |
|
$(F77) –o qcd\a1.o a2.o\ |
|
$(LDFLAGS) |
|
qcd.o: qcd.F params2 |
|
|
|
|
Let cwd be the current working directory on the frontend
, say kabru , where the executable ‘qcd’ has been succesfully compiled. |
|
Let input be the file containing various input
parameters. |
|
Let kabru1 and kabru2 be the nodes or processors
on which the parallel job is to be run. |
|
|
|
|
Copy the executable to the corresponding cwd on kabru1and
kabru2 . |
|
scp
qcd user@kabru1:cwd |
|
scp
qcd user@kabru2:cwd |
|
If cwd does not exist on these nodes, first
create cwd on them: |
|
rsh kabru1
“mkdir -p cwd “ |
|
Another way: first create a file nodes which
contains the list of nodes |
|
scarcp -f nodes qcd cwd |
|
scash
–f nodes mkdir -p cwd |
|
|
|
|
There are essentially two ways of launching
jobs: |
|
mpimon
-stdin all qcd - - k1 k2
< input > output & |
|
mpirun -np 2 -machinefile nodes qcd <input
> output & |
|
|
|
|
MPI jobs will hang for a variety of reasons or
not behave the way they should e.g take too long or give absurd results… |
|
In such cases they should be terminated. |
|
This should be done in such a way as to ensure a
clean environment for subsequent jobs. |
|
Scali MPI: |
|
scakill -f nodes -s qcd |
|
Dangers: in Scali all jobs matching the given
string are killed! |
|
|
|
|
There are a number of important routines for
monitoring and controlling an MPI environment . |
|
MPI_INIT(ierr),MPI_Init(&argc,&argv) |
|
This
initialises the MPI environment e.g setting up node id.. |
|
MPI_INITIALIZED(flag,ierr) |
|
This
tells whether MPI_INIT has been called or not. |
|
MPI_COMM_SIZE(comm,size.ierr) |
|
This
returns the total number of processors in the communicator ‘comm’ |
|
MPI_COMM_RANK(comm,rank,ierr) |
|
This
returns the rank of the processor on which it is called. |
|
|
|
|
MPI_GET_PROCESSOR_NAME(name,resultlength,ierr) |
|
This
is akin to the hostname command. |
|
MPI_WTIME( ),MPI_Wtime( ) |
|
This
returns wall time in seconds as a double precision number . |
|
MPI_WTICK( ) |
|
Returns the clock resolution in real*8 |
|
MPI_ABORT(comm,errorcode,ierr) |
|
This
can be used to halt the job cleanly when it is behaving erroneously. |
|
MPI_FINALIZE( ) |
|
This
should be the last call of an MPI code. It winds up the MPI environment
cleanly. |
|
|
|
|
As already mentioned there are basically two
types of MPI routines: point to point and collective . |
|
As the names indicate, the first is used when
communication is only between two processes and the second when several
processes are involved. |
|
|
|
|
Send, receive are the most basic of the routines
of this type. More evolved of this type are send - receive, send -receive –
replace etc. |
|
When the processor 1 sends to 2, 2 must |
|
be
ready to receive it. |
|
What happens if the send request is sent before
the receiving node is ready? |
|
|
|
|
In this mode of receiving the data sent is put
into a buffer by the system of the receiving node and is subsequently sent
to the application when it is ready. We shall discuss buffered sends later. |
|
|
|
|
The point to point communications can be broadly
categorised as: |
|
Synchronous : the receiving processor is always
ready to receive the matching send. |
|
Blocking send and receive |
|
Non-blocking send and receive |
|
Combined send/receive |
|
Ready send |
|
|
|
|
The synchronous blocking send sends a message
and block until the application buffer in the sending task is free for
reuse and the destination process has started to receive the message. |
|
MPI_SSEND(buf, count, type, dest, tag, comm,
ierr) |
|
|
|
|
A blocking send returns only after the send
buffer is safe for reuse. |
|
This can be synchronous. |
|
It can be asynchronous if buffering is used. |
|
A blocking receive returns only after the data
has arrived fully and is ready to use. |
|
MPI_SEND (buf, count, type, dest, tag, comm,
ierr) |
|
MPI_RECV (buf, count, type, src, tag, comm, status,
ierr) |
|
|
|
|
Non-blocking routines return immediately . |
|
They do not wait for the actual arrival of the
message nor for the sending protocols to be finished. |
|
The MPI library performs these tasks whenever it
is able to with the user having no control. |
|
These routines are intrinsically unsafe as there
is the danger that the application buffers may be modified by the program
without ascertaining whether the non-blocking operation has been completed
or not. |
|
Non-blocking routines allow communications and
computations to overlap . |
|
|
|
|
MPI_ISEND(buf, count, type, dest, tag, comm, request
, ierr) |
|
MPI_IRECV(buf, count, type, src, tag, comm, request
, ierr) |
|
Notice that the non-blocking receive routine has
the same type of argument list as the non-blocking send ( no status ) |
|
If the user wants protection against unintended
modifications of the application buffer before communication calls are
completed, use must be made of the MPI_WAIT routines. |
|
MPI_WAIT(request, status, ierr) |
|
MPI_WAITALL(count, array_of_requests,
array_of_statuses, ierr) |
|
|
|
|
We have already discussed the concept of a buffered
send . |
|
The corresponding routine is |
|
MPI_BSEND (buf, count, type, dest, tag, comm, ierr) |
|
This is a blocking send which permits the
programmer to allocate required amount of buffer space into which the data
can be copied until it is delivered. |
|
Routine returns after data has been copied from |
|
application
buffer to the allocated buffer. |
|
This must be used in conjunction with
MPI_BUFFER_ATTACH and MPI_BUFFER_DETACH routines. |
|
|
|
|
Suppose task 0 sends data to task 1 and at the
same time receives data from task 2.This can be achieved through a combined
send/receive call. |
|
Technically this sends a message and posts a
receive before blocking. |
|
It will block till the sending application
buffer is free for reuse and the receiving application buffer contains the
sent message. |
|
MPI_SENDRECV (sendbuf, sendcount, sendtype,
dest, sendtag, recvbuf, recvcount, recvtype, source, recvtag, comm,
status,ierr) |
|
|
|
|
Another very useful combined send/recv routine
is the send_recv_replace : |
|
MPI_SENDRECV_REPLACE (buf, count, type, dest,
sendtag, source, recvtag, comm, status, ierr) |
|
This routine uses a common buffer for the send
and receive operation which is of the blocking type. |
|
|
|
|
A blocking send to be used when a matching receive
has already been posted. |
|
MPI_RSEND (buf, count, type, dest, tag, comm,
ierr) |
|
|
|
|
In addition to the point to point routines
discussed so far, MPI has many routines for collective communications. |
|
These are typically of the type one – to –many ,
many – to – one and many – to – many . |
|
All the collective routines are blocking . |
|
Most frequently used routines are BCAST, REDUCE,
SCATTER, ALLTOALL . |
|
|
|
|
|
|
|
|
|
|
|
Matrix multiplications either with a vector or
another matrix are important cases where parallel computing becomes
necessary. |
|
This happens when the size of the matrix becomes
very large. |
|
For a NxN matrix one needs N2 elements. |
|
For example for a complex matrix in double
precision one requires 16 N2 bytes |
|
of
memory. |
|
A matrix – vector multiplication takes N
multiplications and N – 1 additions while a matrix – matrix multiplication
takes N3 multiplications and N3 – N2
additions. |
|
If the numbers are complex each multiplication 4
real multiplications and two additions while each addition is 2 real
additions. |
|
On a 324 lattice even the simplest
algorithm involves 5 billion double
precision multiplications per sweep of the lattice and one needs thousands
of sweeps. |
|
|
|
|
x x x
x x x
x x y
z |
|
x x x
x x x
x x y
z |
|
x x x
x x x
x x y
z |
|
x x x
x x x
x x y = z |
|
x x x
x x x
x x y
z |
|
x x x
x x x
x x y
z |
|
x x x
x x x
x x y
z |
|
x x x
x x x
x x y
z |
|
|
|
|
x x
x x x
x x x
y z |
|
x x
x x x
x x x
y z |
|
x x
x x x
x x x
y z |
|
x x
x x x
x x x
y z |
|
y |
|
y |
|
y |
|
y |
|
|
|
|
y |
|
y |
|
y |
|
y |
|
x x
x x x
x x x
y z |
|
x x
x x x
x x x
y z |
|
x x
x x x
x x x
y z |
|
x x
x x x
x x x
y z |
|
|
|
|
Call MPI_SEND and MPI_RECV to combine the green and
red half -vectors. |
|
Estimate the TOTAL memory and time requirements. |
|
|
|
|
x x
x x x
x x x
y z z |
|
x x
x x x
x x x
y z z |
|
x x
x x x
x x x
y z z |
|
x x
x x x
x x x
y z z |
|
x x
x x x
x x x
y z z |
|
x x
x x x
x x x
y z z |
|
x x
x x x
x x x
y z z |
|
x x
x x x
x x x
y z z |
|
|
|
|
Now one uses MPI_REDUCE with MPI_SUM to combine
the red and green full vectors. |
|
Make a comparitive study of the two solutions. |
|
|
|
|
Jacobi iteration provides a splendid example
where many concepts of parallel computation in general and MPI in
particular can be nicely illustrated. |
|
The d-dimensional version of this problem requires solving the Lapalce’s equation
subject to a boundary condition on the solution. |
|
|
|
|
|
|
The solution at the centre is the average of the
solution over the d-dimensional sphere . |
|
|
|
|
Consider approximating the manifold by a discrete
set of points. |
|
Consider the sum of f at the points 1,2,3,4: |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Choose any trial solution and choose a new
solution at every point according to the update algorithm keeping the boundary
values fixed: |
|
|
|
|
Green lines show how to domain decompose the
problem. |
|
|
|
|
Every d-dimensional hypercubic lattice admits a bipartite
structure. |
|
This allows independent updates at the even and odd
sites. OpenMP? |
|
|