C++ functions and MPI programing - mpi

From what I have learned in my supercomputing class I know that MPI is a communicating (and data passing) interface.
I'm confused on when you run a function in a C++ program and want each processor to perform a specific task.
For example, a prime number search (very popular for supercomputers). Say I have a range of values (531-564, some arbitrary range) and say I have 50 processes I could run a series of evaluations on for each number. If root (process 0) wants to examine 531 and knowing prime numbers I can use 8 processes (1-8) to evaluate the prime status. If the number is divisible by any number 2-9 with a remainder of 0, then it is not prime.
Is it possible that for MPI which passes data to each process to have these processes perform these actions?
The hardest part for me is understanding that if I perform an action in the original C++ program the processes taking place could be allocated on several different processes, then in MPI how can I structure this? Or is my understanding completely wrong? If so how am I supposed to truly go about this path of thinking in a correct manner?
The big idea is passing data to a process versus having a function sent to a process. I'm fairly certain I'm wrong but I'm trying to back track to fix my thinking.

Each MPI process is running the same program, but that doesn't mean that they are doing the same thing. Different processes can be running different branches of the code, depending on the id (or "rank") of the process, and in effect be completely independent. Like any distributed computation, the actors do need to agree on how they will communicate.
The most basic strategy in MPI is scatter-gather, where the "master" process (usually the one with rank 0) will split an array of work equally amongst the peers (including the master process itself) by having them all call scatter, the peers will do the work, then all peers will call gather to send the results back to master.
In your prime algorithm example, build an array of integers, "scatter" it to all the peers, each peer will run through its array saving 1 if it is prime, 0 if it is not then "gather" the results to master. [In this particular example, since the input data is completely predictable based on process rank, the scatter step is unnecessary but we will do it anyway.]
As pseudo-code:
main():
int x[n], n = 100
MPI_init()
// prepare data on master
if rank == 0:
for i in 1 ... n, x[i] = i
// send data from x on root to local on each process in world
MPI_scatter(x, n, int, local, n/k, int, root, world)
for i in 1 ... n/k
result[i] = 1 // assume prime
if 2 divides local[i], result[i] = 0
if 3 divides local[i], result[i] = 0
if 5 divides local[i], result[i] = 0
if 7 divides local[i], result[i] = 0
// gather reults from local on each process in world to x on root
MPI_gather(result, n/k, int, x, n, int, root, world)
// print results
if rank == 0:
for i in 1 ... n, print i if x[i] == 1
MPI_finalize()
There are lots of details to fill in such as proper declarations, and dealing with the fact that some ranks will have fewer elements than others, using
proper C syntax, etc., but getting them right doesn't help explain the overall picture.
More fine-grained synchronization and communication is possible using direct send/recv between processes. Such programs are harder to write since the different processes may be in different states. In particular, it is important that if process a is calling MPI_send to process b, then process b had better be calling MPI_recv from a.

Related

How to distribute a number range to each process?

I am doing a learning exercise. I want to calculate the number of primes in a range from 0 to N. What function of mpi can I use to distribute ranges of numbers to each process? In other words, each process calculates the number of primes within a number range of the main range.
You could simply use a for loop and MPI_Send on the root (and MPI_Recv on receivers) to send to each process the number at which it should start and how many numbers it should check.
Another possibility, even better, is to send N to each process with MPI_Bcast (On root and receivers) and let each process compute which numbers it should check using it's own rank. (something like start=N/MPI_Comm_size*MPI_Comm_rank and length=N/MPI_Comm_size, and some adequate rounding etc.)
You can probably optimize load balancing even more but you should get it working first.
At the end you should call MPI_Reduce with a sum.

MPI neighbor reduce operation

This is the moment I feel I need something like MPI_Neighbor_allreduce, but I know it doesn't exist.
Foreword
Given a 3D MPI cartesian topology describing how a 3D physical domain is distributed among processes, I wrote a function probe that asks for a scalar value (which is supposed to be put in a simple REAL :: val) given the 3 coordinates of a point inside the domain.
There can only be 1, 2, 4, or 8 process(es) that are actually involved in the computation of val.
1 if the point is internal to a process subdomain (and it has no neighbors involved),
2 if the point is on a face between 2 processes' subdomains (and each of them has 1 neighbor involved),
4 if the point is on a side between 4 processes' subdomains (and each of them has 2 neighbors involved),
8 if the point is a vertex between 8 processes' subdomain (and each of them has 3 neighbors involved).
After the call to probe as it is now, each process holds val, which is some value for involved processes, 0 or NaN (I decide by (de)commenting the proper lines) for not-involved processes. And each process knows if it is involved or not (through a LOGICAL :: found variable), but does not know if it is the only one involved, nor who are the involved neighbors if it is not.
In the case of 1 involved process, that only value of that only process is enough, and the process can write it, use it, or whatever is needed.
In the latter three cases, the sum of the different scalar values of the processes involved must be computed (and divided by the number of neighbors +1, i.e. self included).
The question
What is the best strategy to accomplish this communication and computation?
What solutions I'm thinking about
I'm thinking about the following possibilities.
Every process executes val = 0 before the call to probe, then MPI_(ALL)REDUCE can be used, (the involved processes participating with val /= 0 in general, all others with val == 0), but this would mean that if more points are asked for val, those points would be treated serially, even if the set of process(es) involved for each of them does not overlap with other sets.
Every process calls MPI_Neighbor_allgather to share found among neighboring processes to make each involved process know which one(s) of the 6 neighbors participate(s) to the sum and then perform individual MPI_send(s) and an MPI_recv(s) to communicate val. But this would still involve every process (even though each communicates only with the 6 neighbors.
Maybe the best choice is that each process defines a communicator made up of itself plus the 6 neighbors and then use.
EDIT
For what concerns the risk of deadlock mentioned by #JorgeBellón, I initially solved it by calling MPI_SEND before MPI_RECV for communications in the positive direction, i.e. those corresponding to even indices in who_is_involved, and vice-versa in the negative direction. As a special case, this could not deal with a periodic direction with only two processes along it (since each of the two would see the other one as a neighbor in both positive and negative directions, thus resulting in both processes calling MPI_SEND and MPI_RECV in the same order, thus causing a deadlock); the solution to this special case was the following ad-hoc edit to who_is_involved (which I called found_neigh in my code):
DO id = 1, ndims
IF (ALL(found_neigh(2*id - 1:2*id))) found_neigh(2*id -1 + mycoords(id)) = .FALSE.
END DO
As a reference for the readers, the solution that I implemented so far (a solution I'm not so satisfied with) is the following.
found = ... ! .TRUE. or .FALSE. depending whether the process is/isn't involved in computation of val
IF ( found) val = ... ! compute own contribution
IF (.NOT. found) val = NaN
! share found among neighbors
found_neigh(:) = .FALSE.
CALL MPI_NEIGHBOR_ALLGATHER(found, 1, MPI_LOGICAL, found_neigh, 1, MPI_LOGICAL, procs_grid, ierr)
found_neigh = found_neigh .AND. found
! modify found_neigh to deal with special case of TWO processes along PERIODIC direction
DO id = 1, ndims
IF (ALL(found_neigh(2*id - 1:2*id))) found_neigh(2*id -1 + mycoords(id)) = .FALSE.
END DO
! exchange contribution with neighbors
val_neigh(:) = NaN
IF (found) THEN
DO id = 1, ndims
IF (found_neigh(2*id)) THEN
CALL MPI_SEND(val, 1, MPI_DOUBLE_PRECISION, idp(id), 999, MPI_COMM_WORLD, ierr)
CALL MPI_RECV(val_neigh(2*id), 1, MPI_DOUBLE_PRECISION, idp(id), 666, MPI_COMM_WORLD, MPI_STATUS_IGNORE, ierr)
END IF
IF (found_neigh(2*id - 1)) THEN
CALL MPI_RECV(val_neigh(2*id - 1), 1, MPI_DOUBLE_PRECISION, idm(id), 999, MPI_COMM_WORLD, MPI_STATUS_IGNORE, ierr)
CALL MPI_SEND(val, 1, MPI_DOUBLE_PRECISION, idm(id), 666, MPI_COMM_WORLD, ierr)
END IF
END DO
END IF
! combine own contribution with others
val = somefunc(val, val_neigh)
As you said, MPI_Neighbor_allreduce does not exist.
You can create derived communicators that only include your adjacent processes and then perform a regular MPI_Allreduce on them. Each process can have up to 7 communicators in a 3D grid.
The communicator in which a specific process will be placed in the center of the stencil.
The respective communicator for each of the adjacent processes.
This can be a quite expensive process, but it does not mean it could be worthwhile (HPLinpack makes extensive use of derived communicators, for example).
If you already have a cartesian topology, a good approach is to use MPI_Neighbor_allgather. This way you will not only know how many neighbors are involved but also who it is.
int found; // logical: either 1 or 0
int num_neighbors; // how many neighbors i got
int who_is_involved[num_neighbors]; // unknown, to be received
MPI_Neighbor_allgather( &found, ..., who_is_involved, ..., comm );
int actually_involved = 0;
int r = 0;
MPI_Request reqs[2*num_neighbors];
for( int i = 0; i < num_neighbors; i++ ) {
if( who_is_involved[i] != 0 ) {
actually_involved++;
MPI_Isend( &val, ..., reqs[r++]);
MPI_Irecv( &val, ..., reqs[r++]);
}
}
MPI_Waitall( r, reqs, MPI_STATUSES_IGNORE );
Note that I'm using non-blocking point to point routines. This is important in most cases because MPI_Send may wait for the receiver to call MPI_Recv. Unconditionally calling MPI_Send and then MPI_Recv in all processes, may cause a deadlock (see MPI 3.1 standard section 3.4).
Another possibility is to send both the real value and the found in a single communication, so that the number of transfers are reduced. Since all processes are involved in the MPI_Neighbor_allgather anyway, you could use it to get everything done (for a small increase in the amount of data transferred it really pays off).
INTEGER :: neighbor, num_neighbors, found
REAL :: val
REAL :: sendbuf(2)
REAL :: recvbuf(2,num_neighbors)
sendbuf(1) = found
sendbuf(2) = val
CALL MPI_Neighbor_allgather( sendbuf, 1, MPI_2REAL, recvbuf, num_neighbors, MPI_2REAL, ...)
DO neighbor = 1,num_neighbors
IF recvbuf(1,neighbor) .EQ. 1 THEN
! use neighbor val, placed in recvbuf(2,neighbor)
END IF
END DO

Using Fortran90 and MPI, new to both, trying to use MPI_Gather to collect from a loop 3 different variables in each process

I am new to both Fortran90 and MPI. I have a loop that iterates different based on each individual process. Inside of that, I have a nested loop, and it is here that I make the computations that I desire along with the elements of the respective loops. However, I want to send all of this data, the x, the y, and the computed values using x and y, to my root process, 0. From here, I want to write all of the data to the same file in the format of 'x y computation'.
program fortranMPI
use mpi
!GLOBAL VARIABLE DECLARATION
real :: step = 0.5, x, y, comput
integer :: count = 0, finalCount = 5, outFile = 20, i
!MPI
integer :: ierr, myrank, mysize, status(MPI_STATUS_SIZE)
call MPI_INIT(ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD,myrank,ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD,mysize,ierr)
if(myrank == 0) then
!I want to gather my data here?
end if
do i = 1, mysize, 1
if(myrank == i) then
x = -2. + (myrank - 1.)*step
do while (x<= 2.)
y= -2.
do while (y<=2.)
!Here is where I am trying to send my data!
y = y + step
end do
x = x + (mysize-1)*(step)
end do
end if
end do
call MPI_FINALIZE(ierr)
end program fortranMPI
I keep getting stuck trying to pass the data! If someone could help me out, that would be great! Sorry if this is simpler than I am making it, I am still trying to figure Fortran/MPI out. Thanks in advance!
First of all the program seems to not have any sense on what its doing. If you can be more specific on what you want to do, i can help further.
Now, usually, before the calculations, this if (myrank==0) statement, is where you send your data on the rest of the processes. Since process 0 will be sending data, you have to add code right after that in order for the processes to receive the data. Also you may need to add an MPI_BARRIER (call MPI_BARRIER), right before the start of the calculations, just to make sure the data has reached every process.
As for the calculation part, you also have to decide, not only where you send data, but also where the data is received and if you also need any synchronization on the communication. This has to do with the design of your program so you are the one who knows what exactly you want to do.
The most common commands for sending and receiving data are MPI_SEND and MPI_RECV.
Those are blocking commands, which means that the communication should be synchronized. One Send command should be matched with one Receive command before both processes can continue.
There are also non blocking commands, you can find them all here:
http://www.mpich.org/static/docs/v3.1/www3/
As for the MPI_GATHER command, this is used in order to gather data from a group of processes. This will only help you when you are going to use more than 2 processes to further accelerate your program. Except from that MPI_GATHER is used when you want to gather data and store them in an array fashion, and of course it's worth using only when you are going to receive lots of data which is definitely not your case here.
Finally about printing out results, i'm not sure if what you are asking is possible. Trying to open the same file handle using 2 processes, is probably going to lead to OS errors. Usually for printing out the results, you have rank 0 to do that, right after every other process has finished.

Iterating results back into an OpenCL kernel

I have written an openCL kernel that takes 25million points and checks them relative to two lines, (A & B). It then outputs two lists; i.e. set A of all of the points found to be beyond line A, and vice versa.
I'd like to run the kernel repeatedly, updating the input points with each of the line results sets in turn (and also updating the checking line). I'm guessing that reading the two result sets out of the kernel, forming them into arrays and then passing them back in one at a time as inputs is quite a slow solution.
As an alternative, I've tested keeping a global index in the kernel that logs which points relate to which line. This is updated at each line checking cycle. During each iteration, the index for each point in the overall set is switched to 0 (no line), A or B or so forth (i.e. the related line id). In subsequent iterations only points with an index that matches the 'live' set being checked in that cycle (i.e. tagged with A for set A) are tested further.
The problem is that, in each iteration, the kernels still have to check through the full index (i.e. all 25m points) to discover wether or not they are in the 'live' set. As a result, the speed of each cycle does not significantly improve as the size of the results set decrease over time. Again, this seems a slow solution; whilst avoiding passing too much information between GPU and CPU it instead means that a large number of the work items aren't doing very much work at all.
Is there an alternative solution to what I am trying to do here?
You could use atomics to sort the outputs into two arrays. Ie if we're in A then get my position by incrementing the A counter and put me into A, and do the same for B
Using global atomics on everything might be horribly slow (fast on amd, slow on nvidia, no idea about other devices) - instead you can use a local atomic_inc in a 0'd local integer to do exactly the same thing (but for only the local set of x work-items), and then at the end do an atomic_add to both global counters based on your local counters
To put this more clearly in code (my explanation is not great)
int id;
if(is_a)
id = atomic_inc(&local_a);
else
id = atomic_inc(&local_b);
barrier(CLK_LOCAL_MEM_FENCE);
__local int a_base, b_base;
int lid = get_local_id(0);
if(lid == 0)
{
a_base = atomic_add(a_counter, local_a);
b_base = atomic_add(b_counter, local_b);
}
barrier(CLK_LOCAL_MEM_FENCE);
if(is_a)
a_buffer[id + a_base] = data;
else
b_buffer[id + b_base] = data;
This involves faffing around with atomics which are inherently slow, but depending on how quickly your dataset reduces it might be much faster. Additionally if B data is not considered live, you can omit getting the b ids and all the atomics involving b, as well as the write back

Error in MPI broadcast

Sorry for the long post. I did read some other MPI broadcast related errors but I couldn't
find out why my program is failing.
I am new to MPI and I am facing this problem. First I will explain what I am trying to do:
My declarations :
ROWTAG 400
COLUMNTAG 800
Create a 2 X 2 Cartesian topology.
Rank 0 has the whole matrix. It wants to dissipate parts of matrix to all the processes in the 2 X 2 Cartesian topology. For now, instead
of matrix I am just dealing with integers. So for process P(i,j) in 2 X 2 Cartesian topology, (i - row , j - column), I want it to receive
(ROWTAG + i ) in one message and (COLUMNTAG + j) in another message.
My strategy to do so is:
Processes: P(0,0) , P(0,1), P(1,0), P(1,1)
P(0,0) has all the initial data.
P(0,0) sends (ROWTAG+1) (in this case 401) to P(1,0) - In essense P(1,0) is responsible for dissipating information related to row 1 for all the processes in Row 1 - I just used a blocking send
P(0,0) sends (COLUMNTAG+1) (in this case 801) to P(0,1) - In essense P(0,1) is responsible for dissipating information related to column 1 for all the processes in Column 1 - Used a blocking send
For each process, I made a row_group containing all the processes in that row and out of this created a row_comm (communicator object)
For each process, I made a col_group containing all the processes in that column and out of this created a col_comm (communicator object)
At this point, P(0,0) has given information related to row 'i' to Process P(i,0) and P(0,0) has given information related to column 'j' to
P(0,j). I call P(i,0) and P(0,j) as row_head and col_head respectively.
For Process P(i,j) , P(i,0) gives information related to row i, and P(0,j) gives information related to column j.
I used a broad cast call:
MPI_Bcast(&row_data,1,MPI_INT,row_head,row_comm)
MPI_Bcast(&col_data,1,MPI_INT,col_head,col_comm)
Please find my code here: http://pastebin.com/NpqRWaWN
Here is the error I see:
* An error occurred in MPI_Bcast
on communicator MPI COMMUNICATOR 5 CREATE FROM 3
MPI_ERR_ROOT: invalid root
* MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
Also please let me know if there is any better way to distribute the matrix data.
There are several errors in your program. First, row_Ranks is declared with one element less and when writing to it, you possibly overwrite other stack variables:
int col_Ranks[SIZE], row_Ranks[SIZE-1];
// ^^^^^^
On my test system the program just hangs because of that.
Second, you create new subcommunicators out of matrixComm but you use rank numbers from the latter to address processes in the former when performing the broadcast. That doesn't work. For example, in a 2x2 Cartesian communicator ranks range from 0 to 3. In any column- or row-wise subgroup there are only two processes with ranks 0 and 1 - there is neither rank 2 nor rank 3. If you take a look at the value of row_head across the ranks, it is 2 in two of them, hence the error.
For a much better way to distribute the data, you should refer to this extremely informative answer.

Resources