MPI neighbor reduce operation - mpi

This is the moment I feel I need something like MPI_Neighbor_allreduce, but I know it doesn't exist.
Foreword
Given a 3D MPI cartesian topology describing how a 3D physical domain is distributed among processes, I wrote a function probe that asks for a scalar value (which is supposed to be put in a simple REAL :: val) given the 3 coordinates of a point inside the domain.
There can only be 1, 2, 4, or 8 process(es) that are actually involved in the computation of val.
1 if the point is internal to a process subdomain (and it has no neighbors involved),
2 if the point is on a face between 2 processes' subdomains (and each of them has 1 neighbor involved),
4 if the point is on a side between 4 processes' subdomains (and each of them has 2 neighbors involved),
8 if the point is a vertex between 8 processes' subdomain (and each of them has 3 neighbors involved).
After the call to probe as it is now, each process holds val, which is some value for involved processes, 0 or NaN (I decide by (de)commenting the proper lines) for not-involved processes. And each process knows if it is involved or not (through a LOGICAL :: found variable), but does not know if it is the only one involved, nor who are the involved neighbors if it is not.
In the case of 1 involved process, that only value of that only process is enough, and the process can write it, use it, or whatever is needed.
In the latter three cases, the sum of the different scalar values of the processes involved must be computed (and divided by the number of neighbors +1, i.e. self included).
The question
What is the best strategy to accomplish this communication and computation?
What solutions I'm thinking about
I'm thinking about the following possibilities.
Every process executes val = 0 before the call to probe, then MPI_(ALL)REDUCE can be used, (the involved processes participating with val /= 0 in general, all others with val == 0), but this would mean that if more points are asked for val, those points would be treated serially, even if the set of process(es) involved for each of them does not overlap with other sets.
Every process calls MPI_Neighbor_allgather to share found among neighboring processes to make each involved process know which one(s) of the 6 neighbors participate(s) to the sum and then perform individual MPI_send(s) and an MPI_recv(s) to communicate val. But this would still involve every process (even though each communicates only with the 6 neighbors.
Maybe the best choice is that each process defines a communicator made up of itself plus the 6 neighbors and then use.
EDIT
For what concerns the risk of deadlock mentioned by #JorgeBellón, I initially solved it by calling MPI_SEND before MPI_RECV for communications in the positive direction, i.e. those corresponding to even indices in who_is_involved, and vice-versa in the negative direction. As a special case, this could not deal with a periodic direction with only two processes along it (since each of the two would see the other one as a neighbor in both positive and negative directions, thus resulting in both processes calling MPI_SEND and MPI_RECV in the same order, thus causing a deadlock); the solution to this special case was the following ad-hoc edit to who_is_involved (which I called found_neigh in my code):
DO id = 1, ndims
IF (ALL(found_neigh(2*id - 1:2*id))) found_neigh(2*id -1 + mycoords(id)) = .FALSE.
END DO
As a reference for the readers, the solution that I implemented so far (a solution I'm not so satisfied with) is the following.
found = ... ! .TRUE. or .FALSE. depending whether the process is/isn't involved in computation of val
IF ( found) val = ... ! compute own contribution
IF (.NOT. found) val = NaN
! share found among neighbors
found_neigh(:) = .FALSE.
CALL MPI_NEIGHBOR_ALLGATHER(found, 1, MPI_LOGICAL, found_neigh, 1, MPI_LOGICAL, procs_grid, ierr)
found_neigh = found_neigh .AND. found
! modify found_neigh to deal with special case of TWO processes along PERIODIC direction
DO id = 1, ndims
IF (ALL(found_neigh(2*id - 1:2*id))) found_neigh(2*id -1 + mycoords(id)) = .FALSE.
END DO
! exchange contribution with neighbors
val_neigh(:) = NaN
IF (found) THEN
DO id = 1, ndims
IF (found_neigh(2*id)) THEN
CALL MPI_SEND(val, 1, MPI_DOUBLE_PRECISION, idp(id), 999, MPI_COMM_WORLD, ierr)
CALL MPI_RECV(val_neigh(2*id), 1, MPI_DOUBLE_PRECISION, idp(id), 666, MPI_COMM_WORLD, MPI_STATUS_IGNORE, ierr)
END IF
IF (found_neigh(2*id - 1)) THEN
CALL MPI_RECV(val_neigh(2*id - 1), 1, MPI_DOUBLE_PRECISION, idm(id), 999, MPI_COMM_WORLD, MPI_STATUS_IGNORE, ierr)
CALL MPI_SEND(val, 1, MPI_DOUBLE_PRECISION, idm(id), 666, MPI_COMM_WORLD, ierr)
END IF
END DO
END IF
! combine own contribution with others
val = somefunc(val, val_neigh)

As you said, MPI_Neighbor_allreduce does not exist.
You can create derived communicators that only include your adjacent processes and then perform a regular MPI_Allreduce on them. Each process can have up to 7 communicators in a 3D grid.
The communicator in which a specific process will be placed in the center of the stencil.
The respective communicator for each of the adjacent processes.
This can be a quite expensive process, but it does not mean it could be worthwhile (HPLinpack makes extensive use of derived communicators, for example).
If you already have a cartesian topology, a good approach is to use MPI_Neighbor_allgather. This way you will not only know how many neighbors are involved but also who it is.
int found; // logical: either 1 or 0
int num_neighbors; // how many neighbors i got
int who_is_involved[num_neighbors]; // unknown, to be received
MPI_Neighbor_allgather( &found, ..., who_is_involved, ..., comm );
int actually_involved = 0;
int r = 0;
MPI_Request reqs[2*num_neighbors];
for( int i = 0; i < num_neighbors; i++ ) {
if( who_is_involved[i] != 0 ) {
actually_involved++;
MPI_Isend( &val, ..., reqs[r++]);
MPI_Irecv( &val, ..., reqs[r++]);
}
}
MPI_Waitall( r, reqs, MPI_STATUSES_IGNORE );
Note that I'm using non-blocking point to point routines. This is important in most cases because MPI_Send may wait for the receiver to call MPI_Recv. Unconditionally calling MPI_Send and then MPI_Recv in all processes, may cause a deadlock (see MPI 3.1 standard section 3.4).
Another possibility is to send both the real value and the found in a single communication, so that the number of transfers are reduced. Since all processes are involved in the MPI_Neighbor_allgather anyway, you could use it to get everything done (for a small increase in the amount of data transferred it really pays off).
INTEGER :: neighbor, num_neighbors, found
REAL :: val
REAL :: sendbuf(2)
REAL :: recvbuf(2,num_neighbors)
sendbuf(1) = found
sendbuf(2) = val
CALL MPI_Neighbor_allgather( sendbuf, 1, MPI_2REAL, recvbuf, num_neighbors, MPI_2REAL, ...)
DO neighbor = 1,num_neighbors
IF recvbuf(1,neighbor) .EQ. 1 THEN
! use neighbor val, placed in recvbuf(2,neighbor)
END IF
END DO

Related

MPI - Deal with special case of zero sized subarray for MPI_TYPE_CREATE_SUBARRAY

I use MPI_TYPE_CREATE_SUBARRAY to create a type used to communicate portions of 3D arrays between neighboring processes in a Cartesian topology. Specifically, each process communicates with the two processes on the two sides along each of the three directions.
Referring for simplicity to a one-dimensional grid, there are two parameters nL and nR that define how many values each process has to receive from the left and send to the right, and how many each has to receive from the right and send to the left.
Unaware (or maybe just forgetful) of the fact that all elements of the array_of_subsizes array parameter of MPI_TYPE_CREATE_SUBARRAY must be positive, I wrote my code that can't deal with the case nR = 0 (or nL = 0, either can be).
(By the way, I see that MPI_TYPE_VECTOR does accept zero count and blocklength arguments and it's sad MPI_TYPE_CREATE_SUBARRAY can't.)
How would you suggest to face this problem? Do I really have to convert each call to MPI_TYPE_CREATE_SUBARRAY into multiple MPI_TYPE_VECTORs called in a chain?
The following code is minimal but not working (but it works in the larger program and I haven't had time to extract the minimum number of declarations and prints), still it should give a better look into what I'm talking about.
INTEGER :: ndims = 3, DBS, ierr, temp, sub3D
INTEGER, DIMENSION(ndims) :: aos, aoss
CALL MPI_TYPE_SIZE(MPI_DOUBLE_PRECISION, DBS, ierr)
! doesn't work if ANY(aoss == 0)
CALL MPI_TYPE_CREATE_SUBARRAY(ndims, aos, aoss, [0,0,0], MPI_ORDER_FORTRAN, MPI_DOUBLE_PRECISION, sub3D, ierr)
! does work if ANY(aoss == 0)
CALL MPI_TYPE_HVECTOR(aoss(2), aoss(1), DBS*aos(1), MPI_DOUBLE_PRECISION, temp, ierr)
CALL MPI_TYPE_HVECTOR(aoss(3), 1, DBS*PRODUCT(aos(1:2)), temp, sub3D, ierr)
At the end it wasn't hard to replace MPI_TYPE_CREATE_SUBARRAY with two MPI_TYPE_HVECTORs. Maybe this is the best solution, after all.
In this sense one side question comes naturally for me: why is MPI_TYPE_CREATE_SUBARRAY so limited? There are a lot of examples in the MPI standard of stuff which correctly falls back on "do nothing" (when a sender or receiver is MPI_PROC_NULL) or "there's nothing in this" (when aoss has a zero dimension in my example). Should I post a feature request somewhere?
The MPI 3.1 standard (chapter 4.1 page 95) makes it crystal clear
For any dimension i, it is erroneous to specify array_of_subsizes[i] < 1 [...].
You are free to send your comment to the appropriate Mailing List.

C++ functions and MPI programing

From what I have learned in my supercomputing class I know that MPI is a communicating (and data passing) interface.
I'm confused on when you run a function in a C++ program and want each processor to perform a specific task.
For example, a prime number search (very popular for supercomputers). Say I have a range of values (531-564, some arbitrary range) and say I have 50 processes I could run a series of evaluations on for each number. If root (process 0) wants to examine 531 and knowing prime numbers I can use 8 processes (1-8) to evaluate the prime status. If the number is divisible by any number 2-9 with a remainder of 0, then it is not prime.
Is it possible that for MPI which passes data to each process to have these processes perform these actions?
The hardest part for me is understanding that if I perform an action in the original C++ program the processes taking place could be allocated on several different processes, then in MPI how can I structure this? Or is my understanding completely wrong? If so how am I supposed to truly go about this path of thinking in a correct manner?
The big idea is passing data to a process versus having a function sent to a process. I'm fairly certain I'm wrong but I'm trying to back track to fix my thinking.
Each MPI process is running the same program, but that doesn't mean that they are doing the same thing. Different processes can be running different branches of the code, depending on the id (or "rank") of the process, and in effect be completely independent. Like any distributed computation, the actors do need to agree on how they will communicate.
The most basic strategy in MPI is scatter-gather, where the "master" process (usually the one with rank 0) will split an array of work equally amongst the peers (including the master process itself) by having them all call scatter, the peers will do the work, then all peers will call gather to send the results back to master.
In your prime algorithm example, build an array of integers, "scatter" it to all the peers, each peer will run through its array saving 1 if it is prime, 0 if it is not then "gather" the results to master. [In this particular example, since the input data is completely predictable based on process rank, the scatter step is unnecessary but we will do it anyway.]
As pseudo-code:
main():
int x[n], n = 100
MPI_init()
// prepare data on master
if rank == 0:
for i in 1 ... n, x[i] = i
// send data from x on root to local on each process in world
MPI_scatter(x, n, int, local, n/k, int, root, world)
for i in 1 ... n/k
result[i] = 1 // assume prime
if 2 divides local[i], result[i] = 0
if 3 divides local[i], result[i] = 0
if 5 divides local[i], result[i] = 0
if 7 divides local[i], result[i] = 0
// gather reults from local on each process in world to x on root
MPI_gather(result, n/k, int, x, n, int, root, world)
// print results
if rank == 0:
for i in 1 ... n, print i if x[i] == 1
MPI_finalize()
There are lots of details to fill in such as proper declarations, and dealing with the fact that some ranks will have fewer elements than others, using
proper C syntax, etc., but getting them right doesn't help explain the overall picture.
More fine-grained synchronization and communication is possible using direct send/recv between processes. Such programs are harder to write since the different processes may be in different states. In particular, it is important that if process a is calling MPI_send to process b, then process b had better be calling MPI_recv from a.

Error in MPI broadcast

Sorry for the long post. I did read some other MPI broadcast related errors but I couldn't
find out why my program is failing.
I am new to MPI and I am facing this problem. First I will explain what I am trying to do:
My declarations :
ROWTAG 400
COLUMNTAG 800
Create a 2 X 2 Cartesian topology.
Rank 0 has the whole matrix. It wants to dissipate parts of matrix to all the processes in the 2 X 2 Cartesian topology. For now, instead
of matrix I am just dealing with integers. So for process P(i,j) in 2 X 2 Cartesian topology, (i - row , j - column), I want it to receive
(ROWTAG + i ) in one message and (COLUMNTAG + j) in another message.
My strategy to do so is:
Processes: P(0,0) , P(0,1), P(1,0), P(1,1)
P(0,0) has all the initial data.
P(0,0) sends (ROWTAG+1) (in this case 401) to P(1,0) - In essense P(1,0) is responsible for dissipating information related to row 1 for all the processes in Row 1 - I just used a blocking send
P(0,0) sends (COLUMNTAG+1) (in this case 801) to P(0,1) - In essense P(0,1) is responsible for dissipating information related to column 1 for all the processes in Column 1 - Used a blocking send
For each process, I made a row_group containing all the processes in that row and out of this created a row_comm (communicator object)
For each process, I made a col_group containing all the processes in that column and out of this created a col_comm (communicator object)
At this point, P(0,0) has given information related to row 'i' to Process P(i,0) and P(0,0) has given information related to column 'j' to
P(0,j). I call P(i,0) and P(0,j) as row_head and col_head respectively.
For Process P(i,j) , P(i,0) gives information related to row i, and P(0,j) gives information related to column j.
I used a broad cast call:
MPI_Bcast(&row_data,1,MPI_INT,row_head,row_comm)
MPI_Bcast(&col_data,1,MPI_INT,col_head,col_comm)
Please find my code here: http://pastebin.com/NpqRWaWN
Here is the error I see:
* An error occurred in MPI_Bcast
on communicator MPI COMMUNICATOR 5 CREATE FROM 3
MPI_ERR_ROOT: invalid root
* MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
Also please let me know if there is any better way to distribute the matrix data.
There are several errors in your program. First, row_Ranks is declared with one element less and when writing to it, you possibly overwrite other stack variables:
int col_Ranks[SIZE], row_Ranks[SIZE-1];
// ^^^^^^
On my test system the program just hangs because of that.
Second, you create new subcommunicators out of matrixComm but you use rank numbers from the latter to address processes in the former when performing the broadcast. That doesn't work. For example, in a 2x2 Cartesian communicator ranks range from 0 to 3. In any column- or row-wise subgroup there are only two processes with ranks 0 and 1 - there is neither rank 2 nor rank 3. If you take a look at the value of row_head across the ranks, it is 2 in two of them, hence the error.
For a much better way to distribute the data, you should refer to this extremely informative answer.

Breaking down a Matrix so that every process gets its share of the matrix using MPI

I am fairly new to using MPI. My question is the following: I have a matrix with 2000 rows and 3 columns stored as a 2D array (not contiguous data). Without changing the structure of the array, depending on the number of processes np, each process should get a portion of the matrix.
Example:
A: 2D array of 2000 arrays by 3 columns, np = 2, then P0 gets the first half of A which would be 2D array of first 1000 rows by 3 columns, and P1 gets the second half which would be the second 1000 rows by 3 columns.
Now np can be any number (as long as it divides the number of rows). Any easy way to go about this?
I will have to use FORTRAN 90 for this assignment.
Thank you
Row-wise distribution of 2D arrays in Fortran is tricky (but not impossible) using scatter/gather operations directly because of the column-major storage. Two possible solutions follow.
Pure Fortran 90 solution: With Fortran 90 you can specify array sections like A(1:4,2:3) which would take a small 4x2 block out of the matrix A. You can pass array slices to MPI routines. Note with current MPI implementations (conforming to the now old MPI-2.2 standard), the compiler would create temporary contiguous copy of the section data and would pass it to the MPI routine (since the lifetime of the temporary storage is not well defined, one should not pass array sectons to non-blocking MPI operations like MPI_ISEND). MPI-3.0 introduces new and very modern Fortran 2008 interface that allows MPI routines to directly take array sections (without intermediate arrays) and supports passing of sections to non-blocking calls.
With array sections you only have to implement a simple DO loop in the root process:
INTEGER :: i, rows_per_proc
rows_per_proc = 2000/nproc
IF (rank == root) THEN
DO i = 0, nproc-1
IF (i /= root) THEN
start_row = 1 + i*rows_per_proc
end_row = (i+1)*rows_per_proc
CALL MPI_SEND(mat(start_row:end_row,:), 3*rows_per_proc, MPI_REAL, &
i, 0, MPI_COMM_WORLD, ierr)
END IF
END DO
ELSE
CALL MPI_RECV(submat(1,1), 3*rows_per_proc, MPI_REAL, ...)
END IF
Pure MPI solution (also works with FORTRAN 77): First, you have to declare a vector datatype with MPI_TYPE_VECTOR. The number of blocks would be 3, the block length would be the number of rows that each process should get (e.g. 1000), the stride should be equal to the total height of the matrix (e.g. 2000). If this datatype is called blktype, then the following would send the top half of the matrix:
REAL, DIMENSION(2000,3) :: mat
CALL MPI_SEND(mat(1,1), 1, blktype, p0, ...)
CALL MPI_SEND(mat(1001,1), 1, blktype, p1, ...)
Calling MPI_SEND with blktype would take 1000 elements from the specified starting address, then skip the next 2000 - 1000 = 1000 elements, take another 1000 and so on, 3 times in total. This would form a 1000-row sub-matrix of your big matrix.
You can now run a loop to send a different sub-block to each process in the communicator, effectively performing a scatter operation. In order to receive this sub-block, the receiving process could simply specify:
REAL, DIMENSION(1000,3) :: submat
CALL MPI_RECV(submat(1,1), 3*1000, MPI_REAL, root, ...)
If you are new to MPI, this is all you need to know about scattering matrices by rows in Fortran. If you know well how the type system of MPI works, then read ahead for more elegant solution.
(See here for an excellent description on how to do that with MPI_SCATTERV by Jonathan Dursi. His solution deals with splitting a C matrix in columns, which essentially poses the same problem as the one here as C stores matrices in row-major fashion. Fortran version follows.)
You could also make use of MPI_SCATTERV but it is quite involved. It builds on the pure MPI solution presented above. First you have to resize the blktype datatype into a new type, that has an extent, equal to that of MPI_REAL so that offsets in array elements could be specified. This is needed because offsets in MPI_SCATTERV are specified in multiples of the extent of the datatype specified and the extent of blktype is the size of the matrix itself. But because of the strided storage, both sub-blocks would start at only 4000 bytes apart (1000 times the typical extent of MPI_REAL). To modify the extent of the type, one would use MPI_TYPE_CREATE_RESIZED:
INTEGER(KIND=MPI_ADDRESS_KIND) :: lb, extent
! Get the extent of MPI_REAL
CALL MPI_TYPE_GET_EXTENT(MPI_REAL, lb, extent, ierr)
! Bestow the same extent upon the brother of blktype
CALL MPI_TYPE_CREATE_RESIZED(blktype, lb, extent, blk1b, ierr)
This creates a new datatype, blk1b, which has all characteristics of blktype, e.g. can be used to send whole sub-blocks, but when used in array operations, MPI would only advance the data pointer with the size of a single MPI_REAL instead of with the size of the whole matrix. With this new type, you could now position the start of each chunk for MPI_SCATTERV on any element of mat, including the start of any matrix row. Example with two sub-blocks:
INTEGER, DIMENSION(2) :: sendcounts, displs
! First sub-block
sendcounts(1) = 1
displs(1) = 0
! Second sub-block
sendcounts(2) = 1
displs(2) = 1000
CALL MPI_SCATTERV(mat(1,1), sendcounts, displs, blk1b, &
submat(1,1), 3*1000, MPI_REAL, &
root, MPI_COMM_WORLD, ierr)
Here the displacement of the first sub-block is 0, which coincides with the beginning of the matrix. The displacement of the second sub-block is 1000, i.e. it would start on the 1000-th row of the first column. On the receiver's side the data count argument is 3*1000 elements, which matches the size of the sub-block type.

MPI - broadcast to all processes to printf something

I have declared int value in my main, and all of the processes has inicialized this value. All of them are storing value, which I want to write on the screen after computing is finished. Is Broadcast a solution? E.g. how to implement it?
int i;
int value;
MPI_Status status;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD,&numtasks);
MPI_Comm_rank(MPI_COMM_WORLD;&myrank);
left = (myrank - 1); if (left < 0) left = numtasks-1;
right = (myrank + 1); if (right >= numtasks) right = 0;
if(myrank==0){
value=value+myrank;
MPI_Send(value,MPI_INT,right,99,MPI_COMM_WORLD);
MPI_Recv(value,MPI_INT,left,99,MPI_COMM_WORLD,&status);
}
else if(myrank==(numtasks-1)){
MPI_Recv(value,MPI_INT,left,99,MPI_COMM_WORLD,&status);
value=value+myrank;
MPI_Send(value,MPI_INT,right,99,MPI_COMM_WORLD);
}
else{
MPI_Recv(value,MPI_INT,left,99,MPI_COMM_WORLD,&status);
value=value+myrank;
MPI_Send(value,MPI_INT,right,99,MPI_COMM_WORLD);
}
Theese should make logical circle. I do one computing (sum of all ranks), and in process 0 I get result. This result (for 4 processes it will be 6) I want to be printed by each of the processes after this computing. But I don't see how to use barrier exactly and where.
There is also one thing, after all N-1 sendings (where N is number of processes) I should have sum of all ranks in each of processes. In my code I get this sum only into process 0... It might be a bad approach :-(
Some more detail about the structure of your code would help, but it sounds like you can just use MPI_Barrier. Your processes don't need to exchange any data, they just have to wait until everyone reached the point in your code where you want the printing to happen, which is exactly what Barrier does.
EDIT: In the code you posted, the Barrier would go at the very end (after the if statement), followed by printf(value).
However, your code will not compute the total sum of all ranks in all nodes, since process i only receives the summed ranks of the first i-1 processes. If you want each process to have the total sum at the end, then replacing Barrier with Broadcast is indeed the best option. (In fact, the entire if statement AND the Broadcast could be replaced by a single MPI_Reduce() call, but that wouldn't really help you learn MPI. :) )

Resources