MPI Send/Recv millions of messages - mpi

I have this loop over NT (millions of iterations) for procs greater than 0. Messages of 120 bytes are sent to proc 0 for each iteration and proc 0 receives them (I have the same loop over NT for proc 0).
I want proc 0 to receive them ordered so I can store them in array nhdr1.
The problem is that proc 0 does not receive messages properly and I have often 0 values in array nhdr.
How can I modify the code so that the messages are received in the same order are they were sent?
[...]
if (rank == 0) {
nhdr = malloc((unsigned long)15*sizeof(*nhdr));
nhdr1 = malloc((unsigned long)NN*15*sizeof(*nhdr1));
itr = 0;
jnode = 1;
for (l=0; l<NT; l++) {
MPI_Recv(nhdr, 15, MPI_LONG, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &status);
if (l == status.MPI_TAG) {
for (i=0; i<nkeys; i++)
nhdr1[itr*15+i] = nhdr[i];
}
itr++;
if (itr == NN) {
ipos = (unsigned long)(jnode-1)*NN*15*sizeof(*nhdr1);
fseek(ismfh, ipos, SEEK_SET);
nwrite += fwrite(nhdr1, sizeof(*nhdr1), NN*15, ismfh);
itr = 0;
jnode++;
}
}
free(nhdr);
free(nhdr1);
} else {
nhdr = malloc(15*sizeof(*nhdr));
irecmin = (rank-1)*NN+1;
irecmax = rank*NN;
for (l=0; l<NT; l++) {
if (jrec[l] >= irecmin && jrec[l] <= irecmax) {
indx1 = (unsigned long)(jrec[l]-irecmin) * 15;
for (i=0; i<15; i++)
nhdr[i] = nhdr1[indx1+i]; // nhdr1 is allocated before for rank>0!
MPI_Send(nhdr, 15, MPI_LONG, 0, l, MPI_COMM_WORLD);
}
}
free(nhdr);
}

There is no way to guarantee that your messages will arrive on rank 0 in the same order they were sent from different ranks. For example, if you have a scenario like this (S1 means send message 1) :
rank 0 ----------------
rank 1 ---S1------S3---
rank 2 ------S2------S4
There is no guarantee that the messages will arrive at rank 0 in the order S1, S2, S3, S4. The only guarantee made by MPI is that the messages from each rank that are sent on the same communicator with the same tag (which you are doing) will arrive in the same order they were sent. This means that the resulting order could be:
S1, S2, S3, S4
Or it could be:
S1, S3, S2, S4
or:
S2, S1, S3, S4
...and so on.
For most applications, this doesn't really matter. The ordering that's important is the logical ordering, not the real time ordering. You might take another look at your application and make sure you can't relax your requirements a bit.

What do you mean by " messages are received in the same order are they were sent"?
In the code now, the message ARE received in (roughly) the order that they are actually sent...but that order has nothing to do with the rank numbers, or really anything else. See #Wesley Bland's response for more on this.
If you mean "receive the messages in rank order"...then there are a few options.
First, a collective like MPI_Gather or MPI_Gatherv would be an "obvious" choice to ensure that the data is ordered by the rank that produced it. This only works if each rank does the same number of iterations, and those iterations stay roughly sync'd.
Second, you could remove the MPI_ANY_SOURCE, and post a set of MPI_IRevc with the buffers supplied "in order". When a message arrives, it will be in the correct buffer location "automatically." For each message that is received, a new MPI_Irecv could be posted with the correct recv buffer location supplied. Any un-matched MPI_Irecv's would need to be canceled at the end of the job.

keeping in mind that:
messages from a given rank are received in order and
messages have the originating processor rank in the status structure (status.MPI_SOURCE) returned by MPI_Recv()
you can use these two elements to properly place the received data into nhdr1.

Related

How is the length determined in sendcount and recvcount in MPI.COMM_WORLD.Gather

So after I Bcast the data (clusters[10][5] 2d array) to every other process and then when each one calculates its new local values I want to send them back to the process 0.
But some of the data is missing sometimes (depends on no. of clusters) or the data is not equal to the ones I have in the sending processes.
I don't know why but the max value of recvcount and recvcount need to be divided by size or by some factor, they can't be array size (10 or 10*5 - no. of elements).
If I put its full size for instance cluster.lenght(10) it says indexoutofbounce 19 and if I run with more processes (mpjrun.bat -np 11 name) the higher index occurs in the outofbounce and it always goes up or down by 2 with higher/lower no. of processes (for example I use 5 processes and get outofbounce 9 and then next run use 6 and get 11).
Can someone explain why is Gather's count connected to number of processes or why it can't accept the array size?
And also the program doesn't end after the data is calculated correctly, only if I use 1 process it ends but otherwise it goes out of the loop and then print something to the terminal and after that I have MPI.finalize but nothing happens and I have Ctrl+c to terminate bat job so I can use the terminal again.
The clusterget variable is set to number of clusters*size of proceses so that it stores all the new clusters from other processes so that I can then use them all in the first process so the problem isn't in clusterget variable or maybe is it? Since there isn't really anything documented about sending through a 2d array of floats (yeah I need to use MPI.OBJECT because java doesn't like float if I use float it says Float can't be casted to Float).
MPI.COMM_WORLD.Bcast(clusters, 0, clusters.length, MPI.OBJECT, 0);
//calculate and then send back to 0
MPI.COMM_WORLD.Gather(clusters, 0, clusters.length / size, MPI.OBJECT, clusterget, 0, clusters.length / size, MPI.OBJECT, 0);
if (me == 0) {
for (int j = 0; j < clusters.length; j++) { //adds clusters from each other process to the first ones
for (int i = 0; i < size - 1; i++) {
System.out.println(clusterget[j+i*cluster][4]+" tock "+clusters[j][4]);
clusters[j][2] += clusterget[j + i * cluster][2]; //dodaj
clusters[j][3] += clusterget[j + i * cluster][3];
clusters[j][4] += clusterget[j + i * cluster][4];
}
}
}
In Summmary:
The data from each process isn't the same as the one collected after gather, in which i can't put the full size of 2d float array.
I've changed gather to Send and Recv and it works and I needed to add a Barrier so that the data is in synch before sending. But this only works for 2 procesess.
MPI.COMM_WORLD.Barrier();
if (me != 0){
MPI.COMM_WORLD.Send(clusters,0,clusters.length,MPI.OBJECT,0,MPI.ANY_TAG);
}
if (me == 0) {
for (int i = 1; i < size; i++) {
MPI.COMM_WORLD.Recv(clusterget,0,clusters.length,MPI.OBJECT,i,MPI.ANY_TAG);
for (int j = 0; j < clusters.length; j++) {
clusters[j][2] += clusterget[j][2];
clusters[j][3] += clusterget[j][3];
clusters[j][4] += clusterget[j][4];
}
}

MPI_Allreduce() in a recursive Tree walk

I have a tree structure on every processor (Quadtree) and walk through theses trees using the following function:
void treeWalk(TreeNode *t){
if(t != NULL){
for(int i=0; i<4; i++){
treeWalk(t->child[i]);
}
// Start operation on *t
if (isSpecificNode(t) == TRUE){
// Do stuff
}
// End operation on *t
}
}
The tree structure on every processor is not the same. If I reach a certain node (which every processor has, e. g. node number x) in a tree I want to use MPI_Allreduce() to sum up the node's values and send the result (the sum of all nodes) to all the other processors such that they can save the result of this action like t->nodeValue = sumOfAllNodes;. If I do the following it does not work:
void treeWalk(TreeNode *t){
if(t != NULL){
for(int i=0; i<4; i++){
treeWalk(t->child[i]);
}
// Start operation on *t
if (isSpecificNode(t) == TRUE){
double thisNodeValue = t->nodeValue;
double sumOfAllNodes = 0;
MPI_Allreduce(&thisNodeValue, &sumOfAllNodes, 1,
MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
t->nodeValue = sumOfAllNodes;
}
// End operation on *t
}
}
In this case after calling MPI_Allreduce the value of t->nodeValue is not the desired result. How do I use MPI_Allreduce() in a recursive function? Can I use other functions like MPI_Barrier() or MPI_Wait() to get the job done?
EDIT: To make my question more clear:
Let us assume that I have two trees with only two nodes. The left tree is on processor 1 and the right one on processor 2. Both processors do a recursive tree walk. If the processors reach the left node (which normally does not happen simultaneously) they call MPI_Allreduce() and compute 1+0 = 1. After this operation both nodes on the left on every processor should be equal to 1 or 0+2=2 for the case of the right nodes.

Could MPI_Bcast lead to the issue of data uncertainty?

If different process broadcast different value to the other processes in the group of a certain communicator,what would happen?
Take the following program run by two processes as an example,
int rank, size;
int x;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
if (rank == 0)
{
x = 0;
MPI_Bcast(&x, 1, MPI_INT, 0, MPI_COMM_WORLD);
}
else if (rank==1)
{
x = 1;
MPI_Bcast(&x, 1, MPI_INT, 1, MPI_COMM_WORLD);
}
cout << "Process " << rank << "'s value is:" << x << endl;
MPI_Finalize();
I think there could be different possibilities of the printed result at the end of the program. If the process 0 runs faster than process 1, it would broadcast its value earlier than process 1, so process 1 would have the same value with process 0 when it starts broadcasting its value, therefore making the printed value of x both 0. But if the process 0 runs slower than process 1, the process 0 would have the same value as process 1 which is 1 at the end. Does what I described happen actually?
I think you don't understand the MPI_Bcast function well. Actually MPI_Bcast is a kind of MPI collective communication method in which every process that belongs to a certain communicator need to get involve. So for the function of MPI_Bcast, not only the process who sends the data to be broadcasted, but also the processes who receive the broadcasted data need to call the function synchronously in order to achieve the goal of data broadcasting among all participating processes.
In your given program, especially this part:
if (rank == 0)
{
x = 0;
MPI_Bcast(&x, 1, MPI_INT, 0, MPI_COMM_WORLD);
}
else if (rank==1)
{
x = 1;
MPI_Bcast(&x, 1, MPI_INT, 1, MPI_COMM_WORLD);
}
I think you meant to let process whose rank is 0 (process 0) broadcasts its value of x to other processes, but in your code, only process 0 calls the MPI_Bcast function when you use the if-else segment. So what do other processes do? For process whose rank is 1 (process 1), it doesn't call the same MPI_Bcast function the process 0 calls, although it does call another MPI_Bcast function to broadcast its value of x (the argument of root is different between those two MPI_Bcast functions).Thus if only process 0 calls the MPI_Bcast function, it just broadcasts the value of x to only itself actually and the value of x stored in other processes wouldn't be affected at all. Also it's the same condition for process 1. As a result in your program, the printed value of x for each process should be the same as when it's assigned initially and there wouldn't be the data uncertainty issue as you concerned.
MPI_Bcast is used primarily so that rank 0 [root] can calculate values and broadcast them so everybody starts with the same values.
Here is a valid usage:
int x;
// not all ranks may get the same time value ...
srand(time(NULL));
// so we must get the value once ...
if (rank == 0)
x = rand();
// and broadcast it here ...
MPI_Bcast(&x,1,MPI_INT,0,MPI_COMM_WORLD);
Notice the difference from your usage. The same MPI_Bcast call for all ranks. The root will do a send and the others will do recv.

MPI call and receive in a nested loop

I have a nested loop and from inside the loop I call the MPI send which I want it to
send to the receiver a specific value then at the receiver takes the data and again sends MPI messages
to another set of CPUs ... I used something like this but it looks like there is a problem in the receive ... and I cant see where I went wrong ..."the machine goes to infinite loop somewhere ...
I am trying to make it work like this :
master CPU >> send to other CPUs >> send to slave CPUs
.
.
.
int currentCombinationsCount;
int mp;
if (rank == 0)
{
for (int pr = 0; pr < combinationsSegmentSize; pr++)
{
int CblockBegin = CombinationsSegementsBegin[pr];
int CblockEnd = CombinationsSegementsEnd [pr];
currentCombinationsCount = numOfCombinationsEachLoop[pr];
prossessNum = 1; //specify which processor we are sending to
// now substitute and send to the main Processors
for (mp = CblockBegin; mp <= CblockEnd; mp++)
{
MPI_Send(&mp , 1, MPI_INT , prossessNum, TAG, MPI_COMM_WORLD);
prossessNum ++;
}
}//this loop goes through all the specified blocks for the combinations
} // end of rank 0
else if (rank > currentCombinationsCount)
{
// here I want to put other receives that will take values from the else below
}
else
{
MPI_Recv(&mp , 1, MPI_INT , 0, TAG, MPI_COMM_WORLD, &stat);
// the code stuck here in infinite loop
}
You've only initialised currentCombinationsCount within the if(rank==0) branch so all other procs will see an uninitialised variable. That will result in undefined behaviour and the outcome depends on your compiler. Your program may crash or the value may be set to 0 or an undetermined value.
If you're lucky, the value may be set to 0 in which case your branch reduces to:
if (rank == 0) { /* rank == 0 will enter this }
else if (rank > 0) { /* all other procs enter this }
else { /* never entered! Recvs are never called to match the sends */ }
You therefore end up with sends that are not matched by any receives. Since MPI_Send is potentially blocking, the sending proc may stall indefinitely. With procs blocking on sends, it can certainly look as thought "...the machine goes to infinite loop somewhere...".
If currentCombinationsCount is given an arbitrary value (instead of 0) then rank!=0 procs will enter arbitrary branchss (with a higher chance of all entering the final else). You then end up with second set of receives not being called resulting in the same issue as above.

C MPI multiple dynamic array passing

I'm trying to ISend() two arrays: arr1,arr2 and an integer n which is the size of arr1,arr2. I understood from this post that sending a struct that contains all three is not an option since n is only known at run time. Obviously, I need n to be received first since otherwise the receiving process wouldn't know how many elements to receive. What's the most efficient way to achieve this without using the blokcing Send() ?
Sending the size of the array is redundant (and inefficient) as MPI provides a way to probe for incoming messages without receiving them, which provides just enough information in order to properly allocate memory. Probing is performed with MPI_PROBE, which looks a lot like MPI_RECV, except that it takes no buffer related arguments. The probe operation returns a status object which can then be queried for the number of elements of a given MPI datatype that can be extracted from the content of the message with MPI_GET_COUNT, therefore explicitly sending the number of elements becomes redundant.
Here is a simple example with two ranks:
if (rank == 0)
{
MPI_Request req;
// Send a message to rank 1
MPI_Isend(arr1, n, MPI_DOUBLE, 1, 0, MPI_COMM_WORLD, &req);
// Do not forget to complete the request!
MPI_Wait(&req, MPI_STATUS_IGNORE);
}
else if (rank == 1)
{
MPI_Status status;
// Wait for a message from rank 0 with tag 0
MPI_Probe(0, 0, MPI_COMM_WORLD, &status);
// Find out the number of elements in the message -> size goes to "n"
MPI_Get_count(&status, MPI_DOUBLE, &n);
// Allocate memory
arr1 = malloc(n*sizeof(double));
// Receive the message. ignore the status
MPI_Recv(arr1, n, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
}
MPI_PROBE also accepts the wildcard rank MPI_ANY_SOURCE and the wildcard tag MPI_ANY_TAG. One can then consult the corresponding entry in the status structure in order to find out the actual sender rank and the actual message tag.
Probing for the message size works as every message carries a header, called envelope. The envelope consists of the sender's rank, the receiver's rank, the message tag and the communicator. It also carries information about the total message size. Envelopes are sent as part of the initial handshake between the two communicating processes.
Firstly you need to allocate memory (full memory = n = elements) to arr1 and arr2 with rank 0. i.e. your front end processor.
Divide the array into parts depending on the no. of processors. Determine the element count for each processor.
Send this element count to the other processors from rank 0.
The second send is for the array i.e. arr1 and arr2
In other processors allocate arr1 and arr2 according to the element count received from main processor i.e. rank = 0. After receiving element count, receive the two arrays in the allocated memories.
This is a sample C++ Implementation but C will follow the same logic. Also just interchange Send with Isend.
#include <mpi.h>
#include <iostream>
using namespace std;
int main(int argc, char*argv[])
{
MPI::Init (argc, argv);
int rank = MPI::COMM_WORLD.Get_rank();
int no_of_processors = MPI::COMM_WORLD.Get_size();
MPI::Status status;
double *arr1;
if (rank == 0)
{
// Setting some Random n
int n = 10;
arr1 = new double[n];
for(int i = 0; i < n; i++)
{
arr1[i] = i;
}
int part = n / no_of_processors;
int offset = n % no_of_processors;
// cout << part << "\t" << offset << endl;
for(int i = 1; i < no_of_processors; i++)
{
int start = i*part;
int end = start + part - 1;
if (i == (no_of_processors-1))
{
end += offset;
}
// cout << i << " Start: " << start << " END: " << end;
// Element_Count
int e_count = end - start + 1;
// cout << " e_count: " << e_count << endl;
// Sending
MPI::COMM_WORLD.Send(
&e_count,
1,
MPI::INT,
i,
0
);
// Sending Arr1
MPI::COMM_WORLD.Send(
(arr1+start),
e_count,
MPI::DOUBLE,
i,
1
);
}
}
else
{
// Element Count
int e_count;
// Receiving elements count
MPI::COMM_WORLD.Recv (
&e_count,
1,
MPI::INT,
0,
0,
status
);
arr1 = new double [e_count];
// Receiving FIrst Array
MPI::COMM_WORLD.Recv (
arr1,
e_count,
MPI::DOUBLE,
0,
1,
status
);
for(int i = 0; i < e_count; i++)
{
cout << arr1[i] << endl;
}
}
// if(rank == 0)
delete [] arr1;
MPI::Finalize();
return 0;
}
#Histro The point I want to make is, that Irecv/Isend are some functions themselves manipulated by MPI lib. The question u asked depend completely on your rest of the code about what you do after the Send/Recv. There are 2 cases:
Master and Worker
You send part of the problem (say arrays) to the workers (all other ranks except 0=Master). The worker does some work (on the arrays) then returns back the results to the master. The master then adds up the result, and convey new work to the workers. Now, here you would want the master to wait for all the workers to return their result (modified arrays). So you cannot use Isend and Irecv but a multiple send as used in my code and corresponding recv. If your code is in this direction you wanna use B_cast and MPI_Reduce.
Lazy Master
The master divides the work but doesn't care of the result from his workers. Say you want to program a pattern of different kinds for same data. Like given characteristics of population of some city, you want to calculate the patterns like how many are above 18, how
many have jobs, how much of them work in some company. Now these results don't have anything to do with one another. In this case you don't have to worry about whether the data is received by the workers or not. The master can continue to execute the rest of the code. This is where it is safe to use Isend/Irecv.

Resources