Why is MPI_Bsend not returning error even when the buffer is insufficient to accommodate all the messages - mpi

I am trying to implement a logical ring with MPI; where every process receives a message from process with id one less than that of current process and forwards to the next process in cyclic fashion. My aim is to traffic buffer such that it loses some messages or maybe puts them in out of order.
A cycle of communication will finish when a message dispatched by root node comes back again to root.
here is the code that I have tried:
I am just including relevant parts of it.
if(procId!=root)
{
sleep(100);
while(1)
{
tm = MPI_Wtime();
MPI_Irecv( &message, STR_LEN, MPI_CHAR,
((procId-1)>=0?(procId-1):(numProc-1)),RETURN_DATA_TAG,
MPI_COMM_WORLD,&receiveRequest);
MPI_Wait(&receiveRequest,&status);
printf("%d: Received\n",procId);
if(!strncmp(message,"STOP",4)&&(procId==(numProc-1)))
break;
MPI_Ssend( message, STR_LEN, MPI_CHAR,
(procId+1)%numProc, SEND_DATA_TAG, MPI_COMM_WORLD);
if(!strncmp(message,"STOP",4))
break;
printf("%d: Sent\n",procId);
}
}
else
{
for(iter=0;iter<benchmarkSize;iter++)
{
//Synthesize the message
message[STR_LEN-1] = '\0';
iErr = MPI_Bsend( message, STR_LEN, MPI_CHAR,
(root+1)%numProc, SEND_DATA_TAG, MPI_COMM_WORLD);
if (iErr != MPI_SUCCESS) {
char error_string[BUFSIZ];
int length_of_error_string;
MPI_Error_string(iErr, error_string, &length_of_error_string);
fprintf(stderr, "%3d: %s\n", procId, error_string);
}
tm = MPI_Wtime();
while(((MPI_Wtime()-tm)*1000)<delay);
printf("Root: Sending\n");
}
for(iter=0;iter<benchmarkSize;iter++)
{
MPI_Recv(message,STR_LEN,MPI_CHAR,
(numProc-1),RETURN_DATA_TAG,MPI_COMM_WORLD,&status);
//We should not wait for the messages to be received but wait for certain amount of time
//Extract the fields in the message
if(((prevRcvdSeqNum+1)!=atoi(seqNum))&&(prevRcvdSeqNum!=0))
outOfOrderMsgs++;
prevRcvdSeqNum = atoi(seqNum);
printf("Seq Num: %d\n",atoi(seqNum));
rcvdMsgs++;
printf("Root: Receiving\n");
}
MPI_Isend( "STOP", 4, MPI_CHAR,
(root+1)%numProc, SEND_DATA_TAG, MPI_COMM_WORLD,&sendRequest);
MPI_Wait(&sendRequest,&status);
/*This is to ask all other processes to terminate, when the work is done*/
}
Now, I have these questions:
1) Why is it that when I inject some sleep in the other processes(I mean other than root) of
the ring; NO receive is taking place?
2) Even when the buffer size is only one, how is it that root node is able to dispatch messages through MPI_Bsend without an error? for example the case when it needs to send total 10 messages at a rate of 1000 per second and with buffer size of 1. MPI_Bsend is able to dispatch all the messages without any error of "buffer full"; irrespective of the presence of sleep() in other processes of the ring!
Thanks a ton!

Related

Developing an MCA component for OpenMPI: how to progress MPI_Recv using custom BTL for NIC under development?

I'm trying to develop a btl for a custom NIC. I studied the btl.h file to understand the flow of calls that are expected to be implemented in my component. I'm using a simple test (which works like a charm with the TCP btl) to test my development, the code is a simple MPI_Send + MPI_Recv:
int ping_pong_count = 1;
int partner_rank = (world_rank + 1) % 2;
printf("MY RANK: %d PARTNER: %d\n",world_rank,partner_rank);
if (world_rank == 0) {
// Increment the ping pong count before you send it
ping_pong_count++;
MPI_Send(&ping_pong_count, 1, MPI_INT, partner_rank, 0, MPI_COMM_WORLD);
printf("%d sent and incremented ping_pong_count %d to %d\n",
world_rank, ping_pong_count, partner_rank);
} else {
MPI_Recv(&ping_pong_count, 1, MPI_INT, partner_rank, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("%d received ping_pong_count %d from %d\n",
world_rank, ping_pong_count, partner_rank);
}
MPI_Finalize();
}
I see that in my component's btl code the functions called during the "MPI_send" phase are:
mca_btl_mycomp_add_procs
mca_btl_mycomp_prepare_src
mca_btl_mycomp_send (where I set the return to 1, so the send phase
should be finished)
I see then the print inside the test:
0 sent and incremented ping_pong_count 2 to 1
and this should conclude the MPI_Send phase.
Then I implemented in the btl_mycomp_component_progress function a call to:
mca_btl_active_message_callback_t *reg = mca_btl_base_active_message_trigger + tag;
reg->cbfunc(&my_btl->super, &desc);
I saw the same code in all the other BTLs and I thought this was enough to unlock the MPI_Recv "polling". But actually I see my test hangs, probably "waiting" for something that never happens (?).
I also took a look in the ob1 mca_pml_ob1_recv_frag_callback_match function, and it seems to get to the end of the function, actually matching my frag.
So my question is: how can I say to the framework that I finished my work and so the function can return to the user application? What am I doing wrong?
Is there a way to understand where and what my code is waiting for?

Wrong synchronization of RMA calls in MPI

I am trying to MPIs RMA scheme with Fences. In some cases it works fine, but for systems with multiple nodes I get the following error:
Error message: MPI failed with Error_code = 71950898
Wrong synchronization of RMA calls , error stack:
MPI_Rget(176): MPI_Rget(origin_addr=0x2ac7b10, origin_count=1, MPI_INTEGER, target_rank=0, target_disp=0, target_count=1, MPI_INTEGER, win=0xa0000000, request=0x7ffdc1efe634) failed
(unknown)(): Wrong synchronization of RMA calls
Error from PE:0/4
This is a schematic of how I setup the code:
call MPI_init(..)
CALL MPI_WIN_CREATE(..)
do i =1,10
MPI_Win_fence(0, handle, err)
calc_values()
MPI_Put(values)
MPI_Put(values)
MPI_Put(values)
MPI_Win_fence(0, handle, err)
MPI_Rget(values, req)
MPI_WAIT(req)
do_something(values)
MPI_Rget(values, req)
MPI_WAIT(req)
do_something(values)
enddo
call MPI_finalize()
I know that MPI_Put is non-blocking. Is it guaranteed, that the MPI_Put is finished after MPI_Win_fence(0, handle, err) or do I have to use MPI_RPUT?
What does this error even mean: Wrong synchronization of RMA calls ?
How do I fix my communication scheme?
Make sure you add the following call as necessary to ensure synchronization (you need to make sure your window(s) are created before putting data in them):
MPI_Win_fence(0, window);
Please look at the example below (source) and note that they are making two fence calls.
// Create the window
int window_buffer = 0;
MPI_Win window;
MPI_Win_create(&window_buffer, sizeof(int), sizeof(int), MPI_INFO_NULL, MPI_COMM_WORLD, &window);
if(my_rank == 1)
{
printf("[MPI process 1] Value in my window_buffer before MPI_Put: %d.\n", window_buffer);
}
MPI_Win_fence(0, window);
if(my_rank == 0)
{
// Push my value into the first integer in MPI process 1 window
int my_value = 12345;
MPI_Put(&my_value, 1, MPI_INT, 1, 0, 1, MPI_INT, window);
printf("[MPI process 0] I put data %d in MPI process 1 window via MPI_Put.\n", my_value);
}
// Wait for the MPI_Put issued to complete before going any further
MPI_Win_fence(0, window);
if(my_rank == 1)
{
printf("[MPI process 1] Value in my window_buffer after MPI_Put: %d.\n", window_buffer);
}
// Destroy the window
MPI_Win_free(&window);

What is the right way to "notify" processors without blocking?

Suppose I have a very large array of things and I have to do some operation on all these things.
In case operation fails for one element, I want to stop the work [this work is distributed across number of processors] across all the array.
I want to achieve this while keeping the number of sent/received messages to a minimum.
Also, I don't want to block processors if there is no need to.
How can I do it using MPI?
This seems to be a common question with no easy answer. Both other answer have scalability issues. The ring-communication approach has linear communication cost, while in the one-sided MPI_Win-solution, a single process will be hammered with memory requests from all workers. This may be fine for low number of ranks, but pose issues when increasing the rank count.
Non-blocking collectives can provide a more scalable better solution. The main idea is to post a MPI_Ibarrier on all ranks except on one designated root. This root will listen to point-to-point stop messages via MPI_Irecv and complete the MPI_Ibarrier once it receives it.
The tricky part is that there are four different cases "{root, non-root} x {found, not-found}" that need to be handled. Also it can happen that multiple ranks send a stop message, requiring an unknown number of matching receives on the root. That can be solved with an additional reduction that counts the number of ranks that sent a stop-request.
Here is an example how this can look in C:
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
const int iter_max = 10000;
const int difficulty = 20000;
int find_stuff()
{
int num_iters = rand() % iter_max;
for (int i = 0; i < num_iters; i++) {
if (rand() % difficulty == 0) {
return 1;
}
}
return 0;
}
const int stop_tag = 42;
const int root = 0;
int forward_stop(MPI_Request* root_recv_stop, MPI_Request* all_recv_stop, int found_count)
{
int flag;
MPI_Status status;
if (found_count == 0) {
MPI_Test(root_recv_stop, &flag, &status);
} else {
// If we find something on the root, we actually wait until we receive our own message.
MPI_Wait(root_recv_stop, &status);
flag = 1;
}
if (flag) {
printf("Forwarding stop signal from %d\n", status.MPI_SOURCE);
MPI_Ibarrier(MPI_COMM_WORLD, all_recv_stop);
MPI_Wait(all_recv_stop, MPI_STATUS_IGNORE);
// We must post some additional receives if multiple ranks found something at the same time
MPI_Reduce(MPI_IN_PLACE, &found_count, 1, MPI_INT, MPI_SUM, root, MPI_COMM_WORLD);
for (found_count--; found_count > 0; found_count--) {
MPI_Recv(NULL, 0, MPI_CHAR, MPI_ANY_SOURCE, stop_tag, MPI_COMM_WORLD, &status);
printf("Additional stop from: %d\n", status.MPI_SOURCE);
}
return 1;
}
return 0;
}
int main()
{
MPI_Init(NULL, NULL);
int rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
srand(rank);
MPI_Request root_recv_stop;
MPI_Request all_recv_stop;
if (rank == root) {
MPI_Irecv(NULL, 0, MPI_CHAR, MPI_ANY_SOURCE, stop_tag, MPI_COMM_WORLD, &root_recv_stop);
} else {
// You may want to use an extra communicator here, to avoid messing with other barriers
MPI_Ibarrier(MPI_COMM_WORLD, &all_recv_stop);
}
while (1) {
int found = find_stuff();
if (found) {
printf("Rank %d found something.\n", rank);
// Note: We cannot post this as blocking, otherwise there is a deadlock with the reduce
MPI_Request req;
MPI_Isend(NULL, 0, MPI_CHAR, root, stop_tag, MPI_COMM_WORLD, &req);
if (rank != root) {
// We know that we are going to receive our own stop signal.
// This avoids running another useless iteration
MPI_Wait(&all_recv_stop, MPI_STATUS_IGNORE);
MPI_Reduce(&found, NULL, 1, MPI_INT, MPI_SUM, root, MPI_COMM_WORLD);
MPI_Wait(&req, MPI_STATUS_IGNORE);
break;
}
MPI_Wait(&req, MPI_STATUS_IGNORE);
}
if (rank == root) {
if (forward_stop(&root_recv_stop, &all_recv_stop, found)) {
break;
}
} else {
int stop_signal;
MPI_Test(&all_recv_stop, &stop_signal, MPI_STATUS_IGNORE);
if (stop_signal)
{
MPI_Reduce(&found, NULL, 1, MPI_INT, MPI_SUM, root, MPI_COMM_WORLD);
printf("Rank %d stopping after receiving signal.\n", rank);
break;
}
}
};
MPI_Finalize();
}
While this is not the simplest code, it should:
Introduce no additional blocking
Scale with the implementation of a barrier (usually O(log N))
Have a worst-case-latency from one found, to all stop of 2 * loop time ( + 1 p2p + 1 barrier + 1 reduction).
If many/all ranks find a solution at the same time, it still works but may be less efficient.
A possible strategy to derive this global stop condition in a non-blocking fashion is to rely on MPI_Test.
scenario
Consider that each process posts an asynchronous receive of type MPI_INT to its left rank with a given tag to build a ring. Then start your computation. If a rank encounters the stop condition it sends its own rank to its right rank. In the meantime each rank uses MPI_Test to check for the completion of the MPI_Irecv during the computation if it is completed then enter a branch first waiting the message and then transitively propagating the received rank to the right except if the right rank is equal to the payload of the message (not to loop).
This done you should have all processes in the branch, ready to trigger an arbitrary recovery operation.
Complexity
The topology retained is a ring as it minimizes the number of messages at most (n-1) however it augments the propagation time. Other topologies could be retained with more messages but lower spatial complexity for example a tree with a n.ln(n) complexity.
Implementation
Something like this.
int rank, size;
MPI_Init(&argc,&argv);
MPI_Comm_rank( MPI_COMM_WORLD, &rank);
MPI_Comm_size( MPI_COMM_WORLD, &size);
int left_rank = (rank==0)?(size-1):(rank-1);
int right_rank = (rank==(size-1))?0:(rank+1)%size;
int stop_cond_rank;
MPI_Request stop_cond_request;
int stop_cond= 0;
while( 1 )
{
MPI_Irecv( &stop_cond_rank, 1, MPI_INT, left_rank, 123, MPI_COMM_WORLD, &stop_cond_request);
/* Compute Here and set stop condition accordingly */
if( stop_cond )
{
/* Cancel the left recv */
MPI_Cancel( &stop_cond_request );
if( rank != right_rank )
MPI_Send( &rank, 1, MPI_INT, right_rank, 123, MPI_COMM_WORLD );
break;
}
int did_recv = 0;
MPI_Test( &stop_cond_request, &did_recv, MPI_STATUS_IGNORE );
if( did_recv )
{
stop_cond = 1;
MPI_Wait( &stop_cond_request, MPI_STATUS_IGNORE );
if( right_rank != stop_cond_rank )
MPI_Send( &stop_cond_rank, 1, MPI_INT, right_rank, 123, MPI_COMM_WORLD );
break;
}
}
if( stop_cond )
{
/* Handle the stop condition */
}
else
{
/* Cleanup */
MPI_Cancel( &stop_cond_request );
}
That is a question I've asked myself a few times without finding any completely satisfactory answer... The only thing I thought of (beside MPI_Abort() that does it but which is a bit extreme) is to create an MPI_Win storing a flag that will be raise by whichever process facing the problem, and checked by all processes regularly to see if they can go on processing. This is done using non-blocking calls, the same way as described in this answer.
The main weaknesses of this are:
This depends on the processes to willingly check the status of the flag. There is no immediate interruption of their work to notifying them.
The frequency of this checking must be adjusted by hand. You have to find the trade-off between the time you waste processing data while there's no need to because something happened elsewhere, and the time it takes to check the status...
In the end, what we would need is a way of defining a callback action triggered by an MPI call such as MPI_Abort() (basically replacing the abort action by something else). I don't think this exists, but maybe I overlooked it.

MPI call and receive in a nested loop

I have a nested loop and from inside the loop I call the MPI send which I want it to
send to the receiver a specific value then at the receiver takes the data and again sends MPI messages
to another set of CPUs ... I used something like this but it looks like there is a problem in the receive ... and I cant see where I went wrong ..."the machine goes to infinite loop somewhere ...
I am trying to make it work like this :
master CPU >> send to other CPUs >> send to slave CPUs
.
.
.
int currentCombinationsCount;
int mp;
if (rank == 0)
{
for (int pr = 0; pr < combinationsSegmentSize; pr++)
{
int CblockBegin = CombinationsSegementsBegin[pr];
int CblockEnd = CombinationsSegementsEnd [pr];
currentCombinationsCount = numOfCombinationsEachLoop[pr];
prossessNum = 1; //specify which processor we are sending to
// now substitute and send to the main Processors
for (mp = CblockBegin; mp <= CblockEnd; mp++)
{
MPI_Send(&mp , 1, MPI_INT , prossessNum, TAG, MPI_COMM_WORLD);
prossessNum ++;
}
}//this loop goes through all the specified blocks for the combinations
} // end of rank 0
else if (rank > currentCombinationsCount)
{
// here I want to put other receives that will take values from the else below
}
else
{
MPI_Recv(&mp , 1, MPI_INT , 0, TAG, MPI_COMM_WORLD, &stat);
// the code stuck here in infinite loop
}
You've only initialised currentCombinationsCount within the if(rank==0) branch so all other procs will see an uninitialised variable. That will result in undefined behaviour and the outcome depends on your compiler. Your program may crash or the value may be set to 0 or an undetermined value.
If you're lucky, the value may be set to 0 in which case your branch reduces to:
if (rank == 0) { /* rank == 0 will enter this }
else if (rank > 0) { /* all other procs enter this }
else { /* never entered! Recvs are never called to match the sends */ }
You therefore end up with sends that are not matched by any receives. Since MPI_Send is potentially blocking, the sending proc may stall indefinitely. With procs blocking on sends, it can certainly look as thought "...the machine goes to infinite loop somewhere...".
If currentCombinationsCount is given an arbitrary value (instead of 0) then rank!=0 procs will enter arbitrary branchss (with a higher chance of all entering the final else). You then end up with second set of receives not being called resulting in the same issue as above.

C MPI multiple dynamic array passing

I'm trying to ISend() two arrays: arr1,arr2 and an integer n which is the size of arr1,arr2. I understood from this post that sending a struct that contains all three is not an option since n is only known at run time. Obviously, I need n to be received first since otherwise the receiving process wouldn't know how many elements to receive. What's the most efficient way to achieve this without using the blokcing Send() ?
Sending the size of the array is redundant (and inefficient) as MPI provides a way to probe for incoming messages without receiving them, which provides just enough information in order to properly allocate memory. Probing is performed with MPI_PROBE, which looks a lot like MPI_RECV, except that it takes no buffer related arguments. The probe operation returns a status object which can then be queried for the number of elements of a given MPI datatype that can be extracted from the content of the message with MPI_GET_COUNT, therefore explicitly sending the number of elements becomes redundant.
Here is a simple example with two ranks:
if (rank == 0)
{
MPI_Request req;
// Send a message to rank 1
MPI_Isend(arr1, n, MPI_DOUBLE, 1, 0, MPI_COMM_WORLD, &req);
// Do not forget to complete the request!
MPI_Wait(&req, MPI_STATUS_IGNORE);
}
else if (rank == 1)
{
MPI_Status status;
// Wait for a message from rank 0 with tag 0
MPI_Probe(0, 0, MPI_COMM_WORLD, &status);
// Find out the number of elements in the message -> size goes to "n"
MPI_Get_count(&status, MPI_DOUBLE, &n);
// Allocate memory
arr1 = malloc(n*sizeof(double));
// Receive the message. ignore the status
MPI_Recv(arr1, n, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
}
MPI_PROBE also accepts the wildcard rank MPI_ANY_SOURCE and the wildcard tag MPI_ANY_TAG. One can then consult the corresponding entry in the status structure in order to find out the actual sender rank and the actual message tag.
Probing for the message size works as every message carries a header, called envelope. The envelope consists of the sender's rank, the receiver's rank, the message tag and the communicator. It also carries information about the total message size. Envelopes are sent as part of the initial handshake between the two communicating processes.
Firstly you need to allocate memory (full memory = n = elements) to arr1 and arr2 with rank 0. i.e. your front end processor.
Divide the array into parts depending on the no. of processors. Determine the element count for each processor.
Send this element count to the other processors from rank 0.
The second send is for the array i.e. arr1 and arr2
In other processors allocate arr1 and arr2 according to the element count received from main processor i.e. rank = 0. After receiving element count, receive the two arrays in the allocated memories.
This is a sample C++ Implementation but C will follow the same logic. Also just interchange Send with Isend.
#include <mpi.h>
#include <iostream>
using namespace std;
int main(int argc, char*argv[])
{
MPI::Init (argc, argv);
int rank = MPI::COMM_WORLD.Get_rank();
int no_of_processors = MPI::COMM_WORLD.Get_size();
MPI::Status status;
double *arr1;
if (rank == 0)
{
// Setting some Random n
int n = 10;
arr1 = new double[n];
for(int i = 0; i < n; i++)
{
arr1[i] = i;
}
int part = n / no_of_processors;
int offset = n % no_of_processors;
// cout << part << "\t" << offset << endl;
for(int i = 1; i < no_of_processors; i++)
{
int start = i*part;
int end = start + part - 1;
if (i == (no_of_processors-1))
{
end += offset;
}
// cout << i << " Start: " << start << " END: " << end;
// Element_Count
int e_count = end - start + 1;
// cout << " e_count: " << e_count << endl;
// Sending
MPI::COMM_WORLD.Send(
&e_count,
1,
MPI::INT,
i,
0
);
// Sending Arr1
MPI::COMM_WORLD.Send(
(arr1+start),
e_count,
MPI::DOUBLE,
i,
1
);
}
}
else
{
// Element Count
int e_count;
// Receiving elements count
MPI::COMM_WORLD.Recv (
&e_count,
1,
MPI::INT,
0,
0,
status
);
arr1 = new double [e_count];
// Receiving FIrst Array
MPI::COMM_WORLD.Recv (
arr1,
e_count,
MPI::DOUBLE,
0,
1,
status
);
for(int i = 0; i < e_count; i++)
{
cout << arr1[i] << endl;
}
}
// if(rank == 0)
delete [] arr1;
MPI::Finalize();
return 0;
}
#Histro The point I want to make is, that Irecv/Isend are some functions themselves manipulated by MPI lib. The question u asked depend completely on your rest of the code about what you do after the Send/Recv. There are 2 cases:
Master and Worker
You send part of the problem (say arrays) to the workers (all other ranks except 0=Master). The worker does some work (on the arrays) then returns back the results to the master. The master then adds up the result, and convey new work to the workers. Now, here you would want the master to wait for all the workers to return their result (modified arrays). So you cannot use Isend and Irecv but a multiple send as used in my code and corresponding recv. If your code is in this direction you wanna use B_cast and MPI_Reduce.
Lazy Master
The master divides the work but doesn't care of the result from his workers. Say you want to program a pattern of different kinds for same data. Like given characteristics of population of some city, you want to calculate the patterns like how many are above 18, how
many have jobs, how much of them work in some company. Now these results don't have anything to do with one another. In this case you don't have to worry about whether the data is received by the workers or not. The master can continue to execute the rest of the code. This is where it is safe to use Isend/Irecv.

Resources