Developing an MCA component for OpenMPI: how to progress MPI_Recv using custom BTL for NIC under development? - mpi

I'm trying to develop a btl for a custom NIC. I studied the btl.h file to understand the flow of calls that are expected to be implemented in my component. I'm using a simple test (which works like a charm with the TCP btl) to test my development, the code is a simple MPI_Send + MPI_Recv:
int ping_pong_count = 1;
int partner_rank = (world_rank + 1) % 2;
printf("MY RANK: %d PARTNER: %d\n",world_rank,partner_rank);
if (world_rank == 0) {
// Increment the ping pong count before you send it
ping_pong_count++;
MPI_Send(&ping_pong_count, 1, MPI_INT, partner_rank, 0, MPI_COMM_WORLD);
printf("%d sent and incremented ping_pong_count %d to %d\n",
world_rank, ping_pong_count, partner_rank);
} else {
MPI_Recv(&ping_pong_count, 1, MPI_INT, partner_rank, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("%d received ping_pong_count %d from %d\n",
world_rank, ping_pong_count, partner_rank);
}
MPI_Finalize();
}
I see that in my component's btl code the functions called during the "MPI_send" phase are:
mca_btl_mycomp_add_procs
mca_btl_mycomp_prepare_src
mca_btl_mycomp_send (where I set the return to 1, so the send phase
should be finished)
I see then the print inside the test:
0 sent and incremented ping_pong_count 2 to 1
and this should conclude the MPI_Send phase.
Then I implemented in the btl_mycomp_component_progress function a call to:
mca_btl_active_message_callback_t *reg = mca_btl_base_active_message_trigger + tag;
reg->cbfunc(&my_btl->super, &desc);
I saw the same code in all the other BTLs and I thought this was enough to unlock the MPI_Recv "polling". But actually I see my test hangs, probably "waiting" for something that never happens (?).
I also took a look in the ob1 mca_pml_ob1_recv_frag_callback_match function, and it seems to get to the end of the function, actually matching my frag.
So my question is: how can I say to the framework that I finished my work and so the function can return to the user application? What am I doing wrong?
Is there a way to understand where and what my code is waiting for?

Related

Wrong synchronization of RMA calls in MPI

I am trying to MPIs RMA scheme with Fences. In some cases it works fine, but for systems with multiple nodes I get the following error:
Error message: MPI failed with Error_code = 71950898
Wrong synchronization of RMA calls , error stack:
MPI_Rget(176): MPI_Rget(origin_addr=0x2ac7b10, origin_count=1, MPI_INTEGER, target_rank=0, target_disp=0, target_count=1, MPI_INTEGER, win=0xa0000000, request=0x7ffdc1efe634) failed
(unknown)(): Wrong synchronization of RMA calls
Error from PE:0/4
This is a schematic of how I setup the code:
call MPI_init(..)
CALL MPI_WIN_CREATE(..)
do i =1,10
MPI_Win_fence(0, handle, err)
calc_values()
MPI_Put(values)
MPI_Put(values)
MPI_Put(values)
MPI_Win_fence(0, handle, err)
MPI_Rget(values, req)
MPI_WAIT(req)
do_something(values)
MPI_Rget(values, req)
MPI_WAIT(req)
do_something(values)
enddo
call MPI_finalize()
I know that MPI_Put is non-blocking. Is it guaranteed, that the MPI_Put is finished after MPI_Win_fence(0, handle, err) or do I have to use MPI_RPUT?
What does this error even mean: Wrong synchronization of RMA calls ?
How do I fix my communication scheme?
Make sure you add the following call as necessary to ensure synchronization (you need to make sure your window(s) are created before putting data in them):
MPI_Win_fence(0, window);
Please look at the example below (source) and note that they are making two fence calls.
// Create the window
int window_buffer = 0;
MPI_Win window;
MPI_Win_create(&window_buffer, sizeof(int), sizeof(int), MPI_INFO_NULL, MPI_COMM_WORLD, &window);
if(my_rank == 1)
{
printf("[MPI process 1] Value in my window_buffer before MPI_Put: %d.\n", window_buffer);
}
MPI_Win_fence(0, window);
if(my_rank == 0)
{
// Push my value into the first integer in MPI process 1 window
int my_value = 12345;
MPI_Put(&my_value, 1, MPI_INT, 1, 0, 1, MPI_INT, window);
printf("[MPI process 0] I put data %d in MPI process 1 window via MPI_Put.\n", my_value);
}
// Wait for the MPI_Put issued to complete before going any further
MPI_Win_fence(0, window);
if(my_rank == 1)
{
printf("[MPI process 1] Value in my window_buffer after MPI_Put: %d.\n", window_buffer);
}
// Destroy the window
MPI_Win_free(&window);

How to check completion of non-blocking reduce?

It is not clear to me how to properly use non-blocking collective in MPI, particularly MPI_Ireduce() in this case:
Say I want to collect a sum from root rank:
int local_cnt;
int total_cnt;
MPI_Request request;
MPI_Ireduce(&local_cnt, &total_cnt, 1, MPI_INT, MPI_SUM, 0, MPI_WORLD_COMM, &request);
/* now I want to check if the reduce is finished */
if (rank == 0) {
int flag = 0;
MPI_Status status;
MPI_Test(&request, &flag, &status);
if (flag) {
/* reduce is finished? */
}
}
Is this the correct way to check if non-blocking reduce is done? My confusion comes from two aspects: one, Can or should just root process check for it using MPI_Test() since this is meaningful only to root? second, since MPI_Test() is local op, how can this local op knows the operation is complete? it does require all processes to finish, right?
You must check for completion on all participating ranks, not just the root.
From a user perspective, you need to know about the completion of the communication because you must not do anything with the memory provided to a non-blocking operation. I.e. if you send a local scope variable like local_cnt, you cannot write to it or leave it's scope before you have confirmed that the operation is complete.
One option to ensure completion is calling MPI_Test until it eventually returns flag==true. Use this only if you can do something useful between the calls to MPI_Test:
{
int local_cnt;
int total_cnt;
// fill local_cnt on all ranks
MPI_Request request;
MPI_Ireduce(&local_cnt, &total_cnt, 1, MPI_INT, MPI_SUM, 0, MPI_WORLD_COMM, &request);
int flag;
do {
// perform some useful computation
MPI_Status status;
MPI_Test(&request, &flag, &status);
} while (!flag)
}
Do not call MPI_Test in a loop if you have nothing useful to do in between calls. Instead use MPI_Wait, which blocks until completion.
{
int local_cnt;
int total_cnt;
// fill local_cnt on all ranks
MPI_Request request;
MPI_Ireduce(&local_cnt, &total_cnt, 1, MPI_INT, MPI_SUM, 0, MPI_WORLD_COMM, &request);
// perform some useful computation
MPI_Status status;
MPI_Wait(&request, &status);
}
Remember, if you have no useful computation at all, and don't need to be non-blocking for deadlock reasons, use a blocking communication in the first place. If you have multiple ongoing non-blocking communication, there are MPI_Waitany, MPI_Waitsome, MPI_Waitall and their Test variant's.
Zulan brilliantly answered the first part of your question.
MPI_Reduce() returns when
the send buffer can be overwritten on a non root rank
the result is available on the root rank (which implies all the ranks have completed)
So there is no way for a non root rank to know whether the root rank completed. If you do need this information, then you need to manually add a MPI_Barrier(). That being said, you generally do not require this information, and if you believe you do need it, there might be something wrong with your app.
This remains true if you use non blocking collectives (e.g. MPI_Wait() corresponding to MPI_Ireduce() completes on a non root rank : that simply means the send buffer can be overwritten.

What is the right way to "notify" processors without blocking?

Suppose I have a very large array of things and I have to do some operation on all these things.
In case operation fails for one element, I want to stop the work [this work is distributed across number of processors] across all the array.
I want to achieve this while keeping the number of sent/received messages to a minimum.
Also, I don't want to block processors if there is no need to.
How can I do it using MPI?
This seems to be a common question with no easy answer. Both other answer have scalability issues. The ring-communication approach has linear communication cost, while in the one-sided MPI_Win-solution, a single process will be hammered with memory requests from all workers. This may be fine for low number of ranks, but pose issues when increasing the rank count.
Non-blocking collectives can provide a more scalable better solution. The main idea is to post a MPI_Ibarrier on all ranks except on one designated root. This root will listen to point-to-point stop messages via MPI_Irecv and complete the MPI_Ibarrier once it receives it.
The tricky part is that there are four different cases "{root, non-root} x {found, not-found}" that need to be handled. Also it can happen that multiple ranks send a stop message, requiring an unknown number of matching receives on the root. That can be solved with an additional reduction that counts the number of ranks that sent a stop-request.
Here is an example how this can look in C:
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
const int iter_max = 10000;
const int difficulty = 20000;
int find_stuff()
{
int num_iters = rand() % iter_max;
for (int i = 0; i < num_iters; i++) {
if (rand() % difficulty == 0) {
return 1;
}
}
return 0;
}
const int stop_tag = 42;
const int root = 0;
int forward_stop(MPI_Request* root_recv_stop, MPI_Request* all_recv_stop, int found_count)
{
int flag;
MPI_Status status;
if (found_count == 0) {
MPI_Test(root_recv_stop, &flag, &status);
} else {
// If we find something on the root, we actually wait until we receive our own message.
MPI_Wait(root_recv_stop, &status);
flag = 1;
}
if (flag) {
printf("Forwarding stop signal from %d\n", status.MPI_SOURCE);
MPI_Ibarrier(MPI_COMM_WORLD, all_recv_stop);
MPI_Wait(all_recv_stop, MPI_STATUS_IGNORE);
// We must post some additional receives if multiple ranks found something at the same time
MPI_Reduce(MPI_IN_PLACE, &found_count, 1, MPI_INT, MPI_SUM, root, MPI_COMM_WORLD);
for (found_count--; found_count > 0; found_count--) {
MPI_Recv(NULL, 0, MPI_CHAR, MPI_ANY_SOURCE, stop_tag, MPI_COMM_WORLD, &status);
printf("Additional stop from: %d\n", status.MPI_SOURCE);
}
return 1;
}
return 0;
}
int main()
{
MPI_Init(NULL, NULL);
int rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
srand(rank);
MPI_Request root_recv_stop;
MPI_Request all_recv_stop;
if (rank == root) {
MPI_Irecv(NULL, 0, MPI_CHAR, MPI_ANY_SOURCE, stop_tag, MPI_COMM_WORLD, &root_recv_stop);
} else {
// You may want to use an extra communicator here, to avoid messing with other barriers
MPI_Ibarrier(MPI_COMM_WORLD, &all_recv_stop);
}
while (1) {
int found = find_stuff();
if (found) {
printf("Rank %d found something.\n", rank);
// Note: We cannot post this as blocking, otherwise there is a deadlock with the reduce
MPI_Request req;
MPI_Isend(NULL, 0, MPI_CHAR, root, stop_tag, MPI_COMM_WORLD, &req);
if (rank != root) {
// We know that we are going to receive our own stop signal.
// This avoids running another useless iteration
MPI_Wait(&all_recv_stop, MPI_STATUS_IGNORE);
MPI_Reduce(&found, NULL, 1, MPI_INT, MPI_SUM, root, MPI_COMM_WORLD);
MPI_Wait(&req, MPI_STATUS_IGNORE);
break;
}
MPI_Wait(&req, MPI_STATUS_IGNORE);
}
if (rank == root) {
if (forward_stop(&root_recv_stop, &all_recv_stop, found)) {
break;
}
} else {
int stop_signal;
MPI_Test(&all_recv_stop, &stop_signal, MPI_STATUS_IGNORE);
if (stop_signal)
{
MPI_Reduce(&found, NULL, 1, MPI_INT, MPI_SUM, root, MPI_COMM_WORLD);
printf("Rank %d stopping after receiving signal.\n", rank);
break;
}
}
};
MPI_Finalize();
}
While this is not the simplest code, it should:
Introduce no additional blocking
Scale with the implementation of a barrier (usually O(log N))
Have a worst-case-latency from one found, to all stop of 2 * loop time ( + 1 p2p + 1 barrier + 1 reduction).
If many/all ranks find a solution at the same time, it still works but may be less efficient.
A possible strategy to derive this global stop condition in a non-blocking fashion is to rely on MPI_Test.
scenario
Consider that each process posts an asynchronous receive of type MPI_INT to its left rank with a given tag to build a ring. Then start your computation. If a rank encounters the stop condition it sends its own rank to its right rank. In the meantime each rank uses MPI_Test to check for the completion of the MPI_Irecv during the computation if it is completed then enter a branch first waiting the message and then transitively propagating the received rank to the right except if the right rank is equal to the payload of the message (not to loop).
This done you should have all processes in the branch, ready to trigger an arbitrary recovery operation.
Complexity
The topology retained is a ring as it minimizes the number of messages at most (n-1) however it augments the propagation time. Other topologies could be retained with more messages but lower spatial complexity for example a tree with a n.ln(n) complexity.
Implementation
Something like this.
int rank, size;
MPI_Init(&argc,&argv);
MPI_Comm_rank( MPI_COMM_WORLD, &rank);
MPI_Comm_size( MPI_COMM_WORLD, &size);
int left_rank = (rank==0)?(size-1):(rank-1);
int right_rank = (rank==(size-1))?0:(rank+1)%size;
int stop_cond_rank;
MPI_Request stop_cond_request;
int stop_cond= 0;
while( 1 )
{
MPI_Irecv( &stop_cond_rank, 1, MPI_INT, left_rank, 123, MPI_COMM_WORLD, &stop_cond_request);
/* Compute Here and set stop condition accordingly */
if( stop_cond )
{
/* Cancel the left recv */
MPI_Cancel( &stop_cond_request );
if( rank != right_rank )
MPI_Send( &rank, 1, MPI_INT, right_rank, 123, MPI_COMM_WORLD );
break;
}
int did_recv = 0;
MPI_Test( &stop_cond_request, &did_recv, MPI_STATUS_IGNORE );
if( did_recv )
{
stop_cond = 1;
MPI_Wait( &stop_cond_request, MPI_STATUS_IGNORE );
if( right_rank != stop_cond_rank )
MPI_Send( &stop_cond_rank, 1, MPI_INT, right_rank, 123, MPI_COMM_WORLD );
break;
}
}
if( stop_cond )
{
/* Handle the stop condition */
}
else
{
/* Cleanup */
MPI_Cancel( &stop_cond_request );
}
That is a question I've asked myself a few times without finding any completely satisfactory answer... The only thing I thought of (beside MPI_Abort() that does it but which is a bit extreme) is to create an MPI_Win storing a flag that will be raise by whichever process facing the problem, and checked by all processes regularly to see if they can go on processing. This is done using non-blocking calls, the same way as described in this answer.
The main weaknesses of this are:
This depends on the processes to willingly check the status of the flag. There is no immediate interruption of their work to notifying them.
The frequency of this checking must be adjusted by hand. You have to find the trade-off between the time you waste processing data while there's no need to because something happened elsewhere, and the time it takes to check the status...
In the end, what we would need is a way of defining a callback action triggered by an MPI call such as MPI_Abort() (basically replacing the abort action by something else). I don't think this exists, but maybe I overlooked it.

Why is MPI_Bsend not returning error even when the buffer is insufficient to accommodate all the messages

I am trying to implement a logical ring with MPI; where every process receives a message from process with id one less than that of current process and forwards to the next process in cyclic fashion. My aim is to traffic buffer such that it loses some messages or maybe puts them in out of order.
A cycle of communication will finish when a message dispatched by root node comes back again to root.
here is the code that I have tried:
I am just including relevant parts of it.
if(procId!=root)
{
sleep(100);
while(1)
{
tm = MPI_Wtime();
MPI_Irecv( &message, STR_LEN, MPI_CHAR,
((procId-1)>=0?(procId-1):(numProc-1)),RETURN_DATA_TAG,
MPI_COMM_WORLD,&receiveRequest);
MPI_Wait(&receiveRequest,&status);
printf("%d: Received\n",procId);
if(!strncmp(message,"STOP",4)&&(procId==(numProc-1)))
break;
MPI_Ssend( message, STR_LEN, MPI_CHAR,
(procId+1)%numProc, SEND_DATA_TAG, MPI_COMM_WORLD);
if(!strncmp(message,"STOP",4))
break;
printf("%d: Sent\n",procId);
}
}
else
{
for(iter=0;iter<benchmarkSize;iter++)
{
//Synthesize the message
message[STR_LEN-1] = '\0';
iErr = MPI_Bsend( message, STR_LEN, MPI_CHAR,
(root+1)%numProc, SEND_DATA_TAG, MPI_COMM_WORLD);
if (iErr != MPI_SUCCESS) {
char error_string[BUFSIZ];
int length_of_error_string;
MPI_Error_string(iErr, error_string, &length_of_error_string);
fprintf(stderr, "%3d: %s\n", procId, error_string);
}
tm = MPI_Wtime();
while(((MPI_Wtime()-tm)*1000)<delay);
printf("Root: Sending\n");
}
for(iter=0;iter<benchmarkSize;iter++)
{
MPI_Recv(message,STR_LEN,MPI_CHAR,
(numProc-1),RETURN_DATA_TAG,MPI_COMM_WORLD,&status);
//We should not wait for the messages to be received but wait for certain amount of time
//Extract the fields in the message
if(((prevRcvdSeqNum+1)!=atoi(seqNum))&&(prevRcvdSeqNum!=0))
outOfOrderMsgs++;
prevRcvdSeqNum = atoi(seqNum);
printf("Seq Num: %d\n",atoi(seqNum));
rcvdMsgs++;
printf("Root: Receiving\n");
}
MPI_Isend( "STOP", 4, MPI_CHAR,
(root+1)%numProc, SEND_DATA_TAG, MPI_COMM_WORLD,&sendRequest);
MPI_Wait(&sendRequest,&status);
/*This is to ask all other processes to terminate, when the work is done*/
}
Now, I have these questions:
1) Why is it that when I inject some sleep in the other processes(I mean other than root) of
the ring; NO receive is taking place?
2) Even when the buffer size is only one, how is it that root node is able to dispatch messages through MPI_Bsend without an error? for example the case when it needs to send total 10 messages at a rate of 1000 per second and with buffer size of 1. MPI_Bsend is able to dispatch all the messages without any error of "buffer full"; irrespective of the presence of sleep() in other processes of the ring!
Thanks a ton!

MPI call and receive in a nested loop

I have a nested loop and from inside the loop I call the MPI send which I want it to
send to the receiver a specific value then at the receiver takes the data and again sends MPI messages
to another set of CPUs ... I used something like this but it looks like there is a problem in the receive ... and I cant see where I went wrong ..."the machine goes to infinite loop somewhere ...
I am trying to make it work like this :
master CPU >> send to other CPUs >> send to slave CPUs
.
.
.
int currentCombinationsCount;
int mp;
if (rank == 0)
{
for (int pr = 0; pr < combinationsSegmentSize; pr++)
{
int CblockBegin = CombinationsSegementsBegin[pr];
int CblockEnd = CombinationsSegementsEnd [pr];
currentCombinationsCount = numOfCombinationsEachLoop[pr];
prossessNum = 1; //specify which processor we are sending to
// now substitute and send to the main Processors
for (mp = CblockBegin; mp <= CblockEnd; mp++)
{
MPI_Send(&mp , 1, MPI_INT , prossessNum, TAG, MPI_COMM_WORLD);
prossessNum ++;
}
}//this loop goes through all the specified blocks for the combinations
} // end of rank 0
else if (rank > currentCombinationsCount)
{
// here I want to put other receives that will take values from the else below
}
else
{
MPI_Recv(&mp , 1, MPI_INT , 0, TAG, MPI_COMM_WORLD, &stat);
// the code stuck here in infinite loop
}
You've only initialised currentCombinationsCount within the if(rank==0) branch so all other procs will see an uninitialised variable. That will result in undefined behaviour and the outcome depends on your compiler. Your program may crash or the value may be set to 0 or an undetermined value.
If you're lucky, the value may be set to 0 in which case your branch reduces to:
if (rank == 0) { /* rank == 0 will enter this }
else if (rank > 0) { /* all other procs enter this }
else { /* never entered! Recvs are never called to match the sends */ }
You therefore end up with sends that are not matched by any receives. Since MPI_Send is potentially blocking, the sending proc may stall indefinitely. With procs blocking on sends, it can certainly look as thought "...the machine goes to infinite loop somewhere...".
If currentCombinationsCount is given an arbitrary value (instead of 0) then rank!=0 procs will enter arbitrary branchss (with a higher chance of all entering the final else). You then end up with second set of receives not being called resulting in the same issue as above.

Resources