MPI_Sendrecv deadlock - mpi

Could anyone help me to fix the bug in the following simple MPI prog. I was trying to use MPI_Sendrecv to send the "c" value from rank 1 to 2, and they print it from rank 2.
But, the following code ends with a deadlock.
What is the mistake, how to correctly use MPI_Sendrecv (in this situation)
#include<stdio.h>
#include"mpi.h"
int main (int argc, char **argv)
{
int size, rank;
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&size);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
printf("Hi dear, I am printing from rank %d\n",rank);
double a, b, c;
MPI_Status status, status2;
if (rank == 0)
{
a = 10.1;
MPI_Send(&a,1,MPI_DOUBLE,1,99,MPI_COMM_WORLD);
}
if (rank == 1)
{
b = 20.1;
MPI_Recv(&a,1,MPI_DOUBLE,0,99,MPI_COMM_WORLD,&status);
c = a + b;
printf("\nThe value of c is %f \n",c);
}
MPI_Sendrecv(&c,1,MPI_DOUBLE,2,100,
&c,1,MPI_DOUBLE,1,100,MPI_COMM_WORLD,&status2);
MPI_Barrier(MPI_COMM_WORLD);
if(rank == 2)
{
printf("\n Printing from rank %d, c is %f\n",rank, c);
}
MPI_Finalize();
return 0;

When a process calls MPI_Sendrecv, it will always try to execute both the send and receive part. For instance, if it is process 2, it will not look at the dest (fourth argument), see a "2" and think, "Oh, I don't have to do any sending. I'll just do the receive." nstead, process 2 will see the "2" and think, "Ah, I have to send something to myself." Further, as you have written this code, all processes see the MPI_SendRecv and think, "Oh. I have to send something to process 2 (the fourth argument) and receive something from process 1 (the ninth argument). So here we go ..." The problem is that process 1 isn't getting a command to send anything, so everyone, even process 1, is waiting for process 1 to send something.
MPI_Sendrecv is a very useful function. I find uses for it all the time. But it is intended for sending in a "chain". For instance, 0 sends to 1 while 1 sends to 2 while 2 sends to 0 or whatever. I think in this case you are better off with the usual MPI_Send and MPI_Recv.

Related

Developing an MCA component for OpenMPI: how to progress MPI_Recv using custom BTL for NIC under development?

I'm trying to develop a btl for a custom NIC. I studied the btl.h file to understand the flow of calls that are expected to be implemented in my component. I'm using a simple test (which works like a charm with the TCP btl) to test my development, the code is a simple MPI_Send + MPI_Recv:
int ping_pong_count = 1;
int partner_rank = (world_rank + 1) % 2;
printf("MY RANK: %d PARTNER: %d\n",world_rank,partner_rank);
if (world_rank == 0) {
// Increment the ping pong count before you send it
ping_pong_count++;
MPI_Send(&ping_pong_count, 1, MPI_INT, partner_rank, 0, MPI_COMM_WORLD);
printf("%d sent and incremented ping_pong_count %d to %d\n",
world_rank, ping_pong_count, partner_rank);
} else {
MPI_Recv(&ping_pong_count, 1, MPI_INT, partner_rank, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("%d received ping_pong_count %d from %d\n",
world_rank, ping_pong_count, partner_rank);
}
MPI_Finalize();
}
I see that in my component's btl code the functions called during the "MPI_send" phase are:
mca_btl_mycomp_add_procs
mca_btl_mycomp_prepare_src
mca_btl_mycomp_send (where I set the return to 1, so the send phase
should be finished)
I see then the print inside the test:
0 sent and incremented ping_pong_count 2 to 1
and this should conclude the MPI_Send phase.
Then I implemented in the btl_mycomp_component_progress function a call to:
mca_btl_active_message_callback_t *reg = mca_btl_base_active_message_trigger + tag;
reg->cbfunc(&my_btl->super, &desc);
I saw the same code in all the other BTLs and I thought this was enough to unlock the MPI_Recv "polling". But actually I see my test hangs, probably "waiting" for something that never happens (?).
I also took a look in the ob1 mca_pml_ob1_recv_frag_callback_match function, and it seems to get to the end of the function, actually matching my frag.
So my question is: how can I say to the framework that I finished my work and so the function can return to the user application? What am I doing wrong?
Is there a way to understand where and what my code is waiting for?

Unexpected result from MPI isend and irecv

My goal was to send a vector from process 0, to process 1. Then, send it back from process 1 to process 0.
I have two questions from my implementation,
1- Why does the sending back from process 1 to process 0 takes longer than the vice versa?
The first send-recv takes ~1e-4 seconds in total and the second send-recv takes ~1 second.
2- When I increase size of the vector, I get the following error. What is the reason for that issue?
mpirun noticed that process rank 0 with PID 11248 on node server1 exited on signal 11 (Segmentation fault).
My Updated C++ code is as follows
#include <mpi.h>
#include <stdio.h>
#include <iostream>
#include <vector>
#include <boost/timer/timer.hpp>
#include <math.h>
using namespace std;
int main(int argc, char** argv) {
// Initialize the MPI environment
MPI_Init(NULL, NULL);
MPI_Request request, request2,request3,request4;
MPI_Status status;
int world_size;
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
int world_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
srand( world_rank );
int n = 1e3;
double *myvector = new double[n];
if (world_rank==0){
myvector[n-1] = 1;
}
MPI_Barrier (MPI_COMM_WORLD);
if (world_rank==0){
boost::timer::cpu_timer timer;
MPI_Isend(myvector, n, MPI_DOUBLE , 1, 0, MPI_COMM_WORLD, &request);
boost::timer::cpu_times elapsedTime1 = timer.elapsed();
cout << " Wallclock time on Process 1:"
<< elapsedTime1.wall / 1e9 << " (sec)" << endl;
MPI_Irecv(myvector, n, MPI_DOUBLE, 1, 0, MPI_COMM_WORLD, &request4);
MPI_Wait(&request4, &status);
printf("Test if data is recieved from node 1: %1.0f\n",myvector[n-1]);
boost::timer::cpu_times elapsedTime2 = timer.elapsed();
cout <<" Wallclock time on Process 1:"
<< elapsedTime2.wall / 1e9 << " (sec)" << endl;
}else{
boost::timer::cpu_timer timer;
MPI_Irecv(myvector, n, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD, &request2);
MPI_Wait(&request2, &status);
boost::timer::cpu_times elapsedTime1 = timer.elapsed();
cout << " Wallclock time on Process 2:"
<< elapsedTime1.wall / 1e9 << " (sec)" << endl;
printf("Test if data is recieved from node 0: %1.0f\n",myvector[n-1]);
myvector[n-1] = 2;
MPI_Isend(myvector, n, MPI_DOUBLE , 0, 0, MPI_COMM_WORLD, &request3);
boost::timer::cpu_times elapsedTime2 = timer.elapsed();
cout<< " Wallclock time on Process 2:"
<< elapsedTime1.wall / 1e9 << " (sec)" << endl;
}
MPI_Finalize();
}
The output is:
Wallclock time on Process 1:2.484e-05 (sec)
Wallclock time on Process 2:0.000125325 (sec)
Test if data is recieved from node 0: 1
Wallclock time on Process 2:0.000125325 (sec)
Test if data is recieved from node 1: 2
Wallclock time on Process 1:1.00133 (sec)
Timing discrepancies
First of all, you don't measure the time to send the message. This is why posting the actual code you use for timing is essential.
You measure four times, for the two sends, you only time the call to MPI_Isend. This is the Immediate version of the API call. As the name suggests, it completes immediately. The timing has nothing to do with the actual time for sending the message.
For the receive operations, you measure MPI_Irecv and a corresponding MPI_Wait. This is the time between initiating the receive and the the local availability of the message. This is again different from the message transfer time, as it does not consider the time difference between posting the receive and the corresponding send. In general, you have to consider the late sender and late receiver cases. Further even for blocking send operations, a local completion does not imply a completed transfer, remote completion, or even initiation.
Timing MPI transfers is difficult.
Checking for completion
There is still the question as to why anything in this code could take an entire second. That is certainly not a sensible time unless your network uses IPoAC. The likely reason is that you do not check for completion of all messages. MPI implementations are often single threaded and can only make progress on communication during the respective API calls. To use immediate messages, you must either periodically call MPI_Test* until the request is finished or complete the request using MPI_Wait*.
I don't know why you chose to use immediate MPI functions in the first place. If you call MPI_Wait right after starting an MPI_Isend/MPI_Irecv, you might as well just call MPI_Send/MPI_Recv. You need immediate functions for concurrent communication and computation, to allow concurrent irregular communication patterns, and to avoid deadlocks in certain situations. If you don't need the immediate functions, use the blocking ones instead.
Segfault
While I cannot reproduce, I suspect this is caused by using the same buffer (myvector) for two concurrently running MPI operations. Don't do that. Either use a separate buffer, or make sure the first operation completes. Generally - you are not allowed to touch a buffer in any way after passing it to MPI_Isend/MPI_Irecv until you know the request is completed via MPI_Test*/MPI_Wait*.
P.S.
If you think you need immediate operations to avoid deadlocks while sending and receiving, consider MPI_Sendrecv instead.

Could MPI_Bcast lead to the issue of data uncertainty?

If different process broadcast different value to the other processes in the group of a certain communicator,what would happen?
Take the following program run by two processes as an example,
int rank, size;
int x;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
if (rank == 0)
{
x = 0;
MPI_Bcast(&x, 1, MPI_INT, 0, MPI_COMM_WORLD);
}
else if (rank==1)
{
x = 1;
MPI_Bcast(&x, 1, MPI_INT, 1, MPI_COMM_WORLD);
}
cout << "Process " << rank << "'s value is:" << x << endl;
MPI_Finalize();
I think there could be different possibilities of the printed result at the end of the program. If the process 0 runs faster than process 1, it would broadcast its value earlier than process 1, so process 1 would have the same value with process 0 when it starts broadcasting its value, therefore making the printed value of x both 0. But if the process 0 runs slower than process 1, the process 0 would have the same value as process 1 which is 1 at the end. Does what I described happen actually?
I think you don't understand the MPI_Bcast function well. Actually MPI_Bcast is a kind of MPI collective communication method in which every process that belongs to a certain communicator need to get involve. So for the function of MPI_Bcast, not only the process who sends the data to be broadcasted, but also the processes who receive the broadcasted data need to call the function synchronously in order to achieve the goal of data broadcasting among all participating processes.
In your given program, especially this part:
if (rank == 0)
{
x = 0;
MPI_Bcast(&x, 1, MPI_INT, 0, MPI_COMM_WORLD);
}
else if (rank==1)
{
x = 1;
MPI_Bcast(&x, 1, MPI_INT, 1, MPI_COMM_WORLD);
}
I think you meant to let process whose rank is 0 (process 0) broadcasts its value of x to other processes, but in your code, only process 0 calls the MPI_Bcast function when you use the if-else segment. So what do other processes do? For process whose rank is 1 (process 1), it doesn't call the same MPI_Bcast function the process 0 calls, although it does call another MPI_Bcast function to broadcast its value of x (the argument of root is different between those two MPI_Bcast functions).Thus if only process 0 calls the MPI_Bcast function, it just broadcasts the value of x to only itself actually and the value of x stored in other processes wouldn't be affected at all. Also it's the same condition for process 1. As a result in your program, the printed value of x for each process should be the same as when it's assigned initially and there wouldn't be the data uncertainty issue as you concerned.
MPI_Bcast is used primarily so that rank 0 [root] can calculate values and broadcast them so everybody starts with the same values.
Here is a valid usage:
int x;
// not all ranks may get the same time value ...
srand(time(NULL));
// so we must get the value once ...
if (rank == 0)
x = rand();
// and broadcast it here ...
MPI_Bcast(&x,1,MPI_INT,0,MPI_COMM_WORLD);
Notice the difference from your usage. The same MPI_Bcast call for all ranks. The root will do a send and the others will do recv.

C MPI multiple dynamic array passing

I'm trying to ISend() two arrays: arr1,arr2 and an integer n which is the size of arr1,arr2. I understood from this post that sending a struct that contains all three is not an option since n is only known at run time. Obviously, I need n to be received first since otherwise the receiving process wouldn't know how many elements to receive. What's the most efficient way to achieve this without using the blokcing Send() ?
Sending the size of the array is redundant (and inefficient) as MPI provides a way to probe for incoming messages without receiving them, which provides just enough information in order to properly allocate memory. Probing is performed with MPI_PROBE, which looks a lot like MPI_RECV, except that it takes no buffer related arguments. The probe operation returns a status object which can then be queried for the number of elements of a given MPI datatype that can be extracted from the content of the message with MPI_GET_COUNT, therefore explicitly sending the number of elements becomes redundant.
Here is a simple example with two ranks:
if (rank == 0)
{
MPI_Request req;
// Send a message to rank 1
MPI_Isend(arr1, n, MPI_DOUBLE, 1, 0, MPI_COMM_WORLD, &req);
// Do not forget to complete the request!
MPI_Wait(&req, MPI_STATUS_IGNORE);
}
else if (rank == 1)
{
MPI_Status status;
// Wait for a message from rank 0 with tag 0
MPI_Probe(0, 0, MPI_COMM_WORLD, &status);
// Find out the number of elements in the message -> size goes to "n"
MPI_Get_count(&status, MPI_DOUBLE, &n);
// Allocate memory
arr1 = malloc(n*sizeof(double));
// Receive the message. ignore the status
MPI_Recv(arr1, n, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
}
MPI_PROBE also accepts the wildcard rank MPI_ANY_SOURCE and the wildcard tag MPI_ANY_TAG. One can then consult the corresponding entry in the status structure in order to find out the actual sender rank and the actual message tag.
Probing for the message size works as every message carries a header, called envelope. The envelope consists of the sender's rank, the receiver's rank, the message tag and the communicator. It also carries information about the total message size. Envelopes are sent as part of the initial handshake between the two communicating processes.
Firstly you need to allocate memory (full memory = n = elements) to arr1 and arr2 with rank 0. i.e. your front end processor.
Divide the array into parts depending on the no. of processors. Determine the element count for each processor.
Send this element count to the other processors from rank 0.
The second send is for the array i.e. arr1 and arr2
In other processors allocate arr1 and arr2 according to the element count received from main processor i.e. rank = 0. After receiving element count, receive the two arrays in the allocated memories.
This is a sample C++ Implementation but C will follow the same logic. Also just interchange Send with Isend.
#include <mpi.h>
#include <iostream>
using namespace std;
int main(int argc, char*argv[])
{
MPI::Init (argc, argv);
int rank = MPI::COMM_WORLD.Get_rank();
int no_of_processors = MPI::COMM_WORLD.Get_size();
MPI::Status status;
double *arr1;
if (rank == 0)
{
// Setting some Random n
int n = 10;
arr1 = new double[n];
for(int i = 0; i < n; i++)
{
arr1[i] = i;
}
int part = n / no_of_processors;
int offset = n % no_of_processors;
// cout << part << "\t" << offset << endl;
for(int i = 1; i < no_of_processors; i++)
{
int start = i*part;
int end = start + part - 1;
if (i == (no_of_processors-1))
{
end += offset;
}
// cout << i << " Start: " << start << " END: " << end;
// Element_Count
int e_count = end - start + 1;
// cout << " e_count: " << e_count << endl;
// Sending
MPI::COMM_WORLD.Send(
&e_count,
1,
MPI::INT,
i,
0
);
// Sending Arr1
MPI::COMM_WORLD.Send(
(arr1+start),
e_count,
MPI::DOUBLE,
i,
1
);
}
}
else
{
// Element Count
int e_count;
// Receiving elements count
MPI::COMM_WORLD.Recv (
&e_count,
1,
MPI::INT,
0,
0,
status
);
arr1 = new double [e_count];
// Receiving FIrst Array
MPI::COMM_WORLD.Recv (
arr1,
e_count,
MPI::DOUBLE,
0,
1,
status
);
for(int i = 0; i < e_count; i++)
{
cout << arr1[i] << endl;
}
}
// if(rank == 0)
delete [] arr1;
MPI::Finalize();
return 0;
}
#Histro The point I want to make is, that Irecv/Isend are some functions themselves manipulated by MPI lib. The question u asked depend completely on your rest of the code about what you do after the Send/Recv. There are 2 cases:
Master and Worker
You send part of the problem (say arrays) to the workers (all other ranks except 0=Master). The worker does some work (on the arrays) then returns back the results to the master. The master then adds up the result, and convey new work to the workers. Now, here you would want the master to wait for all the workers to return their result (modified arrays). So you cannot use Isend and Irecv but a multiple send as used in my code and corresponding recv. If your code is in this direction you wanna use B_cast and MPI_Reduce.
Lazy Master
The master divides the work but doesn't care of the result from his workers. Say you want to program a pattern of different kinds for same data. Like given characteristics of population of some city, you want to calculate the patterns like how many are above 18, how
many have jobs, how much of them work in some company. Now these results don't have anything to do with one another. In this case you don't have to worry about whether the data is received by the workers or not. The master can continue to execute the rest of the code. This is where it is safe to use Isend/Irecv.

MPI Barrier C++

I want to use MPI (MPICH2) on windows. I write this command:
MPI_Barrier(MPI_COMM_WORLD);
And I expect it blocks all Processors until all group members have called it. But it is not happen. I add a schematic of my code:
int a;
if(myrank == RootProc)
a = 4;
MPI_Barrier(MPI_COMM_WORLD);
cout << "My Rank = " << myrank << "\ta = " << a << endl;
(With 2 processor:) Root processor (0) acts correctly, but processor with rank 1 doesn't know the a variable, so it display -858993460 instead of 4.
Can any one help me?
Regards
You're only assigning a in process 0. MPI doesn't share memory, so if you want the a in process 1 to get the value of 4, you need to call MPI_Send from process 0 and MPI_Recv from process 1.
Variable a is not initialized - it is possible that is why it displays that number. In MPI, variable a is duplicated between the processes - so there are two values for a, one of which is uninitialized. You want to write:
int a = 4;
if (myrank == RootProc)
...
Or, alternatively, do an MPI_send in the Root (id 0), and an MPI_recv in the slave (id 1) so the value in the root is also set in the slave.
Note: that code triggers a small alarm in my head, so I need to check something and I'll edit this with more info. Until then though, the uninitialized value is most certainly a problem for you.
Ok I've checked the facts - your code was not properly indented and I missed the missing {}. The barrier looks fine now, although the snippet you posted does not do too much, and is not a very good example of a barrier because the slave enters it directly, whereas the root will set the value of the variable to 4 and then enter it. To test that it actually works, you probably want some sort of a sleep mechanism in one of the processes - that will yield (hope it's the correct term) the other process as well, preventing it from printing the cout until the sleep is over.
Blocking is not enough, you have to send data to other processes (memory in not shared between processes).
To share data across ALL processes use:
int MPI_Bcast(void* buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm )
so in your case:
MPI_Bcast(&a, 1, MPI_INT, 0, MPI_COMM_WORLD);
here you send one integer pointed by &a form process 0 to all other.
//MPI_Bcast is sender for root process and receiver for non-root processes
You can also send some data to specyfic process by:
int MPI_Send( void *buf, int count, MPI_Datatype datatype, int dest,
int tag, MPI_Comm comm )
and then receive by:
int MPI_Recv(void* buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status)

Resources