Could MPI_Bcast lead to the issue of data uncertainty? - mpi

If different process broadcast different value to the other processes in the group of a certain communicator,what would happen?
Take the following program run by two processes as an example,
int rank, size;
int x;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
if (rank == 0)
{
x = 0;
MPI_Bcast(&x, 1, MPI_INT, 0, MPI_COMM_WORLD);
}
else if (rank==1)
{
x = 1;
MPI_Bcast(&x, 1, MPI_INT, 1, MPI_COMM_WORLD);
}
cout << "Process " << rank << "'s value is:" << x << endl;
MPI_Finalize();
I think there could be different possibilities of the printed result at the end of the program. If the process 0 runs faster than process 1, it would broadcast its value earlier than process 1, so process 1 would have the same value with process 0 when it starts broadcasting its value, therefore making the printed value of x both 0. But if the process 0 runs slower than process 1, the process 0 would have the same value as process 1 which is 1 at the end. Does what I described happen actually?

I think you don't understand the MPI_Bcast function well. Actually MPI_Bcast is a kind of MPI collective communication method in which every process that belongs to a certain communicator need to get involve. So for the function of MPI_Bcast, not only the process who sends the data to be broadcasted, but also the processes who receive the broadcasted data need to call the function synchronously in order to achieve the goal of data broadcasting among all participating processes.
In your given program, especially this part:
if (rank == 0)
{
x = 0;
MPI_Bcast(&x, 1, MPI_INT, 0, MPI_COMM_WORLD);
}
else if (rank==1)
{
x = 1;
MPI_Bcast(&x, 1, MPI_INT, 1, MPI_COMM_WORLD);
}
I think you meant to let process whose rank is 0 (process 0) broadcasts its value of x to other processes, but in your code, only process 0 calls the MPI_Bcast function when you use the if-else segment. So what do other processes do? For process whose rank is 1 (process 1), it doesn't call the same MPI_Bcast function the process 0 calls, although it does call another MPI_Bcast function to broadcast its value of x (the argument of root is different between those two MPI_Bcast functions).Thus if only process 0 calls the MPI_Bcast function, it just broadcasts the value of x to only itself actually and the value of x stored in other processes wouldn't be affected at all. Also it's the same condition for process 1. As a result in your program, the printed value of x for each process should be the same as when it's assigned initially and there wouldn't be the data uncertainty issue as you concerned.

MPI_Bcast is used primarily so that rank 0 [root] can calculate values and broadcast them so everybody starts with the same values.
Here is a valid usage:
int x;
// not all ranks may get the same time value ...
srand(time(NULL));
// so we must get the value once ...
if (rank == 0)
x = rand();
// and broadcast it here ...
MPI_Bcast(&x,1,MPI_INT,0,MPI_COMM_WORLD);
Notice the difference from your usage. The same MPI_Bcast call for all ranks. The root will do a send and the others will do recv.

Related

Developing an MCA component for OpenMPI: how to progress MPI_Recv using custom BTL for NIC under development?

I'm trying to develop a btl for a custom NIC. I studied the btl.h file to understand the flow of calls that are expected to be implemented in my component. I'm using a simple test (which works like a charm with the TCP btl) to test my development, the code is a simple MPI_Send + MPI_Recv:
int ping_pong_count = 1;
int partner_rank = (world_rank + 1) % 2;
printf("MY RANK: %d PARTNER: %d\n",world_rank,partner_rank);
if (world_rank == 0) {
// Increment the ping pong count before you send it
ping_pong_count++;
MPI_Send(&ping_pong_count, 1, MPI_INT, partner_rank, 0, MPI_COMM_WORLD);
printf("%d sent and incremented ping_pong_count %d to %d\n",
world_rank, ping_pong_count, partner_rank);
} else {
MPI_Recv(&ping_pong_count, 1, MPI_INT, partner_rank, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("%d received ping_pong_count %d from %d\n",
world_rank, ping_pong_count, partner_rank);
}
MPI_Finalize();
}
I see that in my component's btl code the functions called during the "MPI_send" phase are:
mca_btl_mycomp_add_procs
mca_btl_mycomp_prepare_src
mca_btl_mycomp_send (where I set the return to 1, so the send phase
should be finished)
I see then the print inside the test:
0 sent and incremented ping_pong_count 2 to 1
and this should conclude the MPI_Send phase.
Then I implemented in the btl_mycomp_component_progress function a call to:
mca_btl_active_message_callback_t *reg = mca_btl_base_active_message_trigger + tag;
reg->cbfunc(&my_btl->super, &desc);
I saw the same code in all the other BTLs and I thought this was enough to unlock the MPI_Recv "polling". But actually I see my test hangs, probably "waiting" for something that never happens (?).
I also took a look in the ob1 mca_pml_ob1_recv_frag_callback_match function, and it seems to get to the end of the function, actually matching my frag.
So my question is: how can I say to the framework that I finished my work and so the function can return to the user application? What am I doing wrong?
Is there a way to understand where and what my code is waiting for?

MPI message received in different communicator

It was my understanding that MPI communicators restrict the scope of communication, such
that messages sent from one communicator should never be received in a different one.
However, the program inlined below appears to contradict this.
I understand that the MPI_Send call returns before a matching receive is posted because of the internal buffering it does under the hood (as opposed to MPI_Ssend). I also understand that MPI_Comm_free doesn't destroy the communicator right away, but merely marks it for deallocation and waits for any pending operations to finish. I suppose that my unmatched send operation will be forever pending, but then I wonder how come the same object (integer value) is reused for the second communicator!?
Is this normal behaviour, a bug in the MPI library implementation, or is it that my program is just incorrect?
Any suggestions are much appreciated!
LATER EDIT: posted follow-up question
#include "stdio.h"
#include "unistd.h"
#include "mpi.h"
int main(int argc, char* argv[]) {
int rank, size;
MPI_Group group;
MPI_Comm my_comm;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_group(MPI_COMM_WORLD, &group);
MPI_Comm_create(MPI_COMM_WORLD, group, &my_comm);
if (rank == 0) printf("created communicator %d\n", my_comm);
if (rank == 1) {
int msg = 123;
MPI_Send(&msg, 1, MPI_INT, 0, 0, my_comm);
printf("rank 1: message sent\n");
}
sleep(1);
if (rank == 0) printf("freeing communicator %d\n", my_comm);
MPI_Comm_free(&my_comm);
sleep(2);
MPI_Comm_create(MPI_COMM_WORLD, group, &my_comm);
if (rank == 0) printf("created communicator %d\n", my_comm);
if (rank == 0) {
int msg;
MPI_Recv(&msg, 1, MPI_INT, 1, 0, my_comm, MPI_STATUS_IGNORE);
printf("rank 0: message received\n");
}
sleep(1);
if (rank == 0) printf("freeing communicator %d\n", my_comm);
MPI_Comm_free(&my_comm);
MPI_Finalize();
return 0;
}
outputs:
created communicator -2080374784
rank 1: message sent
freeing communicator -2080374784
created communicator -2080374784
rank 0: message received
freeing communicator -2080374784
The number you're seeing is simply a handle for the communicator. It's safe to reuse the handle since you've freed it. As to why you're able to send the message, look at how you're creating the communicator. When you use MPI_Comm_group, you're getting a group containing the ranks associated with the specified communicator. In this case, you get all of the ranks, since you are getting the group for MPI_COMM_WORLD. Then, you are using MPI_Comm_create to create a communicator based on a group of ranks. You are using the same group you just got, which will contain all of the ranks. So your new communicator has all of the ranks from MPI_COMM_WORLD. If you want your communicator to only contain a subset of ranks, you'll need to use a different function (or multiple functions) to make the desired group(s). I'd recommend reading through Chapter 6 of the MPI Standard, it includes all of the functions you'll need. Pick what you need to build the communicator you want.

MPI_Sendrecv deadlock

Could anyone help me to fix the bug in the following simple MPI prog. I was trying to use MPI_Sendrecv to send the "c" value from rank 1 to 2, and they print it from rank 2.
But, the following code ends with a deadlock.
What is the mistake, how to correctly use MPI_Sendrecv (in this situation)
#include<stdio.h>
#include"mpi.h"
int main (int argc, char **argv)
{
int size, rank;
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&size);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
printf("Hi dear, I am printing from rank %d\n",rank);
double a, b, c;
MPI_Status status, status2;
if (rank == 0)
{
a = 10.1;
MPI_Send(&a,1,MPI_DOUBLE,1,99,MPI_COMM_WORLD);
}
if (rank == 1)
{
b = 20.1;
MPI_Recv(&a,1,MPI_DOUBLE,0,99,MPI_COMM_WORLD,&status);
c = a + b;
printf("\nThe value of c is %f \n",c);
}
MPI_Sendrecv(&c,1,MPI_DOUBLE,2,100,
&c,1,MPI_DOUBLE,1,100,MPI_COMM_WORLD,&status2);
MPI_Barrier(MPI_COMM_WORLD);
if(rank == 2)
{
printf("\n Printing from rank %d, c is %f\n",rank, c);
}
MPI_Finalize();
return 0;
When a process calls MPI_Sendrecv, it will always try to execute both the send and receive part. For instance, if it is process 2, it will not look at the dest (fourth argument), see a "2" and think, "Oh, I don't have to do any sending. I'll just do the receive." nstead, process 2 will see the "2" and think, "Ah, I have to send something to myself." Further, as you have written this code, all processes see the MPI_SendRecv and think, "Oh. I have to send something to process 2 (the fourth argument) and receive something from process 1 (the ninth argument). So here we go ..." The problem is that process 1 isn't getting a command to send anything, so everyone, even process 1, is waiting for process 1 to send something.
MPI_Sendrecv is a very useful function. I find uses for it all the time. But it is intended for sending in a "chain". For instance, 0 sends to 1 while 1 sends to 2 while 2 sends to 0 or whatever. I think in this case you are better off with the usual MPI_Send and MPI_Recv.

C MPI multiple dynamic array passing

I'm trying to ISend() two arrays: arr1,arr2 and an integer n which is the size of arr1,arr2. I understood from this post that sending a struct that contains all three is not an option since n is only known at run time. Obviously, I need n to be received first since otherwise the receiving process wouldn't know how many elements to receive. What's the most efficient way to achieve this without using the blokcing Send() ?
Sending the size of the array is redundant (and inefficient) as MPI provides a way to probe for incoming messages without receiving them, which provides just enough information in order to properly allocate memory. Probing is performed with MPI_PROBE, which looks a lot like MPI_RECV, except that it takes no buffer related arguments. The probe operation returns a status object which can then be queried for the number of elements of a given MPI datatype that can be extracted from the content of the message with MPI_GET_COUNT, therefore explicitly sending the number of elements becomes redundant.
Here is a simple example with two ranks:
if (rank == 0)
{
MPI_Request req;
// Send a message to rank 1
MPI_Isend(arr1, n, MPI_DOUBLE, 1, 0, MPI_COMM_WORLD, &req);
// Do not forget to complete the request!
MPI_Wait(&req, MPI_STATUS_IGNORE);
}
else if (rank == 1)
{
MPI_Status status;
// Wait for a message from rank 0 with tag 0
MPI_Probe(0, 0, MPI_COMM_WORLD, &status);
// Find out the number of elements in the message -> size goes to "n"
MPI_Get_count(&status, MPI_DOUBLE, &n);
// Allocate memory
arr1 = malloc(n*sizeof(double));
// Receive the message. ignore the status
MPI_Recv(arr1, n, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
}
MPI_PROBE also accepts the wildcard rank MPI_ANY_SOURCE and the wildcard tag MPI_ANY_TAG. One can then consult the corresponding entry in the status structure in order to find out the actual sender rank and the actual message tag.
Probing for the message size works as every message carries a header, called envelope. The envelope consists of the sender's rank, the receiver's rank, the message tag and the communicator. It also carries information about the total message size. Envelopes are sent as part of the initial handshake between the two communicating processes.
Firstly you need to allocate memory (full memory = n = elements) to arr1 and arr2 with rank 0. i.e. your front end processor.
Divide the array into parts depending on the no. of processors. Determine the element count for each processor.
Send this element count to the other processors from rank 0.
The second send is for the array i.e. arr1 and arr2
In other processors allocate arr1 and arr2 according to the element count received from main processor i.e. rank = 0. After receiving element count, receive the two arrays in the allocated memories.
This is a sample C++ Implementation but C will follow the same logic. Also just interchange Send with Isend.
#include <mpi.h>
#include <iostream>
using namespace std;
int main(int argc, char*argv[])
{
MPI::Init (argc, argv);
int rank = MPI::COMM_WORLD.Get_rank();
int no_of_processors = MPI::COMM_WORLD.Get_size();
MPI::Status status;
double *arr1;
if (rank == 0)
{
// Setting some Random n
int n = 10;
arr1 = new double[n];
for(int i = 0; i < n; i++)
{
arr1[i] = i;
}
int part = n / no_of_processors;
int offset = n % no_of_processors;
// cout << part << "\t" << offset << endl;
for(int i = 1; i < no_of_processors; i++)
{
int start = i*part;
int end = start + part - 1;
if (i == (no_of_processors-1))
{
end += offset;
}
// cout << i << " Start: " << start << " END: " << end;
// Element_Count
int e_count = end - start + 1;
// cout << " e_count: " << e_count << endl;
// Sending
MPI::COMM_WORLD.Send(
&e_count,
1,
MPI::INT,
i,
0
);
// Sending Arr1
MPI::COMM_WORLD.Send(
(arr1+start),
e_count,
MPI::DOUBLE,
i,
1
);
}
}
else
{
// Element Count
int e_count;
// Receiving elements count
MPI::COMM_WORLD.Recv (
&e_count,
1,
MPI::INT,
0,
0,
status
);
arr1 = new double [e_count];
// Receiving FIrst Array
MPI::COMM_WORLD.Recv (
arr1,
e_count,
MPI::DOUBLE,
0,
1,
status
);
for(int i = 0; i < e_count; i++)
{
cout << arr1[i] << endl;
}
}
// if(rank == 0)
delete [] arr1;
MPI::Finalize();
return 0;
}
#Histro The point I want to make is, that Irecv/Isend are some functions themselves manipulated by MPI lib. The question u asked depend completely on your rest of the code about what you do after the Send/Recv. There are 2 cases:
Master and Worker
You send part of the problem (say arrays) to the workers (all other ranks except 0=Master). The worker does some work (on the arrays) then returns back the results to the master. The master then adds up the result, and convey new work to the workers. Now, here you would want the master to wait for all the workers to return their result (modified arrays). So you cannot use Isend and Irecv but a multiple send as used in my code and corresponding recv. If your code is in this direction you wanna use B_cast and MPI_Reduce.
Lazy Master
The master divides the work but doesn't care of the result from his workers. Say you want to program a pattern of different kinds for same data. Like given characteristics of population of some city, you want to calculate the patterns like how many are above 18, how
many have jobs, how much of them work in some company. Now these results don't have anything to do with one another. In this case you don't have to worry about whether the data is received by the workers or not. The master can continue to execute the rest of the code. This is where it is safe to use Isend/Irecv.

MPI Barrier C++

I want to use MPI (MPICH2) on windows. I write this command:
MPI_Barrier(MPI_COMM_WORLD);
And I expect it blocks all Processors until all group members have called it. But it is not happen. I add a schematic of my code:
int a;
if(myrank == RootProc)
a = 4;
MPI_Barrier(MPI_COMM_WORLD);
cout << "My Rank = " << myrank << "\ta = " << a << endl;
(With 2 processor:) Root processor (0) acts correctly, but processor with rank 1 doesn't know the a variable, so it display -858993460 instead of 4.
Can any one help me?
Regards
You're only assigning a in process 0. MPI doesn't share memory, so if you want the a in process 1 to get the value of 4, you need to call MPI_Send from process 0 and MPI_Recv from process 1.
Variable a is not initialized - it is possible that is why it displays that number. In MPI, variable a is duplicated between the processes - so there are two values for a, one of which is uninitialized. You want to write:
int a = 4;
if (myrank == RootProc)
...
Or, alternatively, do an MPI_send in the Root (id 0), and an MPI_recv in the slave (id 1) so the value in the root is also set in the slave.
Note: that code triggers a small alarm in my head, so I need to check something and I'll edit this with more info. Until then though, the uninitialized value is most certainly a problem for you.
Ok I've checked the facts - your code was not properly indented and I missed the missing {}. The barrier looks fine now, although the snippet you posted does not do too much, and is not a very good example of a barrier because the slave enters it directly, whereas the root will set the value of the variable to 4 and then enter it. To test that it actually works, you probably want some sort of a sleep mechanism in one of the processes - that will yield (hope it's the correct term) the other process as well, preventing it from printing the cout until the sleep is over.
Blocking is not enough, you have to send data to other processes (memory in not shared between processes).
To share data across ALL processes use:
int MPI_Bcast(void* buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm )
so in your case:
MPI_Bcast(&a, 1, MPI_INT, 0, MPI_COMM_WORLD);
here you send one integer pointed by &a form process 0 to all other.
//MPI_Bcast is sender for root process and receiver for non-root processes
You can also send some data to specyfic process by:
int MPI_Send( void *buf, int count, MPI_Datatype datatype, int dest,
int tag, MPI_Comm comm )
and then receive by:
int MPI_Recv(void* buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status)

Resources