broadcasting structs without knowing attributes - mpi

I'm trying to use MPI_Bcast to share an instance of cudaIpcMemHandler_t, but I cannot figure out how to create the corresponding MPI_Datatype needed for Bcast. I do not know the underlying structure of the cuda type, hence methods like this one don't seem to work. Am I missing something ?

Following up from the useful comment , I have a solution that seems to work, though its only been tested in a toy program:
// Broadcast the memory handle
cudaIpcMemHandler_t memHandler[1];
if (rank==0){
// set the handler content, e.g. call cudpaIpcGetMemHandle
}
MPI_Barrier(MPI_COMM_WORLD);
// share the size of the handler with other objects
int hand_size[1];
if (rank==0)
hand_size[0]= sizeof(memHand[0]);
MPI_Bcast(&hand_size[0], 1, MPI_INT, 0, MPI_COMM_WORLD);
// broadcase the handler as byte array
char memHand_C[hand_size[0]];
if (rank==0)
memcpy(&memHand_C, &memHand[0], hand_size[0]);
MPI_Bcast(&memHand_C, hand_size[0], MPI_BYTE, 0, MPI_COMM_WORLD);
if (rank >0)
memcpy(&memHand[0], &memHand_C, hand_size[0]);

Related

How to check completion of non-blocking reduce?

It is not clear to me how to properly use non-blocking collective in MPI, particularly MPI_Ireduce() in this case:
Say I want to collect a sum from root rank:
int local_cnt;
int total_cnt;
MPI_Request request;
MPI_Ireduce(&local_cnt, &total_cnt, 1, MPI_INT, MPI_SUM, 0, MPI_WORLD_COMM, &request);
/* now I want to check if the reduce is finished */
if (rank == 0) {
int flag = 0;
MPI_Status status;
MPI_Test(&request, &flag, &status);
if (flag) {
/* reduce is finished? */
}
}
Is this the correct way to check if non-blocking reduce is done? My confusion comes from two aspects: one, Can or should just root process check for it using MPI_Test() since this is meaningful only to root? second, since MPI_Test() is local op, how can this local op knows the operation is complete? it does require all processes to finish, right?
You must check for completion on all participating ranks, not just the root.
From a user perspective, you need to know about the completion of the communication because you must not do anything with the memory provided to a non-blocking operation. I.e. if you send a local scope variable like local_cnt, you cannot write to it or leave it's scope before you have confirmed that the operation is complete.
One option to ensure completion is calling MPI_Test until it eventually returns flag==true. Use this only if you can do something useful between the calls to MPI_Test:
{
int local_cnt;
int total_cnt;
// fill local_cnt on all ranks
MPI_Request request;
MPI_Ireduce(&local_cnt, &total_cnt, 1, MPI_INT, MPI_SUM, 0, MPI_WORLD_COMM, &request);
int flag;
do {
// perform some useful computation
MPI_Status status;
MPI_Test(&request, &flag, &status);
} while (!flag)
}
Do not call MPI_Test in a loop if you have nothing useful to do in between calls. Instead use MPI_Wait, which blocks until completion.
{
int local_cnt;
int total_cnt;
// fill local_cnt on all ranks
MPI_Request request;
MPI_Ireduce(&local_cnt, &total_cnt, 1, MPI_INT, MPI_SUM, 0, MPI_WORLD_COMM, &request);
// perform some useful computation
MPI_Status status;
MPI_Wait(&request, &status);
}
Remember, if you have no useful computation at all, and don't need to be non-blocking for deadlock reasons, use a blocking communication in the first place. If you have multiple ongoing non-blocking communication, there are MPI_Waitany, MPI_Waitsome, MPI_Waitall and their Test variant's.
Zulan brilliantly answered the first part of your question.
MPI_Reduce() returns when
the send buffer can be overwritten on a non root rank
the result is available on the root rank (which implies all the ranks have completed)
So there is no way for a non root rank to know whether the root rank completed. If you do need this information, then you need to manually add a MPI_Barrier(). That being said, you generally do not require this information, and if you believe you do need it, there might be something wrong with your app.
This remains true if you use non blocking collectives (e.g. MPI_Wait() corresponding to MPI_Ireduce() completes on a non root rank : that simply means the send buffer can be overwritten.

When to use tags when sending and receiving messages in MPI?

I'm not sure when I have to use different numbers for the tag field in MPI send, receive calls. I've read this, but I can't understand it.
Sometimes there are cases when A might have to send many different
types of messages to B. Instead of B having to go through extra
measures to differentiate all these messages, MPI allows senders and
receivers to also specify message IDs with the message (known as
tags). When process B only requests a message with a certain tag
number, messages with different tags will be buffered by the network
until B is ready for them.
Do I have to use tags, for example, when I have multiple calls "isend" (with different tags) from process A and only 1 call to "ireceive" in process B?
Message tags are optional. You can use arbitrary integer values for them and use whichever semantics you like and seem useful to you.
Like you suggested, tags can be used to differentiate between messages that consist of different types (MPI_INTEGER, MPI_REAL, MPI_BYTE, etc.). You could also use tags to add some information about what the data actually represents (if you have an nxn matrix, a message to send a row of this matrix will consist of n values, as will a message to send a column of that matrix; nevertheless, you may want to treat row and column data differently).
Note that the receive operation has to match the tag of a message it wants to receive. This, however, does not mean that you have to specify the same tag, you can also use the wildcard MPI_ANY_TAG as message tag; the receive operation will then match arbitrary message tags. You can find out which tag the sender used with the help of MPI_Probe.
In general, I tend to avoid them. There is no requirement that you use tags. If you need to get the message size before parsing the message, you can use MPI_Probe. That way you can send different messages rather than specifying Tags. I typically use tags because MPI_Recv requires that you know the message size before fetching the data. If you have different sizes and types, tags can help you differentiate between them by having multiple threads or processes listening over a different subset. Tag 1 can mean messages of type X and tag 2 will be messages of type Y. Also, it enables you to have multiple "channels" of communication without having to do the work of creating unique communicators and groups.
#include <mpi.h>
#include <iostream>
using namespace std;
int main( int argc, char* argv[] )
{
// Init MPI
MPI_Init( &argc, &argv);
// Get the rank and size
int rank, size;
MPI_Comm_rank( MPI_COMM_WORLD, &rank );
MPI_Comm_size( MPI_COMM_WORLD, &size );
// If Master
if( rank == 0 ){
char* message_r1 = "Hello Rank 1";
char* message_r2 = "Hello Rank 2";
// Send a message over tag 0
MPI_Send( message_r1, 13, MPI_CHAR, 1, 0, MPI_COMM_WORLD );
// Send a message over tag 1
MPI_Send( message_r2, 13, MPI_CHAR, 2, 1, MPI_COMM_WORLD );
}
else{
// Buffer
char buffer[256];
MPI_Status status;
// Wait for your own message
MPI_Recv( buffer, 13, MPI_CHAR, 0, rank-1, MPI_COMM_WORLD, &status );
cout << "Rank: " << rank << ", Message: " << buffer << std::endl;
}
// Finalize MPI
MPI_Finalize();
}
Tags can be useful in distributed computing algorithms where there can be multiple types of messages. Consider the leader election problem, where a process (election candidate) sends a message of type requestVote and the other processes respond with a message of type voteGrant.
There are many such algorithms that distinguish between the types of messages and the tag can be useful to categorize among such messages.

MPI message received in different communicator

It was my understanding that MPI communicators restrict the scope of communication, such
that messages sent from one communicator should never be received in a different one.
However, the program inlined below appears to contradict this.
I understand that the MPI_Send call returns before a matching receive is posted because of the internal buffering it does under the hood (as opposed to MPI_Ssend). I also understand that MPI_Comm_free doesn't destroy the communicator right away, but merely marks it for deallocation and waits for any pending operations to finish. I suppose that my unmatched send operation will be forever pending, but then I wonder how come the same object (integer value) is reused for the second communicator!?
Is this normal behaviour, a bug in the MPI library implementation, or is it that my program is just incorrect?
Any suggestions are much appreciated!
LATER EDIT: posted follow-up question
#include "stdio.h"
#include "unistd.h"
#include "mpi.h"
int main(int argc, char* argv[]) {
int rank, size;
MPI_Group group;
MPI_Comm my_comm;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_group(MPI_COMM_WORLD, &group);
MPI_Comm_create(MPI_COMM_WORLD, group, &my_comm);
if (rank == 0) printf("created communicator %d\n", my_comm);
if (rank == 1) {
int msg = 123;
MPI_Send(&msg, 1, MPI_INT, 0, 0, my_comm);
printf("rank 1: message sent\n");
}
sleep(1);
if (rank == 0) printf("freeing communicator %d\n", my_comm);
MPI_Comm_free(&my_comm);
sleep(2);
MPI_Comm_create(MPI_COMM_WORLD, group, &my_comm);
if (rank == 0) printf("created communicator %d\n", my_comm);
if (rank == 0) {
int msg;
MPI_Recv(&msg, 1, MPI_INT, 1, 0, my_comm, MPI_STATUS_IGNORE);
printf("rank 0: message received\n");
}
sleep(1);
if (rank == 0) printf("freeing communicator %d\n", my_comm);
MPI_Comm_free(&my_comm);
MPI_Finalize();
return 0;
}
outputs:
created communicator -2080374784
rank 1: message sent
freeing communicator -2080374784
created communicator -2080374784
rank 0: message received
freeing communicator -2080374784
The number you're seeing is simply a handle for the communicator. It's safe to reuse the handle since you've freed it. As to why you're able to send the message, look at how you're creating the communicator. When you use MPI_Comm_group, you're getting a group containing the ranks associated with the specified communicator. In this case, you get all of the ranks, since you are getting the group for MPI_COMM_WORLD. Then, you are using MPI_Comm_create to create a communicator based on a group of ranks. You are using the same group you just got, which will contain all of the ranks. So your new communicator has all of the ranks from MPI_COMM_WORLD. If you want your communicator to only contain a subset of ranks, you'll need to use a different function (or multiple functions) to make the desired group(s). I'd recommend reading through Chapter 6 of the MPI Standard, it includes all of the functions you'll need. Pick what you need to build the communicator you want.

MPI in C and pointer transmissions

I am having a new little problem;
I have a little pointer called:
int *a;
Now ..somewhere inside my main method I allocate some space for it using the following lines and assign a value:
a = (int *) malloc(sizeof(int));
*a=5;
..and then I attempt to transmit it (say to process 1):
MPI_Bsend(a, 1, MPI_INT, 1, 0, MPI_COMM_WORLD);
On the other end, if I try to receive that pointer
int *b;
MPI_Recv(b, 1, MPI_INT, 0, MPI_ANY_TAG, MPI_COMM_WORLD, &status);
printf("This is what I received: %d \n", *b);
I get an error about the buffer!
However if instead of declaring 'b' a pointer I do the following:
int b;
MPI_Recv(&b, 1, MPI_INT, 0, MPI_ANY_TAG, MPI_COMM_WORLD, &status);
printf("This is what I received: %d \n", b);
...all seems to be good! Could someone help me figure out what's happening and how to only use pointer?
Thanks in advance!
The meaning of the line
MPI_Bsend(a, 1, MPI_INT, 1, 0, MPI_COMM_WORLD);
is the following: "a is a point in memory where I have 1 integer. Send it.`
In the code you posted above, this is absolutely true: a does point to an integer, and so it is sent. This is why you can receive it using your second method, since the meaning of the line
MPI_Recv(&b, 1, MPI_INT, 0, MPI_ANY_TAG, MPI_COMM_WORLD, &status);
is "receive 1 integer, and store it at &b. b is a regular int, so it's all good. In the first receive, you're trying to receive an integer into an int* variable there is no allocated memory that b is pointing to, so Recv has nowhere to write to. However, I should point out:
NEVER pass a pointer's contents to another process in MPI
MPI processes cannot read each others' memory, and virtual addressing makes one process' pointer completely meaningless to another.
This problem is related to handling pointers and allocating memory; it's not an MPI specific issue.
In your second variant, int a automatically allocates memory for one integer. By passing &a you are passing a pointer to an allocated memory segment. In your first variant, memory for the pointer is automatically allocated, but NOT for the memory the pointer is pointing to. Thus, when you pass in the pointer, MPI tries to write to non-allocated memory, which causes the error.
It would work this way though:
int *b = (int *) malloc(sizeof(int));
MPI_Recv(b, 1, MPI_INT, 0, MPI_ANY_TAG, MPI_COMM_WORLD, &status);
The error you are getting is that you are copying a result from MPI_Recv in some memory *b that you don't own and isn't initialised.
Not an expert on MPI but surely you can't transfer a pointer (ie a memory address) to a process that could be running on another machine!

MPI Barrier C++

I want to use MPI (MPICH2) on windows. I write this command:
MPI_Barrier(MPI_COMM_WORLD);
And I expect it blocks all Processors until all group members have called it. But it is not happen. I add a schematic of my code:
int a;
if(myrank == RootProc)
a = 4;
MPI_Barrier(MPI_COMM_WORLD);
cout << "My Rank = " << myrank << "\ta = " << a << endl;
(With 2 processor:) Root processor (0) acts correctly, but processor with rank 1 doesn't know the a variable, so it display -858993460 instead of 4.
Can any one help me?
Regards
You're only assigning a in process 0. MPI doesn't share memory, so if you want the a in process 1 to get the value of 4, you need to call MPI_Send from process 0 and MPI_Recv from process 1.
Variable a is not initialized - it is possible that is why it displays that number. In MPI, variable a is duplicated between the processes - so there are two values for a, one of which is uninitialized. You want to write:
int a = 4;
if (myrank == RootProc)
...
Or, alternatively, do an MPI_send in the Root (id 0), and an MPI_recv in the slave (id 1) so the value in the root is also set in the slave.
Note: that code triggers a small alarm in my head, so I need to check something and I'll edit this with more info. Until then though, the uninitialized value is most certainly a problem for you.
Ok I've checked the facts - your code was not properly indented and I missed the missing {}. The barrier looks fine now, although the snippet you posted does not do too much, and is not a very good example of a barrier because the slave enters it directly, whereas the root will set the value of the variable to 4 and then enter it. To test that it actually works, you probably want some sort of a sleep mechanism in one of the processes - that will yield (hope it's the correct term) the other process as well, preventing it from printing the cout until the sleep is over.
Blocking is not enough, you have to send data to other processes (memory in not shared between processes).
To share data across ALL processes use:
int MPI_Bcast(void* buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm )
so in your case:
MPI_Bcast(&a, 1, MPI_INT, 0, MPI_COMM_WORLD);
here you send one integer pointed by &a form process 0 to all other.
//MPI_Bcast is sender for root process and receiver for non-root processes
You can also send some data to specyfic process by:
int MPI_Send( void *buf, int count, MPI_Datatype datatype, int dest,
int tag, MPI_Comm comm )
and then receive by:
int MPI_Recv(void* buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status)

Resources