When to use tags when sending and receiving messages in MPI? - mpi

I'm not sure when I have to use different numbers for the tag field in MPI send, receive calls. I've read this, but I can't understand it.
Sometimes there are cases when A might have to send many different
types of messages to B. Instead of B having to go through extra
measures to differentiate all these messages, MPI allows senders and
receivers to also specify message IDs with the message (known as
tags). When process B only requests a message with a certain tag
number, messages with different tags will be buffered by the network
until B is ready for them.
Do I have to use tags, for example, when I have multiple calls "isend" (with different tags) from process A and only 1 call to "ireceive" in process B?

Message tags are optional. You can use arbitrary integer values for them and use whichever semantics you like and seem useful to you.
Like you suggested, tags can be used to differentiate between messages that consist of different types (MPI_INTEGER, MPI_REAL, MPI_BYTE, etc.). You could also use tags to add some information about what the data actually represents (if you have an nxn matrix, a message to send a row of this matrix will consist of n values, as will a message to send a column of that matrix; nevertheless, you may want to treat row and column data differently).
Note that the receive operation has to match the tag of a message it wants to receive. This, however, does not mean that you have to specify the same tag, you can also use the wildcard MPI_ANY_TAG as message tag; the receive operation will then match arbitrary message tags. You can find out which tag the sender used with the help of MPI_Probe.

In general, I tend to avoid them. There is no requirement that you use tags. If you need to get the message size before parsing the message, you can use MPI_Probe. That way you can send different messages rather than specifying Tags. I typically use tags because MPI_Recv requires that you know the message size before fetching the data. If you have different sizes and types, tags can help you differentiate between them by having multiple threads or processes listening over a different subset. Tag 1 can mean messages of type X and tag 2 will be messages of type Y. Also, it enables you to have multiple "channels" of communication without having to do the work of creating unique communicators and groups.
#include <mpi.h>
#include <iostream>
using namespace std;
int main( int argc, char* argv[] )
{
// Init MPI
MPI_Init( &argc, &argv);
// Get the rank and size
int rank, size;
MPI_Comm_rank( MPI_COMM_WORLD, &rank );
MPI_Comm_size( MPI_COMM_WORLD, &size );
// If Master
if( rank == 0 ){
char* message_r1 = "Hello Rank 1";
char* message_r2 = "Hello Rank 2";
// Send a message over tag 0
MPI_Send( message_r1, 13, MPI_CHAR, 1, 0, MPI_COMM_WORLD );
// Send a message over tag 1
MPI_Send( message_r2, 13, MPI_CHAR, 2, 1, MPI_COMM_WORLD );
}
else{
// Buffer
char buffer[256];
MPI_Status status;
// Wait for your own message
MPI_Recv( buffer, 13, MPI_CHAR, 0, rank-1, MPI_COMM_WORLD, &status );
cout << "Rank: " << rank << ", Message: " << buffer << std::endl;
}
// Finalize MPI
MPI_Finalize();
}

Tags can be useful in distributed computing algorithms where there can be multiple types of messages. Consider the leader election problem, where a process (election candidate) sends a message of type requestVote and the other processes respond with a message of type voteGrant.
There are many such algorithms that distinguish between the types of messages and the tag can be useful to categorize among such messages.

Related

broadcasting structs without knowing attributes

I'm trying to use MPI_Bcast to share an instance of cudaIpcMemHandler_t, but I cannot figure out how to create the corresponding MPI_Datatype needed for Bcast. I do not know the underlying structure of the cuda type, hence methods like this one don't seem to work. Am I missing something ?
Following up from the useful comment , I have a solution that seems to work, though its only been tested in a toy program:
// Broadcast the memory handle
cudaIpcMemHandler_t memHandler[1];
if (rank==0){
// set the handler content, e.g. call cudpaIpcGetMemHandle
}
MPI_Barrier(MPI_COMM_WORLD);
// share the size of the handler with other objects
int hand_size[1];
if (rank==0)
hand_size[0]= sizeof(memHand[0]);
MPI_Bcast(&hand_size[0], 1, MPI_INT, 0, MPI_COMM_WORLD);
// broadcase the handler as byte array
char memHand_C[hand_size[0]];
if (rank==0)
memcpy(&memHand_C, &memHand[0], hand_size[0]);
MPI_Bcast(&memHand_C, hand_size[0], MPI_BYTE, 0, MPI_COMM_WORLD);
if (rank >0)
memcpy(&memHand[0], &memHand_C, hand_size[0]);

How to check completion of non-blocking reduce?

It is not clear to me how to properly use non-blocking collective in MPI, particularly MPI_Ireduce() in this case:
Say I want to collect a sum from root rank:
int local_cnt;
int total_cnt;
MPI_Request request;
MPI_Ireduce(&local_cnt, &total_cnt, 1, MPI_INT, MPI_SUM, 0, MPI_WORLD_COMM, &request);
/* now I want to check if the reduce is finished */
if (rank == 0) {
int flag = 0;
MPI_Status status;
MPI_Test(&request, &flag, &status);
if (flag) {
/* reduce is finished? */
}
}
Is this the correct way to check if non-blocking reduce is done? My confusion comes from two aspects: one, Can or should just root process check for it using MPI_Test() since this is meaningful only to root? second, since MPI_Test() is local op, how can this local op knows the operation is complete? it does require all processes to finish, right?
You must check for completion on all participating ranks, not just the root.
From a user perspective, you need to know about the completion of the communication because you must not do anything with the memory provided to a non-blocking operation. I.e. if you send a local scope variable like local_cnt, you cannot write to it or leave it's scope before you have confirmed that the operation is complete.
One option to ensure completion is calling MPI_Test until it eventually returns flag==true. Use this only if you can do something useful between the calls to MPI_Test:
{
int local_cnt;
int total_cnt;
// fill local_cnt on all ranks
MPI_Request request;
MPI_Ireduce(&local_cnt, &total_cnt, 1, MPI_INT, MPI_SUM, 0, MPI_WORLD_COMM, &request);
int flag;
do {
// perform some useful computation
MPI_Status status;
MPI_Test(&request, &flag, &status);
} while (!flag)
}
Do not call MPI_Test in a loop if you have nothing useful to do in between calls. Instead use MPI_Wait, which blocks until completion.
{
int local_cnt;
int total_cnt;
// fill local_cnt on all ranks
MPI_Request request;
MPI_Ireduce(&local_cnt, &total_cnt, 1, MPI_INT, MPI_SUM, 0, MPI_WORLD_COMM, &request);
// perform some useful computation
MPI_Status status;
MPI_Wait(&request, &status);
}
Remember, if you have no useful computation at all, and don't need to be non-blocking for deadlock reasons, use a blocking communication in the first place. If you have multiple ongoing non-blocking communication, there are MPI_Waitany, MPI_Waitsome, MPI_Waitall and their Test variant's.
Zulan brilliantly answered the first part of your question.
MPI_Reduce() returns when
the send buffer can be overwritten on a non root rank
the result is available on the root rank (which implies all the ranks have completed)
So there is no way for a non root rank to know whether the root rank completed. If you do need this information, then you need to manually add a MPI_Barrier(). That being said, you generally do not require this information, and if you believe you do need it, there might be something wrong with your app.
This remains true if you use non blocking collectives (e.g. MPI_Wait() corresponding to MPI_Ireduce() completes on a non root rank : that simply means the send buffer can be overwritten.

Using MPI_Scatter with 3 processes

I am new to MPI , and my question is how the root(for example rank-0) initializes all its values (in the array) before other processes receive their i'th value from the root?
for example:
in the root i initialize: arr[0]=20,arr[1]=90,arr[2]=80.
My question is ,If i have for example process (number -2) that starts a little bit before the root process. Can the MPI_Scatter sends incorrect value instead 80?
How can i assure the root initialize all his memory before others use Scatter ?
Thank you !
The MPI standard specifies that
If comm is an intracommunicator, the outcome is as if the root executed n
send operations, MPI_Send(sendbuf+i, sendcount, extent(sendtype), sendcount, sendtype, i,...), and each process executed a receive, MPI_Recv(recvbuf, recvcount, recvtype, i,...).
This means that all the non-root processes will wait until their recvcount respective elements have been transmitted. This is also known as synchronized routine (the process waits until the communication is completed).
You as the programmer are responsible of ensuring that the data being sent is correct by the time you call any communication routine and until the send buffer available again (in this case, until MPI_Scatter returns). In a MPI only program, this is as simple as placing the initialization code before the call to MPI_Scatter, as each process executes the program sequentially.
The following is an example based in the document's Example 5.11:
MPI_Comm comm = MPI_COMM_WORLD;
int grank, gsize,*sendbuf;
int root, rbuf[100];
MPI_Comm_rank( comm, &grank );
MPI_Comm_size(comm, &gsize);
root = 0;
if( grank == root ) {
sendbuf = (int *)malloc(gsize*100*sizeof(int));
// Initialize sendbuf. None of its values are valid at this point.
for( int i = 0; i < gsize * 100; i++ )
sendbuf[i] = i;
}
rbuf = (int *)malloc(100*sizeof(int));
// Distribute sendbuf data
// At the root process, all sendbuf values are valid
// In non-root processes, sendbuf argument is ignored.
MPI_Scatter(sendbuf, 100, MPI_INT, rbuf, 100, MPI_INT, root, comm);
MPI_Scatter() is a collective operation, so the MPI library does take care of everything, and the outcome of a collective operation does not depend on which rank called earlier than an other.
In this specific case, a non root rank will block (at least) until the root rank calls MPI_Scatter().
This is no different than a MPI_Send() / MPI_Recv().
MPI_Recv() blocks if called before the remote peer MPI_Send() a matching message.

C MPI multiple dynamic array passing

I'm trying to ISend() two arrays: arr1,arr2 and an integer n which is the size of arr1,arr2. I understood from this post that sending a struct that contains all three is not an option since n is only known at run time. Obviously, I need n to be received first since otherwise the receiving process wouldn't know how many elements to receive. What's the most efficient way to achieve this without using the blokcing Send() ?
Sending the size of the array is redundant (and inefficient) as MPI provides a way to probe for incoming messages without receiving them, which provides just enough information in order to properly allocate memory. Probing is performed with MPI_PROBE, which looks a lot like MPI_RECV, except that it takes no buffer related arguments. The probe operation returns a status object which can then be queried for the number of elements of a given MPI datatype that can be extracted from the content of the message with MPI_GET_COUNT, therefore explicitly sending the number of elements becomes redundant.
Here is a simple example with two ranks:
if (rank == 0)
{
MPI_Request req;
// Send a message to rank 1
MPI_Isend(arr1, n, MPI_DOUBLE, 1, 0, MPI_COMM_WORLD, &req);
// Do not forget to complete the request!
MPI_Wait(&req, MPI_STATUS_IGNORE);
}
else if (rank == 1)
{
MPI_Status status;
// Wait for a message from rank 0 with tag 0
MPI_Probe(0, 0, MPI_COMM_WORLD, &status);
// Find out the number of elements in the message -> size goes to "n"
MPI_Get_count(&status, MPI_DOUBLE, &n);
// Allocate memory
arr1 = malloc(n*sizeof(double));
// Receive the message. ignore the status
MPI_Recv(arr1, n, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
}
MPI_PROBE also accepts the wildcard rank MPI_ANY_SOURCE and the wildcard tag MPI_ANY_TAG. One can then consult the corresponding entry in the status structure in order to find out the actual sender rank and the actual message tag.
Probing for the message size works as every message carries a header, called envelope. The envelope consists of the sender's rank, the receiver's rank, the message tag and the communicator. It also carries information about the total message size. Envelopes are sent as part of the initial handshake between the two communicating processes.
Firstly you need to allocate memory (full memory = n = elements) to arr1 and arr2 with rank 0. i.e. your front end processor.
Divide the array into parts depending on the no. of processors. Determine the element count for each processor.
Send this element count to the other processors from rank 0.
The second send is for the array i.e. arr1 and arr2
In other processors allocate arr1 and arr2 according to the element count received from main processor i.e. rank = 0. After receiving element count, receive the two arrays in the allocated memories.
This is a sample C++ Implementation but C will follow the same logic. Also just interchange Send with Isend.
#include <mpi.h>
#include <iostream>
using namespace std;
int main(int argc, char*argv[])
{
MPI::Init (argc, argv);
int rank = MPI::COMM_WORLD.Get_rank();
int no_of_processors = MPI::COMM_WORLD.Get_size();
MPI::Status status;
double *arr1;
if (rank == 0)
{
// Setting some Random n
int n = 10;
arr1 = new double[n];
for(int i = 0; i < n; i++)
{
arr1[i] = i;
}
int part = n / no_of_processors;
int offset = n % no_of_processors;
// cout << part << "\t" << offset << endl;
for(int i = 1; i < no_of_processors; i++)
{
int start = i*part;
int end = start + part - 1;
if (i == (no_of_processors-1))
{
end += offset;
}
// cout << i << " Start: " << start << " END: " << end;
// Element_Count
int e_count = end - start + 1;
// cout << " e_count: " << e_count << endl;
// Sending
MPI::COMM_WORLD.Send(
&e_count,
1,
MPI::INT,
i,
0
);
// Sending Arr1
MPI::COMM_WORLD.Send(
(arr1+start),
e_count,
MPI::DOUBLE,
i,
1
);
}
}
else
{
// Element Count
int e_count;
// Receiving elements count
MPI::COMM_WORLD.Recv (
&e_count,
1,
MPI::INT,
0,
0,
status
);
arr1 = new double [e_count];
// Receiving FIrst Array
MPI::COMM_WORLD.Recv (
arr1,
e_count,
MPI::DOUBLE,
0,
1,
status
);
for(int i = 0; i < e_count; i++)
{
cout << arr1[i] << endl;
}
}
// if(rank == 0)
delete [] arr1;
MPI::Finalize();
return 0;
}
#Histro The point I want to make is, that Irecv/Isend are some functions themselves manipulated by MPI lib. The question u asked depend completely on your rest of the code about what you do after the Send/Recv. There are 2 cases:
Master and Worker
You send part of the problem (say arrays) to the workers (all other ranks except 0=Master). The worker does some work (on the arrays) then returns back the results to the master. The master then adds up the result, and convey new work to the workers. Now, here you would want the master to wait for all the workers to return their result (modified arrays). So you cannot use Isend and Irecv but a multiple send as used in my code and corresponding recv. If your code is in this direction you wanna use B_cast and MPI_Reduce.
Lazy Master
The master divides the work but doesn't care of the result from his workers. Say you want to program a pattern of different kinds for same data. Like given characteristics of population of some city, you want to calculate the patterns like how many are above 18, how
many have jobs, how much of them work in some company. Now these results don't have anything to do with one another. In this case you don't have to worry about whether the data is received by the workers or not. The master can continue to execute the rest of the code. This is where it is safe to use Isend/Irecv.

MPI Barrier C++

I want to use MPI (MPICH2) on windows. I write this command:
MPI_Barrier(MPI_COMM_WORLD);
And I expect it blocks all Processors until all group members have called it. But it is not happen. I add a schematic of my code:
int a;
if(myrank == RootProc)
a = 4;
MPI_Barrier(MPI_COMM_WORLD);
cout << "My Rank = " << myrank << "\ta = " << a << endl;
(With 2 processor:) Root processor (0) acts correctly, but processor with rank 1 doesn't know the a variable, so it display -858993460 instead of 4.
Can any one help me?
Regards
You're only assigning a in process 0. MPI doesn't share memory, so if you want the a in process 1 to get the value of 4, you need to call MPI_Send from process 0 and MPI_Recv from process 1.
Variable a is not initialized - it is possible that is why it displays that number. In MPI, variable a is duplicated between the processes - so there are two values for a, one of which is uninitialized. You want to write:
int a = 4;
if (myrank == RootProc)
...
Or, alternatively, do an MPI_send in the Root (id 0), and an MPI_recv in the slave (id 1) so the value in the root is also set in the slave.
Note: that code triggers a small alarm in my head, so I need to check something and I'll edit this with more info. Until then though, the uninitialized value is most certainly a problem for you.
Ok I've checked the facts - your code was not properly indented and I missed the missing {}. The barrier looks fine now, although the snippet you posted does not do too much, and is not a very good example of a barrier because the slave enters it directly, whereas the root will set the value of the variable to 4 and then enter it. To test that it actually works, you probably want some sort of a sleep mechanism in one of the processes - that will yield (hope it's the correct term) the other process as well, preventing it from printing the cout until the sleep is over.
Blocking is not enough, you have to send data to other processes (memory in not shared between processes).
To share data across ALL processes use:
int MPI_Bcast(void* buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm )
so in your case:
MPI_Bcast(&a, 1, MPI_INT, 0, MPI_COMM_WORLD);
here you send one integer pointed by &a form process 0 to all other.
//MPI_Bcast is sender for root process and receiver for non-root processes
You can also send some data to specyfic process by:
int MPI_Send( void *buf, int count, MPI_Datatype datatype, int dest,
int tag, MPI_Comm comm )
and then receive by:
int MPI_Recv(void* buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status)

Resources