MPI recv from an unknown source - mpi

I am implementing in MPI a program in which the main process (with rank=0) should be able to receive requests from the other processes who ask for values of variables that are only known by the root.
If I make MPI_Recv(...) by the rank 0, I have to specify the rank of the process which sends request to the root, but i cannot control that since the processes don't run in the order 1,2,3,....
How can I receive the request from any rank and use the number of the emitting process to send it the necessary information?

This assumes you are using C. There are similar concepts in C++ and Fortran. You would just specify MPI_ANY_SOURCE as the source in the MPI_recv(). The status struct contains the actual source of the message.
int buf[32];
MPI_Status status;
// receive message from any source
MPI_recv(buf, 32, MPI_INT, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &status);
int replybuf[];
// send reply back to sender of the message received above
MPI_send(buf, 32, MPI_INT, status.MPI_SOURCE, tag, MPI_COMM_WORLD);

MPI_ANY_SOURCE is the obvious answer.
However, if all the ranks will be sending a request to rank 0, then MPI_Irecv combined with MPI_Testall might also work as a pattern. This will allow the MPI_Send calls to be executed in any order, and the information can be received and processed in the order that the MPI_Irecv calls are matched.

Related

Is there a way to cancel a blocking `MPI_Probe` call?

The MPI_Irecv and MPI_Isend operations return an MPI_Request that can be later marked as cancelled with MPI_Cancel. Is there a similar mechanism for blocking MPI_Probe and MPI_Mprobe ?
The context of the question is the latest implementation of Boost.MPI request handlers using Probe.
EDIT - Here is an example of how an hypothetical MPI_Probecancel could be used:
#include <mpi.h>
#include <chrono>
#include <future>
using namespace std::literals::chrono_literals;
// Executed in a thread
void async_cancel(MPI_Probe *probe)
{
std::this_thread::sleep_for(1s);
int res = MPI_Probecancel(probe);
}
int main(int argc, char* argv[])
{
int provided;
MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided);
if (rank == 0)
{
// A handle to the probe (similar to a request)
MPI_Probe probe;
// Start a thread
// `probe` will be filled with the next call, pretty ugly
// Ideally, this should be done in two steps like MPI_Irecv, MPI_Wait
auto res = std::async(std::launch::async, &async_cancel, &probe);
MPI_Message message;
MPI_Status status;
MPI_MProbe(1, 123, MPI_COMM_WORLD, &message, &status, &probe);
if (!probe.cancelled)
{
int buffer;
MPI_Mrecv(&buffer, 1, MPI_INT, &message, &status);
}
}
else
std::this_thread::sleep_for(2s);
MPI_Finalize();
return 0;
}
First, the premise / nomenclature of your question is wrong. It is the nonblocking calls. MPI_Irecv and MPI_Isend which return a request object that you may cancel. For these calls, you cancel the local operation.
MPI_Probe and MPI_Mprobe are in fact blocking. You cannot possibly cancel these operations in the sense that control flow will only leave when a message is available.
On the other hand, MPI_Iprobe and MPI_Improbe are nonblocking, meaning they always complete immediately, telling you whether a message is available.
For neither kind of probe call, there is any kind of local state left after the completion. So there is nothing that could be cancelled locally after the functions return.
That said, if a probe tells you that a message is available, you should definitely receive it. Otherwise a send operation may bock and you would leak resources on all side. But that's just a receive operation.
Edit: Regarding your idea to cancel a ongoing local MPI_Probe in a concurrent thread: This is not directly supported.
Theoretically, you could emulate this on a conforming implementation with MPI_THREAD_MULTIPE by running the probe on MPI_ANY_SOURCE and send a message to the same rank from the other thread. That, of course, has the consequence that you change must probe on message from any incoming rank.
Realistically, if you have to do this, you would probably just use a loop like while(!cancelled) MPI_Iprobe();.
That said, I would again question the scenario: How would another thread on your rank suddenly know to cancel a local MPI_Probe operation? It would probably have to be based on information received from a remote rank - in which case that would be covered by actually being able to receive information from it, i.e. the actual Probe would complete.
Maybe for some high-level abstraction it makes some sense to offer a local cancel, but in an actual practical situation I would believe you could design a idiomatic pattern without needing this.

When to use MPI_BUFFER_ATTACH?

As far as I know, MPI_BUFFER_ATTACH must be called by a process if it is going to do buffered communication. But does this include the standard MPI_SEND as well? We know that MPI_SEND may behave either as a synchronous send or as a buffered send.
You need to call MPI_Buffer_attach() only if you plan to perform (explicitly) buffered sends via MPI_Bsend().
If you only plan to MPI_Send() or MPI_Isend(), then you do not need to invoke MPI_Buffer_attach().
FWIW, buffered sends are error prone and I strongly encourage you not to use them.
MPI_Buffer_attach
Attaches a user-provided buffer for sending
Synopsis
int MPI_Buffer_attach(void *buffer, int size)
Input Parameters
buffer
initial buffer address (choice)
size
buffer size, in bytes (integer)
Notes
The size given should be the sum of the sizes of all outstanding
Bsends that you intend to have, plus MPI_BSEND_OVERHEAD for each Bsend
that you do. For the purposes of calculating size, you should use
MPI_Pack_size. In other words, in the code
MPI_Buffer_attach( buffer, size );
MPI_Bsend( ..., count=20, datatype=type1, ... );
...
MPI_Bsend( ..., count=40, datatype=type2, ... );
the value of size in the MPI_Buffer_attach call should be greater than the value computed by
MPI_Pack_size( 20, type1, comm, &s1 );
MPI_Pack_size( 40, type2, comm, &s2 );
size = s1 + s2 + 2 * MPI_BSEND_OVERHEAD;
The MPI_BSEND_OVERHEAD gives the maximum amount of space that may be used in the buffer for use by the BSEND routines in using the buffer. This value is in mpi.h (for C) and mpif.h (for Fortran).
Thread and Interrupt Safety
The user is responsible for ensuring that multiple threads do not try to update the same MPI object from different threads. This routine should not be used from within a signal handler.
The MPI standard defined a thread-safe interface but this does not mean that all routines may be called without any thread locks. For example, two threads must not attempt to change the contents of the same MPI_Info object concurrently. The user is responsible in this case for using some mechanism, such as thread locks, to ensure that only one thread at a time makes use of this routine. Because the buffer for buffered sends (e.g., MPI_Bsend) is shared by all threads in a process, the user is responsible for ensuring that only one thread at a time calls this routine or MPI_Buffer_detach.
Notes for Fortran
All MPI routines in Fortran (except for MPI_WTIME and MPI_WTICK) have an additional argument ierr at the end of the argument list. ierr is an integer and has the same meaning as the return value of the routine in C. In Fortran, MPI routines are subroutines, and are invoked with the call statement.
All MPI objects (e.g., MPI_Datatype, MPI_Comm) are of type INTEGER in Fortran.
Errors
All MPI routines (except MPI_Wtime and MPI_Wtick) return an error value; C routines as the value of the function and Fortran routines in the last argument. Before the value is returned, the current MPI error handler is called. By default, this error handler aborts the MPI job. The error handler may be changed with MPI_Comm_set_errhandler (for communicators), MPI_File_set_errhandler (for files), and MPI_Win_set_errhandler (for RMA windows). The MPI-1 routine MPI_Errhandler_set may be used but its use is deprecated. The predefined error handler MPI_ERRORS_RETURN may be used to cause error values to be returned. Note that MPI does not guarentee that an MPI program can continue past an error; however, MPI implementations will attempt to continue whenever possible.
MPI_SUCCESS
No error; MPI routine completed successfully.
MPI_ERR_BUFFER
Invalid buffer pointer. Usually a null buffer where one is not valid.
MPI_ERR_INTERN
An internal error has been detected. This is fatal. Please send a bug report to mpi-bugs#mcs.anl.gov.
See Also MPI_Buffer_detach, MPI_Bsend
Refer Here For More
Buffer allocation and usage
Programming with MPI
MPI - Bsend usage

Do all processes need all the data when invoking MPI_Comm_spawn?

I wonder if it is necessary that all the processes have the needed data when, for instance, an MPI_Comm_spawn is invoked.
When invoking this function, a root process is defined to drive the operation and this rank, obviously, must give the appropriate parameters to the function. I.e.:
MPI_Comm_spawn("./a.out", &argvs, maxprocs, info, 0, MPI_COMM_WORLD, &intercomm, MPI_ERRCODES_IGNORE);
If we know that rank 0 is the root, do the other processes need to have set the variables: argvs, maxprocs and info? Or would it be enough to have that information in the rank 0?
No, this seems clear from the documentation:
MPI_Comm_spawn
where
argv : arguments to command (array of strings, significant only at root)
maxprocs : maximum number of processes to start (integer, significant
only at root)
info :
a set of key-value pairs telling the runtime system where and how to start the processes (handle, significant only at root)
So the "other" processes don't need to set those above as far as I understand from the documentation.
Just keep in mind that is a collective call.
Though the first four arguments to MPI_Comm_spawn are only significant at the designated root process, all processes in the communicator must make the call. Thus, they acquire the ability to communicate with the child job via the intercommunicator returned by MPI_Comm_spawn. In case you do not need that all ranks in the initial job be able to communicate with the child job, but only the root rank, you can use MPI_COMM_SELF instead of MPI_COMM_WORLD:
int rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (rank == 0) {
MPI_Comm_spawn("./a.out", args, maxprocs, info, 0, MPI_COMM_SELF,
&intercomm, MPI_ERRCODES_IGNORE);
}

MPI can not send data to oneself by MPI_Send and MPI_Recv

I'm trying to implement MPI_Bcast, and I'm planning to do that by MPI_Send and MPI_Recv but seems I can not send message to myself?
The code is as follow
void My_MPI_Bcast(void *buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm) {
int comm_rank, comm_size, i;
MPI_Comm_rank(comm, &comm_rank);
MPI_Comm_size(comm, &comm_size);
if(comm_rank==root){
for(i = 0; i < comm_size; i++){
MPI_Send(buffer, count, datatype, i, 0, comm);
}
}
MPI_Recv(buffer, count, datatype, root, 0, comm, MPI_STATUS_IGNORE);
}
Any suggestion on that? Or I should never send message to oneself and just do a memory copy?
Your program is erroneous on multiple levels. First of all, there is an error in the conditional:
if(comm_rank=root){
This does not compare comm_rank to root but rather assigns root to comm_rank and the loop would then only execute if root is non-zero and besides it would be executed by all ranks.
Second, the root process does not need to send data to itself since the data is already there. Even if you'd like to send and receive anyway, you should notice that both MPI_Send and MPI_Recv peruse the same buffer space, which is not correct. Some MPI implementations use direct memory copy for self-interaction, i.e. the library might use memcpy() to transfer the message. Using memcpy() with overlapping buffers (including using the same buffer) leads to an undefined behaviour.
The proper way to implement linear broadcast is:
void My_MPI_Bcast(void *buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm)
{
int comm_rank, comm_size, i;
MPI_Comm_rank(comm, &comm_rank);
MPI_Comm_size(comm, &comm_size);
if (comm_rank == root)
{
for (i = 0; i < comm_size; i++)
{
if (i != comm_rank)
MPI_Send(buffer, count, datatype, i, 0, comm);
}
}
else
MPI_Recv(buffer, count, datatype, root, 0, comm, MPI_STATUS_IGNORE);
}
The usual ways for a process to talk to itself without deadlocking is:
using a combination of MPI_Isend and MPI_Recv or a combination of MPI_Send and MPI_Irecv;
using buffered send MPI_Bsend;
using MPI_Sendrecv or MPI_Sendrecv_replace.
The combination of MPI_Irecv and MPI_Send works well in cases where multiple sends are done in a loop like yours. For example:
MPI_Request req;
// Start a non-blocking receive
MPI_Irecv(buff2, count, datatype, root, 0, comm, &req);
// Send to everyone
for (i = 0; i < comm_size; i++)
MPI_Send(buff1, count, datatype, i, 0, comm);
// Complete the non-blocking receive
MPI_Wait(&req, MPI_STATUS_IGNORE);
Note the use of separate buffers for send and receive. Probably the only point-to-point MPI communication call that allows the same buffer to be used both for sending and receiving is MPI_Sendrecv_replace as well as the in-place modes of the collective MPI calls. But these are implemented internally in such way that at no time the same memory area is used both for sending and receiving.
This is an incorrect program. You cannot rely on doing a blocking MPI_Send to yourself...because it may block. MPI does not guarantee that your MPI_Send returns until the buffer is available again. In some cases this could mean it will block until the message has been received by the destination. In your program, the destination may never call MPI_Recv, because it is still trying to send.
Now in your My_MPI_Bcast example, the root process already has the data. Why need to send or copy it at all?
The MPI_Send / MPI_Recv block on the root node can be a deadlock.
Converting to MPI_Isend could be used to resolve the issue. However, there may be issues because the send buffer is being reused and root is VERY likely to reach the MPI_Recv "early" and then may alter that buffer before it is transmitted to other ranks. This is especially likely on large jobs. Also, if this routine is ever called from fortran there could be issues with the buffer being corrupted on each MPI_Send call.
The use of MPI_Sendrecv could be used only for the root process. That would allow the MPI_Send's to all non-root ranks to "complete" (e.g. the send buffer can be safely altered) before the root process enters a dedicated MPI_Sendrecv. The for loop would simply begin with "1" instead of "0", and the MPI_Sendrecv call added to the bottom of that loop. (Why is a better questions, since the data is in "buffer" and is going to "buffer".)
However, all this begs the question, why are you doing this at all? If this is a simple "academic exercise" in writing a collective with point to point calls, so be it. BUT, your approach is naive at best. This overall strategy would be beaten by any of the MPI_Bcast algorithms in any reasonably implemented mpi.
I think you should put MPI_Recv(buffer, count, datatype, root, 0, comm, MPI_STATUS_IGNORE); only for rank=root otherwise it will probably hang

Will MPI_BCAST broadcast the master node variable and change the slave node variables?

In the following codes:
if (rank==0) master();
else slave();
...
void master()
{
int i=0;
}
...
void slave()
{
int i=1;
MPI_BCAST(&i,1,MPI_INT,0,COMM);
}
Will the slave node broadcast "i(==0)" in the master node and set the "i" values in all slave nodes to be 0?
It's a little unclear from what you've posted that you have the semantics right -- all processes in the communicator have to call MPI_BCAST, it's one of MPI's collective operations. Your program would then behave as if the process designated the root in the call to MPI_BCAST sends the message to all the other processes in the designated communicator which, in turn, receive the message.
Your snippet suggests that you think that the call to MPI_BCAST would be successful if called only on what you call the 'slave' process(es), which would be incorrect.
Your syntax for the call is, however, correct.
EDIT in response to comment
I believe that all processes have to execute the piece of code which calls MPI_BCAST. If, as you suggest, the pseudo-code is like this:
if (myrank == master) then
do_master_stuff ...
call mpi_bcast(...)
end if
if (myrank /= master) then
call mpi_bcast(...)
do_worker_stuff ...
end if
then the call will fail; it's likely that your program will stall until the job management or operating system notices and chucks it out. There is no mechanism within MPI for 'matching' calls to MPI_BCAST (or any of the other collective communications routines) across scopes.
Your pseudo-code ought to be like this
if (myrank == master) then
do_master_stuff ...
end if
if (myrank /= master) then
do_worker_stuff ...
end if
! all together now
call mpi_bcast(...)
or whatever variant your program requires

Resources