I wonder if it is necessary that all the processes have the needed data when, for instance, an MPI_Comm_spawn is invoked.
When invoking this function, a root process is defined to drive the operation and this rank, obviously, must give the appropriate parameters to the function. I.e.:
MPI_Comm_spawn("./a.out", &argvs, maxprocs, info, 0, MPI_COMM_WORLD, &intercomm, MPI_ERRCODES_IGNORE);
If we know that rank 0 is the root, do the other processes need to have set the variables: argvs, maxprocs and info? Or would it be enough to have that information in the rank 0?
No, this seems clear from the documentation:
MPI_Comm_spawn
where
argv : arguments to command (array of strings, significant only at root)
maxprocs : maximum number of processes to start (integer, significant
only at root)
info :
a set of key-value pairs telling the runtime system where and how to start the processes (handle, significant only at root)
So the "other" processes don't need to set those above as far as I understand from the documentation.
Just keep in mind that is a collective call.
Though the first four arguments to MPI_Comm_spawn are only significant at the designated root process, all processes in the communicator must make the call. Thus, they acquire the ability to communicate with the child job via the intercommunicator returned by MPI_Comm_spawn. In case you do not need that all ranks in the initial job be able to communicate with the child job, but only the root rank, you can use MPI_COMM_SELF instead of MPI_COMM_WORLD:
int rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (rank == 0) {
MPI_Comm_spawn("./a.out", args, maxprocs, info, 0, MPI_COMM_SELF,
&intercomm, MPI_ERRCODES_IGNORE);
}
Related
As far as I know, MPI_BUFFER_ATTACH must be called by a process if it is going to do buffered communication. But does this include the standard MPI_SEND as well? We know that MPI_SEND may behave either as a synchronous send or as a buffered send.
You need to call MPI_Buffer_attach() only if you plan to perform (explicitly) buffered sends via MPI_Bsend().
If you only plan to MPI_Send() or MPI_Isend(), then you do not need to invoke MPI_Buffer_attach().
FWIW, buffered sends are error prone and I strongly encourage you not to use them.
MPI_Buffer_attach
Attaches a user-provided buffer for sending
Synopsis
int MPI_Buffer_attach(void *buffer, int size)
Input Parameters
buffer
initial buffer address (choice)
size
buffer size, in bytes (integer)
Notes
The size given should be the sum of the sizes of all outstanding
Bsends that you intend to have, plus MPI_BSEND_OVERHEAD for each Bsend
that you do. For the purposes of calculating size, you should use
MPI_Pack_size. In other words, in the code
MPI_Buffer_attach( buffer, size );
MPI_Bsend( ..., count=20, datatype=type1, ... );
...
MPI_Bsend( ..., count=40, datatype=type2, ... );
the value of size in the MPI_Buffer_attach call should be greater than the value computed by
MPI_Pack_size( 20, type1, comm, &s1 );
MPI_Pack_size( 40, type2, comm, &s2 );
size = s1 + s2 + 2 * MPI_BSEND_OVERHEAD;
The MPI_BSEND_OVERHEAD gives the maximum amount of space that may be used in the buffer for use by the BSEND routines in using the buffer. This value is in mpi.h (for C) and mpif.h (for Fortran).
Thread and Interrupt Safety
The user is responsible for ensuring that multiple threads do not try to update the same MPI object from different threads. This routine should not be used from within a signal handler.
The MPI standard defined a thread-safe interface but this does not mean that all routines may be called without any thread locks. For example, two threads must not attempt to change the contents of the same MPI_Info object concurrently. The user is responsible in this case for using some mechanism, such as thread locks, to ensure that only one thread at a time makes use of this routine. Because the buffer for buffered sends (e.g., MPI_Bsend) is shared by all threads in a process, the user is responsible for ensuring that only one thread at a time calls this routine or MPI_Buffer_detach.
Notes for Fortran
All MPI routines in Fortran (except for MPI_WTIME and MPI_WTICK) have an additional argument ierr at the end of the argument list. ierr is an integer and has the same meaning as the return value of the routine in C. In Fortran, MPI routines are subroutines, and are invoked with the call statement.
All MPI objects (e.g., MPI_Datatype, MPI_Comm) are of type INTEGER in Fortran.
Errors
All MPI routines (except MPI_Wtime and MPI_Wtick) return an error value; C routines as the value of the function and Fortran routines in the last argument. Before the value is returned, the current MPI error handler is called. By default, this error handler aborts the MPI job. The error handler may be changed with MPI_Comm_set_errhandler (for communicators), MPI_File_set_errhandler (for files), and MPI_Win_set_errhandler (for RMA windows). The MPI-1 routine MPI_Errhandler_set may be used but its use is deprecated. The predefined error handler MPI_ERRORS_RETURN may be used to cause error values to be returned. Note that MPI does not guarentee that an MPI program can continue past an error; however, MPI implementations will attempt to continue whenever possible.
MPI_SUCCESS
No error; MPI routine completed successfully.
MPI_ERR_BUFFER
Invalid buffer pointer. Usually a null buffer where one is not valid.
MPI_ERR_INTERN
An internal error has been detected. This is fatal. Please send a bug report to mpi-bugs#mcs.anl.gov.
See Also MPI_Buffer_detach, MPI_Bsend
Refer Here For More
Buffer allocation and usage
Programming with MPI
MPI - Bsend usage
I have some MPI processes which should write to the same file after they finish their task. The problem is that the length of the results is variable and I cannot assume that each process will write at a certain offset.
A possible approach would be to open the file in every process, to write the output at the end and then to close the file. But this way a race condition could occur.
How can I open and write to that file so that the result would be the expected one?
You might think you want the shared file or ordered mode routines. But these routines get little use and so are not well optimized (so they get little use... quite the cycle...)
I hope you intend on doing this collectively. then you can use MPI_SCAN to collect the offsets, then call MPI_FILE_WRITE_AT_ALL to have the MPI library optimize the I/O for you.
(If you are doing this independently, then you will have to do something like... master slave? passing a token? fall back to the shared file pointer routines even though I hate them?)
Here's an approach for a good collective method:
incr = (count*datatype_size);
/* you can skip this call and assume 'offset' is zero if you don't care
about the contents of the file */
MPI_File_get_position(mpi_fh, &offset);
MPI_Scan(&incr, &new_offset, 1, MPI_LONG_LONG_INT,
MPI_SUM, MPI_COMM_WORLD);
new_offset -= incr;
new_offset += offset;
ret = MPI_File_write_at_all(mpi_fh, new_offset, buf, count,
datatype, status);
I'm trying to implement MPI_Bcast, and I'm planning to do that by MPI_Send and MPI_Recv but seems I can not send message to myself?
The code is as follow
void My_MPI_Bcast(void *buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm) {
int comm_rank, comm_size, i;
MPI_Comm_rank(comm, &comm_rank);
MPI_Comm_size(comm, &comm_size);
if(comm_rank==root){
for(i = 0; i < comm_size; i++){
MPI_Send(buffer, count, datatype, i, 0, comm);
}
}
MPI_Recv(buffer, count, datatype, root, 0, comm, MPI_STATUS_IGNORE);
}
Any suggestion on that? Or I should never send message to oneself and just do a memory copy?
Your program is erroneous on multiple levels. First of all, there is an error in the conditional:
if(comm_rank=root){
This does not compare comm_rank to root but rather assigns root to comm_rank and the loop would then only execute if root is non-zero and besides it would be executed by all ranks.
Second, the root process does not need to send data to itself since the data is already there. Even if you'd like to send and receive anyway, you should notice that both MPI_Send and MPI_Recv peruse the same buffer space, which is not correct. Some MPI implementations use direct memory copy for self-interaction, i.e. the library might use memcpy() to transfer the message. Using memcpy() with overlapping buffers (including using the same buffer) leads to an undefined behaviour.
The proper way to implement linear broadcast is:
void My_MPI_Bcast(void *buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm)
{
int comm_rank, comm_size, i;
MPI_Comm_rank(comm, &comm_rank);
MPI_Comm_size(comm, &comm_size);
if (comm_rank == root)
{
for (i = 0; i < comm_size; i++)
{
if (i != comm_rank)
MPI_Send(buffer, count, datatype, i, 0, comm);
}
}
else
MPI_Recv(buffer, count, datatype, root, 0, comm, MPI_STATUS_IGNORE);
}
The usual ways for a process to talk to itself without deadlocking is:
using a combination of MPI_Isend and MPI_Recv or a combination of MPI_Send and MPI_Irecv;
using buffered send MPI_Bsend;
using MPI_Sendrecv or MPI_Sendrecv_replace.
The combination of MPI_Irecv and MPI_Send works well in cases where multiple sends are done in a loop like yours. For example:
MPI_Request req;
// Start a non-blocking receive
MPI_Irecv(buff2, count, datatype, root, 0, comm, &req);
// Send to everyone
for (i = 0; i < comm_size; i++)
MPI_Send(buff1, count, datatype, i, 0, comm);
// Complete the non-blocking receive
MPI_Wait(&req, MPI_STATUS_IGNORE);
Note the use of separate buffers for send and receive. Probably the only point-to-point MPI communication call that allows the same buffer to be used both for sending and receiving is MPI_Sendrecv_replace as well as the in-place modes of the collective MPI calls. But these are implemented internally in such way that at no time the same memory area is used both for sending and receiving.
This is an incorrect program. You cannot rely on doing a blocking MPI_Send to yourself...because it may block. MPI does not guarantee that your MPI_Send returns until the buffer is available again. In some cases this could mean it will block until the message has been received by the destination. In your program, the destination may never call MPI_Recv, because it is still trying to send.
Now in your My_MPI_Bcast example, the root process already has the data. Why need to send or copy it at all?
The MPI_Send / MPI_Recv block on the root node can be a deadlock.
Converting to MPI_Isend could be used to resolve the issue. However, there may be issues because the send buffer is being reused and root is VERY likely to reach the MPI_Recv "early" and then may alter that buffer before it is transmitted to other ranks. This is especially likely on large jobs. Also, if this routine is ever called from fortran there could be issues with the buffer being corrupted on each MPI_Send call.
The use of MPI_Sendrecv could be used only for the root process. That would allow the MPI_Send's to all non-root ranks to "complete" (e.g. the send buffer can be safely altered) before the root process enters a dedicated MPI_Sendrecv. The for loop would simply begin with "1" instead of "0", and the MPI_Sendrecv call added to the bottom of that loop. (Why is a better questions, since the data is in "buffer" and is going to "buffer".)
However, all this begs the question, why are you doing this at all? If this is a simple "academic exercise" in writing a collective with point to point calls, so be it. BUT, your approach is naive at best. This overall strategy would be beaten by any of the MPI_Bcast algorithms in any reasonably implemented mpi.
I think you should put MPI_Recv(buffer, count, datatype, root, 0, comm, MPI_STATUS_IGNORE); only for rank=root otherwise it will probably hang
In the following codes:
if (rank==0) master();
else slave();
...
void master()
{
int i=0;
}
...
void slave()
{
int i=1;
MPI_BCAST(&i,1,MPI_INT,0,COMM);
}
Will the slave node broadcast "i(==0)" in the master node and set the "i" values in all slave nodes to be 0?
It's a little unclear from what you've posted that you have the semantics right -- all processes in the communicator have to call MPI_BCAST, it's one of MPI's collective operations. Your program would then behave as if the process designated the root in the call to MPI_BCAST sends the message to all the other processes in the designated communicator which, in turn, receive the message.
Your snippet suggests that you think that the call to MPI_BCAST would be successful if called only on what you call the 'slave' process(es), which would be incorrect.
Your syntax for the call is, however, correct.
EDIT in response to comment
I believe that all processes have to execute the piece of code which calls MPI_BCAST. If, as you suggest, the pseudo-code is like this:
if (myrank == master) then
do_master_stuff ...
call mpi_bcast(...)
end if
if (myrank /= master) then
call mpi_bcast(...)
do_worker_stuff ...
end if
then the call will fail; it's likely that your program will stall until the job management or operating system notices and chucks it out. There is no mechanism within MPI for 'matching' calls to MPI_BCAST (or any of the other collective communications routines) across scopes.
Your pseudo-code ought to be like this
if (myrank == master) then
do_master_stuff ...
end if
if (myrank /= master) then
do_worker_stuff ...
end if
! all together now
call mpi_bcast(...)
or whatever variant your program requires
I am implementing in MPI a program in which the main process (with rank=0) should be able to receive requests from the other processes who ask for values of variables that are only known by the root.
If I make MPI_Recv(...) by the rank 0, I have to specify the rank of the process which sends request to the root, but i cannot control that since the processes don't run in the order 1,2,3,....
How can I receive the request from any rank and use the number of the emitting process to send it the necessary information?
This assumes you are using C. There are similar concepts in C++ and Fortran. You would just specify MPI_ANY_SOURCE as the source in the MPI_recv(). The status struct contains the actual source of the message.
int buf[32];
MPI_Status status;
// receive message from any source
MPI_recv(buf, 32, MPI_INT, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &status);
int replybuf[];
// send reply back to sender of the message received above
MPI_send(buf, 32, MPI_INT, status.MPI_SOURCE, tag, MPI_COMM_WORLD);
MPI_ANY_SOURCE is the obvious answer.
However, if all the ranks will be sending a request to rank 0, then MPI_Irecv combined with MPI_Testall might also work as a pattern. This will allow the MPI_Send calls to be executed in any order, and the information can be received and processed in the order that the MPI_Irecv calls are matched.