MVAPICH2 RDMA-based communication without explicit PUT/GET use? - mpi

My cluster utilizes MVAPICH2 over Infiniband FDR and and I am considering the
use of RDMA for my simulations. I am aware of the MPI_Put and MPI_Get calls for explicitly invoking RDMA operations, however I would like to know if this is the only way to use RDMA within MPI.
My current implementation involves channel semantics (send/receive) for communication, along with MPI_Reduce and MPI_Gatherv. I know that MVAPICH2 has configuration paramaters that can be used to enable RDMA. If a program using MPI has send/receive calls and RDMA is enabled, does MPI automatically convert from channel semantics over to memory semantics (put/get) or is the explicit use of MPI_Put and MPI_Get the only method for implementing RDMA in MVAPICH2?
MPI_Send requires a corresponding MPI_Receive, whether they are blocking or non-blocking doesnt matter as a send must be met with a receive. RDMA does not have this requirement and instead only implements either MPI_Put (write to remote memory) or MPI_Get (read from remote memory). I am trying to find out if enabling rdma while still using send and receives, allows MVAPICH2 to somehow automatically convert the send/receives into the appropriate rdma call.

If MVAPICH2 has been built with the correct options, it will use RDMA for all MPI operations including MPI_Send and MPI_Recv on supported hardware, which includes InfiniBand. So, you do not need to use MPI_Put/Get to take advantage of RDMA-capable hardware. In fact, using MPI_Send/Recv might be faster because they are often better optimized.
MPI libraries use various designs to translate MPI_Send/Recv operations to RDMA semantics. The details can be found in the literature.

Related

How do the different types of point to point communication in MPI differ in terms of performance?

The latest version of MPI includes these types of point to point (p-p) communication:
blocking
ordinary non-blocking
persistent non-blocking
partitioned non-blocking
As far as I know, historically blocking p-p communication was the first type that existed. Then different types of non-blocking p-p communication were introduced one after the other to increase performance. For example, they allow overlap of computation and communication. But are there cases where blocking p-p communication is actually faster than the non-blocking alternatives? If no, what does justify their existence? Simply backward compatibility and their simplicity of use?
It is a misconception that non-blocking communication was motivated by performance: it was mostly to be able to be able to express deadlock/serialization-free communication patterns. (Indeed, actually getting overlap of communication/computation was only possible through the "Iprobe trick". Only recently with "progress threads" has it become a more systematic possibility.)
Partitioned communication is intended for multi-threaded contexts. You may call that a performance argument, or a completely new use case.
Persistent sends have indeed the potential for performance improvement, since various setups and buffer allocations can be amortized. However, I don't see much evidence that MPI implementations actually do this. https://arxiv.org/abs/1809.10778
Finally, you're missing buffered, synchronized, and ready sends. These indeed have the potential to improve performance, though again I don't see much evidence that they do.

Advantage of MPI_SEND over MPI_ISEND?

Using MPI_SEND (the standard blocking send) is simpler than using MPI_ISEND (the standard non-blocking send), because the latter should be used along with another MPI function to ensure that the communication has been "completed", so that the send buffer can be reused. But apart from that, has MPI_SEND any advantages over MPI_ISEND? It seems that, in general, MPI_ISEND prevents deadlock and also allows better performance (because the calling process can do other things while the communication proceeds in the background by MPI implementation).
So, is it a good idea to use the blocking version at all?
Performance wise, MPI_Send() has the potential of being faster that MPI_Isend() immediately followed by MPI_Wait() (and it is faster in Open MPI).
But most importantly, if your MPI library does not provide a progress thread, your message might be sitting on the sender node until MPI is progressed by your code (that typically occurs when a MPI subroutine is invoked, and definitely happens when MPI_Wait() is called).

Why should there be minimal work before MPI_Init?

The documentations for both MPICH and OpenMPI mention that there should be minimal amount of work done before MPI_Init or after MPI_Finilize:
The MPI standard does not say what a program can do before an MPI_INIT or after an MPI_FINALIZE. In the MPICH implementation, you should do as little as possible.
What is the reason behind this?
To me it seems perfectly reasonable for processes to do a significant amount of calculations before starting the communication with each other.
I believe it was worded in the like that in order to allow MPI implementations that spawn its ranks within MPI_Init. That means not all ranks are technically guaranteed to exist before MPI_Init. If you had opened file descriptors or performed other things with side effects on the process state, it would become a huge mess.
Afaik no major current MPI implementation does that, nevertheless an MPI implementation might use this requirement for other tricks.
EDIT: I found no evidence of this and only remember this from way back, so I'm not sure about it. I can't seem to find the formulation in MPI standard that you quoted from MPICH. However, the MPI standard regulates which MPI functions you may call before MPI_Init:
The only MPI functions that may be invoked before the MPI initialization routines are called are MPI_GET_VERSION, MPI_GET_LIBRARY_VERSION, MPI_INITIALIZED, MPI_FINALIZED, and any function with the prefix MPI_T_.
The MPI_Init documentation of MPICH is giving some hints:
The MPI standard does not say what a program can do before an MPI_INIT or after an MPI_FINALIZE. In the MPICH implementation, you should do as little as possible. In particular, avoid anything that changes the external state of the program, such as opening files, reading standard input or writing to standard output.
BTW, I would not expect MPI_Init to do communications. These would happen later.
And the mpich/init.c implementation is free software; you can study its source code and understand that it is initializing some timers, some threads, etc... (and that should indeed happen really early).
To me it seems perfectly reasonable for processes to do a significant amount of calculations before starting the communication with each other.
Of course, but these should happen after MPI_Init (but before some MPI_Send etc).
On some supercomputers, MPI might use dedicated hardware (like InfiniBand, Fibre Channel, etc...) and there might be some hardware or operating system reasons to initialize it very early. So it makes sense to call MPI_Init very early. BTW, it is also given the pointers to main arguments and I guess that it would modify them before further processing by your main. Then the call to MPI_Init is probably the first statement of your main.

MPI one-sided communication with user callbacks

To overlap MPI communications and computations, I am working on issuing asynchronous I/O (MPI calls) with user-defined computation function on the data from I/O.
MS-Window's 'Overlap' is not the friend of MPI (It supports for overlapped I/O only for File I/O and Socket communication, but not for MPI operations...)
I cannot find an appropriate MPI API for it, is there anyone with a glimpse on it?
There are no completion callbacks in MPI. Non-blocking operations always return a request handle that must be either be synchronously waited on using MPI_Wait and family or periodically tested using the non-blocking MPI_Test and family.
With the help of either MPI_Waitsome or MPI_Testsome, it is possible to implement a dispatch mechanism that monitors multiple requests and calls specific functions upon their completion. None of the MPI calls has any timeout characteristics though - it is either "wait forever" (MPI_Wait...) or "check without waiting" (MPI_Test...).

What is the standard way to use MPI_Isend to send the same message to many processors?

I was originally using MPI_Send paired with MPI_Irecv, but I recently found out that MPI_Send may block until the message is received. So, I'm changing to MPI_Isend and I need to send the same message to N different processors. Assuming the buffer will get destroyed later, should I have a for loop with MPI_Isend and MPI_Wait in the loop, or should I make an array of requests and have only MPI_Isend in the loop with MPI_Waitall after the loop?
For distributing the same buffer to "n" remote ranks, MPI_Bcast is the "obvious" choice. Unless you have some "overwhelming" reason to avoid MPI_Bcast, it would be advisable to use it. In general, MPI_Bcast is very well optimized by all the major MPI implementations.
If blocking is an issue, MPI 3.0 Standard introduced MPI_IBcast along with other non-blocking collectives. The initial implementation of non-blocking collectives appears to be "naive" and built as wrappers to non-blocking point-to-point routines (e.g. MPI_IBcast is implemented as a wrapper around calls to MPI_ISend and MPI_IRecv). The implementations are likely to improve in quality over the next year or two - depending partly on the speed of adoption by the MPI application developer community.
MPI_Send will "block" until the send buffer can be safely re-used by the calling application. Nothing is guaranteed about the state of the corresponding MPI_[I]Recv's.
If you need non-blocking, then the best advice would be to call MPI_ISend in a loop. Alternatively, persistent requests could be used with MPI_Start or MPI_Startall if this is a message pattern that will be repeated over the course of the program. Persistent communication requests.
Since its the same message, you should be able to use MPI_Bcast. You'll just have to create a new communicator to define a subgroup of processes.

Resources