Timeline of asynchronous MPI communication - asynchronous

To optimize MPI communication it is important to understand the flow of the whole communication process. This is rather straightforward for synchronous communication, but what about asynchronous communication? As I understand it, it works in one of these two ways:
Rank0 -> Isend -> Rank1 and Rank1 -> Isend -> Rank0
Rank0 -> Irecv -> Rank1 and Rank1 -> Irecv -> Rank0
Rank0 and Rank1 do some computation
Messages are being dispatched to their respective target location
Matching Recv call found! -> write into the given recv-buffer
Rank0 and Rank1 finish their computation and call MPI_Wait for send and receive
MPI_Wait -> communication completed
or
Rank0 -> Isend -> Rank1 and Rank1 -> Isend -> Rank0
Rank0 and Rank1 do some computation
Messages are being dispatched to their respective target location
No matching Recv call found! -> allocate own temporary buffer and write into that
Rank0 and Rank1 finish their computation and call MPI_Recv
Matching MPI_Recv call is found -> temporary buffer is written into the recv-buffer
Rank0 and Rank1 call MPI_Wait
MPI_Wait -> Communication is completed -> the temporary buffer is freed
Is this correct? Do I need to be aware of any other processes that run in the background of MPI to optimize its usage?

In general, when sending data with MPI, you should always pre-post your receives if possible. That means if you're trying to do communication between two processes, you should do something like this (lots of important arguments left out for brevity):
if (rank == 0) {
MPI_Irecv(rdata, ..., 1, ..., req[0]);
...
MPI_Isend(sdata, ..., 1, ..., req[1]);
} else {
MPI_Irecv(rdata, ..., 0, ..., req[0]);
...
MPI_Isend(sdata, ..., 0, ..., req[1]);
}
MPI_Waitall(2, req);
You can do other stuff in between the Irecv and Isend if you like, but by pre-posting the receives, you will save memory and time because the user buffers will be available for the MPI library to put the data that's coming in. If you don't do it in this order and the messages arrive before you call Irecv (or any other flavor of receive), the messages will have to be stored in some other internal buffer first until the receive is posted, then the message will be copied again from the MPI buffer to the user buffer. This may also result in the message not being sent at all until the receive is called if the message is too large to fit in the pre-allocated buffers.
You can call the Irecv as early as you like too. If you want to put the Irecv at the beginning of an iteration, do a bunch of calculation, then call Isend at the end of the iteration when the data is ready, that's fine too.
Cross-talk between other processes is usually not a problem unless you have lots of processes sending messages to one process. In that case, you might end up with some flow control issues, but that's not usually a situation that comes up. Most of the time in that situation, collectives are used instead of point-to-point communication.

Related

Concurrent write with OCaml Async

I'm reading data from the network and I'd like to write it to a file whenever I get them. The writes are concurrent and non sequential (think P2P file sharing). In C, I would get a file descriptor to the file(for the duration of the program) then use lseek, followed by write and eventually close the fd. These operations could be protected by a mutex in a multithreaded setting (especially, lseek and write should be atomic).
I don't really see how to get that behavior in Async. My initial idea is to have something like this.
let write fd s pos =
let posl = Int64.of_int pos in
Async_unix.Unix_syscalls.lseek fd ~mode:`Set posl
>>| fun _ ->
let wr = Writer.create t.fd in
let len = String.length s in
Writer.write wr s ~pos:0 ~len
Then, the writes are scheduled asynchronously when data is received.
My solution isn't correct. For one thing, this write task need to be atomic but it is not the case, since two lseek can be executed before the first Writer.write. Even if I can schedule the write sequentially it wouldn't help since Writer.write doesn't return a Deferred.t. Any idea?
BTW, this is a follow-up to a previous answered question.
The basic approach would be to have a queue of workers, where each worker performs an atomic seek/write1 operation. The invariant is that only one worker at a time is running. A more complicated strategy would employ a priority queue, where writes are ordered by some criterium that maximizes the throughput, e.g., writes to subsequent positions. You may also implement a sophisticated buffering strategy if you observe lots of small writes, then a good idea would be to coalesce them into larger chunks.
But let's start with a simple non-prioritized queue, implemented via Async.Pipe.t. For the positional write, we can't use the Writer interface, as it is designed for buffered sequential writes. So, we will use the Unix.lseek from Async_unix.Std and Bigstring.really_writefunction. The really_write is a regular non-asynchronous function, so we need to lift it into the Async interface using theFd.syscall_in_thread` function, e.g.,
let really_pwrite fd offset bytes =
Unix.lseek fd offset ~mode:`Set >>= fun (_ : int64) ->
Fd.syscall_in_thread fd (fun desc ->
Bigstring.really_write desc bytes)
Note: this function will write as many bytes as system decides, but no more than the length of bytes. So you might be interested in implementing a really_pwrite function that will write all bytes.
The overall scheme would include one master thread, that will own a file descriptor and accept write requests from multiple clients via an Async.Pipe. Suppose that each write request is a message of the follwing type:
type chunk = {
offset : int;
bytes : Bigstring.t;
}
Then your master thread will look like this:
let process_requests fd =
Async.Pipe.iter ~f:(fun {offset; bytes} ->
really_pwrite fd offset bytes)
Where the really_pwrite is a function that really writes all the bytes and handles all the errors. You may also use Async.Pipe.iter' function and presort and coalesce the writes before actually executing the pwrite syscall.
One more optimization note. Allocating a bigstring is a rather expensive operation, so you may consider to pre allocate one big bigstring and serve small chunks from it. This will create a limited resource, so your clients will wait until other clients will finish their writes and release their chunks. As a result, you will have a throttled system with a limited memory footprint.
1)Ideally we should use pwrite though Janestreet provides only pwrite_assume_fd_is_nonblocking function, that doesn't release OCaml runtime when a call to the system pwrite is done, and will actually block the whole system. So we need to use a combination of a seek and write. The latter will release the OCaml runtime so that the rest of the program can continue. (Also, given their definition of nonblocking fd, this function doesn't really make much sense, as only sockets and FIFO are considered non-blocking, and as far as I know, they do not support the seek operation. I will file an issue on their bug tracker.

MPI - Equivalent of MPI_SENDRCV with asynchronous functions

I know that MPI_SENDRECV allow to overcome the problem of deadlocks (when we use the classic MPI_SEND and MPI_RECV functions).
I would like to know if MPI_SENDRECV(sent_to_process_1, receive_from_process_0) is equivalent to:
MPI_ISEND(sent_to_process_1, request1)
MPI_IRECV(receive_from_process_0, request2)
MPI_WAIT(request1)
MPI_WAIT(request2)
with asynchronous MPI_ISEND and MPI_RECV functions?
From I have seen, MPI_ISEND and MPI_RECV creates a fork (i.e. 2 processes). So if I follow this logic, the first call of MPI_ISEND generates 2 processes. One does the communication and the other calls MPI_RECV which forks itself 2 processes.
But once the communication of first MPI_ISEND is finished, does the second process call MPI_IRECV again? With this logic, the above equivalent doesn't seem to be valid...
Maybe I should change to this:
MPI_ISEND(sent_to_process_1, request1)
MPI_WAIT(request1)
MPI_IRECV(receive_from_process_0, request2)
MPI_WAIT(request2)
But I think that it could be create also deadlocks.
Anyone could give to me another solution using MPI_ISEND, MPI_IRECV and MPI_WAIT to get the same behaviour of MPI_SEND_RECV?
There's some dangerous lines of thought in the question and other answers. When you start a non-blocking MPI operation, the MPI library doesn't create a new process/thread/etc. You're thinking of something more like a parallel region of OpenMP I believe, where new threads/tasks are created to do some work.
In MPI, starting a non-blocking operation is like telling the MPI library that you have some things that you'd like to get done whenever MPI gets a chance to do them. There are lots of equally valid options for when they are actually completed:
It could be that they all get done later when you call a blocking completion function (like MPI_WAIT or MPI_WAITALL). These functions guarantee that when the blocking completion call is done, all of the requests that you passed in as arguments are finished (in your case, the MPI_ISEND and the MPI_IRECV). Regardless of when the operations actually take place (see next few bullets), you as an application can't consider them done until they are actually marked as completed by a function like MPI_WAIT or MPI_TEST.
The operations could get done "in the background" during another MPI operation. For instance, if you do something like the code below:
MPI_Isend(..., MPI_COMM_WORLD, &req[0]);
MPI_Irecv(..., MPI_COMM_WORLD, &req[1]);
MPI_Barrier(MPI_COMM_WORLD);
MPI_Waitall(2, req);
The MPI_ISEND and the MPI_IRECV would probably actually do the data transfers in the background during the MPI_BARRIER. This is because as an application, you are transferring "control" of your application to the MPI library during the MPI_BARRIER call. This lets the library make progress on any ongoing MPI operation that it wants. Most likely, when the MPI_BARRIER is complete, so are most other things that finished first.
Some MPI libraries allow you to specify that you want a "progress thread". This tells the MPI library to start up another thread (not that thread != process) in the background that will actually do the MPI operations for you while your application continues in the main thread.
Remember that all of these in the end require that you actually call MPI_WAIT or MPI_TEST or some other function like it to ensure that your operation is actually complete, but none of these spawn new threads or processes to do the work for you when you call your nonblocking functions. Those really just act like you stick them on a list of things to do (which in reality, is how most MPI libraries implement them).
The best way to think of how MPI_SENDRECV is implemented is to do two non-blocking calls with one completion function:
MPI_Isend(..., MPI_COMM_WORLD, &req[0]);
MPI_Irecv(..., MPI_COMM_WORLD, &req[1]);
MPI_Waitall(2, req);
How I usually do this on node i communicating with node i+1:
mpi_isend(send_to_process_iPlus1, requests(1))
mpi_irecv(recv_from_process_iPlus1, requests(2))
...
mpi_waitall(2, requests)
You can see how ordering your commands this way with non-blocking communication allows you (during the ... above) to perform any computation that does not rely on the send/recv buffers to be done during your communication. Overlapping computation with communication is often crucial for maximizing performance.
mpi_send_recv on the other hand (while avoiding any deadlock issues) is still a blocking operation. Thus, your program must remain in that routine during the entire send/recv process.
Final points: you can initialize more than 2 requests and wait on all of them the same way using the above structure as dealing with 2 requests. For instance, it's quite easy to start communication with node i-1 as well and wait on all 4 of the requests. Using mpi_send_recv you must always have a paired send and receive; what if you only want to send?

Is gen_tcp:send/2 blocking?

Is gen_tcp:send() asynchronous? Assume I'll send some byte array using gen_tcp:send/2. Will process continue to work:
a) Immediately
b) At the time data will arrive in target's inner buffer
c) When the target gets the data from buffer
Thank You in advance.
gen_tcp:send/2 is synchronous. It means that the call returns only after the given packet is really sent. Usually it happens immediately, however if TCP window is full gen_tcp:send/2 blocks until the data is sent. So it means that the call can theoretically block infinitely (for example when receiver does not read data from socket on its side).
Fortunately there are some options to avoid such situation. There are two options {send_timeout, Integer} and {send_timeout_close, Boolean} for sockets which can be specified by the call inet:setopts/2. The first one allows to specify a longest time to wait for a send operation.
When the limit is exceeded, the send operation will return {error, timeout}. Default value of that option is infinity (and it is the reason of infinite block). Also unfortunately it is unknown how much of data was sent if {error, timeout} was returned. In that case it is better to close the socket. If the second option {send_timeout_close, Boolean} is set to true then the socket will be close automatically if {error, timeout} occurs.

What is the difference between isend and issend?

Need clarification to my understanding of isend and issend as given in Send Types
My understanding is that isend will return once the send buffer is free, i.e. when all the data has been released. Issend on the other hand returns only when it receives an ack from the receive of getting/not getting the entire data. Is this all there is to it?
Both MPI_Isend() and MPI_Issend() return immediately, but in both cases you can't use the send buffer immediately.
Think of the difference that there is between MPI_Send() and MPI_Ssend():
MPI_Send() can be buffered or it can be synchronous if the buffer is too
large to be buffered locally, and in this case it waits to complete sending the
data to the corresponding receive operation.
MPI_Ssend() is always synchronous: it always waits to complete sending the data
to the corresponding receive operation.
The inner working of the corresponding "I"-operations is very similar, except for the fact that they both don't block (return immediately): the difference is only when the MPI library signals to the user program that you can use the send-buffer (that is: MPI_Wait() returns or MPI_Test() returns true - the so called send-complete operation of the non-blocking send):
with MPI_Isend() this can happen either when the data has been copied locally in a buffer owned by the MPI library, if below the "synchronous threshold", or when the data has been actually moved to the sibling task: the send-complete operation can be local, in case the underlying send operation is buffered.
With MPI_Issend() MPI doesn't ever buffer data locally and the "buffer-free condition" is returned only after the data has been actually transferred (and probably ack'ed, at low level): the send-complete operation is non-local.
The MPI standard document is quite pedantic on these aspects. See section 3.7 Nonblocking Communication.
Correct. Obviously both of those will only be true when the request that you get back from the call to MPI_ISEND or MPI_ISSEND is completed via a MPI_WAIT* or MPI_TEST* function.

How can the "packet" option of socket in Erlang accelerate the tcp transmission so much?

It takes only 8 seconds to transfer 1G data through two different ports on localhost using {packet,4}, while the same task can't be finished within 30 seconds using {packet,raw}. I know if use the latter method, the data will arrive in tens of thousands small pieces (on archlinux the size is 1460 bytes). I've learned some aspects of TCP/IP protocol and have been thinking about this question for days but still can't figure out what is the EXACT difference. Sincerely look forward to some bottom-up explanation.
-module(test).
-export([main/1]).
-define(SOCKOPT, [binary,{active,true},{packet,4}]).
main(_) ->
{ok, LSock} = gen_tcp:listen(6677, ?SOCKOPT),
spawn(fun() -> send() end),
recv(LSock).
recv(LSock) ->
{ok, Sock} = gen_tcp:accept(LSock),
inet:setopts(Sock, ?SOCKOPT),
loop(Sock).
loop(Sock) ->
receive
{tcp, Sock, Data} ->
io:fwrite("~p~n",[bit_size(Data)]),
loop(Sock);
{tcp_closed, Sock} -> ok
end.
send() ->
timer:sleep(500),
{ok, Sock}=gen_tcp:connect("localhost", 6677, ?SOCKOPT),
gen_tcp:send(Sock, binary:copy(<<"1">>, 1073741824)),
gen_tcp:close(Sock).
$ time escript test.erl
8589934592
real 0m8.919s
user 0m6.643s
sys 0m2.257s
When you use {packet,4} erlang first reads 4 bytes to get length of your data, allocates a buffer to hold it and reads data into buffer after getting each tcp packet. Then it sends the buffer as one packet to your process. This all happens inside builtin read code, which is rather fast.
When you use {packet,raw} erlang sends a message to your process after receiving each tcp packet of data, so for each tcp packet it does many more things.
When the data is received in small pieces, the kernel buffer at receiver end fills up fast. It will reduce the congestion window size at sender side forcing the sender to push data at lower rate.
Try
-define(SOCKOPT, [binary,{active,true},{recbuf, 16#FFFFFF}, {sndbuf, 16#1FFFFFF}])

Resources