Order of non-blocking operations in MPI

Order of non-blocking operations in MPI - mpi

I have some code where I move data around using MPI non-blocking operations. What I require is that the operations follow an order, that I specify, but I'm not sure if that's happening.
For example, the operations look like this:
MPI_Rget(buffer, source, &req[0]); // A
MPI_Ibarrier(&req[1]); // B
MPI_Rput(dest, buffer, &req[2]); // C
MPI_Ibarrier(&req[3]); // D
for(int i = 0; i < 4; ++i)
MPI_Wait(&req[i]);
I need operation A to be guaranteed to complete before C, so that the buffer has the data I "get" before I "put" it.
I also need a guarantee that operation C isn't started by any process until all processes finish operation A, which I hoped would be provided by operation B.
I was wondering whether this code was correct, and if not, what I could do to provide the ordering guarantees.

A barrier is not enough synchronization for request-based operations: if one needs to come after the other, you need a MPI_Waitwhatever call for the request of the first
Rget is a one-sided call and only valid in a passive target epoch. Is that what you're doing?

Related

How to synchronize (specific) work-items based on data, in OpenCL?

Context:
The need is to simulate a net of related discrete elements (complex electronic circuit). Thus each component receive input from several other components and output to several others.
The intended design is to have a kernel, with a configuration argument defining which component it shall represent. Each component of the circuit is represented by a work-item, and all the circuit will fit in a single work-group (or adequate splitting of the circuit will be done so each work-group can manage all the components as work-items).
The problem:
Is it possible, and in case how? to have some work-items waiting for other work-items data?
A work-item generate an output to an array (at a data-driven position). Another work-item needs to wait for this to happens before to start making it processing.
The net has no loops, thus, it is not possible that a single work-item needs to run twice.
Attempts:
In the following example, each component can have a maximum of one single input (to simplify) making the circuit a tree where the input to the circuit is the root, and the 3 outputs are leafs.
inputIndex modelize this tree, by indicating for each component which other component provide it input. The first component take itself as input but the kernel manage this case (for simplification).
result save the result of each component (voltage, intensity, etc.)
inputModified indicate if the given component already calculated his output.
// where the data come from (index in result)
constant int inputIndex[5]={0,0, 0, 2, 2};
kernel void update_component(
local int *result, // each work-item result.
local int *inputModified // If all inputs are ready (one only for this example)
) {
int id = get_local_id(0);
int size = get_local_size(0);
int barrierCount = 0;
// inputModified is a boolean indicating if the input is ready
inputModified[id]=(id!=0 ? 0 : 1);
// make sure all input are false by default (except the first input).
barrier(CLK_LOCAL_MEM_FENCE);
// Wait until all inputs are ready (only one in this example)
while( !inputModified[inputIndex[id]] && size > barrierCount++)
{
// If the input is not ready, wait for it
barrier(CLK_LOCAL_MEM_FENCE);
}
// all inputs are ready, compute output
if (id!=0) result[id] = result[inputIndex[id]]+1;
else result[0]=42;
// make sure any other work-item depending on this is unblocked
inputModified[id]=1;
// Even if finished, we needs to "barrier" for other working items.
while (size > barrierCount++)
{
barrier(CLK_LOCAL_MEM_FENCE);
}
}
This example has N barriers for N components, making it worse than a sequential solution.
Note: this is only the kernel, as the C++ minimal host is quite long. In case of need, I could find a way to add it.
Question:
Is it possible to efficiently, and by the kernel itself to have the different work-items waiting for their data to be provided by other work-items? Or what solution would be efficient?
This problem is (for me) not trivial to explain and I am far from expert in OpenCL. Please, be patient and feel free to ask if anything is unclear.

From documentation of barrier
https://www.khronos.org/registry/OpenCL/sdk/1.2/docs/man/xhtml/barrier.html
If barrier is inside a loop, all work-items must execute the barrier for each iteration of the loop before any are allowed to continue
execution beyond the barrier.
But a while loop (containing a barrier) in the kernel has this condition:
inputModified[inputIndex[id]]
this can change its behavior with id of thread and lead to undefined behavior. Besides, another barrier before that
barrier(CLK_LOCAL_MEM_FENCE);
already synchronizes all work-items in the work-group so the while loop is redundant, even if it works.
Also the last barrier loop is redundant
while (size > barrierCount++)
{
barrier(CLK_LOCAL_MEM_FENCE);
}
when kernel ends, it does synchronize all workitems.
If you are meant to send some message to out-of-workgroup workitems, then you can only use atomic variables. Even when using atomics, you should not assume any working/issuing order between any two workitems.
Your question
how? to have some work-items waiting for other work-items data? A
work-item generate an output to an array (at a data-driven position).
Another work-item needs to wait for this to happens before to start
making it processing. The net has no loops, thus, it is not possible
that a single work-item needs to run twice.
can be answered with an OpenCL 2.x feature "dynamic parallelism" which lets a workitem spawn new workgroups/kernels inside kernel. It is much more efficient than waiting on a spin-wait loop and absolutely more hardware-independent than relying on number of in-flight threads a GPU supports (when GPU can't handle that many in-flight threads, any spin-wait will dead-lock, order of threads does not matter).
When you use barrier, you don't need to inform other threads about "inputModified". Data of result is already visible within workgroup.
If you can't use OpenCL v2.x, then you should process a tree using BFS:
start 1 workitem for top node
process it and prepare K outputs and push them into a queue
end kernel
start K workitems (each pop elements from queue)
process them and prepare N outputs and push them into queue
end kernel
repeat until queue doesn't have any more elements
Number of kernel calls is equal to maximum depth of tree, not number of nodes.
If you need a quicker synchronization than "kernel launches", then use a single workgroup for whole tree, use barrier instead of kernel recalls. Or, process first few steps on CPU, have multiple sub-trees and send them to different OpenCL workgroups. Perhaps computing on CPU until there are N sub-trees where N=compute units of GPU could be better for workgroup-barrier based faster asynchronous computing of sub-trees.
There is also a barrierless, atomicless and single-kernel-call way for this. Start tree from bottom and go up.
Map all deepest level child nodes to workitems. Move each of them to the top while recording their path(node id, etc) within their private memory / some other fast memory. Then have them traverse back top-down through that recorded path, computing on the move, without any synchronizations nor even atomics. This is less work efficient than barrier/kernel-call versions but the lack of barrier and being on totally asynchronous paths should make it fast enough.
If tree has 10 depth, this means 10 node pointers to save, not so much for private registers. If tree depth is about 30 40 then use local memory with less threads in each workgroup; if it is even more, then allocate global memory.
But you may need to sort the workitems on their spatiality / tree's topology to make them work together faster with less branching.
This way looks simplest to me, so I suggest you to try this barrierless version first.
If you want only data-visibility per workitem instead of group or kernel, use fence: https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/mem_fence.html

Concurrent write with OCaml Async

I'm reading data from the network and I'd like to write it to a file whenever I get them. The writes are concurrent and non sequential (think P2P file sharing). In C, I would get a file descriptor to the file(for the duration of the program) then use lseek, followed by write and eventually close the fd. These operations could be protected by a mutex in a multithreaded setting (especially, lseek and write should be atomic).
I don't really see how to get that behavior in Async. My initial idea is to have something like this.
let write fd s pos =
let posl = Int64.of_int pos in
Async_unix.Unix_syscalls.lseek fd ~mode:`Set posl
>>| fun _ ->
let wr = Writer.create t.fd in
let len = String.length s in
Writer.write wr s ~pos:0 ~len
Then, the writes are scheduled asynchronously when data is received.
My solution isn't correct. For one thing, this write task need to be atomic but it is not the case, since two lseek can be executed before the first Writer.write. Even if I can schedule the write sequentially it wouldn't help since Writer.write doesn't return a Deferred.t. Any idea?
BTW, this is a follow-up to a previous answered question.

The basic approach would be to have a queue of workers, where each worker performs an atomic seek/write1 operation. The invariant is that only one worker at a time is running. A more complicated strategy would employ a priority queue, where writes are ordered by some criterium that maximizes the throughput, e.g., writes to subsequent positions. You may also implement a sophisticated buffering strategy if you observe lots of small writes, then a good idea would be to coalesce them into larger chunks.
But let's start with a simple non-prioritized queue, implemented via Async.Pipe.t. For the positional write, we can't use the Writer interface, as it is designed for buffered sequential writes. So, we will use the Unix.lseek from Async_unix.Std and Bigstring.really_writefunction. The really_write is a regular non-asynchronous function, so we need to lift it into the Async interface using theFd.syscall_in_thread` function, e.g.,
let really_pwrite fd offset bytes =
Unix.lseek fd offset ~mode:`Set >>= fun (_ : int64) ->
Fd.syscall_in_thread fd (fun desc ->
Bigstring.really_write desc bytes)
Note: this function will write as many bytes as system decides, but no more than the length of bytes. So you might be interested in implementing a really_pwrite function that will write all bytes.
The overall scheme would include one master thread, that will own a file descriptor and accept write requests from multiple clients via an Async.Pipe. Suppose that each write request is a message of the follwing type:
type chunk = {
offset : int;
bytes : Bigstring.t;
}
Then your master thread will look like this:
let process_requests fd =
Async.Pipe.iter ~f:(fun {offset; bytes} ->
really_pwrite fd offset bytes)
Where the really_pwrite is a function that really writes all the bytes and handles all the errors. You may also use Async.Pipe.iter' function and presort and coalesce the writes before actually executing the pwrite syscall.
One more optimization note. Allocating a bigstring is a rather expensive operation, so you may consider to pre allocate one big bigstring and serve small chunks from it. This will create a limited resource, so your clients will wait until other clients will finish their writes and release their chunks. As a result, you will have a throttled system with a limited memory footprint.
1)Ideally we should use pwrite though Janestreet provides only pwrite_assume_fd_is_nonblocking function, that doesn't release OCaml runtime when a call to the system pwrite is done, and will actually block the whole system. So we need to use a combination of a seek and write. The latter will release the OCaml runtime so that the rest of the program can continue. (Also, given their definition of nonblocking fd, this function doesn't really make much sense, as only sockets and FIFO are considered non-blocking, and as far as I know, they do not support the seek operation. I will file an issue on their bug tracker.

Memory considerations when enqueing a long sequence of kernels and reads

I have a long sequence of kernels I need to run on some data like
data -> kernel1 -> data1 -> kernel2 -> data2 -> kernel3 -> data3 etc.
I need all the intermediate results to copied back to the host as well, so the idea would be something like (pseudo code):
inputdata = clCreateBuffer(...hostBuffer[0]);
for (int i = 0; i < N; ++i)
{
// create output buffer
outputdata = clCreateBuffer(...);
// run kernel
kernel = clCreateKernel(...);
kernel.setArg(0, inputdata);
kernel.setArg(1, outputdata);
enqueueNDRangeKernel(kernel);
// read intermediate result
enqueueReadBuffer(outputdata, hostBuffer[i]);
// output of operation becomes input of next
inputdata = outputdata;
}
There are several ways to schedule these operations:
Simplest is to always wait for the event of previous enqueue operation, so we wait for a read operation to complete before proceeding with the next kernel. I can release buffers as soon as they are not needed.
OR Make everything as asynchronous as possible, where kernel and read enqueues only wait for previous kernels, so buffer reads can happen while another kernel is running.
In the second (asynchronous) case I have a few questions:
Do I have to keep references to all cl_mem objects in the long chain of actions and release them after everything is complete?
Importantly, how does OpenCL handle the case when the sum of all memory objects exceeds that of the total memory available on the device? At any point a kernel only needs the input and output kernels (which should fit in memory), but what if 4 or 5 of these buffers exceed the total, how does OpenCL allocate/deallocate these memory objects behind the scenes? How does this affect the reads?
I would be grateful if someone could clarify what happens in these situations, and perhaps there is something relevant to this in the OpenCL spec.
Thank you.

Your Second case is the way to go.
In the second (asynchronous) case I have a few questions:
Do I have to keep references to all cl_mem objects
in the long chain of actions and release them after
everything is complete?
Yes. But If all the data arrays are of the same size I would use just 2, and overwrite one after the other each iteration.
Then you will only need to have 2 memory zones, and the release and allocation should only occur at the beggining/end.
Don't worry about the data having bad values, if you set proper events the processing will wait to the I/O to finish. ie:
data -> kernel1 -> data1 -> kernel2 -> data -> kernel3 -> data1
-> I/O operation -> I/O operation
For doing that just set a condition that forces the kernel3 to start only if the first I/O has finished. You can chain all the events that way.
NOTE: Use 2 queues, one for I/O and another for processing will bring you parallel I/O, which is 2 times faster.
Importantly, how does OpenCL handle the case when the sum
of all memory objects exceeds that of the total memory available on the
device?
Gives an error OUT_OF_RESOURCES or similar when allocating.
At any point a kernel only needs the input and output kernels
(which should fit in memory), but what if 4 or 5 of these buffers
exceed the total, how does OpenCL allocate/deallocate these memory
objects behind the scenes? How does this affect the reads?
It will not do this automatically, except you have set the memory as a host PTR. But I'm unsure if that way the OpenCL driver will handle it properly. I would not allocate more than the maximum if I were you.

I was under the impression (sorry, I was going to cite specification but can't find it today, so I downgraded the strength of my assertion) that when you enqueue a kernel with cl_mem references, it takes a retain on those objects, and releases them when the kernel is done.
This could allow you to release these objects after enqueing a kernel without actually having to wait for the kernel to finish running. This is how the async "clEnqueue" operations are reconciled with the synchronous operations (i.e., memory release), and prevents the use of released memory objects by the runtime and kernel.

Is Open MPI's reduce syncrhonized?

I am looking at the code here which I am doing for practice.
http://www.mcs.anl.gov/research/projects/mpi/usingmpi/examples/simplempi/main.html
I am confused about the part shown here.
MPI::COMM_WORLD.Reduce(&mypi, &pi, 1, MPI::DOUBLE, MPI::SUM, 0);
if (rank == 0)
cout << "pi is approximately " << pi
<< ", Error is " << fabs(pi - PI25DT)
<< endl;
My question is does the mpi reduce function know when all the other processes (in this case the programs with rank 1-3) have finished and that its result is complete?

All collective communication calls (Reduce, Gather, Scatter, etc) are blocking.

#g.inozemtsev is correct. The MPI collective calls -- including those in Open MPI -- are "blocking" in the MPI sense of the word, meaning that you can use the buffer when the call returns. In an operation like MPI_REDUCE, it means that the root process will have the answer in its buffer when it returns. Further, it means that non-root processes in an MPI_REDUCE can safely overwrite their buffer when MPI_REDUCE returns (which usually means that their part in the reduction is complete).
However, note that as mentioned above, the return from a collective operation such as an MPI_REDUCE in one process has no bearing on the return of the same collective operation in a peer process. The only exception to this rule is MPI_BARRIER, because barrier is defined as an explicit synchronization, whereas all the other MPI-2.2 collective operations do not necessarily need to explicitly synchronize.
As a concrete example, say that all non-root processes call MPI_REDUCE at time X. The root finally calls MPI_REDUCE at time X+N (for this example, assume N is large). Depending on the implementation, the non-root processes may return much earlier than X+N or they may block until X+N(+M). The MPI standard is intentionally vague on this point to allow MPI implementations to do what they want / need (which may also be dictated by resource consumption/availability).
Hence, #g.inozemtsev's point of "You cannot rely on synchronization" (except for with MPI_BARRIER) is correct.

Add recevied data to existing receive buffer in MPI_SendRecv

I am trying to send data(forces) across 2 processes, using MPI_SendRecv. Usually the data will be over written in the received buffer, I do not want to overwrite the data in the received buffer instead I want to add the data it received.
I can do the following. Store the data in the previous time step to a different array and then add it after receiving. But I have huge number of nodes and I do not want to have memory allocated for its storage every time step. (or overwrite the same)
My question is there a way to add the received data directly to the buffer and store it in the received memory using MPI?
Any help in this direction would be really thankful.
I am sure collective communication calls (MPI Reduce)cannot be worked out here. Are there any other commands that can do this?

In short: no, but you should be able to do this.
In long: Your suggestion makes a great deal of sense and the MPI Forum is currently considering new features that would enable essentially what you want.
It is incorrect to suggest that the data must be received before it can be accumulated. MPI_Accumulate does a remote accumulation in a one-sided fashion. You want MPI_Sendrecv_accumulate rather than MPI_Sendrecv_replace. This makes perfect sense and an implementation can internally do much better than you can because it can buffer on a per-packet basis, for example.
For suszterpatt, MPI internally buffers in the eager protocol and in the rendezvous protocol can setup a pipeline to minimize buffering.
The implementation of MPI_Recv_accumulate (for simplicity, as the MPI_Send part need not be considered) looks like this:
int MPI_Recv(void *buf, int count, MPI_Datatype datatype, MPI_Op op, int source, int tag, MPI_Comm comm, MPI_Status *status)
{
if (eager)
MPI_Reduce_local(_eager_buffer, buf, count, datatype, op);
else /* rendezvous */
{
malloc _buffer
while (mycount<count)
{
receive part of the incoming data into _buffer
reduce_local from _buffer into buf
}
}

In short: no.
In long: your suggestion doesn't really make sense. The machine can't perform any operations on your received value without first putting it into local memory somewhere. You'll need a buffer to receive the newest value, and a separate sum that you will increment by the buffer's content after every receive.