How to check if MPI one-sided communication has finished? - mpi

I am using MPI_Raccumulate function which is one-sided communication from source to destination with pre-defined aggregation function.
I want to check whether all the MPI_Raccumulate call has finished (sender sent the data and receiver received the data successfully) at the end of the program. MPI_Wait, however, does not seem to be the solution to this problem; it only waits for checking whether the source buffer is updatable or not (available to user).
Is there any way (1) to check whether A specific MPI-one-sided communication call has completely finished (in sender and receiver side)? (2) to confirm that there is no send/receive MPI requests in each processor?
My application program should use one-sided communication but need to confirm that there is no more communications at the end of a specific task.

Completing RMA requests only ensures local completion and thus buffer reuse. Remote completion requires one of:
MPI_Win_complete, in the PSCW usage model
MPI_Win_fence, in the BSP usage model
MPI_Win_unlock(_all) or MPI_Win_flush(_all) in the passive-target usage model.
You probably don't want to use request-based RMA. The regular functions are sufficient for nearly all usage models. The only request RMA operation that is obviously useful is MPI_Get (Or MPI_Get_accumulate with MPI_NO_OP, which is the atomic equivalent of MPI_Get). And I say this as the person most responsible for these features being part of MPI-3.


Asynchronous GRPC?

I am working on designing a new system that will take a an array of hashes of car data and then use this data to call a separate API that returns a Boolean, after which I will return to the original caller the car model and either true or false.
The system needs to be callable from other applications so I am looking into GRPC to solve the problem. My question revolves around how to implement this solution in GRPC and whether or not something like RabbitMQ would be better?
Would it make sense to make a bidirectional streaming GRPC solution where the client streams in the list of cars and then on the servers end I spawn off say a delayed job for each request on the server? And then when each delayed job finishes processing I return that value to the original caller in a stream?
Is this an elegant solution or are there better ways to achieve my goal? Thanks.
The streaming system of gRPC is typically designed for asynchronous communication, so it should fit your usage case neatly.
The general design philosophy in this case is to consider each individual message sent in the stream as independent. Basically, make sure your proto message contains all the information it needs to be parsed and processed by your application without needing any context from previous calls.

Dynamically Creating Communicators

I have a small communication problem that has consumed hours of search. I am using MPICH2 to communicate between different workers. At some points in my program a process needs to multi-cast a message to a fraction of the workers (2 or 3 out of a total of 20). Therefore, I temporarily need to create a group that includes the ranks of all those workers and then use MPI_BCast. However, this seems to be impossible!
I have tried MPI_Comm_Create but the program simply hangs because it required "every" worker call MPI_Comm_Create. I can not also use MPI_Comm_Split because I do not know the ranks of the recipient workers in advance and hence can not color code them.
Could you please help me.
Why do you need to create a new communicator at all?
Your description, of what you actually want to achieve and what the constraints are is a little lacking, but here are some hints, that might be applicable for your problem.
Sticking to classical two-sided communication, you need at some point a communication that involves all processes to identify the recipients, I guess. You could for example broadcast to everybody who is to be a recipient, and subsequently send the actual message to those with peer-to-peer communication (If this relation is going to change over time, I would not bother with creating a new communicator each time).
You could use MPI's one-sided communication concepts, and simply write messages from the broadcasting rank into dedicated memory areas of the receiving ranks. However, one-sided is often considered somewhat bad and not so good on the performance side.
With MPI-3 you could make use of an non-blocking barrier: All processes open the barrier, and those, which are not the broadcasting rank start immediately testing for the completion of this barrier, open a non-blocking receive for any source and regularly test for that as well, otherwise they proceed as usual. The broadcasting rank however, starts sending out its message to the actual recipients and when it completed that, it waits for the non-blocking barrier to complete. Now, all processes will find the barrier to complete, and now they can stop listening for the receives, those who didn't get a message can simply send a message to themselves to properly close the communication and proceed in their computation.

OpenCL clEnqueueReadBuffer During Kernel Execution?

Can queued kernels continue to execute while an OpenCL clEnqueueReadBuffer operation is occurring?
In other words, is clEnqueueReadBuffer a blocking operation on the device?
From a host API point of view, clEnqueueReadBuffer can be blocking or not, depending on if you set the blocking_read parameter to CL_TRUE or CL_FALSE.
If you set it to not block, then the read just gets queued and you should use an event (or subsequent blocking call) to determine when it has finished (i.e., before you access the memory that you are reading to).
If you set it to block, the call won't return until the read is done. The memory being read to will be correct. Also (and answering your actual question) any operations you queued prior to the clEnqueueReadBuffer will all have to finish first before the read starts (see exception note below).
All clEnqueue* API calls are asynchronous, but some have "blocking" parameters you can set. Using it is the equivalent to using a non-blocking version and then calling clFinish instead. The command queue will be flushed to the device and your host thread won't continue until the work has finished. Of course, it is hard to keep the GPU always busy doing it this way, since now it doesn't have any work, but if you queue up new work fast enough you can still keep it reasonably busy.
This all assumes a single, in-order command queue. If your command queue is out-of-order and your device supports out-of-order queues then enqueued items can execute in any order that doesn't violate the event_wait_list parameters you provided. Likewise, you can have multiple command queues, which can again be executed in any order that doesn't violate the event_wait_list parameters you provided. Typically, they are used to overlap memory transfers and compute, and to keep multiple compute units busy. Out-of-order command queues and multiple command queues are both advanced OpenCL concepts and shouldn't be attempted until you fully understand and have experience with in-order command queues.
Clarification added later after DarkZeros pointed out the "on the device" part of the OP's question: My answer was from the host thread API point of view. On the device, with an in-order command queue all downstream commands are blocked by the current command. With an out-of-order queue they are only blocked by the event_wait_list. However, out-of-order command queues are not well supported in today's drivers. With multiple command queues, in theory commands are only blocked by prior commands (if in-order) and the event_wait_list. In reality, there are sometimes special vendor rules that prevent the free flowing of potentially non-blocked commands that you might like. This is often because the multiple OpenCL command queues get transferred to device-side memory and compute queues, and get executed in-order there. So depending on the order that you add commands to your multiple command queues, they might get interleaved in such a way that they block in sub-optimal ways. The best solution I'm aware of is to either be careful about the order you enqueue (based on knowledge of this implementation detail), or use one queue for memory and one for compute, which matches the device-side queueing.
If overlap of memory and compute is your goal, both AMD and NVIDIA both provide examples of how to overlap memory and compute operations, and for GPUs that support multiple compute operations, how to do that too. NVIDIA examples are hard to get ahold of but they are out there (from CUDA 4 days).

How do I create a memory bound message queue in Erlang?

I want the speed of asynchronous messages but still have some flow control. How can I accomplish this in Erlang?
There is no process memory limit right now -- it is discussed on mailing list etc. You can look at those threads.
On the up side, when you use OTP patterns implementation like gen_server you have a lot of freedom in retrieving messages from process queue and measuring the length of the queue.
gen_server2 used in rabbitmq used to optimize that by moving messages to internal data structure.
Having that you can discard any new incoming message when internal queue is too long.
You can do it silently or notify sender that the message rejected.
All of that is on very low level.
RabbitMQ will provide this functionality on AMQP level.
A common and quite good way of enforcing flow control is to make well selected messages into calls which limits how much load each client can load the server to one, effectively providing force feed back in an extremely simple way. The trick is of course to pick which communications uses synchronous calls :-)

TCP Socket Piping

Suppose that you have 2 sockets(each will be listened by other TCP peers) each resides on the same process, how these sockets could be bound, meaning input stream of each other will be bound to output stream of other. Sockets will continuously carry data, no waiting will happen. Normally thread can solve this problem but, rather than creating threads is there more efficient way of piping sockets?
If you need to connect both ends of the socket to the same process, use the pipe() function instead. This function returns two file descriptors, one used for writing and the other used for reading. There isn't really any need to involve TCP for this purpose.
Update: Based on your clarification of your use case, no, there isn't any way to tell the OS to connect the ends of two different sockets together. You will have to write code to read from one socket and write the same data to the other. Depending on the architecture of your process, you may or may not need an additional thread to do this work. For example, if your application is based on a select() loop, then creating another thread is not necessary.
You can avoid threads with an event queue within the process. The WP Message queue article assumes you want interprocess message passing, but if you are using sockets, you kind of are doing interprocess message passing over the same process.
