Is it okay to pass MPI_Request by value, or to make a copy of it? In a function like MPI_Test, why is the MPI_Request passed as a pointer, as opposed to just a pass-by-value? MPI_Request isn't modified within MPI_Test, is it? It's only 8 bytes in size.
MPI_Request is an opaque type and you should not assume anything about its size (since it might differ between various MPI libraries).
i believe MPI_Request can be modified by MPI_Test (for example, MPI_Test) might mark the request as completed when appropriate.
Related
Recently I saw in a book on computational physics that the following piece of code was used to shift data across a chain of processes:
MPI_Isend(data_send, length, datatype, lower_process, tag, comm, &request);
MPI_Recv(data_receive, length, datatype, upper_process, tag, comm, &status);
MPI_Wait(&request, &status);
I think the same can be achieved by a single call to MPI_Sendrecv and I don't think there is any reason to believe the former is faster. Does it have any advantages?
I believe there is no real difference between the fragment you give and an MPI_sendrecv call. The sendrecv combo is fully compatible with regular sends and receives: you could for instance when shifting data through a (non-periodic!) chain use sendrecv everywhere but the end points, and do a regular send/isend/recv/irecv there.
You can think of two variants on your code fragment: use a Isend/Recv combination or use Isend/Irecv. There are probably minor differences in how these are treated in the underlying protocols, but I wouldn't worry about them.
Your fragment can of course more easily be generalized to other patterns than shifting along a chain, but if you have indeed a setup where each process sends to at most one and receives from at most one, then I'd use MPI_Sendrecv just for cleanliness. This one just makes the reader wonder: "is there a deep reason for this".
Usaully, one would have to define a new type and register it with MPI to use it. I am wondering if using protobuf to serialize a object and send it over using MPI as byte stream. I have two questions:
(1) do you foresee any problem with this approach?
(2) do I need to send length information through a separate MPI_Send(), or can I probe and use MPI_Get_count(&status, MPI_BYTE, &count)?
An example would be:
// sender
MyObj myobj;
...
size_t size = myobj.ByteSizeLong();
void *buf = malloc(size);
myobj.SerializePartialToArray(buf, size);
MPI_Isend(buf, size, MPI_BYTE, ... )
...
// receiver
MPI_Iprobe(MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &flag, &status);
if (flag) {
MPI_Get_count(&status, MPI_BYTE, &size);
MPI_Recv(buf, size, MPI_BYTE, ... , &status);
MyObject obj;
obj.ParseFromArray(buf, size);
...
}
Generally you can do that. Your code sketch looks also fine (except for the omitted buf allocation on the receiver side). As Gilles points out, makes sure to use status.MPI_SOURCE and status.MPI_TAG for the actual MPI_Recv, not MPI_*_ANY.
However, there are some performance limitations.
Protobuf isn't very fast, particularly due to en-/decoding. It very much depends what your performance expectations are. If you run on a high performance network, assume a significant impact. Here are some basic benchmarks.
Not knowing the message size ahead and thus always posting the receive after the send also has performance implications. This means the actual transmission will likely start later, which may or may not have an impact on the senders side since you are using non-blocking sends. There could be cases, where you run into some practical limitations regarding number of unexpected messages. That is not a general correctness issues, but might require some configuration tuning.
If you go ahead with your approach, remember to do some performance analysis on the implementation. Use an MPI-aware performance analysis tool to make sure your approach doesn't introduce critical bottlenecks.
Is it safe to re-use a finished MPI_Request for another request ? I have been using a pool of MPI_Request to improve performance and there is no error. But it would be good to know for sure.
Variables of type MPI_Request are not request objects themselves but rather just opaque handles (something like an abstract pointer) to the real MPI request objects. Assigning a new value to such a variable in no way affects the MPI object and only breaks the association to it. Therefore the object might become inaccessible in the sense that if no handle to it exists in your program, it can no longer be passed to MPI calls. This is the same as losing a pointer to a dynamically allocated memory block, thus leaking it.
When it comes to asynchronous request handles, once the operation is completed, MPI destroys the request object and MPI_Wait* / MPI_Test* set the passed handle variable to MPI_REQUEST_NULL on return. Also, a call to MPI_Request_free will mark the request for deletion and set the handle to MPI_REQUEST_NULL on return. At that point you can reuse the variable and store a different request handle in it.
The same applies to handles to communicators (of type MPI_Comm), handles to datatypes (of type MPI_Datatype), handles to reduce operations (of type MPI_Op), and so on.
It's just fine to reuse your MPI_Request objects as long as they're completed before you use them again (either by completing the request or freeing the request object manually using MPI_REQUEST_FREE).
I would like to understand how to correctly use the async_work_group_copy() call in OpenCL. Let's have a look on a simplified example:
__kernel void test(__global float *x) {
__local xcopy[GROUP_SIZE];
int globalid = get_global_id(0);
int localid = get_local_id(0);
event_t e = async_work_group_copy(xcopy, x+globalid-localid, GROUP_SIZE, 0);
wait_group_events(1, &e);
}
The reference http://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/async_work_group_copy.html says "Perform an async copy of num_elements gentype elements from src to dst. The async copy is performed by all work-items in a work-group and this built-in function must therefore be encountered by all work-items in a workgroup executing the kernel with the same argument values; otherwise the results are undefined."
But that doesn't clarify my questions...
I would like to know, if the following assumptions are correct:
The call to async_work_group_copy() must be executed by all work-items in the group.
The call should be in a way, that the source address is identical for all work-items and points to the first element of the memory area to be copied.
As my source address is relative based on the global work-item id of the first work-item in the work-group. So I have to subtract the local id to have the address identical for all work-items...
Is the third parameter really the number of elements (not the size in bytes)?
Bonus questions:
a. Can I just use barrier(CLK_LOCAL_MEM_FENCE) instead of wait_group_events() and ignore the return value? If so, would that be probably faster?
b. Does a local copy also make sense for processing on CPUs or is that overhead as they share a cache anyway?
Regards,
Stefan
One of the main reasons for this function existing is to allow the driver/kernel compiler to efficiently copy the memory without the developer having to make assumptions about the hardware.
You describe what memory you need copied as if it were a single-threaded copy, and async_work_group_copy gets it done for you using the parallel hardware.
For your specific questions:
I have never seen async_work_group_copy used by only some of the work items in a group. I always assumed this is because it it required. I think the blocking nature of wait_group_events forces all work items to be part of the copy.
Yes. Source (and destination) addresses need to be the same for all work items.
You could subtract your local id to get the correct address, but I find that basing the address on groupId solves this problem as well. (get_group_id)
Yes. The last param is the number of elements, not the size in bytes.
a. No. The event-based you will find that your barrier is hit almost immediately by the work items, and the data won't necessarily be copied. This makes sense because some opencl hardware might not even use the compute units at all to do the actual copy operation.
b. I think that cpu opencl implementations might guarantee L1 cache usage when you use local memory. The only way to know for sure if this performs better is to benchmark your application with various settings.
I am trying to send data(forces) across 2 processes, using MPI_SendRecv. Usually the data will be over written in the received buffer, I do not want to overwrite the data in the received buffer instead I want to add the data it received.
I can do the following. Store the data in the previous time step to a different array and then add it after receiving. But I have huge number of nodes and I do not want to have memory allocated for its storage every time step. (or overwrite the same)
My question is there a way to add the received data directly to the buffer and store it in the received memory using MPI?
Any help in this direction would be really thankful.
I am sure collective communication calls (MPI Reduce)cannot be worked out here. Are there any other commands that can do this?
In short: no, but you should be able to do this.
In long: Your suggestion makes a great deal of sense and the MPI Forum is currently considering new features that would enable essentially what you want.
It is incorrect to suggest that the data must be received before it can be accumulated. MPI_Accumulate does a remote accumulation in a one-sided fashion. You want MPI_Sendrecv_accumulate rather than MPI_Sendrecv_replace. This makes perfect sense and an implementation can internally do much better than you can because it can buffer on a per-packet basis, for example.
For suszterpatt, MPI internally buffers in the eager protocol and in the rendezvous protocol can setup a pipeline to minimize buffering.
The implementation of MPI_Recv_accumulate (for simplicity, as the MPI_Send part need not be considered) looks like this:
int MPI_Recv(void *buf, int count, MPI_Datatype datatype, MPI_Op op, int source, int tag, MPI_Comm comm, MPI_Status *status)
{
if (eager)
MPI_Reduce_local(_eager_buffer, buf, count, datatype, op);
else /* rendezvous */
{
malloc _buffer
while (mycount<count)
{
receive part of the incoming data into _buffer
reduce_local from _buffer into buf
}
}
In short: no.
In long: your suggestion doesn't really make sense. The machine can't perform any operations on your received value without first putting it into local memory somewhere. You'll need a buffer to receive the newest value, and a separate sum that you will increment by the buffer's content after every receive.