Sending class instances using MPI_Send() or MPI_Bcast in C++ - mpi

How can I transmit instances of my class or a std::vector using MPI_Send() or MPI_Bcast() in C++?

You cannot simply transmit instances of random classes since being C calls neither MPI_Send() nor MPI_Bcast() understand the structure of those classes. You can send instances of std::vector (since it uses contiguous memory storage) by providing &vector[0] to MPI_Send() but the receive operation should then be implemented in several steps: MPI_Probe() -> get the number of elements in the message -> resize the vector instance -> MPI_Recv() into the resized instance. For all other cases, you should use something like Boost.MPI or you should use MPI_Pack() and MPI_Unpack() to serialise and deserialise your class instances to and from MPI messages.

MPI doesn't operate on objects, it operates on memory locations. So to send an object from your own class, you will need to know the memory layout from your class. You can then use this to build an MPI datatype. There is an entire chapter of the MPI specification (chapter 4) devoted to how to do this. The basic premise is that you build a datatype based on the standard MPI types, arranged in a specified memory layout. Once this type is built and committed, you can then use it in MPI operations.

Ok, I've already found a solution:
http://www.mpi-forum.org/docs/mpi-2.2/mpi22-report/node83.htm#Node83
in 5th Example

You could try sending it as a byte array like this:
MPI_Send(vec.data(), vec.size() * sizeof(VectorElement), MPI_CHAR, 0, 0, MPI_COMM_WORLD);
And recieve it like this:
MPI_Recv(&vec.front(), vec.size() * sizeof(VectorElement), MPI_CHAR, 0, MPI_ANY_TAG, MPI_COMM_WORLD, &status);
Where vec is an instanse of std::vector<VectorElement>
If you think that the receiving vector size could change, you can simply send it before the vector.
I can't guarantee that this method is safe but i believe it should be since both padding and the order of fields of a certain class should be the same on every subprocess.

Related

MPI_Sendrecv vs combination of MPI_Isend and MPI_Recv

Recently I saw in a book on computational physics that the following piece of code was used to shift data across a chain of processes:
MPI_Isend(data_send, length, datatype, lower_process, tag, comm, &request);
MPI_Recv(data_receive, length, datatype, upper_process, tag, comm, &status);
MPI_Wait(&request, &status);
I think the same can be achieved by a single call to MPI_Sendrecv and I don't think there is any reason to believe the former is faster. Does it have any advantages?
I believe there is no real difference between the fragment you give and an MPI_sendrecv call. The sendrecv combo is fully compatible with regular sends and receives: you could for instance when shifting data through a (non-periodic!) chain use sendrecv everywhere but the end points, and do a regular send/isend/recv/irecv there.
You can think of two variants on your code fragment: use a Isend/Recv combination or use Isend/Irecv. There are probably minor differences in how these are treated in the underlying protocols, but I wouldn't worry about them.
Your fragment can of course more easily be generalized to other patterns than shifting along a chain, but if you have indeed a setup where each process sends to at most one and receives from at most one, then I'd use MPI_Sendrecv just for cleanliness. This one just makes the reader wonder: "is there a deep reason for this".

Is this an advantage of MPI_PACK over derived datatype?

Suppose a process is going to send a number of arrays of different sizes but of the same type to another process in a single communication, so that the receiver builds the same arrays in its memory. Prior to the communication the receiver doesn't know the number of arrays and their sizes. So it seems to me that though the task can be done quite easily with MPI_Pack and MPI_Unpack, it cannot be done by creating a new datatype because the receiver doesn't know enough. Can this be regarded as an advantage of MPI_PACK over derived datatypes?
There is some passage in the official document of MPI which may be referring to this:
The pack/unpack routines are provided for compatibility with previous libraries. Also, they provide some functionality that is not otherwise available in MPI. For instance, a message can be received in several parts, where the receive operation done on a later part may depend on the content of a former part.
You are absolutely right. The way I phrase it is that with MPI_Pack I can make "self-documenting messages". First you store an integer that says how many elements are coming up, then you pack those elements. The receiver inspects that first int, then unpacks the elements. The only catch is that the receiver needs to know an upper bound on the number of bytes in the pack buffer, but you can do that with a separate message, or a MPI_Probe.
There is of course the matter that unpacking a packed message is way slower than straight copying out of a buffer.
Another advantage to packing is that it makes heterogeneous data much easier to handle. The MPI_Type_struct is rather a bother.

sending data: MPI_Type_contiguous vs primitive types

I am trying to exchange data (30 chars) b/w two processes to understand the MPI_Type_contiguous API as:
char data[30];
MPI_Type_contiguous(10,MPI_CHAR,&mytype);
MPI_Type_commit(&mytype);
MPI_Send(data, 3,mytype,1, 99,MPI_COMM_WORLD);
But the similar task could have been accomplished via :
MPI_Send(data, 30,MPI_CHAR,1, 99,MPI_COMM_WORLD);
I guess there is no latency factor advantage as i am using only single function call in both cases(or is it?).
Can anyone share a use case where MPI_Type_contiguous is advantageous over primitive types(in terms of performance/ease of accomplishing a task)?
One use that immediately comes to mind is sending very large messages. Since the count argument to MPI_Send is of type int, on a typical LP64 (Unix-like OSes) or LLP64 (Windows) 64-bit OS it is not possible to directly send more than 231-1 elements, even if the MPI implementation is using internally 64-bit lengths. With modern compute nodes having hundreds of GiBs of RAM, this is becoming a nuisance. The solution is to create a new datatype of length m and send n elements of the new type for a total of n*m data elements, where n*m can now be up to 262-232+1. The method is future-proof and can also be used on 128-bit machines as MPI datatypes can be nested even further. This workaround and the fact that registering a datatype is way cheaper (in execution time) compared to the time it takes such large messages to traverse the network was used by the MPI Forum to reject the proposal for adding new large-count APIs or modifying the argument types of the existing ones. Jeff Hammond (#Jeff) has written a library to simplify the process.
Another use is in MPI-IO. When setting the view of file with MPI_File_set_view, a contiguous datatype can be provided as the elementary datatype. It allows one to e.g. treat in a simpler way binary files of complex numbers in languages that do not have a built-in complex datatype (like the earlier versions of C).
MPI_Type_contiguous is for making a new datatype which is count copies of the existing one. This is useful to simplify the processes of sending a number of datatypes together as you don't need to keep track of their combined size (count in MPI_send can be replaced by 1).
For you case, I think it is exactly the same. The text from using MPI adapted slightly to match your example is,
When a count argument is used in an MPI operation, it is the same as if a contigous type of that size has been constructed.
MPI_Send(data, count, MPI_CHAR, 1, 99, MPI_COMM_WORLD);
is exactly the same as
MPI_Type_contiguous(count, MPI_CHAR, &mytype);
MPI_Type_commit(&mytype);
MPI_Send(data, 1, mytype, 1, 99, MPI_COMM_WORLD);
MPI_Type_free(&mytype);
You are correct, as there is only one actual communication call, the latency will be identical (and bandwidth, the same number of bytes are sent).

Proper way to inform OpenCL kernels of many memory objects?

In my OpenCL program, I am going to end up with 60+ global memory buffers that each kernel is going to need to be able to access. What's the recommended way to for letting each kernel know the location of each of these buffers?
The buffers themselves are stable throughout the life of the application -- that is, we will allocate the buffers at application start, call multiple kernels, then only deallocate the buffers at application end. Their contents, however, may change as the kernels read/write from them.
In CUDA, the way I did this was to create 60+ program scope global variables in my CUDA code. I would then, on the host, write the address of the device buffers I allocated into these global variables. Then kernels would simply use these global variables to find the buffer it needed to work with.
What would be the best way to do this in OpenCL? It seems that CL's global variables are a bit different than CUDA's, but I can't find a clear answer on if my CUDA method will work, and if so, how to go about transferring the buffer pointers into global variables. If that wont work, what's the best way otherwise?
60 global variables sure is a lot! Are you sure there isn't a way to refactor your algorithm a bit to use smaller data chunks? Remember, each kernel should be a minimum work unit, not something colossal!
However, there is one possible solution. Assuming your 60 arrays are of known size, you could store them all into one big buffer, and then use offsets to access various parts of that large array. Here's a very simple example with three arrays:
A is 100 elements
B is 200 elements
C is 100 elements
big_array = A[0:100] B[0:200] C[0:100]
offsets = [0, 100, 300]
Then, you only need to pass big_array and offsets to your kernel, and you can access each array. For example:
A[50] = big_array[offsets[0] + 50]
B[20] = big_array[offsets[1] + 20]
C[0] = big_array[offsets[2] + 0]
I'm not sure how this would affect caching on your particular device, but my initial guess is "not well." This kind of array access is a little nasty, as well. I'm not sure if it would be valid, but you could start each of your kernels with some code that extracts each offset and adds it to a copy of the original pointer.
On the host side, in order to keep your arrays more accessible, you can use clCreateSubBuffer: http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clCreateSubBuffer.html which would also allow you to pass references to specific arrays without the offsets array.
I don't think this solution will be any better than passing the 60 kernel arguments, but depending on your OpenCL implementation's clSetKernelArgs, it might be faster. It will certainly reduce the length of your argument list.
You need to do two things. First, each kernel that uses each global memory buffer should declare an argument for each one, something like this:
kernel void awesome_parallel_stuff(global float* buf1, ..., global float* buf60)
so that each utilized buffer for that kernel is listed. And then, on the host side, you need to create each buffer and use clSetKernelArg to attach a given memory buffer to a given kernel argument before calling clEnqueueNDRangeKernel to get the party started.
Note that if the kernels will keep using the same buffer with each kernel execution, you only need to setup the kernel arguments one time. A common mistake I see, which can bleed host-side performance, is to repeatedly call clSetKernelArg in situations where it is completely unnecessary.

Add recevied data to existing receive buffer in MPI_SendRecv

I am trying to send data(forces) across 2 processes, using MPI_SendRecv. Usually the data will be over written in the received buffer, I do not want to overwrite the data in the received buffer instead I want to add the data it received.
I can do the following. Store the data in the previous time step to a different array and then add it after receiving. But I have huge number of nodes and I do not want to have memory allocated for its storage every time step. (or overwrite the same)
My question is there a way to add the received data directly to the buffer and store it in the received memory using MPI?
Any help in this direction would be really thankful.
I am sure collective communication calls (MPI Reduce)cannot be worked out here. Are there any other commands that can do this?
In short: no, but you should be able to do this.
In long: Your suggestion makes a great deal of sense and the MPI Forum is currently considering new features that would enable essentially what you want.
It is incorrect to suggest that the data must be received before it can be accumulated. MPI_Accumulate does a remote accumulation in a one-sided fashion. You want MPI_Sendrecv_accumulate rather than MPI_Sendrecv_replace. This makes perfect sense and an implementation can internally do much better than you can because it can buffer on a per-packet basis, for example.
For suszterpatt, MPI internally buffers in the eager protocol and in the rendezvous protocol can setup a pipeline to minimize buffering.
The implementation of MPI_Recv_accumulate (for simplicity, as the MPI_Send part need not be considered) looks like this:
int MPI_Recv(void *buf, int count, MPI_Datatype datatype, MPI_Op op, int source, int tag, MPI_Comm comm, MPI_Status *status)
{
if (eager)
MPI_Reduce_local(_eager_buffer, buf, count, datatype, op);
else /* rendezvous */
{
malloc _buffer
while (mycount<count)
{
receive part of the incoming data into _buffer
reduce_local from _buffer into buf
}
}
In short: no.
In long: your suggestion doesn't really make sense. The machine can't perform any operations on your received value without first putting it into local memory somewhere. You'll need a buffer to receive the newest value, and a separate sum that you will increment by the buffer's content after every receive.

Resources