Recently I saw in a book on computational physics that the following piece of code was used to shift data across a chain of processes:
MPI_Isend(data_send, length, datatype, lower_process, tag, comm, &request);
MPI_Recv(data_receive, length, datatype, upper_process, tag, comm, &status);
MPI_Wait(&request, &status);
I think the same can be achieved by a single call to MPI_Sendrecv and I don't think there is any reason to believe the former is faster. Does it have any advantages?
I believe there is no real difference between the fragment you give and an MPI_sendrecv call. The sendrecv combo is fully compatible with regular sends and receives: you could for instance when shifting data through a (non-periodic!) chain use sendrecv everywhere but the end points, and do a regular send/isend/recv/irecv there.
You can think of two variants on your code fragment: use a Isend/Recv combination or use Isend/Irecv. There are probably minor differences in how these are treated in the underlying protocols, but I wouldn't worry about them.
Your fragment can of course more easily be generalized to other patterns than shifting along a chain, but if you have indeed a setup where each process sends to at most one and receives from at most one, then I'd use MPI_Sendrecv just for cleanliness. This one just makes the reader wonder: "is there a deep reason for this".
Related
Suppose a process is going to send a number of arrays of different sizes but of the same type to another process in a single communication, so that the receiver builds the same arrays in its memory. Prior to the communication the receiver doesn't know the number of arrays and their sizes. So it seems to me that though the task can be done quite easily with MPI_Pack and MPI_Unpack, it cannot be done by creating a new datatype because the receiver doesn't know enough. Can this be regarded as an advantage of MPI_PACK over derived datatypes?
There is some passage in the official document of MPI which may be referring to this:
The pack/unpack routines are provided for compatibility with previous libraries. Also, they provide some functionality that is not otherwise available in MPI. For instance, a message can be received in several parts, where the receive operation done on a later part may depend on the content of a former part.
You are absolutely right. The way I phrase it is that with MPI_Pack I can make "self-documenting messages". First you store an integer that says how many elements are coming up, then you pack those elements. The receiver inspects that first int, then unpacks the elements. The only catch is that the receiver needs to know an upper bound on the number of bytes in the pack buffer, but you can do that with a separate message, or a MPI_Probe.
There is of course the matter that unpacking a packed message is way slower than straight copying out of a buffer.
Another advantage to packing is that it makes heterogeneous data much easier to handle. The MPI_Type_struct is rather a bother.
Usaully, one would have to define a new type and register it with MPI to use it. I am wondering if using protobuf to serialize a object and send it over using MPI as byte stream. I have two questions:
(1) do you foresee any problem with this approach?
(2) do I need to send length information through a separate MPI_Send(), or can I probe and use MPI_Get_count(&status, MPI_BYTE, &count)?
An example would be:
// sender
MyObj myobj;
...
size_t size = myobj.ByteSizeLong();
void *buf = malloc(size);
myobj.SerializePartialToArray(buf, size);
MPI_Isend(buf, size, MPI_BYTE, ... )
...
// receiver
MPI_Iprobe(MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &flag, &status);
if (flag) {
MPI_Get_count(&status, MPI_BYTE, &size);
MPI_Recv(buf, size, MPI_BYTE, ... , &status);
MyObject obj;
obj.ParseFromArray(buf, size);
...
}
Generally you can do that. Your code sketch looks also fine (except for the omitted buf allocation on the receiver side). As Gilles points out, makes sure to use status.MPI_SOURCE and status.MPI_TAG for the actual MPI_Recv, not MPI_*_ANY.
However, there are some performance limitations.
Protobuf isn't very fast, particularly due to en-/decoding. It very much depends what your performance expectations are. If you run on a high performance network, assume a significant impact. Here are some basic benchmarks.
Not knowing the message size ahead and thus always posting the receive after the send also has performance implications. This means the actual transmission will likely start later, which may or may not have an impact on the senders side since you are using non-blocking sends. There could be cases, where you run into some practical limitations regarding number of unexpected messages. That is not a general correctness issues, but might require some configuration tuning.
If you go ahead with your approach, remember to do some performance analysis on the implementation. Use an MPI-aware performance analysis tool to make sure your approach doesn't introduce critical bottlenecks.
Reading Intel's big manual, I see that if you want to return from a far call, that is, a call to a procedure in another code segment, you simply issue a return instruction (possibly with an immediate argument that moves the stack pointer up n bytes after the pointer popping).
This, apparently, if I'm interpreting things correctly, is enough for the hardware to pop both the segment selector and offset into the correct registers.
But, how does the system know that the return should be a far return and that both an offset AND a selector need to be popped?
If the hardware just pops the offset pointer and not the selector after it, then you'll be pointing to the right offset but wrong segment.
There is nothing special about the far return command compared to the near return version.
They both look identical as far as I can tell.
I assume then that the processor, perhaps at the micro-architecture level, keeps track of which calls are far and which are close so that when they're returned from, the system knows how many bytes to pop and where to pop them (pointer registers and segment selector registers).
Is my assumption correct?
What do you guys know about this mechanism?
The processor doesn't track whether or not a call should be far or near; the compiler decides how to encode the function call and return using either far or near opcodes.
As it is, FAR calls have no use on modern processors because you don't need to change any segment register values; that's the point of a flat memory model. Segment registers still exist, but the OS sets them up with base=0 and limit=0xffffffff so just a plain 32-bit pointer can access all memory. Everything is NEAR, if you need to put a name on it.
Normally you just don't even think about segmentation so you don't actually call it either. But the manual still describes the call/ret opcodes we use for normal code as the NEAR versions.
FAR and NEAR were used on old 86 processors, which used a segmented memory model. Programs at that time needed to choose what kind of architecture they wished to support, ranging from "tiny" to "large". If your program was small enough to fit in a single segment, then it could be compiled using NEAR calls and returns exclusively. If it was "large", the opposite was true. For anything in between, you had power to choose whether local functions needed to be able to be either callable/returnable from code in another segment.
Most modern programs (besides bootloaders and the like) run on a different construct: they expect a flat memory model. Behind the scenes the OS will swap out memory as needed (with paging not segmentation), but as far as the program is concerned, it has its virtual address space all to itself.
But, to answer your question, the difference in the call/return is the opcode used; the processor obeys the command given to it. If you mistake (say, give it a FAR return opcode when in flat mode), it'll fail.
I am trying to exchange data (30 chars) b/w two processes to understand the MPI_Type_contiguous API as:
char data[30];
MPI_Type_contiguous(10,MPI_CHAR,&mytype);
MPI_Type_commit(&mytype);
MPI_Send(data, 3,mytype,1, 99,MPI_COMM_WORLD);
But the similar task could have been accomplished via :
MPI_Send(data, 30,MPI_CHAR,1, 99,MPI_COMM_WORLD);
I guess there is no latency factor advantage as i am using only single function call in both cases(or is it?).
Can anyone share a use case where MPI_Type_contiguous is advantageous over primitive types(in terms of performance/ease of accomplishing a task)?
One use that immediately comes to mind is sending very large messages. Since the count argument to MPI_Send is of type int, on a typical LP64 (Unix-like OSes) or LLP64 (Windows) 64-bit OS it is not possible to directly send more than 231-1 elements, even if the MPI implementation is using internally 64-bit lengths. With modern compute nodes having hundreds of GiBs of RAM, this is becoming a nuisance. The solution is to create a new datatype of length m and send n elements of the new type for a total of n*m data elements, where n*m can now be up to 262-232+1. The method is future-proof and can also be used on 128-bit machines as MPI datatypes can be nested even further. This workaround and the fact that registering a datatype is way cheaper (in execution time) compared to the time it takes such large messages to traverse the network was used by the MPI Forum to reject the proposal for adding new large-count APIs or modifying the argument types of the existing ones. Jeff Hammond (#Jeff) has written a library to simplify the process.
Another use is in MPI-IO. When setting the view of file with MPI_File_set_view, a contiguous datatype can be provided as the elementary datatype. It allows one to e.g. treat in a simpler way binary files of complex numbers in languages that do not have a built-in complex datatype (like the earlier versions of C).
MPI_Type_contiguous is for making a new datatype which is count copies of the existing one. This is useful to simplify the processes of sending a number of datatypes together as you don't need to keep track of their combined size (count in MPI_send can be replaced by 1).
For you case, I think it is exactly the same. The text from using MPI adapted slightly to match your example is,
When a count argument is used in an MPI operation, it is the same as if a contigous type of that size has been constructed.
MPI_Send(data, count, MPI_CHAR, 1, 99, MPI_COMM_WORLD);
is exactly the same as
MPI_Type_contiguous(count, MPI_CHAR, &mytype);
MPI_Type_commit(&mytype);
MPI_Send(data, 1, mytype, 1, 99, MPI_COMM_WORLD);
MPI_Type_free(&mytype);
You are correct, as there is only one actual communication call, the latency will be identical (and bandwidth, the same number of bytes are sent).
I am looking at the code here which I am doing for practice.
http://www.mcs.anl.gov/research/projects/mpi/usingmpi/examples/simplempi/main.html
I am confused about the part shown here.
MPI::COMM_WORLD.Reduce(&mypi, &pi, 1, MPI::DOUBLE, MPI::SUM, 0);
if (rank == 0)
cout << "pi is approximately " << pi
<< ", Error is " << fabs(pi - PI25DT)
<< endl;
My question is does the mpi reduce function know when all the other processes (in this case the programs with rank 1-3) have finished and that its result is complete?
All collective communication calls (Reduce, Gather, Scatter, etc) are blocking.
#g.inozemtsev is correct. The MPI collective calls -- including those in Open MPI -- are "blocking" in the MPI sense of the word, meaning that you can use the buffer when the call returns. In an operation like MPI_REDUCE, it means that the root process will have the answer in its buffer when it returns. Further, it means that non-root processes in an MPI_REDUCE can safely overwrite their buffer when MPI_REDUCE returns (which usually means that their part in the reduction is complete).
However, note that as mentioned above, the return from a collective operation such as an MPI_REDUCE in one process has no bearing on the return of the same collective operation in a peer process. The only exception to this rule is MPI_BARRIER, because barrier is defined as an explicit synchronization, whereas all the other MPI-2.2 collective operations do not necessarily need to explicitly synchronize.
As a concrete example, say that all non-root processes call MPI_REDUCE at time X. The root finally calls MPI_REDUCE at time X+N (for this example, assume N is large). Depending on the implementation, the non-root processes may return much earlier than X+N or they may block until X+N(+M). The MPI standard is intentionally vague on this point to allow MPI implementations to do what they want / need (which may also be dictated by resource consumption/availability).
Hence, #g.inozemtsev's point of "You cannot rely on synchronization" (except for with MPI_BARRIER) is correct.