I have a routine named transfer(int) which calls MPI routines. In my main program, transfer() is called twice.
... // do some work
transfer(1); // first transfer
... // do some work again
transfer(2); // second transfer
The transfer(int) function looks like this
... // do some work
MPI_Barrier(MPI_COMM_WORLD);
t0 = clock();
for(int k=0; k<mpisize; k++) {
MPI_Irecv( (void*)rbuffer[k], rsize[k], MPI_BYTE, k, 0, MPI_COMM_WORLD, reqs+2*k );
MPI_Isend( (void*)sbuffer[k], ssize[k], MPI_BYTE, k, 0, MPI_COMM_WORLD, reqs+2*k+1);
}
MPI_Waitall(2*mpisize, reqs, status);
MPI_Barrier(MPI_COMM_WORLD);
if (mpirank==0) cerr << "Transfer took "<< (double)(clock()-t0)/CLOCKS_PER_SEC << " secs" << endl;
Note that I only measure the communication time, excluding the pre-processing.
For transfer(1), all send and receive buffers have size 0 for each k. So essentially, there's no communication going on. Yet, the transfer took 1.95 seconds.
For transfer(2), each processor has to send/receive about 20KB to/from every other processor. Yet, the whole transfer took only 0.07 seconds.
I ran the experiment many times with 1024 processors and the measurements are consistent. Can you explain this phenomenon or what could possibly be wrong?
Thanks.
You could use a MPI performance analysis tool such as Vampir or Scalasca to better understand what is happening:
Are all communications slow or just a few the rest is waiting at the barrier?
What is the influence of the barrier?
The actual answer highly depends on your System and MPI implementation. Anycorn's comment that zero size messages still require communication and the first communication can have additional overhead is a good start of investigating. So another question you should try to answer is:
How does a second zero size message behave?
Also MPI implementations can handle messages of different sizes fundamentally different, e.g. by using unexpected message buffers, but that again is implementation and system dependent.
Related
I'm completely stuck. This code
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
if(myrank==0) i=1;
if(myrank==1) i=0;
MPI_Send(sendbuf, 1, MPI_INT, i, 99, MPI_COMM_WORLD);
MPI_Recv(recvbuf, 1, MPI_INT, i, 99, MPI_COMM_WORLD,&status);
does work running on two processes. Why is there no deadlock?
Same thing with non-blocking version
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
if(myrank==0) i=1;
if(myrank==1) i=0;
MPI_Isend(sendbuf, 1, MPI_INT, i, 99, MPI_COMM_WORLD,&req);
MPI_Wait(&req, &status);
MPI_Irecv(recvbuf, 1, MPI_INT, i, 99, MPI_COMM_WORLD,&req);
MPI_Wait(&req, &status);
Logically there should be a blocking, but it is not. Why?
How do I force the MPI to block?
MPI has two kinds of point-to-point operations - blocking and non-blocking. This might be a bit confusing initially, especially when it comes to sending messages, but the blocking behaviour does not relate to the physical data transfer operation. It rather relates to the time period, in which MPI might still access the buffer (either send or receive) and therefore the application is not supposed to modify its content (with send operations) or may still read old garbage (with read operations).
When you make a blocking call to MPI, it only returns once it has finished using the send or the receive buffer. With receive operations this means that the message has been received and stored in the buffer, but with send operations things are more complex. There are several modes for the send operation:
buffered -- data isn't sent immediately, but rather copied to a local user-supplied buffer and the call returns right after. The actual message transfer happens at some point in the future when MPI decides to do so. There is a special MPI call, MPI_Bsend, which always behaves this way. There are also calls for attaching and detaching the user-supplied buffer.
synchronous -- MPI waits until a matching receive operation is posted and the data transfer is in progress. It returns once the entire message is in transit and the send buffer is free to be reused. There is a special MPI call, MPI_Ssend, which always behaves this way.
ready -- MPI tries to send the message and it only succeeds if the receive operation has already been posted. The idea here is to skip the handshake between the ranks, which may reduce the latency, but it is unspecified what exactly happens if the receiver is not ready. There is a special call for that, MPI_Rsend, but it is advised to not use it unless you really know what you are doing.
MPI_Send invokes what is known as the standard send mode, which could be any combination of the synchronous and the buffered mode, with the latter using a buffer supplied by the MPI system and not the user-supplied one. The actual details are left to the MPI implementation and hence differ wildly.
It is most often the case that small messages are buffered while larger messages are sent synchronously, which is what you observe in your case, but one should never ever rely on this behaviour and the definition of "small" varies with the implementation and the network kind. A correct MPI program will not deadlock if all MPI_Send calls are replaced with MPI_Ssend, which means you must never assume that small messages are buffered. But a correct program will also not expect MPI_Send to be synchronous for larger messages and rely on this behaviour for synchronisation between the ranks, i.e., replacing MPI_Send with MPI_Bsend and providing a large enough buffer should not alter the program behaviour.
There is a very simple solution that always works and it frees you from having to remember to not rely on any assumptions -- simply use MPI_Sendrecv. It is a combined send and receive operation that never deadlocks, except when the send or the receive operation (or both) is unmatched. With send-receive, your code will look like this:
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
if (myrank==0) i=1;
if (myrank==1) i=0;
MPI_Sendrecv(sendbuf, 1, MPI_INT, i, 99,
recvbuf, 1, MPI_INT, i, 99, MPI_COMM_WORLD, &status);
Usaully, one would have to define a new type and register it with MPI to use it. I am wondering if using protobuf to serialize a object and send it over using MPI as byte stream. I have two questions:
(1) do you foresee any problem with this approach?
(2) do I need to send length information through a separate MPI_Send(), or can I probe and use MPI_Get_count(&status, MPI_BYTE, &count)?
An example would be:
// sender
MyObj myobj;
...
size_t size = myobj.ByteSizeLong();
void *buf = malloc(size);
myobj.SerializePartialToArray(buf, size);
MPI_Isend(buf, size, MPI_BYTE, ... )
...
// receiver
MPI_Iprobe(MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &flag, &status);
if (flag) {
MPI_Get_count(&status, MPI_BYTE, &size);
MPI_Recv(buf, size, MPI_BYTE, ... , &status);
MyObject obj;
obj.ParseFromArray(buf, size);
...
}
Generally you can do that. Your code sketch looks also fine (except for the omitted buf allocation on the receiver side). As Gilles points out, makes sure to use status.MPI_SOURCE and status.MPI_TAG for the actual MPI_Recv, not MPI_*_ANY.
However, there are some performance limitations.
Protobuf isn't very fast, particularly due to en-/decoding. It very much depends what your performance expectations are. If you run on a high performance network, assume a significant impact. Here are some basic benchmarks.
Not knowing the message size ahead and thus always posting the receive after the send also has performance implications. This means the actual transmission will likely start later, which may or may not have an impact on the senders side since you are using non-blocking sends. There could be cases, where you run into some practical limitations regarding number of unexpected messages. That is not a general correctness issues, but might require some configuration tuning.
If you go ahead with your approach, remember to do some performance analysis on the implementation. Use an MPI-aware performance analysis tool to make sure your approach doesn't introduce critical bottlenecks.
I'm new to HPC and I am curious about a point regarding the performance of MPI over Infiniband. For reference, I am using OpenMPI over two different machines connected through IB.
I've coded a very simple benchmark to see how fast I can transfer data over IB using MPI calls. Below you can see the code.
The issue is that when I run this, I get a throughput of ~1.4 gigabytes/s. However, when I use standard ib benchmarks like ib_write_bw, I get nearly 6 GB/s. What might account for this sizable discrepancy? Am I being naive about Gather, or is this just a result of OpenMPI overheads that I can't overcome?
In addition to the code, I am providing a plot to show the results of my simple benchmark.
Thanks in advance!
Results:
Code:
#include<iostream>
#include<mpi.h>
#include <stdint.h>
#include <ctime>
using namespace std;
void server(unsigned int size, unsigned int n) {
uint8_t* recv = new uint8_t[size * n];
uint8_t* send = new uint8_t[size];
std::clock_t s = std::clock();
MPI_Gather(send, size, MPI_CHAR, recv, size, MPI_CHAR, 0, MPI_COMM_WORLD);
std::clock_t e = std::clock();
cout<<size<<" "<<(e - s)/double(CLOCKS_PER_SEC)<<endl;
delete [] recv;
delete [] send;
}
void client(unsigned int size, unsigned int n) {
uint8_t* send = new uint8_t[size];
MPI_Gather(send, size, MPI_CHAR, NULL, 0, MPI_CHAR, 0, MPI_COMM_WORLD);
delete [] send;
}
int main(int argc, char **argv) {
int ierr, size, rank;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
cout<<"Rank "<<rank<<" of "<<size<<endl;
unsigned int min = 1, max = (1 << 31), n = 1000;
for (unsigned int i = 1; i < n; i++) {
unsigned int s = i * ((max - min) / n);
if(rank == 0) server(s, size); else client(s, size);
}
MPI_Finalize();
}
In your code you are executing a single collective operation per message size.
This involves huge overhead in comparison with tests that were written for performance measurement (e.g. ib_write_bw).
In general, comparing MPI collectives to ib_write_bw is not apples to apples comparison:
RDMA opcode
ib_write_bw uses RDMA_WRITE operations, which doesn't use CPU at all - once the initial handshake is done, it is pure RDMA, constrained only by network and PCIe capabilities.
MPI will use different RDMA opcodes for different collectives and different message sizes, and if you do it as you did in your code, there are lots of things that MPI does for each message (hence the huge overhead)
Data overhead
ib_write_bw transfers almost pure data (there's a local routing header and a payload)
MPI has more data (headers) added to each packet to allow the receiver to identify the message
Zero copy
ib_write_bw is doing what is called "zero-copy" - data is sent from a user buffer directly, and written to a user buffer directly on the receiving side, w/o copying from/to buffers
MPI will copy the message from your client's buffer to its internal buffers on the sender side, then copy it again from its internal buffers on the receiving side to your server's buffer. Again, this behaviour depends on the message size and MPI configuration and MPI implementation, but you get the general idea.
Memory registration
ib_write_bw registers the required memory region and exchanges this info between client and server before starting measuring performance
If MPI will need to register some memory region during the collective execution, it will do it while you are measuring time
there are many more
even the "small" things like warming up the cache lines on the HCAs...
So, now that we've covered why you shouldn't compare these things, here's what you should do:
There are two libraries that are regarded as a de-facto standard for MPI performance measurement:
IMB (Intel MPI Benchmark) - it says Intel, but it is written as a standard MPI application and will work with any MPI implementation.
OSU benchmarks - again, it says MVAPICH, but it will work with any MPI.
Download those, compile with your MPI, run your benchmarks, see what you get.
This is as high as you can get with MPI.
If you get much better results than with your small program (and you will for sure) - this is open source, see how the pros are doin it :)
Have fun!
You have to consider that the full payload size for the collective call received on rank 0 depends on the number of ranks. So with, say, 4 processes sending 1000 bytes you actually receive 4000 bytes on the root rank. That includes a memory copy from rank 0's input buffer into the output buffer (possibly with a detour through the network stack). And that is before you add the overheads of MPI and the lower networking protocols.
On Nvidia GPUs, when I call clEnqueueNDRange, the program waits for it to finish before continuing. More precisely, I'm calling its equivalent C++ binding, CommandQueue::enqueueNDRange, but this shouldn't make a difference. This only happens on Nvidia hardware (3 Tesla M2090s) remotely; on our office workstations with AMD GPUs, the call is nonblocking and returns immediately. I don't have local Nvidia hardware to test on - we used to, and I remember similar behavior then, too, but it's a bit hazy.
This makes spreading the work across multiple GPUs harder. I've tried starting a new thread for each call to enqueueNDRange using std::async/std::finish in the new C++11 spec, but that doesn't seem to work either - monitoring the GPU usage in nvidia-smi, I can see that the memory usage on GPU 0 goes up, then it does some work, then the memory on GPU 0 goes down and the memory on GPU 1 goes up, that one does some work, etc. My gcc version is 4.7.0.
Here's how I'm starting the kernels, where increment is the desired global work size divided by the number of devices, rounded up to the nearest multiple of the desired local work size:
std::vector<cl::CommandQueue> queues;
/* Population of queues happens somewhere
cl::NDrange offset, increment, local;
std::vector<std::future<cl_int>> enqueueReturns;
int numDevices = queues.size();
/* Calculation of increment (local is gotten from the function parameters)*/
//Distribute the job among each of the devices in the context
for(int i = 0; i < numDevices; i++)
{
//Update the offset for the current device
offset = cl::NDRange(i*increment[0], i*increment[1], i*increment[2]);
//Start a new thread for each call to enqueueNDRangeKernel
enqueueReturns.push_back(std::async(
std::launch::async,
&cl::CommandQueue::enqueueNDRangeKernel,
&queues[i],
kernels[kernel],
offset,
increment,
local,
(const std::vector<cl::Event>*)NULL,
(cl::Event*)NULL));
//Without those last two casts, the program won't even compile
}
//Wait for all threads to join before returning
for(int i = 0; i < numDevices; i++)
{
execError = enqueueReturns[i].get();
if(execError != CL_SUCCESS)
std::cerr << "Informative error omitted due to length" << std::endl
}
The kernels definitely should be running on the call to std::async, since I can create a little dummy function, set a breakpoint on it in GDB and have it step into it the moment std::async is called. However, if I make a wrapper function for enqueueNDRangeKernel, run it there, and put in a print statement after the run, I can see that it takes some time between prints.
P.S. The Nvidia dev zone is down due to hackers and such, so I haven't been able to post the question there.
EDIT: Forgot to mention - The buffer that I'm passing to the kernel as an argment (and the one I mention, above, that seems to get passed between the GPUs) is declared as using CL_MEM_COPY_HOST_PTR. I had been using CL_READ_WRITE_BUFFER, with the same effect happening.
I emailed the Nvidia guys and actually got a pretty fair response. There's a sample in the Nvidia SDK that shows, for each device you need to create seperate:
queues - So you can represent each device and enqueue orders to it
buffers - One buffer for each array you need to pass to the device, otherwise the devices will pass around a single buffer, waiting for it to become available and effectively serializing everything.
kernel - I think this one's optional, but it makes specifying arguments a lot easier.
Furthermore, you have to call EnqueueNDRangeKernel for each queue in separate threads. That's not in the SDK sample, but the Nvidia guy confirmed that the calls are blocking.
After doing all this, I achieved concurrency on multiple GPUs. However, there's still a bit of a problem. On to the next question...
Yes, you're right. AFAIK - the nvidia implementation has a synchronous "clEnqueueNDRange". I have noticed this when using my library (Brahma) as well. I don't know if there is a workaround or a way of preventing this, save using a different implementation (and hence device).
I am looking at the code here which I am doing for practice.
http://www.mcs.anl.gov/research/projects/mpi/usingmpi/examples/simplempi/main.html
I am confused about the part shown here.
MPI::COMM_WORLD.Reduce(&mypi, &pi, 1, MPI::DOUBLE, MPI::SUM, 0);
if (rank == 0)
cout << "pi is approximately " << pi
<< ", Error is " << fabs(pi - PI25DT)
<< endl;
My question is does the mpi reduce function know when all the other processes (in this case the programs with rank 1-3) have finished and that its result is complete?
All collective communication calls (Reduce, Gather, Scatter, etc) are blocking.
#g.inozemtsev is correct. The MPI collective calls -- including those in Open MPI -- are "blocking" in the MPI sense of the word, meaning that you can use the buffer when the call returns. In an operation like MPI_REDUCE, it means that the root process will have the answer in its buffer when it returns. Further, it means that non-root processes in an MPI_REDUCE can safely overwrite their buffer when MPI_REDUCE returns (which usually means that their part in the reduction is complete).
However, note that as mentioned above, the return from a collective operation such as an MPI_REDUCE in one process has no bearing on the return of the same collective operation in a peer process. The only exception to this rule is MPI_BARRIER, because barrier is defined as an explicit synchronization, whereas all the other MPI-2.2 collective operations do not necessarily need to explicitly synchronize.
As a concrete example, say that all non-root processes call MPI_REDUCE at time X. The root finally calls MPI_REDUCE at time X+N (for this example, assume N is large). Depending on the implementation, the non-root processes may return much earlier than X+N or they may block until X+N(+M). The MPI standard is intentionally vague on this point to allow MPI implementations to do what they want / need (which may also be dictated by resource consumption/availability).
Hence, #g.inozemtsev's point of "You cannot rely on synchronization" (except for with MPI_BARRIER) is correct.