I'm new to HPC and I am curious about a point regarding the performance of MPI over Infiniband. For reference, I am using OpenMPI over two different machines connected through IB.
I've coded a very simple benchmark to see how fast I can transfer data over IB using MPI calls. Below you can see the code.
The issue is that when I run this, I get a throughput of ~1.4 gigabytes/s. However, when I use standard ib benchmarks like ib_write_bw, I get nearly 6 GB/s. What might account for this sizable discrepancy? Am I being naive about Gather, or is this just a result of OpenMPI overheads that I can't overcome?
In addition to the code, I am providing a plot to show the results of my simple benchmark.
Thanks in advance!
Results:
Code:
#include<iostream>
#include<mpi.h>
#include <stdint.h>
#include <ctime>
using namespace std;
void server(unsigned int size, unsigned int n) {
uint8_t* recv = new uint8_t[size * n];
uint8_t* send = new uint8_t[size];
std::clock_t s = std::clock();
MPI_Gather(send, size, MPI_CHAR, recv, size, MPI_CHAR, 0, MPI_COMM_WORLD);
std::clock_t e = std::clock();
cout<<size<<" "<<(e - s)/double(CLOCKS_PER_SEC)<<endl;
delete [] recv;
delete [] send;
}
void client(unsigned int size, unsigned int n) {
uint8_t* send = new uint8_t[size];
MPI_Gather(send, size, MPI_CHAR, NULL, 0, MPI_CHAR, 0, MPI_COMM_WORLD);
delete [] send;
}
int main(int argc, char **argv) {
int ierr, size, rank;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
cout<<"Rank "<<rank<<" of "<<size<<endl;
unsigned int min = 1, max = (1 << 31), n = 1000;
for (unsigned int i = 1; i < n; i++) {
unsigned int s = i * ((max - min) / n);
if(rank == 0) server(s, size); else client(s, size);
}
MPI_Finalize();
}
In your code you are executing a single collective operation per message size.
This involves huge overhead in comparison with tests that were written for performance measurement (e.g. ib_write_bw).
In general, comparing MPI collectives to ib_write_bw is not apples to apples comparison:
RDMA opcode
ib_write_bw uses RDMA_WRITE operations, which doesn't use CPU at all - once the initial handshake is done, it is pure RDMA, constrained only by network and PCIe capabilities.
MPI will use different RDMA opcodes for different collectives and different message sizes, and if you do it as you did in your code, there are lots of things that MPI does for each message (hence the huge overhead)
Data overhead
ib_write_bw transfers almost pure data (there's a local routing header and a payload)
MPI has more data (headers) added to each packet to allow the receiver to identify the message
Zero copy
ib_write_bw is doing what is called "zero-copy" - data is sent from a user buffer directly, and written to a user buffer directly on the receiving side, w/o copying from/to buffers
MPI will copy the message from your client's buffer to its internal buffers on the sender side, then copy it again from its internal buffers on the receiving side to your server's buffer. Again, this behaviour depends on the message size and MPI configuration and MPI implementation, but you get the general idea.
Memory registration
ib_write_bw registers the required memory region and exchanges this info between client and server before starting measuring performance
If MPI will need to register some memory region during the collective execution, it will do it while you are measuring time
there are many more
even the "small" things like warming up the cache lines on the HCAs...
So, now that we've covered why you shouldn't compare these things, here's what you should do:
There are two libraries that are regarded as a de-facto standard for MPI performance measurement:
IMB (Intel MPI Benchmark) - it says Intel, but it is written as a standard MPI application and will work with any MPI implementation.
OSU benchmarks - again, it says MVAPICH, but it will work with any MPI.
Download those, compile with your MPI, run your benchmarks, see what you get.
This is as high as you can get with MPI.
If you get much better results than with your small program (and you will for sure) - this is open source, see how the pros are doin it :)
Have fun!
You have to consider that the full payload size for the collective call received on rank 0 depends on the number of ranks. So with, say, 4 processes sending 1000 bytes you actually receive 4000 bytes on the root rank. That includes a memory copy from rank 0's input buffer into the output buffer (possibly with a detour through the network stack). And that is before you add the overheads of MPI and the lower networking protocols.
Related
Usaully, one would have to define a new type and register it with MPI to use it. I am wondering if using protobuf to serialize a object and send it over using MPI as byte stream. I have two questions:
(1) do you foresee any problem with this approach?
(2) do I need to send length information through a separate MPI_Send(), or can I probe and use MPI_Get_count(&status, MPI_BYTE, &count)?
An example would be:
// sender
MyObj myobj;
...
size_t size = myobj.ByteSizeLong();
void *buf = malloc(size);
myobj.SerializePartialToArray(buf, size);
MPI_Isend(buf, size, MPI_BYTE, ... )
...
// receiver
MPI_Iprobe(MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &flag, &status);
if (flag) {
MPI_Get_count(&status, MPI_BYTE, &size);
MPI_Recv(buf, size, MPI_BYTE, ... , &status);
MyObject obj;
obj.ParseFromArray(buf, size);
...
}
Generally you can do that. Your code sketch looks also fine (except for the omitted buf allocation on the receiver side). As Gilles points out, makes sure to use status.MPI_SOURCE and status.MPI_TAG for the actual MPI_Recv, not MPI_*_ANY.
However, there are some performance limitations.
Protobuf isn't very fast, particularly due to en-/decoding. It very much depends what your performance expectations are. If you run on a high performance network, assume a significant impact. Here are some basic benchmarks.
Not knowing the message size ahead and thus always posting the receive after the send also has performance implications. This means the actual transmission will likely start later, which may or may not have an impact on the senders side since you are using non-blocking sends. There could be cases, where you run into some practical limitations regarding number of unexpected messages. That is not a general correctness issues, but might require some configuration tuning.
If you go ahead with your approach, remember to do some performance analysis on the implementation. Use an MPI-aware performance analysis tool to make sure your approach doesn't introduce critical bottlenecks.
I'm learning OpenCL and attempt to utilize it on some low-latency scenario, so I'm really concerned with the memory transferring delay.
According to NVidia's OpenCL Best Practices Guide, and also by many other places, direct read/write on buffer object should be avoided. Instead, we should use map/unmap utility. In that guide, a demonstrative code is given like this:
cl_mem cmPinnedBufIn = clCreateBuffer(cxGPUContext, CL_MEM_READ_ONLY | CL_MEM_ALLOC_HOST_PTR, memSize, NULL, NULL);
cl_mem cmDevBufIn = clCreateBuffer(cxGPUContext, CL_MEM_READ_ONLY, memSize, NULL, NULL);
unsigned char* cDataIn = (unsigned char*) clEnqueueMapBuffer(cqCommandQue, cmPinnedBufIn, CL_TRUE, CL_MAP_WRITE, 0, memSize, 0, NULL, NULL, NULL);
for(unsigned int i = 0; i < memSize; i++)
{
cDataIn[i] = (unsigned char)(i & 0xff);
}
clEnqueueWriteBuffer(cqCommandQue, cmDevBufIn, CL_FALSE, 0, szBuffBytes, cDataIn , 0, NULL, NULL);
In this code snippet, two buffer objects are generated explicitly, and a write-to-device operation is also explicitly called.
If my understanding is correct, when you call clCreateBuffer with CL_MEM_ALLOC_HOST_PTR OR CL_MEM_USE_HOST_PTR, the storage of buffer object is created in on host side, probably in DMA memory, and no storage is allocated on device side. So the above code actually creates two separated storage. If so:
What would happen if I call map buffer on cmDevBufIn, which do not have host side memory?
For CPU-integrated GPUs, there is no separate graphics memory. Especially, for new version of AMD APUs, the memory address is also homologus. So it seems create two buffer objects is not good. What is the best practice for integrated platforms?
Is there any way to write single lines of memory transfer code for different platforms? Or I must write several different suits of memory transfer codes to achieve best performance for Nvidia, AMD separate GPU, AMD old APU, AMD new APU and Intel HD graphics......
Unfortunately, it's different for each vendor.
NVIDIA claims their best bandwidth is when you use read/write buffer where the host memory is "pinned", which can be achieved by creating a buffer with CL_MEM_ALLOC_HOST_PTR and mapping it (I think your example is that). You should also compare that to just mapping and unmapping the device memory; their more recent drivers have gotten better at that.
With AMD you can just map/unmap the device buffer to get full speed. They also have a bunch of vendor-specific buffer flags which can make certain scenarios faster; you should study them but more importantly create benchmarks that try out everything to see what actually works best for your task.
With both discrete devices you should use separate command queues for the transfer operations so they can overlap with other (non-dependent) compute operations (look up various compute overlap examples). Furthermore, some higher end discrete GPUs can be downloading one buffer at the same time they are uploading another (using dual DMA engines), so you could be uploading one batch of work while you're computing another while you're downloading the result of a third. When written elegantly, this isn't even much more code than the strictly sequential version, but you have to use OpenCL events to synchronize between command queues. NVIDIA has a GTC talk you can watch that shows how to do this for video frames every 16 ms.
With AMD's APU and with Intel's Integrated Graphics, the map/unmap of the "device" buffer is "free" since it is in main memory. Don't use read/write buffer here or you'll be paying for unneeded transfers.
What is the best way (in any sense) of allocating memory for OpenCL output data? Is there a solution what works reasonably with both discrete and integrated graphics?
As a super-simplified example, consider the following C++ (host) code:
std::vector<float> generate_stuff(size_t num_elements) {
std::vector<float> result(num_elements);
for(int i = 0; i < num_elements; ++i)
result[i] = i;
return result;
}
This can be implemented using an OpenCL kernel:
__kernel void gen_stuff(float *result) {
result[get_global_id(0)] = get_global_id(0);
}
The most straightforward solution is to allocate an array on both the device and host, then copy after kernel finished:
std::vector<float> generate_stuff(size_t num_elements) {
//global context/kernel/queue objects set up appropriately
cl_mem result_dev = clCreateBuffer(context, CL_MEM_WRITE_ONLY, num_elements*sizeof(float) );
clSetKernelArg(kernel, 0, sizeof(cl_mem), result_dev);
clEnqueueNDRangeKernel(queue, kernel, 1, nullptr, &num_elements, nullptr, 0, nullptr, nullptr);
std::vector<float> result(num_elements);
clEnqueueReadBuffer( queue, result_dev, CL_TRUE, 0, num_elements*sizeof(float), result_host.data(), 0, nullptr, nullptr );
return result;
}
This works reasonably with discrete cards. But with shared memory graphics, this means allocating double and an extra copy. How can one avoid this? One thing for sure, one should drop clEnqueuReadBuffer and use clEnqueueMapBuffer/clUnmapMemObject instead.
Some alternative scenarios:
Deal with an extra memory copy. Acceptable if memory bandwidth is not an issue.
Allocate a normal memory array on host, use CL_MEM_USE_HOST_PTR when creating the buffer. Should allocate with device-specific alignment - it is 4k with Intel HD Graphics: https://software.intel.com/en-us/node/531272 I am not aware if this is possible to query from the OpenCL environment. Results should be mapped (with CL_MAP_READ) after kernel finishes to flush caches. But when is it possible to unmap? Immediately after mapping is finished (it seems that does not work with AMD discrete graphics)? Deallocation of the array also requires modification of client code on Windows (due to _aligned_free being different from free).
Allocate using CL_MEM_ALLOCATE_HOST_PTR and map after kernel finishes. The cl_mem object has to be kept alive till the buffer is used (and probably even mapped?), so it requires polluting client code. Also this keeps the array in a pinned memory, what might be undesirable.
Allocate on device without CL_MEM_*_HOST_PTR, and map it after kernel finishes. This is the same thing as option 2 from deallocation's perspective, it's just avoiding pinned memory. (Actually, not sure if memory that is mapped isn't pinned.)
???
How are you dealing with this problem? Is there any vendor-specific solution?
You can do it with a single buffer, for both discrete and integrated hardware:
Allocate with CL_MEM_WRITE_ONLY (since your kernel only writes to the buffer). Optionally also use CL_MEM_ALLOCATE_HOST_PTR or vendor-specific (e.g., AMD) flags if it helps performance on certain platforms (read the vendor guidance and do benchmarking).
Enqueue your kernel that writes to the buffer.
clEnqueueMapBuffer with CL_MAP_READ and blocking. On discrete hardware this will copy over PCIe; on integrated hardware it's "free".
Use the results on the CPU using the returned pointer.
clEnqueueUnmapMemObject.
Depends on the use case:
For minimal memory footprint and IO efficiency: (Dithermaster's answer)
Create with CL_MEM_WRITE_ONLY flags, or maybe CL_MEM_ALLOCATE_HOST_PTR (depending on platforms). Blocking map for reading, use it, un-map it. This option requires that the data handler (consumer), knows about the CL existance, and unmaps it using CL calls.
For situations where you have to provide a buffer data to a third party (ie: libraries that need a C pointer, or class buffer, agnostic to CL):
In this case it may not be good to use mapped memory. Mapped memory access time is typically longer compared to normal CPU memory. So, instead of mapping, then memcpy() and the unmap; it is easier to directly perform a clEnqueueReadBuffer() to the CPU address where the output should be copied. In some vendor cases, this does not provide pinned memory and the copy is slow, so is better to revert to the option "1". But for some other cases where there is no pinned memory I found it faster.
Any other different condition for reading the kernel output? I think not...
MPI_Bcast(void *buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm)
This function does not require the rank parameter. How does it know the rank of each process?
We should call the MPI_COMM_RANK() before broadcast, does any data structure (like communicator) store the rank of processes?
Perhaps you didn't think it possible, but functions inside the MPI library can make internally the same MPI calls that you do use to obtain the process' rank or the size of a communicator. That's why MPI_Bcast() doesn't need the rank of the calling process because it simply calls the internal implementation of MPI_Comm_rank() in order to obtain it. Here is a small sample from one of the MPI_Bcast() implementations in Open MPI (more specifically, this is from the split binary tree implementation in the tuned module from the coll framework that provides algorithms implementing the collective operations):
int
ompi_coll_tuned_bcast_intra_split_bintree ( void* buffer,
int count,
struct ompi_datatype_t* datatype,
int root,
struct ompi_communicator_t* comm,
mca_coll_base_module_t *module,
uint32_t segsize )
{
...
int rank, size;
...
size = ompi_comm_size(comm);
rank = ompi_comm_rank(comm);
...
}
As you can see, it calls the internal implementation of MPI_Comm_size() and MPI_Comm_rank(). These are very cheap calls in Open MPI. The rank of the process is stored in the process group that is associated with the communicator and is copied to a field in the communicator structure (to save a few CPU cycles dereferencing the pointer to the group) during the creation of a communicator (for more information refer to openmpi-source/ompi/communicator/communicator.h and openmpi-source/ompi/group/group.h).
As a matter of fact, no MPI communication primitive ever takes explicitly the rank of the calling process - it is always resolved internally. You only specify where to send the data (e.g. in MPI_SEND) or from where to receive the data (e.g. in MPI_RECV) or the data root in those collective operations which have one.
Consider three possible implementations of MPI_Bcast():
The root sends to root+1, and then sends to root+2, and then to root+3, etc. This is a linear order of magnitude
Starting with the root, each process that has a copy of the data at iteration N forwards the data to rank xor 2^N. This is logarithmic order of magnitude.
The root uses the router to perform a multicast to each process on the network. This is constant order of magnitude.
In each of these scenarios, the MPI_Bcast() function knows which process will get the next message. In the first and third case, any non-root process will simply receive the data; in the second, each process will continue the forwarding process once it receives the data. In all implementations, though, the order of sends and receives is deterministic on the bases of which process is the root. (That's why all processes must invoke MPI_Bcast(), whether root or not.)
You are right, the rank is stored in the communicator, and is available to the implementation of MPI_Bcast internally. Ranks are assigned when a communicator is created. For example, MPI_COMM_WORLD is created by MPI_Init.
MPI_Comm_rank simply gets the rank value from the communicator. There's no requirement to call it before a broadcast. However, knowing the rank is usually necessary to do any meaningful programming.
Note that since MPI_Bcast is a collective call, it needs to be performed by all processes in the communicator.
int root , is the rank of the broadcast root, essentially MPI broadcast sends a message from rank root to all the other ranks
also it would be considered by me a "best practice" to call the following after MPI_Init
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
this will assign each processor or core a int rank value from 0 to n-1
and
MPI_Comm_size( MPI_COMM_WORLD, &Numprocs);
this will create an int with Numproces being the total number of processors
I have a routine named transfer(int) which calls MPI routines. In my main program, transfer() is called twice.
... // do some work
transfer(1); // first transfer
... // do some work again
transfer(2); // second transfer
The transfer(int) function looks like this
... // do some work
MPI_Barrier(MPI_COMM_WORLD);
t0 = clock();
for(int k=0; k<mpisize; k++) {
MPI_Irecv( (void*)rbuffer[k], rsize[k], MPI_BYTE, k, 0, MPI_COMM_WORLD, reqs+2*k );
MPI_Isend( (void*)sbuffer[k], ssize[k], MPI_BYTE, k, 0, MPI_COMM_WORLD, reqs+2*k+1);
}
MPI_Waitall(2*mpisize, reqs, status);
MPI_Barrier(MPI_COMM_WORLD);
if (mpirank==0) cerr << "Transfer took "<< (double)(clock()-t0)/CLOCKS_PER_SEC << " secs" << endl;
Note that I only measure the communication time, excluding the pre-processing.
For transfer(1), all send and receive buffers have size 0 for each k. So essentially, there's no communication going on. Yet, the transfer took 1.95 seconds.
For transfer(2), each processor has to send/receive about 20KB to/from every other processor. Yet, the whole transfer took only 0.07 seconds.
I ran the experiment many times with 1024 processors and the measurements are consistent. Can you explain this phenomenon or what could possibly be wrong?
Thanks.
You could use a MPI performance analysis tool such as Vampir or Scalasca to better understand what is happening:
Are all communications slow or just a few the rest is waiting at the barrier?
What is the influence of the barrier?
The actual answer highly depends on your System and MPI implementation. Anycorn's comment that zero size messages still require communication and the first communication can have additional overhead is a good start of investigating. So another question you should try to answer is:
How does a second zero size message behave?
Also MPI implementations can handle messages of different sizes fundamentally different, e.g. by using unexpected message buffers, but that again is implementation and system dependent.