What is the cost of MPI_Reduce and MPI_allreduce (in terms of network latency and bandwidth).
I'd appreciate if you can also provide reference to algorithm. Do the cost model changes with message size?
These papers describe MPI reduction algorithms:
http://www.mcs.anl.gov/~thakur/papers/ijhpca-coll.pdf
https://fs.hlrs.de/projects/rabenseifner/publ/myreduce_iccs2004_2.pdf
In general, there are many different protocols used for MPI collections, with selection based upon message size, how many processes are involved, etc.
In the case of MPI reductions, some form of reduction tree is instantiated on the MPI communicator. For large messages, Rabenseifner's algorithm may be more efficient.
I would recommend downloading a benchmark to test this. The IntelĀ® MPI Benchmarks can be run only using MPI_Reduce and MPI_Allreduce to show expected times for each. This is open source, and you can compile it with your preferred MPI implementation and test with various parameters. By default, you'll get a set of results spanning message sizes up to 4 MB. You can download it at https://software.intel.com/en-us/articles/intel-mpi-benchmarks.
Related
I'm new to MPI and I'm trying to understand how MPI (and specifically OpenMPI) work in order to reason about the performance of my system.
I've tried to find resources online to help me understand things a little better, but haven't had much luck. I thought I'd come here.
Right now my question is simple: if I have 3 nodes (1 master, 2 clients) and I issue an MPI_Gather, does the root process handle incoming data sequentially or concurrently? In other words, if processes 1 is the first to make a connection with processes 0, will process 2 have to wait until processes 1 is done sending its data before it can start to send its data?
Thanks!
There are multiple components in Open MPI that implement collective operations and some of them provide multiple algorithms for the implementation of each operation.
What you are most likely interested in is the tuned component of the coll framework as that is what Open MPI uses by default. tuned implements all collectives using point-to-point operations and provides several algorithms for gather:
linear with synchronisation - used when messages are large to mid-size
binomial - used when the number of processes is large or the message size is small
basic linear - used in all other cases
The performance of each algorithm depends strongly on the particular combination of message size and number of ranks, therefore the library comes with a set of heuristics that tries to determine the best algorithm based on the data size and the size of the communicator (as indicated above). There are several mechanisms to override the heuristics and either force a certain algorithm or provide a list of custom algorithm selection rules.
The basic linear algorithm simply has the root loop over all other ranks receiving their messages in sequence. In that case, rank 2 won't be able to send its chunk before rank 1 since the root will first receive the message from rank 1 and only then move on to rank 2.
The linear with synchronisation algorithm splits the chunks into two pieces each. The first pieces are collected in sequence just like in the basic linear algorithm. The second pieces are collected asynchronously using non-blocking receives.
The binomial algorithm arranges the ranks as a binomial tree. The processes at the nodes of the tree receive the chunks from the lower levels and aggregate them into larger chunks that then get passed to the upper levels until they reach the root rank.
You can find the source code of the tuned module in the ompi/mca/coll/tuned folder of the Open MPI source tree. In the development branch, part of the tuned component got promoted to the base implementation of the collective framework and the code for the gather is to be found in ompi/mca/coll/base instead.
Hristo's answer is of course excellent, but I would like to offer a different point of view.
Contrary to your expectation, the question is not simple. It isn't even possible to specifically answer it without knowing more system specifics, as Hristo pointed out. That doesn't mean the question is invalid, but you should start to reason about performance on a different level.
First, consider the complexity of a the gather operation: The total network transfer to the root as well as the memory requirements are linearly growing with the number of processes in the communicator. This naturally limits scalability.
Second, you may assume that your MPI implementation does implement MPI_Gather in the most efficient way possible - better than you could do it by hand. This assumption may very well be wrong, but it is the best starting point to write your program.
Now when you have your program, you should measure and see where time is spent - or wasted. For that you should an MPI performance analysis tools. Now if you have identified that your Gather has a significant impact on performance, you can go ahead and try to optimize that: But to do so, first consider if you can structure your communication conceptually better, e.g. by somehow removing the computation all together or using a clever reduction instead. If you still need to stick to the gather: go ahead and tune your MPI implementation. Afterwards verify that your optimization did indeed improve performance on your specific system.
I have a C program (acoustic wave solver) that is parallelized with MPI. However, I've been testing the speed up on various numbers of cores and I've noticed something strange. If I use N processes where N is the number of available cores in the machine, then I do not see a performance improvement over the next step down.
So on my 8 core machine then I see speedup from 1 process to 2 processes to 4 processes, but not from 4 to 8. Similarly on my 4 core laptop I see speedup from 1 to 2, but not from 2 to 4.
Any idea what could be causing this?
Many modern (Intel-)cpu run two hyperthreads on a single physical core. The number of cores you are referencing are actually the number of hardware threads that are available, not the number of physical execution units.
As long as you are using a number of processes that is smaller or equal to the number of physical cores, the processes will (or at least should) be distributed to use all of the available codes. But as soon as all physical cores are taken, additional processes will share a physical core with another process.
It is not possible to give a definitive answer on if using all threads will increase your performance at all or by how much. That strongly depends on the code you are running. A very nice answer to a similar question is given on superuser.com. Essentially, if your process is memory-bound or uses different parts of your cpu (Integer/Floating point arithmetic, Video encoding, vector processing, ...) and communication overhead is small you might even get perfect scaling. Code that is cpu-bound and only does one type of computation might not give any improvement, or might even take longer due to communication overhead.
My program is well-suited for MPI. Each CPU does its own, specific (sophisticated) job, produces a single double, and then I use an MPI_Reduce to multiply the result from every CPU.
But I repeat this many, many times (> 100,000). Thus, it occurred to me that a GPU would dramatically speed things up.
I have google'd around, but can't find anything concrete. How do you go about mixing MPI with GPUs? Is there a way for the program to query and verify "oh, this rank is the GPU, all other are CPUs" ? Is there a recommended tutorial or something?
Importantly, I don't want or need a full set of GPUs. I really just need a lot of CPUs, and then a single GPU to speed up the frequently-used MPI_Reduce operation.
Here is a schematic example of what I'm talking about:
Suppose I have 500 CPUs. Each CPU somehow produces, say, 50 doubles. I need to multiply all 250,00 of these doubles together. Then I repeat this between 10,000 and 1 million times. If I could have one GPU (in addition to the 500 CPUs), this could be really efficient. Each CPU would compute its 50 doubles for all ~1 million "states". Then, all 500 CPUs would send their doubles to the GPU. The GPU would then multiply the 250,000 doubles together for each of the 1 million "states", producing 1 million doubles.
These numbers are not exact. The compute is indeed very large. I'm just trying to convey the general problem.
This isn't the way to think about these things.
I like to say that MPI and GPGPU stuff are orthogonal(*). You use MPI between tasks (for which think nodes, although you can have multiple tasks per node), and each task may or may not use an accelerator like a GPU to accelerate the computation within task. There is no MPI rank on a GPU.
Regardless, Talonmies is right; this particular example doesn't sound like it would benefit much from a GPU. And it won't be helped by having tens of thousands of doubles per task; if you're only doing one or a few FLOPs per double, the cost of sending the data to the GPU will exceed the benefit of having all those cores operate on them.
(*) This used to be more clearly true; now with, for instance, GPUDirect being able to copy memory to remote GPUs over infiniband, the distinction is fuzzier. However, I maintain that this is still the most useful way to think about things, with such things as RDMA to GPUs being an important optimization but conceptually a minor tweak.
Here I have found some news about the topic:
"MPI, the Message Passing Interface, is a standard API for communicating data via messages between distributed processes that is commonly used in HPC to build applications that can scale to multi-node computer clusters. As such, MPI is fully compatible with CUDA, which is designed for parallel computing on a single computer or node. There are many reasons for wanting to combine the two parallel programming approaches of MPI and CUDA. A common reason is to enable solving problems with a data size too large to fit into the memory of a single GPU, or that would require an unreasonably long compute time on a single node. Another reason is to accelerate an existing MPI application with GPUs or to enable an existing single-node multi-GPU application to scale across multiple nodes. With CUDA-aware MPI these goals can be achieved easily and efficiently. In this post I will explain how CUDA-aware MPI works, why it is efficient, and how you can use it."
Suppose that I've two big functions. Is it better to write them in a separate kernels and call them sequentially, or is better to write only one kernel? (I don't want to read the data back and force form between host and device in between). What about the speed up if I want to call the kernel many times?
One thing to consider is the effect of register pressure on hardware utilization and performance.
As a general rule, big kernels have big register footprints. Typical OpenCL devices (ie. GPUs) have very finite register file sizes and large kernels can result in lower concurrency (fewer concurrent warps/wavefronts), less opportunities for latency hiding, and poorer overall performance. On the other hand, kernel launch overheads are pretty low on most platforms, so if your algorithm doesn't have an enormous amount of state to save between "phases" of execution, the penalty of using multiple kernels can be rather low.
Using multiple kernels also has another side benefit -- you get implicit synchronization between all work units for free. Often that can eliminate the need for atomic memory operations and synchronization primitives which can have a negative impact on code performance.
The ultimate guide should be measured performance. There is no universal rule-of-thumb for this sort of things. Benchmarking is the only way to know for sure.
In general this is a question of (maybe) slightly better performance vs. readibility of your code. Copying buffers is no issue as long as you keep them within the same context. E.g. you could set one output buffer of a kernel to be an input buffer of the next kernel, which would not involve any copying.
The proper way to code in OpenCL is to separate your code into parallel tasks, and each of them is a kernel. This is, each "for loop" should be a kernel. Some times one single CPU code function could result in a 4 kernel implementation in OCL.
If you need to store data between kernel executions just use OpenCL buffers and do not copy to host (this solves the DEVICE<->HOST bottleneck).
If both functions act to different data you could propably write a single kernel, but that depends on the complexity of the operation being run.
I'm using OpenCL and have ATI 4850 card. It has:
CL_DEVICE_MAX_COMPUTE_UNITS: 10
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3
CL_DEVICE_MAX_WORK_GROUP_SIZE: 256
CL_DEVICE_MAX_WORK_ITEM_SIZES:(256, 256, 256)
CL_DEVICE_AVAILABLE: 1
CL_DEVICE_NAME: ATI RV770
How many tasks can it execute simultaneously?
Is it CL_DEVICE_MAX_COMPUTE_UNITS * CL_DEVICE_MAX_WORK_ITEM_SIZES = 2560?
To be more specific: a single core processor can execute only one task in the one moment, dual-core can execute 2 tasks... How many tasks can execute my GPU at one moment? Or rephrased: How many processors does my GPU have?
The RV770 has 10 SIMD cores, each consisting of 16 shader cores, each consisting of 5 ALUs (VLIW5 architecture). A total of 800 ALUs that can do parallel computations. I don't think there's a way to get all these numbers out of OpenCL. I'm also not sure what you would equate to a CPU core. Perhaps a shader core? You can read about VLIW at Wikipedia. It's an interesting design.
If you say a CPU core is only executing one "task" at any given time, even though it has multiple ALUs working in parallel, then I guess you can say the RV770 would be working on 160 tasks. But with the differences in how different chips work, I think "core" and "task" can become difficult to define. A CPU with hyperthreading can even execute two sets of code at the same time. With OpenCL I don't believe it is possible yet to execute more than one kernel at any given time - unless recent driver updates have changed that.
Anyway, I think it is more important to present your work to the GPU in a way that gives the best performance. Unfortunately there's no way to find the best work group size other than experimenting. At least not that I know of. One help is that if the drivers support OpenCL 1.1 you can query the CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE and set your work size to a multiple of that. Otherwise, going for a multiple of 64 is probably a safe bet.
GPU work ends up becoming wavefronts/warps.
Using a GPU for UI and compute is effectively using it for many programs without being aware of it. Many for the GUI drawing, plus whatever compute kernels you are executing. Fast OpenCL clients are asynchronous and overlap multiple instance of work so they won't be latency-bound. It is expected that you'll use multiple kernels in parallel.
There doesn't seem to be a "hard" limit other than memory limiting the number of buffers you can use. When using the same GPU for UI and for compute, you must throttle your work. In my experience, issuing too much work will cause starvation of the GUI and/or your compute kernels. There doesn't seem to be anything in the way of ensuring that you won't have starvation (long delays before a work item begins actually executing). Some work item(s) may sit for a very long time (10s seconds or more in bad cases) while the GPU does other work items. I speculate that items are dispatched to pipelines based on data availability and little or nothing is there to prevent starvation of work items.
Limiting how far ahead work is enqueued greatly improves GUI responsiveness by letting the GPU drain its work queue almost/sometimes to empty, reducing GUI drawing workitem starvation delays.