MPI_Send/Recv vs. MPI_Reduce - mpi

I was given a little excercise where I had to implement a Monte Carlo algorithm by using MPI to estimate the total volume of n spheres, having the coordinates of their center and radius in 3 dimensions. Even if we must use MPI, we can launch all the processes on our local machine, so there's no network overhead. I implemented two versions of this excericse:
One, using MPI_Send and MPI_Recv (where the process of rank 0 only waits for partial results from the others to perform the final sum)
http://pastebin.com/AV41hJqn
The other, using MPI_Reduce, also here process of rank 0 waits for partial results.
http://pastebin.com/8b0czv6a
I expected that both the programs would take the same time to finish, but I see that the one using MPI_Reduce is faster. Why this? Where's the difference?

There could be a lot of reasons depending on which MPI implementation you're using, what kind of hardware you're running on and how optimized the implementation is to take advantage of that. This Google Scholar search gives some idea of the variety of work done on this. To give you a few ideas of what it could be:
Since reductions can be completed in intermediate steps, it may be possible to use a different topology than the basic rank 0 collect-from-all approach, with tradeoffs in latency and bandwidth.
Within a compute node (or on your desktop or laptop if you're trying this with a toy problem), it may be possible to exploit locality within cores, between cores on a CPU socket or between sockets to order the computations and communication in a way that's more efficient for the hardware. It sounds from the abstract like this paper from IBM may give some concrete details about some of these design decisions. Alternatively, the implementation might choose a cache-oblivious scheme for better performance within a general compute node.
Persistent communication (MPI_Send_init and MPI_Recv_init) can be used under the hood in the MPI_Reduce implementation. These routines can perform better than their blocking and non-blocking counterparts due to providing the MPI implementation and hardware with extra details about how the program is grouping its communications.
This is not a comprehensive list, but hopefully it gets you started and provides some ideas for how to search out more details if you're interested.

Related

GPU Brute-Force Implementation

I am asking for an advise for the following problem:
For a research-project I am writing a brute-force algorithm based on a GPU with (py)OpenCl.
(I know JTR is out there)
Right now I do have a Brute-Force-Generator in Python which is filling up for each round the buffer with words (amount=1024*64).I pass the buffer to the GPU Kernel. The GPU is calculating for each value in the buffer a MD5 Hash Value and compares it with a given one. Great that it works!
BUT:
I don't think this is really the full performance i can get from the GPU - or is it? Isn't there a bottleneck when i have to fill up the buffer by the CPU and pass it to the GPU 'just' for a Hash calculation an comparison - or am i wrong and this is already the fastet or almost the fastet performance i can get?
I have done a lot of Research before I consider to ask this question here. I couldn't find a brute force implementation on the GPU kernel so far - why?
Thx
EDIT 1:
I try to explain it in a different way what I want to know. Lets say I have an average computer. Performing an brute-force-algorithm on a GPU is faster than on a CPU (if you do it right). I have looked through some GPU brute-force tools and couldn't find one with the whole brute-force implementation on the GPU Kernel.
Right now I am passing "word packages" to the GPU and let them do the work (hash & compare) there - looks like this is the common way . Isn't it faster to 'split' the brute-force algorithm so each Unit on the GPU will generate its own "word packages" by itself.
All I do is wondering why the common way is to pass packages with values from the CPU to the GPU instead of doing the CPU work also on the GPU work! Is it because it is not possible to split a brute-force algorithm on a GPU or isn't the improvement worth the effort to port it to the GPU?
About the performance of the "brute-force" approach.
All i do is wondering why the common way is to pass packages with values from the CPU to the GPU instead of doing the CPU work also on the GPU work! Is it because it is not possible to split a brute-force algorithm on a GPU or isn't the improvement worth the effort to port it to the GPU?
I do not know the details of your algorithm, but, in general, there are some points to consider before creating a hybrid CPU-GPU algorithm. Just to name a few:
Different architectures (best CPU algorithm probably is not the best
GPU algorithm).
Extra synchronization points.
Different memory spaces (implies PCIe/network transfers).
More complex algorithms
More complex fine tuning.
Vendors policy.
Nevertheless, there are quite a few examples out there that combines the power of the GPU and the CPU at the same time. Typically, sequential or highly divergent parts of the algorithm will run on the CPU while the homogeneous, computing intensive part runs on the GPU. Other applications, uses the CPU to preprocess the input data to a format that is more amenable to GPU processing (for instance, changing the data layout). Finally, there are applications targeting pure performance that really do a significant amount of work on the CPU, like the MAGMA project.
In summary, the answer it that it really depends on the details of your algorithm if it is really possible or if it worth it to design a hybrid algorithm that takes the most of your CPU-GPU system as a whole.
About the performance of your current approach
I think you should break down your question in two parts:
It is my GPU kernel efficient?
How much time am I actually doing work at the GPU?
Regarding the first one, you didn't provide any information about your GPU kernel so we could not really help you with it, but general optimization approaches apply:
Is it your computation memory/compute bound?
How far are you from your GPU peak memory bandwidth?
You need to start from these question in order to known what kind of optimization/algorithm you should apply. Take a look at the roofline performance model.
As for the second question, even though you don't go into detail, it seems like your application spend so much time on small memory transfers (take a look at this article about how to optimize memory transfers). The overhead of starting the PCIe just to send a few words would kill any performance benefit you get from using a GPU device. Thus, sending a bunch of small buffers instead of large chunks of memory packing a large number of them is not, in general, the way to go.
If you're looking for performance, you may want to overlap the computation and the memory transfer. Read this article for more information.
As a general recommendation, before implementing any optimization, take some time to profile your application. It would save you a lot of time.

How to use non-blocking point-to-point MPI routines instead of collectives

In my programm, I would like to heavily parallelize many mathematical calculations, the results of which are then written to an output file.
I successfully implemented that using collective communication (gather, scatter etc.) but I noticed that using these synchronizing routines, the slowest among all processors dominates the execution time and heavily reduces overall computation time, as fast processors spend a lot of time waiting.
So I decided to switch to the scheme, where one (master) processor is dedicated to receiving chunks of results and handling the file output, and alle the other processors calculate these results and send them to the master using non-blocking send routines.
Unfortunately, I don't really know how to implement the master code; Do I need to run an infinite loop with MPI_Recv(), listening for incoming messages? How do I know when to stop the loop? Can I combine MPI_Isend() and MPI_Recv(), or do both method need to be non-blocking? How is this typically done?
MPI 3.1 provides non-blocking collectives. I would strongly recommend that instead of implementing it on your own.
However, it may not help you after all. Eventually you need the data from all processes, even the slow ones. So you are likely to wait at some point again. Non-blocking communication overlaps communication and computation, but it doesn't fix your load imbalances.
Update (more or less a long clarification comment)
There are several layers to your question, I might have been confused by the title as to what kind of answer you were expecting. Maybe the question is rather
How do I implement a centralized work queue in MPI?
This pops up regularly, most recently here. But that is actually often undesirable because a central component quickly becomes a bottleneck in large scale programs. So the actual problem you have, is that your work decomposition & mapping is imbalanced. So the more fundamental "X-question" is
How do I load balance an MPI application?
At that point you must provide more information about your mathematical problem and it's current implementation. Preferably in form of an [mcve]. Again, there is no standard solution. Load balancing is a huge research area. It may even be a topic for CS.SE rather than SO.

How much can MPI_Alltoall outperform MPI_Alltoallv?

I wonder what is the difference in terms of running time between executing the MPI_Alltoallv and MPI_Alltoall functions when the amount of transferred data is approximately the same? I couldn't find any such benchmark results. I am interested in large-scale instances, where tens of thousands or better hundreds of thousand of MPI processes are used and where these processes correspond to a substantial part of a given HPC system (considering at best some modern ones, such as BG/Q, Cray XC30, Cray XE6, ...).
Overview
One of the big advantages of MPI_Alltoall is that protocol decisions can be made quickly because they depend on a handful of scalars. In contrast, if a library implementer wants to optimize MPI_Alltoallv, they have to scan four vectors to determine if, for example, the communication is nearly homogeneous, highly sparse, or some other pattern.
The other issue is that MPI_Alltoall can easily use the output buffer as scratch space because every process provides and consumes the same amount of data. For MPI_Alltoallv, it's not practical to do all the bookkeeping, so any scratch space is going to be allocated. I can't remember the specifics of this issue, but I think I've read it somewhere in the MPI canon.
Implementation Skeletons
There are at least two special cases of alltoallv for which one can optimize better than the MPI library can:
Nearly homogeneous communication, i.e. the count vectors are nearly constant. This can happen when you have a distributed array that doesn't divide evenly across the process grid. In this case, you can:
Pad your arrays and use MPI_Alltoall directly.
Use MPI_Alltoall for the subset of processes that have homogeneous communication and either MPI_Alltoallv or a batch of Send-Recv for the remainder. This works best if you can cache the associated communicators. Using nonblocking communication should help too.
Write your own implementation of Bruck that handles the cases where the count varies, which is likely at the end of your vector. Having not done this myself, I don't know how difficult or worthwhile this one is.
Sparse communication, i.e. the count vector contains a large number of zeros. For this case, just use a batch of nonblocking Send-Recv and Waitall, because that's likely the best the MPI library will ever do and doing it yourself allows you to tune the batch size if you want.
Papers
MPI on a Million Processors describes the scalabillity issue associated with vector collectives. Granted, you may not see the cost of scanning the vector arguments on most CPUs, but it is an O(n) problem that motivates implementers to not touch the vector arguments more than necessary.
HykSort: a new variant of hypercube quicksort on distributed memory architectures describes a custom implementation that performs much better than optimized libraries. Such an optimization is rather difficult to implement inside of an MPI library, because it may be rather specialized. (This reference is targeted at Hristo's comment, not your question, by the way.)
Code
You can discover some interesting things by comparing the implementations of these operations in MPICH (https://github.com/pmodels/mpich/blob/main/src/mpi/coll/alltoall.c and https://github.com/pmodels/mpich/blob/main/src/mpi/coll/alltoallv.c). Only MPI_Alltoall uses Bruck's algorithm and pairwise exchange. Similar conclusions can be drawn from the available options for I_MPI_ADJUST_ALLTOALL and I_MPI_ADJUST_ALLTOALLV on https://software.intel.com/en-us/node/528906. Whether these limitations are fundamental or merely practical is left as an exercise for the reader.
Practical Experience
When MPI_Alltoall on Blue Gene/P used DCMF_Alltoallv (source code), so there was no difference relative to MPI_Alltoallv, and the latter might have even been better since the application pre-populated the vector arguments.
I wrote a version of all-to-all exchange for Blue Gene/Q that was as fast as MPI_Alltoall. My version was agnostic to constant versus vector arguments so this result implies that MPI_Alltoallv would perform similarly to MPI_Alltoall. However, I can't find the code now to be absolutely sure of the details.
However, Blue Gene networks were rather special, particularly w.r.t. all-to-all, so the behavior on fat-tree or dragonly networks on systems where the CPU is much faster than the network will be quite different.
I suggest you write a benchmark and measure it where you intend to run your application. Once you have some data, it will be much easier to figure out what optimizations may be missed.

MPI vs openMP for a shared memory

Lets say there is a computer with 4 CPUs each having 2 cores, so totally 8 cores. With my limited understanding I think that all processors share same memory in this case. Now, is it better to directly use openMP or to use MPI to make it general so that the code could work on both distributed and shared settings. Also, if I use MPI for a shared setting would performance decrease compared with openMP?
Whether you need or want MPI or OpenMP (or both) heavily depends the type of application you are running, and whether your problem is mostly memory-bound or CPU-bound (or both). Furthermore, it depends on the type of hardware you are running on. A few examples:
Example 1
You need parallelization because you are running out of memory, e.g. you have a simulation and the problem size is so large that your data does not fit into the memory of a single node anymore. However, the operations you perform on the data are rather fast, so you do not need more computational power.
In this case you probably want to use MPI and start one MPI process on each node, thereby making maximum use of the available memory while limiting communication to the bare minimum.
Example 2
You usually have small datasets and only want to speed up your application, which is computationally heavy. Also, you do not want to spend much time thinking about parallelization, but more your algorithms in general.
In this case OpenMP is your first choice. You only need to add a few statements here and there (e.g. in front of your for loops that you want to accelerate), and if your program is not too complex, OpenMP will do the rest for you automatically.
Example 3
You want it all. You need more memory, i.e. more computing nodes, but you also want to speed up your calculations as much as possible, i.e. running on more than one core per node.
Now your hardware comes into play. From my personal experience, if you have only a few cores per node (4-8), the performance penalty created by the general overhead of using OpenMP (i.e. starting up the OpenMP threads etc.) is more than the overhead of processor-internal MPI communication (i.e. sending MPI messages between processes that actually share memory and would not need MPI to communicate).
However, if you are working on a machine with more cores per node (16+), it will become necessary to use a hybrid approach, i.e. parallelizing with MPI and OpenMP at the same time. In this case, hybrid parallelization will be necessary to make full use of your computational resources, but it is also the most difficult to code and to maintain.
Summary
If you have a problem that is small enough to be run on just one node, use OpenMP. If you know that you need more than one node (and thus definitely need MPI), but you favor code readability/effort over performance, use only MPI. If using MPI only does not give you the speedup you would like/require, you have to do it all and go hybrid.
To your second question (in case that did not become clear):
If you setup is such that you do not need MPI at all (because your will always run on only one node), use OpenMP as it will be faster. But If you know that you need MPI anyways, I would start with that and only add OpenMP later, when you know that you've exhausted all reasonable optimization options for MPI.
With most distributed memory platforms nowadays consisting of SMP or NUMA nodes it just makes no sense to not use OpenMP. OpenMP and MPI can perfectly work together; OpenMP feeds the cores on each node and MPI communicates between the nodes. This is called hybrid programming. It was considered exotic 10 years ago but now it is becoming mainstream in High Performance Computing.
As for the question itself, the right answer, given the information provided, has always been one and the same: IT DEPENDS.
For use on a single shared memory machine like that, I'd recommend OpenMP. It make some aspects of the problem simpler and might be faster.
If you ever plan to move to a distributed memory machine, then use MPI. It'll save you solving the same problem twice.
The reason I say OpenMP might be faster is because a good implementation of MPI could be clever enough to spot that it's being used in a shared memory environment and optimise its behaviour accordingly.
Just for a bigger picture, hybrid programming has become popular because OpenMP benefits from cache topology, by using the same address space. As MPI might have the same data replicated over the memory (because process can't share data) it might suffer from cache cancelation.
On the other hand, if you partition your data correctly, and each processor has a private cache, it might come to a point were your problem fit completely in cache. In this case you have super linear speedups.
By talking in cache, there are very different cache topology on recent processors, and has always: IT DEPENDS...

Can someone suggest a good way to understand how MPI works?

Can someone suggest a good way to understand how MPI works?
If you are familiar with threads, then you treat each node as a thread (to an extend)
You send a message (work) to a node and it does some work and then returns you some results.
Similar behaviors between thread & MPI:
They all involve partitioning a work and process it separately.
They all would have overhead when more node/threads involved, MPI overhead is more significant compared to thread, passing messages around nodes would cause significant overhead if work is not carefully partitioned, you might end up with the time passing messages > computational time required to process job.
Difference behaviors:
They have different memory models, each MPI node does not share memory with others and does not know anything about the rest of world unless you send something to it.
Here you can find some learning materials http://www.mcs.anl.gov/research/projects/mpi/
Parallel programming is one of those subjects that is "intrinsically" complex (as opposed to the "accidental" complexity, as noted by Fred Brooks).
I used Parallel Programming in MPI by Peter Pacheco. This book gives a good overview of the basic MPI topics, available API's, and common patterns for parallel program construction.

Resources