suggest a Benchmark program to compare MPICH and OpenMPI - mpi

I am new to HPC and the task in hand is to do a performance analysis and comparison between MPICH and OpenMPI on a cluster which comprises of IBM servers equipped with dual-core AMD Opteron processors, running on a ClusterVisionOS.
Which benchmark program should I pick to compare between MPICH and OpenMPI implementations?
I am not sure if High-Performance Linpack Benchmark can help, as i am not attempting to measure the performance of the cluster itself.. kindly suggest..
The classic examples are:
NAS Parallel Benchmarks - they
are representative numerical kernels
that you'd see in a lot of scientific
computing applications. These
admittedly have a lot of computation
but also have the communications
patterns you'd expect to see in real
applications, so they are fairly
Or, if you really just want MPI "microbenchmarks", the OSU benchmarks or the Intel MPI Benchmarks are well known choices. These run zillions of tests -- ping-poing, broadcast, etc -- of various sizes and configurations, so you end up with a very large amount of data. The good news is that if you run these with the two MPIs, you'll know exactly where each one is stronger or weaker.
MPICH and OpenMPI are both actively maintained and very solid, and have a long-standing friendly rivalry; so I'd be very surprised if you found one to be consistently faster than the other. We have had both on our system, and there were differences with the default settings on real applications, but usually fairly small, some favouring one some favouring the other. But to really find out which is better for a particular application, you need to do more than run with the default parameters; both implementations can have a large number of variables set dealing with how they deal with collectives (OpenMPI 1.5.x has very interesting-looking hierarchical collectives I haven't played with yet), etc.

What I would do is to search in the ACM Digital Library. You will get objective stuff there.
Some tips for the search:
Sort by relevance.
Read the Abstract (at the bottom) to see if it matches what you are looking for.
If a paper matches your search, buy that paper, it is usually cheap. Other option is to subscribe to ACM if you plan to search often as you will get a better price.
Hope this helps someone.


GPU Brute-Force Implementation

I am asking for an advise for the following problem:
For a research-project I am writing a brute-force algorithm based on a GPU with (py)OpenCl.
(I know JTR is out there)
Right now I do have a Brute-Force-Generator in Python which is filling up for each round the buffer with words (amount=1024*64).I pass the buffer to the GPU Kernel. The GPU is calculating for each value in the buffer a MD5 Hash Value and compares it with a given one. Great that it works!
I don't think this is really the full performance i can get from the GPU - or is it? Isn't there a bottleneck when i have to fill up the buffer by the CPU and pass it to the GPU 'just' for a Hash calculation an comparison - or am i wrong and this is already the fastet or almost the fastet performance i can get?
I have done a lot of Research before I consider to ask this question here. I couldn't find a brute force implementation on the GPU kernel so far - why?
I try to explain it in a different way what I want to know. Lets say I have an average computer. Performing an brute-force-algorithm on a GPU is faster than on a CPU (if you do it right). I have looked through some GPU brute-force tools and couldn't find one with the whole brute-force implementation on the GPU Kernel.
Right now I am passing "word packages" to the GPU and let them do the work (hash & compare) there - looks like this is the common way . Isn't it faster to 'split' the brute-force algorithm so each Unit on the GPU will generate its own "word packages" by itself.
All I do is wondering why the common way is to pass packages with values from the CPU to the GPU instead of doing the CPU work also on the GPU work! Is it because it is not possible to split a brute-force algorithm on a GPU or isn't the improvement worth the effort to port it to the GPU?
About the performance of the "brute-force" approach.
I do not know the details of your algorithm, but, in general, there are some points to consider before creating a hybrid CPU-GPU algorithm. Just to name a few:
Different architectures (best CPU algorithm probably is not the best
GPU algorithm).
Extra synchronization points.
Different memory spaces (implies PCIe/network transfers).
More complex algorithms
More complex fine tuning.
Vendors policy.
Nevertheless, there are quite a few examples out there that combines the power of the GPU and the CPU at the same time. Typically, sequential or highly divergent parts of the algorithm will run on the CPU while the homogeneous, computing intensive part runs on the GPU. Other applications, uses the CPU to preprocess the input data to a format that is more amenable to GPU processing (for instance, changing the data layout). Finally, there are applications targeting pure performance that really do a significant amount of work on the CPU, like the MAGMA project.
In summary, the answer it that it really depends on the details of your algorithm if it is really possible or if it worth it to design a hybrid algorithm that takes the most of your CPU-GPU system as a whole.
About the performance of your current approach
I think you should break down your question in two parts:
It is my GPU kernel efficient?
How much time am I actually doing work at the GPU?
Regarding the first one, you didn't provide any information about your GPU kernel so we could not really help you with it, but general optimization approaches apply:
Is it your computation memory/compute bound?
How far are you from your GPU peak memory bandwidth?
You need to start from these question in order to known what kind of optimization/algorithm you should apply. Take a look at the roofline performance model.
As for the second question, even though you don't go into detail, it seems like your application spend so much time on small memory transfers (take a look at this article about how to optimize memory transfers). The overhead of starting the PCIe just to send a few words would kill any performance benefit you get from using a GPU device. Thus, sending a bunch of small buffers instead of large chunks of memory packing a large number of them is not, in general, the way to go.
If you're looking for performance, you may want to overlap the computation and the memory transfer. Read this article for more information.
As a general recommendation, before implementing any optimization, take some time to profile your application. It would save you a lot of time.

How much can MPI_Alltoall outperform MPI_Alltoallv?

I wonder what is the difference in terms of running time between executing the MPI_Alltoallv and MPI_Alltoall functions when the amount of transferred data is approximately the same? I couldn't find any such benchmark results. I am interested in large-scale instances, where tens of thousands or better hundreds of thousand of MPI processes are used and where these processes correspond to a substantial part of a given HPC system (considering at best some modern ones, such as BG/Q, Cray XC30, Cray XE6, ...).
One of the big advantages of MPI_Alltoall is that protocol decisions can be made quickly because they depend on a handful of scalars. In contrast, if a library implementer wants to optimize MPI_Alltoallv, they have to scan four vectors to determine if, for example, the communication is nearly homogeneous, highly sparse, or some other pattern.
The other issue is that MPI_Alltoall can easily use the output buffer as scratch space because every process provides and consumes the same amount of data. For MPI_Alltoallv, it's not practical to do all the bookkeeping, so any scratch space is going to be allocated. I can't remember the specifics of this issue, but I think I've read it somewhere in the MPI canon.
Implementation Skeletons
There are at least two special cases of alltoallv for which one can optimize better than the MPI library can:
Nearly homogeneous communication, i.e. the count vectors are nearly constant. This can happen when you have a distributed array that doesn't divide evenly across the process grid. In this case, you can:
Pad your arrays and use MPI_Alltoall directly.
Use MPI_Alltoall for the subset of processes that have homogeneous communication and either MPI_Alltoallv or a batch of Send-Recv for the remainder. This works best if you can cache the associated communicators. Using nonblocking communication should help too.
Write your own implementation of Bruck that handles the cases where the count varies, which is likely at the end of your vector. Having not done this myself, I don't know how difficult or worthwhile this one is.
Sparse communication, i.e. the count vector contains a large number of zeros. For this case, just use a batch of nonblocking Send-Recv and Waitall, because that's likely the best the MPI library will ever do and doing it yourself allows you to tune the batch size if you want.
MPI on a Million Processors describes the scalabillity issue associated with vector collectives. Granted, you may not see the cost of scanning the vector arguments on most CPUs, but it is an O(n) problem that motivates implementers to not touch the vector arguments more than necessary.
HykSort: a new variant of hypercube quicksort on distributed memory architectures describes a custom implementation that performs much better than optimized libraries. Such an optimization is rather difficult to implement inside of an MPI library, because it may be rather specialized. (This reference is targeted at Hristo's comment, not your question, by the way.)
You can discover some interesting things by comparing the implementations of these operations in MPICH ( and Only MPI_Alltoall uses Bruck's algorithm and pairwise exchange. Similar conclusions can be drawn from the available options for I_MPI_ADJUST_ALLTOALL and I_MPI_ADJUST_ALLTOALLV on Whether these limitations are fundamental or merely practical is left as an exercise for the reader.
Practical Experience
When MPI_Alltoall on Blue Gene/P used DCMF_Alltoallv (source code), so there was no difference relative to MPI_Alltoallv, and the latter might have even been better since the application pre-populated the vector arguments.
I wrote a version of all-to-all exchange for Blue Gene/Q that was as fast as MPI_Alltoall. My version was agnostic to constant versus vector arguments so this result implies that MPI_Alltoallv would perform similarly to MPI_Alltoall. However, I can't find the code now to be absolutely sure of the details.
However, Blue Gene networks were rather special, particularly w.r.t. all-to-all, so the behavior on fat-tree or dragonly networks on systems where the CPU is much faster than the network will be quite different.
I suggest you write a benchmark and measure it where you intend to run your application. Once you have some data, it will be much easier to figure out what optimizations may be missed.

MPI vs openMP for a shared memory

Lets say there is a computer with 4 CPUs each having 2 cores, so totally 8 cores. With my limited understanding I think that all processors share same memory in this case. Now, is it better to directly use openMP or to use MPI to make it general so that the code could work on both distributed and shared settings. Also, if I use MPI for a shared setting would performance decrease compared with openMP?
Whether you need or want MPI or OpenMP (or both) heavily depends the type of application you are running, and whether your problem is mostly memory-bound or CPU-bound (or both). Furthermore, it depends on the type of hardware you are running on. A few examples:
Example 1
You need parallelization because you are running out of memory, e.g. you have a simulation and the problem size is so large that your data does not fit into the memory of a single node anymore. However, the operations you perform on the data are rather fast, so you do not need more computational power.
In this case you probably want to use MPI and start one MPI process on each node, thereby making maximum use of the available memory while limiting communication to the bare minimum.
Example 2
You usually have small datasets and only want to speed up your application, which is computationally heavy. Also, you do not want to spend much time thinking about parallelization, but more your algorithms in general.
In this case OpenMP is your first choice. You only need to add a few statements here and there (e.g. in front of your for loops that you want to accelerate), and if your program is not too complex, OpenMP will do the rest for you automatically.
Example 3
You want it all. You need more memory, i.e. more computing nodes, but you also want to speed up your calculations as much as possible, i.e. running on more than one core per node.
Now your hardware comes into play. From my personal experience, if you have only a few cores per node (4-8), the performance penalty created by the general overhead of using OpenMP (i.e. starting up the OpenMP threads etc.) is more than the overhead of processor-internal MPI communication (i.e. sending MPI messages between processes that actually share memory and would not need MPI to communicate).
However, if you are working on a machine with more cores per node (16+), it will become necessary to use a hybrid approach, i.e. parallelizing with MPI and OpenMP at the same time. In this case, hybrid parallelization will be necessary to make full use of your computational resources, but it is also the most difficult to code and to maintain.
If you have a problem that is small enough to be run on just one node, use OpenMP. If you know that you need more than one node (and thus definitely need MPI), but you favor code readability/effort over performance, use only MPI. If using MPI only does not give you the speedup you would like/require, you have to do it all and go hybrid.
To your second question (in case that did not become clear):
If you setup is such that you do not need MPI at all (because your will always run on only one node), use OpenMP as it will be faster. But If you know that you need MPI anyways, I would start with that and only add OpenMP later, when you know that you've exhausted all reasonable optimization options for MPI.
With most distributed memory platforms nowadays consisting of SMP or NUMA nodes it just makes no sense to not use OpenMP. OpenMP and MPI can perfectly work together; OpenMP feeds the cores on each node and MPI communicates between the nodes. This is called hybrid programming. It was considered exotic 10 years ago but now it is becoming mainstream in High Performance Computing.
As for the question itself, the right answer, given the information provided, has always been one and the same: IT DEPENDS.
For use on a single shared memory machine like that, I'd recommend OpenMP. It make some aspects of the problem simpler and might be faster.
If you ever plan to move to a distributed memory machine, then use MPI. It'll save you solving the same problem twice.
The reason I say OpenMP might be faster is because a good implementation of MPI could be clever enough to spot that it's being used in a shared memory environment and optimise its behaviour accordingly.
Just for a bigger picture, hybrid programming has become popular because OpenMP benefits from cache topology, by using the same address space. As MPI might have the same data replicated over the memory (because process can't share data) it might suffer from cache cancelation.
On the other hand, if you partition your data correctly, and each processor has a private cache, it might come to a point were your problem fit completely in cache. In this case you have super linear speedups.
By talking in cache, there are very different cache topology on recent processors, and has always: IT DEPENDS...

Tuning Mathematical Parallel Codes

Assuming that I am interested in performance rather than portability of my linear algebra iterative multi-threaded solver and that I have the results of profiling my code in hand, how do I go about tuning my code to run optimally on that machine of my choice?
The algorithm involves Matrix-Vector multiplications, norms and dot-products. (FWIW, I am working on CG and GMRES).
I am working on codes which are of matrix size roughly equivalent to the full size of the RAM (~6GB). I'll be working on Intel i3 Laptop. I'll be linking my codes using Intel MKL.
Is there a good resource(PDF/Book/Paper) for learning manual tuning? There are numerous things that I learnt by doing for instance : Manual Unrolling isn't always optimal or about compiler flags but I would prefer a centralized resource.
I need something to translate profiler information to improved performance. For instance, my profiler tells me that my stacks of one processor are being accessed by another or that my mulpd ASM is taking too much time. I have no clue what these mean and how I could use this information for improving my code.
My intention is to spend as much time as needed to squeeze as much compute power as possible. Its more of a learning experience than for actual use or distribution as of now.
(I am concerned about manual tuning not auto-tuning)
Misc Details:
This differs from usual performance tuning since the major portions of the code are linked to Intel's proprietary MKL library.
Because of Memory Bandwidth issues in O(N^2) matrix-vector multiplications and dependencies, there is a limit to what I could manage on my own through simple observation.
I write in C and Fortran and I have tried both and as discussed a million times on SO, I found no difference in either if I tweak them appropriately.
Gosh, this still has no answers. After you've read this you'll still have no useful answers ...
You imply that you've already done all the obvious and generic things to make your codes fast. Specifically you have:
chosen the fastest algorithm for your problem (either that, or your problem is to optimise the implementation of an algorithm rather than to optimise the finding of a solution to a problem);
worked your compiler like a dog to squeeze out the last drop of execution speed;
linked in the best libraries you can find which are any use at all (and tested to ensure that they do in fact improve the performance of your program;
hand-crafted your memory access to optimise r/w performance;
done all the obvious little tricks that we all do (eg when comparing the norms of 2 vectors you don't need to take a square root to determine that one is 'larger' than another, ...);
hammered the parallel scalability of your program to within a gnat's whisker of the S==P line on your performance graphs;
always executed your program on the right size of job, for a given number of processors, to maximise some measure of performance;
and still you are not satisfied !
Now, unfortunately, you are close to the bleeding edge and the information you seek is not to be found easily in books or on web-sites. Not even here on SO. Part of the reason for this is that you are now engaged in optimising your code on your platform and you are in the best position to diagnose problems and to fix them. But these problems are likely to be very local indeed; you might conclude that no-one else outside your immediate research group would be interested in what you do, I know you wouldn't be interested in any of the micro-optimisations I do on my code on my platform.
The second reason is that you have stepped into an area that is still an active research front and the useful lessons (if any) are published in the academic literature. For that you need access to a good research library, if you don't have one nearby then both the ACM and IEEE-CS Digital Libraries are good places to start. (Post or comment if you don't know what these are.)
In your position I'd be looking at journals on 2 topics: peta- and exa-scale computing for science and engineering, and compiler developments. I trust that the former is obvious, the latter may be less obvious: but if your compiler already did all the (useful) cutting-edge optimisations you wouldn't be asking this question and compiler-writers are working hard so that your successors won't have to.
You're probably looking for optimisations which like, say, loop unrolling, were relatively difficult to find implemented in compilers 25 years ago and which were therefore bleeding-edge back then, and which themselves will be old and established in another 25 years.
First, let me make explicit something that was originally only implicit in my 'answer': I am not prepared to spend long enough on SO to guide you through even a summary of the knowledge I have gained in 25+ years in scientific/engineering and high-performance computing. I am not given to writing books, but many are and Amazon will help you find them. This answer was way longer than most I care to post before I added this bit.
Now, to pick up on the points in your comment:
on 'hand-crafted memory access' start at the Wikipedia article on 'loop tiling' (see, you can't even rely on me to paste the URL here) and read out from there; you should be able to quickly pick up the terms you can use in further searches.
on 'working your compiler like a dog' I do indeed mean becoming familiar with its documentation and gaining a detailed understanding of the intentions and realities of the various options; ultimately you will have to do a lot of testing of compiler options to determine which are 'best' for your code on your platform(s).
on 'micro-optimisations', well here's a start: Performance Optimization of Numerically Intensive Codes. Don't run away with the idea that you will learn all (or even much) of what you want to learn from this book. It's now about 10 years old. The take away messages are:
performance optimisation requires intimacy with machine architecture;
performance optimisation is made up of 1001 individual steps and it's generally impossible to predict which ones will be most useful (and which ones actually harmful) without detailed understanding of a program and its run-time environment;
performance optimisation is a participation sport, you can't learn it without doing it;
performance optimisation requires obsessive attention to detail and good record-keeping.
Oh, and never write a clever piece of optimisation that you can't easily un-write when the next compiler release implements a better approach. I spend a fair amount of time removing clever tricks from 20-year old Fortran that was justified (if at all) on the grounds of boosting execution performance but which now just confuses the programmer (it annoys the hell out of me too) and gets in the way of the compiler doing its job.
Finally, one piece of wisdom I am prepared to share: these days I do very little optimisation that is not under one of the items in my first list above; I find that the cost/benefit ratio of micro-optimisations is unfavourable to my employers.

Developing with OpenCl on ATI and Nvidia on the same time

our workgroup is slowly trying a little bit of OpenCl in a side project. So far 'everybody' is working on NVIDIA Quadro FX 580. Now we are planning to buy new computers for new colleages and instead of the FX 580 we could buy ATI FirePro V4800 instead, which costs only 15Eur more and give us 1Gig instead of 512Gig of Ram which will benificial for our data intensive tasks.
So, how much trouble is it to develop OpenCl code at the same time on Nvidia and ATI?
I read the following SO question, Running OpenCL on hardware from mixed vendors, which was very pessimistic about developing on/for different vendors. On the other side, the question is already a year old.
What do you reccomend?
I have previous worked extensively with CUDA programming language.
I have been planning to start developing apps using OpenCL. As you mentioned one of the best features with OpenCL is running on many vendor hardware (Intel, AMD and Nvidia).
One project that I came across that used openCL extensively for large scale development is It might be a good idea to look at the source code from this group and understand how they have developed their application on so many hardware including sony cell processor.
Another approach would be to use PyOPENCL, which provides higher abstraction than OpenCL and can significantly reduce the coding effort.
Do you need the code to run unchanged on both bits of hardware? If so you may have to develop for a limited subset of common functions.
If you can run slightly different c ode on each you will probably get better performance - in CUDA/OpenCL you generally have to tune the algorithms for the amount of ram, number of GPU engines anyway so it shoudldn't be much more work to also tweak for NVidia/AMD
The biggest problem is workgroup sizes. Some ATI cards I have used crash at above 64, but then it may be the Apple OSX 10.6 drivers I am using.
Developing for both ATI and NVIDIA is actually not too difficult so long as you avoid using any part of either vendor's SDK. Stick to OpenCL as it is defined in the OpenCL spec. ( and your code will stay syntax portable. Due to differences in the underlying architectures, performance portability may be an issue. Local & Global worksizes really have to be determined independently for each card to maximize performance. Another thing to pay attention to is the types being used. Vector types (float2, float4) are especially useful on ATI cards, as each processing element actually contains 4 execution units (one for each RGB color channel, plus aplha).
