Combine OpenMP and MPI - mpi

I am working on a particle physics programm and want to parallelize it now on a cluster. I attended some HPC lecture so i managed to create parallel tasks within a node using OpenMP. This still does not scale good enough.
Hence I want to use several nodes that use some message exchange via network while in one node i parallelize with OpenMP.
For instance, i only know about MPI but i have heard about something called OpenMPI as a combined approach. I would be thankful about good sources about such an approach, that are understandable for someone with no Background in HP computing. Also i would love to hear other suggestions regarding my idea.

Related

Overview of the different Parallel programming methods [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 months ago.
Improve this question
I'm learning how to use parallel programming (specifically in R, but I'm trying to make this question as general as possible). There are many different libraries that work with it, and understanding their differences is hard without knowing the computer science terms used in their descriptions.
I identified some attributes that define these categories, such as: fine-grained and coarse-grained, explicit and implicit, levels of parallelism (bit-level etc.), classes of parallel computers (multi-core computing, grid computing, etc.), and what I call "methods" (I will explain what I mean by that later).
First question: is that list complete? Or there are other relevant attributes that define categories of parallel programing?
Secondary question: for each attribute, what are the pros and cons of the different options? When to use each option?
About the "methods": I saw materials talking about socket and forking; other talking about Parallel Virtual Machine (PVM) and Message Passing Interface (MPI) (and another option called "NWS"); and some other specific methods like Hadoop/map-reduce and futures (from R's "future" package).
Second question: I don't understand some of them, and I'm not sure if it makes sense to join them in this list that I called (parallel processing) "methods". Also, are there other "methods" that I left out?
Secondary question: what are the pros and cons of each of these methods, and when is it better to use each?
Third question: in light of this categorization and the discussion on the pros and cons of each category, how can we make an overview of the parallel computing libraries in R?
My post might be too broad, and ask too much at once, so I'll answer it with what I found until now, and maybe you can correct it/add to it in your own answer. The points that I feel that are most lacking is understanding the pros/cons of each "method", and of each R package.
OP here. I can't answer the "is that list complete?" part of the questions, but here are the explanations to each attribute and the pros and cons of each option. Just to reiterate that I'm new to this subject and might write something false/misleading.
Categories
Fine-grained, coarse-grained, and embarrassing parallelism (ref):
Attribute: how often their subtasks need to synchronize or communicate with each other.
Fine-grained parallelism: if its subtasks must communicate many times per second;
Coarse-grained parallelism if they do not communicate many times per second;
Embarrassing parallelism if they rarely or never have to communicate.
When to use each: self-explanatory
Explicit and implicit parallelism (ref):
Attribute: the need to write code that directly instructs the computer to parallelize
Explicit: needs it
Implicit: automatically detects a task that needs parallelism
When to use each: Parallel computing might introduce too much complexity when working with tasks, such that implicitly parallelism can lead to inefficiencies in some cases.
Types/levels of parallelism (ref):
Attribute: the code level where parallelism happens.
Bit-level, instruction-level, task and data-level, superword-level
When to use each: From what i understood, in the most common statistics/R applications, we use task and data-level, thus I didn't searched about when to use the other ones.
Classes of parallel computers (ref)
Attribute: the level at which the hardware supports parallelism
Multi-core computing, Symmetric multiprocessing, Distributed computing, Cluster computing, Massively parallel computing, Grid computing, Specialized parallel computers.
When to use each: From what i understood, unless you have a really big task that needs several/external computers working, you can use multi-core computing (using only your own machine).
Parallelism "method":
Attribute: different methods (there probably is a better word for this).
This post makes a distinction between socket approach (launches a new version the code on each core) and forking approach (copies the entire current version of your project on each core). Forking isn't supported by Windows.
This post makes a difference between Parallel Virtual Machine (PVM) and Message Passing Interface (MPI) (which I couldn't quite understand). Apparently, there is another option called "NWS", which I couldn't find information about.
This CRAN task view contains a list of packages, grouped by topic, that are useful for high-performance computing in R. It refers to a method called "Hadoop" which was built upon "map-reduce", and to the "future" package, that uses "futures" to introduce parallelism to R.
When to use each: While in socket each node is unique (avoiding cross-contamination), and runs on any system, forking is faster and allows to access your entire workspace in each process. I couldn't find information to talk about PVM, MPI, and NWS; and didn't get in depth into Hadoop and futures, so there is alot of space to contribute to this paragraph.
R packages
The CRAN task view is a great reference to this. It separates which packages deal with explicit/implicit parallel processing, most of which work with multicore-processing (while also pointing out some that do grid-processing). It points to a specific group of packages that use Hadoop, and other groups of parallel processing tools and specific applications. As they're more common, I'll list the explicit parallelism ones. Package names inside parenthesis are upgrades to the listed package.
rpvm (no longer actively maintained): PVM dedicated;
Rmpi (pdbMPI): MPI dedicated;
snow (snowFT, snowfall): works with PVM, MPI, NWS and socket, but not forking;
parallel (parallelly) was built upon multicore (forking focused and no longer actively maintained) and snow, and it is in base R;
future (future.apply and furrr): "parallel evaluations via abstraction of futures"
foreach: needs a "parallel backend", which can be doMPI (using Rmpi), doMC (using multicore), doSNOW (using snow), doPararell (using parallel), and doFuture (using future)
RHIPE, rmr, segue, and RProtoBuf: use Hadoop and other map-reduce techniques.
I also didn't get in depth into how each package works, and there is room to add pros/cons and when to use each.

Memory virtualization with R on cluster

I don't know almost anything about parallel computing so this question might be very stupid and it is maybe impossible to do what I would like to.
I am using linux cluster with 40 nodes, however since I don't know how to write parallel code in R I am limited to using only one. On this node I am trying to analyse data which floods the memory (arround 64GB). So my problem isn't lack of computational power but rather memory limitation.
My question is, whether it is even possible to use some R package (like doSnow) for implicit parallelisation to use 2-3 nodes to increase the RAM limit or would I have to rewrite the script from ground to make it explicit parallelised ?
Sorry if my question is naive, any suggestions are welcomed.
Thanks,
Simon
I don't think there is such a package. The reason is that it would not make much sense to have one. Memory access is very fast, and accessing data from another computer over the network is very slow compared to that. So if such a package existed it would be almost useless, since the processor would need to wait for data over the network all the time, and this would make the computation very very slow.
This is true for common computing clusters, built from off-the-shelf hardware. If you happen to have a special cluster where remote memory access is fast, and is provided as a service of the operating system, then of course it might be not that bad.
Otherwise, what you need to do is to try to divide up the problem into multiple pieces, manually, and then parallelize, either using R, or another tool.
An alternative to this would be to keep some of the data on the disk, instead of loading all of it into the memory. You still need to (kind of) divide up the problem, to make sure that the part of the data in the memory is used for a reasonable amount of time for computation, before loading another part of the data.
Whether it is worth (or possible at all) doing either of these options, depends completely on your application.
Btw. a good list of high performance computing tools in R is here:
http://cran.r-project.org/web/views/HighPerformanceComputing.html
For future inquiry:
You may want to have a look at two packages "snow" and "parallel".
Library "snow" extends the functionality of apply/lapply/sapply... to work on more than one core and/or one node.
Of course, you can perform simple parallel computing using more than one core:
#SBATCH --cpus-per-task= (enter some number here)
You can also perform parallel computing using more than one node (preferably with the previously mentioned libraries) using:
#SBATCH --ntasks-per-node= (enter some number here)
However, for several implications, you may wanna think of using Python instead of R where parallelism can be much more efficient using "Dask" workers.
You might want to take a look at TidalScale, which can allow you to aggregate nodes on your cluster to run a single instance of Linux with the collective resources of the underlying nodes. www.tidalscale.com. Though the R application may be inherently single threaded, you'll be able to provide your R application with a single, simple coherent memory space across the nodes that will be transparent to your application.
Good luck with your project!

MPI vs openMP for a shared memory

Lets say there is a computer with 4 CPUs each having 2 cores, so totally 8 cores. With my limited understanding I think that all processors share same memory in this case. Now, is it better to directly use openMP or to use MPI to make it general so that the code could work on both distributed and shared settings. Also, if I use MPI for a shared setting would performance decrease compared with openMP?
Whether you need or want MPI or OpenMP (or both) heavily depends the type of application you are running, and whether your problem is mostly memory-bound or CPU-bound (or both). Furthermore, it depends on the type of hardware you are running on. A few examples:
Example 1
You need parallelization because you are running out of memory, e.g. you have a simulation and the problem size is so large that your data does not fit into the memory of a single node anymore. However, the operations you perform on the data are rather fast, so you do not need more computational power.
In this case you probably want to use MPI and start one MPI process on each node, thereby making maximum use of the available memory while limiting communication to the bare minimum.
Example 2
You usually have small datasets and only want to speed up your application, which is computationally heavy. Also, you do not want to spend much time thinking about parallelization, but more your algorithms in general.
In this case OpenMP is your first choice. You only need to add a few statements here and there (e.g. in front of your for loops that you want to accelerate), and if your program is not too complex, OpenMP will do the rest for you automatically.
Example 3
You want it all. You need more memory, i.e. more computing nodes, but you also want to speed up your calculations as much as possible, i.e. running on more than one core per node.
Now your hardware comes into play. From my personal experience, if you have only a few cores per node (4-8), the performance penalty created by the general overhead of using OpenMP (i.e. starting up the OpenMP threads etc.) is more than the overhead of processor-internal MPI communication (i.e. sending MPI messages between processes that actually share memory and would not need MPI to communicate).
However, if you are working on a machine with more cores per node (16+), it will become necessary to use a hybrid approach, i.e. parallelizing with MPI and OpenMP at the same time. In this case, hybrid parallelization will be necessary to make full use of your computational resources, but it is also the most difficult to code and to maintain.
Summary
If you have a problem that is small enough to be run on just one node, use OpenMP. If you know that you need more than one node (and thus definitely need MPI), but you favor code readability/effort over performance, use only MPI. If using MPI only does not give you the speedup you would like/require, you have to do it all and go hybrid.
To your second question (in case that did not become clear):
If you setup is such that you do not need MPI at all (because your will always run on only one node), use OpenMP as it will be faster. But If you know that you need MPI anyways, I would start with that and only add OpenMP later, when you know that you've exhausted all reasonable optimization options for MPI.
With most distributed memory platforms nowadays consisting of SMP or NUMA nodes it just makes no sense to not use OpenMP. OpenMP and MPI can perfectly work together; OpenMP feeds the cores on each node and MPI communicates between the nodes. This is called hybrid programming. It was considered exotic 10 years ago but now it is becoming mainstream in High Performance Computing.
As for the question itself, the right answer, given the information provided, has always been one and the same: IT DEPENDS.
For use on a single shared memory machine like that, I'd recommend OpenMP. It make some aspects of the problem simpler and might be faster.
If you ever plan to move to a distributed memory machine, then use MPI. It'll save you solving the same problem twice.
The reason I say OpenMP might be faster is because a good implementation of MPI could be clever enough to spot that it's being used in a shared memory environment and optimise its behaviour accordingly.
Just for a bigger picture, hybrid programming has become popular because OpenMP benefits from cache topology, by using the same address space. As MPI might have the same data replicated over the memory (because process can't share data) it might suffer from cache cancelation.
On the other hand, if you partition your data correctly, and each processor has a private cache, it might come to a point were your problem fit completely in cache. In this case you have super linear speedups.
By talking in cache, there are very different cache topology on recent processors, and has always: IT DEPENDS...

suggest a Benchmark program to compare MPICH and OpenMPI

I am new to HPC and the task in hand is to do a performance analysis and comparison between MPICH and OpenMPI on a cluster which comprises of IBM servers equipped with dual-core AMD Opteron processors, running on a ClusterVisionOS.
Which benchmark program should I pick to compare between MPICH and OpenMPI implementations?
I am not sure if High-Performance Linpack Benchmark can help, as i am not attempting to measure the performance of the cluster itself.. kindly suggest..
Thank you
The classic examples are:
NAS Parallel Benchmarks - they
are representative numerical kernels
that you'd see in a lot of scientific
computing applications. These
admittedly have a lot of computation
but also have the communications
patterns you'd expect to see in real
applications, so they are fairly
relevant.
Or, if you really just want MPI "microbenchmarks", the OSU benchmarks or the Intel MPI Benchmarks are well known choices. These run zillions of tests -- ping-poing, broadcast, etc -- of various sizes and configurations, so you end up with a very large amount of data. The good news is that if you run these with the two MPIs, you'll know exactly where each one is stronger or weaker.
MPICH and OpenMPI are both actively maintained and very solid, and have a long-standing friendly rivalry; so I'd be very surprised if you found one to be consistently faster than the other. We have had both on our system, and there were differences with the default settings on real applications, but usually fairly small, some favouring one some favouring the other. But to really find out which is better for a particular application, you need to do more than run with the default parameters; both implementations can have a large number of variables set dealing with how they deal with collectives (OpenMPI 1.5.x has very interesting-looking hierarchical collectives I haven't played with yet), etc.
What I would do is to search in the ACM Digital Library. You will get objective stuff there.
Some tips for the search:
Sort by relevance.
Read the Abstract (at the bottom) to see if it matches what you are looking for.
If a paper matches your search, buy that paper, it is usually cheap. Other option is to subscribe to ACM if you plan to search often as you will get a better price.
Hope this helps someone.

Can someone suggest a good way to understand how MPI works?

Can someone suggest a good way to understand how MPI works?
If you are familiar with threads, then you treat each node as a thread (to an extend)
You send a message (work) to a node and it does some work and then returns you some results.
Similar behaviors between thread & MPI:
They all involve partitioning a work and process it separately.
They all would have overhead when more node/threads involved, MPI overhead is more significant compared to thread, passing messages around nodes would cause significant overhead if work is not carefully partitioned, you might end up with the time passing messages > computational time required to process job.
Difference behaviors:
They have different memory models, each MPI node does not share memory with others and does not know anything about the rest of world unless you send something to it.
Here you can find some learning materials http://www.mcs.anl.gov/research/projects/mpi/
Parallel programming is one of those subjects that is "intrinsically" complex (as opposed to the "accidental" complexity, as noted by Fred Brooks).
I used Parallel Programming in MPI by Peter Pacheco. This book gives a good overview of the basic MPI topics, available API's, and common patterns for parallel program construction.

Resources