Is OpenMP and MPI hybrid program faster than pure MPI? - mpi

I am developing some program than runs on 4 node cluster with 4 cores on each node. I have a quite fast version of OpenMP version of the program that only runs on one cluster and I am trying to scale it using MPI. Due to my limited experience I am wondering which one would give me faster performance, a OpenMP hybrid architecture or a MPI only architecture? I have seen this slide claiming that the hybrid one generally cannot out perform the pure MPI one, but it does not give supporting evidence and is kind of counter-intuitive for me.
BTW, My platform use infiniband to interconnect nodes.
Thank a lot,
Bob

Shared memory is usually more efficient than message passing, as the latter usually requires increased data movement (moving data from the source to its destination) which is costly both performance-wise and energy-wise. This cost is predicted to keep growing with every generation.
The material states that MPI-only applications are usually on-par or better than hybrid applications, although they usually have larger memory requirements.
However, they are based on the fact that most of the large hybrid applications shown were based on parallel computation then serial communication.
This kind of implementations are usually susceptible to the following problems:
Non uniform memory access: having two sockets in a single node is a popular setup in HPC. Since modern processors have their memory controller on chip, half of the memory will be easily accessible from the local memory controller, meanwhile the other half has to pass through the remote memory controller (i.e., the one present in the other socket). Therefore, how the program allocates memory is very important: if the memory is reserved in the serialized phase (on the closest possible memory), then half of the cores will suffer longer main memory accesses.
Load balance: each *parallel computation to serialized communication** phase implies a synchronization barrier. This barriers force the fastest cores to wait for the slowest cores in a parallel region. Fastest/slowest unbalance may be affected by OS preemption (time is shared with other system processes), dynamic frequency scaling, etc.
Some of this issues are more straightforward to solve than others. For example,
the multiple-socket NUMA problem can be mitigated placing different MPI processes in different sockets inside the same node.
To really exploit the efficiency of shared memory parallelism, the best option is trying to overlap communication with computation and ensure load balance between all processes, so that the synchronization cost is mitigated.
However, developing hybrid applications which are both load balanced and do not impose big synchronization barriers is very difficult, and nowadays there is a strong research effort to address this complexity.

Related

OpenCL shared memory optimisation

I am solving a 2d Laplace equation using OpenCL.
The global memory access version runs faster than the one using shared memory.
The algorithm used for shared memory is same as that in the OpenCL Game of Life code.
https://www.olcf.ornl.gov/tutorials/opencl-game-of-life/
If anyone has faced the same problem please help. If anyone wants to see the kernel I can post it.
If your global-memory really runs faster than your local-memory version (assuming both are equally optimized depending on the memory space you're using), maybe this paper could answer your question.
Here's a summary of what it says:
Usage of local memory in a kernel add another constraint to the number of concurrent workgroups that can be run on the same compute unit.
Thus, in certain cases, it may be more efficient to remove this constraint and live with the high latency of global memory accesses. More wavefronts (warps in NVidia-parlance, each workgroup is divided into wavefronts/warps) running on the same compute unit allow your GPU to hide latency better: if one is waiting for a memory access to complete, another can compute during this time.
In the end, each kernel will take more wall-time to proceed, but your GPU will be completely busy because it is running more of them concurrently.
No, it doesn't. It only says that ALL OTHER THINGS BEING EQUAL, an access from local memory is faster than an access from global memory. It seems to me that global accesses in your kernel are being coalesced which yields better performance.
Using shared memory (memory shared with CPU) isn't always going to be faster. Using a modern graphics card It would only be faster in the situation that the GPU/CPU are both performing oepratoins on the same data, and needed to share information with each-other, as memory wouldn't have to be copied from the card to the system and vice-versa.
However, if your program is running entirely on the GPU, it could very well execute faster by running in local memory (GDDR5) exclusively since the GPU's memory will not only likely be much faster than your systems, there will not be any latency caused by reading memory over the PCI-E lane.
Think of the Graphics Card's memory as a type of "l3 cache" and your system's memory a resource shared by the entire system, you only use it when multiple devices need to share information (or if your cache is full). I'm not a CUDA or OpenCL programmer, I've never even written Hello World in these applications. I've only read a few white papers, it's just common sense (or maybe my Computer Science degree is useful after all).

MPI and OpenMP on Desktop CPUs

I was just wondering how is it possible that OpenMP (shared memory) and MPI (distributed memory) could run on normal desktop CPUs like i7 for example. Is there some kind of a virtual machine that can simulate shared and distributed memory on these CPUs? I am asking it because when learnig OpenMP and MPI, the structures of supercomputers is shown, with shared memory or different nodes for distributed memory, each node with its own processor and memory.
MPI assumes nothing about how and where MPI processes run. As far as MPI is concerned, processes are just entites that have a unique address known as their rank and MPI gives them the ability to send and receive data in the form of messages. How exactly are the messages transfered is left to the implementation. The model is so general that MPI can run virtually on any platform imaginable.
OpenMP deals with shared memory programming using threads. Threads are just concurrent instruction flows that can access a shared memory space. They can execute in a timesharing fashion on a single CPU core or they can execute on multiple cores inside a single CPU chip, or they can be distributed among multiple CPUs connected together by some sophisticated network that allows them to access each others memory.
Given all that, MPI does not require that each process executes on a dedicated CPU core or that millions of cores should be necessarily put on separate boards connected with some high speed network - performance does, as well as technical limitations. You can happily run a 100 processes MPI job on a single CPU core though performance would be very very bad but it will still work (given enough memory is available). The same applies to OpenMP - it does not require that each thread is scheduled on a dedicated CPU core but doing so gives the best performance.
That's why MPI and OpenMP are called abstractions - they are general enough that the execution hardware can vary greatly while source code is kept the same.
A modern multicore-CPU-based PC is a shared-memory computer. It is a sensible approximation to think of each core as a processor, and that they all have equal access to the same RAM. This approximation hides a lot of details of processor and chip architectures.
It has always (well, perhaps not always, but for almost as long as MPI has been around) been possible to use message-passing (of which MPI is one standard) on a shared-memory computer so that you can run the same MPI-enabled program as you would on a genuinely distributed-memory machine.
At the application level a programmer only cares about calls to MPI routines. At the systems level the MPI run-time translates these calls into, well on a cluster or supercomputer, into instructions to send stuff over the interconnect. On a shared-memory computer it could instead translate these calls into instructions to send stuff over the internal bus.
This is by no means a comprehensive introduction to the topics you've raised, but that's what Google and all the published sources out there are for.

OpenCL - how to spawn a separate math process on each core

I am new to OpenCL and I am writing an RSA factoring application. Ideally the application should work across both NV and AMD GPU targets, but I am not finding an easy way to determine the total number of cores/stream procs on each GPU.
Is there an easy way to determine how many total cores/stream procs there are on any hardware platform, and then spawn a factoring thread on each available core? The target RSA modulus would be in shared memory, and with each factoring thread using a Rho factoring attack against the modulus.
Also, any idea if OpenCL support multi-precision math libraries similar to GNU MP, to store large semi prime numbers?
Thanks in advance
On the GPU, you don't spawn one thread for each core, like you would on a CPU. Instead, you want to start many more threads than there are cores. I wouldn't worry about the exact number of cores available on a given target platform. Instead, focus no what fits best for your problem.
To add to Roger's answer, the reason why you would want to have many more threads than cores is because GPUs implement very efficient context switching to hide memory latency. Generally, each memory access is a very expensive operation in terms of the amount of time it takes for a processor to receive the requested data. But if a thread is waiting on a memory transaction, it can be "paused" and another thread can be activated to do computations (or other memory accesses) in the meantime. So if you have enough threads, you can essentially hide the memory access latency, and your software can run at the full computational capacity of the hardware (which would rarely happen otherwise).
I would have put this in a comment to Roger's post, but its size is beyond the limit.

Parallel computing: Distributed systems vs multicore processors?

I was just wondering why there is a need to go through all the trouble of creating distributed systems for massive parallel processing when, we could just create individual machines that support hundreds or thousands of cores/CPUs (or even GPGPUs) per machine?
So basically, why should you do parallel processing over a network of machines when it can rather be done at much lower cost and much more reliably on 1 machine that supports numerous cores?
I think it is simply cheaper. Those machines are available today, no need of inventing something new.
Next problem will be in complexity of the motherboard, imagine 10 CPUs on one MB - so much links! And if one of those CPUs dies, it could destroy whole machine..
You can write a program for GPGPU of course, but it is not as easy as write it for CPU. There are many limitations, eg. cache per core is really small if any, you can not communicate between cores (or you can, but it is very costly) etc.
Linking many computers is more stable, more scalable and cheaper due to long usage history.
What Petr said. As you add cores to an individual machine, communication overhead increases. If memory is shared between cores then the locking architecture for shared memory, and caching, generates increasingly large overheads.
If you don't have shared memory, then effectively you're working with different machines, even if they're all in the same box.
Hence it's usually better to develop very large scale apps without shared memory. And usually possible as well - although communications overhead is often still large.
Given that this is the case, there's little use for building highly multicore individual machines - though some do exist e.g. nvidia tesla...

MPI overhead in shared memory setup

I want parallelize a program. It's not that difficult with threads working on one big data-structure in shared memory.
But I want to be able to use distribute it over cluster and I have to choose a technology to do that. MPI is one idea.
The question is what overhead will have MPI (or other technology) if I skip implementation of specialized version for shared memory and let MPI handle all cases ?
Update:
I want to grow a large data structure (game tree) simultaneously on many computers.
Most parts of it will be only on one cluster node but some of it (unregular top of the tree) will be shared and synchronized from time to time.
On shared memory machine I would like to have this achieved through shared memory.
Can this be done generically?
All the popular MPI implementations will communicate locally via shared memory. The performance is very good as long as you don't spend all your time packing and unpacking buffers (i.e. your design is reasonable). In fact, the design imposed upon you by MPI can perform better than most threaded implementations because the separate address space improves cache coherence. To consistently beat MPI, the threaded implementations have to be aware of the cache hierarchy and what the other cores are working on.
With good network hardware (like InfiniBand) the HCA is responsible for getting your buffers on and off the network so the CPU can do other things. Also, since many jobs are memory bandwidth limited, they will perform better using, e.g. 1 core on each socket across multiple nodes than when using multiple cores per socket.
It depends on the algorithm. Clealy inter-cluster communication is orders of magnitude slower than shared memory either as inter-process communication or multiple threads within a process. Therefore you want to minimize inter-cluster traffic, E.g. by duplicating data where possible and practicable or breaking the problem down in such a way that minimizes inter node communication.
For 'embarrisngly' parallel algorithms with little inter-node communication it's an easy choice - these are problems like brute force searching for encryption key where each node can crunch numbers for long periods and report back to a central node periodically but no communication is required to test keys.

Resources