MPI overhead in shared memory setup - mpi

I want parallelize a program. It's not that difficult with threads working on one big data-structure in shared memory.
But I want to be able to use distribute it over cluster and I have to choose a technology to do that. MPI is one idea.
The question is what overhead will have MPI (or other technology) if I skip implementation of specialized version for shared memory and let MPI handle all cases ?
Update:
I want to grow a large data structure (game tree) simultaneously on many computers.
Most parts of it will be only on one cluster node but some of it (unregular top of the tree) will be shared and synchronized from time to time.
On shared memory machine I would like to have this achieved through shared memory.
Can this be done generically?

All the popular MPI implementations will communicate locally via shared memory. The performance is very good as long as you don't spend all your time packing and unpacking buffers (i.e. your design is reasonable). In fact, the design imposed upon you by MPI can perform better than most threaded implementations because the separate address space improves cache coherence. To consistently beat MPI, the threaded implementations have to be aware of the cache hierarchy and what the other cores are working on.
With good network hardware (like InfiniBand) the HCA is responsible for getting your buffers on and off the network so the CPU can do other things. Also, since many jobs are memory bandwidth limited, they will perform better using, e.g. 1 core on each socket across multiple nodes than when using multiple cores per socket.

It depends on the algorithm. Clealy inter-cluster communication is orders of magnitude slower than shared memory either as inter-process communication or multiple threads within a process. Therefore you want to minimize inter-cluster traffic, E.g. by duplicating data where possible and practicable or breaking the problem down in such a way that minimizes inter node communication.
For 'embarrisngly' parallel algorithms with little inter-node communication it's an easy choice - these are problems like brute force searching for encryption key where each node can crunch numbers for long periods and report back to a central node periodically but no communication is required to test keys.

Related

Difference between processor and process in parallel computing?

Every time I come across something like "process 0 does x task" , I am inclined to think they mean processor.
After reading a bit more about it, I find that there are two memory classifications, shared memory and distributed memory:
A shared memory executes something like a thread (implying same data is available to all processors- hence it makes sense to call it a process) However, even for distributed memory it is called a process instead of a processor. For example: "Process 0 is computing the partial dot product"
Why is this so? Why is it called a process and not a processor?
PS. I hope this question is not trivial :)
These other answers are all pretty spot on. Processors are physical, processes are software. So a quad core CPU will have 4 processors, but can run many more processes.
Your confusion around distributed terminology is fair though. In distributed computing, typically X number of processes will be executed equal to the number of hardware processors. In this scenario, each processes gets an ID in software often called a rank. Ranks are independent of processors, and different ranks will have different tasks. So when you report a status, information is relative to the process rank, and not the physical processor.
To rephrase, in distributed computing there will usually be one process running on each processor. The process will have a unique id that is more important in the software than the physical processor it is running on, so status information is given about the process. As the number of processes and processors are equal, this distinction can get a bit blurred.
The distinction is hardware vs software.
The process is the logical instance of your program. The processor is the hardware entity that runs the process. Most of the time, you don't care about the actual processor, only the process that's executing.
For instance, the OS may decide to temporarily put your processes to sleep in order to give other applications runtime, and later it may awaken them on different processors. As long as your processes produce the expected results, this should not be of any interest to you: all you care about is the computation, not where it's happening.
For me, processor refers to machine, that is responsible for computing operations. Process is a single instance of some program. (I hope i understood what you meant).
I would say that they use the terms indistinctly because most of the time the context allows it and the difference may be subtle to some extent. That is, since each process (when it is single threaded) executes on a processor, people typically does not want to make the distinction between the physical entity (processor) and the logical entity (process).
This assumption might be wrong when considering processors with multithreading capabilities (SMT, and Hyper-Threading for Intel processors) and/or executing multi-threaded applications because processes run on any available processor (or thread). In those situations, people should be stricter when making this affirmations. Still, since it is possible to bind one process (and even one thread) to a processor (or processor thread) using affinity commands, they can use indistinctly both terms under these circumstances.

Is OpenMP and MPI hybrid program faster than pure MPI?

I am developing some program than runs on 4 node cluster with 4 cores on each node. I have a quite fast version of OpenMP version of the program that only runs on one cluster and I am trying to scale it using MPI. Due to my limited experience I am wondering which one would give me faster performance, a OpenMP hybrid architecture or a MPI only architecture? I have seen this slide claiming that the hybrid one generally cannot out perform the pure MPI one, but it does not give supporting evidence and is kind of counter-intuitive for me.
BTW, My platform use infiniband to interconnect nodes.
Thank a lot,
Bob
Shared memory is usually more efficient than message passing, as the latter usually requires increased data movement (moving data from the source to its destination) which is costly both performance-wise and energy-wise. This cost is predicted to keep growing with every generation.
The material states that MPI-only applications are usually on-par or better than hybrid applications, although they usually have larger memory requirements.
However, they are based on the fact that most of the large hybrid applications shown were based on parallel computation then serial communication.
This kind of implementations are usually susceptible to the following problems:
Non uniform memory access: having two sockets in a single node is a popular setup in HPC. Since modern processors have their memory controller on chip, half of the memory will be easily accessible from the local memory controller, meanwhile the other half has to pass through the remote memory controller (i.e., the one present in the other socket). Therefore, how the program allocates memory is very important: if the memory is reserved in the serialized phase (on the closest possible memory), then half of the cores will suffer longer main memory accesses.
Load balance: each *parallel computation to serialized communication** phase implies a synchronization barrier. This barriers force the fastest cores to wait for the slowest cores in a parallel region. Fastest/slowest unbalance may be affected by OS preemption (time is shared with other system processes), dynamic frequency scaling, etc.
Some of this issues are more straightforward to solve than others. For example,
the multiple-socket NUMA problem can be mitigated placing different MPI processes in different sockets inside the same node.
To really exploit the efficiency of shared memory parallelism, the best option is trying to overlap communication with computation and ensure load balance between all processes, so that the synchronization cost is mitigated.
However, developing hybrid applications which are both load balanced and do not impose big synchronization barriers is very difficult, and nowadays there is a strong research effort to address this complexity.

OpenCL - how to spawn a separate math process on each core

I am new to OpenCL and I am writing an RSA factoring application. Ideally the application should work across both NV and AMD GPU targets, but I am not finding an easy way to determine the total number of cores/stream procs on each GPU.
Is there an easy way to determine how many total cores/stream procs there are on any hardware platform, and then spawn a factoring thread on each available core? The target RSA modulus would be in shared memory, and with each factoring thread using a Rho factoring attack against the modulus.
Also, any idea if OpenCL support multi-precision math libraries similar to GNU MP, to store large semi prime numbers?
Thanks in advance
On the GPU, you don't spawn one thread for each core, like you would on a CPU. Instead, you want to start many more threads than there are cores. I wouldn't worry about the exact number of cores available on a given target platform. Instead, focus no what fits best for your problem.
To add to Roger's answer, the reason why you would want to have many more threads than cores is because GPUs implement very efficient context switching to hide memory latency. Generally, each memory access is a very expensive operation in terms of the amount of time it takes for a processor to receive the requested data. But if a thread is waiting on a memory transaction, it can be "paused" and another thread can be activated to do computations (or other memory accesses) in the meantime. So if you have enough threads, you can essentially hide the memory access latency, and your software can run at the full computational capacity of the hardware (which would rarely happen otherwise).
I would have put this in a comment to Roger's post, but its size is beyond the limit.

Parallel computing: Distributed systems vs multicore processors?

I was just wondering why there is a need to go through all the trouble of creating distributed systems for massive parallel processing when, we could just create individual machines that support hundreds or thousands of cores/CPUs (or even GPGPUs) per machine?
So basically, why should you do parallel processing over a network of machines when it can rather be done at much lower cost and much more reliably on 1 machine that supports numerous cores?
I think it is simply cheaper. Those machines are available today, no need of inventing something new.
Next problem will be in complexity of the motherboard, imagine 10 CPUs on one MB - so much links! And if one of those CPUs dies, it could destroy whole machine..
You can write a program for GPGPU of course, but it is not as easy as write it for CPU. There are many limitations, eg. cache per core is really small if any, you can not communicate between cores (or you can, but it is very costly) etc.
Linking many computers is more stable, more scalable and cheaper due to long usage history.
What Petr said. As you add cores to an individual machine, communication overhead increases. If memory is shared between cores then the locking architecture for shared memory, and caching, generates increasingly large overheads.
If you don't have shared memory, then effectively you're working with different machines, even if they're all in the same box.
Hence it's usually better to develop very large scale apps without shared memory. And usually possible as well - although communications overhead is often still large.
Given that this is the case, there's little use for building highly multicore individual machines - though some do exist e.g. nvidia tesla...

UNIX Domain sockets vs Shared Memory (Mapped File)

Can anyone tell, how slow are the UNIX domain sockets, compared to Shared Memory (or the alternative memory-mapped file)?
Thanks.
It's more a question of design, than speed (Shared Memory is faster), domain sockets are definitively more UNIX-style, and do a lot less problems. In terms of choice know beforehand:
Domain Sockets advantages
blocking and non-blocking mode and switching between them
you don't have to free them when tasks are completed
Domain sockets disadvantages
must read and write in a linear fashion
Shared Memory advantages
non-linear storage
will never block
multiple programs can access it
Shared Memory disadvantages
need locking implementation
need manual freeing, even if unused by any program
That's all I can think of now. However, I'd go with domain sockets any day -- not to mention that it's a lot easier then to reimplement them to do distributed computing. The speed gain of Shared Memory will be lost because of the need of a safe design. However, if you know exactly what you're doing, and use the proper kernel calls, you can achieve greater speed with Shared Memory.
In terms of speed shared memory is definitely the winner. With sockets you will have at least two copies of the data - from sending process to the kernel buffer, then from the kernel to the receiving process. With shared memory the latency will only be bound by the cache consistency algorithm between the cores on the box.
As Kornel notes though, dealing with shared memory is more involved since you have to come up with your own synchronization/signalling scheme, which might add a delay depending on which route you go. Definitely use semaphores in shared memory (implemented with futex on Linux) to avoid system calls in non-contended case.
Both are inter process communication (IPC) mechanisms.
UNIX domain sockets are uses for communication between processes on one host similar as TCP-Sockets are used between different hosts.
Shared memory (SHM) is a piece of memory where you can put data and share this between processes.
SHM provides you random access by using pointers, Sockets can be written or read but you cannot rewind or do positioning.
#Kornel Kisielewicz 's answer is good IMO. Just adding my own results here for sockets, not only Unix domain sockets.
Shared Memory
Performance is very high. No copies with RAW access data. Fastest access for sure.
Synchronization needed. Design not so easy to setup for complex cases.
Fixed size. Growing shared memory is doable but memory has to be unmapped first, growed, and then remapped.
Signaling mechanism can be quite slow, see here : Boost.Interprocess notify() performance. Especially if you want to do lots of exchanges between processes. Signaling mechanism not so easy to setup also.
Sockets
Easy to setup.
Can be used on different machines.
No complex synchronisation needed.
Size is not a problem if you use TCP. Simple design with header containing the packet size and then send the data.
Ping/Pong exchange is fast because it can be treated as hardware interruption by the OS.
Performance is average: a few copies of data are made.
High CPU consumption compared to shared memory. Sockets calls are not that cheap if you use them a lot.
In my tests, exchanges of small chunks of data (around 1MByte/second) shows no real advantage for shared memory. I would even say that ping/pong exchanges were faster using TCP (due to simple and efficient signaling mechanism). BUT when exchanging large amount of data (around 200MBytes/second), I had 20% CPU consumption with sockets, compared to 3% CPU using shared memory. So a huge win for shared memory in terms of CPU because read and write socket calls are not cheap.
In this case - sockets are faster. Writing to shared memory is faster then any IPC but writing to a memory mapped file and writing to shared memory are 2 completely different things.
when writing to a memory mapped file you need to "flush" what was written to the shared memory to an actual binded file (not exactly, the flush is being done for you), so you copy your data first to the shared memory, and then you copy it again (flush) to the actual file and that is super duper expansive - more then anything, even more then writing to socket, you are gaining nothing by doing that.

Resources