Overview of the different Parallel programming methods

I'm learning how to use parallel programming (specifically in R, but I'm trying to make this question as general as possible). There are many different libraries that work with it, and understanding their differences is hard without knowing the computer science terms used in their descriptions.
I identified some attributes that define these categories, such as: fine-grained and coarse-grained, explicit and implicit, levels of parallelism (bit-level etc.), classes of parallel computers (multi-core computing, grid computing, etc.), and what I call "methods" (I will explain what I mean by that later).
First question: is that list complete? Or there are other relevant attributes that define categories of parallel programing?
Secondary question: for each attribute, what are the pros and cons of the different options? When to use each option?
About the "methods": I saw materials talking about socket and forking; other talking about Parallel Virtual Machine (PVM) and Message Passing Interface (MPI) (and another option called "NWS"); and some other specific methods like Hadoop/map-reduce and futures (from R's "future" package).
Second question: I don't understand some of them, and I'm not sure if it makes sense to join them in this list that I called (parallel processing) "methods". Also, are there other "methods" that I left out?
Secondary question: what are the pros and cons of each of these methods, and when is it better to use each?
Third question: in light of this categorization and the discussion on the pros and cons of each category, how can we make an overview of the parallel computing libraries in R?
My post might be too broad, and ask too much at once, so I'll answer it with what I found until now, and maybe you can correct it/add to it in your own answer. The points that I feel that are most lacking is understanding the pros/cons of each "method", and of each R package.

OP here. I can't answer the "is that list complete?" part of the questions, but here are the explanations to each attribute and the pros and cons of each option. Just to reiterate that I'm new to this subject and might write something false/misleading.
Fine-grained, coarse-grained, and embarrassing parallelism (ref):
Attribute: how often their subtasks need to synchronize or communicate with each other.
Fine-grained parallelism: if its subtasks must communicate many times per second;
Coarse-grained parallelism if they do not communicate many times per second;
Embarrassing parallelism if they rarely or never have to communicate.
When to use each: self-explanatory
Explicit and implicit parallelism (ref):
Attribute: the need to write code that directly instructs the computer to parallelize
Explicit: needs it
Implicit: automatically detects a task that needs parallelism
When to use each: Parallel computing might introduce too much complexity when working with tasks, such that implicitly parallelism can lead to inefficiencies in some cases.
Types/levels of parallelism (ref):
Attribute: the code level where parallelism happens.
Bit-level, instruction-level, task and data-level, superword-level
When to use each: From what i understood, in the most common statistics/R applications, we use task and data-level, thus I didn't searched about when to use the other ones.
Classes of parallel computers (ref)
Attribute: the level at which the hardware supports parallelism
Multi-core computing, Symmetric multiprocessing, Distributed computing, Cluster computing, Massively parallel computing, Grid computing, Specialized parallel computers.
When to use each: From what i understood, unless you have a really big task that needs several/external computers working, you can use multi-core computing (using only your own machine).
Parallelism "method":
Attribute: different methods (there probably is a better word for this).
This post makes a distinction between socket approach (launches a new version the code on each core) and forking approach (copies the entire current version of your project on each core). Forking isn't supported by Windows.
This post makes a difference between Parallel Virtual Machine (PVM) and Message Passing Interface (MPI) (which I couldn't quite understand). Apparently, there is another option called "NWS", which I couldn't find information about.
This CRAN task view contains a list of packages, grouped by topic, that are useful for high-performance computing in R. It refers to a method called "Hadoop" which was built upon "map-reduce", and to the "future" package, that uses "futures" to introduce parallelism to R.
When to use each: While in socket each node is unique (avoiding cross-contamination), and runs on any system, forking is faster and allows to access your entire workspace in each process. I couldn't find information to talk about PVM, MPI, and NWS; and didn't get in depth into Hadoop and futures, so there is alot of space to contribute to this paragraph.
R packages
The CRAN task view is a great reference to this. It separates which packages deal with explicit/implicit parallel processing, most of which work with multicore-processing (while also pointing out some that do grid-processing). It points to a specific group of packages that use Hadoop, and other groups of parallel processing tools and specific applications. As they're more common, I'll list the explicit parallelism ones. Package names inside parenthesis are upgrades to the listed package.
rpvm (no longer actively maintained): PVM dedicated;
Rmpi (pdbMPI): MPI dedicated;
snow (snowFT, snowfall): works with PVM, MPI, NWS and socket, but not forking;
parallel (parallelly) was built upon multicore (forking focused and no longer actively maintained) and snow, and it is in base R;
future (future.apply and furrr): "parallel evaluations via abstraction of futures"
foreach: needs a "parallel backend", which can be doMPI (using Rmpi), doMC (using multicore), doSNOW (using snow), doPararell (using parallel), and doFuture (using future)
RHIPE, rmr, segue, and RProtoBuf: use Hadoop and other map-reduce techniques.
I also didn't get in depth into how each package works, and there is room to add pros/cons and when to use each.


Could you implement async-await by memcopying stack frames rather than creating state machines?

I am trying to understand all the low-level stuff Compilers / Interpreters / the Kernel do for you (because I'm yet another person who thinks they could design a language that's better than most others)
One of the many things that sparked my curiosity is Async-Await.
I've checked the under-the-hood implementation for a couple languages, including C# (the compiler generates the state machine from sugar code) and Rust (where the state machine has to be implemented manually from the Future trait), and they all implement Async-Await using state machines.
I've not found anything useful by googling ("async copy stack frame" and variations) or in the "Similar questions" section.
To me, this method seems rather complicated and overhead-heavy;
Could you not implement Async-Await by simply memcopying the stack frames of async calls to/from heap?
I'm aware that it is architecturally impossible for some languages (I thank the CLR can't do it, so C# can't either).
Am I missing something that makes this logically impossible? I would expect less complicated code and a performance boost from doing it that way, am I mistaken? I suppose when you have a deep stack hierarchy after a async call (eg. a recursive async function) the amount of data you would have to memcopy is rather large, but there are probably ways to work around that.
If this is possible, then why isn't it done anywhere?
Yes, an alternative to converting code into state machines is copying stacks around. This is the way that the go language does it now, and the way that Java will do it when Project Loom is released.
It's not an easy thing to do for real-world languages.
It doesn't work for C and C++, for example, because those languages let you make pointers to things on the stack. Those pointers can be used by other threads, so you can't move the stack away, and even if you could, you would have to copy it back into exactly the same place.
For the same reason, it doesn't work when your program calls out to the OS or native code and gets called back in the same thread, because there's a portion of the stack you don't control. In Java, project Loom's 'virtual threads' will not release the thread as long as there's native code on the stack.
Even in situations where you can move the stack, it requires dedicated support in the runtime environment. The stack can't just be copied into a byte array. It has to be copied off in a representation that allows the garbage collector to recognize all the pointers in it. If C# were to adopt this technique, for example, it would require significant extensions to the common language runtime, whereas implementing state machines can be accomplished entirely within the C# compiler.
I would first like to begin by saying that this answer is only meant to serve as a starting point to go in the actual direction of your exploration. This includes various pointers and building up on the work of various other authors
I've checked the under-the-hood implementation for a couple languages, including C# (the compiler generates the state machine from sugar code) and Rust (where the state machine has to be implemented manually from the Future trait), and they all implement Async-Await using state machines
You understood correctly that the Async/Await implementation for C# and Rust use state machines. Let us understand now as to why are those implementations chosen.
To put the general structure of stack frames in very simple terms, whatever we put inside a stack frame are temporary allocations which are not going to outlive the method which resulted in the addition of that stack frame (including, but not limited to local variables). It also contains the information of the continuation, ie. the address of the code that needs to be executed next (in other words, the control has to return to), within the context of the recently called method. If this is a case of synchronous execution, the methods are executed one after the other. In other words, the caller method is suspended until the called method finishes execution. This, from a stack perspective fits in intuitively. If we are done with the execution of a called method, the control is returned to the caller and the stack frame can be popped off. It is also cheap and efficient from a perspective of the hardware that is running this code as well (hardware is optimised for programming with stacks).
In the case of asynchronous code, the continuation of a method might have to trigger several other methods that might get called from within the continuation of callers. Take a look at this answer, where Eric Lippert outlines the entirety of how the stack works for an asynchronous flow. The problem with asynchronous flow is that, the method calls do not exactly form a stack and trying to handle them like pure stacks may get extremely complicated. As Eric says in the answer, that is why C# uses graph of heap-allocated tasks and delegates that represents a workflow.
However, if you consider languages like Go, the asynchrony is handled in a different way altogether. We have something called Goroutines and here is no need for await statements in Go. Each of these Goroutines are started on their own threads that are lightweight (each of them have their own stacks, which defaults to 8KB in size) and the synchronization between each of them is achieved through communication through channels. These lightweight threads are capable of waiting asynchronously for any read operation to be performed on the channel and suspend themselves. The earlier implementation in Go is done using the SplitStacks technique. This implementation had its own problems as listed out here and replaced by Contigious Stacks. The article also talks about the newer implementation.
One important thing to note here is that it is not just the complexity involved in handling the continuation between the tasks that contribute to the approach chosen to implement Async/Await, there are other factors like Garbage Collection that play a role. GC process should be as performant as possible. If we move stacks around, GC becomes inefficient because accessing an object then would require thread synchronization.
Could you not implement Async-Await by simply memcopying the stack frames of async calls to/from heap?
In short, you can. As this answer states here, Chicken Scheme uses a something similar to what you are exploring. It begins by allocating everything on the stack and move the stack values to heap when it becomes too large for the GC activities (Chicken Scheme uses Generational GC). However, there are certain caveats with this kind of implementation. Take a look at this FAQ of Chicken Scheme. There is also lot of academic research in this area (linked in the answer referred to in the beginning of the paragraph, which I shall summarise under further readings) that you may want to look at.
Further Reading
Continuation Passing Style
The classic SICP book
This answer (contains few links to academic research in this area)
The decision of which approach to be taken is subjective to factors that affect the overall usability and performance of the language. State Machines are not the only way to implement the Async/Await functionality as done in C# and Rust. Few languages like Go implement a Contigious Stack approach coordinated over channels for asynchronous operations. Chicken Scheme allocates everything on the stack and moves the recent stack value to heap in case it becomes heavy for its GC algorithm's performance. Moving stacks around has its own set of implications that affect garbage collection negatively. Going through the research done in this space will help you understand the advancements and rationale behind each of the approaches. At the same time, you should also give a thought to how you are planning on designing/implementing the other parts of your language for it be anywhere close to be usable in terms of performance and overall usability.
PS: Given the length of this answer, will be happy to correct any inconsistencies that may have crept in.
I have been looking into various strategies for doing this myseøf, because I naturally thi k I can design a language better than anybody else - same as you. I just want to emphasize that when I say better, I actually mean better as in tastes better for my liking, and not objectively better.
I have come to a few different approaches, and to summarize: It really depends on many other design choices you have made in the language.
It is all about compromises; each approach has advantages and disadvantages.
It feels like the compiler design community are still very focused on garbage collection and minimizing memory waste, and perhaps there is room for some innovation for more lazy and less purist language designers given the vast resources available to modern computers?
How about not having a call stack at all?
It is possible to implement a language without using a call stack.
Pass continuations. The function currently running is responsible for keeping and resuming the state of the caller. Async/await and generators come naturally.
Preallocated static memory addresses for all local variables in all declared functions in the entire program. This approach causes other problems, of course.
If this is your design, then asymc functions seem trivial
Tree shaped stack
With a tree shaped stack, you can keep all stack frames until the function is completely done. It does not matter if you allow progress on any ancestor stack frame, as long as you let the async frame live on until it is no longer needed.
Linear stack
How about serializing the function state? It seems like a variant of continuations.
Independent stack frames on the heap
Simply treat invocations like you treat other pointers to any value on the heap.
All of the above are trivialized approaches, but one thing they have in common related to your question:
Just find a way to store any locals needed to resume the function. And don't forget to store the program counter in the stack frame as well.

Memory virtualization with R on cluster

I don't know almost anything about parallel computing so this question might be very stupid and it is maybe impossible to do what I would like to.
I am using linux cluster with 40 nodes, however since I don't know how to write parallel code in R I am limited to using only one. On this node I am trying to analyse data which floods the memory (arround 64GB). So my problem isn't lack of computational power but rather memory limitation.
My question is, whether it is even possible to use some R package (like doSnow) for implicit parallelisation to use 2-3 nodes to increase the RAM limit or would I have to rewrite the script from ground to make it explicit parallelised ?
Sorry if my question is naive, any suggestions are welcomed.
I don't think there is such a package. The reason is that it would not make much sense to have one. Memory access is very fast, and accessing data from another computer over the network is very slow compared to that. So if such a package existed it would be almost useless, since the processor would need to wait for data over the network all the time, and this would make the computation very very slow.
This is true for common computing clusters, built from off-the-shelf hardware. If you happen to have a special cluster where remote memory access is fast, and is provided as a service of the operating system, then of course it might be not that bad.
Otherwise, what you need to do is to try to divide up the problem into multiple pieces, manually, and then parallelize, either using R, or another tool.
An alternative to this would be to keep some of the data on the disk, instead of loading all of it into the memory. You still need to (kind of) divide up the problem, to make sure that the part of the data in the memory is used for a reasonable amount of time for computation, before loading another part of the data.
Whether it is worth (or possible at all) doing either of these options, depends completely on your application.
Btw. a good list of high performance computing tools in R is here:
For future inquiry:
You may want to have a look at two packages "snow" and "parallel".
Library "snow" extends the functionality of apply/lapply/sapply... to work on more than one core and/or one node.
Of course, you can perform simple parallel computing using more than one core:
#SBATCH --cpus-per-task= (enter some number here)
You can also perform parallel computing using more than one node (preferably with the previously mentioned libraries) using:
#SBATCH --ntasks-per-node= (enter some number here)
However, for several implications, you may wanna think of using Python instead of R where parallelism can be much more efficient using "Dask" workers.
You might want to take a look at TidalScale, which can allow you to aggregate nodes on your cluster to run a single instance of Linux with the collective resources of the underlying nodes. www.tidalscale.com. Though the R application may be inherently single threaded, you'll be able to provide your R application with a single, simple coherent memory space across the nodes that will be transparent to your application.
Good luck with your project!

MPI vs openMP for a shared memory

Lets say there is a computer with 4 CPUs each having 2 cores, so totally 8 cores. With my limited understanding I think that all processors share same memory in this case. Now, is it better to directly use openMP or to use MPI to make it general so that the code could work on both distributed and shared settings. Also, if I use MPI for a shared setting would performance decrease compared with openMP?
Whether you need or want MPI or OpenMP (or both) heavily depends the type of application you are running, and whether your problem is mostly memory-bound or CPU-bound (or both). Furthermore, it depends on the type of hardware you are running on. A few examples:
Example 1
You need parallelization because you are running out of memory, e.g. you have a simulation and the problem size is so large that your data does not fit into the memory of a single node anymore. However, the operations you perform on the data are rather fast, so you do not need more computational power.
In this case you probably want to use MPI and start one MPI process on each node, thereby making maximum use of the available memory while limiting communication to the bare minimum.
Example 2
You usually have small datasets and only want to speed up your application, which is computationally heavy. Also, you do not want to spend much time thinking about parallelization, but more your algorithms in general.
In this case OpenMP is your first choice. You only need to add a few statements here and there (e.g. in front of your for loops that you want to accelerate), and if your program is not too complex, OpenMP will do the rest for you automatically.
Example 3
You want it all. You need more memory, i.e. more computing nodes, but you also want to speed up your calculations as much as possible, i.e. running on more than one core per node.
Now your hardware comes into play. From my personal experience, if you have only a few cores per node (4-8), the performance penalty created by the general overhead of using OpenMP (i.e. starting up the OpenMP threads etc.) is more than the overhead of processor-internal MPI communication (i.e. sending MPI messages between processes that actually share memory and would not need MPI to communicate).
However, if you are working on a machine with more cores per node (16+), it will become necessary to use a hybrid approach, i.e. parallelizing with MPI and OpenMP at the same time. In this case, hybrid parallelization will be necessary to make full use of your computational resources, but it is also the most difficult to code and to maintain.
If you have a problem that is small enough to be run on just one node, use OpenMP. If you know that you need more than one node (and thus definitely need MPI), but you favor code readability/effort over performance, use only MPI. If using MPI only does not give you the speedup you would like/require, you have to do it all and go hybrid.
To your second question (in case that did not become clear):
If you setup is such that you do not need MPI at all (because your will always run on only one node), use OpenMP as it will be faster. But If you know that you need MPI anyways, I would start with that and only add OpenMP later, when you know that you've exhausted all reasonable optimization options for MPI.
With most distributed memory platforms nowadays consisting of SMP or NUMA nodes it just makes no sense to not use OpenMP. OpenMP and MPI can perfectly work together; OpenMP feeds the cores on each node and MPI communicates between the nodes. This is called hybrid programming. It was considered exotic 10 years ago but now it is becoming mainstream in High Performance Computing.
As for the question itself, the right answer, given the information provided, has always been one and the same: IT DEPENDS.
For use on a single shared memory machine like that, I'd recommend OpenMP. It make some aspects of the problem simpler and might be faster.
If you ever plan to move to a distributed memory machine, then use MPI. It'll save you solving the same problem twice.
The reason I say OpenMP might be faster is because a good implementation of MPI could be clever enough to spot that it's being used in a shared memory environment and optimise its behaviour accordingly.
Just for a bigger picture, hybrid programming has become popular because OpenMP benefits from cache topology, by using the same address space. As MPI might have the same data replicated over the memory (because process can't share data) it might suffer from cache cancelation.
On the other hand, if you partition your data correctly, and each processor has a private cache, it might come to a point were your problem fit completely in cache. In this case you have super linear speedups.
By talking in cache, there are very different cache topology on recent processors, and has always: IT DEPENDS...

Best Practices for cache locality in Multicore Parallelism in F#

I'm studying multicore parallelism in F#. I have to admit that immutability really helps to write correct parallel implementation. However, it's hard to achieve good speedup and good scalability when the number of cores grows. For example, my experience with Quick Sort algorithm is that many attempts to implement parallel Quick Sort in a purely functional way and using List or Array as the representation are failed. Profiling those implementations shows that the number of cache misses increases significantly compared to those of sequential versions. However, if one implements parallel Quick Sort using mutation inside arrays, a good speedup could be obtained. Therefore, I think mutation might be a good practice for optimizing multicore parallelism.
I believe that cache locality is a big obstacle for multicore parallelism in a functional language. Functional programming involves in creating many short-lived objects; destruction of those objects may destroy coherence property of CPU caches. I have seen many suggestions how to improve cache locality in imperative languages, for example, here and here. But it's not clear to me how they would be done in functional programming, especially with recursive data structures such as trees, etc, which appear quite often.
Are there any techniques to improve cache locality in an impure functional language (specifically F#)? Any advices or code examples are more than welcome.
As far as I can make out, the key to cache locality (multithreaded or otherwise) is
Keep work units in a contiguous block of RAM that will fit into the cache
To this end ;
Avoid objects where possible
Objects are allocated on the heap, and might be sprayed all over the place, depending on heap fragmentation, etc.
You have essentially zero control over the memory placement of objects, to the extent that the GC might move them at any time.
Use arrays. Arrays are interpreted by most compilers as a contiguous block of memory.
Other collection datatypes might distribute things all over the place - linked lists, for example, are composed of pointers.
Use arrays of primitive types. Object types are allocated on the heap, so an array of objects is just an array of pointers to objects that may be distributed all over the heap.
Use arrays of structs, if you can't use primitives. Structs have their fields arranged sequentially in memory, and are treated as primitives by the .NET compilers.
Work out the size of the cache on the machine you'll be executing it on
CPUs have different size L2 caches
It might be prudent to design your code to scale with different cache sizes
Or more simply, write code that will fit inside the lowest common cache size your code will be running on
Work out what needs to sit close to each datum
In practice, you're not going to fit your whole working set into the L2 cache
Examine (or redesign) your algorithms so that the data structures you are using hold data that's needed "next" close to data that was previously needed.
In practice this means that you may end up using data structures that are not theoretically perfect examples of computer science - but that's all right, computers aren't theoretically perfect examples of computer science either.
A good academic paper on the subject is Cache-Efficient String Sorting Using Copying
Allowing mutability within functions in F# is a blessing, but it should only be used when optimizing code. Purely-functional style often yields more intuitive implementation, and hence is preferred.
Here's what a quick search returned: Parallel Quicksort in Haskell. Let's keep the discussion about performance focused on performance. Choose a processor, then bench it with a specific algorithm.
To answer your question without specifics, I'd say that Clojure's approach to implementing STM could be a lesson in general case on how to decouple paths of execution on multicore processors and improve cache locality. But it's only effective when number of reads outweigh number of writes.
I am no parallelism expert, but here is my advice anyway.
I would expect that a locally mutable approach where each core is allocated an area of memory which is both read and written will always beat a pure approach.
Try to formulate your algorithm so that it works sequentially on a contiguous area of memory. This means that if you are working with graphs, it may be worth "flattening" nodes into arrays and replace references by indices before processing. Regardless of cache locality issues, this is always a good optimisation technique in .NET, as it helps keep garbage collection out of the way.
A great approach is to split the work into smaller sections and iterate over each section on each core.
One option I would start with is to look for cache locality improvements on a single core before going parallel, it should be simply a matter of subdividing the work again for each core. For example if you are doing matrix calculations with large matrices then you could split up the calculations into smaller sections.
Heres a great example of that: Cache Locality For Performance
There were some great sections in Tomas Petricek's book Real Work functional programming, check out Chapter 14 Writing Parallel Functional Programs, you might find Parallel processing of a binary tree of particular interest.
To write scalable Apps cache locality is paramount to your application speed. The principles are well explain by Scott Meyers talk. Immutability does not play well with cache locality since you create new objects in memory which forces the CPU to reload the data from the new object again.
As in the talk is noted even on modern CPUs the L1 cache has only 32 KB size which is shared for code and data between all cores. If you go multi threaded you should try to consume as little memory as possible (goodbye immutabilty) to stay in the fastest cache. The L2 cache is about 4-8 MB which is much bigger but still tiny compared to the data you are trying to sort.
If you manage to write an application which consumes as little memory as possible (data cache locality) you can get speedups of 20 or more. But if you manage this for 1 core it might be very well be that scaling to more cores will hurt performance since all cores are competing for the same L2 cache.
To get most out of it the C++ guys use PGA (Profile Guided Optimizations) which allows them to profile their application which is used as input data for the compiler to emit better optimized code for the specific use case.
You can get better to certain extent in a managed code but since so many factors influence your cache locality it is not likely that you will ever see a speedup of 20 in the real world due to total cache locality. This remains the regime of C++ and compilers which use profiling data.
You may get some ideas from these:
Cache-Oblivious http://supertech.csail.mit.edu/cacheObliviousBTree.html Cache-Oblivious Search Trees Project
DSapce#MIT Cache coherence strategies in a many-core processor http://dspace.mit.edu/handle/1721.1/61276
describes the revolutionary idea of cache oblivious algorithms via the elegant and efficient implementation of a matrix multiply in F#.

Can someone suggest a good way to understand how MPI works?

Can someone suggest a good way to understand how MPI works?
If you are familiar with threads, then you treat each node as a thread (to an extend)
You send a message (work) to a node and it does some work and then returns you some results.
Similar behaviors between thread & MPI:
They all involve partitioning a work and process it separately.
They all would have overhead when more node/threads involved, MPI overhead is more significant compared to thread, passing messages around nodes would cause significant overhead if work is not carefully partitioned, you might end up with the time passing messages > computational time required to process job.
Difference behaviors:
They have different memory models, each MPI node does not share memory with others and does not know anything about the rest of world unless you send something to it.
Here you can find some learning materials http://www.mcs.anl.gov/research/projects/mpi/
Parallel programming is one of those subjects that is "intrinsically" complex (as opposed to the "accidental" complexity, as noted by Fred Brooks).
I used Parallel Programming in MPI by Peter Pacheco. This book gives a good overview of the basic MPI topics, available API's, and common patterns for parallel program construction.
