Why graph processing is difficult to be distributed? - graph

Recently I read the paper Scalability! But at what cost?. In this paper, authors take graph computation as an example to measure their performance on a single thread machine compared to the performance on some distributed frameworks.
In section 2, authors stated that graph computation represents one of the simplest classes of data-parallel computation that is not trivially parallelized. Can anybody tell me what are the main barriers in the parallelization of graph computing?

The main barriers are the commutative and associative properties of the graph operations. These two properties determine if an algorithm is trivially parallelizable or not. In the page you linked the authors state the following:
The updates are commutative and associative,
and consequently admit a scalable implementation [7].
Actually the cited paper at [7] is a PhD dissertation which explains it quite well:
At the core of this dissertation’s approach is this scalable commutativity rule: In any situation where several operations commute—meaning there’s no way to distinguish their execution order using the interface—they have a implementation that is conflict-free during those operations—meaning no core writes a cache line that was read or written by another core.
Empirically, conflict-free operations scale, so this implementation scales. Or, more concisely, whenever interface operations commute, they can be implemented in a way that scales. This rule makes intuitive sense: when operations commute, their results(return values and
effect on system state) are independent of order. Hence, communication between commutative
operations is unnecessary, and eliminating it yields a conflict-free implementation. On modern
shared-memory multicores, conflict-free operations can execute entirely from per-core caches,
so the performance of a conflict-free implementation will scale linearly with the number of
cores.
For example cartesian graph product is a commutative and associative operation, the resulting vertices can be calculated in any order, making parallelization easy in this case. However most graph operations lack either one or both of these properties.

Related

Float related numerical stability issues for parallel reduction

I have been looking at some online resources related to float summation and the related accuracy issues.
E.g.:
https://devtalk.nvidia.com/default/topic/1044661/cuda-programming-and-performance/how-to-improve-float-array-summation-precision-and-stability-/
https://hal.archives-ouvertes.fr/hal-00949355v4/document
Most of them recommend using some form of manual intervention when handling floating-point summation for any modern hardware. E.g.
(1) to use Kahan’s algorithm for float summation, or (2) Sort and sum closer magnitude numbers together, etc.
Are these kind of nuances handled by MPI_AllReduce or OpenMP reduction kernels?
Speaking only for OpenMP: the standard says nothing about the order in which reduction operations are applied, and, indeed, that can even differ at each execution of the code. (Some OpenMP runtimes, such as the LLVM/Intel one implement a deterministic reduction*, but only guarantee determinism between runs with the same number of threads).
If you want to sort, or perform reduction in other ways, you will need to implement it yourself...
See https://software.intel.com/en-us/cpp-compiler-developer-guide-and-reference-supported-environment-variables and search for KMP_DETERMINISTIC_REDUCTION for details.

How much can MPI_Alltoall outperform MPI_Alltoallv?

I wonder what is the difference in terms of running time between executing the MPI_Alltoallv and MPI_Alltoall functions when the amount of transferred data is approximately the same? I couldn't find any such benchmark results. I am interested in large-scale instances, where tens of thousands or better hundreds of thousand of MPI processes are used and where these processes correspond to a substantial part of a given HPC system (considering at best some modern ones, such as BG/Q, Cray XC30, Cray XE6, ...).
Overview
One of the big advantages of MPI_Alltoall is that protocol decisions can be made quickly because they depend on a handful of scalars. In contrast, if a library implementer wants to optimize MPI_Alltoallv, they have to scan four vectors to determine if, for example, the communication is nearly homogeneous, highly sparse, or some other pattern.
The other issue is that MPI_Alltoall can easily use the output buffer as scratch space because every process provides and consumes the same amount of data. For MPI_Alltoallv, it's not practical to do all the bookkeeping, so any scratch space is going to be allocated. I can't remember the specifics of this issue, but I think I've read it somewhere in the MPI canon.
Implementation Skeletons
There are at least two special cases of alltoallv for which one can optimize better than the MPI library can:
Nearly homogeneous communication, i.e. the count vectors are nearly constant. This can happen when you have a distributed array that doesn't divide evenly across the process grid. In this case, you can:
Pad your arrays and use MPI_Alltoall directly.
Use MPI_Alltoall for the subset of processes that have homogeneous communication and either MPI_Alltoallv or a batch of Send-Recv for the remainder. This works best if you can cache the associated communicators. Using nonblocking communication should help too.
Write your own implementation of Bruck that handles the cases where the count varies, which is likely at the end of your vector. Having not done this myself, I don't know how difficult or worthwhile this one is.
Sparse communication, i.e. the count vector contains a large number of zeros. For this case, just use a batch of nonblocking Send-Recv and Waitall, because that's likely the best the MPI library will ever do and doing it yourself allows you to tune the batch size if you want.
Papers
MPI on a Million Processors describes the scalabillity issue associated with vector collectives. Granted, you may not see the cost of scanning the vector arguments on most CPUs, but it is an O(n) problem that motivates implementers to not touch the vector arguments more than necessary.
HykSort: a new variant of hypercube quicksort on distributed memory architectures describes a custom implementation that performs much better than optimized libraries. Such an optimization is rather difficult to implement inside of an MPI library, because it may be rather specialized. (This reference is targeted at Hristo's comment, not your question, by the way.)
Code
You can discover some interesting things by comparing the implementations of these operations in MPICH (https://github.com/pmodels/mpich/blob/main/src/mpi/coll/alltoall.c and https://github.com/pmodels/mpich/blob/main/src/mpi/coll/alltoallv.c). Only MPI_Alltoall uses Bruck's algorithm and pairwise exchange. Similar conclusions can be drawn from the available options for I_MPI_ADJUST_ALLTOALL and I_MPI_ADJUST_ALLTOALLV on https://software.intel.com/en-us/node/528906. Whether these limitations are fundamental or merely practical is left as an exercise for the reader.
Practical Experience
When MPI_Alltoall on Blue Gene/P used DCMF_Alltoallv (source code), so there was no difference relative to MPI_Alltoallv, and the latter might have even been better since the application pre-populated the vector arguments.
I wrote a version of all-to-all exchange for Blue Gene/Q that was as fast as MPI_Alltoall. My version was agnostic to constant versus vector arguments so this result implies that MPI_Alltoallv would perform similarly to MPI_Alltoall. However, I can't find the code now to be absolutely sure of the details.
However, Blue Gene networks were rather special, particularly w.r.t. all-to-all, so the behavior on fat-tree or dragonly networks on systems where the CPU is much faster than the network will be quite different.
I suggest you write a benchmark and measure it where you intend to run your application. Once you have some data, it will be much easier to figure out what optimizations may be missed.

Best Practices for cache locality in Multicore Parallelism in F#

I'm studying multicore parallelism in F#. I have to admit that immutability really helps to write correct parallel implementation. However, it's hard to achieve good speedup and good scalability when the number of cores grows. For example, my experience with Quick Sort algorithm is that many attempts to implement parallel Quick Sort in a purely functional way and using List or Array as the representation are failed. Profiling those implementations shows that the number of cache misses increases significantly compared to those of sequential versions. However, if one implements parallel Quick Sort using mutation inside arrays, a good speedup could be obtained. Therefore, I think mutation might be a good practice for optimizing multicore parallelism.
I believe that cache locality is a big obstacle for multicore parallelism in a functional language. Functional programming involves in creating many short-lived objects; destruction of those objects may destroy coherence property of CPU caches. I have seen many suggestions how to improve cache locality in imperative languages, for example, here and here. But it's not clear to me how they would be done in functional programming, especially with recursive data structures such as trees, etc, which appear quite often.
Are there any techniques to improve cache locality in an impure functional language (specifically F#)? Any advices or code examples are more than welcome.
As far as I can make out, the key to cache locality (multithreaded or otherwise) is
Keep work units in a contiguous block of RAM that will fit into the cache
To this end ;
Avoid objects where possible
Objects are allocated on the heap, and might be sprayed all over the place, depending on heap fragmentation, etc.
You have essentially zero control over the memory placement of objects, to the extent that the GC might move them at any time.
Use arrays. Arrays are interpreted by most compilers as a contiguous block of memory.
Other collection datatypes might distribute things all over the place - linked lists, for example, are composed of pointers.
Use arrays of primitive types. Object types are allocated on the heap, so an array of objects is just an array of pointers to objects that may be distributed all over the heap.
Use arrays of structs, if you can't use primitives. Structs have their fields arranged sequentially in memory, and are treated as primitives by the .NET compilers.
Work out the size of the cache on the machine you'll be executing it on
CPUs have different size L2 caches
It might be prudent to design your code to scale with different cache sizes
Or more simply, write code that will fit inside the lowest common cache size your code will be running on
Work out what needs to sit close to each datum
In practice, you're not going to fit your whole working set into the L2 cache
Examine (or redesign) your algorithms so that the data structures you are using hold data that's needed "next" close to data that was previously needed.
In practice this means that you may end up using data structures that are not theoretically perfect examples of computer science - but that's all right, computers aren't theoretically perfect examples of computer science either.
A good academic paper on the subject is Cache-Efficient String Sorting Using Copying
Allowing mutability within functions in F# is a blessing, but it should only be used when optimizing code. Purely-functional style often yields more intuitive implementation, and hence is preferred.
Here's what a quick search returned: Parallel Quicksort in Haskell. Let's keep the discussion about performance focused on performance. Choose a processor, then bench it with a specific algorithm.
To answer your question without specifics, I'd say that Clojure's approach to implementing STM could be a lesson in general case on how to decouple paths of execution on multicore processors and improve cache locality. But it's only effective when number of reads outweigh number of writes.
I am no parallelism expert, but here is my advice anyway.
I would expect that a locally mutable approach where each core is allocated an area of memory which is both read and written will always beat a pure approach.
Try to formulate your algorithm so that it works sequentially on a contiguous area of memory. This means that if you are working with graphs, it may be worth "flattening" nodes into arrays and replace references by indices before processing. Regardless of cache locality issues, this is always a good optimisation technique in .NET, as it helps keep garbage collection out of the way.
A great approach is to split the work into smaller sections and iterate over each section on each core.
One option I would start with is to look for cache locality improvements on a single core before going parallel, it should be simply a matter of subdividing the work again for each core. For example if you are doing matrix calculations with large matrices then you could split up the calculations into smaller sections.
Heres a great example of that: Cache Locality For Performance
There were some great sections in Tomas Petricek's book Real Work functional programming, check out Chapter 14 Writing Parallel Functional Programs, you might find Parallel processing of a binary tree of particular interest.
To write scalable Apps cache locality is paramount to your application speed. The principles are well explain by Scott Meyers talk. Immutability does not play well with cache locality since you create new objects in memory which forces the CPU to reload the data from the new object again.
As in the talk is noted even on modern CPUs the L1 cache has only 32 KB size which is shared for code and data between all cores. If you go multi threaded you should try to consume as little memory as possible (goodbye immutabilty) to stay in the fastest cache. The L2 cache is about 4-8 MB which is much bigger but still tiny compared to the data you are trying to sort.
If you manage to write an application which consumes as little memory as possible (data cache locality) you can get speedups of 20 or more. But if you manage this for 1 core it might be very well be that scaling to more cores will hurt performance since all cores are competing for the same L2 cache.
To get most out of it the C++ guys use PGA (Profile Guided Optimizations) which allows them to profile their application which is used as input data for the compiler to emit better optimized code for the specific use case.
You can get better to certain extent in a managed code but since so many factors influence your cache locality it is not likely that you will ever see a speedup of 20 in the real world due to total cache locality. This remains the regime of C++ and compilers which use profiling data.
You may get some ideas from these:
Cache-Oblivious http://supertech.csail.mit.edu/cacheObliviousBTree.html Cache-Oblivious Search Trees Project
DSapce#MIT Cache coherence strategies in a many-core processor http://dspace.mit.edu/handle/1721.1/61276
describes the revolutionary idea of cache oblivious algorithms via the elegant and efficient implementation of a matrix multiply in F#.

Big datastructures in functional programming

I'm newbie in Functional Programming.
I have a huge neural network with thousands of neurons and every connection between neurons has its weight. I have to update these weights very often, several thousand times per learning session.
Is FP still applicable here? I mean in fp we can't modify variables and only able to return new variables not changing previous values. Does this mean I have to recreate whole network on every weight update?
Is FP still applicable here?
You can certainly write this in a functional style with decent asymptotic algorithmic efficiency but you are not likely to get with 10× the performance of a decent imperative solution because purely functional programming makes it difficult to use CPU caches effectively.
I mean in fp we can't modify variables and only able to return new variables not changing previous values. Does this mean I have to recreate whole network on every weight update?
No, for two reasons:
Purely functional data structures can be updated efficiently because they decompose large structures (e.g. a hash table) into many small recursively-defined structures (e.g. a balanced binary tree). When you update a single node within an immutable tree, you copy data from every node in the path from the root to the destination but refer back to all other branches by reference safe in the knowledge that they cannot be changed under you because they are immutable. So you only do O(log n) work instead of O(n) work.
Purely functional data structures usually offer functions like map that allow every element to be updated in the same way and avoid rebalancing by copying the structure of the source tree. So the time for n updates is O(n) instead of O(n log n).
So you should be able to achieve similar or even equal asymptotic time complexity but, in absolute terms, you will be using several times as much space and time as an imperative solution. I described these basics in detail in my book Visual F# 2010 for Technical Computing and I wrote the article Artificial Intelligence: Neural Networks (8th May 2010) for the OCaml Journal.
Look into Haskell arrays which include mutable variants in a monad.
You should not need to recreate the entire network every time a weight update occurs. Presumably, your neurons are modeled as individual objects - this means that to "update" an individual neuron, you would actually be creating a new neuron with the updated weight. Then this neuron would be inserted into the network in place of the old one, which would in turn be free for reclamation by the garbage collector.
I do not agree with the idea of using mutable state. Functional languages know that they are functional, so they make optimizations for functional programming. If a functional language really is the best tool for the job, then take advantage of its benefits.
If you structure your data in such a way that you can use a persistent data structure to model your neural network, functional updates to the neural network will be cheap (at least compared to copying the whole thing).
If it is still not fast enough, your language may allow other techniques (such as careful use of mutation) to speed it up; for example, if you were using Clojure, you could use transients to some additional speed.

query language for graph sets: data modeling question

Suppose I have a set of directed graphs. I need to query those graphs. I would like to get a feeling for my best choice for the graph modeling task. So far I have these options, but please don't hesitate to suggest others:
Proprietary implementation (matrix)
and graph traversal algorithms.
RDBM and SQL option (too space consuming)
RDF and SPARQL option (too slow)
What would you guys suggest? Regards.
EDIT: Just to answer Mad's questions:
Each one is relatively small, no more than 200 vertices, 400 edges. However, there are hundreds of them.
Frequency of querying: hard to say, it's an experimental system.
Speed: not real time, but practical, say 4-5 seconds tops.
You didn't give us enough information to respond with a well thought out answer. For example: what size are these graphs? With what frequencies do you expect to query these graphs? Do you need real-time response to these queries? More information on what your application is for, what is your purpose, will be helpful.
Anyway, to counter the usual responses that suppose SQL-based DBMSes are unable to handle graphs structures effectively, I will give some references:
Graph Transformation in Relational Databases (.pdf), by G. Varro, K. Friedl, D. Varro, presented at International Workshop on Graph-Based Tools (GraBaTs) 2004;
5 Conclusion and Future Work
In the paper, we proposed a new graph transformation engine based on off-the-shelf
relational databases. After sketching the main concepts of our approach, we carried
out several test cases to evaluate our prototype implementation by comparing it to
the transformation engines of the AGG [5] and PROGRES [18] tools.
The main conclusion that can be drawn from our experiments is that relational
databases provide a promising candidate as an implementation framework for graph
transformation engines. We call attention to the fact that our promising experimental
results were obtained using a worst-case assessment method i.e. by recalculating
the views of the next rule to be applied from scratch which is still highly inefficient,
especially, for model transformations with a large number of independent matches
of the same rule. ...
They used PostgreSQL as DBMS, which is probably not particularly good at this kind of applications. You can try LucidDB and see if it is better, as I suspect.
Incremental SQL Queries (more than one paper here, you should concentrate on " Maintaining Transitive Closure of Graphs in SQL "): "
.. we showed that transitive closure, alternating paths, same generation, and other recursive queries, can be maintained in SQL if some auxiliary relations are allowed. In fact, they can all be maintained using at most auxiliary relations of arity 2. ..
Incremental Maintenance of Shortest Distance and Transitive Closure in First Order Logic and SQL.
Edit: you give more details so... I think the best way is to experiment a little with both a main-memory dedicated graph library and with a DBMS-based solution, then evaluate carefully pros and cons of both solutions.
For example: a DBMS need to be installed (if you don't use an "embeddable" DBMS like SQLite), only you know if/where your application needs to be deployed and what your users are. On the other hand, a DBMS gives you immediate benefits, like persistence (I don't know what support graph libraries gives for persisting their graphs), transactions management and countless other. Are these relevant for your application? Again, only you know.
The first option you mentioned seems best. If your graph won't have many edges (|E|=O(|V|)) then you might earn better complexity of time and space using Dictionary:
var graph = new Dictionary<Vertex, HashSet<Vertex>>();
An interesting graph library is QuickGraph. Never used it but it seems promising :)
I wrote and designed quite a few graph algorithms for various programming contests and in production code. And I noticed that every time I need one, I have to develop it from scratch, assembling together concepts from graph theory (BFS, DFS, topological sorting etc).
Perhaps a lack of experience is a reason, but it seems to me that there's still no reasonable general-purpose query language to solve graph problems. Pick a couple of general-purpose graph libraries and solve your particular task in a programming (not query!) language. That will give you best performance and space consumption, but will also require understanding of graph theory basic concepts and of their limitations.
And the last one: do not use SQL for graphs.

Resources