Graph partition algo with Neo4j graph database - graph

I know there has some famous graph partition algo tools like METIS which is implemented by karypis Lab (http://glaros.dtc.umn.edu/gkhome/metis/metis/overview)
but I wanna know is there any method to partition graph stored in Neo4j?
or I have to dump the Neo4j's data and transform the node and edge format manually to fit the METIS input format?

Regarding new-ish and interesting algorithms, this is by no means exhaustive or state of the art, but these are the first places I would look:
Specific Algorithm: DiDiC (Distributed Diffusive Clustering) - I used it once in my thesis (Partitioning Graph Databases)
You iterate over all nodes, then for each node retrieve all neighbors, in order to spread some of "some unit" to all your neighbors
Easy to implement.
Can be made deterministic
Iterative - as it's based on iterations (like Super Steps in Pregel) you can stop it at any time. The longer you leave it the better the result, in theory (though in some cases, on certain graph shapes it can be unstable)
When we implemented this we ran it for 100 iterations on a machine with ~30GB RAM, for up to ~4 million nodes - it took no more than two days to complete.
Specific Algorithm: EvoCut "Finding sparse cuts locally using evolving sets" - local probabilistic algorithm from Microsoft - related to these papers
Difficult to implement
Local algorithm - BFS-like access patterns (random walks)
It's been a while since i read that paper, but i remember it was built on clean abstractions:
EvoNibble (pluggable - decides how much of neighborhood to add to the current cluster
EvoCut (calls EvoNibble multiple times to find the local cluster)
EvoPartition (calls EvoCut repeatedly to partition entire graph)
Not deterministic
General Algorithm Family: Hierarchical Graph Clustering
From a high level:
Coarsen the graph by collapsing nodes into aggregate nodes
coarsening strategy is selectable
Find clusters in the coarsened/smaller graph
clustering strategy is selectable
Incrementally decoarsen the graph, refining at the clustering at each step
refining strategy is selectable
Notes:
If the graph changes slowly (or results don't need to be right up to date) it may be possible to coarsen once (or infrequently) then work with the coarsened graph - to save computation
I don't know of a specific algorithm to recommend
General limitations - the things few clustering algorithms do:
Node types not acknowledged - i.e., all nodes treated equally
Relationship types not acknowledged - i.e., all relationships treated equally
Relationship direction not acknowledged - i.e., relationships treated as undirected

Having worked independently with METIS and Neo4j in the past, I am not aware of any tool for generating a METIS file from Neo4j. That being said, writing such a tool should be an easy task and would be a great community contribution.
Another approach for integrating METIS with Neo4j might be in connecting METIS to Neo4j from C++ via JNI. However this is going to be much more involved as it would have to take care of things like transactions, concurrency etc.
On the more general question of partitioning graphs, it is quite possible to implement some of the more known and simple algorithms with reasonable effort.

Related

Given an unsorted list of edges, how would I detect if they form a cycle (Graphs)

I am given a list of edges (each of which have a 'From' and 'To' properties that specify which vertices they connect).
I want to either return a null if they don't form the cycle in this (undirected) graph, or return the list forming a cycle.
Does anyone know how I would go about such a problem? I am clueless.
The way I've been taught to do this involves storing a list of visited vertices.
Navigate through the graph, storing each vertex and adding it to the list. Each turn, compare the current vertex to the list - if it is present, you've visited it before, and are therefore in a cycle.
Algorithms of this type are called Graph Cycle Detection Algorithms. There are some intricacies about which algorithms to select based on the needs of the application or the context of the problem. For example, do you want to find the first cycle, the shortest cycle, longest cycle, all of the cycles, is the graph unidirectional or bidirectional, etc.?
There are numerous cycle detection algorithms to select from, depending on the need and the allowable computational complexity and the nature of the cycle (e.g. finding first, longest, etc.). Some common algorithms include the following:
Floyd's Algorithm
Brent's Algoritm (see also here)
Tarjan's Strongly Connected Components Algorithm (see also this Stack Overflow post)
The specific algorithm you select will depend on your need and how efficient the algorithm must be. If serious efficiency is needed, I would suggest looking through some scholarly articles on the topic and compare and contrast some of the trade-offs of various algorithms.

Some Common Questiones about Neo4J 3.0

1.Does enterprise version support distributed graph algorithm?Or can the Neo4J graph data and graph calculation be distributed over cloud infrastruction?And How does it work?
2.If I have a server (16 cores CPU,256G memory,2TB HDD) and each node or relation has 1K data,how many nodes and relationships can the server contain.The ratio between nodes and relationships is 1:5.
If we want import more data,what should we do?
3.For fast importing , we used batchinserter,but one lucence index has a number limit which is 2^32.So we can import less than 2^32 nodes.What should we do to solve this limit except using more indexes?
4.And after two days for importing, the importing speed is too slow(200-600 nodes per sec) to accept. It is only 1% of the beginning !I can see that the memory is full, what should we do to impove the speed.
It has imported about 0.2B nodes and 0.5B relationships. That's half of my data. And my server has 32GB memory.
Thanks a lot.
You have many questions here that might be better suited as individual questions or asked on the Neo4j slack channel.
I started to write this as a comment, but ran out of chars so I'll try to point you to some resources:
1) Neo4j distributed graph model
I'm not sure exactly what you're asking here. See this document for general information on Neo4j scalability. If you're asking if graph traversals can be distributed across machines in a Neo4j cluster, the answer is no.
2) Hardware sizing
This depends a bit on your access patterns. See the hardware sizing calculator that can take workload / access patterns into account.
3-4) Import
Can you create a new question and share your code for this? You should be able to achieve much better performance than this.

What is the difference between a node and a vertex?

What is the difference (if any) between a node and a vertex? I can't find the answer after looking at countless sites! Even my book doesn't specify it so I am kind of lost!
It is worth mentioning that I am looking for the difference besides the fact that it is called a 'vertex' when used in a graph and a 'node' when used in a tree.
There are no differences between the words Node and Vertex. Even in some books that explain graph theory and graph algorithms they name it as:
Vertex denoted by v, and sometimes it's called nodes also
There are no major nor minor differences between them.
This is mentioned in the book: Data structure and Algorithms with Object Oriented Design Patterns in C#, Bruno R, Preiss.
In "The Practitioner's Guide to Graph Data", the author avoid the term "node/nodes" and only use vertex/vertices and they explain it as below:
...because we are focusing on distributed graphs, and nodes has different meanings in distributed systems, graph theory and computer science.
In distributed systems, a node can be a client, server or peer, while in computer network it can be a computer or a modem. In computer science, as you already point out, it could be used either for graph theory or tree system.
So in the context of graph theory, node and vertex are used interchangeable. But if you would like to make it clear and avoid any misunderstanding, vertex/vertices is the way to go.
In think both terminologies come from the different perception of graphs and networks. Albert-László Barabási writes in his recent text book.
"In the scientific literature the terms network and graph are used interchangeably:
Network science
Graph theory
Network
Graph
Node
Vertex
Link
Edge
Yet, there is a subtle distinction between the two terminologies: the {network, node, link} combination often refers to real systems: The WWW is a network of web documents linked by URLs; society is a network of individuals linked by family, friendship or professional ties; the metabolic network is the sum of all chemical reactions that take place in a cell. In contrast, we use the terms {graph, vertex, edge} when we discuss the mathematical representation of these networks: We talk about the web graph, the social graph (a term made popular by Facebook), or the metabolic graph. Yet, this distinction is rarely made, so these two terminologies are often synonyms of each other."
<tl;dr> Same, same, but different.
There is no difference between a node and a vertex. Most books use V to represent the vertex of a graph. I've seen node mostly associated with a tree.
For instance, you may have come across O(V + E) being used to represent the time complexity for depth first search and breadth first search graph traversals.
Similarly, V is used as part of time complexity analysis for other graph algorithms like Prim's, Kruskal's, etc.

Detecting all cycles in a directed graph with millions of nodes in Ocaml

I have graphs with thousands of nodes to millions of nodes. I want to detect all possible cycles in such graphs.
I use hash table to store the edges. ( (source node,edge weight) -> (target node) ).
What can be the efficient way of implementing it in OCaml?
Its looks like Tarjan's algorithm is the best one.
What can be the most implementation for the same.
Yes, Tarjan's algorithm for strongly connected components is a good solution. You may also use so-called path-based strong component algorithms which have (when done carefully) comparable linear complexity.
If you pick reasonable data structures, they should work. It's hard to say much more before you implemented and profiled a prototype implementation.
I don't understand what your graph representation is: are you hashed keys really a (node,weight) couple? Then how do you find all neighbors of a given node? For a large graph structure you should optimize access time, of course, but also memory efficiency.
If you really want to find all possible cycles, the problem seems at least exponential in the worst case. For a complete graph, every nonempty subset of nodes gives you a different cycle (including a link from the last back to the first). Forthermore every cyclic permutation of every subset gives you a different cycle. Depending on the sparsity of your graphs, the problem could be tractable in practice.

Distributed physics simulation help/advice

I'm working in a distributed memory environment. My task is to simulate using particles tied by springs big 3D objects by dividing them into smaller pieces and each piece get simulated by another computer. I'm using a 3rd party physics engine to a achieve simulation. The problem I am facing is how to transmit the particle information in the extremities where the object is divided. This information is needed to compute interacting particle forces. The line in the image shows where the cut has been made. Because the number o particles is big the communication overhead will be big as well. Is there a good way to transmit such information or is there a way to transmit another value which helps me determine the information I need? Any help is much appreciated. Thank-you
PS: by particle information i mean the new positions from which to compute a resulting force to be applied on the particles simulated in the local machine
"Big" means lots of things. Here the number of points with data being communicated may be "big" in that it's much more than one, but if you have say a million particles in a lattice, and are dividing it between 4 processors (say) by cutting it into squares, you're only communicating 500 particles across each boundary; big compared to one but very small compared to 1,000,000.
A library very commonly used for these sorts of distributed-memory computations (which is somwehat different than distributed computing, which suggests nodes scattered all over the internet; this sort of computation, involving tightly-coupled elements, is usually best done with a series of nearby computers in a lab or in a cluster) is MPI. This pattern of communication is very common, and is called "halo exchange" or "guardcell exchange" or "ghostzone exchange" or some combination; you should be able to find lots of examples of such things by searching for those terms. (There are a few questions on this site on the topic, but they're typically focussed on very specific implementation questions).

Resources