Best way to store very large weighted DAG on disk? - bigdata

Assume a graph database to store a very large DAG on disk:
There are many things that are not required, which allows for optimization.
Basically what I do need is:
store a directed acyclic graph, no cycles, at most one edge per node-pair
fromID, toID, weight (can be INT,INT,FLOAT)
return connected components efficiently and conveniently
return all zero-indegeree nodes efficiently and conveniently
return all descendents of a node efficiently and conveniently
manage sizes of up to 100 million nodes, with up to 10 billion edges
modest resource requirements
free / open-source
Do you have some experience that allow you to give recommendations?

Related

Can I have O(1000s) of vertices connecting to a single vertex and O(1000s) of properties off a vertex for Cosmos DB and/or graph databases?

I have a graph with the following pattern:
- Workflow:
-- Step #1
--- Step execution #1
--- Step execution #2
[...]
--- Step execution #n
-- Step #2
--- Step execution #1
--- Step execution #2
[...]
--- Step execution #n
[...]
-- Step #m
--- Step execution #1
--- Step execution #2
[...]
--- Step execution #n
I have a couple of design questions here:
How many execution documents can hang off a
single vertex without affecting performance? For example, each "step" could have hundreds of 'executions' off it. I'm using two edges to connect them—'has_runs' (from step → execution) and 'execution_step' (from execution → step).
Are graph databases (Cosmos DB or any graph database) designed to handle thousands of vertexes and edges associated with a single vertex?
Each 'execution' has (theoretically) unlimited properties associated with it, but it is probably 10 < x < 100 properties. Is that OK? Are graph databases made to support such a large number properties off a vertex?
All the demos I've seen seem to have < 10 total properties.
Is it appropriate to have so many execution documents hanging off a single vertex? E.g. each "step" could have 100s of 'executions' off it.
Having 100s of edges from a single vertex is not atypical and sounds reasonable. In practice, you can easily find yourself with models that have millions of edges and dig yourself into the problem of supernodes at which point you would need to make some design choices to deal with such things based on your expected query patterns.
Each 'execution' has (theoretically) unlimited properties associated with it, but is probably 10 < x < 100 properties. Is that ok? Are graph databases made to support many, many properties off a vertex?
In designing a schema, I think graph modelers tend to think in terms of graph elements (i.e. vertices/edges) as having the ability to hold unlimited properties, but in practice they have to consider the capabilities of the graph system and not assume them all to be the same. Some graphs, like TinkerGraph will be limited only by available memory. Other graphs like JanusGraph will be limited by the underlying data store (e.g. Cassandra, Hbase, etc).
I'm not aware of any graph system that would have trouble with storing 100 properties. Of course, there's caveats to all such generalities - a few examples:
100 separate simple primitive properties of integers and Booleans is different than 100 byte arrays each holding 100 megabytes of data.
Storing 100 properties is fine on most systems, but do you intend to index all 100? On some systems that might be an issue. Since you tagged your question with "CosmosDB", I will offer that I don't think they are too worried about that since they auto-index everything.
If any of those 100 properties are multi-properties you could put yourself in a position to create a different sort of supernode - a fat vertex (a vertex with millions of properties).
All that said, generally speaking, your schema sounds reasonable for any graph system out there.

Some Common Questiones about Neo4J 3.0

1.Does enterprise version support distributed graph algorithm?Or can the Neo4J graph data and graph calculation be distributed over cloud infrastruction?And How does it work?
2.If I have a server (16 cores CPU,256G memory,2TB HDD) and each node or relation has 1K data,how many nodes and relationships can the server contain.The ratio between nodes and relationships is 1:5.
If we want import more data,what should we do?
3.For fast importing , we used batchinserter,but one lucence index has a number limit which is 2^32.So we can import less than 2^32 nodes.What should we do to solve this limit except using more indexes?
4.And after two days for importing, the importing speed is too slow(200-600 nodes per sec) to accept. It is only 1% of the beginning !I can see that the memory is full, what should we do to impove the speed.
It has imported about 0.2B nodes and 0.5B relationships. That's half of my data. And my server has 32GB memory.
Thanks a lot.
You have many questions here that might be better suited as individual questions or asked on the Neo4j slack channel.
I started to write this as a comment, but ran out of chars so I'll try to point you to some resources:
1) Neo4j distributed graph model
I'm not sure exactly what you're asking here. See this document for general information on Neo4j scalability. If you're asking if graph traversals can be distributed across machines in a Neo4j cluster, the answer is no.
2) Hardware sizing
This depends a bit on your access patterns. See the hardware sizing calculator that can take workload / access patterns into account.
3-4) Import
Can you create a new question and share your code for this? You should be able to achieve much better performance than this.

Detecting short cycles in neo4j property graph

What is the best way to detect for cycles in a graph of a considerable size using cypher.
I have a graph which has about 90000 nodes and about 320000 relationship and I would like to detect cycles in sub graph of about 10k nodes and involving 100k relationships. The cypher I have written is like
start
n = node:node_auto_index(some lucene query that returns about 10k nodes)
match
p = n-[:r1|r2|r3*]->n
return p
However this is not turning out to be very efficient.
Can somebody suggest a better way to do this.
Unlimited-length path searches are well-known to be slow, since the number of operations grows exponentially with the depth of the search.
If you are willing to limit your cycle search to a reasonably small maximum path depth, then you can speed up the query (although it might still take some time). For instance, to only look at paths up to 5 steps deep:
MATCH p = n-[:r1|r2|r3*..n]->n
RETURN p;

Graph partition algo with Neo4j graph database

I know there has some famous graph partition algo tools like METIS which is implemented by karypis Lab (http://glaros.dtc.umn.edu/gkhome/metis/metis/overview)
but I wanna know is there any method to partition graph stored in Neo4j?
or I have to dump the Neo4j's data and transform the node and edge format manually to fit the METIS input format?
Regarding new-ish and interesting algorithms, this is by no means exhaustive or state of the art, but these are the first places I would look:
Specific Algorithm: DiDiC (Distributed Diffusive Clustering) - I used it once in my thesis (Partitioning Graph Databases)
You iterate over all nodes, then for each node retrieve all neighbors, in order to spread some of "some unit" to all your neighbors
Easy to implement.
Can be made deterministic
Iterative - as it's based on iterations (like Super Steps in Pregel) you can stop it at any time. The longer you leave it the better the result, in theory (though in some cases, on certain graph shapes it can be unstable)
When we implemented this we ran it for 100 iterations on a machine with ~30GB RAM, for up to ~4 million nodes - it took no more than two days to complete.
Specific Algorithm: EvoCut "Finding sparse cuts locally using evolving sets" - local probabilistic algorithm from Microsoft - related to these papers
Difficult to implement
Local algorithm - BFS-like access patterns (random walks)
It's been a while since i read that paper, but i remember it was built on clean abstractions:
EvoNibble (pluggable - decides how much of neighborhood to add to the current cluster
EvoCut (calls EvoNibble multiple times to find the local cluster)
EvoPartition (calls EvoCut repeatedly to partition entire graph)
Not deterministic
General Algorithm Family: Hierarchical Graph Clustering
From a high level:
Coarsen the graph by collapsing nodes into aggregate nodes
coarsening strategy is selectable
Find clusters in the coarsened/smaller graph
clustering strategy is selectable
Incrementally decoarsen the graph, refining at the clustering at each step
refining strategy is selectable
Notes:
If the graph changes slowly (or results don't need to be right up to date) it may be possible to coarsen once (or infrequently) then work with the coarsened graph - to save computation
I don't know of a specific algorithm to recommend
General limitations - the things few clustering algorithms do:
Node types not acknowledged - i.e., all nodes treated equally
Relationship types not acknowledged - i.e., all relationships treated equally
Relationship direction not acknowledged - i.e., relationships treated as undirected
Having worked independently with METIS and Neo4j in the past, I am not aware of any tool for generating a METIS file from Neo4j. That being said, writing such a tool should be an easy task and would be a great community contribution.
Another approach for integrating METIS with Neo4j might be in connecting METIS to Neo4j from C++ via JNI. However this is going to be much more involved as it would have to take care of things like transactions, concurrency etc.
On the more general question of partitioning graphs, it is quite possible to implement some of the more known and simple algorithms with reasonable effort.

Detecting all cycles in a directed graph with millions of nodes in Ocaml

I have graphs with thousands of nodes to millions of nodes. I want to detect all possible cycles in such graphs.
I use hash table to store the edges. ( (source node,edge weight) -> (target node) ).
What can be the efficient way of implementing it in OCaml?
Its looks like Tarjan's algorithm is the best one.
What can be the most implementation for the same.
Yes, Tarjan's algorithm for strongly connected components is a good solution. You may also use so-called path-based strong component algorithms which have (when done carefully) comparable linear complexity.
If you pick reasonable data structures, they should work. It's hard to say much more before you implemented and profiled a prototype implementation.
I don't understand what your graph representation is: are you hashed keys really a (node,weight) couple? Then how do you find all neighbors of a given node? For a large graph structure you should optimize access time, of course, but also memory efficiency.
If you really want to find all possible cycles, the problem seems at least exponential in the worst case. For a complete graph, every nonempty subset of nodes gives you a different cycle (including a link from the last back to the first). Forthermore every cyclic permutation of every subset gives you a different cycle. Depending on the sparsity of your graphs, the problem could be tractable in practice.

Resources