Detecting short cycles in neo4j property graph - graph

What is the best way to detect for cycles in a graph of a considerable size using cypher.
I have a graph which has about 90000 nodes and about 320000 relationship and I would like to detect cycles in sub graph of about 10k nodes and involving 100k relationships. The cypher I have written is like
start
n = node:node_auto_index(some lucene query that returns about 10k nodes)
match
p = n-[:r1|r2|r3*]->n
return p
However this is not turning out to be very efficient.
Can somebody suggest a better way to do this.

Unlimited-length path searches are well-known to be slow, since the number of operations grows exponentially with the depth of the search.
If you are willing to limit your cycle search to a reasonably small maximum path depth, then you can speed up the query (although it might still take some time). For instance, to only look at paths up to 5 steps deep:
MATCH p = n-[:r1|r2|r3*..n]->n
RETURN p;

Related

Improve performance removing TinkerGraph vertices

I have a graph g with 600k vertices and 950k edges. After some processing, I need to clean up about 350k+ vertices with this query:
g.V().hasLabel(LABEL_INTERMEDIATE_COLUMN).not(inE(EDGE_DEPEND)).drop().iterate();
Even though I'm excluding vertices with no "depend" edges, they are still connected with other edges.
Using Java, tinkerpop/tinkergraph 3.4.6.
Currently it is taking about 45 minutes to drop all those vertices.
I did a java profiling and the results shows 73% of time spent in the TinkerVertex.remove method, and the rest in the ExpandableStepIterator.next
Is there something like "bulk drop"? Would a JanusGraph or other graph provider be much faster?
It's unlikely that there are graphs that are much faster than TinkerGraph as TinkerGraph is a pure in-memory implementation. You might find one that is more efficient using that memory like OverflowDB which originally forked from TinkerGraph but I don't know that it will make this particular query go faster.
TinkerGraph, nor any graph I know of, has a filtered "bulk drop" operation.
The "not" style global query here is simple expensive as you're having to touch the a large portion of the graph. Of course, I'm a bit surprised that TinkerGraph is taking that long for a graph with less than one million edges. You didn't mention if you were experiencing a lot of GC as you were doing your profile. Perhaps that is an issue? If so, I would try to adjust your JVM memory configurations - maybe you just need a larger -Xmx value or something simple like that.
From a query perspective, you could try to invert not() portion of the traversal to positively find the things you want to remove. It might lead to a less concise query to read but could perhaps speed things up, but on the other hand you are still trying to delete 50% of your data so the cost may not be just in finding the vertices to get rid of.
One other thought would be try to parallelize the drop(). You might hit concurrency errors so you could need a retry strategy but you could consider taking the Iterator of g.V().hasLabel(LABEL_INTERMEDIATE_COLUMN).not(inE(EDGE_DEPEND)) and then delegating calls to each (or batches of) Vertex.remove() to a separate worker thread.
Based on accepted answer, a simple parallelization improved enough that this operation is no longer the most critical time-wise
For future reference, this:
g.V().hasLabel(LABEL_INTERMEDIATE_COLUMN).not(inE(EDGE_DEPEND)).drop().iterate();
is now something like this:
ExecutorService executor = Executors.newFixedThreadPool(4);
int iterator = 0;
final int batchsize = 10000;
Long count = g.V().hasLabel(LABEL_INTERMEDIATE_COLUMN).not(inE(EDGE_DEPEND)).count().next();
List<Callable<Object>> callableList = new ArrayList<Callable<Object>>();
// splitting current set into tasks to be executed in para
while (iterator * batchsize < count) {
final Set<Object> vSet = g.V().hasLabel(LABEL_INTERMEDIATE_COLUMN).not(inE(EDGE_DEPEND)).skip(iterator * batchsize).limit(batchsize).id().toSet();
callableList.add(() -> g.V(vSet).drop().iterate());
iterator++;
}
List<Future<Object>> results = executor.invokeAll(callableList);
After some tests, I decided to keep the iteration in a single thread. That way the distributed tasks are really independent of each other (e.g.: one task completing won't affect other task query).
Keep in mind that the actual removal is still single thread, as the vertex node map modification is behind concurrent access locks.
The effect is that increasing threads won't get better results (personally tried 8). And based on some thread dumps, even 4 might be too much (there is always 1 or more thread in a waiting status) - although I did get a dump with 3 threads running!

Best way to store very large weighted DAG on disk?

Assume a graph database to store a very large DAG on disk:
There are many things that are not required, which allows for optimization.
Basically what I do need is:
store a directed acyclic graph, no cycles, at most one edge per node-pair
fromID, toID, weight (can be INT,INT,FLOAT)
return connected components efficiently and conveniently
return all zero-indegeree nodes efficiently and conveniently
return all descendents of a node efficiently and conveniently
manage sizes of up to 100 million nodes, with up to 10 billion edges
modest resource requirements
free / open-source
Do you have some experience that allow you to give recommendations?

Neo4j breadth first search is too much slow

I need breadth first search in my database. There are 3.863 nodes, 2.830.471 properties and 1.355.783 relationships. N is my start point and m is my end point in the query but It's too much slow, so I can't get result while I started the query that is in the following segment:
start n=node(42),m=node(31)
match p=n-[*1..]->m
return p,length(p)
order by length(p) asc
limit 1
How can I optimize that query? Because It must finish maximum in 20 seconds. I have 8gb ram in my own computer but I bought 24 Gb ram dedicated server for that. Also, My heap size is 4096-5120 MB. Also there is my other configs which is about query in the following segment:
neostore.nodestore.db.mapped_memory=2024M
neostore.relationshipstore.db.mapped_memory=614M
neostore.propertystore.db.mapped_memory=128M
neostore.propertystore.db.strings.mapped_memory=2024M
neostore.propertystore.db.arrays.mapped_memory=614M
How can solve this problem?
Your query basically collects all paths at any lengths, sorts them and returns the shortest one. There is a way much better way to do this:
start n=node(42),m=node(31)
match p=shortestPath(n-[*..1000]->m)
return p,length(p)
It's a best practice to always supply a upper limit on variable depth patterns.

Detecting cycles in neo4j property graph using cypher

What is the best way to detect for cycles in a graph of a considerable size using cypher.
I have a graph which has about 250000 nodes and about 270000 relationship and I would like to detect cycles in sub graph of about 10k nodes and involving 100k relationships. The cypher I have written is like
start
n = node:node_auto_index(some lucene query that returns about 10k nodes)
match
p = n-[:r1|r2|r3*]->n
return p
However this is not turning out to be very efficient.
Can somebody suggest a better way to do this.
1) Count un-flagged nodes
2) Flag nodes with no outgoing relations (leaves)
3) Flag nodes with no incoming relations (roots)
4) If any nodes were flagged in 2 or 3 return to step 1
5) If un-flagged nodes remain you have at least 1 cycle
the cycle(s) will be in the set of un-flagged nodes
looking at nodes with fewest [in|out] edges may help
if there are still too many to identify cycle
The query you are using at the moment will return all paths for each node. Changing it to
start
n = node:node_auto_index(some lucene query that returns about 10k nodes)
where (n)-[*]->(n)
return distinct p, n
will let Neo4j stop once it finds a path for a given node. It will probably still be slow, but definitely less so.

Detecting all cycles in a directed graph with millions of nodes in Ocaml

I have graphs with thousands of nodes to millions of nodes. I want to detect all possible cycles in such graphs.
I use hash table to store the edges. ( (source node,edge weight) -> (target node) ).
What can be the efficient way of implementing it in OCaml?
Its looks like Tarjan's algorithm is the best one.
What can be the most implementation for the same.
Yes, Tarjan's algorithm for strongly connected components is a good solution. You may also use so-called path-based strong component algorithms which have (when done carefully) comparable linear complexity.
If you pick reasonable data structures, they should work. It's hard to say much more before you implemented and profiled a prototype implementation.
I don't understand what your graph representation is: are you hashed keys really a (node,weight) couple? Then how do you find all neighbors of a given node? For a large graph structure you should optimize access time, of course, but also memory efficiency.
If you really want to find all possible cycles, the problem seems at least exponential in the worst case. For a complete graph, every nonempty subset of nodes gives you a different cycle (including a link from the last back to the first). Forthermore every cyclic permutation of every subset gives you a different cycle. Depending on the sparsity of your graphs, the problem could be tractable in practice.

Resources