Detecting cycles in neo4j property graph using cypher - graph

What is the best way to detect for cycles in a graph of a considerable size using cypher.
I have a graph which has about 250000 nodes and about 270000 relationship and I would like to detect cycles in sub graph of about 10k nodes and involving 100k relationships. The cypher I have written is like
start
n = node:node_auto_index(some lucene query that returns about 10k nodes)
match
p = n-[:r1|r2|r3*]->n
return p
However this is not turning out to be very efficient.
Can somebody suggest a better way to do this.

1) Count un-flagged nodes
2) Flag nodes with no outgoing relations (leaves)
3) Flag nodes with no incoming relations (roots)
4) If any nodes were flagged in 2 or 3 return to step 1
5) If un-flagged nodes remain you have at least 1 cycle
the cycle(s) will be in the set of un-flagged nodes
looking at nodes with fewest [in|out] edges may help
if there are still too many to identify cycle

The query you are using at the moment will return all paths for each node. Changing it to
start
n = node:node_auto_index(some lucene query that returns about 10k nodes)
where (n)-[*]->(n)
return distinct p, n
will let Neo4j stop once it finds a path for a given node. It will probably still be slow, but definitely less so.

Related

Best way to store very large weighted DAG on disk?

Assume a graph database to store a very large DAG on disk:
There are many things that are not required, which allows for optimization.
Basically what I do need is:
store a directed acyclic graph, no cycles, at most one edge per node-pair
fromID, toID, weight (can be INT,INT,FLOAT)
return connected components efficiently and conveniently
return all zero-indegeree nodes efficiently and conveniently
return all descendents of a node efficiently and conveniently
manage sizes of up to 100 million nodes, with up to 10 billion edges
modest resource requirements
free / open-source
Do you have some experience that allow you to give recommendations?

Gremlin query which only returns vertices having edges

Is there a more efficient way of returning the id of the first 100 vertices which have edges and do not have a specific property?
g.V()
.filter(hasNot("SOME_PROPERTY").bothE())
.limit(100)
.id()
I don't think you can write that in a much more optimal fashion. That traversal will only be as fast as the the underlying graphs ability to optimize the absence of a property which typically isn't that fast. It's generally treated as a global operation that has to iterate every vertex in the graph (or until it finds 100 matches) and I don't think that any graph allows indices that can help in this sort of case.
If this traversal is meant to be a real-time traversal (OLTP) then you should probably consider defaulting that "SOME_PROPERTY" so that it can be indexed in some way to detect negative values or if it is more of an administrative traversal (e.g. detecting bad data to clean up) (OLAP) then you should probably execute that traversal with Gremlin Spark.

Detecting short cycles in neo4j property graph

What is the best way to detect for cycles in a graph of a considerable size using cypher.
I have a graph which has about 90000 nodes and about 320000 relationship and I would like to detect cycles in sub graph of about 10k nodes and involving 100k relationships. The cypher I have written is like
start
n = node:node_auto_index(some lucene query that returns about 10k nodes)
match
p = n-[:r1|r2|r3*]->n
return p
However this is not turning out to be very efficient.
Can somebody suggest a better way to do this.
Unlimited-length path searches are well-known to be slow, since the number of operations grows exponentially with the depth of the search.
If you are willing to limit your cycle search to a reasonably small maximum path depth, then you can speed up the query (although it might still take some time). For instance, to only look at paths up to 5 steps deep:
MATCH p = n-[:r1|r2|r3*..n]->n
RETURN p;

How to compute the average(or sum) of node values in a network?

Consider a network(graph) of N nodes and each of them is holding a value, how to design a program/algorithm (for each node) that allows each node to compute the average(or sum) of all the node values in the network?
Assumptions are:
Direct communication between nodes is constrained by the graph topology, which is not a complete graph. Any other assumptions, if necessary for your algorithm, is allowable. The weakest one I assume is that there's a loop in the graph that contains all the nodes.
N is finite.
N is suffiently large such that you can't store all the values and then compute its average (or sum). For the same reason, you can't "remember" whose value you've received (thus you can't just redistributing values you've received and add those you've not seen to the buffer and get a result).
(The Tags may not be right since I don't know which field this kind of problems are in, if it's some kind of a general problem.)
That is an interesting question, here some assumptions I've made, before I present a partial solution:
The graph is connected (in case of a directed graph, strongly connected)
The nodes only communicate with their direct neighbours
It is possible to hold and send the sum of all numbers, this means the sum either won't exceed long or you have a data structure sufficiently large, which it won't exceed
I'd go with depth first search. Node N0 would initiate the algorithm and send it's value + the count to the first neighbour (N0.1). N0.1 would add it's own value + count and forward the message to the next neighbour (N0.1.1). In case the message comes back to either N0 or N0.1 they just forward it to another neighbour of theirs. (N0.2 or N0.1.2).
The problem now is to know, when to terminate the algorithm. Preferably you want to terminate it as soon as you've reached all nodes, and afterwards just broadcast the final message. In case you know how many nodes there are in the graph, just keep on forwarding it to the next node, until every node will be reached eventually. The last node will know that is had been reached (it can compare the count variable with the number of nodes in the graph) and broadcast the message.
If you don't know how many nodes there are, and it's and undirected graph, than it will be just depth first implementation in a graph. This means, if N0.1 gets a message from anyone else than N0.1.1 it will just bounce the message back, as you can't send messages to the parent when performing depth first search. If it is a directed graph and you don't know the number of nodes, well than you either come up with a mathematical model to prove when the algorithm has finished, or you learn the number of nodes.
I've found a paper here, proposing a gossip based algorithm to count the number of nodes in a dynamic network: https://gnunet.org/sites/default/files/Gossipico.pdf maybe that will help. Maybe you can even use it to sum up the nodes.

Performance issue with Graph Traversal in ArangoDB

I've set up a simple test case to finally learn some graph databases
I have a simple tree structure based on a collection of roughly 80000 vertices/documents, with around 25 attributes each. The only edges are outbound "is_parent" edges, so to find children of each node, I can simple pick up all inbound edges. I have not set up any specific indices on any fields.
The tree is 20 levels deep and I'm grabbing a random node on the fifth level, to then pick up all descendants of that node using a graph traversal:
FOR t IN GRAPH_TRAVERSAL("sample_tree", "sampleunit/2565130142666", "inbound", {"maxDepth":20}) RETURN t'
This takes a little more than 3 seconds on my dev machine and I feel I might be doing something wrong. Is there any way to speed things up or do I have any conceptual issues?
I set up an example tree-like graph as you described and ran the query on it.
Interestingly, the following query was executing much faster than your query:
FOR t IN TRAVERSAL(sampleunit, unitlinks, "sampleunit/2565130142666", "inbound", {"maxDepth":20}) RETURN t
The query above uses the "older" traversal function in AQL. We checked why there is a performance difference between the two traversal types, and finally found something that can be improved.
The fix for this has been pushed into the 2.2 and devel branches. It is included in commit 9a1eb149aa4da514d709c43a4ebdfd8819ba2f1d if you prefer cherry-picking.
I see similar issue between NEIGHBORS and GRAPH_NEIGHBORS on version 2.6.3, the first is 30 times faster then the second.

Resources