How to explain the execution path of this Cypher query? - graph

Imagine the following graph:
And this query:
MATCH(p:Person {id:1})
MATCH (p)-[:KNOWS]-(s)
CREATE (p)-[:LIVE_IN]->(:Place {name: 'Some Place'})
Now, why five LIVE_IN, Place are created even though s is not involved in the CREATE statement? is there any place in the docs that explain this behavior?
Note: this is not about MERGE vs CREATE, although MERGE can solve it.
EDIT: In response to #Tomaz answer: I have deliberately placed MATCH (p)-[:KNOWS]-(s) in the query and I know how it will behave. I am asking for found explanations. For example, is CREATE will execute for each path or row in the matched patterns regardless of the node involved in the CREATE? what if you have complex matched patterns such as, disconnected graph, Trees...etc?
Also note that the direction of relationship KNOWS (- vs ->) will effect the number of returned rows (9 vs 1), but CREATE will execute five times regardless of the direction.
Update:
I have added 3 other node Office and issued the following query:
MATCH(p:Person {id:1})
MATCH (p)-[:KNOWS]-(s)
MATCH (o:Office)
CREATE (p)-[:LOVE]->(:Place {name: 'Any Place'})
And as result: 15 LOVE Place have been created, so it seems to me that cypher performs Cartesian Product between all nodes:
p refer to 1 nodes, s refer to 5 nodes, o refer to 3 nodes => 1 * 5 * 3 = 15
But I can not confirm this form neo4j docs unfortunately.

This is because the Person with id 1 has five neighbors.
In your query you start with:
MATCH(p:Person {id:1})
This produces a single row, where it finds the node you are looking for.
The next step is:
MATCH (p)-[:KNOWS]-(s)
This statement found 5 neighbors, so your cardinality or number of rows increases to five. And then you run a create statement for each row, which in turn creates five Places. You could for example lower the cardinality back to 1 before doing the CREATE and you'll create only a single place:
MATCH(p:Person {id:1})
MATCH (p)-[:KNOWS]-(s)
// aggregation reduces cardinality to 1
WITH p, collect(s) as neighbors
CREATE (p)-[:LIVE_IN]->(:Place {name: 'Some Place'})
When doing cypher query, always have in mind the cardinality you are operating.

Related

Cypher query - Combine 2 queries by pipe result of one to other

I am a beginner on cypher and want to create a query that find all nodes that connect to specific nodes that other node connect to them,
see the example
I need to get all the brown nodes that connect to the red nodes that the blue node connect to it.
for this example I want to get the brown nodes Ids: 2, 3 and 1 (because no red nodes needs to get it)
For now I did it on 2 different queries and use python to find it, but now I need to do this in 1 query.
Match (r:R)-[]-(a:A) return collect(a.Id)
and the second query:
Match (b:B) Optional Match (b)-[]-(a:A) return b.Id, collect(a.Id)
and in my script check if every record from the second query is a subset of the first list of all a.Id that connect to R
can I do it in 1 query?
Thank!
Improved answer:
Start with :B nodes and check if all their :A neighbours are have a link to :R
the ALL() function also returns true if :B does not have any neighbours
MATCH (b:B)
WHERE ALL(n IN [(b)--(a:A) | a] WHERE EXISTS ((n)--(:R)) )
RETURN b

How can I optimize this query in neo4j?

I have a unidirectional graph.
The structure is as follows:
There are about 20,000 nodes in the graph.
I make the simplest request: MATCH (b1)-[:NEXT_BAR*10]->(b2) RETURN b1.id, b2.id LIMIT 5
The request is processed quickly.
But if I increase the number of relationships, the query takes much longer to process. In other words, the speed depends on the number of relationships.
This request takes longer than 5 minutes to complete: MATCH (b1)-[:NEXT_BAR*10000]->(b2) RETURN b1.id, b2.id LIMIT 5
This is still a simplified version. The request can have more than two nodes and the number of relationships can still be a range.
How can I optimize a query with a large number of relationships?
Perhaps there are other graph DBMS where there is no such problem?
Variable-length relationship queries have exponential time and memory complexity.
If R is the average number of suitable relationships per node, and D is the depth of the search, then the complexity is O(R ** D). This complexity will exist in any DBMS.
The theory is simple here, but there are a couple of intricacies in the query execution.
-[:NEXT_BAR*10000]- matches a path that is precisely 10000 edges in size, so query engine spends some time to find these paths. Another thing to mention is that in (b1)-[...]- >(b2), b1 and b2 are not specific, which means that the query engine has to scall all nodes. If there is a limit, yea, scall all should return a limited number of items. The whole execution also depends on the efficiency of variable-length path implementation.
Some of the following might help:
Is it feasible to start from a specific node?
If there are branches, the only hope is aggressive filtering because of exponential complexity (as cybersam well explained).
Use a smaller number in the variable expand, or a range, e.g., [NEXT_BAR*..10000]. In this case, the query engine will match any path up to 10000 in size (different semantics, but maybe applicable).
* means the DFS type of execution. On the other hand, BFS might be the right approach. Memgraph (DISCLAIMER: I'm the co-founder and CTO) also supports BFS type of execution with filtering lambda.
Here is a Python script I've used to generate and import data into Memgraph. By using small nodes_no you can quickly notice the execution patterns.
import mgclient
# Make a connection to the database.
connection = mgclient.connect(
host='127.0.0.1',
port=7687,
sslmode=mgclient.MG_SSLMODE_REQUIRE)
connection.autocommit = True
cursor = connection.cursor()
# Clean and setup database instance.
cursor.execute("""MATCH (n) DETACH DELETE n;""")
cursor.execute("""CREATE INDEX ON :Node(id);""")
# Import dataset.
nodes_no = 10
# Create nodes.
for identifier in range(0, nodes_no):
cursor.execute("""CREATE (:Node {id: "%s"});""" % identifier)
# Create edges.
for identifier in range(1, nodes_no):
cursor.execute("""
MATCH (start_node:Node {id: "%s"})
MATCH (end_node:Node {id: "%s"})
CREATE (start_node)-[:NEXT_BAR]->(end_node);
""" % (identifier - 1, identifier))

How to return top n biggest cluster in Neo4j?

in my database, the graph looks somehow like this:
I want to find the top 3 biggest cluster in my data. A cluster is a collection of nodes connected to each other, the direction of the connection is not important. As can be seen from the picture, the expected result should have 3 clusters with size 3 2 2 respectively.
Here is what I came up with so far:
MATCH (n)
RETURN n, size((n)-[*]-()) AS cluster_size
ORDER BY cluster_size DESC
LIMIT 100
However, it has 2 problems:
I think the query is wrong because the size() function does not return the number of nodes in a cluster as I want, but the number of sub-graph matching the pattern instead.
The LIMIT clause limits the number of nodes to return, not taking the top result. That's why I put 100 there.
What should I do now? I'm stuck :( Thank you for your help.
UPDATE
Thanks to Bruno Peres' answer, I'm able to try algo.unionFind query in Neo4j Graph Algorithm. I can find the size of my connected components using this query:
CALL algo.unionFind.stream()
YIELD nodeId,setId
RETURN setId,count(*) as size_of_component
ORDER BY size_of_component DESC LIMIT 20;
And here is the result:
But that's all I know. I cannot get any information about the nodes in each component to visualize them. The collect(nodeId) takes forever because the top 2 components are too large. And I know it doesn't make sense to visualize those large components, but how about the third one? 235 nodes are fine to render.
I think you are looking for Connected Componentes. The section about connected components of Neo4j Graph Algorithms User Guide says:
Connected Components or UnionFind basically finds sets of connected
nodes where each node is reachable from any other node in the same
set. In graph theory, a connected component of an undirected graph is
a subgraph in which any two vertices are connected to each other by
paths, and which is connected to no additional vertices in the graph.
If this is your case you can install Neo4j Graph Algorithms and use algo.unionFind. I reproduced your scenario with this sample data set:
create (x), (y),
(a), (b), (c),
(d), (e),
(f), (g),
(a)-[:type]->(b), (b)-[:type]->(c), (c)-[:type]->(a),
(d)-[:type]->(e),
(f)-[:type]->(g)
Then running algo.unionFind:
// call unionFind procedure
CALL algo.unionFind.stream('', ':type', {})
YIELD nodeId,setId
// groupBy setId, storing all node ids of the same set id into a list
WITH setId, collect(nodeId) as nodes
// order by the size of nodes list descending
ORDER BY size(nodes) DESC
LIMIT 3 // limiting to 3
RETURN setId, nodes
The result will be:
╒═══════╤══════════╕
│"setId"│"nodes" │
╞═══════╪══════════╡
│2 │[11,12,13]│
├───────┼──────────┤
│5 │[14,15] │
├───────┼──────────┤
│7 │[16,17] │
└───────┴──────────┘
EDIT
From comments:
how can I get all nodeId of a specific setId? For example, from my
screenshot above, how can I get all nodeId of the setId 17506? That
setId has 235 nodes and I want to visualize them.
Run call CALL algo.unionFind('', ':type', {write:true, partitionProperty:"partition"}) YIELD nodes RETURN *. This statement will create apartition` property for each node, containing the partition ID the node is part of.
Run this statement to get the top 3 partitions: match (node)
with node.partition as partition, count(node) as ct order by ct desc
limit 3 return partition, ct.
Now you can get all nodes of each top 3 partitions individually with match (node {partition : 17506}) return node, using the partition ID returned in the second query.

Traverse Graph With Directed Cycles using Relationship Properties as Filters

I have a Neo4j graph with directed cycles. I have had no issue finding all descendants of A assuming I don't care about loops using this Cypher query:
match (n:TEST{name:"A"})-[r:MOVEMENT*]->(m:TEST)
return n,m,last(r).movement_time
The relationships between my nodes have a timestamp property on them, movement_time. I've simulated that in my test data below using numbers that I've imported as floats. I would like to traverse the graph using the timestamp as a constraint. Only follow relationships that have a greater movement_time than the movement_time of the relationship that brought us to this node.
Here is the CSV sample data:
from,to,movement_time
A,B,0
B,C,1
B,D,1
B,E,1
B,X,2
E,A,3
Z,B,5
C,X,6
X,A,7
D,A,7
Here is what the graph looks like:
I would like to calculate the descendants of every node in the graph and include the timestamp from the last relationship using Cypher; so I'd like my output data to look something like this:
Node:[{Descendant,Movement Time},...]
A:[{B,0},{C,1},{D,1},{E,1},{X,2}]
B:[{C,1},{D,1},{E,1},{X,2},{A,7}]
C:[{X,6},{A,7}]
D:[{A,7}]
E:[{A,3}]
X:[{A,7}]
Z:[{B,5}]
This non-Neo4J implementation looks similar to what I'm trying to do: Cycle enumeration of a directed graph with multi edges
This one is not 100% what you want, but very close:
MATCH (n:TEST)-[r:MOVEMENT*]->(m:TEST)
WITH n, m, r, [x IN range(0,length(r)-2) |
(r[x+1]).movement_time - (r[x]).movement_time] AS deltas
WHERE ALL (x IN deltas WHERE x>0)
RETURN n, collect(m), collect(last(r).movement_time)
ORDER BY n.name
We basically find all the paths between any of your nodes (beware cartesian products get very expensive on non-trivial datasets). In the WITH we're building a collection delta's that holds the difference between two subsequent movement_time properties.
The WHERE applies an ALL predicate to filter out those having any non-positive value - aka we guarantee increasing values of movement_time along the path.
The RETURN then just assembles the results - but not as a map, instead one collection for the reachable nodes and the last value of movement_time.
The current issue is that we have duplicates since e.g. there are multiple paths from B to A.
As a general notice: this problem is much more elegantly and more performant solvable by using Java traversal API (http://neo4j.com/docs/stable/tutorial-traversal.html). Here you would have a PathExpander that skips paths with decreasing movement_time early instead of collection all and filter out (as Cypher does).

Find path for N levels with repeating pattern of directional relationships in Neo4J

I'm trying to use Neo4j to analyze relationships in a family tree. I've modeled it like so:
(p1:Person)-[:CHILD]->(f:Family)<-[:FATHER|MOTHER]-(p2)
I know I could have left out the family label and just had children connected to each parent, but that's not practical for my purposes. Here's an example of my graph and the black line is the path I want it to generate:
I can query for it with
MATCH p=(n {personID:3})-[:CHILD]->()<-[:FATHER|MOTHER]-()-[:CHILD]->()<-[:FATHER|MOTHER]-()-[:CHILD]->()<-[:FATHER|MOTHER]-() RETURN p
but there's a repeating pattern to the relationships. Could I do something like:
MATCH p=(n {personID:3})(-[:CHILD]->()<-[:FATHER|MOTHER]-())* RETURN p
where the * means repeat the :CHILD then :FATHER|MOTHER relationships, with the directions being different? Obviously if the relationships were all the same direction, I could use
-[:CHILD|FATHER|MOTHER*]->
I want to be able to query it from Person #3 all the way to the top of the graph like a pedigree chart, but also be specific about how many levels if needed (like 3 generations, as opposed to end-of-line).
Another issue I'm having with this, is if I don't put directions on the relationships like -[:CHILD|FATHER|MOTHER*]-, then it will start at Person #3, and go both in the direction I want (alternating arrows), but also descend back down the chain finding all the other "cousins, aunts, uncles, etc.".
Any seasoned Cypher experts that an help me?
I am just on the same problem. And I found out that the APOC Expand path procedures are just accomplishing what you/we want.
Applied to your example, you could use apoc.path.subgraphNodes to get all ancestors of Person #3:
MATCH (p1:Person {personId:3})
CALL apoc.path.subgraphNodes(p1, {
sequence: '>Person,CHILD>,Family,<MOTHER|<FATHER'
}) YIELD node
RETURN node
Or if you want only ancestors up to the 3 generations from start person, add maxLevel: 6 to config (as one generation is defined by 2 relationships, 3 generations are 6 levels):
MATCH (p1:Person {personId:3})
CALL apoc.path.subgraphNodes(p1, {
sequence: '>Person,CHILD>,Family,<MOTHER|<FATHER',
maxLevel: 6
}) YIELD node
RETURN node
And if you want only ancestors of 3rd generation, i.e. only great-grandparents, you can also specify minLevel (using apoc.path.expandConfig):
MATCH (p1:Person {personId:3})
CALL apoc.path.expandConfig(p1, {
sequence: '>Person,CHILD>,Family,<MOTHER|<FATHER',
minLevel: 6,
maxLevel: 6
}) YIELD path
WITH last(nodes(path)) AS person
RETURN person
You could reverse the directionality of the CHILD relationships in your model, as in:
(p1:Person)<-[:CHILD]-(f:Family)<-[:FATHER|MOTHER]-(p2)
This way, you can use a simple -[:CHILD|FATHER|MOTHER*]-> pattern in your queries.
Reversing the directionality is actually intuitive as well, since you can then more naturally visualize the graph as a family tree, with all the arrows flowing "downwards" from ancestors to descendants.
Yeah, that's an interesting case. I'm pretty sure (though I'm open to correction) that this is just not possible. Would it be possible for you to have and maintain both? You could have a simple cypher query create the extra relationships:
MATCH (parent)-[:MOTHER|FATHER]->()<-[:CHILD]-(child)
CREATE (child)-[:CHILD_OF]->parent
Ok, so here's a thought:
MATCH path=(child:Person {personID: 3})-[:CHILD|FATHER|MOTHER*]-(ancestor:Person),
WHERE ancestor-[:MOTHER|FATHER]->()
RETURN path
Normally I'd use a second clause in the MATCH like this:
MATCH
path=(child:Person {personID: 3})-[:CHILD|FATHER|MOTHER*]-(ancestor:Person),
ancestor-[:MOTHER|FATHER]->()
RETURN path
But Neo4j (at least by default, I think) doesn't traverse back through the path. Maybe comma-separating would be fine and this would be a problem:
MATCH path=(child:Person {personID: 3})-[:CHILD|FATHER|MOTHER]-(ancestor:Person)-[:MOTHER|FATHER]->()
I'm curious to know what you find!

Resources