Creating edges in neo4j based on query results - graph

I'm modelling a search term transition graph in a e-commerce software as a graph of nodes (terms) and edges (transitions). If a user types e.g. iphone in the search bar and then refines the query to iphone 6s this will be modeled as two nodes and a edge between those nodes. The same transition of terms of the different users will result in several edges between the nodes.
I'd now like to create an edge with a cumulated weight of 4 to represent that 4 users did this specific transition. How can I combine the results of a count(*) query with a create query to produce an edge with a property weight = 4
My count(*) query is:
MATCH (n:Term)-[r]->(n1:Term)
RETURN type(r), count(*)
I'd expect the combined query to look like this, but this kind of sql like composition seems not to be possible in cypher:
MATCH (n:Term), (n1:Term)
WHERE (n)-[tr:TRANSITION]->(n1)
CREATE (n)-[actr:ACC_TRANSITION {count:
MATCH (n:Term)-[r]->(n1:Term) RETURN
count(*)}
]->(n1)
RETURN n, n1
A non generic query to produce the accumulated transition that works is:
MATCH (n:Term), (n1:Term)
WHERE n.term = 'iphone' AND n1.term ='iphone 6s'
CREATE (n)-[actr:ACC_TRANSITION {count: 4}]->(n1)
RETURN n, n1
Any other ideas on how to approach and model this problem?

Use WITH like this:
MATCH (n:Term)-[r]->(n1:Term)
WITH n as n, count(*) as rel_count, n1
CREATE (n)-[:ACC_TRANSITION {count:rel_count}]->(n1)
RETURN n, n1

If you match the nodes and relationship first and then use set, you will not produce duplicate nodes or relationships
Match (n:Term)-[r]->(n1.Term)
with n as nn,count(r) as rel_count,n1 as nn1
set r.ACC_TRANSITION=rel_count
return nn,nn1,r
The create function will create duplicates.

Related

How to return top n biggest cluster in Neo4j?

in my database, the graph looks somehow like this:
I want to find the top 3 biggest cluster in my data. A cluster is a collection of nodes connected to each other, the direction of the connection is not important. As can be seen from the picture, the expected result should have 3 clusters with size 3 2 2 respectively.
Here is what I came up with so far:
MATCH (n)
RETURN n, size((n)-[*]-()) AS cluster_size
ORDER BY cluster_size DESC
LIMIT 100
However, it has 2 problems:
I think the query is wrong because the size() function does not return the number of nodes in a cluster as I want, but the number of sub-graph matching the pattern instead.
The LIMIT clause limits the number of nodes to return, not taking the top result. That's why I put 100 there.
What should I do now? I'm stuck :( Thank you for your help.
UPDATE
Thanks to Bruno Peres' answer, I'm able to try algo.unionFind query in Neo4j Graph Algorithm. I can find the size of my connected components using this query:
CALL algo.unionFind.stream()
YIELD nodeId,setId
RETURN setId,count(*) as size_of_component
ORDER BY size_of_component DESC LIMIT 20;
And here is the result:
But that's all I know. I cannot get any information about the nodes in each component to visualize them. The collect(nodeId) takes forever because the top 2 components are too large. And I know it doesn't make sense to visualize those large components, but how about the third one? 235 nodes are fine to render.
I think you are looking for Connected Componentes. The section about connected components of Neo4j Graph Algorithms User Guide says:
Connected Components or UnionFind basically finds sets of connected
nodes where each node is reachable from any other node in the same
set. In graph theory, a connected component of an undirected graph is
a subgraph in which any two vertices are connected to each other by
paths, and which is connected to no additional vertices in the graph.
If this is your case you can install Neo4j Graph Algorithms and use algo.unionFind. I reproduced your scenario with this sample data set:
create (x), (y),
(a), (b), (c),
(d), (e),
(f), (g),
(a)-[:type]->(b), (b)-[:type]->(c), (c)-[:type]->(a),
(d)-[:type]->(e),
(f)-[:type]->(g)
Then running algo.unionFind:
// call unionFind procedure
CALL algo.unionFind.stream('', ':type', {})
YIELD nodeId,setId
// groupBy setId, storing all node ids of the same set id into a list
WITH setId, collect(nodeId) as nodes
// order by the size of nodes list descending
ORDER BY size(nodes) DESC
LIMIT 3 // limiting to 3
RETURN setId, nodes
The result will be:
╒═══════╤══════════╕
│"setId"│"nodes" │
╞═══════╪══════════╡
│2 │[11,12,13]│
├───────┼──────────┤
│5 │[14,15] │
├───────┼──────────┤
│7 │[16,17] │
└───────┴──────────┘
EDIT
From comments:
how can I get all nodeId of a specific setId? For example, from my
screenshot above, how can I get all nodeId of the setId 17506? That
setId has 235 nodes and I want to visualize them.
Run call CALL algo.unionFind('', ':type', {write:true, partitionProperty:"partition"}) YIELD nodes RETURN *. This statement will create apartition` property for each node, containing the partition ID the node is part of.
Run this statement to get the top 3 partitions: match (node)
with node.partition as partition, count(node) as ct order by ct desc
limit 3 return partition, ct.
Now you can get all nodes of each top 3 partitions individually with match (node {partition : 17506}) return node, using the partition ID returned in the second query.

Recursive multiple relationships

I'm attempting to recursively perform alternate match statements with 2 specific relationships.
For example, Pets are owned by a Person. Pets LIKE other people (not owner) Those people have pets owned by them, who like other people etc.
match (n.Person {id.123})<-[r.OwnedBy]-(p.Pet) Return n, r, p
match (p.Pet {id.123})-[r.Likes]->(n.Person) Return p, r, n
Notice the directional relationships involved - #1 is backwards, #2 is forwards.
What I want to do is to, given a person(id),
1. Display pets [OwnedBy] this person(id)
2. Display people [Liked] by those pets
3. Display pets [OwnedBy] the people in 2.
etc. recursively
Independently, these Match statements work. together, they do not.
I tried adding the 2nd match statement, using different variables, then it will go down 2 levels and stop.
In the real data set, there are dozens of nodes and relationships. I'm trying to limit the display to a 'tree' view of only these 2 relationships/nodes.
Thanks!
How about this?
match (n:Person {id:123})<-[:OwnedBy]-(p:Pet)-[:Likes]->(n2:Person)<-[:OwnedBy]-(p2:Pet)
return n, collect(distinct p) as pets, collect(distinct n2) as peopleLiked, collect(distinct p2) as petsOfPeopleLiked
Though if you're only interested in the graph display, this should work:
match path = (n:Person {id:123})<-[:OwnedBy]-(p:Pet)-[:Likes]->(n2:Person)<-[:OwnedBy]-(p2:Pet)
return path, n, p, n2, p2
You can also utilize APOC Procedures. This can handle showing these paths, using only these two types of relationships:
match (n:Person {id:123})
call apoc.path.expandConfig(n, {relationshipFilter:'<OwnedBy|Likes>'}) yield path
return path

Traverse Graph With Directed Cycles using Relationship Properties as Filters

I have a Neo4j graph with directed cycles. I have had no issue finding all descendants of A assuming I don't care about loops using this Cypher query:
match (n:TEST{name:"A"})-[r:MOVEMENT*]->(m:TEST)
return n,m,last(r).movement_time
The relationships between my nodes have a timestamp property on them, movement_time. I've simulated that in my test data below using numbers that I've imported as floats. I would like to traverse the graph using the timestamp as a constraint. Only follow relationships that have a greater movement_time than the movement_time of the relationship that brought us to this node.
Here is the CSV sample data:
from,to,movement_time
A,B,0
B,C,1
B,D,1
B,E,1
B,X,2
E,A,3
Z,B,5
C,X,6
X,A,7
D,A,7
Here is what the graph looks like:
I would like to calculate the descendants of every node in the graph and include the timestamp from the last relationship using Cypher; so I'd like my output data to look something like this:
Node:[{Descendant,Movement Time},...]
A:[{B,0},{C,1},{D,1},{E,1},{X,2}]
B:[{C,1},{D,1},{E,1},{X,2},{A,7}]
C:[{X,6},{A,7}]
D:[{A,7}]
E:[{A,3}]
X:[{A,7}]
Z:[{B,5}]
This non-Neo4J implementation looks similar to what I'm trying to do: Cycle enumeration of a directed graph with multi edges
This one is not 100% what you want, but very close:
MATCH (n:TEST)-[r:MOVEMENT*]->(m:TEST)
WITH n, m, r, [x IN range(0,length(r)-2) |
(r[x+1]).movement_time - (r[x]).movement_time] AS deltas
WHERE ALL (x IN deltas WHERE x>0)
RETURN n, collect(m), collect(last(r).movement_time)
ORDER BY n.name
We basically find all the paths between any of your nodes (beware cartesian products get very expensive on non-trivial datasets). In the WITH we're building a collection delta's that holds the difference between two subsequent movement_time properties.
The WHERE applies an ALL predicate to filter out those having any non-positive value - aka we guarantee increasing values of movement_time along the path.
The RETURN then just assembles the results - but not as a map, instead one collection for the reachable nodes and the last value of movement_time.
The current issue is that we have duplicates since e.g. there are multiple paths from B to A.
As a general notice: this problem is much more elegantly and more performant solvable by using Java traversal API (http://neo4j.com/docs/stable/tutorial-traversal.html). Here you would have a PathExpander that skips paths with decreasing movement_time early instead of collection all and filter out (as Cypher does).

Cypher query to stop graph traversal when reaching a hub

I have a graph database that contains highly connected nodes (hubs). These nodes can have more than 40000 relationships.
When I want to traverse the graph starting from a node, I would like to stop traversal at these hubs not to retrieve too many nodes.
I think I should use aggregation function and conditional stop based on the count of relationship for each node, but I didn't manage to write the good cypher query.
I tried:
MATCH p=(n)-[r*..10]-(m)
WHERE n.name='MyNodeName' AND ALL (x IN nodes(p) WHERE count(x) < 10)
RETURN p;
and also:
MATCH (n)-[r*..10]-(m) WHERE n.name='MyNodeName' AND COUNT(r) < 10 RETURN p;
I think you can't stop the query at some node if you MATCH a path of length 10. You could count the number of relationships for all nodes in the path, but only after the path is matched.
You could solve this by adding an additional label to the hub nodes and filter that in your query:
MATCH (a:YourLabel)
OPTIONAL MATCH (a)-[r]-()
WITH a, count(r) as count_rels
CASE
WHEN count_rels > 20000
THEN SET a :Hub
END
Your query:
MATCH p=(n)-[r*..10]-(m)
WHERE n.name='MyNodeName' AND NONE (x IN nodes(p) WHERE x:Hub)
RETURN p
I used this approach in a similar case.
Since Neo4j 2.2 there is a cool trick to use the internal getDegree() function to determine if a node is a dense node.
You also forgot the label (and probably index) for n
For your case that would mean:
MATCH p=(n:Label)-[r*..10]-(m)
WHERE n.name='MyNodeName' AND size((m)--()) < 10
RETURN p;

MATCH on nodes in a two-dimensional COLLECTION in Neo4j / Cypher

In order to limit traversal through a Neo4j graph db, I am collecting scores from a subgraph. Imagine this (simplified)
MATCH (a)-[:r1 {prop1:123}]->()-[]->()-[]->()-[]->(b {prop2:456})
WITH b,b.prop2*r1.prop1 as score ORDER BY score DESC LIMIT 10
WITH COLLECT ([b,score]) AS bscore
so far, so good. To avoid the long traversal, I want to limit the next match to the nodes b stored in bscore and sum the scores in bscore[1], but I couldn't find the correct syntax. Even wondering whether it's possible in cypher. Conceptually I'd like to do this:
MATCH bscore[0]-[:r2]->(c)
RETURN c, SUM(bscore[1])
Any hints/ pointers highly appreciated.
Could you do something like this perhaps?
MATCH (a)-[:r1 {prop1:123}]->()-[]->()-[]->()-[]->(b {prop2:456})
WITH b,b.prop2*r1.prop1 as score ORDER BY score DESC LIMIT 10
MATCH b-[:r2]->(c)
RETURN c, sum(score)

Resources