Neo4j: match with multiple relations in timely manner - graph

Consider following nodes that are connected between each other with 2 type of edges: direct and intersect. The query needs to discover all possible paths between 2 nodes that satisfies all following rules:
0..N direct edges
0..1 intersect edge
intersect edge can be between direct edges
These paths are considered valid between nodeA and nodeZ:
(nodeA)-[:direct]->(nodeB)-[:direct]->(nodeC)->[:direct]->(nodeZ)
(nodeA)-[:intersect]->(nodeB)-[:direct]->(nodeC)->[:direct]->(nodeZ)
(nodeA)-[:direct]->(nodeB)-[:intersect]->(nodeC)->[:direct]->(nodeZ)
(nodeA)-[:direct]->(nodeB)->[:direct]->(nodeC)-[:intersect]->(nodeZ)
Basically intersect edge can happen anywhere in the path but only once.
My ideal cypher query in non-existing neo4j version would be this:
MATCH (from)-[:direct*0..N|:intersect*0..1]->(to)
But neo4j doesn't support multiple constraints for edges type :(.
UPDATE 23.04.16
There 6609 nodes (out of 550k total), 5184 edges of type direct (out of 440k total) and 34119 of type intersect (out of 37289 total). There are some circular references expected (which neo4j avoids, isn't it?)
The query that looked promising but failed to finish in a manner of seconds:
MATCH p = (from {from: 1})-[:direct|intersect*0..]->(to {to: 99})
WHERE
123 < from.departureTS < 123 + 86400 //next day
AND REDUCE(s = 0, x IN RELATIONSHIPS(p) | CASE TYPE(x) WHEN 'intersect' THEN s + 1 ELSE s END) <= 1
return p;

Here is a query that conforms to the stated requirements:
MATCH p = (from)-[:direct|intersect*0..]->(to)
WHERE REDUCE(s = 0, x IN RELATIONSHIPS(p) |
CASE WHEN TYPE(x) = 'intersect' THEN s + 1 ELSE s END) <= 1
return p;
It returns all paths with 0 or more direct relationships and 0 or 1 intersect relationships.

This will do what you want:
// Cybersam's correction:
MATCH p = ((from)-[:direct*0..]->(middle)-[:intersect*0..1]->(middle2)-[:direct*0..]->(to)‌​) return DISTINCT p;
return p
Here's the test scenario I used:
create (a:nodeA {name: "A"})
create (b:nodeB {name: "B"})
create (c:nodeC {name: "C"})
create (z:nodeZ {name: "Z"})
merge (a)-[:direct {name: "D11"}]->(b)-[:direct {name: "D21"}]->(c)-[:direct {name: "D31"}]->(z)
merge (a)-[:intersect {name: "I12"}]->(b)-[:direct {name: "D22"}]->(c)-[:direct {name: "D32"}]->(z)
merge (a)-[:direct {name: "D13"}]->(b)-[:intersect {name: "I23"}]->(c)-[:direct {name: "D33"}]->(z)
merge (a)-[:direct {name: "D14"}]->(b)-[:direct {name: "D24"}]->(c)-[:intersect {name: "I34"}]->(z)
merge (a)-[:intersect {name: "I15"}]->(z)
// Cybersam's correction:
MATCH p = ((from)-[:direct*0..]->(middle)-[:intersect*0..1]->(middle2)-[:direct*0..]->(to)‌​) return DISTINCT p;
return p
I made the mistake of thinking the graph on the browser reflected the data that was returned in "p" - it did not, you have to look at the "rows" part of the report to get all the details.
This query will also return single nodes- which fits the requirements.

Related

How is the number of random walks determined in GDS/Neo4j?

I am running the random walk algorithm on my Neo4j graph named 'example', with the minimum allowed walk length (2) and walks per node (1). Namely,
CALL gds.beta.randomWalk.stream(
'example',
{
walkLength: 2,
walksPerNode: 1,
randomSeed: 42,
concurrency: 1
}
)
YIELD nodeIds, path
RETURN nodeIds, [node IN nodes(path) | node.name ] AS event_name
And I get 41 walks. How is this number determined? I checked the graph and it contains 161 nodes and 574 edges. Any insights?
Added later: Here is more info on the projected graph that I am constructing. Basically, I am filtering on nodes and relationships and just projecting the subgraph and doing nothing else. Here is the code -
// Filter for only IDH Codel recurrent events
WITH [path=(m:IDHcodel)--(n:Tissue)
WHERE (m.node_category = 'molecular' AND n.event_class = 'Recurrence')
AND NOT EXISTS((m)--(:Tissue{event_class:'Primary'})) | m] AS recur_events
// Obtain the sub-network with 2 or more patients in edges
MATCH p=(m1)-[r:hasIDHcodelPatients]-(m2)
WHERE (m1 IN recur_events AND m2 IN recur_events AND r.total_common_patients >= 2)
WITH COLLECT(p) AS all_paths
WITH [p IN all_paths | nodes(p)] AS path_nodes, [p IN all_paths | relationships(p)] AS path_rels
WITH apoc.coll.toSet(apoc.coll.flatten(path_nodes)) AS subgraph_nodes, apoc.coll.flatten(path_rels) AS subgraph_rels
// Form the GDS Cypher projection
CALL gds.graph.create.cypher(
'example',
'MATCH (n) where n in $sn RETURN id(n) as id',
'MATCH ()-[r]-() where r in $sr RETURN id(startNode(r)) as source , id(endNode(r)) as target, { LINKS: { orientation: "UNDIRECTED" } }',
{parameters: {sn: subgraph_nodes, sr: subgraph_rels} }
)
YIELD graphName AS graph, nodeQuery, nodeCount AS nodes, relationshipQuery, relationshipCount AS rels
RETURN graph, nodes, rels
Thanks.
It seems that the documentation is missing the description for the sourceNodes parameter, which would tell you how many walks will be created.
We don't know the default value, but we can use the parameter to set the source nodes that the walk should start from.
For example, you could use all the nodes in the graph to be treated as a source node (the random walk will start from them).
MATCH (n)
WITH collect(n) AS nodes
CALL gds.beta.randomWalk.stream(
'example',
{ sourceNodes:nodes,
walkLength: 2,
walksPerNode: 1,
randomSeed: 42,
concurrency: 1
}
)
YIELD nodeIds, path
RETURN nodeIds, [node IN nodes(path) | node.name ] AS event_name
This way you should get 161 walks as there are 161 nodes in your graph and the walksPerNode is set to 1, so a single random walk will start from every node in the graph. In essence, the number of source nodes times the walks per node will determine the number of random walks.

Creating a subgraph using Cypher projection

I am trying to create a subgraph of my graph using Cypher projection because I want to use the GDS library. First, I am creating a subgraph using Cypher query which works perfectly fine. Here is the query:
// Filter for only recurrent events
WITH [path=(m:IDHcodel)--(n:Tissue)
WHERE (m.node_category = 'molecular' AND n.event_class = 'Recurrence')
AND NOT EXISTS((m)--(:Tissue{event_class:'Primary'})) | m] AS recur_events
// Obtain the sub-network with 2 or more patients in edges
MATCH p=(m1)-[r:hasIDHcodelPatients]->(m2)
WHERE (m1 IN recur_events AND m2 IN recur_events AND r.total_common_patients >= 2)
WITH COLLECT(p) AS all_paths
WITH [p IN all_paths | nodes(p)] AS path_nodes, [p IN all_paths | relationships(p)] AS path_rels
RETURN apoc.coll.toSet(apoc.coll.flatten(path_nodes)) AS subgraph_nodes, apoc.coll.flatten(path_rels) AS subgraph_rels
So far so good. Now all I am trying to do is a Cypher projection by sending the subgraph nodes and subgraph rels as parameters in the GDS create query and this gives me a null pointer exception:
// All the above lines except using WITH instead of RETRUN in the last line. ie.,
...
WITH apoc.coll.toSet(apoc.coll.flatten(path_nodes)) AS subgraph_nodes, apoc.coll.flatten(path_rels) AS subgraph_rels
// Call gds library to create a graph by sending subgraph_nodes and subgraph_rels as parameters
CALL gds.graph.create.cypher(
'example',
'MATCH (n) where n in $sn RETURN id(n) as id',
'MATCH ()-[r]-() where r in $sr RETURN r.start as source , r.end as target',
{parameters: {sn: subgraph_nodes, sr: subgraph_rels} }
) YIELD graphName AS graph, nodeQuery, nodeCount AS nodes, relationshipQuery, relationshipCount AS rels
RETURN graph
What could be wrong? Thanks.
To access start and end node of a relationship, there is a slightly different syntax that you are using:
WITH apoc.coll.toSet(apoc.coll.flatten(path_nodes)) AS subgraph_nodes, apoc.coll.flatten(path_rels) AS subgraph_rels
// Call gds library to create a graph by sending subgraph_nodes and subgraph_rels as parameters
CALL gds.graph.create.cypher(
'example',
'MATCH (n) where n in $sn RETURN id(n) as id',
'MATCH ()-[r]-() where r in $sr RETURN id(startNode(r)) as source , id(endNode(r)) as target',
{parameters: {sn: subgraph_nodes, sr: subgraph_rels} }
) YIELD graphName AS graph, nodeQuery, nodeCount AS nodes, relationshipQuery, relationshipCount AS rels
RETURN graph
This is what I noticed, hopefully this is the only error.

Adding new node connected to set of newly added nodes

I need a query which does the following things:
Insert a varaible number of nodes if they don't exist
If there isn't already a node which has a relation to all nodes added in 1 create this node and connect it to nodes from one
The general idea is that the variable number of nodes describe a unique event which I want to aggregate by inserting the new node.
So if I first insert 4 nodes by this
MERGE (k:type1 {data: "data1"})
MERGE (a:type2 {data: "data2"})
MERGE (m:type3 {data: "data3l"})
MERGE (p:type4 {data: "data4s"})
WITH [a, m, p, k] AS myList
CALL apoc.lock.nodes(myList) // let's lock ahead this time
WITH head(myList) as first, myList
OPTIONAL MATCH (d:SomeLabel)-[:REL]->(first)
WHERE all(node in tail(myList) WHERE (d)-[:REL]->(node))
WITH first, myList
WHERE d IS NULL
MERGE (d:SomeLabel)-[:REL]->(first)
FOREACH (node in tail(myList) | MERGE (d)-[:REL]->(node))
If I change the first node the graph looks as expected:
MERGE (a:type2 {data: "data2"})
MERGE (m:type3 {data: "data3l"})
MERGE (p:type4 {data: "data4s"})
WITH [a, m, p, k] AS myList
CALL apoc.lock.nodes(myList) // let's lock ahead this time
WITH head(myList) as first, myList
OPTIONAL MATCH (d:SomeLabel)-[:REL]->(first)
WHERE all(node in tail(myList) WHERE (d)-[:REL]->(node))
WITH first, myList
WHERE d IS NULL
MERGE (d:SomeLabel)-[:REL]->(first)
FOREACH (node in tail(myList) | MERGE (d)-[:REL]->(node))
Correct graph
However when changing for example the content of the second node, a new common node is not added
MERGE (k:type1 {data: "data1"})
MERGE (a:type2 {data: "data22"})
MERGE (m:type3 {data: "data3l"})
MERGE (p:type4 {data: "data4s"})
WITH [a, m, p, k] AS myList
CALL apoc.lock.nodes(myList) // let's lock ahead this time
WITH head(myList) as first, myList
OPTIONAL MATCH (d:SomeLabel)-[:REL]->(first)
WHERE all(node in tail(myList) WHERE (d)-[:REL]->(node))
WITH first, myList
WHERE d IS NULL
MERGE (d:SomeLabel)-[:REL]->(first)
FOREACH (node in tail(myList) | MERGE (d)-[:REL]->(node))
Incorrect graph
Also, after this has been done I want to add another node connected to the new common node
I used #Graphileon's answer for my solution
MERGE (a:type2 {data: "data2"})
MERGE (m:type3 {data: "data3l"})
MERGE (p:type4 {data: "data4s"})
WITH [a, m, p, k] AS things
OPTIONAL MATCH (c:Collection)
WHERE apoc.coll.isEqualCollection([(c)-[:REL]->(thing)|thing],things)
WITH things,COALESCE(id(c),-1) AS idC, id(c) as center
CALL apoc.do.when(
idC = -1,
'CREATE (c:Collection) '
+ 'FOREACH(m IN $things | MERGE (c)-[:REL]->(m) ) '
+ 'MERGE (s:sample)-[:REL]->(c)',
'',
{things:things}
) YIELD value
RETURN value.node as node;
Did you consider this approach:
WITH ['a','b','c','d'] AS thingNames
FOREACH( thingName in thingNames |
MERGE (n:Thing {name:thingName})
)
WITH thingNames
MATCH (n:Thing) WHERE n.name IN thingNames
WITH COLLECT(n) AS things
OPTIONAL MATCH (c:Collection)
WHERE apoc.coll.isEqualCollection([(c)-[:REL]->(thing)|thing],things)
WITH things,c,COALESCE(id(c),-1) AS idC
CALL apoc.do.when(
idC = -1,
'CREATE (c:Collection) '
+' FOREACH(m IN $things | MERGE (c)-[:REL]->(m) )'
+' RETURN c',
'',
{things:things}
) YIELD value
RETURN COALESCE(c, value.c) AS collectionNode
it creates a new :Collection node for every new combination of Things

Dynamic generation of subtour elimination constraints in AMPL for a PVRP

I am trying to code a Periodic Vehicle Routing Problem with some inventory constraints in AMPL. I would like to add the subtour constraints dynamically. In order to do this i was inspired by this formulation for a TSP:
https://groups.google.com/d/msg/ampl/mVsFg4mAI1c/ZdfRHHRijfUJ
However, I can not get it to eliminate subtours in my model. I used the following in my model file.
param T; # Number of time-periods
param V; # Number of vehicles
param F; # Number of fuel types
set P ordered; # Number of gas stations
param hpos {P} >= 0;
param vpos {P} >= 0;
set PAIRS := {p in P, j in P};
param dist {(p,j) in PAIRS}
:= sqrt((hpos[j]-hpos[p])**2 + (vpos[j]-vpos[p])**2);
# A binary variable to determine if an arc is traversed.
var H{(p,j) in PAIRS, v in 1..V, t in 1..T} binary;
# A binary variable to determine if a delivery of fuel is made to a station in a given time period.
var StationUsed{p in P, f in 1..F, v in 1..V, t in 1..T} binary;
minimize TransportationCost:
sum {(p,j) in PAIRS} sum {v in 1..V, t in 1..T} dist[p,j] * H[p,j,v,t];
param nSubtours >= 0 integer;
set SUB {1..nSubtours} within P;
subject to Subtour_Elimination {k in 1..nSubtours, m in SUB[k], v in 1..V, t in 1..T, f in 1..F}:
sum {p in SUB[k], j in P diff SUB[k]}
if (p,j) in PAIRS then H[p,j,v,t] else H[j,p,v,t] >=2 * StationUsed[m,f,v,t] ;
I added the StationUsed variable, as my problem unlike TSP does not have to visit all nodes in every timeperiod. H is my binary decision variable declaring if vehicle travels the arc (p,j) in a time period.
Then I used a formulation similar to the TSP in my run file:
set NEWSUB;
set EXTEND;
let nSubtours := 0;
repeat {
solve;
let NEWSUB := {};
let EXTEND := {member(ceil(Uniform(0,card(P))),P)};
repeat {
let NEWSUB := NEWSUB union EXTEND;
let EXTEND := {j in P diff NEWSUB: exists {p in NEWSUB, v in 1..V, t in 1..T}
((p,j) in PAIRS and H[p,j,v,t] = 1 or (j,p) in PAIRS and H[j,p,v,t] = 1)};
} until card(EXTEND) = 0;
if card(NEWSUB) < card(P) then {
let nSubtours := nSubtours + 1;
let SUB[nSubtours] := NEWSUB;
display SUB;
} else break;
};
# Display the routes
display {t in 1..T, v in 1..V}: {(p,j) in PAIRS} H[p,j,v,t];
I am not sure if the above is applicable to my problem with multiple vehicles and multiple time periods. I have tried defining v and t in let EXTEND, at it is needed to use H, but I am not sure if this is a correct method. My models runs, when formulated as above, however it does not eliminate the subtours. Do you guys have any suggestions in this regard?
ADDED QUESTION:
I found some inspiration in this model formulated in SAS/OR:
(A bit extensive to read and not necessary for my questions)
http://support.sas.com/documentation/cdl/en/ormpex/67518/HTML/default/viewer.htm#ormpex_ex23_sect009.htm
It eliminates subtours dynamically over d days and I figured it could be translated to my problem with multiple vehicles and multiple periods (days).
To specify my problem a little. A node can only be visited by one vehicle once within a time period. All nodes does not have to be visited in every time period, which is a major difference from the TSP formulation, where all nodes are in the cycle.
I tried with the following approach:
The constraint in the model file is the same as before.
set P ordered; # Number of nodes
set PAIRS := {p in P, j in P: ord(p) != ord(j)};
param nSubtours >= 0 integer;
param iter >= 0 integer;
set SUB {1..nSubtours} within P;
subject to Subtour_Elimination {s in 1..nSubtours, k in SUB[s], f in F, v in V, t in T}:
sum {p in SUB[s], j in P diff SUB[s]}
if (p,j) in PAIRS then H[p,j,v,t] else H[j,p,v,t] >= 2 * StationUsed[k,f,v,t];
My run file looks like this:
let nSubtours := 0;
let iter := 0;
param num_components {V, T};
set P_TEMP;
set PAIRS_SOL {1..iter, V, T} within PAIRS;
param component_id {P_TEMP};
set COMPONENT_IDS;
set COMPONENT {COMPONENT_IDS};
param cp;
param cj;
# loop until each day and each vehicles support graph is connected
repeat {
let iter := iter + 1;
solve;
# Find connected components for each day
for {v in V, t in T} {
let P_TEMP := {p in P: exists {f in F} StationUsed[p,f,v,t] > 0.5};
let PAIRS_SOL[iter, v, t] := {(p,j) in PAIRS: H[p, j, v, t] > 0.5};
# Set each node to its own component
let COMPONENT_IDS := P_TEMP;
let num_components[v, t] := card(P_TEMP);
for {p in P_TEMP} {
let component_id[p] := p;
let COMPONENT[p] := {p};
};
# If p and j are in different components, merge the two component
for {(p,j) in PAIRS_SOL[iter, v, t]} {
let cp := component_id[p];
let cj := component_id[j];
if cp != cj then {
# update smaller component
if card(COMPONENT[cp]) < card(COMPONENT[cj]) then {
for {k in COMPONENT[cp]} let component_id[k] := cj;
let COMPONENT[cj] := COMPONENT[cj] union COMPONENT[cp];
let COMPONENT_IDS := COMPONENT_IDS diff {cp};
} else {
for {k in COMPONENT[cj]} let component_id[k] := cp;
let COMPONENT[cp] := COMPONENT[cp] union COMPONENT[cj];
let COMPONENT_IDS := COMPONENT_IDS diff {cj};
};
};
};
let num_components[v, t] := card(COMPONENT_IDS);
display num_components[v, t];
# create subtour from each component not containing depot node
for {k in COMPONENT_IDS: 1 not in COMPONENT[k]} { . #***
let nSubtours := nSubtours + 1;
let SUB[nSubtours] := COMPONENT[k];
display SUB[nSubtours];
};
};
display num_components;
} until (forall {v in V, t in T} num_components[v,t] = 1);
I get a lot of "invalid subscript discarded", when running the model:
Error at _cmdno 43 executing "if" command
(file amplin, line 229, offset 5372):
error processing set COMPONENT:
invalid subscript COMPONENT[4] discarded.
Error at _cmdno 63 executing "for" command
(file amplin, line 245, offset 5951):
error processing set COMPONENT:
invalid subscript COMPONENT[3] discarded.
(...)
Bailing out after 10 warnings.
I think the script is doing what I am looking for, but it stops, when it has discarded 10 invalid subscripts.
When trying to debug I tested the second for loop.
for {p in P_TEMP} {
let component_id[p] := p;
let COMPONENT[p] := {p};
display component_id[p];
display COMPONENT[p];
};
This is displaying correct, but not before a few errors with "invalid subscript discarded". It seems that p runs through some p not in P_TEMP. For example when P_TEMP is a set consisting of nodes "1 3 4 5", then I get "invalid subscript discarded" for component_id[2] and COMPONENT[2]. My guess is that something similar happens again later on in the IF-ELSE statement.
How do I avoid this?
Thank you,
Kristian
(previous answer text deleted because I misunderstood the implementation)
I'm not sure if this fully explains your issue, but I think there are a couple of problems with how you're identifying subtours.
repeat {
solve;
let NEWSUB := {};
let EXTEND := {member(ceil(Uniform(0,card(P))),P)};
repeat {
let NEWSUB := NEWSUB union EXTEND;
let EXTEND := {j in P diff NEWSUB: exists {p in NEWSUB, v in 1..V, t in 1..T}
((p,j) in PAIRS and H[p,j,v,t] = 1 or (j,p) in PAIRS and H[j,p,v,t] = 1)};
} until card(EXTEND) = 0;
if card(NEWSUB) < card(P) then {
let nSubtours := nSubtours + 1;
let SUB[nSubtours] := NEWSUB;
display SUB;
} else break;
};
What this does:
solves the problem
sets NEWSUB as empty
randomly picks one node from P as the starting point for EXTEND and adds this to NEWSUB
looks for any nodes not currently in NEWSUB which are connected to a node within NEWSUB by any vehicle journey on any day, and adds them to NEWSUB
repeats this process until there are no more to add (i.e. either NEWSUB equals P, the entire set of nodes, or until there are no journeys between NEWSUB and non-NEWSUB notedes)
checks whether NEWSUB is smaller than P (in which case it identifies NEWSUB as a new subtour, appends it to SUB, and goes back to the start).
if NEWSUB has the same size as P (i.e. is equal to P) then it stops.
This should work for a single-vehicle problem with only a single day, but I don't think it's going to work for your problem. There are two reasons for this:
If your solution has different subtours on different days, it may not recognise them as subtours.
For example, consider a single-vehicle problem with two days, where your cities are A, B, C, D, E, F.
Suppose that the day 1 solution selects AB, BC, CD, DE, EF, FA, and the day 2 solution selects AB, BC, CA, DE, EF, FD. Day 1 has no subtour, but day 2 has two length-3 subtours, so this should not be a legal solution.
However, your implementation won't identify this. No matter which node you select as the starting point for NEWSUB, the day-1 routes connect it to all other nodes, so you end up with card(NEWSUB) = card(P). It doesn't notice that Day 2 has a subtour so it will accept this solution.
I'm not sure whether your problem allows for multiple vehicles to visit the same node on the same day. If it does, then you're going to run into the same sort of problem there, where a subtour for vehicle 1 isn't identified because vehicle 2 links that subtour to the rest of P.
Some of this could be fixed by doing subtour checking separately for each day and for each vehicle. But for the problem as you've described it, there's another issue...
Once the program has identified a closed route (i.e. a set of nodes that are all linked to one another, and not to any other nodes) then it needs to figure out whether this subtour should be prohibited.
For the basic TSP, this is straightforward. We have one vehicle that needs to visit every node - hence, if the cardinality of the subtour is smaller than the cardinality of all nodes, then we have an illegal subtour. This is handled by if card(NEWSUB) < card(P).
However, you state:
my problem unlike TSP does not have to visit all nodes in every timeperiod
Suppose Vehicle 1 travels A-B-C-A and Vehicle 2 travels D-E-F-D. In this case, these routes will look like illegal subtours because ABC and DEF are each smaller than ABCDEF and there are no routes that link them. If you use if card(NEWSUB) < card(P) as your criterion for a subloop that should be forbidden, you'll end up forcing every vehicle to visit all nodes, which is fine for basic TSP but not what you want here.
This one can be fixed by identifying how many nodes vehicle v visits on day t, and then comparing the length of the subtour to that total: e.g. if there are 10 cities total, vehicle 1 only visits 6 of them on day 1, and a "subtour" for vehicle 1 visits 6 cities, then that's fine, but if it visits 8 and has a subtour that visits 6, that implies it's travelling two disjoint subloops, which is bad.
One trap to watch out for here:
Suppose Day 1 requires vehicle 1 to visit ABCDEF. If we get a "solution" that has vehicle 1 ABCA and DEFD on one day, we might identify ABCA as a subtour that should be prevented.
However, if Day 2 has different requirements, it might be that having vehicle 1 travel ABCA (and no other nodes) is a legitimate solution for day 2. In this case, you don't want to forbid it on day 2 just because it was part of an illegal solution for day 1.
Similarly, you might have a "subroute" that is a legal solution for one vehicle but illegal for another.
To avoid this, you might need to maintain a different list of prohibited subroutes for each vehicle x day, instead of using one list for all. Unfortunately this is going to make your implementation a bit more complex.

How can I find groups nodes sharing common traits in a graph

Lets say I have a graph that relates food items to traits such as sour, sweet, spicy, tangy, ...
How can I query the graph to give me a set of food items matching each possible combination of traits.
i.e.
all foods that are sweet and spicy
all foods that are sweet and sour
all foods that are sweet, sour, and spicy
The graph tuples would look as follows:
F1 > Spicy
F1 > Sweet
F2 > Sour
F2 > Sweet
F3 > Sour
...
The query should output sets of food matching each possible combination of traits.
Spicy => F1, F2, F3, F4, F5
Spicy & Sweet => F1, F3, F5
Spicy & Sweet & Sour => F3
Spicy & Sweet & Sour # Tangy => F3
Spicy & Sour => ...
Spicy & Sour & Tangy => ...
Spicy & Tangy => ...
1) Assume the following inputs:
UNWIND [ {name: 'F1', traits: ['Spicy', 'Sweet' ]},
{name: 'F2', traits: ['Sour' , 'Sweet' ]},
{name: 'F3', traits: ['Tangy', 'Sour', 'Spicy' ]},
{name: 'F4', traits: ['Tangy', 'Sour', 'Spice', 'Tart']} ] AS food
MERGE (F:Food {name: food.name}) WITH F, food
UNWIND food.traits as trait
MERGE (T:Trait {name: trait})
MERGE (F)-[:hasTrait]->(T)
RETURN F, T
2) Now we need to get all combinations of traits. For this we need apoc library:
MATCH (T:Trait)
WITH collect(T) as traits
// Here we count the number of combinations of traits as a power of two
WITH traits, toInt(round(exp( log(2) * size(traits) )))-1 as combCount
// Go through all the combinations
UNWIND RANGE(1, combCount) as combIndex
UNWIND RANGE(0, size(traits)-1 ) as p
// Check whether the trait is present in the combination
CALL apoc.bitwise.op( toInt(round( exp(log(2) * p) )),'&',combIndex) YIELD value
WITH combIndex, collect(CASE WHEN value > 0 THEN traits[p] END) as comb
// Return all combinations of traits
RETURN comb ORDER BY size(comb)
3) Now, for each combination we need to find the intersection for food:
MATCH (T:Trait)
WITH collect(T) as traits
// Here we count the number of combinations of traits as a power of two
WITH traits, toInt(round(exp( log(2) * size(traits) )))-1 as combCount
// Go through all the combinations
UNWIND RANGE(1, combCount) as combIndex
UNWIND RANGE(0, size(traits)-1 ) as p
// Check whether the trait is present in the combination
CALL apoc.bitwise.op( toInt(round( exp(log(2) * p) )),'&',combIndex) YIELD value
WITH combIndex, collect(CASE WHEN value > 0 THEN traits[p] END) as comb
// Take foods for the first trait:
WITH comb, head(comb) as ft
OPTIONAL MATCH (ft)<-[:hasTrait]-(F:Food)
// We find the intersection of each food with other traits
WITH comb, collect(F) as testFoods
UNWIND testFoods as food
UNWIND comb as trait
OPTIONAL MATCH p = (food)-[:hasTrait]->(trait)
WITH comb, food, trait, size(collect(p)) as pairs
// Check that the number of crossings for food with traits
// for each combination of the same number of traits
WITH comb, food, collect(CASE WHEN pairs > 0 THEN trait END) as pairs
WITH comb, collect(CASE WHEN size(pairs)=size(comb) THEN food END) as pairs
// Return combinations where there is a common food
WITH comb, pairs WHERE size(pairs)>0
RETURN comb, pairs ORDER BY size(comb)
Keep in mind that the format of neo4j query output is designed for rows with columns, not your desired output format, so this makes things a little tricky.
I would highly recommend just outputting your food items on each row, with boolean columns for membership in each distinct simple trait, then in your application code, insert the food objects into sets for each trait. Then using application logic you can calculate all possible combinations of traits you need, and perform set intersection to generate them.
This would make the neo4j query very easy:
MATCH (f:Food)
WITH f
RETURN f.name, EXISTS((f)-[:IS]->(:Trait{name:'tangy'})) AS tangy,
EXISTS((f)-[:IS]->(:Trait{name:'sweet'})) AS sweet,
EXISTS((f)-[:IS]->(:Trait{name:'sour'})) AS sour,
EXISTS((f)-[:IS]->(:Trait{name:'spicy'})) AS spicy
That said, if you're determined to do the entire thing with a neo4j query, it's going to be messy, since you'll need to track and generate all the combinations you need yourself. For intersection operations, you'll want to install the APOC procedures library.
Seems to me that the best start is to create sets of food nodes according to each individual trait.
MATCH (f:Food)-[:IS]->(:Trait{name:'spicy'})
WITH COLLECT(f) AS spicyFood
MATCH (f:Food)-[:IS]->(:Trait{name:'sour'})
WITH COLLECT(f) AS sourFood, spicyFood
MATCH (f:Food)-[:IS]->(:Trait{name:'sweet'})
WITH COLLECT(f) AS sweetFood, sourFood, spicyFood
MATCH (f:Food)-[:IS]->(:Trait{name:'tangy'})
WITH COLLECT(f) AS tangyFood, sweetFood, sourFood, spicyFood
Now that you have these, you can do your intersections with every combination you're interested in.
CALL apoc.coll.intersection(tangyFood, sweetFood) YIELD value AS tangySweetFood
CALL apoc.coll.intersection(tangyFood, sourFood) YIELD value AS tangySourFood
CALL apoc.coll.intersection(tangyFood, spicyFood) YIELD value AS tangySpicyFood
CALL apoc.coll.intersection(tangySweetFood, sourFood) YIELD value AS tangySweetSourFood
CALL apoc.coll.intersection(tangySweetFood, spicyFood) YIELD value AS tangySweetSpicyFood
CALL apoc.coll.intersection(tangySourFood, spicyFood) YIELD value AS tangySourSpicyFood
CALL apoc.coll.intersection(tangySweetSourFood, spicyFood) YIELD value AS tangySweetSourSpicyFood
CALL apoc.coll.intersection(sweetFood, sourFood) YIELD value AS sweetSourFood
CALL apoc.coll.intersection(sweetFood, spicyFood) YIELD value AS sweetSpicyFood
CALL apoc.coll.intersection(sweetSourFood, spicyFood) YIELD value AS sweetSourSpicyFood
CALL apoc.coll.intersection(sourFood, spicyFood) YIELD value AS sourSpicyFood
RETURN tangyFood, sweetFood, sourFood, spicyFood,
tangySweetFood, tangySourFood, tangySpicyFood,
tangySweetSourFood, tangySweetSpicyFood, tangySourSpicyFood,
tangySweetSourSpicyFood,
sweetSourFood, sweetSpicyFood,
sweetSourSpicyFood,
sourSpicyFood

Resources