How to exclude some paths in Neo4j Cypher - collections

I have 2 sets of paths
Collection 1
A->B->C->D
A->E->F->D
A->G->J->H
I->B->C->D
Collection 2
E->D
I->D
The Cypher query output should be paths of Collection 1 where nodes combination of 2nd collection does not exist.
In above example, nodes E,D of Collection 2, 1st element exist in 2nd path of Collection 1, so the 2nd should be dropped. similarly, nodes I,D of collection 2, 2nd element exist in 4th path of Collection 1, so the 4th should also be dropped.
Then the output should be
Collection 3
A->B->C->D
A->G->J->H
Through Cypher, I'm able to find out paths of collection 1 in which nodes of collection 2 paths exist but I'm not able to do a 'minus' operation among collections.
How to get the cypher query to achieve above?
Thanks in advance
Rasyq

Without your Cypher queries it's not easy to answer. But in general you can get the nodes from a path with nodes(your_path) and check if all of those nodes are included in another path with the all() predicate.
MATCH p1 = (your first paths), p2 = (the paths you check against)
// filter paths where NOT all nodes of p2 are in p1
WHERE NOT all(node2 IN nodes(p2) WHERE node2 IN p1)
RETURN p1

Related

How to check for a property in a vertex or its child using gremlin-cosmos

I am traversing a path a,b,c,d in my graph where a,b,c,d are the nodes along the traversed path.
(a)->(b)->(c)->(d)
My traversal will end at node d as node d is a leaf node.
I want my traversal to cover all nodes from a to d and it must not come to a halt before it reaches node d.
The goal of the traversal is to find out if a certain property exists in node c or d. If for example the property does not exit in node c, I'd like to check if it does exist in its child node d first before the path is rendered invalid.
How do I achieve this with gremlin query on cosmos? Is there such a thing as an optional has() in Gremlin?
Nodes c and d are labeled differently however they have the same properties.
Currently, I have this query, but it would fail if node c didn't have the property of interest even though node d might have it.
G.V().hasLabel('a').out().as('b').out().as('c').has('status', 'patched').out().as('d').has('status', 'patched')
If either node c or node d have the property 'status': 'patched' is good enough and the path is valid (aka. I want that path in my result to the query)
I am doing this on CosmosDB, so please only provide answers using supported steps found here: https://learn.microsoft.com/en-us/azure/cosmos-db/gremlin/support
The goal of the traversal is to find out if a certain property exists in node c or d.
This sounds like an or() step:
G.V().hasLabel('a').out().out().or(
__.has('status', 'patched').out(),
__.out().has('status', 'patched')
).path()

Generalized subqueries in Cypher 9

I have a graph in Neo4j, where all nodes and edges have a property p. With a cypher query as follows, I get a subgraph of the whole graph that fulfills several conditions on p for vertices and edges in general.
Q1
MATCH (n)
WHERE n.p ... // some conditions for nodes
OPTIONAL MATCH (n)-[r]-()
WHERE r.p ... // some conditions for relationships
RETURN n,r
Now I want to execute a second query Q2 on the resulting subgraph of the first query Q1. For example
Q2
MATCH (a:Person)-[:knows*5]-(b:Person)-[s:studiedAt]-(u:University),
(b)-[:owns]-(i:Business)
WHERE i.city = "Boston"
RETURN DISTINCT a.name, i.name, u.name
I do not want to apply the general conditions of Q1 for each vertex and edge in Q2. That complicates the query, loses the generalization, and needs path variables for variable paths like [:knows*5].
Is there a smarter way to execute a query Q2 on the result of a query Q1? Or is this impossible in Cypher because of its missing composability and the fact, that the result is always a table and never a graph?
It's not present directly in Cypher as far as I know, but if you are using Neo4j Enterprise, then you can use Neo4j Graph Data Science Library, to create named subgraphs, and then you can query those subgraphs. The disadvantage that comes with this approach is your subgraph won't be updated as the original graph gets updated. Please go through the following documentation, on how to create a subgraph:
Projecting a subgraph
Querying your subgraph

Gremlin query: Find all the related vertices till end which match with edge properties

I need to start with one vertex and find all the related vertices till end. Criteria is to match any one of the edge properties(attributes) in the edge inV vertex. If edge attribute ‘value’ doesn’t match inV vertex ‘attribute’ name I should skip the vertex. Attribute value of edge is propagated as Attribute name in the inV vertex
Am using below query, however this gives me json output of parent node, next node and edges between. With the output am writing logic to pick only next attributes which match with the edge attributes. If the matching of attributes can be done with gremlin query, that would be great
var graphQuery = "g.V().has('sPath', '/Assisted/CSurvey/CSurvey.ss').as('ParentStream').outE.inV().as('Edges').map(select('Edges').inV().fold()).as ('NextStream').select('ParentStream', 'NextStream','Edges')";
In below/attached image. I need to get vertex1 and vertex2 and skip vertex3 as there are no attributes matching with edge
image link
Use graph traversal and filter
Example in Scala:
graph.traversal().V().has().bothE('sPath').filter{it:Edge =>
(it.property('sPath').value() == '/Assisted/CSurvey/CSurvey.ss')}.toList()
Hope this helps

Get path lengths for every relationship neo4j

So I have a graph that looks like this(starting from the rightmost side) with relationships that have a unique number attribute called Isnad. I want to write a query to get the length of every Isnad from the start node to the end node but I can't figure it out. I don't know how to traverse every path for every Isnad separately. Any help?
I don't know if it is the most elegant and solution, but I think it works. First, I'm getting all unique Isnad values of relationships outgoing from the rightmost side node using an identifier. Then I'm using a variable-length pattern matching where all relationships have the same value for Isnad property. Then the Isnad value and the path length are returned.
match ({id:'unique-identifier-of-rightmost-side-node'})-[r]->()
with distinct r.Isnad as Isnad
match p = ()-[*{Isnad : Isnad}]->()
return Isnad, length(p) as Length

Neo4j Cypher query to find nodes that are not connected too slow

Given we have the following Neo4j schema (simplified but it shows the important point). There are two types of nodes NODE and VERSION. VERSIONs are connected to NODEs via a VERSION_OF relationship. VERSION nodes do have two properties from and until that denote the validity timespan - either or both can be NULL (nonexistent in Neo4j terms) to denote unlimited. NODEs can be connected via a HAS_CHILD relationship. Again these relationships have two properties from and until that denote the validity timespan - either or both can be NULL (nonexistent in Neo4j terms) to denote unlimited.
EDIT: The validity dates on VERSION nodes and HAS_CHILD relations are independent (even though the example coincidentally shows them being aligned).
The example shows two NODEs A and B. A has two VERSIONs AV1 until 6/30/17 and AV2 starting from 7/1/17 while B only has one version BV1 that is unlimited. B is connected to A via a HAS_CHILD relationship until 6/30/17.
The challenge now is to query the graph for all nodes that aren't a child (that are root nodes) at one specific moment in time. Given the example above, the query should return just B if the query date is e.g. 6/1/17, but it should return B and A if the query date is e.g. 8/1/17 (because A isn't a child of B as of 7/1/17 any more).
The current query today is roughly similar to that one:
MATCH (n1:NODE)
OPTIONAL MATCH (n1)<-[c]-(n2:NODE), (n2)<-[:VERSION_OF]-(nv2:ITEM_VERSION)
WHERE (c.from <= {date} <= c.until)
AND (nv2.from <= {date} <= nv2.until)
WITH n1 WHERE c IS NULL
MATCH (n1)<-[:VERSION_OF]-(nv1:ITEM_VERSION)
WHERE nv1.from <= {date} <= nv1.until
RETURN n1, nv1
ORDER BY toLower(nv1.title) ASC
SKIP 0 LIMIT 15
This query works relatively fine in general but it starts getting slow as hell when used on large datasets (comparable to real production datasets). With 20-30k NODEs (and about twice the number of VERSIONs) the (real) query takes roughly 500-700 ms on a small docker container running on Mac OS X) which is acceptable. But with 1.5M NODEs (and about twice the number of VERSIONs) the (real) query takes a little more than 1 minute on a bare-metal server (running nothing else than Neo4j). This is not really acceptable.
Do we have any option to tune this query? Are there better ways to handle the versioning of NODEs (which I doubt is the performance problem here) or the validity of relationships? I know that relationship properties cannot be indexed, so there might be a better schema for handling the validity of these relationships.
Any help or even the slightest hint is greatly appreciated.
EDIT after answer from Michael Hunger:
Percentage of root nodes:
With the current example data set (1.5M nodes) the result set contains about 2k rows. That's less than 1%.
ITEM_VERSION node in first MATCH:
We're using the ITEM_VERSION nv2 to filter the result set to ITEM nodes that have no connection other ITEM nodes at the given date. That means that either no relationship must exist that is valid for the given date or the connected item must not have an ITEM_VERSION that is valid for the given date. I'm trying to illustrate this:
// date 6/1/17
// n1 returned because relationship not valid
(nv1 ...)->(n1)-[X_HAS_CHILD ...6/30/17]->(n2)<-(nv2 ...)
// n1 not returned because relationship and connected item n2 valid
(nv1 ...)->(n1)-[X_HAS_CHILD ...]->(n2)<-(nv2 ...)
// n1 returned because connected item n2 not valid even though relationship is valid
(nv1 ...)->(n1)-[X_HAS_CHILD ...]->(n2)<-(nv2 ...6/30/17)
No use of relationship-types:
The problem here is that the software features a user-defined schema and ITEM nodes are connected by custom relationship-types. As we can't have multiple types/labels on a relationship the only common characteristic for these kind of relationships is that they all start with X_. That's been left out of the simplified example here. Would searching with the predicate type(r) STARTS WITH 'X_' help here?
What Neo4j version are you using.
What percentage of your 1.5M nodes will be found as roots at your example date, and if you don't have the limit how much data comes back? Perhaps the issue is not in the match so much as in the sorting at the end?
I'm not sure why you had the VERSION nodes in your first part, at least you don't describe them as relevant for determining a root node.
You didn't use relationship-types.
MATCH (n1:NODE) // matches 1.5M nodes
// has to do 1.5M * degree optional matches
OPTIONAL MATCH (n1)<-[c:HAS_CHILD]-(n2) WHERE (c.from <= {date} <= c.until)
WITH n1 WHERE c IS NULL
// how many root nodes are left?
// # root nodes * version degree (1..2)
MATCH (n1)<-[:VERSION_OF]-(nv1:ITEM_VERSION)
WHERE nv1.from <= {date} <= nv1.until
// has to sort all those
WITH n1, nv1, toLower(nv1.title) as title
RETURN n1, nv1
ORDER BY title ASC
SKIP 0 LIMIT 15
I think a good start for improvement would be to match on nodes using an index so you can quickly get a smaller relevant subset of nodes to search. Your approach right now must inspect all your :NODEs and all their relationships and patterns off of them every single time, which, as you've found, won't scale with your data.
Right now the only nodes in your graph with date/time properties are your :ITEM_VERSION nodes, so let's start with those. You'll need an index on :ITEM_VERSION's from and until properties for fast lookup.
The nulls are going to be problematic for your lookups, as any inequality against a null value returns null, and most workarounds to working with nulls (using COALESCE() or several ANDs/ORs for null cases) seem to prevent usage of index lookups, which is the point of my particular suggestion.
I would encourage you to replace your nulls in from and until with min and max values, which should let you take advantage of finding nodes by index lookup:
MATCH (version:ITEM_VERSION)
WHERE version.from <= {date} <= version.until
MATCH (version)<-[:VERSION_OF]-(node:NODE)
...
That should at least provide quick access to a smaller subset of nodes at the start for continuing your query.

Resources