Gremlin strategy to vertex peer sharing exactly the same neighbourhood - gremlin

Using AWS Neptune, I need to find a traversal strategy that takes one reference vertex, and by traversing along one edge type, it finds another vertices that has exactly the same neighbors, ie. not more, not less.
g.V('1').as('ref_vertex').out('created').as('creations').in('created')
Finds vertices that also created the same things as "1", but it also scopes in vertexes (a) that created something else too, and also (b) those that did not create everything that "1" created.
g.V('1').as('ref_vertex')
.out('created').as('creations').in('created')
.not(out('created').where(neq('creations'))
Helps only problem (a), getting rid of persons created something extra.
How to i continue this query to skip (b) vertices from the result ?

g.V('1').aggregate('ref_vertex').
out('created').
sideEffect(aggregate('neighbors')). /* get neighbors of 'ref_vertex' */
in('created').groupCount().unfold(). /* group count in('created') by its occurrence times */
as('candidate_shared_neighbor_cnt_pair').
where('candidate_shared_neighbor_cnt_pair', eq('neighbors')). /* select only the vertices have the same occurrence times as 'ref_vertex' */
by(select(values)).
by(unfold().count()).
select(keys).
where(without('ref_vertex'))

Related

How do I produce output even when there is no edge and when using select for projection

Can someone help me please with this simple query...Many thanks in advance...
I am using the following gremlin query and it works well giving me the original vertex (v) (with id-=12345), its edges (e) and the child vertex (id property). However, say if the original vertex 'v' (with id-12345) has no outgoing edges, the query returns nothing. I still want the properties of the original vertex ('v') even if it has no outgoing edges and a child. How can I do that?
g.V().has('id', '12345').as('v').
outE().as('e').
inV().
as('child_v').
select('v', 'e', 'child_v').
by(valueMap()).by(id).by(id)
There are a couple of things going on here but the major update you need to the traversal is to use a project() step instead of a select().
select() and project() steps are similar in that they both allow you to format the results of a traversal however they differ in (at least) one significant way. select() steps function by allowing you to access previously traversed and labeled elements (via as). project() steps allow you take the current traverser and branch it to manipulate the output moving forward.
In your original traversal, when there are no outgoing edges from original v so all the traversers are filtered out during the outE() step. Since there are no further traversers after the outE() step then remainder of the traversal has no input stream so there is no data to return. If you use a project() step after the original v you're able to return the original traverser as well as return the edges and incident vertex. This does lead to a slight complication when handling cases where no out edges exist. Gremlin does not handle null values, such as no out edges existing, you need to return some constant value for these statements using a coalesce statement.
Here is functioning version of this traversal:
g.V().hasId(3).
project('v', 'e', 'child_v').
by(valueMap()).
by(coalesce(outE().id(), constant(''))).
by(coalesce(out().id(), constant('')))
Currently you will get a lot of duplicate data, in the above query you will get the vertex properties E times. probably will be better to use project:
g.V('12345').project('v', 'children').
by(valueMap()).
by(outE().as('e').
inV().as('child').
select('e', 'child').by(id).fold())
example: https://gremlify.com/a1
You can get the original data format if you do something like this:
g.V('12345').as('v').
coalesce(
outE().as('e').
inV().
as('child_v')
select('v', 'e', 'child_v').
by(valueMap()).by(id).by(id),
project('v').by(valueMap())
)
example: https://gremlify.com/a2

Gremlin - simultaneously limit traverse iterations and search for edge property

I have a (broken) piece of gremlin code to generate the shortest path from a given vertex to one which has the parameter test_parameter. If that parameter is not found on an edge, no paths should be returned.
s.V(377524408).repeat(bothE().has('date', between(1554076800, 1556668800)).otherV()) /* date filter on edges */
.until(or(__.bothE().has('test_property', gt(0)),
loops().is(4))) /* broken logic! */
.path()
.local(unfold().filter(__.has('entity_id')).fold()) /* remove edges from output paths*/
The line that's broken is .until(or(__.outE().has('test_property', gt(0)), loops().is(4))).
At present - and it makes sense as to why - it gives all paths that are 4 hops from the starting vertex.
I'm trying to adapt it so that if the traverse is at 4 iterations, and if the property test_property is not found, then it should not return any paths. If test_property is found, it should return only the path(s) to that vertex.
I've attempted to put a times(4) constraint in and removing the loops() condition, but don't know how to have both the times(4) this and the .has('test_property', gt(0)) constraint.
Daniel's answer has few issues (see comments).
This query returns the correct result:
g.V(377524408)
.repeat(bothE().has('date', between(1554076800, 1556668800)).otherV().simplePath().as("v"))
.until(and(bothE().has('tp', gt(0)), loops().is(lte(4))))
.select(all, "v")
.limit(1)
The simplePath() is required so we won't go back and forth and avoid circles.
The repeat loop is until the condition is met AND we have not reached max hop.
The limit(1) return only the first (shortest) path. Omit to get all paths.
Note that if the graph is directed it is better to use outE() and not bothE().
This should work:
s.V(377524408).
repeat(bothE().has('date', between(1554076800, 1556668800)).otherV().as('v')).
times(4).
filter(bothE().has('test_property', gt(0))).
select(all, 'v')
Also note, that I replaced your local(unfold().filter(__.has('entity_id')).fold()) with something much simpler (assuming that the sole purpose was the removal of edges from the path).

Time Complexity of Adding Edge to Graph using Adjacency List

I've been studying up on graphs using the Adjacency List implementation, and I am reading that adding an edge is an O(1) operation.
This makes sense if you are just tacking an edge on to the Vertex's Linked List of edges, but I don't understand how this could be the case so long as you care about removing the old edge if one already exists. Finding that edge would take O(V) time.
If you don't do this, and you add an edge that already exists, you would have duplicate entries for that edge, which means they could have different weights, etc.
Can anyone explain what I seem to be missing? Thanks!
You're right at your complecxity analysis. Find if edge already exist is truly O(V). But notice that adding this edge even if existed is still O(1).
You need to remember that having 2 edges with the same source an destination are valid input to graph - even with different weights (maybe not even but because).
That way adding edge to adjacency-list-graph is O(1)
What people usually do to have both optimal search time complexity and the advantages of adjacency lists is to use an array of hashsets instead of an array of lists.
Alternatively,
If you want a worst-case optimal solution, use RadixSort to order the
list of all edges in O(v+e) time, remove duplicates, and then build
the adjacency list representation in the usual way.
source: https://www.quora.com/What-are-the-various-approaches-you-can-use-to-build-adjacency-list-representation-of-a-undirected-graph-having-time-complexity-better-than-O-V-*-E-and-avoiding-duplicate-edges

Traverse Graph With Directed Cycles using Relationship Properties as Filters

I have a Neo4j graph with directed cycles. I have had no issue finding all descendants of A assuming I don't care about loops using this Cypher query:
match (n:TEST{name:"A"})-[r:MOVEMENT*]->(m:TEST)
return n,m,last(r).movement_time
The relationships between my nodes have a timestamp property on them, movement_time. I've simulated that in my test data below using numbers that I've imported as floats. I would like to traverse the graph using the timestamp as a constraint. Only follow relationships that have a greater movement_time than the movement_time of the relationship that brought us to this node.
Here is the CSV sample data:
from,to,movement_time
A,B,0
B,C,1
B,D,1
B,E,1
B,X,2
E,A,3
Z,B,5
C,X,6
X,A,7
D,A,7
Here is what the graph looks like:
I would like to calculate the descendants of every node in the graph and include the timestamp from the last relationship using Cypher; so I'd like my output data to look something like this:
Node:[{Descendant,Movement Time},...]
A:[{B,0},{C,1},{D,1},{E,1},{X,2}]
B:[{C,1},{D,1},{E,1},{X,2},{A,7}]
C:[{X,6},{A,7}]
D:[{A,7}]
E:[{A,3}]
X:[{A,7}]
Z:[{B,5}]
This non-Neo4J implementation looks similar to what I'm trying to do: Cycle enumeration of a directed graph with multi edges
This one is not 100% what you want, but very close:
MATCH (n:TEST)-[r:MOVEMENT*]->(m:TEST)
WITH n, m, r, [x IN range(0,length(r)-2) |
(r[x+1]).movement_time - (r[x]).movement_time] AS deltas
WHERE ALL (x IN deltas WHERE x>0)
RETURN n, collect(m), collect(last(r).movement_time)
ORDER BY n.name
We basically find all the paths between any of your nodes (beware cartesian products get very expensive on non-trivial datasets). In the WITH we're building a collection delta's that holds the difference between two subsequent movement_time properties.
The WHERE applies an ALL predicate to filter out those having any non-positive value - aka we guarantee increasing values of movement_time along the path.
The RETURN then just assembles the results - but not as a map, instead one collection for the reachable nodes and the last value of movement_time.
The current issue is that we have duplicates since e.g. there are multiple paths from B to A.
As a general notice: this problem is much more elegantly and more performant solvable by using Java traversal API (http://neo4j.com/docs/stable/tutorial-traversal.html). Here you would have a PathExpander that skips paths with decreasing movement_time early instead of collection all and filter out (as Cypher does).

Find path for N levels with repeating pattern of directional relationships in Neo4J

I'm trying to use Neo4j to analyze relationships in a family tree. I've modeled it like so:
(p1:Person)-[:CHILD]->(f:Family)<-[:FATHER|MOTHER]-(p2)
I know I could have left out the family label and just had children connected to each parent, but that's not practical for my purposes. Here's an example of my graph and the black line is the path I want it to generate:
I can query for it with
MATCH p=(n {personID:3})-[:CHILD]->()<-[:FATHER|MOTHER]-()-[:CHILD]->()<-[:FATHER|MOTHER]-()-[:CHILD]->()<-[:FATHER|MOTHER]-() RETURN p
but there's a repeating pattern to the relationships. Could I do something like:
MATCH p=(n {personID:3})(-[:CHILD]->()<-[:FATHER|MOTHER]-())* RETURN p
where the * means repeat the :CHILD then :FATHER|MOTHER relationships, with the directions being different? Obviously if the relationships were all the same direction, I could use
-[:CHILD|FATHER|MOTHER*]->
I want to be able to query it from Person #3 all the way to the top of the graph like a pedigree chart, but also be specific about how many levels if needed (like 3 generations, as opposed to end-of-line).
Another issue I'm having with this, is if I don't put directions on the relationships like -[:CHILD|FATHER|MOTHER*]-, then it will start at Person #3, and go both in the direction I want (alternating arrows), but also descend back down the chain finding all the other "cousins, aunts, uncles, etc.".
Any seasoned Cypher experts that an help me?
I am just on the same problem. And I found out that the APOC Expand path procedures are just accomplishing what you/we want.
Applied to your example, you could use apoc.path.subgraphNodes to get all ancestors of Person #3:
MATCH (p1:Person {personId:3})
CALL apoc.path.subgraphNodes(p1, {
sequence: '>Person,CHILD>,Family,<MOTHER|<FATHER'
}) YIELD node
RETURN node
Or if you want only ancestors up to the 3 generations from start person, add maxLevel: 6 to config (as one generation is defined by 2 relationships, 3 generations are 6 levels):
MATCH (p1:Person {personId:3})
CALL apoc.path.subgraphNodes(p1, {
sequence: '>Person,CHILD>,Family,<MOTHER|<FATHER',
maxLevel: 6
}) YIELD node
RETURN node
And if you want only ancestors of 3rd generation, i.e. only great-grandparents, you can also specify minLevel (using apoc.path.expandConfig):
MATCH (p1:Person {personId:3})
CALL apoc.path.expandConfig(p1, {
sequence: '>Person,CHILD>,Family,<MOTHER|<FATHER',
minLevel: 6,
maxLevel: 6
}) YIELD path
WITH last(nodes(path)) AS person
RETURN person
You could reverse the directionality of the CHILD relationships in your model, as in:
(p1:Person)<-[:CHILD]-(f:Family)<-[:FATHER|MOTHER]-(p2)
This way, you can use a simple -[:CHILD|FATHER|MOTHER*]-> pattern in your queries.
Reversing the directionality is actually intuitive as well, since you can then more naturally visualize the graph as a family tree, with all the arrows flowing "downwards" from ancestors to descendants.
Yeah, that's an interesting case. I'm pretty sure (though I'm open to correction) that this is just not possible. Would it be possible for you to have and maintain both? You could have a simple cypher query create the extra relationships:
MATCH (parent)-[:MOTHER|FATHER]->()<-[:CHILD]-(child)
CREATE (child)-[:CHILD_OF]->parent
Ok, so here's a thought:
MATCH path=(child:Person {personID: 3})-[:CHILD|FATHER|MOTHER*]-(ancestor:Person),
WHERE ancestor-[:MOTHER|FATHER]->()
RETURN path
Normally I'd use a second clause in the MATCH like this:
MATCH
path=(child:Person {personID: 3})-[:CHILD|FATHER|MOTHER*]-(ancestor:Person),
ancestor-[:MOTHER|FATHER]->()
RETURN path
But Neo4j (at least by default, I think) doesn't traverse back through the path. Maybe comma-separating would be fine and this would be a problem:
MATCH path=(child:Person {personID: 3})-[:CHILD|FATHER|MOTHER]-(ancestor:Person)-[:MOTHER|FATHER]->()
I'm curious to know what you find!

Resources