Gremlin, get two vertices that both have an edge to each other - graph

So imagine you have 2000 people, they can choose to like someone which creates an edge between them, for example A likes B, now this doesn't necessarily mean that B likes A. How would I write a gremlin query to figure out everyone who likes each other? So where A likes B AND B likes A?
I've been looking around the internet and I've found .both('likes') however from what I understand is that this will get everyone who likes someone or who has someone who likes them, not both at the same time.
I've also found this
g.V().hasId('1234567').as('y').
out('likes').
where(__.in('likes').as('y'))
This works for 1 person, however I can't figure out how to get this to work for multiple people.
To me this seems like a simple enough problem for graph however I can't seem to find any solution online. From everything I've been reading it seems to infer that the data should be structured such that, if A likes B, that also means that B likes A. Which is achievable, when you create the edge that A likes B you can check if B already likes A, and if that's the case insert a special edge which is like... A inRelationshipWith B
The query for this would be g.V().both('inRelationshipWith') which would make things easier.
Is this an issue with how the data is structured and I am potentially using a graph database incorrectly, or is there actually a simple way to achieve what I want that I am missing?

You almost had it. Remember from the other vertex the relationship back to the starting vertex is also an out relationship from that vertex's point of view. The following query uses the air-routes data set to find all airports that have a route in both directions (analogous to your mutual friendship case)
g.V().
hasLabel('airport').as('a').
out().as('b').
where(out().as('a')).
select('a','b').
by('code')
This will return pairs of relationships. It will include each airport (friend) twice for example:
[a:DFW,b:AUS]
[a:AUS,b:DFW]
If you only want one of each pair adding a dedup step will reduce the result set to just one pair per relationships.
g.V().
hasLabel('airport').as('a').
out().as('b').
where(out().as('a')).
select('a','b').
by('code').
order(local).
by(values).
dedup().
by(values)
Finding the inverse case (where there is not a mutual relationship) is just a case of adding a not step to the query.
g.V().
hasLabel('airport').as('a').
out().as('b').
where(__.not(out().as('a'))).
select('a','b').
by('code')

Another possible solution would be:
g.V().as('y').
out('likes').where(__.out('likes').as('y')).
path().dedup().
by(unfold().
order().by(id).
dedup().fold())
You can try it out here on a sample graph:
https://gremlify.com/radpwsh80o

Related

gremlin intersection with `select` and `as`

I'm following up with these 2 questions --
gremlin intersection operation
JanusGraph Gremlin graph traversal with `as` and `select` provides unexpected result
I'm viewing StackOverflow intensively(wanted to thank the community!) but unfortunately I didn't post/write a lot, so I don't even have enough reputation for posting a comment on the posts above...therefore I'm asking my questions here..
In 2nd post above, Hieu and I work together, and I want to provide a bit more background on the question.
As Stephen asked in the comment(for 2nd post), the reason that I want to chain V() in the middle is simply because I want to start the traversal from the beginning, i.e. each and every node of the whole graph just like what g.V() does, which appears at the beginning of most of the queries in gremlin documentation.
A bit more illustration: suppose I need 2 conditional filters on the results. Basically I want to write
g.V().(Condition-A).as('setA')
.V().(Condition-B).as('setB')
select('setA').
where('setA',eq('setB'))
which borrows the last answer from Stephen's answer in the 1st post. Here Condition-A and Condition-B is just a chaining of different filter steps like has or hasLabel etc.
What should I write at the place of .V() in the middle? Or is there some other way to write the query so that Condition-B is completely independent of Condition-A?
Finally, I've read the section for chaining V() in the middle of a query at https://tinkerpop.apache.org/docs/3.5.0/reference/#graph-step. I still cannot fully understand the weird consequences for 2nd post, maybe I should read more about how traversers work?
Thanks Kelvin and Stephen again. Glad and excited to connect with you who wrote a book/wrote the source code for gremlin.
In the middle of a traversal, a V() is applied to every traverser that has been created by the prior steps. Consider this example using the air-routes data set:
g.V(1,2,3)
This will yield three results:
v[1]
v[2]
v[3]
and if we count all vertices in the graph:
gremlin> g.V().count()
==>3747
we get 3,747 results. If we now do:
gremlin> g.V(1,2,3).V().count()
==>11241
we get 11,241 results (exactly 3 times 3747). This is because for each result from g.V(1,2,3) we counted every vertex in the graph.
EDITED to add:
If you need to aggregate some results and then explore the graph again using those results as a filter, one way is to introduce a fold step. This will collapse all of the traversers back into one again. This ensures that the second V step will not be repeated multiple times by any prior fan out.
gremlin> g.V(1,2,3).fold().as('a').V().where(within('a'))
==>v[1]
==>v[2]
==>v[3]
gremlin> g.V(1,2,3).fold().as('a').V().where(without('a')).limit(5)
==>v[0]
==>v[4]
==>v[5]
==>v[6]
==>v[7]
EDITED again to add:
The key part I think people sometimes struggle with is how Gremlin traversals flow. You can think of a query as containing/spawning one or more parallel streams (it may not be executed that way but conceptually it helps me to think of it that way). So g.V('1') creates one stream (we often refer to them as traversers). However g.V('1').out() might create multiple traversers if there is more than one outgoing edge originating from V('1'). When a fold is encountered the traversers are all collapsed back down to one again.

How To Exclude Certain Relationship for Real-Time Prediction in Neo4j

So I'm working on a real-time prediction matter, for example, I have a node (A) (:Person) and he has friends and node (B) as (:Games)
so node (A) has liked a certain Game and his friends liked other games so I recommend those other games for him But the matter is that I need to exclude the games which he is already liked or played.
it seems to be easy around the 'NOT' command but I couldn't find the right code for it yet although I've tried a lot of ways
the one seems closest for me is like:
match (A:Person)-[:Friend]-(n:Person)
where A <> n
with distinct n
match (n)-[:LIKED]-(B:Game)-[:ON]-(:steam), (k:Person{name:'John'})
where not ((k)-[:LIKED]-(:Game)-[:ON]-(:steam))
return B
which has to recommend the games John's friends liked without the games which John already liked.
anyway, when I Run this, the Graph just freezes for a while and then shutdown which is another problem I want to ask for.
Thanks for help
The last WHERE clause has very few constraints on it, and probably explains the hang/timeout. It may help to have a variable name for each label, either to constrain the query or to receive the nodes. more like this
where not ((k:Person{name:'John'})-[:LIKED]->(B:Game)-[:ON]->(C:steam))
return B
specify directional -> relationships (as above) in cypher queries if possible, usually it provides the answer you want, and is faster.
adding the variable name, and relationship direction also makes the query easier to read, to understand what it is doing, and if you need to look at the nodes/relationships values when debugging.
I may be wrong, but the :steam label doesn't look right to me. What are example values? I'm wondering if you meant to have a :service node, and steam would be a node instance?
Note: if you provide create nodes/rels script to create a small example of this database (e.g. a dozen nodes with these relationships) it would be easier to provide a working cypher example.
If you want to find the distinct games on Steam that John's friends liked but John has not yet liked or played, something this should work:
MATCH (j:Person{name:'John'})-[:FRIEND]-(:Person)-[:LIKED]->(g:Game)-[:ON]-(:steam)
WHERE NOT (j)-[:LIKED|PLAYED]->(g)
RETURN DISTINCT g

Jagged Result Array Gremlin Query

Please may you help me to write a query that returns each source vertex in my traversal along with its associated edges and vertices as arrays on each such source vertex? In short, I need a result set comprising an array of 3-tuples with item 1 of each tuple being the source vertex and items 2 and 3 being the associated arrays.
Thanks!
EDIT 1: Expanded on the graph data and added my current problem query.
EDIT 2: Improved Gremlin sample graph code (apologies, didn't think anyone would actually run it.)
Sample Graph
g.addV("blueprint").property("name","Mall").
addV("blueprint").property("name","HousingComplex").
addV("blueprint").property("name","Airfield").
addV("architect").property("name","Tom").
addV("architect").property("name","Jerry").
addV("architect").property("name","Sylvester").
addV("buildingCategory").property("name","Civil").
addV("buildingCategory").property("name","Commercial").
addV("buildingCategory").property("name","Industrial").
addV("buildingCategory").property("name","Military").
addV("buildingCategory").property("name","Resnameential").
V().has("name","Tom").addE("designed").to(V().has("name","HousingComplex")).
V().has("name","Tom").addE("assisted").to(V().has("name","Mall")).
V().has("name","Jerry").addE("designed").to(V().has("name","Airfield")).
V().has("name","Jerry").addE("assisted").to(V().has("name","HousingComplex")).
V().has("name","Sylvester").addE("designed").to(V().has("name","Mall")).
V().has("name","Sylvester").addE("assisted").to(V().has("name","Airfield")).
V().has("name","Sylvester").addE("assisted").to(V().has("name","HousingComplex")).
V().has("name","Mall").addE("classification").to(V().has("name","Commercial")).
V().has("name","HousingComplex").addE("classification").to(V().has("name","Resnameential")).
V().has("name","Airfield").addE("classification").to(V().has("name","Civil"))
Please note that the above is a very simplified rendering of our data.
Needed Query Results
I need to bring back each blueprint vertex as a base with each of its associated edges / vertices as arrays.
My Current Solution
Currently I do this very cumbersome query that gets the blueprints and assigns a label, gets the architects and assigns a label, then selects both labels. The solution is ok; however, it gets messy when I need to include edges or I need to get blueprint classification vertices (industrial, military, residential, commercial, etc.). In effect, the more associated data that I need to pull back for each blueprint, the sloppier my solution becomes.
My current query looks something like this:
g.V().hasLabel("blueprint").as("blueprints").
outE().or(hasLabel("designed"),hasLabel("assisted")).inV().as("architects").
select("blueprints").coalesce(out("classification"),constant()).as("classifications").
select("blueprints","architects","classifications")
The above produces a lot of duplication. If the number of: blueprints is b, architects is a, and classifications is c, the result set comprises b * a * c results. I'd like one blueprint with an array of its associated architects and an array of its associated classifications, if any.
Complications
I'm trying to do this in one query so that I can get all blueprint data from the graph to populate a filtered list. Once I have the list comprising all of the vertices, edges, and their properties, users can then click links to blobs, browse to project sites, etc. Accordingly, I've got pagination as well as filtering to think about and I'd prefer to make one trip to the server each time I get a new page or the filters change.
I figured out an answer; however, it quadruples the compute charge for the query. Not sure if this can be optimized further.
g.V().hasLabel("blueprint").
project("blueprints","architects").
by().
by(outE().or(hasLabel("designed"),hasLabel("assisted")).inV().dedup().fold())
I just solved for blueprints and architects, but classifications just needs another by(...traversal...) and projection label.
I may have to just get the blueprints in one query, get each of their associated items in parallel queries, then put it all together in the API. That would be very bad design for the API data layer but may be necessary for performance reasons.

Gremlin search for vertexes related to 2 or more specific nodes

I'm trying to produce a Gremlin query whereby I need to find vertexes which have edges from specific other vertexes. The less abstract version of this query is I have user vertexes, and those are related to group vertexes (i.e subjects in a school, so students who are in "Year 6 Maths" and "Year 6 English" etc). An extra difficulty is the ability for subgroups to exist in this query.
The query I need to find those users who are in 2 or more groups specified by the user.
Currently I have a brief solution, but in production usage using Amazon Netpune this query performs way too poorly, even with a small amount of data. I'm sure there's a simpler way of achieving this :/
g.V()
.has('id', 'group_1')
.repeat(out("STUDENT", "SUBGROUP"))
.until(hasLabel("USER"))
.aggregate("q-1")
.V()
.has('id', 'group_2')
.repeat(out("STUDENT", "SUBGROUP"))
.until(hasLabel("USER"))
.where(within("q-1"))
.aggregate("q-2")
.V()
.hasLabel(USER)
.where(within("q-2"))
# We add some more filtering here, such as search terms
.dedup()
.range(0, 10)
.values("id")
.toList()
The first major change you can do is to not bother iterating all of V() again for "USER" - that's already that output from the prior steps so collecting "q-2" just to use it for a filter doesn't seem necessary:
g.V().
has('id', 'group_1').
repeat(out("STUDENT", "SUBGROUP")).
until(hasLabel("USER")).
aggregate("q-1").
V().
has('id', 'group_2').
repeat(out("STUDENT", "SUBGROUP")).
until(hasLabel("USER")).
where(within("q-1")).
# We add some more filtering here, such as search terms
dedup().
range(0, 10).
values("id")
That should already be a huge savings to your query because that change avoids iterating the entire graph in memory (i.e. full scan of all vertices) as there was no index lookup there.
I don't know what your additional filters are here:
# We add some more filtering here, such as search terms
but I would definitely look to try to filter the users earlier in your query rather than later. Perhaps consider using emit() on your repeats() to filter better. You should probably also dedup() your "q-1" and reduce the size of the list there.
I'd be curious to know how much just the initial change I suggested works as that was probably the biggest bulk of your query cost (unless you have a really deep/wide tree of student/subgroups I guess). Perhaps there is more that could be tweaked here though, but it would be nice to know that you at least have a traversal with satisfying performance at this point.

How Can I Return Meaningful Errors In Gremlin?

Let's say I have a huge gremlin query with 100 or more steps. One part of this query has a failure and I want it to return a meaningful error message. With a short and sweet query this would not be too difficult, as we can do something like this:
g.V().coalesce(hasId("123"), constant("ERROR - ID does not exist"))
Of course we're asking if a Vertex with an ID of 123 exists. If it does not exist we return a string.
So now let's take this example and make it more complex
g.V().coalesce(hasId("123"), constant("ERROR - ID does not exist")).as("a").V().coalesce(hasId("123"), constant("ERROR - ID does not exist")).as("b").select("a").valueMap(false)
If a vertex with ID: "123" exists we return all properties stored on the vertex.
Lets say a vertex with ID: "123" does not exist in the database. How can I get a meaningful error returned without getting a type error for trying to do a .valueMap() on a string?
First of all, if you have a single line of Gremlin with 100 or more steps (not counting anonymous child traversals steps of course), I'd suggest you re-examine your approach in general. When I encounter Gremlin of that size, it usually means that someone is generating a large traversal for purpose of mutating the graph in some way. That's considered an anti-pattern and something to avoid as the larger the Gremlin grows the greater the chance of hitting the Xss JVM limits for a StackOverflowException and traversal compilation times tend to add up and get expensive. All of that can be avoided in many cases by using inject() or withSideEffect() in some way to pass the data in on the traversal itself and then use Gremlin to be the loop that iterates that data into mutation steps. The result is a slightly more complex Gremlin statement, but one that will perform better and avoid the StackOverflowException.
Second, note that this traversal will likely not behave as you want on any graph provider - see this example on TinkerGraph:
gremlin> g.V().coalesce(hasId(1),constant('x'))
==>v[1]
==>x
==>x
==>x
==>x
==>x
gremlin> g.V().hasId(1)
==>v[1]
The hasId() inside the coalesce() won't be optimized by the graph as an fast id lookup but will instead be treated as a full table scan with a filter.
In answer to your question though, I'd say that the easiest option open to you is to just move the valueMap() inside the coalesce():
g.V().coalesce(hasId("123").valueMap(false),
constant("ERROR - ID does not exist")).as("a").
V().coalesce(hasId("123").valueMap(false),
constant("ERROR - ID does not exist")).as("b").
select("a")
I see why that might be bad if you lots of steps other than valueMap() because then you have replicate the same steps over and over again making the code even larger. I guess that goes back to my first point.
I suppose you could use a lambda though not all graph providers support that - note that I've modified your code to ensure a lookup by id for purpose of demonstration:
gremlin> g.V(1).fold().coalesce(unfold(),map{throw new IllegalStateException("bad")})
==>v[1]
gremlin> g.V(10).fold().coalesce(unfold(),map{throw new IllegalStateException("bad")})
bad
At this time, I'm not sure there's much else you can do. Maybe you could make a "error" Vertex that you could return in constant() that way valueMap() would work but it's hard to say if that would be helpful given what I know about the overall intent of your traversal. I suppose you could maybe come up with a fancy evaluation of an if-then using choose() but that might be hard to read and look awkward. The only other option I can think of is to store the error as a side-effect:
gremlin> g.V(10).fold().coalesce(unfold(),store('error').by(constant('x'))).cap('error')
==>[x]
I don't think Gremlin gives you any really elegant way to do what you want right now.

Resources