Gremlin groupCount() return only gt 2 - azure-cosmosdb

I have the following Gremlin traversal, running on Azure CosmosDB, and I only want to return URLs with a count greater than 1. I'm not sure how to limit the return from the groupCount().
g.V().hasLabel('article').values('url').groupCount()

Here's an example from the modern toy graph:
gremlin> g = TinkerFactory.createModern().traversal()
==>graphtraversalsource[tinkergraph[vertices:6 edges:6], standard]
gremlin> g.V().hasLabel('software').in().
......1> groupCount().
......2> by('name').
......3> unfold().
......4> filter(select(values).unfold().is(gt(1)))
==>josh=2
So you do the groupCount() and then unfold() the resulting Map then filter() the individual values from the Map.
In your case you would probably have something like:
g.V().hasLabel('article').
groupCount()
by('url').
unfold().
filter(select(values).unfold().is(gt(1)))

Per my comment on the answer from Stephen Mallette, Azure CosmosDB Graph https://learn.microsoft.com/en-us/azure/cosmos-db/gremlin-support doesn't support the filter step so I used the where step to achieve the desired results.
g.V().hasLabel('article').groupCount().by('url').unfold().where(select(values).is(gt(1)))

Related

Add an edge failing after creating a vertex (Neptune InternalFailureException)

I am trying to create a user vertex and a city vertex (if they do not already exist in the graph), and then add an edge between the two of them. When I execute the command in the same traversal, I run into this InternalFailureException from Neptune.
g.V("user12345").
fold().
coalesce(unfold(),addV("user").property(id, "user-12345")).as('user').
V("city-ATL").
fold().
coalesce(unfold(), addV("city").property(id, "city-ATL")).as("city").
addE("lives_in").
from("user").
to("city")
{"code":"InternalFailureException","detailedMessage":"An unexpected error has occurred in Neptune.","requestId":"xxx"}
(Note in the case above, both user-12345 and city-ATL do not exist in the graph).
However, when I create the city before executing the command, it works just fine:
gremlin> g.V("city-ATL").
fold().
coalesce(unfold(),
addV("city").property(id, "city-ATL"))
==>v[city-ATL]
gremlin> g.V("user-12345").
fold().
coalesce(unfold(),addV("user").property(id, "user-12345")).as('user').
V("city-ATL").
fold().
coalesce(unfold(), addV("city").property(id, "city-ATL")).as("city").
addE("lives_in").from("user").to("city")
==>e[1abd87d6-6f54-9e42-ae0a-47401c9dcfe6][user-12345-lives_in-city-ATL]
I am trying to build a traversal that can do them both together. Does anyone know why Neptune might be throwing this InternalFailureException when the city doesn't exist?
I will investigate further why you did not get a more useful error message but I can see that the Gremlin query will need to change. After a fold step any prior as labels are lost as fold reduces the traversers down to one. A fold is both a barrier and a map. You should be able to use store or aggregate(local) instead of as in this case where you have to use fold for each coalesce.
gremlin> g.V('user-1234').
......1> fold().
......2> coalesce(unfold(),addV('person').property(id,'user-1234')).store('a').
......3> V('city-ATL').
......4> fold().
......5> coalesce(unfold(),addV('city').property(id,'city-ATL')).store('b').
......6> addE('lives_in').
......7> from(select('a').unfold()).
......8> to(select('b').unfold())
==>e[0][user-1234-lives_in->city-ATL]

How can I make complex Gremlin queries in AWS Neptune without variables?

I'm using Amazon Neptune, which does not support variables. For complex queries, however, I need to use a variable in multiple places. How can I do this without querying twice for the same data?
Here's the problem I'm trying to tackle:
Given a start Person, find Persons that the start Person is connected to by at most 3 steps via the knows relationship. Return each Person's name and email, as well as the distance (1-3).
How would I write this query in Gremlin without variables, since variables are unsupported in Neptune?
I don't see any reason why you would need variables for your traversal and there are many ways you could get an answer. Assuming this graph:
g = TinkerGraph.open().traversal()
g.addV('person').property('name','A').property('age',20).as('a').
addV('person').property('name','B').property('age',21).as('b').
addV('person').property('name','C').property('age',22).as('c').
addV('person').property('name','D').property('age',19).as('d').
addV('person').property('name','E').property('age',22).as('e').
addV('person').property('name','F').property('age',24).as('f').
addE('next').from('a').to('b').
addE('next').from('b').to('c').
addE('next').from('b').to('d').
addE('next').from('c').to('e').
addE('next').from('d').to('e').
addE('next').from('e').to('f').iterate()
You could do something like:
gremlin> g.V().has('person','name','A').
......1> repeat(out().
......2> group('m').
......3> by(loops()).
......4> by(valueMap('name','age').by(unfold()).fold())).
......5> times(3).
......6> cap('m')
==>[0:[[name:B,age:21]],1:[[name:C,age:22],[name:D,age:19]],2:[[name:E,age:22],[name:E,age:22]]]
Find a particular "person" vertex by their name, in this case "A", then repeatedly traverse out() and group those vertices you come across by loops() which is how deep you have traversed. I use valueMap() in this case to extract the properties you wanted. The times(3) is the limit to the depth of your search. Finally you cap() out the side-effect Map held in "m" from our group(). That approach was meant to just give you a bit of basic structure to how you would accomplish this. You could perhaps polish it further this way:
gremlin> g.V().has('person','name','A').
......1> repeat(out().
......2> group('m').
......3> by(loops())).
......4> times(3).
......5> cap('m').unfold().select(values).unfold().
......6> dedup().
......7> valueMap('name','age').by(unfold())
==>[name:B,age:21]
==>[name:C,age:22]
==>[name:D,age:19]
==>[name:E,age:22]
The above example, extracts the values from the Map in "m", removes the duplicates with dedup() and then converts to the result you want. Maybe you don't need the Map in the first place (I just have it on my mind because of this answer actually) - you could simple store() your results as follows:
gremlin> g.V().has('person','name','A').
......1> repeat(out().store('m')).
......2> times(3).
......3> cap('m').unfold().
......4> dedup().
......5> valueMap('name','age').by(unfold())
==>[name:B,age:21]
==>[name:C,age:22]
==>[name:D,age:19]
==>[name:E,age:22]
You might look at using something like simplePath() as well to help avoid re-traversing the same paths over and over again. You can read about that step in the Reference Documentation.

Clone an Edge and target vertex in Cosmos Gremlin

g.AddV('test').property('id','1').property('name','test 1')
g.AddV('test').property('id','2').property('name','test 2')
g.V('1').AddE('owns').to(g.AddV('another').property('id','3'))
Is there any way i can clone this owns edge and it's target another vertex of test 1 with all properties into test 2 vertex? This is just a sample data. I have vertex with at least 10 properties.
NOTE : Query needs to support cosmos db gremlin api.
The answer to this one is mostly presented in this other StackOverflow question which explains how to clone a vertex and all it's edges. Since this question is slightly different I thought I'd adapt it a bit rather this suggesting closing this as a duplicate.
gremlin> g.V().has('test','name','test 1').as('t1').
......1> outE('owns').as('e').inV().as('source').
......2> V().has('test','name','test 2').as('target').
......3> sideEffect(
......4> select('source').properties().as('p').
......5> select('target').
......6> property(select('p').key(), select('p').value())).
......7> sideEffect(
......8> select('t1').
......9> addE(select('e').label()).as('eclone').
.....10> to(select('target')).
.....11> select('e').properties().as('p').
.....12> select('eclone').
.....13> property(select('p').key(), select('p').value()))
==>v[3]
gremlin> g.E()
==>e[8][0-owns->6]
==>e[10][0-owns->3]
gremlin> g.V().valueMap(true)
==>[id:0,label:test,name:[test 1],id:[1]]
==>[id:3,label:test,name:[test 2],id:[3]]
==>[id:6,label:another,id:[3]]
Note that since labels are immutable, you are stuck with the vertex label being "another" given the way that you laid out your sample data. Also, I know it is just sample data, but note that overloading "id" isn't a good choice as it can lead to confusion with T.id.
Execute api: g.V().has('name','test 1').id()
Then try to loop the results in java code and execute the add edge api:
g.V(<the id of vertex loop>).AddE('owns').to(<the id of vertex 'test2'>)
If the vertexes of test 2 are multiple,then you could two-dimensional loop.

Janusgraph Gremlin query

Am relatively new to graph.
Looking for help to build gremlin query equivalent 4 below sql.
Select a.x1,a.x2,b.y1,b.y2 from table1 a, table b where a.x1=b.y1 and a.x2=b.y2.
Consider table as vertices and x1 x2 y1 y2 as properties.
In janusgraph there are no edges for these vertices and property labels are also different. Before getting the result for , need to check if the vertices have no edges.
If there are no edges, then this isn't a terribly "graphy" query so this might look a little clumsy. I think you would have to use some form of mid-traversal V(). I demonstrated here with a little data:
gremlin> g.addV('a').property('x1',1).property('x2',2).
......1> addV('b').property('y1',1).property('y2',2).
......2> addV('b').property('y1',2).property('y2',3).iterate()
gremlin> g.V().hasLabel('a').as('a').
......1> V().hasLabel('b').as('b').
......2> where('a', eq('b')).
......3> by('x1').
......4> by('y1').
......5> where('a', eq('b')).
......6> by('x2').
......7> by('y2').
......8> select('a','b').
......9> by(valueMap(true))
==>[a:[label:a,id:0,x1:[1],x2:[2]],b:[label:b,id:3,y1:[1],y2:[2]]]
I'm not sure if there isn't a nicer way to do this. Depending on how large your dataset is, this could be a tremendously expensive traversal and would probably be a better candidate for a form of OLAP traversal using Gremlin Spark.

Why do you need to fold/unfold using coalesce for a conditional insert?

I'm trying to understand how this pattern for a conditional insert works:
g.V()
.hasLabel('person').has('name', 'John')
.fold()
.coalesce(
__.unfold(),
g.addV('person').property('name', 'John')
).next();
What is the purpose of the fold/unfold? Why are these necessary, and why does this not work:
g.V()
.coalesce(
__.hasLabel('person').has('name', 'John'),
g.addV('person').property('name', 'John')
).next();
The fold-then-unfold pattern seems redundant to me and yet the above does not yield the same result.
Consider what happens when you just do the following:
gremlin> g = TinkerFactory.createModern().traversal()
==>graphtraversalsource[tinkergraph[vertices:6 edges:6], standard]
gremlin> g.V().has('name','marko')
==>v[1]
gremlin> g.V().has('name','stephen')
gremlin>
For "marko" you return something and for "stephen" you do not. The "stephen" case is the one to pay attention to because that is the one where the fold() truly becomes important in this pattern. When that traversal returns nothing, any steps you add after that will not have a Traverser present to trigger actions in those steps. Therefore even the following will not add a vertex:
gremlin> g.V().has('name','stephen').addV('person')
gremlin>
But looks what happens if we fold():
gremlin> g.V().has('name','stephen').fold()
==>[]
fold() is a reducing barrier step and will thus eagerly evaluate the traversal up to that point and return the contents as a List even if the contents of that traversal up to that point yield nothing (in which case, as you can see, you get an empty list). And if you have an empty List that empty List is a Traverser flowing through the traversal and therefore future steps will fire:
gremlin> g.V().has('name','stephen').fold().addV('person')
==>v[13]
So that explains why we fold() because we are checking for existence of "John" in your example and if he's found then he will exist in the List and when that List with "John" hits coalesce() its first check will be to unfold() that List with "John" and return that Vertex - done. If the List is empty and returns nothing because "John" does not exist then it will add the vertex (by the way, you don't need the "g." in front of addV(), it should just be an anonymous traversal and thus __.addV('person')).
Turning to your example, I would first point out that I think you wanted to ask about this:
g.V().
coalesce(
__.has('person','name', 'John'),
__.addV('person').property('name', 'John'))
This is a completely different query. In this traversal, you're saying iterate all the vertices and for each one execute what is in the coalesce(). You can see this fairly plainly by replacing the addV() with constant('x'):
gremlin> g = TinkerFactory.createModern().traversal()
==>graphtraversalsource[tinkergraph[vertices:6 edges:6], standard]
gremlin> g.V().
......1> coalesce(
......2> has('person','name', 'John'),
......3> constant('x'))
==>x
==>x
==>x
==>x
==>x
==>x
gremlin> g.V().
......1> coalesce(
......2> has('person','name', 'marko'),
......3> constant('x'))
==>v[1]
==>x
==>x
==>x
==>x
==>x
Now, imagine what happens with addV() and "John". It will call addV() 6 times, once for each vertex it comes across that is not "John":
gremlin> g.V().
......1> coalesce(
......2> __.has('person','name', 'John'),
......3> __.addV('person').property('name', 'John'))
==>v[13]
==>v[15]
==>v[17]
==>v[19]
==>v[21]
==>v[23]
Personally, I like the idea of wrapping up this kind of logic in a Gremlin DSL - there is a good example of doing so here.
Nice question - I've described the "Element Existence" issue as part of a Gremlin Recipe that can be read here.

Resources