Gremlin: limit by vertex label - gremlin

Hello dear gremlin jedi,
I have a bunch of nodes with different labels in my graph:
g.addV('book')
.addV('book')
.addV('book')
.addV('movie')
.addV('movie')
.addV('movie')
.addV('album')
.addV('album')
.addV('album').iterate()
There also may be vertices with other labels.
and a hash map describing what labels and how many vertices of each label I want to get:
LIMITS = {
"book": 2,
"movie": 2,
"album": 2,
}
I'd like to write a query that returns a list of vertices consisting of vertices with specified labels whete amount of vertices with each label is limited in according to the LIMITS hash map. In this case there should be 2 books, 2 movies and 2 albums in the result.
The limits and requested labels are calculated independently for every query so they cannot be hardcoded.
As far as I can see the limit step does not support passing traversals as an argument.
What trick can I use to write such query? The only option I see is to build the query using capabilities of the client side programming language (Ruby with grumlin as a gremlin client in my case):
nodes = LIMITS.map do |label, limit|
__.hasLabel(label).limit(limit)
end
g.V().union(*nodes).toList
But I believe there is a better solution.
Thank you!

The most direct way would be to use group() I think:
gremlin> g.V().group().by(label)
==>[software:[v[3],v[5]],person:[v[1],v[2],v[4],v[6]]]
gremlin> g.V().group().by(label).by(unfold().limit(2).fold())
==>[software:[v[3],v[5]],person:[v[1],v[2]]]
You can filter the vertices going to group() with hasLabel() if you need those sorts of restrictions. Depending upon how you use this, the traversal could be expensive in the sense that you have to traverse a fair bit of data to filter away all but two (in this case) vertices. If that is a concern, your approach to dynamically construct the traversal and the piecing it together with union() doesn't seem so bad. While I could probably think up a way to write that in just Gremlin, it probably wouldn't not be as readable as your approach.

Related

Gremlin: Add edges to multiple vertices

I have vertices [song1, song2, song3, user].
I want to add edges listened from user to the songs.
I have the following:
g.V().is(within(song1, song2, song3)).addE('listened').from(user)
However I'm getting the following error:
No signature of method: org.apache.tinkerpop.gremlin.process.traversal.dsl.graph.DefaultGraphTraversal.from() is applicable for argument types: (org.janusgraph.graphdb.vertices.CacheVertex) values: [v[4344]]
Possible solutions: sort(), drop(int), sum(), find(), grep(), sort(groovy.lang.Closure)
Of course, I can iterate through them one at a time instead but a single query would be nice:
user.addEdge('listened', song1)
user.addEdge('listened', song2)
user.addEdge('listened', song3)
The from() modulator accepts two things:
a step label or
a traversal
A single vertex or a list of vertices can easily be turned into a traversal by wrapping it in V(). Also, note that g.V().is(within(...)) will most likely end up being a full scan over all vertices; it pretty much depends on the provider implementation, but you should prefer to use g.V(<list of vertices>) instead. Thus your traversal should look more like any of these:
g.V().is(within(song1, song2, song3)).
addE('listened').from(V(user)) // actually bad, as it's potentially a full scan
g.V(song1, song2, song3).
addE('listened').from(V(user))
g.V(user).as('u').
V(within(song1, song2, song3)).
addE('listened').from('u')

Add edge if not exist using gremlin

I'm using cosmos graph db in azure.
Does anyone know if there is a way to add an edge between two vertex only if it doesn't exist (using gremlin graph query)?
I can do that when adding a vertex, but not with edges. I took the code to do so from here:
g.Inject(0).coalesce(__.V().has('id', 'idOne'), addV('User').property('id', 'idOne'))
Thanks!
It is possible to do with edges. The pattern is conceptually the same as vertices and centers around coalesce(). Using the "modern" TinkerPop toy graph to demonstrate:
gremlin> g.V().has('person','name','vadas').as('v').
V().has('software','name','ripple').
coalesce(__.inE('created').where(outV().as('v')),
addE('created').from('v').property('weight',0.5))
==>e[13][2-created->5]
Here we add an edge between "vadas" and "ripple" but only if it doesn't exist already. the key here is the check in the first argument to coalesce().
The performance of the accepted answer isn't great since it use inE(...), which is an expensive operation.
This query is what I use for my work in CosmosDB:
g.E(edgeId).
fold().
coalesce(
unfold(),
g.V(sourceId).
has('pk', sourcePk).
as('source').
V(destinationId).
has('pk', destinationPk).
addE(edgeLabel).
from('source').
property(T.id, edgeId)
)
This uses the id and partition keys of each vertex for cheap lookups.
I have been working on similar issues, trying to avoid duplication of vertices or edges. The first is a rough example of how I check to make sure I am not duplicating a vertex:
"g.V().has('word', 'name', '%s').fold()"
".coalesce(unfold(),"
"addV('word')"
".property('name', '%s')"
".property('pos', '%s')"
".property('pk', 'pk'))"
% (re.escape(category_),re.escape(category_), re.escape(pos_))
The second one is the way I can make sure that isn't a directional edge in either direction. I make use of two coalesce statements, one nested inside the other:
"x = g.V().has('word', 'name', '%s').next()\n"
"y = g.V().has('word', 'name', '%s').next()\n"
"g.V(y).bothE('distance').has('weight', %f).fold()"
".coalesce("
"unfold(),"
"g.addE('distance').from(x).to(y).property('weight', %f)"
")"
% (word_1, word_2, weight, weight)
So, if the edge exists y -> x, it skips producing another one. If y -> x doesn't exist, then it tests to see if x -> y exists. If not, then it goes to the final option of creating x -> y
Let me know if anyone here knows of a more concise solution. I am still very new to gremlin, and would love a cleaner answer. Though, this one appears to suffice.
When I implemented the previous solutions provided, when I ran my code twice, it produced an edge for each try, because it only tests one direction before creating a new edge.

ArangoDB Graph-Traversal: Exclude Edges

I'm executing a query similar to:
FOR v, e IN 1..10 ANY #start GRAPH #graph
FILTER e.someCondition
RETURN v
What I expected to happen was that if e.someCondition was false, then the edge in question wouldn't be traversed (and transitively, all other vertexes and edges reachable solely through e would never be visited).
However, it seems that what happens is that e is merely skipped and then traversal continues down that path.
So, how can I set boundaries on graph-traversal by edge-properties using AQL?
The query supports v, e and p, where p is the path it takes.
The ArangoDB documentation shows some examples.
I have used this to exclude specific nodes at specified depths in the path but you have to specify depth of nodes e.g. p.vertices[0].something != 'value').
Another thing you might want to look at is working with 'Custom Visitor' functions which are evaluated as the query traverses down a path.
This good blog post and this ArangoDB guide show some real world examples of it and it's well worth the read and effort to get a sample working. I've used these functions to summarise data in a path, aggregated by properties on the vertices in the path, but you can also use it to have custom paths followed.
This is worth the effort because it gives you huge flexibility over the computation it follows when traversing the graph. You can exclude branches, only include branches that meet specific requirements, or aggregate data about the paths it took.
I hope that helps.

Traverse Graph With Directed Cycles using Relationship Properties as Filters

I have a Neo4j graph with directed cycles. I have had no issue finding all descendants of A assuming I don't care about loops using this Cypher query:
match (n:TEST{name:"A"})-[r:MOVEMENT*]->(m:TEST)
return n,m,last(r).movement_time
The relationships between my nodes have a timestamp property on them, movement_time. I've simulated that in my test data below using numbers that I've imported as floats. I would like to traverse the graph using the timestamp as a constraint. Only follow relationships that have a greater movement_time than the movement_time of the relationship that brought us to this node.
Here is the CSV sample data:
from,to,movement_time
A,B,0
B,C,1
B,D,1
B,E,1
B,X,2
E,A,3
Z,B,5
C,X,6
X,A,7
D,A,7
Here is what the graph looks like:
I would like to calculate the descendants of every node in the graph and include the timestamp from the last relationship using Cypher; so I'd like my output data to look something like this:
Node:[{Descendant,Movement Time},...]
A:[{B,0},{C,1},{D,1},{E,1},{X,2}]
B:[{C,1},{D,1},{E,1},{X,2},{A,7}]
C:[{X,6},{A,7}]
D:[{A,7}]
E:[{A,3}]
X:[{A,7}]
Z:[{B,5}]
This non-Neo4J implementation looks similar to what I'm trying to do: Cycle enumeration of a directed graph with multi edges
This one is not 100% what you want, but very close:
MATCH (n:TEST)-[r:MOVEMENT*]->(m:TEST)
WITH n, m, r, [x IN range(0,length(r)-2) |
(r[x+1]).movement_time - (r[x]).movement_time] AS deltas
WHERE ALL (x IN deltas WHERE x>0)
RETURN n, collect(m), collect(last(r).movement_time)
ORDER BY n.name
We basically find all the paths between any of your nodes (beware cartesian products get very expensive on non-trivial datasets). In the WITH we're building a collection delta's that holds the difference between two subsequent movement_time properties.
The WHERE applies an ALL predicate to filter out those having any non-positive value - aka we guarantee increasing values of movement_time along the path.
The RETURN then just assembles the results - but not as a map, instead one collection for the reachable nodes and the last value of movement_time.
The current issue is that we have duplicates since e.g. there are multiple paths from B to A.
As a general notice: this problem is much more elegantly and more performant solvable by using Java traversal API (http://neo4j.com/docs/stable/tutorial-traversal.html). Here you would have a PathExpander that skips paths with decreasing movement_time early instead of collection all and filter out (as Cypher does).

Reachable vertices from each other

Given a directed graph, I need to find all vertices v, such that, if u is reachable from v, then v is also reachable from u. I know that, the vertex can be find using BFS or DFS, but it seems to be inefficient. I was wondering whether there is a better solution for this problem. Any help would be appreciated.
Fundamentally, you're not going to do any better than some kind of search (as you alluded to). I wouldn't worry too much about efficiency: these algorithms are generally linear in the number of nodes + edges.
The problem is a bit underspecified, so I'll make some assumptions about your data structure:
You know vertex u (because you didn't ask to find it)
You can iterate both the inbound and outbound edges of a node (efficiently)
You can iterate all nodes in the graph
You can (efficiently) associate a couple bits of data along with each node
In this case, use a convenient search starting from vertex u (depth/breadth, doesn't matter) twice: once following the outbound edges (marking nodes as "reachable from u") and once following the inbound edges (marking nodes as "reaching u"). Finally, iterate through all nodes and compare the two bits according to your purpose.
Note: as worded, your result set includes all nodes that do not reach vertex u. If you intended the conjunction instead of the implication, then you can save a little time by incorporating the test in the second search, rather than scanning all nodes in the graph. This also relieves assumption 3.

Resources