Number of nodes/edges in a large graph via Gremlin? - gremlin

What is the easiest & most efficient way to count the number of nodes/edges in a large graph via Gremlin? The best I have found is using the V iterator:
gremlin> g.V.gather{it.size()}
However, this is not a viable option for large graphs, per the documentation for V:
The vertex iterator for the graph. Utilize this to iterate through all
the vertices in the graph. Use with care on large graphs unless used
in combination with a key index lookup.

I think the preferred way to do a count of all vertices would be:
gremlin> g = TinkerGraphFactory.createTinkerGraph()
==>tinkergraph[vertices:6 edges:6]
gremlin> g.V.count()
==>6
gremlin> g.E.count()
==>6
though, I think that on a very large graph g.V/E just breaks down no matter what you do. On a very large graph the best option for doing a count is to use a tool like Faunus(http://thinkaurelius.github.io/faunus/) so that you can leverage the power of Hadoop to do the counts in parallel.
UPDATE: The original answer above was for TinkerPop 2.x. For TinkerPop 3.x the answer is largely the same and implies use of Gremlin Spark or some provider specific tooling (like DSE GraphFrames for DataStax Graph) that is optimized to do those kinds of large scale traversals.

I tried the above, it didn't work for me. For some of you, this may work:
gremlin> g.V.count()
{"detailedMessage":"Query parsing failed at line 1, character position at 3, error message : no viable alternative at input 'g.V.'","code":"MalformedQueryException","requestId":"99f749db-c240-9834-aa12-e17bb21e598e"}
Type ':help' or ':h' for help.
Display stack trace? [yN]
gremlin> g.V().count()
==>37
gremlin> g.E().count()
==>45
gremlin>
Use g.V().count instead of g.V.count(). (For those where the other command errors out).

via python:
from gremlin_python.structure.graph import Graph
from gremlin_python.driver.driver_remote_connection import DriverRemoteConnection
graph = Graph()
graph_db_uri = 'ws://localhost/gremlin'
g = graph.traversal().withRemote(DriverRemoteConnection(graph_db_uri,'g'))
count=g.V().hasLabel('node_label').count().next()
print("vertex count: ",count)
count=g.E().hasLabel('edge_label').count().next()
print("edge count: ",count)

Related

Gremlin: limit by vertex label

Hello dear gremlin jedi,
I have a bunch of nodes with different labels in my graph:
g.addV('book')
.addV('book')
.addV('book')
.addV('movie')
.addV('movie')
.addV('movie')
.addV('album')
.addV('album')
.addV('album').iterate()
There also may be vertices with other labels.
and a hash map describing what labels and how many vertices of each label I want to get:
LIMITS = {
"book": 2,
"movie": 2,
"album": 2,
}
I'd like to write a query that returns a list of vertices consisting of vertices with specified labels whete amount of vertices with each label is limited in according to the LIMITS hash map. In this case there should be 2 books, 2 movies and 2 albums in the result.
The limits and requested labels are calculated independently for every query so they cannot be hardcoded.
As far as I can see the limit step does not support passing traversals as an argument.
What trick can I use to write such query? The only option I see is to build the query using capabilities of the client side programming language (Ruby with grumlin as a gremlin client in my case):
nodes = LIMITS.map do |label, limit|
__.hasLabel(label).limit(limit)
end
g.V().union(*nodes).toList
But I believe there is a better solution.
Thank you!
The most direct way would be to use group() I think:
gremlin> g.V().group().by(label)
==>[software:[v[3],v[5]],person:[v[1],v[2],v[4],v[6]]]
gremlin> g.V().group().by(label).by(unfold().limit(2).fold())
==>[software:[v[3],v[5]],person:[v[1],v[2]]]
You can filter the vertices going to group() with hasLabel() if you need those sorts of restrictions. Depending upon how you use this, the traversal could be expensive in the sense that you have to traverse a fair bit of data to filter away all but two (in this case) vertices. If that is a concern, your approach to dynamically construct the traversal and the piecing it together with union() doesn't seem so bad. While I could probably think up a way to write that in just Gremlin, it probably wouldn't not be as readable as your approach.

how can i write the query on gremlin console to return the pair vertices these have the parallel edge?

I like to transform this cypher query to gremlin.
(n:Person)-[:friend]->(t:Person)-[:friend]->(n:Person)
Thanks
Using the air-routes data set, one way to do this is to use the cyclicPath step as follows.
gremlin> g.V('44').outE().inV().outE().inV().cyclicPath().path()
==>[v[44],e[5019][44-route->8],v[8],e[3975][8-route->44],v[44]]
==>[v[44],e[5020][44-route->13],v[13],e[4158][13-route->44],v[44]]
==>[v[44],e[5021][44-route->20],v[20],e[4387][20-route->44],v[44]]
gremlin> g.V('44').outE().inV().outE().inV().cyclicPath().path().by('code').by()
==>[SAF,e[5019][44-route->8],DFW,e[3975][8-route->44],SAF]
==>[SAF,e[5020][44-route->13],LAX,e[4158][13-route->44],SAF]
==>[SAF,e[5021][44-route->20],PHX,e[4387][20-route->44],SAF]
==>[SAF,e[5022][44-route->31],DEN,e[4736][31-route->44],SAF]
==>[v[44],e[5022][44-route->31],v[31],e[4736][31-route->44],v[44]]
Or if you just want the edge IDs
gremlin> g.V('44').outE().inV().outE().inV().cyclicPath().path().by('code').by(id)
==>[SAF,5019,DFW,3975,SAF]
==>[SAF,5020,LAX,4158,SAF]
==>[SAF,5021,PHX,4387,SAF]
==>[SAF,5022,DEN,4736,SAF]
Another way to write this query involves a where step
gremlin> g.V('44').as('a').outE().inV().outE().inV().where(eq('a')).path().by('code').by()
==>[SAF,e[5019][44-route->8],DFW,e[3975][8-route->44],SAF]
==>[SAF,e[5020][44-route->13],LAX,e[4158][13-route->44],SAF]
==>[SAF,e[5021][44-route->20],PHX,e[4387][20-route->44],SAF]
==>[SAF,e[5022][44-route->31],DEN,e[4736][31-route->44],SAF]

How to limit the number of times a branch is traversed

Starting with the toy graph I can find which vertexes are creators by looking for edges that have 'created' out edges:
gremlin> graph = TinkerFactory.createModern()
==>tinkergraph[vertices:6 edges:6]
graph.traversal().V().as('a').out('created').select('a').values('name')
==>marko
==>josh
==>josh
==>peter
I can filter out the duplicates with the dedup step...
gremlin> graph.traversal().V().as('a').out('created').select('a').dedup().values('name')
==>marko
==>josh
==>peter
...but this only alters the output, not the path followed by the Gremlin. If creators can be supernodes I'd like to tell the query to output 'a' once it finds its first 'created' edge and to then stop traversing the out step for the current 'a' and proceed to the next 'a'. Can this be done?
This syntax has the desired output. Do they behave like I intend?
graph.traversal().V().where(out('created').count().is(gt(0))).values('name')
graph.traversal().V().where(out('created').limit(1).count().is(gt(0))).values('name')
Is there a better recipe?
EDIT: I just found an example in the where doc (example 2) that shows the presence of a link being evaluated as truth (may not be wording this correctly):
graph.traversal().V().where(out('created')).values('name')
There's a warning about the star-graph problem, which I think doesn't apply here because, and I'm guessing, there is only one where step that tests a branch?
Your last example is the way to go.
g.V().where(out('created')).values('name')
Strategies will optimize that for you and turn it into:
g.V().where(outE('created')).values('name')
Also, .where(outE('created')) will not iterate through all the out-edges, it's just like a .hasNext(), hence no supernode problem.

Gremlin-Server Add Vertex with Multiple Properties (Titan 1.0.0)

I'm creating a Titan graph (backed by Dynamodb); I'm using Titan 1.0.0 and running Gremlin-Server 3 (on TinkerPop3).
I'm trying to add a vertex to my graph with a label and multiple properties in a single line. I'm able to add a vertex with a label and a single property, and I can add multiple properties to a vertex after it has been created, but it seems that I can't do it all at once.
For testing I'm running commands in the gremlin shell, but the end use case is interacting with it via REST api (which is already working fine).
As a note, I'm rolling back after each of these transactions so I have a clean slate.
Here is how I'm initiating my session:
gremlin> graph = TitanFactory.open('conf/gremlin-server/dynamodb.properties')
==>standardtitangraph[com.amazon.titan.diskstorage.dynamodb.DynamoDBStoreManager:[127.0.0.1]]
gremlin> g = graph.traversal()
==>graphtraversalsource[standardtitangraph[com.amazon.titan.diskstorage.dynamodb.DynamoDBStoreManager:[127.0.0.1]], standard]
I can create a vertex with a label and a single property like this:
gremlin> graph.addVertex('date_of_birth').property('date_of_birth','1949-01-01')
==>vp[date_of_birth->1949-01-01]
gremlin> g.V().hasLabel('date_of_birth').has('date_of_birth','1949-01-01').valueMap()
==>[date_of_birth:[1949-01-01]]
I can also create a vertex and then append many properties afterward with a traversal starting at the vertex I just created:
gremlin> v1 = graph.addVertex('date_of_birth')
==>v[409608296]
gremlin> g.V(v1).property('date_of_birth','1949-01-01').property('year_of_birth',1949).property('date_of_birth','1949-01-01').property('day_of_birth',1).property('age',67).property('month_of_birth',1)
==>v[409608296]
gremlin> g.V(v1).valueMap()
==>[day_of_birth:[1], date_of_birth:[1949-01-01], month_of_birth:[1], age:[67], year_of_birth:[1949]]
This is all well and good, but I'm trying to avoid making 2 calls to achieve this result, so I'd like to create the vertex with all of these properties at once. Essentially, I want to be able to do something like the following, but it fails with more than 1 .property():
gremlin> graph.addVertex('date_of_birth').property('date_of_birth','1949-01-01').property('year_of_birth',1949).property('date_of_birth','1949-01-01').property('day_of_birth',1).property('age',67).property('month_of_birth',1)
No signature of method: com.thinkaurelius.titan.graphdb.relations.SimpleTitanProperty.property() is applicable for argument types: (java.lang.String, java.lang.String) values: [date_of_birth, 1949-01-01]
I've also tried using 1 .property() with multiple properties (along with all other syntax variations I could think of), but it only seems to catch the first one:
gremlin> graph.addVertex('date_of_birth').property('date_of_birth','1949-01-01','year_of_birth',1949,'date_of_birth','1949-01-01','day_of_birth',1,'age',67,'month_of_birth',1)
gremlin> g.V().hasLabel('date_of_birth').has('date_of_birth','1949-01-01').valueMap()
==>[date_of_birth:[1949-01-01]]
I've looked through all of the documentation I can get my hands on from all sources I can find and I can't find anything on this "all at once" method. Has anyone done this before or know how it could be done?
Thanks in advance!
As described in Chapter 3 Getting Started of the Titan docs, the GraphOfTheGodsFactory.java source code shows how to add a vertex with a label and multiple properties.
saturn = graph.addVertex(T.label, "titan", "name", "saturn", "age", 10000);
The method addVertex(Object... keyValues) ultimately comes from Graph interface defined by Apache TinkerPop. Titan 1.0.0 uses TinkerPop 3.0.1, and you can find more documentation on the addVertex step (and many other steps) in the TinkerPop docs.

Use Gremlin to find the shortest path in a graph avoiding a given list of vertices?

I need to use Gremlin find the shortest path between two nodes (vertices) while avoiding a list of given vertices.
I already have:
v.bothE.bothV.loop(2){!it.object.equals(y)}.paths>>1
To get my shortest path.
I was attempting something like:
v.bothE.bothV.filter{it.name!="ignored"}.loop(3){!it.object.equals(y)}.paths>>1
but it does not seem to work.
Please HELP!!!
The second solution you have looks correct. However, to be clear on what you are trying to accomplish. If x and y are the vertices that you want to find the shortest path between and a vertex to ignore during the traversal if it has the property name:"ignored", then the query is:
x.both.filter{it.name!="ignored"}.loop(2){!it.object.equals(y)}.paths>>1
If the "list of given vertices" you want filtered is actually a list, then the traversal is described as such:
list = [ ... ] // construct some list
x.both.except(list).loop(2){!it.object.equals(y)}.paths>>1
Moreover, I tend to use a range filter just to be safe as this will go into an infinite loop if you forget the >>1 :)
x.both.except(list).loop(2){!it.object.equals(y)}[1].paths>>1
Also, if there is a potential for no path, then to avoid an infinitely long search, you can do a loop limit (e.g. no more than 4 steps):
x.both.except(list).loop(2){!it.object.equals(y) & it.loop < 5}.filter{it.object.equals(y)}.paths>>1
Note why the last filter step before paths is needed. There are two reasons the loop is broken out of. Thus, you might not be at y when you break out of the loop (instead, you broke out of the loop because it.loops < 5).
Here is you solution implemented over the Grateful Dead graph distributed with Gremlin. First some set up code, where we load the graph and define two vertices x and y:
gremlin> g = new TinkerGraph()
==>tinkergraph[vertices:0 edges:0]
gremlin> g.loadGraphML('data/graph-example-2.xml')
==>null
gremlin> x = g.v(89)
==>v[89]
gremlin> y = g.v(100)
==>v[100]
gremlin> x.name
==>DARK STAR
gremlin> y.name
==>BROWN EYED WOMEN
Now your traversal. Note that there is not name:"ignored" property, so instead, I altered it to account for the number of performances of each song along the path. Thus, shortest path of songs played more than 10 times in concert:
gremlin> x.both.filter{it.performances > 10}.loop(2){!it.object.equals(y)}.paths>>1
==>v[89]
==>v[26]
==>v[100]
If you use Gremlin 1.2+, then you can use a path closure to provide the names of those vertices (for example) instead of just the raw vertex objects:
gremlin> x.both.filter{it.performances > 10}.loop(2){!it.object.equals(y)}.paths{it.name}>>1
==>DARK STAR
==>PROMISED LAND
==>BROWN EYED WOMEN
I hope that helps.
Good luck!
Marko.

Resources