gremlin python clone traversal - gremlin

I'm using gremlin-python to connect to gremlin-server and I'm trying to build up a query incrementally but I'm getting stuck. I have an initial part of my query like the following:
query = g.V().hasLabel('<some_label>')
Now I would like to do multiple things with this query, firstly I just want a count:
query.count().next()
Now if I do anything else using the query variable the count step is on the traversal, so something like the following doesn't work:
query.out('<some_edge_label>').valueMap().toList()
Looking at the docs it seems like I need to clone the traversal so I replaced the above with:
query = g.V().hasLabel('<some_label>')
count_query = query.clone()
count_query.count().next()
But query still has the count() step on it, when I print the bytecode even though I cloned it. Is this the expected behaviour for gremlin-python? Here is a complete example of what I'm talking about, printing the bytecode at each step:
query = g.V().hasLabel('alabel')
print(query)
q_count = query.clone()
print(q_count.count())
print(query)
[['V'], ['hasLabel', 'alabel']]
[['V'], ['hasLabel', 'alabel'], ['count']]
[['V'], ['hasLabel', 'alabel'], ['count']]
What do I do to clone/copy the start of the traversal so I can reuse it in gremlin-python?

There were some fixes in the area of deep cloning traversals in the 3.4.7 (3.3.11) [1] [2] Apache TinkerPop release (June 2020). Installing one of those drivers should help.
[1] https://github.com/apache/tinkerpop/blob/master/CHANGELOG.asciidoc
[2] https://issues.apache.org/jira/browse/TINKERPOP-2350

It looks like this issue was a bug in gremlin-python and has been fixed in version 3.4.7. Updating the version solved my issue.

Related

Adobe AEM Querybuilder Debugger - Multiple Paths and Multiple Nodenames

I am using querybuilder debugger and want to do a search where "nodename=.pdf OR nodename=.doc*" and "path=/content/dam/1 OR path=/content/dam/2".
I have been trying to find an example but no luck on the web. What I have below is not quite right - just wondering what I am missing.
The query does work but there is a huge difference in the amount of time that it runs when compared with when I just query using one nodename instead of 2.
Thanks in advance,
Jerry
type=dam:asset
mainasset=true
1_group.p.or=true
1_group.1.nodename=*.pdf
1_group.2.nodename=*.doc*
2_group.p.or=true
2_group.1_path=/content/dam/1
2_group.2_path=/content/dam/2
p.limit=-1
orderby=path
I thought maybe something as simple as this might work but no luck....
type=dam:asset
mainasset=true
group.p.or=true
group.1_nodename=*.doc*
group.1_path=/content/dam/1
group.2_nodename=*.doc*
group.2_path=/content/dam/2
group.3_nodename=*.pdf
group.3_path=/content/dam/1
group.4_nodename=*.pdf
group.4_path=/content/dam/2
p.limit=-1
orderby=path
Try splitting your query if this won't affect the behaviour you're trying to achieve.
path=/content/dam/1
type=dam:asset
mainasset=true
group.1.nodename=*.pdf
group.2.nodename=*.doc*
p.limit=-1
orderby=path
path=/content/dam/2
type=dam:asset
mainasset=true
group.1.nodename=*.pdf
group.2.nodename=*.doc*
p.limit=-1
orderby=path

Issue getting latest version of artifact from Nexus

I am trying to get latest snapshot version of an artifact from Nexus, and I just can't seem to make it work. The artifacts are uploaded as a zip (in case that affects) and are deployed in correct time-order (lower version before, higher version more recently). Based on couple of other SO answers here are some of the things I've tried.
The given artifact has a latest version of 0.8.400-SNAPSHOT, but all these searches give something else
Here is what I have for 0.8.400-SNAPSHOT
<artifact-resolution>
<data>
<presentLocally>true</presentLocally>
<groupId>x.y</groupId>
<artifactId>artifactId</artifactId>
<version>0.8.400-20160509.154907-1</version>
<baseVersion>0.8.400-SNAPSHOT</baseVersion>
<extension>zip</extension>
<snapshot>true</snapshot>
<snapshotBuildNumber>1</snapshotBuildNumber>
<snapshotTimeStamp>1462808947000</snapshotTimeStamp>
<sha1>61e08f995e9626ce67060af89798c37ff852d475</sha1>
<repositoryPath>/x/y/artifactId/0.8.400-SNAPSHOT/artifactId-0.8.400-20160509.154907-1.zip
</repositoryPath>
</data>
</artifact-resolution>
Using Maven Resolve
1) /nexus/service/local/artifact/maven/resolve?r=snapshots&g=com.x.y&a=artifactId&e=zip&v=LATEST
returns 0.8.385-SNAPSHOT
<artifact-resolution>
<data>
<presentLocally>true</presentLocally>
<groupId>x.y</groupId>
<artifactId>artifactId</artifactId>
<version>0.8.385-20160506.162638-3</version>
<baseVersion>0.8.385-SNAPSHOT</baseVersion>
<extension>zip</extension>
<snapshot>true</snapshot>
<snapshotBuildNumber>3</snapshotBuildNumber>
<snapshotTimeStamp>1462551998000</snapshotTimeStamp>
<sha1>f70809137eb87a8dce98d7c4f746176a1305adfb</sha1>
<repositoryPath>/com/x/y/artifactId/0.8.385-SNAPSHOT/artifactId-0.8.385-20160506.162638-3.zip
</repositoryPath>
</data>
</artifact-resolution>
Using Lucene Search
1) While a blanket search on all artifacts returns correct latest, it fails to give correct if I add any parameters to optimize it (by fetching less data). The query /nexus/service/local/lucene/search?r=snapshots&g=x.y&a=artifactId&e=zip returns all artifacts (count=14) with latestSnaphot as 0.8.400-SNAPSHOT:
<artifact>
<groupId>x.y</groupId>
<artifactId>artifactId</artifactId>
<version>0.8.385-SNAPSHOT</version>
<latestSnapshot>0.8.400-SNAPSHOT</latestSnapshot>
<latestSnapshotRepositoryId>snapshots</latestSnapshotRepositoryId>
<repositoryId>snapshots</repositoryId>
...<snip>...
</artifact>
2) A latest search does not return any data:
/nexus/service/local/lucene/search?r=snapshots&g=x.y&a=artifactId&e=zip&v=LATEST
3) A search to get one result shows wrong latestSnapshot:
/nexus/service/local/lucene/search?r=snapshots&g=x.y&a=artifactId&e=zip&count=1 returns
<latestSnapshot>0.6.265-SNAPSHOT</latestSnapshot>
4) A search with a higher count shows different version, but still with 0.6.265 as latest, until I hit almost entire the count of artifacts
With 10 as count /nexus/service/local/lucene/search?r=snapshots&g=x.y&a=artifactId&e=zip&count=10
<version>0.6.261-SNAPSHOT</version>
<latestSnapshot>0.6.265-SNAPSHOT</latestSnapshot>
With 12 as count, /nexus/service/local/lucene/search?r=snapshots&g=x.y&a=artifactId&e=zip&count=12
<version>0.6.265-SNAPSHOT</version>
<latestSnapshot>0.8.400-SNAPSHOT</latestSnapshot>
Is there a bug, or am I doing something wrong here?

Creating graph in titan from data in csv - example wiki.Vote gives error

I am new to Titan - I loaded titan and successfully ran GraphOfTheGods example including queries given. Next I went on to try bulk loading csv file to create graph and followed steps in Powers of ten - Part 1 http://thinkaurelius.com/2014/05/29/powers-of-ten-part-i/
I am getting an error in loading wiki-Vote.txt
gremlin> g = TitanFactory.open("/tmp/1m") Backend shorthand unknown: /tmp/1m
I tried:
g = TitanFactory.open('conf/titan-berkeleydb-es.properties’)
but get an error in the next step in load-1m.groovy
==>titangraph[berkeleyje:/titan-0.5.4-hadoop2/conf/../db/berkeley] No signature of method: groovy.lang.MissingMethodException.makeKey() is applicable for argument types: () values: [] Possible solutions: every(), any()
Any hints what to do next? I am using groovy for the first time. what kind of groovy expertise needed for working with gremlin
That blog post is meant for Titan 0.4.x. The API shifted when Titan went to 0.5.x. The same principles discussed in the posts generally apply to data loading but the syntax is different in places. The intention is to update those posts in some form when Titan 1.0 comes out with full support of TinkerPop3. Until then, you will need to convert those code examples to the revised API.
For example, an easy way to create a berkeleydb database is with:
g = TitanFactory.build()
.set("storage.backend", "berkeleyje")
.set("storage.directory", "/tmp/1m")
.open();
Please see the docs here. Then most of the schema creation code (which is the biggest change) is now described here and here.
After much experimenting today, I finally figured it out. A lot of changes were needed:
Use makePropertyKey() instead of makeKey(), and makeEdgeLabel() instead of makeLabel()
Use cardinality(Cardinality.SINGLE) instead of unique()
Building the index is quite a bit more complicated. Use the management system instead of the graph both to make the keys and labels, as well as build the index (see https://groups.google.com/forum/#!topic/aureliusgraphs/lGA3Ye4RI5E)
For posterity, here's the modified script that should work (as of 0.5.4):
g = TitanFactory.build().set("storage.backend", "berkeleyje").set("storage.directory", "/tmp/1m").open()
m = g.getManagementSystem()
k = m.makePropertyKey('userId').dataType(String.class).cardinality(Cardinality.SINGLE).make()
m.buildIndex('byId', Vertex.class).addKey(k).buildCompositeIndex()
m.makeEdgeLabel('votesFor').make()
m.commit()
getOrCreate = { id ->
def p = g.V('userId', id)
if (p.hasNext()) {
p.next()
} else {
g.addVertex([userId:id])
}
}
new File('wiki-Vote.txt').eachLine {
if (!it.startsWith("#")){
(fromVertex, toVertex) = it.split('\t').collect(getOrCreate)
fromVertex.addEdge('votesFor', toVertex)
}
}
g.commit()

Titan Graph Queries taking too long to execute

I have a problem with the executing speed of Titan queries.
To be more specific:
I created a property file for my graph using BerkeleyJe which is looking like this:
storage.backend=berkeleyje
storage.directory=/finalGraph_script/graph
Afterwards, i opened the Gremlin.bat to open my Graph.
I set up all the neccessary Index Keys for my nodes:
m = g.getManagementSystem();
username = m.makePropertyKey('username').dataType(String.class).make()
m.buildIndex('byUsername',Vertex.class).addKey(username).unique().buildCompositeIndex()
m.commit()
g.commit()
(all other keys are created the same way...)
I imported a csv file containing about 100 000 lines, each line is producing at least 2 nodes and some edges. All this is done via Batchloading.
That works without a Problem.
Then i execute a groupBy query which is looking like that:
m = g.V.has("imageLink").groupBy{it.imageLink}{it.in("is_on_image").out("is_species")}{it._().species.groupCount().cap.next()}.cap.next()
With this query i want for every node with the property key "imageLink" the number of the different "species". "Species" are also nodes, and can be called by going back the edge "is_on_image" and following the edge "is_species".
Well this is also working like a charm, for my recent nodes. This query is taking about 2 minutes on my local PC.
But now to the problem.
My whole dataset is a csv with 10 million entries. The structure is the same as above, and each line is also creating at least 2 nodes and some edges.
With my local PC i cant even import this set, causing an Memory Exception after 3 days of loading.
So I tried the same on a server with much more RAM and memory. There the Import works, and takes about 1 day. But the groupBy failes after about 3 days.
I actually dont know if the groupBy itself fails, or just the Connection to the Server after that long time.
So my first Question:
In my opinion about 15 million nodes shouldn't be that big deal for a graph database, should it?
Second Question:
Is it normal that it takes so long? Or is there anyway to speed it up using indices? I configured the indices as listet above :(
I don't know which exact information you need for helping me, but please just tell me what you need in addition to that.
Thanks a lot!
Best regards,
Ricardo
EDIT 1: The way im loading the CSV in the Graph:
I'm using this code, i deleted some unneccassry properties, which are also set an property for some nodes, loaded the same way.
bg = new BatchGraph(g, VertexIDType.STRING, 10000)
new File("annotation_nodes_wNothing.csv").eachLine({ final String line ->def (annotationId,species,username,imageLink) = line.split('\t')*.trim();def userVertex = bg.getVertex(username) ?: bg.addVertex(username);def imageVertex = bg.getVertex(imageLink) ?: bg.addVertex(imageLink);def speciesVertex = bg.getVertex(species) ?: bg.addVertex(species);def annotationVertex = bg.getVertex(annotationId) ?: bg.addVertex(annotationId);userVertex.setProperty("username",username);imageVertex.setProperty("imageLink", imageLink);speciesVertex.setProperty("species",species);annotationVertex.setProperty("annotationId", annotationId);def classifies = bg.addEdge(null, userVertex, annotationVertex, "classifies");def is_on_image = bg.addEdge(null, annotationVertex, imageVertex, "is_on_image");def is_species = bg.addEdge(null, annotationVertex, speciesVertex, "is_species");})
bg.commit()
g.commit()

Faunus graph not printing nodes without using side effect from gremlin shell

I'm trying to print a graph in Faunus (v0.4.0) where a node has any edges (incoming or outgoing). From the gremlin shell, I tried:
g = FaunusFactory.open('faunus.properties')
g.V.filter("{it.bothE.hasNext()}").sideEffect("{println it}")
When I do this, I get a printout of all the nodes as I expected
But without the println, I do not.
According to How do I write a for loop in gremlin?, the gremlin terminal should print this info out for me, but it does not seem to.
Is there something specific I need to do to enable the printing from the console?
Faunus and Gremlin are close to each other in terms of purpose and functionality but not identical. The filter isn't producing a side-effect, which will be written to HDFS. If you did:
g.V.filter("{it.bothE.hasNext()}").id
You could then view the list of ids matching that filter with something like:
hdfs.head('output',100)
to see the first 100 lines of the output. If you need more than just the element identifier you could do a transform to get some of the element properties in there as well. You might find these hdfs helper tips helpful.

Resources