Trouble retrieving newly created Vertices by label using DSE Graph - gremlin

I am trying to write a simple node application that uses the dse node driver to create some vertices and then to retrieve the created vertices for use in creating edges. The actual vertex retrieval traversals are in the groovy code that I'm submitting to DSE. My code for retrieving the vertices looks like:
g.V().hasLabel('someVertex').has('id', 'myId').next();
There is a vertex search index on my id property. Unfortunately, I get an error:
FastNoSuchElementException
The same groovy query works perfectly in the Gremlin REPL. The query also works if I take out the hasLabel call.
I thought that perhaps there was an eventual consistency issue, so I wrote a while loop in groovy that checks for the count of the traversal to be greater than zero. It never returns.
This same application works perfectly against my Titan graph.
Am I perhaps not understanding something about implicit transactions in DSE?
Edit: The code works if I wait 10 seconds after creating the vertices.

Related

Gremlin query hangs with and/or condition

I have a Graph model region (vertex) -> has_person (edge) -> person (vertex). I want to get region vertices that has person with name Tom.
This query works fine:
g.V().hasLabel("person").has("name", "Tom").inE("has_person").outV().hasLabel("region").
But why following queries hang:
g.V().hasLabel("region").and(
__.hasLabel("person").has("name", "Tom").inE("has_person").outV().hasLabel("region")
)
g.V().and(
__.hasLabel("person").has("name", "Tom").inE("has_person").outV().hasLabel("region")
).hasLabel("region")
When writing graph traversals with Gremlin you need to think about how the graph database you are using is optimizing your traversal (e.g. is a global index being used)?
You should consider the indexing capability of your graph database and examine the output of the profile() step. It will tell you if indices are being used and where. My guess is that the query that works "fine" is using and index to find "Tom" and then is able to quickly traverse that one index to find the regions that have "has_person" edges related to him. Most every graph will be capable of optimizing that sort of pattern. Your following queries that "hang" will typically not be optimized by most graphs to utilize an index and it's mostly because of the pattern you've chosen with and() step which isn't a pattern most optimizations seek. My guess is that both of those traversals are filtering almost completely in-memory.
Fwiw, your query that works "fine" is the optimal way to write that I think given what you state as your desired output. Your first hanging query I don't think will ever return results because it requires that the vertex have a label that is both "region" and "person" which is not possible. The second hanging query seems to not require the and() in the first place and is double filtering the "region" label.

Gremlin query to find the entire sub-graph that a specific node is connected in any way to

I am brand new to Gremlin and am using gremlin-python to traverse my graph. The graph is made up of many clusters or sub-graphs which are intra-connected, and not inter-connected with any other cluster in the graph.
A simple example of this is a graph with 5 nodes and 3 edges:
Customer_1 is connected to CreditCard_A with 1_HasCreditCard_A edge
Customer_2 is connected to CreditCard_B with 2_HasCreditCard_B edge
Customer_3 is connected to CreditCard_A with 3_HasCreditCard_A edge
I want a query that will return a sub-graph object of all nodes and edges connected (in or out) to the queried node. I can then store this sub-graph as a variable and then run different traversals on it to calculate different things.
This query would need to be recursive as these clusters could be made up of nodes which are many (inward or outward) hops away from each other. There are also many different types of nodes and edges, and they all must be returned.
For example:
If I specified Customer_1 in the query, the resulting sub-graph would contain Customer_1, Customer_3, CreditCardA, 1_HasCreditCard_A, and 3_HasCreditCard_A.
If I specififed Customer_2, the returned sub-graph would consist of Customer_2, CreditCard_B, 2_HasCreditCard_B.
If I queried Customer_3, the exact same subgraph object as returned from the Customer_1 query would be returned.
I have used both Neo4J with Cypher and Dgraph with GraphQL and found this task quite easy in these two langauges, but am struggling a bit more with understanding gremlin.
EDIT:
From, this question, the selected answer should achieve what I want, but without specifying the edge type by changing .both('created') to just .both().
However, the loop syntax: .loop{true}{true} is invalid in Python of course. Is this loop function available in gremlin-python? I cannot find anything.
EDIT 2:
I have tried this and it seems to be working as expected, I think.
g.V(node_id).repeat(bothE().otherV().simplePath()).emit()
Is this a valid solution to what I am looking for? Is it also possible to include the queried node in this result?
Regarding the second edit, this looks like a valid solution that returns all the vertices connected to the starting vertex.
Some small fixes:
you can change the bothE().otherV() to both()
if you want to get also the starting vertex you need to move the emit step before the repeat
I would add a dedup step to remove all duplicate vertices (can be more than 1 path to a vertex)
g.V(node_id).emit().repeat(both().simplePath()).dedup()
exmaple: https://gremlify.com/jngpuy3dwg9

How to do efficient bulk upserts (insert new vertex, or update property(ies)) in Gremlin?

Context:
I do have a graph with about 2000 vertices, and 6000 edges, this over time might grow to 10000 vertices and 100000 edges. Currently I am upserting the new vertices using the following traversal query:
Upserting Vertices & Edges
queryVertex = "g.V().has(label, name, foo).fold().coalesce(
unfold(), addV(label).property(name, foo).property(model, 2)
).property(model, 2)"
The intent here is to look for vertex, named foo, and if found update its model property, otherwise create a new vertex and set the model property. this is issued twice: once for the source vertex and then for the target vertex.
Once the two related vertices are created, another query is issued to create the edge between them:
queryEdge = "g.V('id_of_source_vertex').coalesce(
outE(edge_label).filter(inV().hasId('id_of_target_vertex')),
addE(edge_label).to(V('id_of_target_vertex'))
).property(model, 2)"
here, if there is an edge between the two vertices, the model property on edge is updated, otherwise it creates the edge between them.
And the pseudocode that does this, is something as follows:
for each edge in the list of new edges:
//upsert source and target vertices:
execute queryVertex for edge.source
execute queryVertex for edge.target
// upsert edge:
execute queryEdge
This works, but it is highly inefficient; for example for the mentioned graph size it takes several minutes to finish, and with some in-app concurrency, it reduces the time only by couple of minutes. Surely, there must be a more efficient way of doing this for such a small graph size.
Question
* How can I make these upserts faster?
Bulk loading should typically be relegated to the provider specific tools that are optimized to handle such tasks. Gremlin really doesn't provide abstractions to cover the diverse group of bulk loader tools that are out there for each of the various graph database systems that implement TinkerPop. For Neptune, which is how you tagged your question, that would mean using the Neptune Bulk Loader.
Speaking specifically to your question, though you might see some optimizations to what you described as your approach. From a Gremlin perspective, I imagine you would see some savings here by submitting a single Gremlin request per edge by combining your existing traversals:
g.V().has(label, name, foo).fold().
coalesce(unfold(),
addV(label).property(name, foo)).
property(model, 2).as('source').
V().has(label, name, bar).fold().
coalesce(unfold(),
addV(label).property(name, bar)).
property(model, 2).as('target').
coalesce(inE(edge_label).where(outV().as('source')),
addE(edge_label).from('source').to('target')).
property(model, 2)
I think I got that right - untested, but hopefully you get the idea. Basically, we just reference the vertices already in memory via step labels so that we don't need to requery them. You might try other tactics as well if you continue with Gremlin-style bulk loading like ordering your edges so that you could batch together more edge loads to reduce the amount of vertex lookups and submit vertex/edge data in a more dynamic fashion as described here.

Optimize gremlin queries in Cosmos DB

I run the queries likes the following in gremlin in Cosmos Graph:
g.V().hasLabel('vertex_label').limit(1)
This query is problematic in concern of size of data which returned from DB as this query returns all inE and outE of the selected vertex. The question is how can I optimize this query in notion of size of query result?
As mentioned, the mentioned query returns a vertex with all its dependencies and connections. Therefore, it can be problematic in high volume of data (when there are a lot of connection with the specified vertex). Hence, We can optimize our queries using something likes properties, properyMap, values,and valueMap. In sum, valueMap(true) in the end of related queries can be useful and minimize the size of the transferred data from Cosmos. For example:
g.V().hasLabel('play').limit(1).valueMap(true)
The boolean value is for getting id and label of the vertex in addition to vertex properties.
Also, If there is any notion of optimization in the structure of a query, you can find more in this link.
How are you using CosmosDB Graph, via Microsoft.Azure.Graphs SDK or Gremlin server?
If you are using Microsoft.Azure.Graphs, the latest version (0.2.4-preview as of posting) supports specifying the GraphSONFormat as a parameter on DocumentClient.CreateGremlinRequest(..). You can choose from either GraphSONFormat.Normal or GraphSONFormat.Compact and Compact should be the default if it is not supplied.
For the CosmosDB Gremlin server, Compact is also the default behavior.
With GraphSONFormat.Compact, vertex results won't include edges and as a result, outE and inE fetches can be skipped when fetching the vertex. GraphSONFormat.Normal will return the full GraphSON response if this is desired.
Additional Note: There are optimizations on limit() that will be included in the next release of SDK/server, so I would expect additional perf gains on the traversal example that you provided when the release becomes available.

Titan + d3 for computer network visualisation

I've been experimenting with Titan over the past few weeks and would like some pointers on the way forward, plus a few specific questions. The purpose of the project is to store log data on a Cassandra cluster (for this question let's use the example of web traffic) and represent relationships in a Titan graph. All nodes are modelled as having an entity value and type (e.g. "google.com","hostname"), and edges have a label (e.g. "connects") as well as several attributes of the relationship (timestamp, flow length and so on).
Once this data is stored in cassandra and represented as a Titan graph, I plan to use d3 code to generate visualisations. At the end of the tunnel I am hoping to be able to build large-scale, interactive, complex graph networks that look something like this: http://goo.gl/CVEd55
My current setup is as follows:
A python script to convert log files into vertices.csv and edges.csv files for Gremlin to load in
Titan Server 0.4 (using CassandraThrift as the storage backend) - gremlin script to load converted data into Titan
Python script that uses NetworkX to open a RexPro connection, allowing the analyst to enter a custom Gremlin query, outputting the result as a JSON
Local web front-end that uses the generated JSON and d3 to display the results of the query as a graph
Ideally as a test base case, I would like the user to be able to type a Gremlin query into the web front-end and be directed to a page containing an interactive d3 graph of the result.
My specific questions are are follows:
What is the process for assigning attributes to edges? I have had trouble finding sample code that helps me represent the graph using the model listed above.
My gremlin script to load data into Titan uses bg.commit() to create a batch graph which is later referenced in the RexPro connection conn= RexProConnection('localhost,8184,'bg'). This was working originally but after changing my load script, clearing the graph in Gremlin and then reloading, the RexPro connection cannot be opened due to the graph bg apparently not existing. What is the process of updating graphs in Titan? Presumably running a load script twice using the same graph will only add nodes/vertices to the existing one, so how would I go about generating a new graph with the same name every time I update my model, and have RexPro be able to reference it when running a query?
How easy would it be to extend the interface to allow an analyst to enter SQL queries into the front end, using RexPro to access the graph in a similar way to the one described?
Apologies for the long post, but if anyone could share their expertise that would be much appreciated!
For d3 visualization, you can use force directed graph. There are a few variations of them.
Relationship Graph
https://vida.io/documents/qZ5SJdRJfj3XmSXYJ
Force Layout Tree
https://vida.io/documents/sy7vzWW7BJEvKdZeL
If your network contains a large number of node and edges, you'll need to cluster data before visualizing. You can use tools like Gephi, NodeXL to perform clustering. Then use clustered data to build force directed visualization.
What is the process for assigning attributes to edges?
The process is the same as adding properties to vertices. Get an Edge instance then do:
Edge e = g.addEdge(v1,v2,'label')
e.setProperty('weight',0.1d)
As for:
What is the process of updating graphs in Titan? Presumably running a load script twice using the same graph will only add nodes/vertices to the existing one, so how would I go about generating a new graph with the same name every time I update my model, and have RexPro be able to reference it when running a query?
You don't want a reference to a BatchGraph after loading as it comes with limitations that will prevent you from querying. It sounds like you should just configure "yourgraph" in rexster.xml, when you load through your script, simply wrap your rexster.xml configured Graph in your code, and perform your load operations against it. When you want to query it, simply reference "yourgraph" instead of "bg".
conn = RexProConnection('localhost,8184,'yourgraph')
How easy would it be to extend the interface to allow an analyst to enter SQL queries into the front end, using RexPro to access the graph in a similar way to the one described?
It's hard to say if that's "easy" as that depends on factors outside of just the technology. I'll say that it's possible to to build an interface that accepts Gremlin queries (your wrote SQL, but I assume you meant Gremlin), passes them to Rexster and gets back an answer. What you do with that answer is up to you, but as far as Rexster's part plays into it, I don't see why that would be a problem.

Resources