Jagged Result Array Gremlin Query - gremlin

Please may you help me to write a query that returns each source vertex in my traversal along with its associated edges and vertices as arrays on each such source vertex? In short, I need a result set comprising an array of 3-tuples with item 1 of each tuple being the source vertex and items 2 and 3 being the associated arrays.
Thanks!
EDIT 1: Expanded on the graph data and added my current problem query.
EDIT 2: Improved Gremlin sample graph code (apologies, didn't think anyone would actually run it.)
Sample Graph
g.addV("blueprint").property("name","Mall").
addV("blueprint").property("name","HousingComplex").
addV("blueprint").property("name","Airfield").
addV("architect").property("name","Tom").
addV("architect").property("name","Jerry").
addV("architect").property("name","Sylvester").
addV("buildingCategory").property("name","Civil").
addV("buildingCategory").property("name","Commercial").
addV("buildingCategory").property("name","Industrial").
addV("buildingCategory").property("name","Military").
addV("buildingCategory").property("name","Resnameential").
V().has("name","Tom").addE("designed").to(V().has("name","HousingComplex")).
V().has("name","Tom").addE("assisted").to(V().has("name","Mall")).
V().has("name","Jerry").addE("designed").to(V().has("name","Airfield")).
V().has("name","Jerry").addE("assisted").to(V().has("name","HousingComplex")).
V().has("name","Sylvester").addE("designed").to(V().has("name","Mall")).
V().has("name","Sylvester").addE("assisted").to(V().has("name","Airfield")).
V().has("name","Sylvester").addE("assisted").to(V().has("name","HousingComplex")).
V().has("name","Mall").addE("classification").to(V().has("name","Commercial")).
V().has("name","HousingComplex").addE("classification").to(V().has("name","Resnameential")).
V().has("name","Airfield").addE("classification").to(V().has("name","Civil"))
Please note that the above is a very simplified rendering of our data.
Needed Query Results
I need to bring back each blueprint vertex as a base with each of its associated edges / vertices as arrays.
My Current Solution
Currently I do this very cumbersome query that gets the blueprints and assigns a label, gets the architects and assigns a label, then selects both labels. The solution is ok; however, it gets messy when I need to include edges or I need to get blueprint classification vertices (industrial, military, residential, commercial, etc.). In effect, the more associated data that I need to pull back for each blueprint, the sloppier my solution becomes.
My current query looks something like this:
g.V().hasLabel("blueprint").as("blueprints").
outE().or(hasLabel("designed"),hasLabel("assisted")).inV().as("architects").
select("blueprints").coalesce(out("classification"),constant()).as("classifications").
select("blueprints","architects","classifications")
The above produces a lot of duplication. If the number of: blueprints is b, architects is a, and classifications is c, the result set comprises b * a * c results. I'd like one blueprint with an array of its associated architects and an array of its associated classifications, if any.
Complications
I'm trying to do this in one query so that I can get all blueprint data from the graph to populate a filtered list. Once I have the list comprising all of the vertices, edges, and their properties, users can then click links to blobs, browse to project sites, etc. Accordingly, I've got pagination as well as filtering to think about and I'd prefer to make one trip to the server each time I get a new page or the filters change.

I figured out an answer; however, it quadruples the compute charge for the query. Not sure if this can be optimized further.
g.V().hasLabel("blueprint").
project("blueprints","architects").
by().
by(outE().or(hasLabel("designed"),hasLabel("assisted")).inV().dedup().fold())
I just solved for blueprints and architects, but classifications just needs another by(...traversal...) and projection label.
I may have to just get the blueprints in one query, get each of their associated items in parallel queries, then put it all together in the API. That would be very bad design for the API data layer but may be necessary for performance reasons.

Related

Does an Increased Number of Node Types Impact Performance of Graph DBs?

I am in the process of creating a graph database, a simple one for movies with several types of information like the actors, producers, directors and so on.
What I would like to know is, is it better to break down your nodes to a more granular level? For example, is it better to have two kinds of nodes for 'actors' and 'directors' or is it better to have one node, say 'person' and use different kinds of relationships like 'acted_in' and 'directed'? Does this even matter at all?
Further, is there any impact on the traversal queries? Does having more types of nodes mean that the traversal is slower?
Note: I intend to implement this using the Gremlin console in Amazon Neptune.
The answer really is it depends. If I were building such a model I would break out the key "nouns" into their own nodes. I would also label the edges appropriately such as ACTED_IN or DIRECTED.
The performance of any graph query depends on how much data it will need to touch (the fan out factor as you go from depth to depth).
The best advice I can give you is think about the questions you will need the graph to answer and try to design your data model so that writing those queries is as easy as possible. Don't be afraid to iterate multiple times on your data model also. That is common and expected.
Properties can be useful when you want to add a unique piece of information to a node - perhaps the birthday of the director.
Edge properties can be useful for filtering out unneeded edges but edge labels can also. In some cases you may find a label such as DIRECTED-IN-2005 is a useful short cut to avoid checking a label and a property on an edge.

Gremlin query hangs with and/or condition

I have a Graph model region (vertex) -> has_person (edge) -> person (vertex). I want to get region vertices that has person with name Tom.
This query works fine:
g.V().hasLabel("person").has("name", "Tom").inE("has_person").outV().hasLabel("region").
But why following queries hang:
g.V().hasLabel("region").and(
__.hasLabel("person").has("name", "Tom").inE("has_person").outV().hasLabel("region")
)
g.V().and(
__.hasLabel("person").has("name", "Tom").inE("has_person").outV().hasLabel("region")
).hasLabel("region")
When writing graph traversals with Gremlin you need to think about how the graph database you are using is optimizing your traversal (e.g. is a global index being used)?
You should consider the indexing capability of your graph database and examine the output of the profile() step. It will tell you if indices are being used and where. My guess is that the query that works "fine" is using and index to find "Tom" and then is able to quickly traverse that one index to find the regions that have "has_person" edges related to him. Most every graph will be capable of optimizing that sort of pattern. Your following queries that "hang" will typically not be optimized by most graphs to utilize an index and it's mostly because of the pattern you've chosen with and() step which isn't a pattern most optimizations seek. My guess is that both of those traversals are filtering almost completely in-memory.
Fwiw, your query that works "fine" is the optimal way to write that I think given what you state as your desired output. Your first hanging query I don't think will ever return results because it requires that the vertex have a label that is both "region" and "person" which is not possible. The second hanging query seems to not require the and() in the first place and is double filtering the "region" label.

How to do efficient bulk upserts (insert new vertex, or update property(ies)) in Gremlin?

Context:
I do have a graph with about 2000 vertices, and 6000 edges, this over time might grow to 10000 vertices and 100000 edges. Currently I am upserting the new vertices using the following traversal query:
Upserting Vertices & Edges
queryVertex = "g.V().has(label, name, foo).fold().coalesce(
unfold(), addV(label).property(name, foo).property(model, 2)
).property(model, 2)"
The intent here is to look for vertex, named foo, and if found update its model property, otherwise create a new vertex and set the model property. this is issued twice: once for the source vertex and then for the target vertex.
Once the two related vertices are created, another query is issued to create the edge between them:
queryEdge = "g.V('id_of_source_vertex').coalesce(
outE(edge_label).filter(inV().hasId('id_of_target_vertex')),
addE(edge_label).to(V('id_of_target_vertex'))
).property(model, 2)"
here, if there is an edge between the two vertices, the model property on edge is updated, otherwise it creates the edge between them.
And the pseudocode that does this, is something as follows:
for each edge in the list of new edges:
//upsert source and target vertices:
execute queryVertex for edge.source
execute queryVertex for edge.target
// upsert edge:
execute queryEdge
This works, but it is highly inefficient; for example for the mentioned graph size it takes several minutes to finish, and with some in-app concurrency, it reduces the time only by couple of minutes. Surely, there must be a more efficient way of doing this for such a small graph size.
Question
* How can I make these upserts faster?
Bulk loading should typically be relegated to the provider specific tools that are optimized to handle such tasks. Gremlin really doesn't provide abstractions to cover the diverse group of bulk loader tools that are out there for each of the various graph database systems that implement TinkerPop. For Neptune, which is how you tagged your question, that would mean using the Neptune Bulk Loader.
Speaking specifically to your question, though you might see some optimizations to what you described as your approach. From a Gremlin perspective, I imagine you would see some savings here by submitting a single Gremlin request per edge by combining your existing traversals:
g.V().has(label, name, foo).fold().
coalesce(unfold(),
addV(label).property(name, foo)).
property(model, 2).as('source').
V().has(label, name, bar).fold().
coalesce(unfold(),
addV(label).property(name, bar)).
property(model, 2).as('target').
coalesce(inE(edge_label).where(outV().as('source')),
addE(edge_label).from('source').to('target')).
property(model, 2)
I think I got that right - untested, but hopefully you get the idea. Basically, we just reference the vertices already in memory via step labels so that we don't need to requery them. You might try other tactics as well if you continue with Gremlin-style bulk loading like ordering your edges so that you could batch together more edge loads to reduce the amount of vertex lookups and submit vertex/edge data in a more dynamic fashion as described here.

Gremlin: how to get all of the graph structure surrounding a single vertex into a subgraph

I would like to get all of the graph structure surrounding a single vertex into a subgraph.
The TinkerPop documentation shows how to do this for a fixed number of traversal steps.
My question: are there any recipes for getting the entire surrounding structure (which may include cycles) without knowing ahead of time how many traversal steps are needed?
I have updated this question for the benefit of anyone who might land on this question, here is a gremlin statement that will capture an arbitrary graph structure that surrounds a vertex with id='xxx'
g.V('xxx').repeat(out().simplePath()).until(outE().count().is(eq(0))).path()
-- this incorporates Stephen Mallete's suggestion to use simplePath within the repeat step.
That example uses repeat()...times() but you could easily replace times() with until() and not know the number of steps ahead of time. See more information in the Reference Documentation on how repeat() works to see how to express different types of flow control and traversal stop conditions.

Relationships with relationships in Neo4j

In Neo4j, is it possible for a relationship to have a relationship?
To illustrate: Imagine a domain model that encompasses a collection of geometric planes. Each plane has a collection of lines on it, and each line has a collection of points on it. Each point on a line is connected to the point after it by an outgoing -[NEXT]-> relationship, and to the point preceding it by an incoming one. The way I have it now, each of these NEXT relationships contains a property lineID, which identifies the line on which it exists: The node entities representing lines in the database contain only an id, and perhaps a bit of metadata, and we return line X by traversing the graph, finding all -[NEXT{lineID:X}]-> relationships, fetching the start and end nodes of each and returning an list of them along with the line's metadata.
I was a bit more longwinded there than I intended to be, but my question is this: What if, rather than having a lineID property on each [NEXT] relationship, I wanted to create an -[ON]-> relationship between each [NEXT] and the node entity representing the line it is on?
To illustrate: Rather than doing
CREATE (:point)-[:NEXT{lineID:x}]->(:point)-[:NEXT{lineID:x}-> ...
, what about something like:
CREATE (:point)-[z:NEXT]->(:point), (z)-[:ON]->(:line)`
That's some ugly cypher, but I hope it clarifies my point. Intuitively, it seems like this would make line traversals more efficient (because we'd be playing to neo4j's strength by asking it to traverse all [ON] relationships from a line node rather than simply searching for a (presumably indexed) property. It would also make it easier to specify nested relationships:
(z)-[:ON]->(:line), (z)-[:ON]->(:plane)
Is this intuition misconceived? If not, would something like this be possible? I don't think it is, but am contemplating a workaround that would involve creating a node entity for each "relationship". Something like this:
(:point)<-[:FROM]-(x:next)-[:TO]->(:point), (x)-[:ON]->(:line)
, which would have the added advantage of facilitating hypergraph structures, which is something else I'm interested in. Leaving that conversation for another day (and another post), would such an approach be more trouble/expensive than its worth the purposes elucidated here? Might there be any dis/advantages (aside from plain cost) I'm not considering? Or am I reinventing the wheel here - is there an extant solution in this situation that I'm unaware of?
There are no relationships that you can link to other relations. I think that when you ask yourself this kind of questions, you may have a modelling problem for your data, and the next thing to do is try to model the data differently. For instance, why the relationship that links two points knows the line on which the points are ? Wouldn't it be more natural that the point knows the line, therefore having the property lineID on the points? This way you may have points on several lines, which you can't model properly if the lineID is on the NEXT relationship. Perhaps even better, you can have a node Line that has a relationship CONTAINS with all the points on that particular line instead of using lineID property.
This is not possible.
Restructure your model so any data on your relationship which needs linking is a node

Resources