I have a Graph model region (vertex) -> has_person (edge) -> person (vertex). I want to get region vertices that has person with name Tom.
This query works fine:
g.V().hasLabel("person").has("name", "Tom").inE("has_person").outV().hasLabel("region").
But why following queries hang:
g.V().hasLabel("region").and(
__.hasLabel("person").has("name", "Tom").inE("has_person").outV().hasLabel("region")
)
g.V().and(
__.hasLabel("person").has("name", "Tom").inE("has_person").outV().hasLabel("region")
).hasLabel("region")
When writing graph traversals with Gremlin you need to think about how the graph database you are using is optimizing your traversal (e.g. is a global index being used)?
You should consider the indexing capability of your graph database and examine the output of the profile() step. It will tell you if indices are being used and where. My guess is that the query that works "fine" is using and index to find "Tom" and then is able to quickly traverse that one index to find the regions that have "has_person" edges related to him. Most every graph will be capable of optimizing that sort of pattern. Your following queries that "hang" will typically not be optimized by most graphs to utilize an index and it's mostly because of the pattern you've chosen with and() step which isn't a pattern most optimizations seek. My guess is that both of those traversals are filtering almost completely in-memory.
Fwiw, your query that works "fine" is the optimal way to write that I think given what you state as your desired output. Your first hanging query I don't think will ever return results because it requires that the vertex have a label that is both "region" and "person" which is not possible. The second hanging query seems to not require the and() in the first place and is double filtering the "region" label.
Related
This might be a pretty obscure question but I'll try my best here.
Assuming I have a very simple query, for instance:
g.addV('Person').property('name', 'Marko');
And then I run the same query again.
Of course the graph created two different nodes, but regardless of the id, are they "the same"?
Same for querying the graph:
g.V()
Will the graph produce the results in the same order for any run (assuming it didn't change)?
What I'm trying to ask - can I count on the order of the Gremlin execution?
Thanks!
Gremlin does not enforce iteration order unless you explicitly specify it. It is up to the underlying graph to determine order and most that I'm familiar with do not make such guarantees. Therefore, if you want an order, you need to specify it as in: g.V().order().by('name').
Suppose I have a vertex person, and it has multiple edges, I want to project properties from all traversal. What is the most efficient way to write a query in Cosmos DB Gremlin API?
I tried the following, but its performance is slow.
g.V().
hasLabel('person').
project('Name', 'Language', 'Address').
by('name').
by(out('speaks').values('language')).
by(out('residesAt').values('city'))
Also, I have multiple filters and sorting for each traversal.
I don't think you can write that specific traversal as you've shown it any more efficiently than it is already written, especially if you've added filters to the out('speaks') and out('residesAt') traversals to further limit those paths and as it stands in your example you only return the first "language" or "city" found which is obviously faster than traversing all of those possible paths.
It does stand out to me that you are trying to retrieve all the "person" vertices. You don't say that you have additional filters there specifically, but if you do not then the cost of this traversal could be steep if you have millions of "person" vertices coming back. Typically, traversals that only filter on a vertex label will be costly as most graphs do not optimize those sorts of lookups. In the worst case, such a situation could mean that you have to do a full graph scan to get that initial set of vertices.
Problem
I've been optimizing the performance of our app that's built on graph (Gremlin API on Cosmos DB) and I've been generally having a bad time of it. After quite a lot of digging, I realize that a bunch of the pain is being caused by unneeded traversal processing. If I want to get all foo vertices along with any bar edges that those edges may have (think left join from SQL), I would write the following in Gremlin:
g.V().hasLabel("foo").as("foos").
coalesce(out("bar"),constant()).as("bars").
select("foos","bars")
The results would comprise a set of tuples with item 1 named foos and item 2 named bars. Foos would always have a vertex and bars would have either an edge or a [].
Instead of processing a new anonymous traversal with one constant() step for each vertex, a null would make the whole thing a lot more efficient.
I've looked everywhere but I can't find null in Gremlin. Anyone have any ideas?
There isn't the concept of returning null in Gremlin for a value. It is one of the more annoying parts but you have to return a value.
Additionally, I think you can simplify your traversal down by using a project() statement instead of traversing the graph and then selecting the values. It would be something like this:
g.V().hasLabel("foo").
project('foo', 'bars').
by(__.id()).
by(__.outE('bars').fold().coalesce(unfold(), constant('')))
Please may you help me to write a query that returns each source vertex in my traversal along with its associated edges and vertices as arrays on each such source vertex? In short, I need a result set comprising an array of 3-tuples with item 1 of each tuple being the source vertex and items 2 and 3 being the associated arrays.
Thanks!
EDIT 1: Expanded on the graph data and added my current problem query.
EDIT 2: Improved Gremlin sample graph code (apologies, didn't think anyone would actually run it.)
Sample Graph
g.addV("blueprint").property("name","Mall").
addV("blueprint").property("name","HousingComplex").
addV("blueprint").property("name","Airfield").
addV("architect").property("name","Tom").
addV("architect").property("name","Jerry").
addV("architect").property("name","Sylvester").
addV("buildingCategory").property("name","Civil").
addV("buildingCategory").property("name","Commercial").
addV("buildingCategory").property("name","Industrial").
addV("buildingCategory").property("name","Military").
addV("buildingCategory").property("name","Resnameential").
V().has("name","Tom").addE("designed").to(V().has("name","HousingComplex")).
V().has("name","Tom").addE("assisted").to(V().has("name","Mall")).
V().has("name","Jerry").addE("designed").to(V().has("name","Airfield")).
V().has("name","Jerry").addE("assisted").to(V().has("name","HousingComplex")).
V().has("name","Sylvester").addE("designed").to(V().has("name","Mall")).
V().has("name","Sylvester").addE("assisted").to(V().has("name","Airfield")).
V().has("name","Sylvester").addE("assisted").to(V().has("name","HousingComplex")).
V().has("name","Mall").addE("classification").to(V().has("name","Commercial")).
V().has("name","HousingComplex").addE("classification").to(V().has("name","Resnameential")).
V().has("name","Airfield").addE("classification").to(V().has("name","Civil"))
Please note that the above is a very simplified rendering of our data.
Needed Query Results
I need to bring back each blueprint vertex as a base with each of its associated edges / vertices as arrays.
My Current Solution
Currently I do this very cumbersome query that gets the blueprints and assigns a label, gets the architects and assigns a label, then selects both labels. The solution is ok; however, it gets messy when I need to include edges or I need to get blueprint classification vertices (industrial, military, residential, commercial, etc.). In effect, the more associated data that I need to pull back for each blueprint, the sloppier my solution becomes.
My current query looks something like this:
g.V().hasLabel("blueprint").as("blueprints").
outE().or(hasLabel("designed"),hasLabel("assisted")).inV().as("architects").
select("blueprints").coalesce(out("classification"),constant()).as("classifications").
select("blueprints","architects","classifications")
The above produces a lot of duplication. If the number of: blueprints is b, architects is a, and classifications is c, the result set comprises b * a * c results. I'd like one blueprint with an array of its associated architects and an array of its associated classifications, if any.
Complications
I'm trying to do this in one query so that I can get all blueprint data from the graph to populate a filtered list. Once I have the list comprising all of the vertices, edges, and their properties, users can then click links to blobs, browse to project sites, etc. Accordingly, I've got pagination as well as filtering to think about and I'd prefer to make one trip to the server each time I get a new page or the filters change.
I figured out an answer; however, it quadruples the compute charge for the query. Not sure if this can be optimized further.
g.V().hasLabel("blueprint").
project("blueprints","architects").
by().
by(outE().or(hasLabel("designed"),hasLabel("assisted")).inV().dedup().fold())
I just solved for blueprints and architects, but classifications just needs another by(...traversal...) and projection label.
I may have to just get the blueprints in one query, get each of their associated items in parallel queries, then put it all together in the API. That would be very bad design for the API data layer but may be necessary for performance reasons.
I'm trying to produce a Gremlin query whereby I need to find vertexes which have edges from specific other vertexes. The less abstract version of this query is I have user vertexes, and those are related to group vertexes (i.e subjects in a school, so students who are in "Year 6 Maths" and "Year 6 English" etc). An extra difficulty is the ability for subgroups to exist in this query.
The query I need to find those users who are in 2 or more groups specified by the user.
Currently I have a brief solution, but in production usage using Amazon Netpune this query performs way too poorly, even with a small amount of data. I'm sure there's a simpler way of achieving this :/
g.V()
.has('id', 'group_1')
.repeat(out("STUDENT", "SUBGROUP"))
.until(hasLabel("USER"))
.aggregate("q-1")
.V()
.has('id', 'group_2')
.repeat(out("STUDENT", "SUBGROUP"))
.until(hasLabel("USER"))
.where(within("q-1"))
.aggregate("q-2")
.V()
.hasLabel(USER)
.where(within("q-2"))
# We add some more filtering here, such as search terms
.dedup()
.range(0, 10)
.values("id")
.toList()
The first major change you can do is to not bother iterating all of V() again for "USER" - that's already that output from the prior steps so collecting "q-2" just to use it for a filter doesn't seem necessary:
g.V().
has('id', 'group_1').
repeat(out("STUDENT", "SUBGROUP")).
until(hasLabel("USER")).
aggregate("q-1").
V().
has('id', 'group_2').
repeat(out("STUDENT", "SUBGROUP")).
until(hasLabel("USER")).
where(within("q-1")).
# We add some more filtering here, such as search terms
dedup().
range(0, 10).
values("id")
That should already be a huge savings to your query because that change avoids iterating the entire graph in memory (i.e. full scan of all vertices) as there was no index lookup there.
I don't know what your additional filters are here:
# We add some more filtering here, such as search terms
but I would definitely look to try to filter the users earlier in your query rather than later. Perhaps consider using emit() on your repeats() to filter better. You should probably also dedup() your "q-1" and reduce the size of the list there.
I'd be curious to know how much just the initial change I suggested works as that was probably the biggest bulk of your query cost (unless you have a really deep/wide tree of student/subgroups I guess). Perhaps there is more that could be tweaked here though, but it would be nice to know that you at least have a traversal with satisfying performance at this point.