Goal:
Sort the vertices of a given label by the descending number of incoming edges of a given type in the most efficient way.
Configuration:
Ubuntu VM, 23Go RAM
JanusGraph 0.6.1 full
local graph (default conf/remote.yaml file used)
~400k vertices (~2k with the label)
~1.5m relationships (~1m with the type)
some indexing work has been done, nothing on the relationships though
What I am doing:
g.V().hasLabel(<label>).
order().
by(inE(<type>).count(),desc).
limit(10).
project("name","score").
by(<property_name>).
by(inE(<type>).count())
Issue:
While this query gives the expected results, it is really slow (up to 7+ minutes) and this execution time is not affordable. Is there a way to improve it? Whether it is an improvement to the query itself, or adding an index somewhere that could help...
What I have seen:
I have looked into composite indexes and mixed indexes and they seem to not affect my problem
I have considered the vertex-centric indexes: I think it wouldn't be useful here as I don't want a subset of the incoming edges of type type, but the total number of them. Would it still be a beneficial side-effect (assuming I understand them correctly) of the vertex-centric indexes?
Thanks to everyone reading/replying!
Solutions:
Simply changing the query to the one as rewritten in #KelvinLawrence's answer makes it quite a bit faster! Without any other changes, the execution time went from 7+ minutes to ~2 minutes.
Once the vertex_label property has been added and indexed on, and changing the hasLabel(<label>) into has("vertex_label", <label>), the execution time went down again from 2 minutes to ~5 seconds. The vertex_label property doesn't have any other use but to enable indexing on the type of vertex, JanusGraph not supporting indexing on labels, it provides a workaround.
It seems that perhaps your query is doing quite a bit of repeated work there. What about something like this:
g.V().hasLabel('<label>').
group().
by('<key>').
by(inE('<type>').count()).
order(local).
by(values,desc).
unfold().
limit(10)
From memory, I don't think JanusGraph supports indexing on a label which may be part of the problem here. If so, storing the label also as a property, and creating an index on that property may help the initial finding of the 2K vertices.
UPDATED 2022-06-21 To show an actual example.
Using the air-routes data set, the query might look like this:
gremlin> g.V().hasLabel('airport').
......1> group().
......2> by('code').
......3> by(inE('route').count()).
......4> order(local).
......5> by(values,desc).
......6> unfold().
......7> limit(10)
==>FRA=307
==>IST=307
==>CDG=294
==>AMS=284
==>MUC=271
==>ORD=263
==>DFW=251
==>PEK=249
==>DXB=247
==>ATL=242
Related
I'm looking to get more data back in my jupyter visualization on Neptune than just Node ID
g.V("specific-id").emit().repeat(both().simplePath()).dedup().out().path().by(T.id)
In particular, it would be nice to know the label as well and maybe any other information. How can I modify this above query to achieve that?
You could do a number of things because the by() modulator on path() can take a Traversal. An easy one would be to just do elementMap() or valueMap() as in:
g.V("specific-id").
emit().repeat(both().simplePath()).
dedup().
out().
path().by(valueMap())
I am trying to create a customer identity graph to track customers across devices, logged in or out.
Currently, I am trying to do two things:
For each disconnected graph, create mapping between unique generated
key and entities (adobe, lead, account, checkout).
For example, the bottom graph will randomly pick one of the entities to be used as the unique key. If we pick adobe, we will create a new value "adobe:<adobe_id>" and use that as the key to identify this disconnected graph. Perhaps we could traverse the entire disconnected graph and set the value "adobe:<adobe_id>" as a property.
(nice to have) Be able to handle
situations where two different accounts may use the same ip_address.
For example, the top graph should have 2 unique generated keys.
I'm fairly new to Gremlin but I understand the basics of graph traversal. Not sure where to start so any insights appreciated!
I've created a simplified version below. Gremlify link will allow code execution as well.
https://gremlify.com/jmg2fzo34om/3
It sounds as though you are after some form of community detection or perhaps just trying to identify components of the graph. With Gremlin, this is likely best done with a Graph Computer (OLAP) and perhaps the connectedComponent() step:
gremlin> g.withComputer().
......1> V().connectedComponent().
......2> with(ConnectedComponent.propertyName, 'component').
......3> elementMap().
......4> order().by('component')
==>[id:0,label:account,component:0,account_id:1]
==>[id:2,label:lead,component:0,lead_id:1]
==>[id:6,label:account,component:0,account_id:3]
==>[id:8,label:checkout,checkout_id:1,component:0]
==>[id:10,label:lead,component:0,lead_id:2]
==>[id:12,label:ip_addr,component:0,ip_addr:1.1.1.1]
==>[id:16,label:page,page_id:1,component:14]
==>[id:18,label:lead,component:14,lead_id:4]
==>[id:4,label:adobe,component:14,adobe_id:1]
==>[id:14,label:ip_addr,component:14,ip_addr:3.3.3.3]
You can use Gremlin without Graph Computer (OLTP) to find connected components, but unless you have a really small graph it simply won't perform as well. If you are interested in learning more about that there is a Gremlin Recipe that goes into that subject in greater detail found here.
In Tinkerpop 3, how to perform pagination? I want to fetch the first 10 elements of a query, then the next 10 without having to load them all in memory. For example, the query below returns 1000,000 records. I want to fetch them 10 by 10 without loading all the 1000,000 at once.
g.V().has("key", value).limit(10)
Edit
A solution that works through HttpChannelizer on Gremlin Server would be ideal.
From a functional perspective, a nice looking bit of Gremlin for paging would be:
gremlin> g.V().hasLabel('person').fold().as('persons','count').
select('persons','count').
by(range(local, 0, 2)).
by(count(local))
==>[persons:[v[1],v[2]],count:4]
gremlin> g.V().hasLabel('person').fold().as('persons','count').
select('persons','count').
by(range(local, 2, 4)).
by(count(local))
==>[persons:[v[4],v[6]],count:4]
In this way you get the total count of vertices with the result. Unfortunately, the fold() forces you to count all the vertices which will require iterating them all (i.e. bringing them all into memory).
There really is no way to avoid iterating all 100,000 vertices in this case as long as you intend to execute your traversal in multiple separate attempts. For example:
gremlin> g.V().hasLabel('person').range(0,2)
==>v[1]
==>v[2]
gremlin> g.V().hasLabel('person').range(2,4)
==>v[4]
==>v[6]
The first statement is the same as if you'd terminated the traversal with limit(2). On the second traversal, that only wants the second two vertices, it not as though you magically skip iterating the first two as it is a new traversal. I'm not aware of any TinkerPop graph database implementation that will do that efficiently - they all have that behavior.
The only way to do ten vertices at a time without having them all in memory is to use the same Traversal instance as in:
gremlin> t = g.V().hasLabel('person');[]
gremlin> t.next(2)
==>v[1]
==>v[2]
gremlin> t.next(2)
==>v[4]
==>v[6]
With that model you only iterate the vertices once and don't bring them all into memory at a single point in time.
Some other thoughts on this topic can be found in this blog post.
Why not add order().by() and perform range() function on your gremlin query.
Let's say I have a huge gremlin query with 100 or more steps. One part of this query has a failure and I want it to return a meaningful error message. With a short and sweet query this would not be too difficult, as we can do something like this:
g.V().coalesce(hasId("123"), constant("ERROR - ID does not exist"))
Of course we're asking if a Vertex with an ID of 123 exists. If it does not exist we return a string.
So now let's take this example and make it more complex
g.V().coalesce(hasId("123"), constant("ERROR - ID does not exist")).as("a").V().coalesce(hasId("123"), constant("ERROR - ID does not exist")).as("b").select("a").valueMap(false)
If a vertex with ID: "123" exists we return all properties stored on the vertex.
Lets say a vertex with ID: "123" does not exist in the database. How can I get a meaningful error returned without getting a type error for trying to do a .valueMap() on a string?
First of all, if you have a single line of Gremlin with 100 or more steps (not counting anonymous child traversals steps of course), I'd suggest you re-examine your approach in general. When I encounter Gremlin of that size, it usually means that someone is generating a large traversal for purpose of mutating the graph in some way. That's considered an anti-pattern and something to avoid as the larger the Gremlin grows the greater the chance of hitting the Xss JVM limits for a StackOverflowException and traversal compilation times tend to add up and get expensive. All of that can be avoided in many cases by using inject() or withSideEffect() in some way to pass the data in on the traversal itself and then use Gremlin to be the loop that iterates that data into mutation steps. The result is a slightly more complex Gremlin statement, but one that will perform better and avoid the StackOverflowException.
Second, note that this traversal will likely not behave as you want on any graph provider - see this example on TinkerGraph:
gremlin> g.V().coalesce(hasId(1),constant('x'))
==>v[1]
==>x
==>x
==>x
==>x
==>x
gremlin> g.V().hasId(1)
==>v[1]
The hasId() inside the coalesce() won't be optimized by the graph as an fast id lookup but will instead be treated as a full table scan with a filter.
In answer to your question though, I'd say that the easiest option open to you is to just move the valueMap() inside the coalesce():
g.V().coalesce(hasId("123").valueMap(false),
constant("ERROR - ID does not exist")).as("a").
V().coalesce(hasId("123").valueMap(false),
constant("ERROR - ID does not exist")).as("b").
select("a")
I see why that might be bad if you lots of steps other than valueMap() because then you have replicate the same steps over and over again making the code even larger. I guess that goes back to my first point.
I suppose you could use a lambda though not all graph providers support that - note that I've modified your code to ensure a lookup by id for purpose of demonstration:
gremlin> g.V(1).fold().coalesce(unfold(),map{throw new IllegalStateException("bad")})
==>v[1]
gremlin> g.V(10).fold().coalesce(unfold(),map{throw new IllegalStateException("bad")})
bad
At this time, I'm not sure there's much else you can do. Maybe you could make a "error" Vertex that you could return in constant() that way valueMap() would work but it's hard to say if that would be helpful given what I know about the overall intent of your traversal. I suppose you could maybe come up with a fancy evaluation of an if-then using choose() but that might be hard to read and look awkward. The only other option I can think of is to store the error as a side-effect:
gremlin> g.V(10).fold().coalesce(unfold(),store('error').by(constant('x'))).cap('error')
==>[x]
I don't think Gremlin gives you any really elegant way to do what you want right now.
In Tinkerpop 3, how to perform pagination? I want to fetch the first 10 elements of a query, then the next 10 without having to load them all in memory. For example, the query below returns 1000,000 records. I want to fetch them 10 by 10 without loading all the 1000,000 at once.
g.V().has("key", value).limit(10)
Edit
A solution that works through HttpChannelizer on Gremlin Server would be ideal.
From a functional perspective, a nice looking bit of Gremlin for paging would be:
gremlin> g.V().hasLabel('person').fold().as('persons','count').
select('persons','count').
by(range(local, 0, 2)).
by(count(local))
==>[persons:[v[1],v[2]],count:4]
gremlin> g.V().hasLabel('person').fold().as('persons','count').
select('persons','count').
by(range(local, 2, 4)).
by(count(local))
==>[persons:[v[4],v[6]],count:4]
In this way you get the total count of vertices with the result. Unfortunately, the fold() forces you to count all the vertices which will require iterating them all (i.e. bringing them all into memory).
There really is no way to avoid iterating all 100,000 vertices in this case as long as you intend to execute your traversal in multiple separate attempts. For example:
gremlin> g.V().hasLabel('person').range(0,2)
==>v[1]
==>v[2]
gremlin> g.V().hasLabel('person').range(2,4)
==>v[4]
==>v[6]
The first statement is the same as if you'd terminated the traversal with limit(2). On the second traversal, that only wants the second two vertices, it not as though you magically skip iterating the first two as it is a new traversal. I'm not aware of any TinkerPop graph database implementation that will do that efficiently - they all have that behavior.
The only way to do ten vertices at a time without having them all in memory is to use the same Traversal instance as in:
gremlin> t = g.V().hasLabel('person');[]
gremlin> t.next(2)
==>v[1]
==>v[2]
gremlin> t.next(2)
==>v[4]
==>v[6]
With that model you only iterate the vertices once and don't bring them all into memory at a single point in time.
Some other thoughts on this topic can be found in this blog post.
Why not add order().by() and perform range() function on your gremlin query.