Gremlin query - how to get vertex properties together with edge metadata - gremlin

Let's assume that we have the following model.
So we have permissions which may have Grants, the connection between a Permission and a Grant is called hasGrant and has additional property Type which can be either Allow or Deny. How can I write a query, that returns: PermissionId, GrantId, Type without actually traversing to Grant vertex ? I'd like to avoid the traversal as it seems to be very expensive and I just need Type and GrantId properties (which I can take from the edge).
I've tried sth like:
g.V().hasLabel('Permission').has('name','Column_Commit')
.project('name','id','grant')
.by('name')
.by('permissionId')
.by(outE("hasGrant").
project("id","type").
by(inV().id()).
by("type").
fold())
This code unfortunately traverse to Grant vertex which results in bad performance.

If the values you need are on the edge you don't need to include the inV in the query, you can leave it off. However, and I am not familiar with how CosmosDB is implemented, I can imagine that fetching a lot of edge properties could be where the cost actually is. But, anyway, you could write the query as:
g.V().hasLabel('Permission').has('name','Column_Commit')
.project('name','id','grant')
.by('name')
.by('permissionId')
.by(outE("hasGrant").values("type").fold())
In general, I am surprised that your original query is causing problems as it seems perfectly reasonable Gremlin. The only issue I could envision, in general, is if any of the starting nodes are supernodes.
UPDATED 2022-07-01
An alternative approach is to use the elementMap step.
g.V().hasLabel('Permission').has('name','Column_Commit')
.project('name','id','grant')
.by('name')
.by('permissionId')
.by(outE("hasGrant").elementMap().fold())

Related

Is there any execution order guarantee for SideEffect in Gremlin

I have a bit of Gremlin.Net code that copies a Vertex into a new one, marks the old edge as 'ended', links the new vertex (clone) and updates some of the properties in it. Ignoring my own DSL code in the snippet, my question is, can I rely on the order of these SideEffects? I need the 'update' step to run last
...AddV().Property("label", __.Select<Vertex>("existing").Label()).As("clone") // new vertex
.SideEffect(__
.Select<Vertex>("existing").Properties<VertexProperty>().As("p")
.Select<Vertex>("clone")
.Property(__.Select<VertexProperty>("p").Key(), __.Select<VertexProperty>("p").Value<object>()))
.SideEffect(__
.Select<Vertex>("existing").InE(DbLabels.ComponentEdge).As("ine")
.MarkToDate(operationTime)
.AddE(linkedEdgeLabel).From("parent").To("clone")
.MarkFromDate(operationTime))
.SideEffect(__
.Select<Vertex>("clone").UpdateVertexPropertiesUnchecked(propertyDict));
If not, is the a better way to do this?
I am not sure if it has ever been documented that they are executed in order but I think it is a reasonable assumption as queries such as the one below absolutely need to execute in sideEffect step order. I am not sure if there are any cases where a Gremlin Strategy might rewrite/re-order that query. If I find anything I will update this answer.
gremlin> g.V(44).values('runways')
==>3
gremlin> g.V(44).sideEffect(properties('runways').drop()).sideEffect(property(set,'runways',99)).values('runways')
==>99
gremlin> g.V(44).values('runways')
==>99
At the end of the day probably always best to check with the provider of the graph store you are using. That said, within the TinkerPop community we are working on significantly improving the documentation around the semantics of each Gremlin step and this is definitely something we should clarify there.

Gremlin query to traverse nodes and edges based on user permissions (stored as node/edge property) [duplicate]

We are stamping user permission as a property (of SET cardinality) on each nodes and edges. Wondering what is best way to apply the has step on all the visited nodes/edges for a given traversal gremlin query.
like a very simple travarsal query:
// Flights from London Heathrow (LHR) to airports in the USA
g.V().has('code','LHR').out('route').has('country','US').values('code')
add has('permission', 'team1') to all the visited vertices and edges while traversal using the above query.
There are two approaches you may consider.
Write a custom TraversalStrategy
Develop a Gremlin DSL
For a TraversalStrategy you would develop one similar to SubgraphStrategy or PartitionStrategy which would take your user permissions on construction and then automatically inject the necessary has() steps after out() / in() sorts of steps. The drawback here is that your TraversalStrategy must be written in a JVM language and if using Gremlin Server must be installed on the server. If you intend to configure this TraversalStrategy from the client-side in any way you would need to build custom serializers to make that possible.
For a DSL you would create new navigational steps for out() / in() sorts of steps and they would insert the appropriate combination of navigation step and has() step. The DSL approach is nice because you could write it in any programming language and it would work, but it doesn't allow server-side configuration and you must always ensure clients use the DSL when querying the graph.
We are stamping user permission as a property (of SET cardinality) on each nodes and edges.
As a final note, by "SET cardinality" I assume that you mean multi-properties. Edges don't allow for those so you would only be able to stamp such a property on vertices.

Strange execution behavior of Cosmos Gremlin query

I have a below simple query which creates a new vertex and adds an edge between old vertex and new vertex in the same query. This query works well most of the times. The strange behavior kicks in when there is heavy load on the system and RUs are exhausted.
g.V('2f9d5fe8-6270-4928-8164-2580ad61e57a').AddE('likes').to(g.AddV('fruit').property('id','1').property('name','apple'))
Under Low/Normal Load the above query creates fruit vertex 1 and creates likes edge between user and fruit. Expected behavior.
Under Heavy load(available RUs are limited) the above query creates fruit vertex but doesn't create likes edge between user and fruit. Query throws 429 status code. If i try to replay the query then i get 409 since fruit vertex already exists. This behavior is corrupting the data.
In many places i have g.AddV inside the query. So all those queries might break under heavy load.
Does it make any difference if i use __.addV instead of g.AddV?
UPDATED: using __.addV doesn't make any difference.
So, is my query wrong? do i need to do upsert wherever i need to add an edge?
I don't know how Microsoft implemented TinkerPop and thus I'm not sure if the following will help, but you could try to create the new vertex first and then add an edge to/from the existing vertex.
g.addV('fruit').
property('id','1').
property('name','apple').
addE('likes').
from(V('2f9d5fe8-6270-4928-8164-2580ad61e57a'))
If that also fails, then yes, an upsert is probably your best bet, as you can retry the same query indefinitely. However, since I have no deep knowledge of CosmosDB, I can't tell if its upserts can prevent edge duplication.
In Cosmos DB Gremlin API, the transactional scope is limited to write operations on an entity (a Vertex or Edge). So for Gremlin requests that need to perform multiple write operations, it is possible that on failure a partial state will be committed.
Given this, it is recommended that you use idempotent gremlin traversals, such that the request can be retried on errors like RequestRateTooLarge (429) without becoming blocked by conflict errors on retry.
Here is the traversal re-written using coalesce() step so that it is idempotent (I assumed that 'name' is the partition key).
g.V('1').has('name', 'apple').fold()
coalesce(
__.unfold(),
__.addV('fruit').
property('id','1').
property('name','apple')).
addE('likes').
from(V('2f9d5fe8-6270-4928-8164-2580ad61e57a'))
Note: I did not wrap the addE() in a coalesce() as it is the last operation to be perform during execution. You may want to consider doing this if there will be additional write ops after the edge in the same request, or if you need to prevent duplicate edges for concurrent add edge requests.

The Gremlin coalesce step is inconsistent (Cosmos DB / in general?)

Coalesce doesn't work as the first step in a traversal or if a traversal leading up to the coalesce step doesn't yield at least one result. Before you dismiss the question, please hear me out.
If I have a vertex with label = 'foo' and id = 'bar' in my graph database and I'd like to add a vertex with label = 'baz' and id = 'caz', the following Gremlin query works beautifully.
g.V('bar').coalesce(__.V('caz'), __.addV('baz').property('id', 'caz'))
If; however, I get rid of the first part of the query, the query fails.
g.coalesce(__.V('caz'), __.addV('baz').property('id', 'caz'))
Similarly, if I rework the query as follows, it also fails.
g.V('caz').coalesce(__.V('caz'), __.addV('baz').property('id', 'caz'))
For coalesce to work, it must have an input set of one or more elements. I understand why such an approach makes sense when the steps within a coalesce step are has and hasLabel for example; however, it makes no sense for V and addV. I'm guessing that the server implementation of coalesce has a check/return for a null or empty input step, which cancels processing on the step.
If this is a bug or improvement request with Gremlin in general, it would be awesome to have this addressed. If it's a Cosmos DB only issue, I'll log a call with Microsoft directly.
In the interim, I'm desperately looking for a solution to the challenge of only creating an element if it doesn't exist. I'm aware of using fold/unfold with coalesce; however, that kills my traversal context making previously defined aliases (using as('xyz')) unusable. Given the complexity of the queries we're writing, we can't afford to lose the context; we also can't afford the compute of folding just to unfold when processing data at scale.
Any advice on the above is gratefully received.
Warm regards,
Seb
You can't start a traversal with any step in the Gremlin language. There are specific start steps that trigger a traversal and by "trigger" I mean that they place traversers in the pipeline for processing. There are really just a handful of start steps: V(), E() and inject(), addV() and addE().
I'm aware of using fold/unfold with coalesce; however, that kills my traversal context making previously defined aliases (using as('xyz')) unusable
You typically shouldn't rely too heavily on as() if it can be avoided. Many traversals that have heavy use of as() usually can be re-written in other forms. Since you don't have more details on that, I can't address it further.
we also can't afford the compute of folding just to unfold when processing data at scale.
I can't imagine fold() and unfold() carrying a ton of cost. In the worst case it creates a List with a single item in it and in the best case it creates an empty list. You'll probably have tons of other performance optimizations to sort out before something like that would become anything you would focus on for radical improvements.
All that said, I guess that you could do this:
gremlin> g = TinkerGraph.open().traversal()
==>graphtraversalsource[tinkergraph[vertices:0 edges:0], standard]
gremlin> g.inject(1).coalesce(V().has('id','caz'),addV('baz').property('id','caz'))
==>v[0]
gremlin> g.inject(1).coalesce(V().has('id','caz'),addV('baz').property('id','caz'))
==>v[0]
You start the traversal with inject() and a throwaway value to just get something into the pipeline. I think that I prefer the fold() and unfold() method myself as I believe it's more readable. I also would be sure to validate that the graph I was using was actually using an index for that embedded mid-traversal V() inside the coalesce(). I would hope all graphs are smart about such optimizations but I can't say that with complete certainty. In that sense, fold() and unfold() work better as they present a more platform independent way to execute your query.
After some digging, I realized that the issue is Gremlin language specific and not server implementation specific (as in, not a Cosmos DB issue). Accordingly, I've resorted to using two flavors of the "add if not exists" pattern.
For context, we use a Gremlin recipe provider pattern, which ensures that common conventions are maintained throughout the product for common tasks. Accordingly, when I have an element (edge or vertex) to create, I pass it to the recipe provider to return the traversal with addE/addV and property semantics generated. This issue stems from generating recipes that support the "add if not exists" pattern.
To solve the issue, I pass a boolean flag to the recipe provider that tells the provider whether to use fold/unfold semantics. That way, if the add recipe occurs at the beginning of the traversal, the app uses fold/unfold semantics; if not at the beginning, no fold/unfold. While it is very much putting lipstick on a pig as a workaround, most of the add recipes our app uses don't occur at the beginnings of traversals.
To provide an example, assuming I have three vertices using label vTest and IDs v1-id, v2-id, and v3-id, the Gremlin query generated by the Gremlin recipe provider will look like this:
g.V('v1-id')
.has('partitionKey','v1')
.fold()
.coalesce(
__.unfold(),
__.addV('vTest')
.property('id','v1-id')
.property('partitionKey','v1')
).coalesce(
__.V('v2-id')
.has('partitionKey','v2'),
__.addV('vTest')
.property('id','v2-id')
.property('partitionKey','v2')
).coalesce(
__.V('v3-id')
.has('partitionKey','v3'),
__.addV('vTest')
.property('id','v3-id')
.property('partitionKey','v3')
)
Because each part of the query is guaranteed to return one result, coalesce() works throughout. But, as I'm sure you'll agree, lipstick on a pig.
Unfortunately for us, all user registrations in our app will be affected by the fold() / unfold() approach because that process involves creating the first vertices. I certainly hope to see an update to Gremlin in future, either to coalesce or some other step to handle conditionals.

Neo4j Design: Property vs "Node & Relationship"

I have a node type that has a string property that will have the same value really often. Etc. Millions of nodes with only 5 options of that string value. I will be doing searches by that property.
My question would be what is better in terms of performance and memory:
a) Implement it as a node property and have lots of duplicates (and search using WHERE).
b) Implement it as 5 additional nodes, where all original nodes reference one of them (and search using additional MATCH).
Without knowing further details it's hard to give a general purpose answer.
From a performance perspective it's better to limit the search as early as possible. Even more beneficial if you do not have to look into properties for a traversal.
Given that I assume it's better to move the lookup property into a seperate node and use the value as relationship type.
Use labels; this blog post is a good intro to this new Neo4j 2.0 feature:
Labels and Schema Indexes in Neo4j
I've thought about this problem a little as well. In my case, I had to represent state:
STARTED
IN_PROGRESS
SUBMITTED
COMPLETED
Overall the Node + Relationship approach looks more appealing in that only a single relationship reference needs to be maintained each time rather than a property string and you don't need to scan an extra additional index which has to be maintained on the property (memory and performance would intuitively be in favor of this approach).
Another advantage is that it easily supports the ability of a node being linked to multiple "special nodes". If you foresee a situation where this should be possible in your model, this is better than having to use a property array (and searching using "in").
In practice I found that the problem then became, how do you access these special nodes each time. Either you maintain some sort of constants reference where you have the node ID of these special nodes where you can jump right into them in your START statement (this is what we do) or you need to do a search against property of the special node each time (name, perhaps) and then traverse down it's relationships. This doesn't make for the prettiest of cypher queries.

Resources