Azure Cosmos DB, Gremlin-API and atomic write-operations - azure-cosmosdb

We are starting a larger project and have to decide the datastore-technology.
For various reasons, we would like to use Cosmos-DB through the Gremlin-API.
The thing which we are not sure about is how to handle atomic writes. Cosmos' consistency levels (from strong to eventual) are fine for us, but we haven't found a way to have atomic write operations through the Gremlin API. We have already written quite complex Gremlin-queries like creating vertices and edges, navigating edges, deleting edges, using side-effects etc. in one Gremlin-statement. So if parts of the statement go wrong, we have no chance to recover from it. It's not an option to split the statements into several smaller ones because in case of an error, we would have to "rollback" the statements up to the erroneous one.
I have found following question but there is no answer so far: Azure Cosmos Gremlin API: transactions and efficient graph traversal.
Other sources suggest to write idempotent Gremlin-Statements but due to the mentioned complexity, that's not a valid option for us.

Related

Gremlin: Does calling expensive steps after cheaper ones works as an optimization?

I have a big gremlin query that is basically to filter results, is made of many has() and where() steps that can be written in any order and gives the same result, some of them are expensive and some of them are cheaper.
If i call the cheaper steps first I guess the expensive ones are going to be executed with less iterations because many vertices were filtered, this is true when coding in any language but in a database implementation I don't know if the Gremlin steps are executed in the order that are written.
I know this kind of things usually depends on the Gremlin database implementation but maybe you can give me some kind of general answer. Also I've tried to make some benchmarks but to build good ones in my specific case is too time consuming, so maybe you can help me with your knowledge of how databases are implemented internally.
As you mention, it really does depend on the query engine and the way optimized query plans are developed. Some engines will try to reorder parts of queries based on the estimated cardinality of elements being tested. Amazon Neptune works that way for example. In general it is best to filter out as much as possible as soon as possible. So in a social network you would not want to start with something like g.V().hasLabel(‘person’) unless you are confident the query engine is able to reorder such queries.

Gremlin correlated queries kill performance

I understand that implementation specifics factor into this question, but I also realize that I may be doing something wrong here. If so, what could I do better? If Gremlin has some multi-resultset submit queries batch feature I don't know about, that would solve the problem. As in, hey Gremlin, run these three queries in parallel and give me their results.
Essentially, I need to know when a vertex has a certain edge and if it doesn't have that edge, I need to pull a blank. So...
g.V().as("v").coalesce(outE("someLabel").has("someProperty","someValue"),constant()).as("e").select("v","e")
That query is 10x more expensive than simply getting the edges using:
g.V().outE("someLabel").has("someProperty","someValue")
So if I want to get a set of vertices with their edges or blank placeholders, I have two options: Make two discrete queries and "join" the data in the API or make one very expensive query.
I'm working from the assumption that in Gremlin, we "do it in one trip" and that may in fact be quite wrong. That said, I also know that pulling back chunks of data and effectively doing joins in the API is bad practice because it breaks the encapsulation principal. It also adds roundtrip overhead.
OK, so I found a solution that is ridiculous but fast. It involves fudging the traversal so let me apologize up front if there's a better way...
g.inject(true).
union(
__.V().not(outE("someLabel")).constant().as("ridiculous"),
__.V().outE("someLabel").as("ridiculous")
).
select("ridiculous")
In essence, I have to write the query twice. Once for the traversal with the edge I want and once more for the traversal where the edge is missing. So, if I have n present / not present checks I'm going to need 2 ^ n copies of the query each with its own combination of checks so that I get the most optimal performance. Unfortunately, taking that approach runs the risk of a stack overflow not to mention making code impossible to manage reliably.
Your original query returned vertex-edge pairs, where as your answer returns only edges.
You could just run g.E().hasLabel("somelabel") to get the same result.
Probably a faster alternative to your original query might be:
g.E().hasLabel("somelabel").as("e").outV().as("v").select("v","e")
Or
g.V().as("v").outE("somelabel").as("e").select("v","e")
If Gremlin has some multi-resultset submit queries batch feature I don't know about, that would solve the problem
Gremlin/TinkerPop do not have such functionality built in. There is at least one graph that does have some form of Gremlin batching - DataStax Graph...not sure about others.
I'm also not sure I really have any answer that you might find useful, but while I wouldn't expect a 10x difference in performance between those two traversals, I would expect the first to perform worse on most graph databases. Basically, the use of named steps with as() enables path calculation requirements on the traversal which increases costs. When optimizing Gremlin, one of my earliest steps is to try to look for ways to factor out anything that might do that.
This question seems related to your other question on Jagged Result Array and but I'm having trouble maintaining the context from one question into the other to understand how to expound further.

Making direct gremlin queries from gremlin-python

I've encountered several issues with gremlin-python that aren't such in pure gremlin:
I can't directly select a given vertex type (g.V('customer')) without iterating over all vertices (g.V().hasLabel('customer'))
I get "Maximum Recursion reached" errors from Python. The same query in gremlin works smooth and fast
The ".next()" command works really slow in gremlin-python while in gremlin takes 1 sec
So, from Python/gremlin-python, I would like to be able to make a pure gremlin query to the server and directly store its result in a Python variable. Is that possible?
(I'm using gremlin-python on Apache Zeppelin if that matters)
I can't directly select a given vertex type (g.V('customer')) without iterating over all vertices (g.V().hasLabel('customer'))
g.V('customer') in Gremlin means "find a vertex with the id 'customer'" not "find vertices with label 'customer'. For the latter you need what you wrote g.V().hasLabel('customer'). Those rules are the same in every variation of Gremlin including Python. And, you are correct that a query like g.V().hasLabel('customer') will be expensive as there aren't many graphs that optimize this type of operation. On large graphs this would typically be considered an OLAP query that you would do with with Gremlin Spark.
I get "Maximum Recursion reached" errors from Python. The same query in gremlin works smooth and fast
That was a bug. It is resolved now, but the fix is not released to pypi. A release is currently being prepared and so you will see this on 3.2.10 and 3.3.4. If you need an immediate patch, you can see that the fix was fairly trivial.
The ".next()" command works really slow in gremlin-python while in gremlin takes 1 sec
I'm not sure what you're seeing exactly. I think you might want to detail more about your environment with specifics as to how to recreate the difference. Perhaps you should bring that question to the gremlin-users mailing list.
So, from Python/gremlin-python, I would like to be able to make a pure gremlin query to the server and directly store its result in a Python variable. Is that possible?
That's perfectly possible and is exactly what gremlin-python is meant to do. It enables you to write Gremlin in Python and get results back from the server to do with as you need on the client side.

Difference between traversal using gremlin and methods from Graph

Suppose I've the following simple graph.
I see two ways of traversing this graph
Option 1
I can use the following API provided by the Graph class
Graph factory = ...
factory.getVertices("type", "Response");
Option 2
I can also use GremlinPipeline API as documented here
Graph g = ... // a reference to a Blueprints graph
GremlinPipeline pipe = new GremlinPipeline();
pipe.start(g.getVertex(1))
My question are
Why two API's?
When to use which one?
Does GremlinPipeline take advantage of the indeces created using index related methods of TnkerGraph?
There are two APIs for getting data because one represents a Blueprints-level which is a lower level of abstraction having utility-level functions for accessing graphs and Gremlin-level which is a higher level of abstraction having a much higher degree of expressivity when traversing graphs. The design principle is built around the fact that Blueprints is an abstraction layer over different graph databases (i.e. Neo4j, OrientDB, etc) and that it needs to be simple enough for the implementations to be developed quickly. Gremlin however is a graph traversal language which works over the Blueprints API making it possible for Gremlin to operate over multiple graph databases.
Your examples don't allow Gremlin to shine at all. Sure, in those cases, there really isn't a reason to choose one over the other. Here's a similar example which I think is better:
// blueprints
g.getVertices()
// gremlin
g.V
Other than saving a few characters, Gremlin really isn't getting me anything. However, consider this Gremlin:
g.V.out('knows').outE('bought').has('weight',T.gt,100).inV.groupCount().cap()
I won't supply the Blueprints equivalent of that because it's a lot of code to type. You'll have to trust that this single line of Gremlin is worth many lines of code with tons of ugly looping.
Most of the time, usage of the raw Blueprints API isn't really necessary for traversals. You'll find yourself using it for loading data, but other than that, use Gremlin.

Tinkerpop - Is it better to use Redis for key-value property indexes or to use KeyIndexableGraph

Pretty straightforward question but I can't find the info that I want - is it advisable to use the KeyIndexableGraph of tinkerpop or to roll your own super performant key/index solution on the most performant and specialized stores like redis to get the node/edge locations you need?
It would appear to me that Redis should be better here as a technology that only focuses on key/value lookups and then pass the address in to the graph but I'd like to justify the costs.
The promise from tinkerpop is that index lookups should be log(n) on articles that are indexed with the property which is pretty good. Is it possible to do better in redis, or that the n*constant is much better than in the graph lookup?
Edit: I realized later this isn't really an intelligent question - Redis is an in memory store so is bounded by memory. Looking up a graph node location is still going to require a second lookup of the node in the graph.
It is important to remember that aside from TinkerGraph (an in-memory graph), TinkerPop is not a graph database on its own. KeyIndexableGraph is an interface that is implemented by an underlying graph databases (Titan, Neo4j, OrientDB, etc.) utilizing that graph's index capability. Therefore, you should make your indexing choice based on the capabilities of the underlying graph database itself.
Generally speaking, implementing Redis for indexing purposes for the graphs that do implement KeyIndexableGraph seems like an unnecessary layer. I would guess that it will complicate your programming without much benefit.
Here is the difference:
Databases like OrientDb have apx O(log2n) lookup times on an index.
Reddis has O(1) - constant time lookup.

Resources