Batching operations in gremlin on Neptune - gremlin

In AWS Neptune documentation (best-practices-gremlin-java-batch-add) they recommend batching operations together.
How can I batch a few operations together in case one of them may end the stream.
For example if I want to batch together the following:
g.V(2).drop().addV('test').property(id,1)
The problem is that the addV won't be called.
Is there a way to batch the drop and the addV together and making sure the addV will be called?
I tried to put fold() in between but because it isn't supported natively in Neptune and will probably create performance issues.
The sideEffect isn't a good option for performance reasons with Neptune as well (see drop documentation in gremlin-step-support).

You can accomplish this by using the sideEffect() step to perform the drop() as shown in this recipe. In your case this might look something like:
g.addV('test').property(id,1).sideEffect(V(2).drop())

Related

Gremlin: Does calling expensive steps after cheaper ones works as an optimization?

I have a big gremlin query that is basically to filter results, is made of many has() and where() steps that can be written in any order and gives the same result, some of them are expensive and some of them are cheaper.
If i call the cheaper steps first I guess the expensive ones are going to be executed with less iterations because many vertices were filtered, this is true when coding in any language but in a database implementation I don't know if the Gremlin steps are executed in the order that are written.
I know this kind of things usually depends on the Gremlin database implementation but maybe you can give me some kind of general answer. Also I've tried to make some benchmarks but to build good ones in my specific case is too time consuming, so maybe you can help me with your knowledge of how databases are implemented internally.
As you mention, it really does depend on the query engine and the way optimized query plans are developed. Some engines will try to reorder parts of queries based on the estimated cardinality of elements being tested. Amazon Neptune works that way for example. In general it is best to filter out as much as possible as soon as possible. So in a social network you would not want to start with something like g.V().hasLabel(‘person’) unless you are confident the query engine is able to reorder such queries.

Azure Cosmos DB, Gremlin-API and atomic write-operations

We are starting a larger project and have to decide the datastore-technology.
For various reasons, we would like to use Cosmos-DB through the Gremlin-API.
The thing which we are not sure about is how to handle atomic writes. Cosmos' consistency levels (from strong to eventual) are fine for us, but we haven't found a way to have atomic write operations through the Gremlin API. We have already written quite complex Gremlin-queries like creating vertices and edges, navigating edges, deleting edges, using side-effects etc. in one Gremlin-statement. So if parts of the statement go wrong, we have no chance to recover from it. It's not an option to split the statements into several smaller ones because in case of an error, we would have to "rollback" the statements up to the erroneous one.
I have found following question but there is no answer so far: Azure Cosmos Gremlin API: transactions and efficient graph traversal.
Other sources suggest to write idempotent Gremlin-Statements but due to the mentioned complexity, that's not a valid option for us.

Gremlin correlated queries kill performance

I understand that implementation specifics factor into this question, but I also realize that I may be doing something wrong here. If so, what could I do better? If Gremlin has some multi-resultset submit queries batch feature I don't know about, that would solve the problem. As in, hey Gremlin, run these three queries in parallel and give me their results.
Essentially, I need to know when a vertex has a certain edge and if it doesn't have that edge, I need to pull a blank. So...
g.V().as("v").coalesce(outE("someLabel").has("someProperty","someValue"),constant()).as("e").select("v","e")
That query is 10x more expensive than simply getting the edges using:
g.V().outE("someLabel").has("someProperty","someValue")
So if I want to get a set of vertices with their edges or blank placeholders, I have two options: Make two discrete queries and "join" the data in the API or make one very expensive query.
I'm working from the assumption that in Gremlin, we "do it in one trip" and that may in fact be quite wrong. That said, I also know that pulling back chunks of data and effectively doing joins in the API is bad practice because it breaks the encapsulation principal. It also adds roundtrip overhead.
OK, so I found a solution that is ridiculous but fast. It involves fudging the traversal so let me apologize up front if there's a better way...
g.inject(true).
union(
__.V().not(outE("someLabel")).constant().as("ridiculous"),
__.V().outE("someLabel").as("ridiculous")
).
select("ridiculous")
In essence, I have to write the query twice. Once for the traversal with the edge I want and once more for the traversal where the edge is missing. So, if I have n present / not present checks I'm going to need 2 ^ n copies of the query each with its own combination of checks so that I get the most optimal performance. Unfortunately, taking that approach runs the risk of a stack overflow not to mention making code impossible to manage reliably.
Your original query returned vertex-edge pairs, where as your answer returns only edges.
You could just run g.E().hasLabel("somelabel") to get the same result.
Probably a faster alternative to your original query might be:
g.E().hasLabel("somelabel").as("e").outV().as("v").select("v","e")
Or
g.V().as("v").outE("somelabel").as("e").select("v","e")
If Gremlin has some multi-resultset submit queries batch feature I don't know about, that would solve the problem
Gremlin/TinkerPop do not have such functionality built in. There is at least one graph that does have some form of Gremlin batching - DataStax Graph...not sure about others.
I'm also not sure I really have any answer that you might find useful, but while I wouldn't expect a 10x difference in performance between those two traversals, I would expect the first to perform worse on most graph databases. Basically, the use of named steps with as() enables path calculation requirements on the traversal which increases costs. When optimizing Gremlin, one of my earliest steps is to try to look for ways to factor out anything that might do that.
This question seems related to your other question on Jagged Result Array and but I'm having trouble maintaining the context from one question into the other to understand how to expound further.

Making direct gremlin queries from gremlin-python

I've encountered several issues with gremlin-python that aren't such in pure gremlin:
I can't directly select a given vertex type (g.V('customer')) without iterating over all vertices (g.V().hasLabel('customer'))
I get "Maximum Recursion reached" errors from Python. The same query in gremlin works smooth and fast
The ".next()" command works really slow in gremlin-python while in gremlin takes 1 sec
So, from Python/gremlin-python, I would like to be able to make a pure gremlin query to the server and directly store its result in a Python variable. Is that possible?
(I'm using gremlin-python on Apache Zeppelin if that matters)
I can't directly select a given vertex type (g.V('customer')) without iterating over all vertices (g.V().hasLabel('customer'))
g.V('customer') in Gremlin means "find a vertex with the id 'customer'" not "find vertices with label 'customer'. For the latter you need what you wrote g.V().hasLabel('customer'). Those rules are the same in every variation of Gremlin including Python. And, you are correct that a query like g.V().hasLabel('customer') will be expensive as there aren't many graphs that optimize this type of operation. On large graphs this would typically be considered an OLAP query that you would do with with Gremlin Spark.
I get "Maximum Recursion reached" errors from Python. The same query in gremlin works smooth and fast
That was a bug. It is resolved now, but the fix is not released to pypi. A release is currently being prepared and so you will see this on 3.2.10 and 3.3.4. If you need an immediate patch, you can see that the fix was fairly trivial.
The ".next()" command works really slow in gremlin-python while in gremlin takes 1 sec
I'm not sure what you're seeing exactly. I think you might want to detail more about your environment with specifics as to how to recreate the difference. Perhaps you should bring that question to the gremlin-users mailing list.
So, from Python/gremlin-python, I would like to be able to make a pure gremlin query to the server and directly store its result in a Python variable. Is that possible?
That's perfectly possible and is exactly what gremlin-python is meant to do. It enables you to write Gremlin in Python and get results back from the server to do with as you need on the client side.

How does executemany() work

I have been using c++ and work with sqlite. In python, I have an executemany operation in the library but the c++ library I am using does not have that operation.
I was wondering how the executemany operation optimizes queries to make them faster.
I was looking at the sqlite c/c++ api and saw that there were two commands, sqlite3_reset and sqlite3_clear_bindings, that can be used to clear and reuse prepared statements.
Is this what python does to batch and speedup executemany queries (at least for inserts)? Thanks for your time.
executemany just binds the parameters, executes the statements, and calls sqlite3_reset, in a loop.
Python does not give you direct access to the statement after it has been prepared, so this is the only way to reuse it.
However, SQLite does not take much time for preparing statements, so this is unlikely to have much of an effect on performance.
The most important thing for performance is to batch statements in a transaction; Python tries to be clever and to do this automatically (independently from executemany).
I looked into some of the related posts and found the folowing which was very detailed on ways to improve sqlite batch insert performace. These principles could effectively be used to create an executemany function.
Improve INSERT-per-second performance of SQLite?
The biggest improvement changes were indeed as #CL. said, turning it all into one transaction. The author of the other post also found significant improvement by using and reusing prepared statements and playing with some pragma settings.

Resources