Making direct gremlin queries from gremlin-python - gremlin

I've encountered several issues with gremlin-python that aren't such in pure gremlin:
I can't directly select a given vertex type (g.V('customer')) without iterating over all vertices (g.V().hasLabel('customer'))
I get "Maximum Recursion reached" errors from Python. The same query in gremlin works smooth and fast
The ".next()" command works really slow in gremlin-python while in gremlin takes 1 sec
So, from Python/gremlin-python, I would like to be able to make a pure gremlin query to the server and directly store its result in a Python variable. Is that possible?
(I'm using gremlin-python on Apache Zeppelin if that matters)

I can't directly select a given vertex type (g.V('customer')) without iterating over all vertices (g.V().hasLabel('customer'))
g.V('customer') in Gremlin means "find a vertex with the id 'customer'" not "find vertices with label 'customer'. For the latter you need what you wrote g.V().hasLabel('customer'). Those rules are the same in every variation of Gremlin including Python. And, you are correct that a query like g.V().hasLabel('customer') will be expensive as there aren't many graphs that optimize this type of operation. On large graphs this would typically be considered an OLAP query that you would do with with Gremlin Spark.
I get "Maximum Recursion reached" errors from Python. The same query in gremlin works smooth and fast
That was a bug. It is resolved now, but the fix is not released to pypi. A release is currently being prepared and so you will see this on 3.2.10 and 3.3.4. If you need an immediate patch, you can see that the fix was fairly trivial.
The ".next()" command works really slow in gremlin-python while in gremlin takes 1 sec
I'm not sure what you're seeing exactly. I think you might want to detail more about your environment with specifics as to how to recreate the difference. Perhaps you should bring that question to the gremlin-users mailing list.
So, from Python/gremlin-python, I would like to be able to make a pure gremlin query to the server and directly store its result in a Python variable. Is that possible?
That's perfectly possible and is exactly what gremlin-python is meant to do. It enables you to write Gremlin in Python and get results back from the server to do with as you need on the client side.

Related

Batching operations in gremlin on Neptune

In AWS Neptune documentation (best-practices-gremlin-java-batch-add) they recommend batching operations together.
How can I batch a few operations together in case one of them may end the stream.
For example if I want to batch together the following:
g.V(2).drop().addV('test').property(id,1)
The problem is that the addV won't be called.
Is there a way to batch the drop and the addV together and making sure the addV will be called?
I tried to put fold() in between but because it isn't supported natively in Neptune and will probably create performance issues.
The sideEffect isn't a good option for performance reasons with Neptune as well (see drop documentation in gremlin-step-support).
You can accomplish this by using the sideEffect() step to perform the drop() as shown in this recipe. In your case this might look something like:
g.addV('test').property(id,1).sideEffect(V(2).drop())

Gremlin: Does calling expensive steps after cheaper ones works as an optimization?

I have a big gremlin query that is basically to filter results, is made of many has() and where() steps that can be written in any order and gives the same result, some of them are expensive and some of them are cheaper.
If i call the cheaper steps first I guess the expensive ones are going to be executed with less iterations because many vertices were filtered, this is true when coding in any language but in a database implementation I don't know if the Gremlin steps are executed in the order that are written.
I know this kind of things usually depends on the Gremlin database implementation but maybe you can give me some kind of general answer. Also I've tried to make some benchmarks but to build good ones in my specific case is too time consuming, so maybe you can help me with your knowledge of how databases are implemented internally.
As you mention, it really does depend on the query engine and the way optimized query plans are developed. Some engines will try to reorder parts of queries based on the estimated cardinality of elements being tested. Amazon Neptune works that way for example. In general it is best to filter out as much as possible as soon as possible. So in a social network you would not want to start with something like g.V().hasLabel(‘person’) unless you are confident the query engine is able to reorder such queries.

Azure Cosmos DB, Gremlin-API and atomic write-operations

We are starting a larger project and have to decide the datastore-technology.
For various reasons, we would like to use Cosmos-DB through the Gremlin-API.
The thing which we are not sure about is how to handle atomic writes. Cosmos' consistency levels (from strong to eventual) are fine for us, but we haven't found a way to have atomic write operations through the Gremlin API. We have already written quite complex Gremlin-queries like creating vertices and edges, navigating edges, deleting edges, using side-effects etc. in one Gremlin-statement. So if parts of the statement go wrong, we have no chance to recover from it. It's not an option to split the statements into several smaller ones because in case of an error, we would have to "rollback" the statements up to the erroneous one.
I have found following question but there is no answer so far: Azure Cosmos Gremlin API: transactions and efficient graph traversal.
Other sources suggest to write idempotent Gremlin-Statements but due to the mentioned complexity, that's not a valid option for us.

Gremlin correlated queries kill performance

I understand that implementation specifics factor into this question, but I also realize that I may be doing something wrong here. If so, what could I do better? If Gremlin has some multi-resultset submit queries batch feature I don't know about, that would solve the problem. As in, hey Gremlin, run these three queries in parallel and give me their results.
Essentially, I need to know when a vertex has a certain edge and if it doesn't have that edge, I need to pull a blank. So...
g.V().as("v").coalesce(outE("someLabel").has("someProperty","someValue"),constant()).as("e").select("v","e")
That query is 10x more expensive than simply getting the edges using:
g.V().outE("someLabel").has("someProperty","someValue")
So if I want to get a set of vertices with their edges or blank placeholders, I have two options: Make two discrete queries and "join" the data in the API or make one very expensive query.
I'm working from the assumption that in Gremlin, we "do it in one trip" and that may in fact be quite wrong. That said, I also know that pulling back chunks of data and effectively doing joins in the API is bad practice because it breaks the encapsulation principal. It also adds roundtrip overhead.
OK, so I found a solution that is ridiculous but fast. It involves fudging the traversal so let me apologize up front if there's a better way...
g.inject(true).
union(
__.V().not(outE("someLabel")).constant().as("ridiculous"),
__.V().outE("someLabel").as("ridiculous")
).
select("ridiculous")
In essence, I have to write the query twice. Once for the traversal with the edge I want and once more for the traversal where the edge is missing. So, if I have n present / not present checks I'm going to need 2 ^ n copies of the query each with its own combination of checks so that I get the most optimal performance. Unfortunately, taking that approach runs the risk of a stack overflow not to mention making code impossible to manage reliably.
Your original query returned vertex-edge pairs, where as your answer returns only edges.
You could just run g.E().hasLabel("somelabel") to get the same result.
Probably a faster alternative to your original query might be:
g.E().hasLabel("somelabel").as("e").outV().as("v").select("v","e")
Or
g.V().as("v").outE("somelabel").as("e").select("v","e")
If Gremlin has some multi-resultset submit queries batch feature I don't know about, that would solve the problem
Gremlin/TinkerPop do not have such functionality built in. There is at least one graph that does have some form of Gremlin batching - DataStax Graph...not sure about others.
I'm also not sure I really have any answer that you might find useful, but while I wouldn't expect a 10x difference in performance between those two traversals, I would expect the first to perform worse on most graph databases. Basically, the use of named steps with as() enables path calculation requirements on the traversal which increases costs. When optimizing Gremlin, one of my earliest steps is to try to look for ways to factor out anything that might do that.
This question seems related to your other question on Jagged Result Array and but I'm having trouble maintaining the context from one question into the other to understand how to expound further.

Cypher and Gremlin querying

I'm doing a research about graph query languages and I considered that Gremlin is dedicated for traversal querying and Cypher is efficient and more easy, but I can't find a concrete example that differentiate them.
Can some one give me some example of queries that we can do with Cypher and not with Gremlin or the opposite.
Thanks
It's not a matter of what one language can do that the other language can't. They're both complete enough that you can do any kind of graph query in either. The question is simply how hard you'll have to work to make it happen, and whether the result will be performant, readable, and easy to change.
Cypher is a declarative language, meaning that you declare what you want to see, and the engine figures out how to get that data for you. Gremlin is largely imperative, meaning that you specify how to traverse the graph. This tends to make Gremlin more brittle,

Resources