Cypher and Gremlin querying - graph

I'm doing a research about graph query languages and I considered that Gremlin is dedicated for traversal querying and Cypher is efficient and more easy, but I can't find a concrete example that differentiate them.
Can some one give me some example of queries that we can do with Cypher and not with Gremlin or the opposite.
Thanks

It's not a matter of what one language can do that the other language can't. They're both complete enough that you can do any kind of graph query in either. The question is simply how hard you'll have to work to make it happen, and whether the result will be performant, readable, and easy to change.
Cypher is a declarative language, meaning that you declare what you want to see, and the engine figures out how to get that data for you. Gremlin is largely imperative, meaning that you specify how to traverse the graph. This tends to make Gremlin more brittle,

Related

Gremlin: Does calling expensive steps after cheaper ones works as an optimization?

I have a big gremlin query that is basically to filter results, is made of many has() and where() steps that can be written in any order and gives the same result, some of them are expensive and some of them are cheaper.
If i call the cheaper steps first I guess the expensive ones are going to be executed with less iterations because many vertices were filtered, this is true when coding in any language but in a database implementation I don't know if the Gremlin steps are executed in the order that are written.
I know this kind of things usually depends on the Gremlin database implementation but maybe you can give me some kind of general answer. Also I've tried to make some benchmarks but to build good ones in my specific case is too time consuming, so maybe you can help me with your knowledge of how databases are implemented internally.
As you mention, it really does depend on the query engine and the way optimized query plans are developed. Some engines will try to reorder parts of queries based on the estimated cardinality of elements being tested. Amazon Neptune works that way for example. In general it is best to filter out as much as possible as soon as possible. So in a social network you would not want to start with something like g.V().hasLabel(‘person’) unless you are confident the query engine is able to reorder such queries.

Gremlin correlated queries kill performance

I understand that implementation specifics factor into this question, but I also realize that I may be doing something wrong here. If so, what could I do better? If Gremlin has some multi-resultset submit queries batch feature I don't know about, that would solve the problem. As in, hey Gremlin, run these three queries in parallel and give me their results.
Essentially, I need to know when a vertex has a certain edge and if it doesn't have that edge, I need to pull a blank. So...
g.V().as("v").coalesce(outE("someLabel").has("someProperty","someValue"),constant()).as("e").select("v","e")
That query is 10x more expensive than simply getting the edges using:
g.V().outE("someLabel").has("someProperty","someValue")
So if I want to get a set of vertices with their edges or blank placeholders, I have two options: Make two discrete queries and "join" the data in the API or make one very expensive query.
I'm working from the assumption that in Gremlin, we "do it in one trip" and that may in fact be quite wrong. That said, I also know that pulling back chunks of data and effectively doing joins in the API is bad practice because it breaks the encapsulation principal. It also adds roundtrip overhead.
OK, so I found a solution that is ridiculous but fast. It involves fudging the traversal so let me apologize up front if there's a better way...
g.inject(true).
union(
__.V().not(outE("someLabel")).constant().as("ridiculous"),
__.V().outE("someLabel").as("ridiculous")
).
select("ridiculous")
In essence, I have to write the query twice. Once for the traversal with the edge I want and once more for the traversal where the edge is missing. So, if I have n present / not present checks I'm going to need 2 ^ n copies of the query each with its own combination of checks so that I get the most optimal performance. Unfortunately, taking that approach runs the risk of a stack overflow not to mention making code impossible to manage reliably.
Your original query returned vertex-edge pairs, where as your answer returns only edges.
You could just run g.E().hasLabel("somelabel") to get the same result.
Probably a faster alternative to your original query might be:
g.E().hasLabel("somelabel").as("e").outV().as("v").select("v","e")
Or
g.V().as("v").outE("somelabel").as("e").select("v","e")
If Gremlin has some multi-resultset submit queries batch feature I don't know about, that would solve the problem
Gremlin/TinkerPop do not have such functionality built in. There is at least one graph that does have some form of Gremlin batching - DataStax Graph...not sure about others.
I'm also not sure I really have any answer that you might find useful, but while I wouldn't expect a 10x difference in performance between those two traversals, I would expect the first to perform worse on most graph databases. Basically, the use of named steps with as() enables path calculation requirements on the traversal which increases costs. When optimizing Gremlin, one of my earliest steps is to try to look for ways to factor out anything that might do that.
This question seems related to your other question on Jagged Result Array and but I'm having trouble maintaining the context from one question into the other to understand how to expound further.

Making direct gremlin queries from gremlin-python

I've encountered several issues with gremlin-python that aren't such in pure gremlin:
I can't directly select a given vertex type (g.V('customer')) without iterating over all vertices (g.V().hasLabel('customer'))
I get "Maximum Recursion reached" errors from Python. The same query in gremlin works smooth and fast
The ".next()" command works really slow in gremlin-python while in gremlin takes 1 sec
So, from Python/gremlin-python, I would like to be able to make a pure gremlin query to the server and directly store its result in a Python variable. Is that possible?
(I'm using gremlin-python on Apache Zeppelin if that matters)
I can't directly select a given vertex type (g.V('customer')) without iterating over all vertices (g.V().hasLabel('customer'))
g.V('customer') in Gremlin means "find a vertex with the id 'customer'" not "find vertices with label 'customer'. For the latter you need what you wrote g.V().hasLabel('customer'). Those rules are the same in every variation of Gremlin including Python. And, you are correct that a query like g.V().hasLabel('customer') will be expensive as there aren't many graphs that optimize this type of operation. On large graphs this would typically be considered an OLAP query that you would do with with Gremlin Spark.
I get "Maximum Recursion reached" errors from Python. The same query in gremlin works smooth and fast
That was a bug. It is resolved now, but the fix is not released to pypi. A release is currently being prepared and so you will see this on 3.2.10 and 3.3.4. If you need an immediate patch, you can see that the fix was fairly trivial.
The ".next()" command works really slow in gremlin-python while in gremlin takes 1 sec
I'm not sure what you're seeing exactly. I think you might want to detail more about your environment with specifics as to how to recreate the difference. Perhaps you should bring that question to the gremlin-users mailing list.
So, from Python/gremlin-python, I would like to be able to make a pure gremlin query to the server and directly store its result in a Python variable. Is that possible?
That's perfectly possible and is exactly what gremlin-python is meant to do. It enables you to write Gremlin in Python and get results back from the server to do with as you need on the client side.

Best practice: How to specify a vertex's domain 'type' in a graph database

When building a graph, it is usually necessary to specify the 'type' of vertices. Conceptually I see this could be done by applying a vertex label or property to each vertex (ie Bob, Label: Man), or alternatively by linking a vertex to another 'type' vertex (ie. Bob --IS A--> Man).
To find a list of all vertices of type 'Man' I can write gremlin queries that work for both of these approaches. But what is best practice?
Best practice: keep your data model simple and make sure it is compatible with efficient indexing by the underlying graph database. There is no one size fits all solution at the TinkerPop level.
It really depends on your data model as well as the indexing capabilities of the underlying database, not to mention the way the data is actually serialized on disk. Ultimately, it all boils down to the way you expect to query your graph and the kind of performance you wish to have.
This being said, people typically use vertex labels, sometimes used in conjunction with a type property of some kind. Graph implementers should be able to provide efficient indexes for answering such query. It should also give a simpler graph model, which is an important thing to consider.
Depending on the size of your graph, you could get performance issues when modeling types with vertices since a man type vertex could quickly become a supernode.

Tinkerpop - Is it better to use Redis for key-value property indexes or to use KeyIndexableGraph

Pretty straightforward question but I can't find the info that I want - is it advisable to use the KeyIndexableGraph of tinkerpop or to roll your own super performant key/index solution on the most performant and specialized stores like redis to get the node/edge locations you need?
It would appear to me that Redis should be better here as a technology that only focuses on key/value lookups and then pass the address in to the graph but I'd like to justify the costs.
The promise from tinkerpop is that index lookups should be log(n) on articles that are indexed with the property which is pretty good. Is it possible to do better in redis, or that the n*constant is much better than in the graph lookup?
Edit: I realized later this isn't really an intelligent question - Redis is an in memory store so is bounded by memory. Looking up a graph node location is still going to require a second lookup of the node in the graph.
It is important to remember that aside from TinkerGraph (an in-memory graph), TinkerPop is not a graph database on its own. KeyIndexableGraph is an interface that is implemented by an underlying graph databases (Titan, Neo4j, OrientDB, etc.) utilizing that graph's index capability. Therefore, you should make your indexing choice based on the capabilities of the underlying graph database itself.
Generally speaking, implementing Redis for indexing purposes for the graphs that do implement KeyIndexableGraph seems like an unnecessary layer. I would guess that it will complicate your programming without much benefit.
Here is the difference:
Databases like OrientDb have apx O(log2n) lookup times on an index.
Reddis has O(1) - constant time lookup.

Resources