Gremlin correlated queries kill performance

Gremlin correlated queries kill performance - gremlin

I understand that implementation specifics factor into this question, but I also realize that I may be doing something wrong here. If so, what could I do better? If Gremlin has some multi-resultset submit queries batch feature I don't know about, that would solve the problem. As in, hey Gremlin, run these three queries in parallel and give me their results.
Essentially, I need to know when a vertex has a certain edge and if it doesn't have that edge, I need to pull a blank. So...
g.V().as("v").coalesce(outE("someLabel").has("someProperty","someValue"),constant()).as("e").select("v","e")
That query is 10x more expensive than simply getting the edges using:
g.V().outE("someLabel").has("someProperty","someValue")
So if I want to get a set of vertices with their edges or blank placeholders, I have two options: Make two discrete queries and "join" the data in the API or make one very expensive query.
I'm working from the assumption that in Gremlin, we "do it in one trip" and that may in fact be quite wrong. That said, I also know that pulling back chunks of data and effectively doing joins in the API is bad practice because it breaks the encapsulation principal. It also adds roundtrip overhead.

OK, so I found a solution that is ridiculous but fast. It involves fudging the traversal so let me apologize up front if there's a better way...
g.inject(true).
union(
__.V().not(outE("someLabel")).constant().as("ridiculous"),
__.V().outE("someLabel").as("ridiculous")
).
select("ridiculous")
In essence, I have to write the query twice. Once for the traversal with the edge I want and once more for the traversal where the edge is missing. So, if I have n present / not present checks I'm going to need 2 ^ n copies of the query each with its own combination of checks so that I get the most optimal performance. Unfortunately, taking that approach runs the risk of a stack overflow not to mention making code impossible to manage reliably.

Your original query returned vertex-edge pairs, where as your answer returns only edges.
You could just run g.E().hasLabel("somelabel") to get the same result.
Probably a faster alternative to your original query might be:
g.E().hasLabel("somelabel").as("e").outV().as("v").select("v","e")
Or
g.V().as("v").outE("somelabel").as("e").select("v","e")

If Gremlin has some multi-resultset submit queries batch feature I don't know about, that would solve the problem
Gremlin/TinkerPop do not have such functionality built in. There is at least one graph that does have some form of Gremlin batching - DataStax Graph...not sure about others.
I'm also not sure I really have any answer that you might find useful, but while I wouldn't expect a 10x difference in performance between those two traversals, I would expect the first to perform worse on most graph databases. Basically, the use of named steps with as() enables path calculation requirements on the traversal which increases costs. When optimizing Gremlin, one of my earliest steps is to try to look for ways to factor out anything that might do that.
This question seems related to your other question on Jagged Result Array and but I'm having trouble maintaining the context from one question into the other to understand how to expound further.

Related

Gremlin: Does calling expensive steps after cheaper ones works as an optimization?

I have a big gremlin query that is basically to filter results, is made of many has() and where() steps that can be written in any order and gives the same result, some of them are expensive and some of them are cheaper.
If i call the cheaper steps first I guess the expensive ones are going to be executed with less iterations because many vertices were filtered, this is true when coding in any language but in a database implementation I don't know if the Gremlin steps are executed in the order that are written.
I know this kind of things usually depends on the Gremlin database implementation but maybe you can give me some kind of general answer. Also I've tried to make some benchmarks but to build good ones in my specific case is too time consuming, so maybe you can help me with your knowledge of how databases are implemented internally.

As you mention, it really does depend on the query engine and the way optimized query plans are developed. Some engines will try to reorder parts of queries based on the estimated cardinality of elements being tested. Amazon Neptune works that way for example. In general it is best to filter out as much as possible as soon as possible. So in a social network you would not want to start with something like g.V().hasLabel(‘person’) unless you are confident the query engine is able to reorder such queries.

Azure Cosmos DB, Gremlin-API and atomic write-operations

We are starting a larger project and have to decide the datastore-technology.
For various reasons, we would like to use Cosmos-DB through the Gremlin-API.
The thing which we are not sure about is how to handle atomic writes. Cosmos' consistency levels (from strong to eventual) are fine for us, but we haven't found a way to have atomic write operations through the Gremlin API. We have already written quite complex Gremlin-queries like creating vertices and edges, navigating edges, deleting edges, using side-effects etc. in one Gremlin-statement. So if parts of the statement go wrong, we have no chance to recover from it. It's not an option to split the statements into several smaller ones because in case of an error, we would have to "rollback" the statements up to the erroneous one.
I have found following question but there is no answer so far: Azure Cosmos Gremlin API: transactions and efficient graph traversal.
Other sources suggest to write idempotent Gremlin-Statements but due to the mentioned complexity, that's not a valid option for us.

Gremlin: How can I merge groups of vertices when they are similar

My query returns groups of users vertices like this:
[
[Pedro, Sabrina, Macka, Fer]
[Pedro, Sabrina, Macka, Fer, Britney]
[Brintey, Fred, Christina]
]
The first 2 groups are similar, contains mostly the same vertices. I need to merge them.
I need to merge the groups that are like for example 80% similar (80% of the elements are the same).
Is this possible in gremlin? how can I do this?
Edit:
https://gremlify.com/2ykos4047g5
This gremlify project creates a fake output similar to what I have in my query, I need the first 2 lists merged into a sigle one because they contain almost the same vertices and not the third one because it's completely different from the others.
So what I'm asking is how you write a query that compares all lists checking how many vertices are the same in these lists and based on that decide if merging them into a single one or not.
The expected output for the gremlify project is:
[
[
"Pedro",
"Sabrina",
"Macka",
"Fer",
"Britney"
],
[
"Garry",
"Dana",
"Lily"
]
]

Gremlin doesn't have steps that merge lists based on how much they are alike. Gremlin is fairly flexible so I imagine there might be ways to use its steps in creative ways to get what you want, but the added complexity may not be worth it. My personal preference is to use Gremlin to retrieve my data, filter away the whatever is extraneous, and then transform it as close as possible to its final result while maintaining a balance with readability.
Given that thinking, if your result from Gremlin is simply a list of lists of strings and your Gremlin up to that point is well structured and performant, then perhaps Gremlin has gotten you far enough and his job is done. Take that result and post-process it on your application side by writing some code to take you to your final result. With that approach you have your full programming language environment at your disposal with all the libraries available to you to make that final step easier.
I'd further add that your example is a bit contrived and focuses on an arbitrary result which reduces your Gremlin question to collection manipulation question. With graphs and Gremlin I often find that heavy focus on collection manipulation to improve the quality of a result (rather than just format of a result) implies that I should go back to my core of my traversal algorithm rather than try to tack on extra manipulation at the end of the traversal.
For example, if this output you're asking about in this question relates back to your previous questions here and here, then I'd wonder if you shouldn't rethink the rules of your algorithm. Perhaps, you really aren't "detecting triangles and then trying to group them accordingly" as I put it in one of my answers there. Maybe there is a completely different algorithm that will solve your problem which is even more effective and performant.
This blog post, "Reducing Computational Complexity with Correlate Traversals", does an excellent job in explaining this general notion. Though it focuses on centrality algorithms, the general message is quite clear:
All centrality measures share a similar conceptual theme — they all score the vertices in the graph according to how “central” they are relative to all other vertices. It is this unifying concept which can lead different algorithms to yield the same or similar results. Strong, positive correlations can be taken advantage of by the graph system architect, enabling them to choose a computationally less complex metric when possible.
In your case, perhaps you need more flexibility in the rules you stated for your algorithm to thus allow a better (i.e. less rigid) grouping in your results. In any case, it is something to think about and in the worst case you can obviously just take the brute force approach that you describe in your question and get your result.

Symmetric(or undirected) Hamiltonian Cycle data sets

I would like to test my recently created algorithm on large (50+ node) graphs. Preferrably, they would specifically be challenging graphs, and known tours would exist (for at least most of them).
Problem sets for this problem do not seem as easy to find as with the TSP. I am aware of the Flinder's challenge set available at http://www.flinders.edu.au/science_engineering/csem/research/programs/flinders-hamiltonian-cycle-project/fhcpcs.cfm
However, they seem to be directed. I can probably alter my algorithm to work for directed, but it will take time and likely induce bugs. I'd prefer to know if it can work for undirected first.
Does anyone know where problem sets are available? Thank you.
quick edit:
Now I am unsure if the flinder's set is directed or not.... It doesn't say. Examples make it seem like maybe it actually is undirected.

Check this video:
https://www.youtube.com/watch?v=G1m7goLCJDY
Also check the in depth sequel to the video.
You can determine yourself how many nodes you want to add to the graph.
It does require you to construct the data yourself, which should be deable.
One note: the problem is about a path, not a cycle, but you can overcome this by connecting the start and end node.

Graph data structures in LabVIEW

What's the best way to represent graph data structures in LabVIEW?
I'm doing some basic algorithm review over the holiday, and I'd prefer to not implement all of the storage and traversals myself, if possible.
(I'm aware that there was a thread a few years ago on LAVA, is that my best bet?)

I've never had a need to do this myself, so I never really looked into it, but there are some people who did do some work as far I know.
Brian K. has posted something over here, although it's been a long time since I looked at it:
https://decibel.ni.com/content/docs/DOC-12668
If that doesn't help, I would suggest you read this and then try sending a PM to Daklu there, as he's the most likely candidate to have something.
https://decibel.ni.com/content/thread/8179?tstart=0
If not, I would suggest posting a question on LAVA, as you're more likely to find the relevant people there.

Well you don't have that many options for graphs , from a simple point of view. It really depends on the types of algorithms you are doing, in order to choose the most convenient representation.
Adjacency matrix is simple, but can be slow for some tasks, and can be wasteful if the graph is not dense.
You can keep a couple of lists and hash maps of your edges and vertices. With each edge or vertex created assigned a unique index into the list,it's pretty simple to keep things under control. Each vertex could then be associated with a list of its neighbors. Depending on your needs you could divide that neighbors list into in and out edges. Also depending on your look up needs, you could choose to index edges by their in or out edge or both, or simple by a unique index number.
I had a glance at the LabView quick reference, and while it was not obvious from there how you would do that, as long as they have arrays of some sort, you can implement a graph. I'm sure you'll be fine.