Gremlin: How can I merge groups of vertices when they are similar - gremlin

My query returns groups of users vertices like this:
[
[Pedro, Sabrina, Macka, Fer]
[Pedro, Sabrina, Macka, Fer, Britney]
[Brintey, Fred, Christina]
]
The first 2 groups are similar, contains mostly the same vertices. I need to merge them.
I need to merge the groups that are like for example 80% similar (80% of the elements are the same).
Is this possible in gremlin? how can I do this?
Edit:
https://gremlify.com/2ykos4047g5
This gremlify project creates a fake output similar to what I have in my query, I need the first 2 lists merged into a sigle one because they contain almost the same vertices and not the third one because it's completely different from the others.
So what I'm asking is how you write a query that compares all lists checking how many vertices are the same in these lists and based on that decide if merging them into a single one or not.
The expected output for the gremlify project is:
[
[
"Pedro",
"Sabrina",
"Macka",
"Fer",
"Britney"
],
[
"Garry",
"Dana",
"Lily"
]
]

Gremlin doesn't have steps that merge lists based on how much they are alike. Gremlin is fairly flexible so I imagine there might be ways to use its steps in creative ways to get what you want, but the added complexity may not be worth it. My personal preference is to use Gremlin to retrieve my data, filter away the whatever is extraneous, and then transform it as close as possible to its final result while maintaining a balance with readability.
Given that thinking, if your result from Gremlin is simply a list of lists of strings and your Gremlin up to that point is well structured and performant, then perhaps Gremlin has gotten you far enough and his job is done. Take that result and post-process it on your application side by writing some code to take you to your final result. With that approach you have your full programming language environment at your disposal with all the libraries available to you to make that final step easier.
I'd further add that your example is a bit contrived and focuses on an arbitrary result which reduces your Gremlin question to collection manipulation question. With graphs and Gremlin I often find that heavy focus on collection manipulation to improve the quality of a result (rather than just format of a result) implies that I should go back to my core of my traversal algorithm rather than try to tack on extra manipulation at the end of the traversal.
For example, if this output you're asking about in this question relates back to your previous questions here and here, then I'd wonder if you shouldn't rethink the rules of your algorithm. Perhaps, you really aren't "detecting triangles and then trying to group them accordingly" as I put it in one of my answers there. Maybe there is a completely different algorithm that will solve your problem which is even more effective and performant.
This blog post, "Reducing Computational Complexity with Correlate Traversals", does an excellent job in explaining this general notion. Though it focuses on centrality algorithms, the general message is quite clear:
All centrality measures share a similar conceptual theme — they all score the vertices in the graph according to how “central” they are relative to all other vertices. It is this unifying concept which can lead different algorithms to yield the same or similar results. Strong, positive correlations can be taken advantage of by the graph system architect, enabling them to choose a computationally less complex metric when possible.
In your case, perhaps you need more flexibility in the rules you stated for your algorithm to thus allow a better (i.e. less rigid) grouping in your results. In any case, it is something to think about and in the worst case you can obviously just take the brute force approach that you describe in your question and get your result.

Related

Gremlin correlated queries kill performance

I understand that implementation specifics factor into this question, but I also realize that I may be doing something wrong here. If so, what could I do better? If Gremlin has some multi-resultset submit queries batch feature I don't know about, that would solve the problem. As in, hey Gremlin, run these three queries in parallel and give me their results.
Essentially, I need to know when a vertex has a certain edge and if it doesn't have that edge, I need to pull a blank. So...
g.V().as("v").coalesce(outE("someLabel").has("someProperty","someValue"),constant()).as("e").select("v","e")
That query is 10x more expensive than simply getting the edges using:
g.V().outE("someLabel").has("someProperty","someValue")
So if I want to get a set of vertices with their edges or blank placeholders, I have two options: Make two discrete queries and "join" the data in the API or make one very expensive query.
I'm working from the assumption that in Gremlin, we "do it in one trip" and that may in fact be quite wrong. That said, I also know that pulling back chunks of data and effectively doing joins in the API is bad practice because it breaks the encapsulation principal. It also adds roundtrip overhead.
OK, so I found a solution that is ridiculous but fast. It involves fudging the traversal so let me apologize up front if there's a better way...
g.inject(true).
union(
__.V().not(outE("someLabel")).constant().as("ridiculous"),
__.V().outE("someLabel").as("ridiculous")
).
select("ridiculous")
In essence, I have to write the query twice. Once for the traversal with the edge I want and once more for the traversal where the edge is missing. So, if I have n present / not present checks I'm going to need 2 ^ n copies of the query each with its own combination of checks so that I get the most optimal performance. Unfortunately, taking that approach runs the risk of a stack overflow not to mention making code impossible to manage reliably.
Your original query returned vertex-edge pairs, where as your answer returns only edges.
You could just run g.E().hasLabel("somelabel") to get the same result.
Probably a faster alternative to your original query might be:
g.E().hasLabel("somelabel").as("e").outV().as("v").select("v","e")
Or
g.V().as("v").outE("somelabel").as("e").select("v","e")
If Gremlin has some multi-resultset submit queries batch feature I don't know about, that would solve the problem
Gremlin/TinkerPop do not have such functionality built in. There is at least one graph that does have some form of Gremlin batching - DataStax Graph...not sure about others.
I'm also not sure I really have any answer that you might find useful, but while I wouldn't expect a 10x difference in performance between those two traversals, I would expect the first to perform worse on most graph databases. Basically, the use of named steps with as() enables path calculation requirements on the traversal which increases costs. When optimizing Gremlin, one of my earliest steps is to try to look for ways to factor out anything that might do that.
This question seems related to your other question on Jagged Result Array and but I'm having trouble maintaining the context from one question into the other to understand how to expound further.

Best practice: How to specify a vertex's domain 'type' in a graph database

When building a graph, it is usually necessary to specify the 'type' of vertices. Conceptually I see this could be done by applying a vertex label or property to each vertex (ie Bob, Label: Man), or alternatively by linking a vertex to another 'type' vertex (ie. Bob --IS A--> Man).
To find a list of all vertices of type 'Man' I can write gremlin queries that work for both of these approaches. But what is best practice?
Best practice: keep your data model simple and make sure it is compatible with efficient indexing by the underlying graph database. There is no one size fits all solution at the TinkerPop level.
It really depends on your data model as well as the indexing capabilities of the underlying database, not to mention the way the data is actually serialized on disk. Ultimately, it all boils down to the way you expect to query your graph and the kind of performance you wish to have.
This being said, people typically use vertex labels, sometimes used in conjunction with a type property of some kind. Graph implementers should be able to provide efficient indexes for answering such query. It should also give a simpler graph model, which is an important thing to consider.
Depending on the size of your graph, you could get performance issues when modeling types with vertices since a man type vertex could quickly become a supernode.

Graph data structures in LabVIEW

What's the best way to represent graph data structures in LabVIEW?
I'm doing some basic algorithm review over the holiday, and I'd prefer to not implement all of the storage and traversals myself, if possible.
(I'm aware that there was a thread a few years ago on LAVA, is that my best bet?)
I've never had a need to do this myself, so I never really looked into it, but there are some people who did do some work as far I know.
Brian K. has posted something over here, although it's been a long time since I looked at it:
https://decibel.ni.com/content/docs/DOC-12668
If that doesn't help, I would suggest you read this and then try sending a PM to Daklu there, as he's the most likely candidate to have something.
https://decibel.ni.com/content/thread/8179?tstart=0
If not, I would suggest posting a question on LAVA, as you're more likely to find the relevant people there.
Well you don't have that many options for graphs , from a simple point of view. It really depends on the types of algorithms you are doing, in order to choose the most convenient representation.
Adjacency matrix is simple, but can be slow for some tasks, and can be wasteful if the graph is not dense.
You can keep a couple of lists and hash maps of your edges and vertices. With each edge or vertex created assigned a unique index into the list,it's pretty simple to keep things under control. Each vertex could then be associated with a list of its neighbors. Depending on your needs you could divide that neighbors list into in and out edges. Also depending on your look up needs, you could choose to index edges by their in or out edge or both, or simple by a unique index number.
I had a glance at the LabView quick reference, and while it was not obvious from there how you would do that, as long as they have arrays of some sort, you can implement a graph. I'm sure you'll be fine.

How to serialize a graph?

This is an interview question: How to serialize a graph ? I saw this answer but I am not sure if this is enough.
It looks like a very confusing "open question" and the candidates are probably expected to ask more questions about the requirements: what the nodes and edges are, how they are serialized themselves, is this graph weighted, directed, etc., how many nodes/edges are in the graph.What about the infrastructure ? Is it a plain file system or we should/can use a database ?
So, how would you answer this question ?
I think the answer you provided is quite reasonable. IMO, basically you need to know the application background, I will ask at least:
is it directed or not?
what are the properties associated with the vertex, edge and graph itself?
is the graph sparse (If so then we'd better not use adjacency matrix) ?
The simplest way will be storing it as an edge list.
However, in different application there are some classical ways to do it.
For example if you are doing circuit simulation then the graph is sparse and
the resulting graph/matrix can be stored as column-compressed form. If you are solving a (min-cost) max-flow problem then there are already a DIMACS format, such that public solvers can read it and write it. Structured way is also a good choice if you want human readable, XML can provide self-validation (there is already a GraphML as the standard). By the way, the dot format is quite self-contained.
Meh. Whatever you store it in, it's basically:
Output each vertex in the graph. If you don't have all the vertices first, it's a PITA to rebuild the graph when you're reading it back in.
Now you can store edges between vertices. Hopefully your vertices have some form of ID, to uniquely identify them. The version of this I've seen is "store a (graph|tree) in a database". So, read in the nodes, store in a hashtable or similar for O(1) amortized lookup. Then, foreach edge, lookup ID-source and ID-dest, and link.
Voila, you've deserialized it. If it's not a DB, the same idea generally holds - serialize nodes first, then edges.

query language for graph sets: data modeling question

Suppose I have a set of directed graphs. I need to query those graphs. I would like to get a feeling for my best choice for the graph modeling task. So far I have these options, but please don't hesitate to suggest others:
Proprietary implementation (matrix)
and graph traversal algorithms.
RDBM and SQL option (too space consuming)
RDF and SPARQL option (too slow)
What would you guys suggest? Regards.
EDIT: Just to answer Mad's questions:
Each one is relatively small, no more than 200 vertices, 400 edges. However, there are hundreds of them.
Frequency of querying: hard to say, it's an experimental system.
Speed: not real time, but practical, say 4-5 seconds tops.
You didn't give us enough information to respond with a well thought out answer. For example: what size are these graphs? With what frequencies do you expect to query these graphs? Do you need real-time response to these queries? More information on what your application is for, what is your purpose, will be helpful.
Anyway, to counter the usual responses that suppose SQL-based DBMSes are unable to handle graphs structures effectively, I will give some references:
Graph Transformation in Relational Databases (.pdf), by G. Varro, K. Friedl, D. Varro, presented at International Workshop on Graph-Based Tools (GraBaTs) 2004;
5 Conclusion and Future Work
In the paper, we proposed a new graph transformation engine based on off-the-shelf
relational databases. After sketching the main concepts of our approach, we carried
out several test cases to evaluate our prototype implementation by comparing it to
the transformation engines of the AGG [5] and PROGRES [18] tools.
The main conclusion that can be drawn from our experiments is that relational
databases provide a promising candidate as an implementation framework for graph
transformation engines. We call attention to the fact that our promising experimental
results were obtained using a worst-case assessment method i.e. by recalculating
the views of the next rule to be applied from scratch which is still highly inefficient,
especially, for model transformations with a large number of independent matches
of the same rule. ...
They used PostgreSQL as DBMS, which is probably not particularly good at this kind of applications. You can try LucidDB and see if it is better, as I suspect.
Incremental SQL Queries (more than one paper here, you should concentrate on " Maintaining Transitive Closure of Graphs in SQL "): "
.. we showed that transitive closure, alternating paths, same generation, and other recursive queries, can be maintained in SQL if some auxiliary relations are allowed. In fact, they can all be maintained using at most auxiliary relations of arity 2. ..
Incremental Maintenance of Shortest Distance and Transitive Closure in First Order Logic and SQL.
Edit: you give more details so... I think the best way is to experiment a little with both a main-memory dedicated graph library and with a DBMS-based solution, then evaluate carefully pros and cons of both solutions.
For example: a DBMS need to be installed (if you don't use an "embeddable" DBMS like SQLite), only you know if/where your application needs to be deployed and what your users are. On the other hand, a DBMS gives you immediate benefits, like persistence (I don't know what support graph libraries gives for persisting their graphs), transactions management and countless other. Are these relevant for your application? Again, only you know.
The first option you mentioned seems best. If your graph won't have many edges (|E|=O(|V|)) then you might earn better complexity of time and space using Dictionary:
var graph = new Dictionary<Vertex, HashSet<Vertex>>();
An interesting graph library is QuickGraph. Never used it but it seems promising :)
I wrote and designed quite a few graph algorithms for various programming contests and in production code. And I noticed that every time I need one, I have to develop it from scratch, assembling together concepts from graph theory (BFS, DFS, topological sorting etc).
Perhaps a lack of experience is a reason, but it seems to me that there's still no reasonable general-purpose query language to solve graph problems. Pick a couple of general-purpose graph libraries and solve your particular task in a programming (not query!) language. That will give you best performance and space consumption, but will also require understanding of graph theory basic concepts and of their limitations.
And the last one: do not use SQL for graphs.

Resources