How to list all partitions in gremlin? - gremlin

Is there anyway to list all partition keys with gremlin API?
I knew that I can create a PartitionStrategy while creating a graph. However, how can I know which partitions already have in a graph?

In the context of PartitionStrategy, there are only two ways to know what partitions are in the graph:
You know and keep track of the various partitions yourself as they are created.
You query the graph for the partitions by getting the unique list of partition names in the partition key: g.V().values("nameOfYourPartitionKey").dedup()
Obviously, the second approach listed above could be very expensive since it is a global traversal. For especially large graphs you may need to use an OLAP-style traversal with SparkGraphComputer.

Related

Let A be a gremlin query, and let t1, t2 be two point in times. Excluding ids of edges and nodes, is At1 = At2?

This might be a pretty obscure question but I'll try my best here.
Assuming I have a very simple query, for instance:
g.addV('Person').property('name', 'Marko');
And then I run the same query again.
Of course the graph created two different nodes, but regardless of the id, are they "the same"?
Same for querying the graph:
g.V()
Will the graph produce the results in the same order for any run (assuming it didn't change)?
What I'm trying to ask - can I count on the order of the Gremlin execution?
Thanks!
Gremlin does not enforce iteration order unless you explicitly specify it. It is up to the underlying graph to determine order and most that I'm familiar with do not make such guarantees. Therefore, if you want an order, you need to specify it as in: g.V().order().by('name').

DynamoDB: Querying all similar items of a certain type

Keeping in mind the best practices of having a single table and to evenly distribute items across partitions using as unique partition keys as possible in DynamoDB, I am stuck at one problem.
Say my table stores items such as users, items and devices. I am storing the id for each of these items as the partition key. Each id is prefixed with its type such as user-XXXX, item-XXXX & device-XXXX.
Now the problem is how can I query only a certain type of object? For example I want to retrieve all users, how do I do that? It would have been possible if the begin_with operator was allowed for partition keys so I could search for the prefix but the partition keys only allow the equality operator.
If now I use my types as partition keys, for example, user as partition key and then the user-id as the sort key, it would work but it would result in only a few partition keys and thus resulting in the hot keys issue. And creating multiple tables is a bad practice.
Any suggestions are welcome.
This is a great question. I'm also interested to hear what others are doing to solve this problem.
If you're storing your data with a Partition Key of <type>-<id>, you're supporting the access pattern "retrieve an item by ID". You've correctly noted that you cannot use begins_with on a Partition Key, leaving you without a clear cut way to get a collection of items of that type.
I think you're on the right track with creating a Partition Key of <type> (e.g. Users, Devices, etc) with a meaningful Sort Key. However, since your items aren't evenly distributed across the table, you're faced with the possibility of a hot partition.
One way to solve the problem of a hot partition is to use an external cache, which would prevent your DB from being hit every time. This comes with added complexity that you may not want to introduce to your application, but it's an option.
You also have the option of distributing the data across partitions in DynamoDB, effectively implementing your own cache. For example, lets say you have a web application that has a list of "top 10 devices" directly on the homepage. You could create partitions DEVICES#1,DEVICES#2,DEVICES#3,...,DEVICES#N that each stores the top 10 devices. When your application needs to fetch the top 10 devices, it could randomly select one of these partitions to get the data. This may not work for a partition as large as Users, but is a pretty neat pattern to consider.
Extending this idea further, you could partition Devices by some other meaningful metric (e.g. <manufactured_date> or <created_at>). This would more uniformly distribution your Device items throughout the database. Your application would be responsible for querying all the partitions and merging the results, but you'd reduce/eliminate the hot partition problem. The AWS DynamoDB docs discuss this pattern in greater depth.
There's hardly a one size fits all approach to DynamoDB data modeling, which can make the data modeling super tricky! Your specific access patterns will dictate which solution fits best for your scenario.
Keeping in mind the best practices of having a single table and to evenly distribute items across partitions
Quickly highlighting the two things mentioned here.
Definitely even distribution of partitions keys is a best practice.
Having the records in a single table, in a generic sense is to avoid having to Normalize like in a relational database. In other words its fine to build with duplicate/redundant information. So its not necessarily a notion to club all possible data into a single table.
Now the problem is how can I query only a certain type of object? For
example I want to retrieve all users, how do I do that?
Let's imagine that you had this table with only "user" data in it. Would this allow to retrieve all users? Ofcourse not, unless there is a single partition with type called user and rest of it say behind a sort key of userid.
And creating multiple tables is a bad practice
I don't think so its considered bad to have more than one table. Its bad if we store just like normalized tables and having to use JOIN to get the data together.
Having said that, what would be a better approach to follow.
The fundamental difference is to think about the queries first to derive at the table design. That will even suggest if DynamoDB is the right choice. For example, the requirement to select every user might be a bad use case altogether for DynamoDB to solve.
The query patterns will further suggest, what is the best partition key in hand. The choice of DynamoDB here is it because of high ingest and mostly immutable writes?
Do I always have the partition key in hand to perform the select that I need to perform?
What would the update statements look like, will it have again the partition key to perform updates?
Do I need to further filter by additional columns and can that be the default sort order?
As you start answering some of these questions, a better model might appear altogether.

Gremlin : What is the most efficient way to write multiple traversals?

Suppose I have a vertex person, and it has multiple edges, I want to project properties from all traversal. What is the most efficient way to write a query in Cosmos DB Gremlin API?
I tried the following, but its performance is slow.
g.V().
hasLabel('person').
project('Name', 'Language', 'Address').
by('name').
by(out('speaks').values('language')).
by(out('residesAt').values('city'))
Also, I have multiple filters and sorting for each traversal.
I don't think you can write that specific traversal as you've shown it any more efficiently than it is already written, especially if you've added filters to the out('speaks') and out('residesAt') traversals to further limit those paths and as it stands in your example you only return the first "language" or "city" found which is obviously faster than traversing all of those possible paths.
It does stand out to me that you are trying to retrieve all the "person" vertices. You don't say that you have additional filters there specifically, but if you do not then the cost of this traversal could be steep if you have millions of "person" vertices coming back. Typically, traversals that only filter on a vertex label will be costly as most graphs do not optimize those sorts of lookups. In the worst case, such a situation could mean that you have to do a full graph scan to get that initial set of vertices.

Gremlin query to get in and out edges for a given Vertex

I’m just playing with the Graph API in Cosmos DB
which uses the Gremlin syntax for query.
I have a number of users (Vertex) in the graph and each have ‘knows’ properties to other users. Some of these are out edges (outE) and others are in edges (inE) depending on how the relationship was created.
I’m now trying to create a query which will return all ‘knows’ relationships for a given user (Vertex).
I can easily get the ID of either inE or outE via:
g.V('7112138f-fae6-4272-92d8-4f42e331b5e1').inE('knows')
g.V('7112138f-fae6-4272-92d8-4f42e331b5e1').outE('knows')
where '7112138f-fae6-4272-92d8-4f42e331b5e1' is the Id of the user I’m querying, but I don’t know ahead of time whether this is an in or out edge, so want to get both (e.g. if the user has in and out edges with the ‘knows’ label).
I’ve tried using a projection and OR operator and various combinations of things e.g.:
g.V('7112138f-fae6-4272-92d8-4f42e331b5e1').where(outE('knows').or().inE('knows'))
but its not getting me back the data I want.
All I want out is a list of the Id’s of all inE and outE that have the label ‘knows’ for a given vertex.
Or is there a simpler/better way to model bi-directional associations such as ‘knows’ or ‘friendOf’?
Thanks
You can use the bothE step in this case. g.V('7112138f-fae6-4272-92d8-4f42e331b5e1').bothE('knows')

Passing the results of multiple sequential HBase queries to a Mapreduce job

I have an HBase database that stores adjacency lists for a directed graph, with the edges in each direction stored in a pair of column families, where each row denotes a vertex. I am writing a mapreduce job, which takes as its input all nodes which also have an edge pointing from the same vertices as have an edge pointed at some other vertex (nominated as the subject of the query). This is a little difficult to explain, but in the following diagram, the set of nodes taken as the input, when querying on vertex 'A', would be {A, B, C}, by virtue of their all having edges from vertex '1':
To perform this query in HBase, I first lookup the vertices with edges to 'A' in the reverse edges column family yielding {1}, and the, for every element in that set, lookup the vertices with edges from that element of the set, in the forward edges column family.
This should yield a set of key-value pairs: {1: {A,B,C}}.
Now, I would like to take the output of this set of queries and pass it to a hadoop mapreduce job, however, I can't find a way of 'chaining' hbase queries together to provide the input to a TableMapper in the Hbase mapreduce API. So far, my only idea has been to provide another initial mapper which takes the results of the first query (on the reverse edges table), for each result, performs the query on the forward edges table, and yields the results to be passed to a second map job. However, performing IO from within a map job makes me uneasy, as it seems rather counter to the mapreduce paradigm (and could lead to a bottleneck if several mappers are all trying to access HBase at once). Therefore, can anyone suggest an alternative strategy for performing this sort of query, or offer any advice about best practices for working with hbase and mapreduce in such a way? I'd also be interested to know if there's any improvements to my database schema that could mitigate this problem.
Thanks,
Tim
Your problem is not flowing so well with the Map/Reduce paradigm. I've seen the shortest path problem solved by many M/R chained together. This is not so efficient but needed to get the global view at the reducer level.
In your case, it seems that you could perform all the requests within your mapper by following the edges and keeping a list of seen nodes.
However, performing IO from within a map job makes me uneasy
You should not worry about that. Your data model is absolutely random and trying to perform data locality will be extremely hard therefore you don't have much choice but to query all this data across the network. HBase is designed to handle large parallel queries. Having multiple mapper query on disjoint data will yield into a well distribution of request and a high throughput.
Make sure to keep small block size in HBase tables to optimize your reads and have as little as possible HFile for your regions. I'm assuming your data is quite static here so doing a major compaction will merge the HFile together and reduce the number of files to read.

Resources