Pruning large amount of stale records in Neptune - gremlin

I am following the best practices of pruning stale data from our Neptune Graph Database seen below.
https://docs.aws.amazon.com/neptune/latest/userguide/best-practices-gremlin-prune.html
g.V().has('timestamp', lt(datetime('2021-02-23'))).drop()
This works fine on small datasets, but my graph generates about a million vertices a day. Am I supposed to have a service running continuously just dropping vertices in batches like below? What's the best approach for pruning large datasets?
while (pruneCount > 0):
g.V().has('timestamp', lt(datetime('2021-02-23'))).limit(1000).drop()
pruneCount = g.V().has('timestamp', lt(datetime('2021-02-23'))).count()

If you need to drop say a million vertices, one strategy that I find works is to retrieve the ID's of all the vertices (and edges) that you need to drop and then drop them in batches across multiple threads. In this way you can drop 1M elements fairly efficiently. It's generally best to drop the edges before you drop the vertices to avoid possible concurrent modification exceptions if two threads are trying to drop two adjacent vertices.
You may be able to adapt the algorithm used here to your purposes: https://github.com/awslabs/amazon-neptune-tools/tree/master/drop-graph

Related

Finding outliers in Gremlin to find nodes with more than N edges?

I'm trying to figure out how to find outliers in our graph. In particular nodes with more than N edges where N could be some high number. Our graph has over 2 billion nodes. Is there an efficient way to do this?
At that scale you probably are going to want to multi thread the queries and send requests to the server in batches. A good approximation for client threads is 2 times the number of vCPU on the server. If you are able to send lists of IDs that will be most efficient. Otherwise you will need to do a lot of range steps. Each thread would then do something like query the below for multiple sets of ID ranges:
g.V(<list of IDs>).filter(out().count().is(gt(x)))
You would then collect all the outliers in the application. I think you should approach this as a bit of a batch task that may take a while to complete.
The alternative would be to use Neptune Export to export the graph and load it into Spark and run a degree query using something like GraphFrames.
With a reasonably large instance I think the technique of using multiple threads will work, especially if you are able to easily generate the lists of vertex IDs you are looking for in each query. Spreading the queries across multiple read replicas will also speed things up.

Does an Increased Number of Node Types Impact Performance of Graph DBs?

I am in the process of creating a graph database, a simple one for movies with several types of information like the actors, producers, directors and so on.
What I would like to know is, is it better to break down your nodes to a more granular level? For example, is it better to have two kinds of nodes for 'actors' and 'directors' or is it better to have one node, say 'person' and use different kinds of relationships like 'acted_in' and 'directed'? Does this even matter at all?
Further, is there any impact on the traversal queries? Does having more types of nodes mean that the traversal is slower?
Note: I intend to implement this using the Gremlin console in Amazon Neptune.
The answer really is it depends. If I were building such a model I would break out the key "nouns" into their own nodes. I would also label the edges appropriately such as ACTED_IN or DIRECTED.
The performance of any graph query depends on how much data it will need to touch (the fan out factor as you go from depth to depth).
The best advice I can give you is think about the questions you will need the graph to answer and try to design your data model so that writing those queries is as easy as possible. Don't be afraid to iterate multiple times on your data model also. That is common and expected.
Properties can be useful when you want to add a unique piece of information to a node - perhaps the birthday of the director.
Edge properties can be useful for filtering out unneeded edges but edge labels can also. In some cases you may find a label such as DIRECTED-IN-2005 is a useful short cut to avoid checking a label and a property on an edge.

How to export/store all paths in graph

I have a read-heavy graph with a million nodes and 100 million edges. My use case is to fetch all paths (depth 2, 3 and 4) paths between any two nodes. I tried doing this in Neo4j, OrientDB, and Postgres. Though it works in all three databases, I'm facing the following issues.
Not very fast.
Added slowness for supernodes.
Unable to do pagination/sorting effectively.
One way to address all the issues is by pre-calculating all the paths. What is the best way to do this pre-calculation effectively and where store these paths?
(Also, how to handle changes in the graph?)

Gremlin count() query in Datastax is too slow

I have 3 node in datastax enterprise and loaded 65 million vertices and edges on these. when i use dse studio or gremlin console and run gremlin query on my graph the query is too slow. I defined any kind of index and test again but Had no effect.
when i run query for example "g.v().count()" cpu usage and cpu load average no much change while if i run cql query, that distribute on all nodes and cpu usage and cpu load average on all nodes are a significant change
what is best practice or best configurations for efficient gremlin query in this case?
count() based traversals should be executed via OLAP with Spark for graphs of the size you are working with. If you using standard OLTP based traversals you can expect long wait times for this type of query.
Note that this rule holds true for any graph computation that must do a "table scan" (i.e. touch all or a very large portion of vertices/edges in the graph). This issue is not specific to DSE Graph either and will apply to virtually any graph database.
After many tests on different queries I came to this conclusion that it seems the gremlin has a problem with count query on million vertices while when you define a index on property of vertices and find specific vertix for example: g.V().hasLabel('member').has('C_ID','4242833') the time of this query lower than 1 second and this is acceptable. question is here why gremlin has problem with count query on million vertices?

Passing the results of multiple sequential HBase queries to a Mapreduce job

I have an HBase database that stores adjacency lists for a directed graph, with the edges in each direction stored in a pair of column families, where each row denotes a vertex. I am writing a mapreduce job, which takes as its input all nodes which also have an edge pointing from the same vertices as have an edge pointed at some other vertex (nominated as the subject of the query). This is a little difficult to explain, but in the following diagram, the set of nodes taken as the input, when querying on vertex 'A', would be {A, B, C}, by virtue of their all having edges from vertex '1':
To perform this query in HBase, I first lookup the vertices with edges to 'A' in the reverse edges column family yielding {1}, and the, for every element in that set, lookup the vertices with edges from that element of the set, in the forward edges column family.
This should yield a set of key-value pairs: {1: {A,B,C}}.
Now, I would like to take the output of this set of queries and pass it to a hadoop mapreduce job, however, I can't find a way of 'chaining' hbase queries together to provide the input to a TableMapper in the Hbase mapreduce API. So far, my only idea has been to provide another initial mapper which takes the results of the first query (on the reverse edges table), for each result, performs the query on the forward edges table, and yields the results to be passed to a second map job. However, performing IO from within a map job makes me uneasy, as it seems rather counter to the mapreduce paradigm (and could lead to a bottleneck if several mappers are all trying to access HBase at once). Therefore, can anyone suggest an alternative strategy for performing this sort of query, or offer any advice about best practices for working with hbase and mapreduce in such a way? I'd also be interested to know if there's any improvements to my database schema that could mitigate this problem.
Thanks,
Tim
Your problem is not flowing so well with the Map/Reduce paradigm. I've seen the shortest path problem solved by many M/R chained together. This is not so efficient but needed to get the global view at the reducer level.
In your case, it seems that you could perform all the requests within your mapper by following the edges and keeping a list of seen nodes.
However, performing IO from within a map job makes me uneasy
You should not worry about that. Your data model is absolutely random and trying to perform data locality will be extremely hard therefore you don't have much choice but to query all this data across the network. HBase is designed to handle large parallel queries. Having multiple mapper query on disjoint data will yield into a well distribution of request and a high throughput.
Make sure to keep small block size in HBase tables to optimize your reads and have as little as possible HFile for your regions. I'm assuming your data is quite static here so doing a major compaction will merge the HFile together and reduce the number of files to read.

Resources