Redis: Implement Weighted Directed Graph - graph

What's the best way to implement weighted graph using Redis?
We will mostly search for shortest paths over the graph (probably using the Dijkstra algorithm)
Currently we considered adding the edges to Redis
For each node, we will have the nodeId as the key and a sortedset of keys of referenced nodes
the score of each nodeId in the sortedSet is the weight of the edge.
What do you think? Correct me if I am wrong but the only bummer here is that for each query for the next node in a sortedset we pay O(logn) instead of O(1)...
http://redis.io/commands/zrange

Getting the next item in a sorted set is only O(log(n)) if you are getting them out one at a time, in which case the latency of the connection to redis will be more of an issue than the complexity of the operation.
For most operations on a graph you need to look at all the edges from a node, so it makes sense to load the whole set (or at least those with a suitable score) into local memory when you process the node. This will of course mean loading some edges that will not be followed because you have already found a suitable path, but since the sets are fairly small the cost of this will be far less than making a new call to redis for every edge that you do need.

Sorry for being late :), but I recently came across the same problem, and I modeled it using hashes. I agree with Tom Clarkson that (almost) everything should be loaded into local memory and I augment by saying that an efficient way in terms of space is to use hashes, and store your graph information like this:
Graph = { node1 : { nodeX : edge_weight, nodeY : edge_weight, other_info: bla..},
node2 : { nodeZ : edge_weight, nodeE : edge_weight, other_info: bla..},
bla bla...
}
If you need more space and efficiency, compress every value ( which can be a JSON string...) and decompress/import/deserialize in your client code.

Related

AWS Neptune: More performant to drop Edges before Vertices?

Using - Neptune Engine: 1.0.5.1, Apache Tinkerpop: 3.5.2
My question is in regard to the performance of Vertex removal - it is not about the loading of the Vertices.
We have a cron job that clears out a limited number (1000) of "expired" Vertices.
We get hold of and store the vertices to be removed in a Set.
We then remove these via a g.V([vertices]).sideEffect(drop()).next().
This works fine.
All of the Vertices to be removed will have 1 inE and 1 outE.
These Edges obviously get automatically removed when the linked Vertex is removed.
I am wondering though if Neptune (under the hood) would be more performant if we got hold of and removed the Edges first, and then removed the Vertices.
Just wondered if anyone out their (mainly using Neptune, but it is probably a "thing" with other graph databases too) has looked into this and has any hard evidence either way.
Many thanks
As far as using Amazon Neptune - If you are just doing a single threaded drop of 1,000 vertices where each only has one incident edge then what you are doing is fine. If you were dropping thousands (or more) of vertices, in a multi threaded fashion, then dropping the edges first can avoid collisions as different threads may try to get locks on the same object in the database. In such cases, to avoid conflicts, and therefore avoid retries, dropping the edges first can improve performance.

Improve performance removing TinkerGraph vertices

I have a graph g with 600k vertices and 950k edges. After some processing, I need to clean up about 350k+ vertices with this query:
g.V().hasLabel(LABEL_INTERMEDIATE_COLUMN).not(inE(EDGE_DEPEND)).drop().iterate();
Even though I'm excluding vertices with no "depend" edges, they are still connected with other edges.
Using Java, tinkerpop/tinkergraph 3.4.6.
Currently it is taking about 45 minutes to drop all those vertices.
I did a java profiling and the results shows 73% of time spent in the TinkerVertex.remove method, and the rest in the ExpandableStepIterator.next
Is there something like "bulk drop"? Would a JanusGraph or other graph provider be much faster?
It's unlikely that there are graphs that are much faster than TinkerGraph as TinkerGraph is a pure in-memory implementation. You might find one that is more efficient using that memory like OverflowDB which originally forked from TinkerGraph but I don't know that it will make this particular query go faster.
TinkerGraph, nor any graph I know of, has a filtered "bulk drop" operation.
The "not" style global query here is simple expensive as you're having to touch the a large portion of the graph. Of course, I'm a bit surprised that TinkerGraph is taking that long for a graph with less than one million edges. You didn't mention if you were experiencing a lot of GC as you were doing your profile. Perhaps that is an issue? If so, I would try to adjust your JVM memory configurations - maybe you just need a larger -Xmx value or something simple like that.
From a query perspective, you could try to invert not() portion of the traversal to positively find the things you want to remove. It might lead to a less concise query to read but could perhaps speed things up, but on the other hand you are still trying to delete 50% of your data so the cost may not be just in finding the vertices to get rid of.
One other thought would be try to parallelize the drop(). You might hit concurrency errors so you could need a retry strategy but you could consider taking the Iterator of g.V().hasLabel(LABEL_INTERMEDIATE_COLUMN).not(inE(EDGE_DEPEND)) and then delegating calls to each (or batches of) Vertex.remove() to a separate worker thread.
Based on accepted answer, a simple parallelization improved enough that this operation is no longer the most critical time-wise
For future reference, this:
g.V().hasLabel(LABEL_INTERMEDIATE_COLUMN).not(inE(EDGE_DEPEND)).drop().iterate();
is now something like this:
ExecutorService executor = Executors.newFixedThreadPool(4);
int iterator = 0;
final int batchsize = 10000;
Long count = g.V().hasLabel(LABEL_INTERMEDIATE_COLUMN).not(inE(EDGE_DEPEND)).count().next();
List<Callable<Object>> callableList = new ArrayList<Callable<Object>>();
// splitting current set into tasks to be executed in para
while (iterator * batchsize < count) {
final Set<Object> vSet = g.V().hasLabel(LABEL_INTERMEDIATE_COLUMN).not(inE(EDGE_DEPEND)).skip(iterator * batchsize).limit(batchsize).id().toSet();
callableList.add(() -> g.V(vSet).drop().iterate());
iterator++;
}
List<Future<Object>> results = executor.invokeAll(callableList);
After some tests, I decided to keep the iteration in a single thread. That way the distributed tasks are really independent of each other (e.g.: one task completing won't affect other task query).
Keep in mind that the actual removal is still single thread, as the vertex node map modification is behind concurrent access locks.
The effect is that increasing threads won't get better results (personally tried 8). And based on some thread dumps, even 4 might be too much (there is always 1 or more thread in a waiting status) - although I did get a dump with 3 threads running!

Limiting depth of shortest path query using Gremlin on JanusGraph

I have a fairly large graph (currently 3806702 vertices and 7774654 edges, all edges with the same label) in JanusGraph. I am interested in shortest path searches in it. Gremlin recipes mention this query:
g.V(startId).until(hasId(targetId)).repeat(out().simplePath()).path().limit(1)
This returns path that I know to be a correct one immediately but then hangs the console (top shows janusgraph and scylla to be processing stuff furiously though, so I guess it's working in the background, but it takes forever). It does the right thing and returns first (correct) shortest path if used like this:
g.V(startId).until(hasId(targetId)).repeat(out().simplePath()).path().next()
I would like to limit this query so that gremlin/janusgraph stops searching for path over, let's say, 100 hops (so I want max depth of 100 edges basically). I have tried to use .times(100) in multiple positions but if .until() is used with .times() in the same query it always crashes with a NullPointerException in gremlin traversal classes, ie:
java.lang.NullPointerException
at org.apache.tinkerpop.gremlin.process.traversal.util.TraversalHelper.hasStepOfAssignableClassRecursively(TraversalHelper.java:351)
at org.apache.tinkerpop.gremlin.process.traversal.strategy.optimization.RepeatUnrollStrategy.apply(RepeatUnrollStrategy.java:61)
at org.apache.tinkerpop.gremlin.process.traversal.util.DefaultTraversalStrategies.applyStrategies(DefaultTraversalStrategies.java:86)
at org.apache.tinkerpop.gremlin.process.traversal.util.DefaultTraversal.applyStrategies(DefaultTraversal.java:119)
at org.apache.tinkerpop.gremlin.process.traversal.util.DefaultTraversal.next(DefaultTraversal.java:198)
at java_util_Iterator$next.call(Unknown Source)
...
Does anyone have any idea how can I apply such limit? I need this to return first result or fail, fast.
Thanks!
Add another break condition in your until() and also make sure to limit() the result before you ask for paths:
g.V(startId).
until(__.hasId(targetId).or().loops().is(100)).
repeat(__.both().simplePath()).
hasId(targetId).limit(1).path()
Calling tryNext() on this traversal will give you an Optional<Path>. If it's empty, then no path was found within the given distance.

Setting a new label on all nodes takes too long in a huge graph

I'm working on a graph containing about 50 million nodes and 40 million relationships.
I need to update every node.
I'm trying to set a new label to these nodes, but it's taking too long.
The label applies to all 50 million nodes, so the operation never ends.
After some research, i found out that Neo4j treats this operation as a single transaction (i don't know if optimistic or not), keeping the changes uncommitted, until the end (which will never happen in this fashion).
I'm currently using Neo4j 2.1.4, which has a feature called "USING PERIODIC COMMIT" (already present in earlier versions). Unfortunately, this feature is coupled to the "LOAD CSV" feature, and not available to every cypher command.
The cypher is quite simple:
match n set n:Person;
I decided to use a workaround, and make some sort of block update, as follows:
match n
where not n:Person
with n
limit 500000
set n:node;
It's ugly, but i couldn't come up with a better solution yet.
Here are some of my confs:
== neo4j.properties =========
neostore.nodestore.db.mapped_memory=250M
neostore.relationshipstore.db.mapped_memory=500M
neostore.propertystore.db.mapped_memory=900M
neostore.propertystore.db.strings.mapped_memory=1300M
neostore.propertystore.db.arrays.mapped_memory=1300M
keep_logical_logs=false
node_auto_indexing=true
node_keys_indexable=name_autocomplete,document
relationship_auto_indexing=true
relationship_keys_indexable=role
execution_guard_enabled=true
cache_type=weak
=============================
== neo4j-server.properties ==
org.neo4j.server.webserver.limit.executiontime=20000
org.neo4j.server.webserver.maxthreads=200
=============================
The hardware spec is:
RAM: 24GB
PROC: Intel(R) Xeon(R) X5650 # 2.67GHz, 32 cores
HDD1: 1.2TB
In this environment, each block update of 500000 nodes took from 200 to 400 seconds. I think this is because every node satisfies the query at the start, but as the updates take place, more nodes need to be scanned to find the unlabeled ones (but again, it's a hunch).
So what's the best course of action whenever an operation needs to touch every node in the graph?
Any help towards a better solution to this will be appreciated!
Thanks in advance.
The most performant way to achieve this is using the batch inserter API. You might use the following recipe:
take a look at http://localhost:7474/webadmin and note the "node count". In fact it's not the number of nodes it's more the highest node id in use - we'll need that later on.
make sure to cleanly shut down your graph database.
take a backup copy of your graph.db directory.
write a short piece of java/groovy/(whatever jvm language you prefer...) program that performs the following tasks
open your graph.db folder using the batch inserter api
in a loop from 0..<node count> (from step above) check if the node with given id exists, if so grab its current labels and amend the list by the new label and use setNodeLabels to write it back.
make sure you run shutdown with the batchinserter
start up your Neo4j instance again

Performance issue with Graph Traversal in ArangoDB

I've set up a simple test case to finally learn some graph databases
I have a simple tree structure based on a collection of roughly 80000 vertices/documents, with around 25 attributes each. The only edges are outbound "is_parent" edges, so to find children of each node, I can simple pick up all inbound edges. I have not set up any specific indices on any fields.
The tree is 20 levels deep and I'm grabbing a random node on the fifth level, to then pick up all descendants of that node using a graph traversal:
FOR t IN GRAPH_TRAVERSAL("sample_tree", "sampleunit/2565130142666", "inbound", {"maxDepth":20}) RETURN t'
This takes a little more than 3 seconds on my dev machine and I feel I might be doing something wrong. Is there any way to speed things up or do I have any conceptual issues?
I set up an example tree-like graph as you described and ran the query on it.
Interestingly, the following query was executing much faster than your query:
FOR t IN TRAVERSAL(sampleunit, unitlinks, "sampleunit/2565130142666", "inbound", {"maxDepth":20}) RETURN t
The query above uses the "older" traversal function in AQL. We checked why there is a performance difference between the two traversal types, and finally found something that can be improved.
The fix for this has been pushed into the 2.2 and devel branches. It is included in commit 9a1eb149aa4da514d709c43a4ebdfd8819ba2f1d if you prefer cherry-picking.
I see similar issue between NEIGHBORS and GRAPH_NEIGHBORS on version 2.6.3, the first is 30 times faster then the second.

Resources