I've set up a simple test case to finally learn some graph databases
I have a simple tree structure based on a collection of roughly 80000 vertices/documents, with around 25 attributes each. The only edges are outbound "is_parent" edges, so to find children of each node, I can simple pick up all inbound edges. I have not set up any specific indices on any fields.
The tree is 20 levels deep and I'm grabbing a random node on the fifth level, to then pick up all descendants of that node using a graph traversal:
FOR t IN GRAPH_TRAVERSAL("sample_tree", "sampleunit/2565130142666", "inbound", {"maxDepth":20}) RETURN t'
This takes a little more than 3 seconds on my dev machine and I feel I might be doing something wrong. Is there any way to speed things up or do I have any conceptual issues?
I set up an example tree-like graph as you described and ran the query on it.
Interestingly, the following query was executing much faster than your query:
FOR t IN TRAVERSAL(sampleunit, unitlinks, "sampleunit/2565130142666", "inbound", {"maxDepth":20}) RETURN t
The query above uses the "older" traversal function in AQL. We checked why there is a performance difference between the two traversal types, and finally found something that can be improved.
The fix for this has been pushed into the 2.2 and devel branches. It is included in commit 9a1eb149aa4da514d709c43a4ebdfd8819ba2f1d if you prefer cherry-picking.
I see similar issue between NEIGHBORS and GRAPH_NEIGHBORS on version 2.6.3, the first is 30 times faster then the second.
Related
Using - Neptune Engine: 1.0.5.1, Apache Tinkerpop: 3.5.2
My question is in regard to the performance of Vertex removal - it is not about the loading of the Vertices.
We have a cron job that clears out a limited number (1000) of "expired" Vertices.
We get hold of and store the vertices to be removed in a Set.
We then remove these via a g.V([vertices]).sideEffect(drop()).next().
This works fine.
All of the Vertices to be removed will have 1 inE and 1 outE.
These Edges obviously get automatically removed when the linked Vertex is removed.
I am wondering though if Neptune (under the hood) would be more performant if we got hold of and removed the Edges first, and then removed the Vertices.
Just wondered if anyone out their (mainly using Neptune, but it is probably a "thing" with other graph databases too) has looked into this and has any hard evidence either way.
Many thanks
As far as using Amazon Neptune - If you are just doing a single threaded drop of 1,000 vertices where each only has one incident edge then what you are doing is fine. If you were dropping thousands (or more) of vertices, in a multi threaded fashion, then dropping the edges first can avoid collisions as different threads may try to get locks on the same object in the database. In such cases, to avoid conflicts, and therefore avoid retries, dropping the edges first can improve performance.
I am using NeptuneDB with 2M edges and vertices. The graph can have cycles of length 3-10 and is highly connected.
While fetching the downstream for a particular NodeId is am running the query
g.V(currentNode).repeat(out().simplePath()).until(outE().count().is(0).or().loops().is(12)).path().toList();
The issue here is that by using simplePath() the cyclic nodes are getting filtered out.
For ex: in case of 1->2->3->1, I am only getting 1->2->3 in the pathList but I want the pathList to contain the first node in case of cycles, i.e. 1->2->3->1.
I have been looking a lot for a way to model the query which will return me both cyclic and non-cylic path for the downstream but no luck.
I am also facing issues of memory timeout due to simplePath() and Path() step as I know they are costly operation but I can't seem to find my way around this.
If you want to find cyclicPaths as well as non cyclic ones, rather than do
g.V(currentNode).
repeat(out().simplePath()).
until(outE().count().is(0).or().loops().is(12)).
path().
toList();
You might try something like
g.V(currentNode).
repeat(out()).
until(or(__.not(out()),loops().is(12),cyclicPath())).
path().
toList();
This will include cyclic paths in the result. You will be able to spot them as the first and last vertex in the path result will be the same.
In a highly connected graph you may need to add a limit step to stop trying to find all possible results as there could be many.
I have a graph with the following pattern:
- Workflow:
-- Step #1
--- Step execution #1
--- Step execution #2
[...]
--- Step execution #n
-- Step #2
--- Step execution #1
--- Step execution #2
[...]
--- Step execution #n
[...]
-- Step #m
--- Step execution #1
--- Step execution #2
[...]
--- Step execution #n
I have a couple of design questions here:
How many execution documents can hang off a
single vertex without affecting performance? For example, each "step" could have hundreds of 'executions' off it. I'm using two edges to connect them—'has_runs' (from step → execution) and 'execution_step' (from execution → step).
Are graph databases (Cosmos DB or any graph database) designed to handle thousands of vertexes and edges associated with a single vertex?
Each 'execution' has (theoretically) unlimited properties associated with it, but it is probably 10 < x < 100 properties. Is that OK? Are graph databases made to support such a large number properties off a vertex?
All the demos I've seen seem to have < 10 total properties.
Is it appropriate to have so many execution documents hanging off a single vertex? E.g. each "step" could have 100s of 'executions' off it.
Having 100s of edges from a single vertex is not atypical and sounds reasonable. In practice, you can easily find yourself with models that have millions of edges and dig yourself into the problem of supernodes at which point you would need to make some design choices to deal with such things based on your expected query patterns.
Each 'execution' has (theoretically) unlimited properties associated with it, but is probably 10 < x < 100 properties. Is that ok? Are graph databases made to support many, many properties off a vertex?
In designing a schema, I think graph modelers tend to think in terms of graph elements (i.e. vertices/edges) as having the ability to hold unlimited properties, but in practice they have to consider the capabilities of the graph system and not assume them all to be the same. Some graphs, like TinkerGraph will be limited only by available memory. Other graphs like JanusGraph will be limited by the underlying data store (e.g. Cassandra, Hbase, etc).
I'm not aware of any graph system that would have trouble with storing 100 properties. Of course, there's caveats to all such generalities - a few examples:
100 separate simple primitive properties of integers and Booleans is different than 100 byte arrays each holding 100 megabytes of data.
Storing 100 properties is fine on most systems, but do you intend to index all 100? On some systems that might be an issue. Since you tagged your question with "CosmosDB", I will offer that I don't think they are too worried about that since they auto-index everything.
If any of those 100 properties are multi-properties you could put yourself in a position to create a different sort of supernode - a fat vertex (a vertex with millions of properties).
All that said, generally speaking, your schema sounds reasonable for any graph system out there.
So I have an un-directed un-weighted graph. It contains cycles. I would like to find the path which visits the most nodes with no repeat visits to any node. Since this is a graph traversal, you can start and end at any node you like.
Background Research:
I have looked at Travelling Salesman Problem (TSP); this problem is different and does NOT allow you to finish where you started from and there are no weights. I have looked at several other algorithms, but have found none suitable for this problem.
Graph Size: There are 100 nodes in the graph; with 10 disconnected nodes.
UPDATE: I have moved this to: https://math.stackexchange.com/questions/243375/what-is-the-maximum-number-of-nodes-i-can-traverse-in-an-undirected-graph-visiti
Look for the Hamiltonian Cycle problem
http://en.wikipedia.org/wiki/Hamiltonian_cycle
You should take a look at the wikipedia entry which has an algorithm for acyclic graphs. Your graph has cycles which makes your problem NP-hard.
I would try and create a DAG with nodes representing strongly connected components. Then you could at least find the path that visits the most strongly connected components. You could then expand that path by replacing the individual (strongly connected components) nodes with the longest paths in each of the subgraphs.
Finding the longest paths in the subgraphs is now the same as your original problem but at least you graphs are smaller. If your in luck, the subproblems are easy and your done. In the general case they might not be so small and you could use some advanced heuristics. Maybe have a look at this paper or this question (you could use the answer there to solve your problem completely but i'm not sure)
What's the best way to implement weighted graph using Redis?
We will mostly search for shortest paths over the graph (probably using the Dijkstra algorithm)
Currently we considered adding the edges to Redis
For each node, we will have the nodeId as the key and a sortedset of keys of referenced nodes
the score of each nodeId in the sortedSet is the weight of the edge.
What do you think? Correct me if I am wrong but the only bummer here is that for each query for the next node in a sortedset we pay O(logn) instead of O(1)...
http://redis.io/commands/zrange
Getting the next item in a sorted set is only O(log(n)) if you are getting them out one at a time, in which case the latency of the connection to redis will be more of an issue than the complexity of the operation.
For most operations on a graph you need to look at all the edges from a node, so it makes sense to load the whole set (or at least those with a suitable score) into local memory when you process the node. This will of course mean loading some edges that will not be followed because you have already found a suitable path, but since the sets are fairly small the cost of this will be far less than making a new call to redis for every edge that you do need.
Sorry for being late :), but I recently came across the same problem, and I modeled it using hashes. I agree with Tom Clarkson that (almost) everything should be loaded into local memory and I augment by saying that an efficient way in terms of space is to use hashes, and store your graph information like this:
Graph = { node1 : { nodeX : edge_weight, nodeY : edge_weight, other_info: bla..},
node2 : { nodeZ : edge_weight, nodeE : edge_weight, other_info: bla..},
bla bla...
}
If you need more space and efficiency, compress every value ( which can be a JSON string...) and decompress/import/deserialize in your client code.