Neo4j breadth first search is too much slow - graph

I need breadth first search in my database. There are 3.863 nodes, 2.830.471 properties and 1.355.783 relationships. N is my start point and m is my end point in the query but It's too much slow, so I can't get result while I started the query that is in the following segment:
start n=node(42),m=node(31)
match p=n-[*1..]->m
return p,length(p)
order by length(p) asc
limit 1
How can I optimize that query? Because It must finish maximum in 20 seconds. I have 8gb ram in my own computer but I bought 24 Gb ram dedicated server for that. Also, My heap size is 4096-5120 MB. Also there is my other configs which is about query in the following segment:
neostore.nodestore.db.mapped_memory=2024M
neostore.relationshipstore.db.mapped_memory=614M
neostore.propertystore.db.mapped_memory=128M
neostore.propertystore.db.strings.mapped_memory=2024M
neostore.propertystore.db.arrays.mapped_memory=614M
How can solve this problem?

Your query basically collects all paths at any lengths, sorts them and returns the shortest one. There is a way much better way to do this:
start n=node(42),m=node(31)
match p=shortestPath(n-[*..1000]->m)
return p,length(p)
It's a best practice to always supply a upper limit on variable depth patterns.

Related

How can I improve order().by() performance in Neptune?

I am trying to solve a performance issue with a traversal and have tracked it down to the order().by() step. It seems that order().by() greatly increases the number of statement index ops required (per the profiler) and dramatically slows down execution.
A non-ordered traversal is very fast:
g.V().hasLabel("post").limit(40)
execution time: 2 ms
index ops: 1
Adding a single ordering step adds thousands of index ops and runs much slower.
g.V().hasLabel("post").order().by("createdDate", desc).limit(40)
execution time: 62 ms
index ops: 3909
Adding a single filtering step adds thousands more index ops and runs even slower:
g.V().hasLabel("post").has("isActive", true).order().by("createdDate", desc).limit(40)
execution time: 113 ms
index ops: 7575
However the same filtered traversal without ordering runs just as fast as the original unfiltered traversal:
g.V().hasLabel("post").has("isActive", true).limit(40)
execution time: 1 ms
index ops: 49
By the time we build out the actual traversal we run in production there are around 12 filtering steps and 4 by() step-modulators causing the traversal to take over 6000 ms to complete with over 33000 index ops. Removing the order().by() steps causes the same traversal to run fast (500 ms).
The issue seems to be with order().by() and the number of index ops required to sort. I have seen the performance issue noted here but adding barrier() did not help. The traversal is also fully optimized requiring no Tinkerpop conversion.
I am running engine version 1.1.0.0 R1. There are about 5000 post vertices.
How can I improve the performance of this traversal?
So generally the only way you are going to increase the performance of ordering in any graph database (including Neptune) is to filter items down to a minimal set prior to performing the ordering.
An order().by() step requires that all elements that match the criteria must be returned for them to be ordered as specified. When using only a limit(N) step then as soon as N items are returned the traversal terminates. This is why you are seeing significantly faster times for the limit() only option, it just has to process and return less data since it is returning the first N records which may be in any order.

Improve performance removing TinkerGraph vertices

I have a graph g with 600k vertices and 950k edges. After some processing, I need to clean up about 350k+ vertices with this query:
g.V().hasLabel(LABEL_INTERMEDIATE_COLUMN).not(inE(EDGE_DEPEND)).drop().iterate();
Even though I'm excluding vertices with no "depend" edges, they are still connected with other edges.
Using Java, tinkerpop/tinkergraph 3.4.6.
Currently it is taking about 45 minutes to drop all those vertices.
I did a java profiling and the results shows 73% of time spent in the TinkerVertex.remove method, and the rest in the ExpandableStepIterator.next
Is there something like "bulk drop"? Would a JanusGraph or other graph provider be much faster?
It's unlikely that there are graphs that are much faster than TinkerGraph as TinkerGraph is a pure in-memory implementation. You might find one that is more efficient using that memory like OverflowDB which originally forked from TinkerGraph but I don't know that it will make this particular query go faster.
TinkerGraph, nor any graph I know of, has a filtered "bulk drop" operation.
The "not" style global query here is simple expensive as you're having to touch the a large portion of the graph. Of course, I'm a bit surprised that TinkerGraph is taking that long for a graph with less than one million edges. You didn't mention if you were experiencing a lot of GC as you were doing your profile. Perhaps that is an issue? If so, I would try to adjust your JVM memory configurations - maybe you just need a larger -Xmx value or something simple like that.
From a query perspective, you could try to invert not() portion of the traversal to positively find the things you want to remove. It might lead to a less concise query to read but could perhaps speed things up, but on the other hand you are still trying to delete 50% of your data so the cost may not be just in finding the vertices to get rid of.
One other thought would be try to parallelize the drop(). You might hit concurrency errors so you could need a retry strategy but you could consider taking the Iterator of g.V().hasLabel(LABEL_INTERMEDIATE_COLUMN).not(inE(EDGE_DEPEND)) and then delegating calls to each (or batches of) Vertex.remove() to a separate worker thread.
Based on accepted answer, a simple parallelization improved enough that this operation is no longer the most critical time-wise
For future reference, this:
g.V().hasLabel(LABEL_INTERMEDIATE_COLUMN).not(inE(EDGE_DEPEND)).drop().iterate();
is now something like this:
ExecutorService executor = Executors.newFixedThreadPool(4);
int iterator = 0;
final int batchsize = 10000;
Long count = g.V().hasLabel(LABEL_INTERMEDIATE_COLUMN).not(inE(EDGE_DEPEND)).count().next();
List<Callable<Object>> callableList = new ArrayList<Callable<Object>>();
// splitting current set into tasks to be executed in para
while (iterator * batchsize < count) {
final Set<Object> vSet = g.V().hasLabel(LABEL_INTERMEDIATE_COLUMN).not(inE(EDGE_DEPEND)).skip(iterator * batchsize).limit(batchsize).id().toSet();
callableList.add(() -> g.V(vSet).drop().iterate());
iterator++;
}
List<Future<Object>> results = executor.invokeAll(callableList);
After some tests, I decided to keep the iteration in a single thread. That way the distributed tasks are really independent of each other (e.g.: one task completing won't affect other task query).
Keep in mind that the actual removal is still single thread, as the vertex node map modification is behind concurrent access locks.
The effect is that increasing threads won't get better results (personally tried 8). And based on some thread dumps, even 4 might be too much (there is always 1 or more thread in a waiting status) - although I did get a dump with 3 threads running!

How can I limit the depth to search in a convoluted graph that has a goal node

I have convoluted graph that I need to search. After searching, the paths that are found needs to always end with the goal node. This node has no other deeper nodes. Furthermore, the length of the paths will be limited, so before reaching the limit, it has to find the goal node. I have an example graph:
graph example
In this case, for the following limits I would expect the results next to them.
2 => nothing
3 or 4 => I,1,F
6 => I,2,3,I,1,F and all of the above
7 or 8 => I,1,2,3,I,1,F and all of the above
9 => I,2,3,I,2,3,I,1,F and all of the above
Once I increase the limit more, I would get more loops and so on. I know that depth first search would work for me with the goal state but I don't know how to take the limit into account in a smart way. I can do the search and then stop it if the depth limit is reached. Is there a better way of doing it?
There is also iterative deepening DFS, which you can use. See Wikipedia
It is nothing crazy, you just path your desired limit to DFS algorithm and each time you go down you will decrease the limit; once you hit 0 your depth has reached and you can't go any further.

Detecting short cycles in neo4j property graph

What is the best way to detect for cycles in a graph of a considerable size using cypher.
I have a graph which has about 90000 nodes and about 320000 relationship and I would like to detect cycles in sub graph of about 10k nodes and involving 100k relationships. The cypher I have written is like
start
n = node:node_auto_index(some lucene query that returns about 10k nodes)
match
p = n-[:r1|r2|r3*]->n
return p
However this is not turning out to be very efficient.
Can somebody suggest a better way to do this.
Unlimited-length path searches are well-known to be slow, since the number of operations grows exponentially with the depth of the search.
If you are willing to limit your cycle search to a reasonably small maximum path depth, then you can speed up the query (although it might still take some time). For instance, to only look at paths up to 5 steps deep:
MATCH p = n-[:r1|r2|r3*..n]->n
RETURN p;

Setting a new label on all nodes takes too long in a huge graph

I'm working on a graph containing about 50 million nodes and 40 million relationships.
I need to update every node.
I'm trying to set a new label to these nodes, but it's taking too long.
The label applies to all 50 million nodes, so the operation never ends.
After some research, i found out that Neo4j treats this operation as a single transaction (i don't know if optimistic or not), keeping the changes uncommitted, until the end (which will never happen in this fashion).
I'm currently using Neo4j 2.1.4, which has a feature called "USING PERIODIC COMMIT" (already present in earlier versions). Unfortunately, this feature is coupled to the "LOAD CSV" feature, and not available to every cypher command.
The cypher is quite simple:
match n set n:Person;
I decided to use a workaround, and make some sort of block update, as follows:
match n
where not n:Person
with n
limit 500000
set n:node;
It's ugly, but i couldn't come up with a better solution yet.
Here are some of my confs:
== neo4j.properties =========
neostore.nodestore.db.mapped_memory=250M
neostore.relationshipstore.db.mapped_memory=500M
neostore.propertystore.db.mapped_memory=900M
neostore.propertystore.db.strings.mapped_memory=1300M
neostore.propertystore.db.arrays.mapped_memory=1300M
keep_logical_logs=false
node_auto_indexing=true
node_keys_indexable=name_autocomplete,document
relationship_auto_indexing=true
relationship_keys_indexable=role
execution_guard_enabled=true
cache_type=weak
=============================
== neo4j-server.properties ==
org.neo4j.server.webserver.limit.executiontime=20000
org.neo4j.server.webserver.maxthreads=200
=============================
The hardware spec is:
RAM: 24GB
PROC: Intel(R) Xeon(R) X5650 # 2.67GHz, 32 cores
HDD1: 1.2TB
In this environment, each block update of 500000 nodes took from 200 to 400 seconds. I think this is because every node satisfies the query at the start, but as the updates take place, more nodes need to be scanned to find the unlabeled ones (but again, it's a hunch).
So what's the best course of action whenever an operation needs to touch every node in the graph?
Any help towards a better solution to this will be appreciated!
Thanks in advance.
The most performant way to achieve this is using the batch inserter API. You might use the following recipe:
take a look at http://localhost:7474/webadmin and note the "node count". In fact it's not the number of nodes it's more the highest node id in use - we'll need that later on.
make sure to cleanly shut down your graph database.
take a backup copy of your graph.db directory.
write a short piece of java/groovy/(whatever jvm language you prefer...) program that performs the following tasks
open your graph.db folder using the batch inserter api
in a loop from 0..<node count> (from step above) check if the node with given id exists, if so grab its current labels and amend the list by the new label and use setNodeLabels to write it back.
make sure you run shutdown with the batchinserter
start up your Neo4j instance again

Resources