Improve performance removing TinkerGraph vertices

Improve performance removing TinkerGraph vertices - gremlin

I have a graph g with 600k vertices and 950k edges. After some processing, I need to clean up about 350k+ vertices with this query:
g.V().hasLabel(LABEL_INTERMEDIATE_COLUMN).not(inE(EDGE_DEPEND)).drop().iterate();
Even though I'm excluding vertices with no "depend" edges, they are still connected with other edges.
Using Java, tinkerpop/tinkergraph 3.4.6.
Currently it is taking about 45 minutes to drop all those vertices.
I did a java profiling and the results shows 73% of time spent in the TinkerVertex.remove method, and the rest in the ExpandableStepIterator.next
Is there something like "bulk drop"? Would a JanusGraph or other graph provider be much faster?

It's unlikely that there are graphs that are much faster than TinkerGraph as TinkerGraph is a pure in-memory implementation. You might find one that is more efficient using that memory like OverflowDB which originally forked from TinkerGraph but I don't know that it will make this particular query go faster.
TinkerGraph, nor any graph I know of, has a filtered "bulk drop" operation.
The "not" style global query here is simple expensive as you're having to touch the a large portion of the graph. Of course, I'm a bit surprised that TinkerGraph is taking that long for a graph with less than one million edges. You didn't mention if you were experiencing a lot of GC as you were doing your profile. Perhaps that is an issue? If so, I would try to adjust your JVM memory configurations - maybe you just need a larger -Xmx value or something simple like that.
From a query perspective, you could try to invert not() portion of the traversal to positively find the things you want to remove. It might lead to a less concise query to read but could perhaps speed things up, but on the other hand you are still trying to delete 50% of your data so the cost may not be just in finding the vertices to get rid of.
One other thought would be try to parallelize the drop(). You might hit concurrency errors so you could need a retry strategy but you could consider taking the Iterator of g.V().hasLabel(LABEL_INTERMEDIATE_COLUMN).not(inE(EDGE_DEPEND)) and then delegating calls to each (or batches of) Vertex.remove() to a separate worker thread.

Based on accepted answer, a simple parallelization improved enough that this operation is no longer the most critical time-wise
For future reference, this:
g.V().hasLabel(LABEL_INTERMEDIATE_COLUMN).not(inE(EDGE_DEPEND)).drop().iterate();
is now something like this:
ExecutorService executor = Executors.newFixedThreadPool(4);
int iterator = 0;
final int batchsize = 10000;
Long count = g.V().hasLabel(LABEL_INTERMEDIATE_COLUMN).not(inE(EDGE_DEPEND)).count().next();
List<Callable<Object>> callableList = new ArrayList<Callable<Object>>();
// splitting current set into tasks to be executed in para
while (iterator * batchsize < count) {
final Set<Object> vSet = g.V().hasLabel(LABEL_INTERMEDIATE_COLUMN).not(inE(EDGE_DEPEND)).skip(iterator * batchsize).limit(batchsize).id().toSet();
callableList.add(() -> g.V(vSet).drop().iterate());
iterator++;
}
List<Future<Object>> results = executor.invokeAll(callableList);
After some tests, I decided to keep the iteration in a single thread. That way the distributed tasks are really independent of each other (e.g.: one task completing won't affect other task query).
Keep in mind that the actual removal is still single thread, as the vertex node map modification is behind concurrent access locks.
The effect is that increasing threads won't get better results (personally tried 8). And based on some thread dumps, even 4 might be too much (there is always 1 or more thread in a waiting status) - although I did get a dump with 3 threads running!

Related

Neptune and Gremlin - wild CPU utilization

Issue
I'm seeing extremely high CPU utilization when making [what seems like] a fairly common ask to a graph database. In fact, the utilization is so large that Amazon Neptune seems to "tap out" when I execute multiple of the queries in rapid succession / simultaneously.
I've experimented with even the largest (db.r5d.24xlarge) instance, which costs $14k/ month (😅), and I still see the general behavior -- memory is fine, CPU utilization goes wild.
Further, the query isn't exactly quick (~5 seconds), so I'm kind of getting the worst of both worlds...
Ask
Can someone help me understand what I can do to address this / review my query? My hope was, just as a relational database can handle thousands of concurrent requests, Amazon Neptune could do similar.
I suspect I might be using .limit() in the wrong spot. That said, the results produced are correct. If the limit() is moved up before the .and(), I'm concerned I'd end up with suboptimal results, since -- due to the until() clauses -- it could just be paths that stopped w/o making it to the toPortId.
Update: I’m wondering if it might make sense to add a timeLimit() clause to the repeat() clause … anecdotally, after some testing, it seems like I’m paying a significant time penalty to rigorously verify that there are, in fact, no routes to certain places.
Detail
I'm new to Gremlin / Neptune / graph databases in general, so it's possible/likely that I'm either doing something wrong (or have something incorrectly configured) -or- I overestimated the ability of these tools to handle queries.
My use-case requires that I "qualify" (or disqualify) a large number of candidate destinations -- that is:
Starting at ORIGIN (fromPortId) are there / what are the paths to CANDIDATE (toPortId) for which the duration properties sums to less than LIMIT (travelTimeBudget)?
Offending Query
const result = await g.withSack(INITIAL_CHECKIN_PENALTY)
.V(fromPortId)
.repeat(
(__.outE().hasLabel('HAS_VOYAGE_TO').sack(operator.sum).by('duration'))
.sack(operator.sum).by(__.constant(LAYEROVER_PENALTY))
.inV()
.simplePath()
)
.until(
__.or(
__.has('code', toPortId),
__.sack().is(gte(travelTimeBudget)),
__.loops().is(maxHops))
)
.and(
__.has('code', toPortId),
__.sack().is(lte(travelTimeBudget))
)
.limit(4)
.order().by(__.sack(), order.asc)
.local(
__.union(
__.path().by('code').by('duration'),
__.sack()
).fold()
)
.local(
__.unfold().unfold().fold()
)
.toList()
.then(data => {
return data
})
.catch(error => {
console.log('ERROR', error);
return false
});
Execution
Database: Amazon Neptune (db.r5.large)
Querying from / runtime: AWS Lambda running Node.js
Graph
5,300 vertices (Port)
42,000 edges (HAS_VOYAGE_TO)
Data Model
Note: For this issue, I'm only executing the "upper plane" / top half.

To address the CPU discussion, perhaps a small explanation of the Neptune instance architecture will be beneficial. Each Neptune instance has a worker thread pool. The number of workers in the pool is twice the number of vCPU the instance has. That controls how many queries (maximum) can be running concurrently on a given instance. Each instance also has a query queue that will hold queries sent to the instance until a worker becomes available. For a somewhat complex query, against a small instance type (like the db.r5.large), seeing 20 to 30% CPU utilization is not unexpected. A significant portion of the instance memory is used to cache graph data locally in a most recently used fashion. The remaining memory is allocated to the worker threads for query execution. So larger instances have (a) more workers (b) more memory for caching graph data locally and (c) more memory for use during query execution.
Without having the data it's hard to attempt to modify your query and test it, but it may well be possible to optimize various parts of it to improve its overall efficiency.
As there are a lot of topics that we could discuss here, and I am more than happy to do that, it might be easiest, if you are able, to open an AWS support case and ask that the support engineer create a ticket and have it assigned to me. I will be happy to then get on the phone with you and we can spend some time discussing your use case if that would help.
If you are not able to open a support case, we can connect other ways. I'm easy to find on LinkedIn if you prefer to reach out that way.
I very much want to help you with this and your other questions, but I feel the most expeditious way might be for us to get on the phone.
I'm also happy to keep discussing here of course.

Gremlin console keeps returning "Connection to server is no longer active" error

I tried to run a Gremlin query adding a property to vertex through Gremlin console.
g.V().hasLabel("user").has("status", "valid").property(single, "type", "valid")
I constantly get this error:
org.apache.tinkerpop.gremlin.jsr223.console.RemoteException: Connection to server is no longer active
This error happens after query is running for one or two minutes.
I tried some simple queries like g.V().limit(10) and it works fine.
Since the affected vertex count is more than 4 million, not sure if it is failing due to timeout issue.
I also tried to split it into small batches:
g.V().hasLabel("user").has("status", "valid").hasNot("type").limit(200000).property(single, "type", "valid")
It succeeded for first few batches and started failing again.
Is there any recommendations for updating millions of vertices?

The precise approach you take may vary depending on the backend graph database and storage you are using as well as the capacity of the hardware being used.
The capacity of the hardware where Gremlin Server is running in terms of number of CPUs and most importantly, memory, will also be a factor as will the setting of the query timeout value.
To do this in Gremlin, if you had a way to identify distinct ranges of vertices easily you could split this up into multiple threads each doing batches of updates. If the example you show is representative of your actual need then that is likely not possible in this case.
Likewise some graph databases provide a bulk load capability that is often a good way to do large batch updates but probably not an option here as you need to do essentially a conditional update based on looking at the current presence (or not) of a property.
Without more information about your data model and hardware etc. the best answer is probably to do two things:
Use smaller limits. Maybe try 5K or even just 1K at first and work up from there until you find a reliable sweet spot.
Increase the query timeout settings.
You may need to experiment to find the sweet spot for your environment as the capacity of the hardware will definitely play a role in situations like this as well as how you write your query.

When are gremlin sessions better?

I understand that sessionless operations are the preferred method of using gremlin. I'm wondering when is the sessioned approach better?
So I might be doing something like...
graph.addVertex("foo").property("name","bar")
graph.traversal().V().has("name","bar").as("f").addV("foo").property("name","baz").as("g").addE("test").from("f").to("g")
I'm doing this type of operation a lot. Often there's also a query (usually involving a coalesce) beforehand to check if a node (g in my example) exists, and create it if not.
So I'm wondering if a session might be better because I could hold a handle to the previous vertices and just attach new nodes to them without the expense of the lookup.
Feel free to tell me why I'm wrong in anything else that I'm doing.. Just trying to make things faster.

First of all, I would avoid use of addVertex() and stick to addV() - see more details here.
As to your question, I think the only time to leverage sessions is if you have some sort of loading operation that requires explicit control over transactions and you're not using a JVM based language. Even then, I might consider other options for dealing with that and just avoid sessions completely. You end up with a less portable solution as there are a number of graph systems which don't even support them directly (e.g. Neptune).
The cost to do a T.id based lookup should be really fast, so saving a vertex between requests in a session really shouldn't vastly improve performance. Even if you keep the vertex between requests you will still need to pass the vertex into your traversal so you still have the lookup anyway - I'm not sure I see the difference in cost there.
// first request
v = g.addV(...).property(...).next()
// second request
g.V(v).addE(....
// third request
g.V(v).addE(....
The above should not be that much faster than:
// first request - returns id=1
g.addV(...).property(...).id().next()
// second request - where "1" is just passed in on the next request as a parameter
g.V(1).addE(....
// third request
g.V(1).addE(....

Setting a new label on all nodes takes too long in a huge graph

I'm working on a graph containing about 50 million nodes and 40 million relationships.
I need to update every node.
I'm trying to set a new label to these nodes, but it's taking too long.
The label applies to all 50 million nodes, so the operation never ends.
After some research, i found out that Neo4j treats this operation as a single transaction (i don't know if optimistic or not), keeping the changes uncommitted, until the end (which will never happen in this fashion).
I'm currently using Neo4j 2.1.4, which has a feature called "USING PERIODIC COMMIT" (already present in earlier versions). Unfortunately, this feature is coupled to the "LOAD CSV" feature, and not available to every cypher command.
The cypher is quite simple:
match n set n:Person;
I decided to use a workaround, and make some sort of block update, as follows:
match n
where not n:Person
with n
limit 500000
set n:node;
It's ugly, but i couldn't come up with a better solution yet.
Here are some of my confs:
== neo4j.properties =========
neostore.nodestore.db.mapped_memory=250M
neostore.relationshipstore.db.mapped_memory=500M
neostore.propertystore.db.mapped_memory=900M
neostore.propertystore.db.strings.mapped_memory=1300M
neostore.propertystore.db.arrays.mapped_memory=1300M
keep_logical_logs=false
node_auto_indexing=true
node_keys_indexable=name_autocomplete,document
relationship_auto_indexing=true
relationship_keys_indexable=role
execution_guard_enabled=true
cache_type=weak
=============================
== neo4j-server.properties ==
org.neo4j.server.webserver.limit.executiontime=20000
org.neo4j.server.webserver.maxthreads=200
=============================
The hardware spec is:
RAM: 24GB
PROC: Intel(R) Xeon(R) X5650 # 2.67GHz, 32 cores
HDD1: 1.2TB
In this environment, each block update of 500000 nodes took from 200 to 400 seconds. I think this is because every node satisfies the query at the start, but as the updates take place, more nodes need to be scanned to find the unlabeled ones (but again, it's a hunch).
So what's the best course of action whenever an operation needs to touch every node in the graph?
Any help towards a better solution to this will be appreciated!
Thanks in advance.

The most performant way to achieve this is using the batch inserter API. You might use the following recipe:
take a look at http://localhost:7474/webadmin and note the "node count". In fact it's not the number of nodes it's more the highest node id in use - we'll need that later on.
make sure to cleanly shut down your graph database.
take a backup copy of your graph.db directory.
write a short piece of java/groovy/(whatever jvm language you prefer...) program that performs the following tasks
open your graph.db folder using the batch inserter api
in a loop from 0..<node count> (from step above) check if the node with given id exists, if so grab its current labels and amend the list by the new label and use setNodeLabels to write it back.
make sure you run shutdown with the batchinserter
start up your Neo4j instance again

Barriers in OpenCL

In OpenCL, my understanding is that you can use the barrier() function to synchronize threads in a work group. I do (generally) understand what they are for and when to use them. I'm also aware that all threads in a work group must hit the barrier, otherwise there are problems. However, every time I've tried to use barriers so far, it seems to result in either my video driver crashing, or an error message about accessing invalid memory of some sort. I've seen this on 2 different video cards so far (1 ATI, 1 NVIDIA).
So, my questions are:
Any idea why this would happen?
What is the difference between barrier(CLK_LOCAL_MEM_FENCE) and barrier(CLK_GLOBAL_MEM_FENCE)? I read the documentation, but it wasn't clear to me.
Is there general rule about when to use barrier(CLK_LOCAL_MEM_FENCE) vs. barrier(CLK_GLOBAL_MEM_FENCE)?
Is there ever a time that calling barrier() with the wrong parameter type could cause an error?

As you have stated, barriers may only synchronize threads in the same workgroup. There is no way to synchronize different workgroups in a kernel.
Now to answer your question, the specification was not clear to me either, but it seems to me that section 6.11.9 contains the answer:
CLK_LOCAL_MEM_FENCE – The barrier function will either flush any
variables stored in local memory or queue a memory fence to ensure
correct ordering of memory operations to local memory.
CLK_GLOBAL_MEM_FENCE – The barrier function will queue a memory fence
to ensure correct ordering of memory operations to global memory.
This can be useful when work-items, for example, write to buffer or
image memory objects and then want to read the updated data.
So, to my understanding, you should use CLK_LOCAL_MEM_FENCE when writing and reading to the __local memory space, and CLK_GLOBAL_MEM_FENCE when writing and readin to the __global memory space.
I have not tested whether this is any slower, but most of the time, when I need a barrier and I have a doubt about which memory space is impacted, I simply use a combination of the two, ie:
barrier(CLK_LOCAL_MEM_FENCE | CLK_GLOBAL_MEM_FENCE);
This way you should not have any memory reading\writing ordering problem (as long as you are sure that every thread in the group goes through the barrier, but you are aware of that).
Hope it helps.

Reviving an old-ish thread here. I have had a little bit of trouble with barrier() myself.
Regarding your crash problem, one potential cause could be if your barrier is inside a condition. I read that when you use barrier, ALL work items in the group must be able to reach that instruction, or it will hang your kernel - usually resulting in a crash.
if(someCondition){
//do stuff
barrier(CLK_LOCAL_MEM_FENCE);
//more stuff
}else{
//other stuff
}
My understanding is that if one or more work items satisfies someCondition, ALL work items must satisfy that condition, or there will be some that will skip the barrier. Barriers wait until ALL work items reach that point. To fix the above code, I need to restructure it a bit:
if(someCondition){
//do stuff
}
barrier(CLK_LOCAL_MEM_FENCE);
if(someCondition){
//more stuff
}else{
//other stuff
}
Now all work items will reach the barrier.
I don't know to what extent this applies to loops; if a work item breaks from a for loop, does it hit barriers? I am unsure.
UPDATE: I have successfully crashed a few ocl programs with a barrier in a for-loop. Make sure all work items exit the for loop at the same time - or better yet, put the barrier outside the loop.
(source: Heterogeneous Computing with OpenCL Chapter 5, p90-91)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex