aws neptune times out for large graph drop () - gremlin

There are some threads on this subject already..
particularly this one
but is there any recommended solution to drop large graph other than batching..?
I tried increasing timeout and it doesn't work
Below is the example..
gremlin> g.V().count()
==>5230885
gremlin> g.V().drop().iterate()
{"requestId":"77c64369-45fa-462f-91d7-5712e3308497","detailedMessage":"A timeout occurred within the script during evaluation of [RequestMessage{, requestId=77c64369-45fa-462f-91d7-5712e3308497, op='eval', processor='', args={gremlin=g.V().drop().iterate(), bindings={}, batchSize=64}}] - consider increasing the timeout","code":"TimeLimitExceededException"}
Type ':help' or ':h' for help.
Display stack trace? [yN]N
gremlin> g.E().count()
==>83330550
gremlin> :remote config timeout none
==>Remote timeout is disabled
gremlin> g.E().drop().iterate()
{"requestId":"d418fa03-72ce-4154-86d8-42225e4b9eca","detailedMessage":"A timeout occurred within the script during evaluation of [RequestMessage{, requestId=d418fa03-72ce-4154-86d8-42225e4b9eca, op='eval', processor='', args={gremlin=g.E().drop().iterate(), bindings={}, batchSize=64}}] - consider increasing the timeout","code":"TimeLimitExceededException"}
Type ':help' or ':h' for help.
Display stack trace? [yN]N

You can increase the timeout of your neptune cluster by using the parameter group option neptune_query_timeout.
If using version 3.3.7 of the Java client, you can specify it for specific requests:
Set Timeouts at a Per-Query Level
Hopefully soon you will be able to run:
g.with("scriptEvaluationTimeout", 600).V().drop()

You have two options currently to drop an entire graph that is large. One option of course is to delete the current cluster and create a new one. To delete the existing graph the best approach is to use multiple threads that drop chunks of the graph in batches. I have been working on some Python code that can do just that. It is currently on a branch at this location.
https://github.com/awslabs/amazon-neptune-tools/tree/master/drop-graph
For a graph of the size you have the tool should work fine as is. It does have some limitations currently that are documented in the code.
UPDATED 2021-Dec-8 to add:
Since this question was asked Amazon Neptune now supports a Fast Reset API that can be used to delete all of the data in a cluster. More details here: https://docs.aws.amazon.com/neptune/latest/userguide/manage-console-fast-reset.html

If you have trouble with timeouts on LOAD or UNLOAD requests consider that the neptune_query_timeout configuration setting applies, but has to be set for every DB instance that you address (and not just once for the cluster). Different configuration parameter sets can be applied to the cluster and the single instances, so make sure the parameter set for the instance in question has the right timeout setting.

Related

How to prevent high memory consumption caused by a aql-query returning a large result set?

In our artifactory-pro 7.38 instance I discovered very high memory usage that I haven't seen before in artifactory 6. Now I have a memory dump showing me a stack trace that reveals the cause of the memory consume. When using a certain aql-query to filter all artifacts by a date, the jdbc-resultset seems to become very large (+20 mio items). While there a probably options to limit the result, I wonder how can I protect the instance against such situation. Is there a way it generally limit the size of the resultset in terms of number of results? I read that there is at least support to pass a limit along with the aql-query but is there something that can be done on the server side, such as enforcing pagination?
In Artifactory version 7.41.x there has been an improvement to allow the system to kill long-running AQL queries exactly for this scenario, to avoid performance issues.
By default, the system will kill any queries that last more than 15 min. In case you want to change the default time for this you can add the following property to the system.properties file:
artifactory.aql.query.timeout.seconds - The query timeout for AQL, by default is 15mins (900 secs)
In addition, as you mentioned, it could be that the query can be improved. I recommend you to read this wiki page regarding Limits and Pagination.
I hope this clarifies and helps.

Gremlin console keeps returning "Connection to server is no longer active" error

I tried to run a Gremlin query adding a property to vertex through Gremlin console.
g.V().hasLabel("user").has("status", "valid").property(single, "type", "valid")
I constantly get this error:
org.apache.tinkerpop.gremlin.jsr223.console.RemoteException: Connection to server is no longer active
This error happens after query is running for one or two minutes.
I tried some simple queries like g.V().limit(10) and it works fine.
Since the affected vertex count is more than 4 million, not sure if it is failing due to timeout issue.
I also tried to split it into small batches:
g.V().hasLabel("user").has("status", "valid").hasNot("type").limit(200000).property(single, "type", "valid")
It succeeded for first few batches and started failing again.
Is there any recommendations for updating millions of vertices?
The precise approach you take may vary depending on the backend graph database and storage you are using as well as the capacity of the hardware being used.
The capacity of the hardware where Gremlin Server is running in terms of number of CPUs and most importantly, memory, will also be a factor as will the setting of the query timeout value.
To do this in Gremlin, if you had a way to identify distinct ranges of vertices easily you could split this up into multiple threads each doing batches of updates. If the example you show is representative of your actual need then that is likely not possible in this case.
Likewise some graph databases provide a bulk load capability that is often a good way to do large batch updates but probably not an option here as you need to do essentially a conditional update based on looking at the current presence (or not) of a property.
Without more information about your data model and hardware etc. the best answer is probably to do two things:
Use smaller limits. Maybe try 5K or even just 1K at first and work up from there until you find a reliable sweet spot.
Increase the query timeout settings.
You may need to experiment to find the sweet spot for your environment as the capacity of the hardware will definitely play a role in situations like this as well as how you write your query.

gremlin.net with Cosmos DB: RequestRateTooLarge

I get RequestRateTooLarge exception when doing queries that involve around 60 vertices. The problem seems to be related to the number of vertices and edges involved in the query (it does not happen with "smaller" queries). Increasing the throughput does not solve the problem, it just happens less frequently.
Would it be useful to wait some time between retrievals of the results of the query? I.e. doing a Thread.Sleep() between calls to something like query.ExecuteNextAsync() of Graph API. I could not find an equivalent in gremlin.net so I haven't tried yet.
If this is not a solution, what can I do?
Introducing a wait period will address the RequestRateTooLarge exception you are experiencing.
Another option is to increase the reserved throughput for the container. You can find additional information about exceeding the reserved throughput exception here.

Setting a new label on all nodes takes too long in a huge graph

I'm working on a graph containing about 50 million nodes and 40 million relationships.
I need to update every node.
I'm trying to set a new label to these nodes, but it's taking too long.
The label applies to all 50 million nodes, so the operation never ends.
After some research, i found out that Neo4j treats this operation as a single transaction (i don't know if optimistic or not), keeping the changes uncommitted, until the end (which will never happen in this fashion).
I'm currently using Neo4j 2.1.4, which has a feature called "USING PERIODIC COMMIT" (already present in earlier versions). Unfortunately, this feature is coupled to the "LOAD CSV" feature, and not available to every cypher command.
The cypher is quite simple:
match n set n:Person;
I decided to use a workaround, and make some sort of block update, as follows:
match n
where not n:Person
with n
limit 500000
set n:node;
It's ugly, but i couldn't come up with a better solution yet.
Here are some of my confs:
== neo4j.properties =========
neostore.nodestore.db.mapped_memory=250M
neostore.relationshipstore.db.mapped_memory=500M
neostore.propertystore.db.mapped_memory=900M
neostore.propertystore.db.strings.mapped_memory=1300M
neostore.propertystore.db.arrays.mapped_memory=1300M
keep_logical_logs=false
node_auto_indexing=true
node_keys_indexable=name_autocomplete,document
relationship_auto_indexing=true
relationship_keys_indexable=role
execution_guard_enabled=true
cache_type=weak
=============================
== neo4j-server.properties ==
org.neo4j.server.webserver.limit.executiontime=20000
org.neo4j.server.webserver.maxthreads=200
=============================
The hardware spec is:
RAM: 24GB
PROC: Intel(R) Xeon(R) X5650 # 2.67GHz, 32 cores
HDD1: 1.2TB
In this environment, each block update of 500000 nodes took from 200 to 400 seconds. I think this is because every node satisfies the query at the start, but as the updates take place, more nodes need to be scanned to find the unlabeled ones (but again, it's a hunch).
So what's the best course of action whenever an operation needs to touch every node in the graph?
Any help towards a better solution to this will be appreciated!
Thanks in advance.
The most performant way to achieve this is using the batch inserter API. You might use the following recipe:
take a look at http://localhost:7474/webadmin and note the "node count". In fact it's not the number of nodes it's more the highest node id in use - we'll need that later on.
make sure to cleanly shut down your graph database.
take a backup copy of your graph.db directory.
write a short piece of java/groovy/(whatever jvm language you prefer...) program that performs the following tasks
open your graph.db folder using the batch inserter api
in a loop from 0..<node count> (from step above) check if the node with given id exists, if so grab its current labels and amend the list by the new label and use setNodeLabels to write it back.
make sure you run shutdown with the batchinserter
start up your Neo4j instance again

Graphite + Collectd - How to plot memory used percentage for each host?

I have graphite+collectd setup to collect system related metrics. This question concerns with the memory plugin for collectd.
My infra has this format for collecting memory usage data using collectd:
<cluster>.<host>.memory.memory-{buffered,cached,free,used}
I want to plot the percentage of memory used for each host.
So basically, I have to do something like this:
divideSeries(sumSeriesWithWildCards(*.*.memory.memory-{buffered,cached,free},1),sumSeriesWithWildCards(*.*.memory.memory-{buffered,cached,free,used}),1)
But I am not able to do this, as divideSeries wants the divisor metric to return only one metric.
I basically want a single target to monitor all hosts in a cluster.
How can I do this?
try this one:
asPercent(host.memory.memory-used, sumSeries(host.memory.memory-{used,free,cached,buffered}))
you'll get a graph of memory usage in percents for one hosts. Unfortunately I wasn't able to make it work with wildcards (multiple hosts).
Try this for multiple nodes with regex.
alias(asPercent(sumSeries(collectd.nodexx*_internal_cloudapp_net.memory.memory.used), sumSeries(collectd.nodexx*_internal_cloudapp_net.memory.memory.{used,free,cached,buffered})),"Memory Used")
alias(asPercent(sumSeries(collectd.nodexx*_internal_cloudapp_net.memory.memory.{cached,buffered}), sumSeries(collectd.nodexx*_internal_cloudapp_net.memory.memory.{used,free,cached,buffered})),"Memory Cached")
alias(asPercent(sumSeries(collectd.nodexx*_internal_cloudapp_net.memory.memory.free), sumSeries(collectd.nodexx*_internal_cloudapp_net.memory.memory.{used,free,cached,buffered})),"Memory Free")

Resources