Riak unrecoverable disk failure - riak

I have a 3 node Riak cluster with each having approx. 1 TB disk usage. All of a sudden, one node's hard disk failed unrecoverably.
So, I added a new node using the following steps:
1) riak-admin cluster join
2) down the failed node
3) riak-admin force-replace failed-node new-node
4) riak-admin cluster plan
5) riak-admin cluster commit.
This almost fixed the problem except that after lots of data transfers and handoffs, now not all three nodes have 1 TB disk usage. Only two of them
have 1 TB disk usage. The other one is almost empty. This means there are no longer 3 copies on disk anymore. What commands should I run to forcefully make sure there are three replicas on disk overall without waiting for read-repair or anti-entropy to make three copies ?

Answer got by posting same question to riak-users#lists.basho.com :
(0) Three nodes are insufficient, you should have 5 nodes
(1) You could iterate and read every object in the cluster - this would also trigger read repair for every object
(2) - copied from Engel Sanchez response to a similar question April 10th 2014 )
* If AAE is disabled, you don't have to stop the node to delete the data in
the anti_entropy directories
* If AAE is enabled, deleting the AAE data in a rolling manner may trigger
an avalanche of read repairs between nodes with the bad trees and nodes
with good trees as the data seems to diverge.
If your nodes are already up, with AAE enabled and with old incorrect trees
in the mix, there is a better way. You can dynamically disable AAE with
some console commands. At that point, without stopping the nodes, you can
delete all AAE data across the cluster. At a convenient time, re-enable
AAE. I say convenient because all trees will start to rebuild, and that
can be problematic in an overloaded cluster. Doing this over the weekend
might be a good idea unless your cluster can take the extra load.
To dynamically disable AAE from the Riak console, you can run this command:
riak_core_util:rpc_every_member_ann(riak_kv_entropy_manager, disable, [],
60000).
and enable with the similar:
riak_core_util:rpc_every_member_ann(riak_kv_entropy_manager, enable, [],
60000).
That last number is just a timeout for the RPC operation. I hope this
saves you some extra load on your clusters.
(3) That’s going to be :
(3a) List all keys using the client of your choice
(3b) Fetch each object
https://www.tiot.jp/riak-docs/riak/kv/2.2.3/developing/usage/reading-objects/
https://www.tiot.jp/riak-docs/riak/kv/2.2.3/developing/usage/secondary-indexes/

Related

AWS Neptune Query gremlins slowness on cold call

I'm currently running some queries with a big gap of performance between first call (up to 2 minutes) and the following one (around 5 seconds).
This duration difference can be seen through the gremlin REST API in both execution and profile mode.
As the query is loading a big amount of data, I expect the issue is coming from the caching functionalities of Neptune in its default configuration. I was not able to find any way to improve this behavior through configuration and would be glad to have some advices in order to reduce the length of the first call.
Context :
The Neptune database is running on a db.r5.8xlarge instance, and during execution CPU always stay bellow 20%. I'm also the only user on this instance during the tests.
As we don't have differential inputs, the database is recreated on a weekly basis and switched to production once the loader has loaded everything. Our database have then a short lifetime.
The database is containing slightly above 1.000.000.000 nodes and far more edges. (probably around 10.000.000.000) Those edges are splitted across 10 types of labels, and most of them are not used in the current query.
Query :
// recordIds is a table of 50 ids.
g.V(recordIds).HasLabel("record")
// Convert local id to neptune id.
.out('local_id')
// Go to tree parent link. (either myself if edge come back, or real parent)
.bothE('tree_top_parent').inV()
// Clean duplicates.
.dedup()
// Follow the tree parent link backward to get all children, this step load a big amount of nodes members of the same tree.
.in('tree_top_parent')
.not(values('some flag').Is('Q'))
// Limitation not reached, result is between 80k and 100K nodes.
.limit(200000)
// Convert back to local id for the 80k to 100k selected nodes.
.in('local_id')
.id()
Neptune's architecture is comprised of a shared cluster "volume" (where all data is persisted and where this data is replicated 6 times across 3 availability zones) and a series of decoupled compute instances (one writer and up to 15 read replicas in a single cluster). No data is persisted on the instances however, approximately 65% of the memory capacity on an instance is reserved for a buffer pool cache. As data is read from the underlying cluster volume, it is stored in the buffer pool cache until the cache fills. Once the cache fills, a least-recently-used (LRU) eviction policy will clear buffer pool cache space for any newer reads.
It is common to see first reads be slower due to the need to fetch objects from the underlying storage. One can improve this by writing and issuing "prefetch" queries that pull in objects that they think they might need in the near future.
If you have a use case that is filling buffer pool cache and constantly seeing buffer pool cache misses (a metric one can see in the CloudWatch metrics for Neptune), then you may also want to consider using one of the "d" instance types (ex: r5d.8xlarge) and enabling the Lookup Cache feature [1]. This feature specifically focuses on improving access to property values/literals at query time by keeping them in a directly attached NVMe store on the instance.
[1] https://docs.aws.amazon.com/neptune/latest/userguide/feature-overview-lookup-cache.html

JanusGraph OOM Exception w/ High Degree Vertex

This is a question regarding a java.lang.OutOfMemoryError: GC overhead limit exceeded issue observed while executing operations against JanusGraph.
Setup
I am running JanusGraph version 0.2.0 with Cassandra as the underlying storage. I am using default configuration values as described here, except for configuration describing my storage backend as cql and the storage host.
I'm inserting a large list of users into the graph - each one has (1) an email, (2) a user ID, and (3) an associated grouping label. For this scenario, all the users are in the same grouping, that we'll call groupA.
The inserts are done sequentially, each executed immediately after the previous completes.
During each insert operation, I create one vertex representing the email, one vertex representing the user ID, then create or update the vertex representing groupA. I create an edge between (1) the email <-> userID vertex, and an edge between (2) the userID <-> groupA vertex.
Observed Problem
I used a profiler to observe the heap space usage during the process. The insertions ran without a problem in the beginning, but as more insertions were made, the heap space used increased. Eventually, as more memory was used, I reached an out of memory exception after running for several hours.
--
I then repeated the insertions a second time, but this time I did not include the groupA vertex. This time, the memory usage showed a standard sawtooth pattern over time as shown below.
This leads me to conclude that the out of memory issues I observed were due to operations involving this high degree groupA vertex.
Potential Leads
I currently suspect that there is some caching process on JanusGraph which stores recently accessed elements. Since the adjacency list of the high degree group vertex is large, then there may be large amounts of data cached, that only increases as I create more and more edges from the group vertex to user ID vertices.
Using my profiler, I noticed that there was a relatively high memory usage by the org.janusgraph.graphdb.relations.RelationCache class, so this seems to be relevant.
My Question
My question is: what is the cause of this increasingly high memory usage over time with JanusGraph?
Consider frequent commits. When you insert data, they are held in the transaction cache till the transaction is committed. Also, take a look at your configuration for the db cache. You will find more information about the cache here.

Apache Ignite 2.4 uneven partitioning of data causing nodes to run out of memory and crash

Environment:
Apache Ignite 2.4 running on Amazon Linux. VM is 16CPUs/122GB ram. There is plenty of room there.
5 nodes, 12GB each
cacheMode = PARTITIONED
backups = 0
OnheapCacheEnabled = true
atomicityMode = ATOMIC
rebalacneMode = SYNC
rebalanceBatchSize = 1MB
copyOnread = false
rebalanceThrottle = 0
rebalanceThreadPoolSize = 4
Basically we have a process that populates the cache on startup and then receives periodic updates from Kafka, propagating them to the cache.
The number of elements in the cache is more or less stable over time (there is just a little fluctuation since we have a mixture of create, update and delete events), but what we have noticed is that the distribution of data across the different nodes is very uneven, with one of the nodes having at least double the number of keys (and memory utilization) as the others. Over time, that node either runs out of memory, or starts doing very long GCs and loses contact with the rest of the cluster.
My expectation was that Ignite would balance the data across the different nodes, but reality shows something completely different. Am I missing something here? Why do we see this imbalance and how do we fix it?
Thanks in advance.
Bottom line, although our hash function had good distribution, the default affinity function was not yielding a good distribution of keys (and, consequently, memory) across the nodes in the cluster. We replaced it with a very naive one (partition # % # of nodes), and that improved the distribution quite a bit (less than 2% variance).
This not a generic solution; it works for us because our entire cluster is in one VM and we don't use replication. For massive clusters crossing VM boundaries and replication, keeping the replicated data in separate servers is mandatory, and the naive approach won't cut it.

Setting a new label on all nodes takes too long in a huge graph

I'm working on a graph containing about 50 million nodes and 40 million relationships.
I need to update every node.
I'm trying to set a new label to these nodes, but it's taking too long.
The label applies to all 50 million nodes, so the operation never ends.
After some research, i found out that Neo4j treats this operation as a single transaction (i don't know if optimistic or not), keeping the changes uncommitted, until the end (which will never happen in this fashion).
I'm currently using Neo4j 2.1.4, which has a feature called "USING PERIODIC COMMIT" (already present in earlier versions). Unfortunately, this feature is coupled to the "LOAD CSV" feature, and not available to every cypher command.
The cypher is quite simple:
match n set n:Person;
I decided to use a workaround, and make some sort of block update, as follows:
match n
where not n:Person
with n
limit 500000
set n:node;
It's ugly, but i couldn't come up with a better solution yet.
Here are some of my confs:
== neo4j.properties =========
neostore.nodestore.db.mapped_memory=250M
neostore.relationshipstore.db.mapped_memory=500M
neostore.propertystore.db.mapped_memory=900M
neostore.propertystore.db.strings.mapped_memory=1300M
neostore.propertystore.db.arrays.mapped_memory=1300M
keep_logical_logs=false
node_auto_indexing=true
node_keys_indexable=name_autocomplete,document
relationship_auto_indexing=true
relationship_keys_indexable=role
execution_guard_enabled=true
cache_type=weak
=============================
== neo4j-server.properties ==
org.neo4j.server.webserver.limit.executiontime=20000
org.neo4j.server.webserver.maxthreads=200
=============================
The hardware spec is:
RAM: 24GB
PROC: Intel(R) Xeon(R) X5650 # 2.67GHz, 32 cores
HDD1: 1.2TB
In this environment, each block update of 500000 nodes took from 200 to 400 seconds. I think this is because every node satisfies the query at the start, but as the updates take place, more nodes need to be scanned to find the unlabeled ones (but again, it's a hunch).
So what's the best course of action whenever an operation needs to touch every node in the graph?
Any help towards a better solution to this will be appreciated!
Thanks in advance.
The most performant way to achieve this is using the batch inserter API. You might use the following recipe:
take a look at http://localhost:7474/webadmin and note the "node count". In fact it's not the number of nodes it's more the highest node id in use - we'll need that later on.
make sure to cleanly shut down your graph database.
take a backup copy of your graph.db directory.
write a short piece of java/groovy/(whatever jvm language you prefer...) program that performs the following tasks
open your graph.db folder using the batch inserter api
in a loop from 0..<node count> (from step above) check if the node with given id exists, if so grab its current labels and amend the list by the new label and use setNodeLabels to write it back.
make sure you run shutdown with the batchinserter
start up your Neo4j instance again

How the chances of getting "read-your-writes" consistency are increased in Dynamo?

In Section 5 of Dynamo paper, there is the following content:
In particular, since each write usually follows a read operation, the
coordinator for a write is chosen to be the node that replied fastest to the
previous read operation which is stored in the context information of the
request. This optimization enables us to pick the node that has the data that
was read by the preceding read operation thereby increasing the chances of
getting "read-your-writes" consistency.
How the chances of getting "read-your-writes" consistency are increased?
"read-your-writes" means that a read following a write gets the value set by the
write. The read and the write are performed by two different clients for this
context. The reason is that the choice of the write coordinator does not impact
on the chances of getting "read-your-writes" by the same client.
But the above text is talking about a write following a read. Here is my guess.
The read coordinator will try to do syntactic reconciliation if it is possible.
If syntactic reconciliation is impossible because of divergent versions, the
client need to do semantic reconciliation before doing a write. Either way, the
versions on all the nodes involved in the read operation is an ancestor of the
reconciled version. So the following write can be sent to any of them to get
applied. The earliest time for a write to be seen by a read is after the
following steps are finished:
Client contact the write coordinator.
The write coordinator generates the version clock for the new version.
The write coordinator writes the new version locally.
The shorter the time to perform the above steps, the more likely another
following read sees the new version. Since it is very possible that the node
which replied fastest to the previous read can perform the following steps in a
shorter time. Such a node is chosen as the write coordinator.
Section 2.3 talks about performing the reconciliation at read time rather than write time.
Data versioning - "One can determine whether two versions of an
object are on parallel branches or have a causal ordering, by
examine their vector clocks."
This paragraph from section 4. [emphasis mine]
In Dynamo, when a client wishes to update an object, it must specify
which version it is updating. This is done by passing the context it
obtained from an earlier read operation, which contains the vector
clock information. Upon processing a read request, if Dynamo has
access to multiple branches that cannot be syntactically reconciled,
it will return all the objects at the leaves, with the corresponding
version information in the context. An update using this context is
considered to have reconciled the divergent versions and the
branches are collapsed into a single new version
So by performing the read first, you're effectively reconciling all divergent versions prior to writing. By writing to that same node, the version you've updated is marked with the context and vector clock of the most up to date version and all divergent branches can be collapsed. This is sent to the top N nodes (as you've stated) as fast as possible. But by removing the divergent branches - you reduce the chance that multiple values could be returned. You only need one of the N nodes read in the next read to get the reconciled write. ie - the node as part of the quorum of R reads says - "I am the reconciled version, and all others must bow to me". (and if that has already been distributed to another of the "R" nodes, then there's even greater chance of getting the reconciled version in the quorum)
But, if you wrote to a different node, one that you hadn't read from - the vector clock that is being updated may not necessarily be a reconciled version of the object. Therefore, you could still have divergent branches. The following read will try and reconcile it, but it's more more probable that you could have multiple divergent data and no reconciliation.
If you've made it this far, I think the most interesting part is that per Section 6, client applications can dictate the values of N, R and W - ie - number of nodes that constitute the pool to draw from, and the number of nodes that must agree on a read or write for it to be successful.
Geez - my head hurts now.
I re-read the Dynamo paper. I have a new understanding of "read-your-write" consistency. "read-your-writes" involves only one client. Image the following requests performed by one client on the same key:
read-1
write-1
read-2
"read-your-writes" means that read-2 sees write-1. The write coordinator has the best chance to have write-1. To ensure "read-your-writes", it is desired that the write coordinator replies fastest to read-2. It is highly possible that the node replies fastest to read-1 also reply fastest to read-2. So choose the node replies fastest to read-1 as the write coordinator.
And what is the node that replied fastest to the previous read operation? Such a node only makes sense if client-driven coordination is used. For server-side coordination, the coordinator nodes replies to the client and the other involved nodes reply to the coordinator node. replied fastest is meaningless in this case.

Resources