Riak: Using n_val = 3 and only 3 nodes - riak

I'm starting with Riak and so far everything is going great. I'm not concerned about performance at the moment because I'm mainly using it as a backup store. I've read all docs I could find (the littleriakbook.com was great to explain the concepts) but I still seem to not grasp some parts.
The situation is that I can only use 3 physical nodes/servers at the moment (instead of the recommended 5). I want that all data is replicated to all three nodes. Essentially if up to 2 nodes go down, I want to still be able to read and write to the remaining node. And if the nodes are coming up again they should synchronise again.
I've set it all up and riak-admin diag shows me that not all data is fulfilling the n_val requirement. How can I make sure that all three nodes are (eventually) identical copies? Is it possible to trigger a redristribution of the data that doesn't fulfil the requirements?

With only 3 nodes, it is not possible to fulfil the n_val requirement and ensure that the three copies stored of any object always will be on different nodes. The reason for this is in how Riak distributes replicas.
When storing or retrieving an object, Riak will calculate a hash value based on the bucket and key, and map this value to a specific partition on the ring. Once this partition has been determined the other N-1 replicas are always placed on the following N-1 partitions. If we assume we have a ring size of 64 and name these partitions 1-64, an object that hashes into partition 10 and belongs to a bucket with n_val set to 3 will also be stored in partitions 11 and 12.
With 3 nodes you will often see that the partitions are spread out alternating between the physical nodes. This means that for most partitions, the replicas will be on different physical nodes. For the last 2 partitions of the ring, 63 and 64 in our case, storage will however need to wrap around onto partitions 1 and 2. As 64 can not be evenly divided by 3, objects that do hash into these last partitions will therefore only be stored on 2 different physical nodes.
When a node fails or becomes unavailable in Riak, the remaining nodes will temporarily take responsibility for the partitions belonging to the lost node. These are known as fallback partitions and will initially be empty. As data is updated or inserted, these partitions will keep track of it and hand it back to the owning node once it becomes available. If Active Anti-Entropy is enabled, it will over time synchronise the fallback partition with the other partitions in the background.

Related

Raft protocol split brain

studying raft, I can’t understand one thing, for example, I have a cluster of 6 nodes, and I have 3 partitions with a replication factor of 3, let’s say that a network error has occurred, and now 3 nodes do not see the remaining 3 nodes, until they remain available for clients, and for example, the record SET 5 came to the first formed cluster , and in this case it will pass? because replication factor =3 and majority will be 2? it turns out you can get split brain using raft protocol?
In case of 6 nodes, the majority is 4. So if you have two partitions of three nodes, neither of those partitions will be able to elect a leader or commit new values.
When a raft cluster is created, it is configured with specific number of nodes. And the majority of those nodes is required to either elect a leader or commit a log message.
In case of a raft cluster, every node has a replica of data. I guess we could say that the replication factor is equal to cluster side in a raft cluster. But I don't think I've ever seen replication factor term being used in consensus use case.
Few notes on cluster size.
Traditionally, cluster size is 2*N+1, where N is number of nodes a cluster can lose and still be operational - as the rest of nodes still have majority to elect a leader or commit log entries. Based on that, a cluster of 3 nodes may lose 1 node; a cluster of 5 may lose 2.
There is no much point (from consensus point of view) to have cluster of size 4 or 6. In case of 4 nodes total, the cluster may survive only one node going offline - it can't survive two as the other two are not the majority and they won't be able to elect a leader or agree on progress. Same logic applies for 6 nodes - that cluster can survive only two nodes going off. Having a cluster of 4 nodes is more expensive as we can have support the same single node outage with just 3 nodes - so cluster of 4 is a just more expensive with no availability benefit.
There is a case when cluster designers do pick cluster of size 4 or 6. This is when the system allows stale reads and those reads can be executed by any node in a cluster. To support larger scale of potentially stale reads, a cluster owner adds more nodes to handle the load.

AWS Neptune Query gremlins slowness on cold call

I'm currently running some queries with a big gap of performance between first call (up to 2 minutes) and the following one (around 5 seconds).
This duration difference can be seen through the gremlin REST API in both execution and profile mode.
As the query is loading a big amount of data, I expect the issue is coming from the caching functionalities of Neptune in its default configuration. I was not able to find any way to improve this behavior through configuration and would be glad to have some advices in order to reduce the length of the first call.
Context :
The Neptune database is running on a db.r5.8xlarge instance, and during execution CPU always stay bellow 20%. I'm also the only user on this instance during the tests.
As we don't have differential inputs, the database is recreated on a weekly basis and switched to production once the loader has loaded everything. Our database have then a short lifetime.
The database is containing slightly above 1.000.000.000 nodes and far more edges. (probably around 10.000.000.000) Those edges are splitted across 10 types of labels, and most of them are not used in the current query.
Query :
// recordIds is a table of 50 ids.
g.V(recordIds).HasLabel("record")
// Convert local id to neptune id.
.out('local_id')
// Go to tree parent link. (either myself if edge come back, or real parent)
.bothE('tree_top_parent').inV()
// Clean duplicates.
.dedup()
// Follow the tree parent link backward to get all children, this step load a big amount of nodes members of the same tree.
.in('tree_top_parent')
.not(values('some flag').Is('Q'))
// Limitation not reached, result is between 80k and 100K nodes.
.limit(200000)
// Convert back to local id for the 80k to 100k selected nodes.
.in('local_id')
.id()
Neptune's architecture is comprised of a shared cluster "volume" (where all data is persisted and where this data is replicated 6 times across 3 availability zones) and a series of decoupled compute instances (one writer and up to 15 read replicas in a single cluster). No data is persisted on the instances however, approximately 65% of the memory capacity on an instance is reserved for a buffer pool cache. As data is read from the underlying cluster volume, it is stored in the buffer pool cache until the cache fills. Once the cache fills, a least-recently-used (LRU) eviction policy will clear buffer pool cache space for any newer reads.
It is common to see first reads be slower due to the need to fetch objects from the underlying storage. One can improve this by writing and issuing "prefetch" queries that pull in objects that they think they might need in the near future.
If you have a use case that is filling buffer pool cache and constantly seeing buffer pool cache misses (a metric one can see in the CloudWatch metrics for Neptune), then you may also want to consider using one of the "d" instance types (ex: r5d.8xlarge) and enabling the Lookup Cache feature [1]. This feature specifically focuses on improving access to property values/literals at query time by keeping them in a directly attached NVMe store on the instance.
[1] https://docs.aws.amazon.com/neptune/latest/userguide/feature-overview-lookup-cache.html

May two DynamoDB scan segments contain the same hash key?

I'm scanning a huge table (> 1B docs) so I'm using a parallel scan (using one segment per worker).
The table has a hash key and a sort key.
Intuitively a segment should contain a set of hash keys (including all their sort keys), so one hash key shouldn't appear in more than one segment, but I haven't found any documentation indicating this.
Does anyone know how does DynamoDB behave in this scenario?
Thanks
This is an interesting question. I thought it would be easy to find a document stating that each segment contains a disjoint range of hash keys, and the same hash key cannot appear in more than one segment - but I too failed to find any such document. I am curious if anyone else can find such a document. In the meantime, I can try to offer additional intuitions on why your conjecture is likely correct - but also might be wrong:
My first intuition would be that you are right:
DynamoDB uses the hash key, also known as a partition key to decide on which of the many storage nodes to store copy of this data. All of the items sharing the same partition key (with different sort key values) are stored together, in sort-key order, so they can be Queryed together in order. DynamoDB uses a hash function on the partition key to decide the placement of each item (hence the name "hash key").
Now, if DynamoDB needs to divide the task of scanning all the data into "segments", the most sensible thing for it to do is to divide the space of hash values (i.e., hash function of the hash keys) to different equal-sized pieces. This division is easy to do (just a numeric division by TotalSegments), it ensures roughly the same amount of items in each segment (assuming there are many different partitions), and it ensures that the scanning of each segment involves a different storage node, so the parallel scan can proceed faster than what a single storage node is capable of.
However, there is one indication that this might not be the entire story.
The DynamoDB documentation claims that
In general, there is no practical limit on the number of distinct sort key values per partition key value.
This means that in theory at least, your entire database, perhaps one petabyte of it, may be in a single partition with billions of different sort keys. Since Amazon's single storage node do have a size limit, it means DynamoDB must (unless the above statement is false) support splitting of a single huge partition into multiple storage nodes. This means that when GetItem is looking for a particular item, DynamoDB needs to know which sort key is on which storage node. It also means that a parallel scan might - possibly - divide this huge partition into pieces, all scanning the same partition but different sort-key ranges in it. I am not sure we can completely rule out this possibility. I am guessing it will never happen when you only have smallish partitions.
Every DynamoDB table has a "hashspace" and data is partitioned as per the hash value of the partition key. When a ParallelScan is intended and the TotalSegments and Segment values are provided, the table's complete hashspace is logically divided into these "Segments" such that TotalSegments cover the complete hash space, without overlapping. It is quite possible some segments here do not actually have any data corresponding to them, since there may not be any data in the hashspace allocated to the segment. This can be observed if the TotalSegments value chosen is very high for instance.
And for each Segment value passed in the Scan request (with TotalSegments value being constant), each Segment would return distinct items without any overlap.
FAQs
Q. Ideal Number for TotalSegments ?
-> You might need to experiment with values, find the sweet spot for your table, and the number of workers you use, until your application achieves its best performance.
Q. One or more segments do not return any records. Why?
-> This is possible if the hash range that is allocated as per the TotalSegments value does not have any items. In this case, the TotalSegments value can be decreased, for better performance.
Q. Scan for a segment failed midway. Can a Scan for that segment alone be retried now ?
-> As long as the TotalSegments value remains the same, a Scan for one of the segments can be re-run, since it would have the same hash range allocated at any given time.
Q. Can I perform a Scan for a single segment, without performing the Scan for other segments as per TotalSegments value?
-> Yes. Multiple Scan operations for different Segments are not linked/do not depend on previous/other Segment Scans.

Who can explain the 'Replication' in dynamo paper?

In dynamo paper : http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf
The Replication section said:
To account for node failures, preference list contains more than N
nodes.
I want to know why ? and does this 'node' mean virtual node ?
It is for increasing Dynamo availability. If the top N nodes in the preference list are good, the other nodes will not be used. But if all the N nodes are unavailable, the other nodes will be used. For write operation, this is called hinted handoff.
The diagram makes sense both for physical nodes and virtual nodes.
I also don't understand the part you're talking about.
Background:
My understanding of the paper is that, since Dynamo's default replication factor is 3, each node N is responsible for circle ranging from N-3 to N (while also being the coordinator to ring range N-1 to N).
That explains why:
node B holds keys from F to B
node C holds keys from G to C
node D holds keys from A to D
And since range A-B falls within all those ranges, nodes B, C and D are the ones that have that range of key hashes.
The paper states:
The section 4.3 Replication:
To address this, the preference list for a key is constructed by skipping positions in the ring to ensure that the list contains only distinct physical nodes.
How can preference list contain more than N nodes if it is constructed by skipping virtual ones?
IMHO they should have stated something like this:
To account for node failures, ring range N-3 to N may contain more than N nodes, N physical nodes plus x virtual nodes.
Distributed DBMS Dyanmo DB falls in that class which sacrifices Consistency. Refer to the image below:
So, the system is inconsistent even though it is highly available. Because network partitions are a given in Distributed Systems you cannot not pick Partition Tolerance.
Addressing your questions:
To account for node failures, preference list contains more than N nodes. I want to know why?
One fact of Large Scale Distributed Systems is that in a system of thousands of nodes, failure of a nodes is a norm.
You are bound to have a few nodes failing in such a big system. You don't treat it as an exceptional condition. You prepare for such situations. How do you prepare?
For Data: You simply replicate your data on multiple nodes.
For Execution: You perform the same execution on multiple nodes. This is called speculative execution. As soon as you get the first result from the multiple executions you ran, you cancel the other executions.
That's the answer right there - you replicate your data to prepare for the case when node(s) may fail.
To account for node failures, preference list contains more than N nodes. Does this 'node' mean virtual node?
I wanted to ensure that I always have access to my house. So I copied my house's keys and gave them to another family member of mine. This guy put those keys in a safe in our house. Now when we all go out, I'm under the illusion that we have other set of keys so in case I lose mine, we can still get in the house. But... those keys are in the house itself. Losing my keys simply means I lose the access to my house. This is what would happen if we replicate the data on virtual nodes instead of physical nodes.
A virtual node is not a separate physical node and so when the real node on which this virtual node has been mapped to will fail, the virtual node will go away as well.
This 'node' cannot mean virtual node if the aim is high availability, which is the aim in Dynamo DB.

Pregel BSP: Difference between partitioning and assignment of user input by master to worker

The pregel paper mentions:
a) The Pregel library divides a graph into partitions, each consisting
of a set of vertices and all of those vertices’ outgoing edges...The
master determines how many partitions the graph will have, and assigns
one or more partitions to each worker machine.
and
b) The master assigns a portion of the user’s input to each worker. The
input is treated as a set of records, each of which contains an
arbitrary number of vertices and edges. The division of inputs is
orthogonal to the partitioning of the graph itself, and is typically
based on file boundaries.
I have two questions here:
1) In b), how is the master assigning a "portion of the user's input to each worker" different from "assigns one or more partitions to each worker machine". Do they have different functions?
I thought we have to figure out our partitions and then feed one or more partition to a worker machine and that is all. What am I missing?
2) If the division of inputs is solely based on file boundaries, does that mean vertices of a partition can reside on different machines? (because two vertices of a partition may reside on different files and hence be processed by different worker machines).
Question 1:
The master assigning the user input to each worker is the same to assigning one or more partitions to each worker machine.
The user input will be a graph. This graph will be split into several partitions. These partitions will be split between workers.
The worker is where the partitions are going to be processed. They may contain one or more partitions. The partitions contain vertices. The partition in the entity selects active vertices and run their superstep computations.
Question 2:
No. All vertices that are inside a partition are inside the same worker. If a vertex is going to be transferred to another machine (and thus another worker), it will change to another partition.

Resources