studying raft, I can’t understand one thing, for example, I have a cluster of 6 nodes, and I have 3 partitions with a replication factor of 3, let’s say that a network error has occurred, and now 3 nodes do not see the remaining 3 nodes, until they remain available for clients, and for example, the record SET 5 came to the first formed cluster , and in this case it will pass? because replication factor =3 and majority will be 2? it turns out you can get split brain using raft protocol?
In case of 6 nodes, the majority is 4. So if you have two partitions of three nodes, neither of those partitions will be able to elect a leader or commit new values.
When a raft cluster is created, it is configured with specific number of nodes. And the majority of those nodes is required to either elect a leader or commit a log message.
In case of a raft cluster, every node has a replica of data. I guess we could say that the replication factor is equal to cluster side in a raft cluster. But I don't think I've ever seen replication factor term being used in consensus use case.
Few notes on cluster size.
Traditionally, cluster size is 2*N+1, where N is number of nodes a cluster can lose and still be operational - as the rest of nodes still have majority to elect a leader or commit log entries. Based on that, a cluster of 3 nodes may lose 1 node; a cluster of 5 may lose 2.
There is no much point (from consensus point of view) to have cluster of size 4 or 6. In case of 4 nodes total, the cluster may survive only one node going offline - it can't survive two as the other two are not the majority and they won't be able to elect a leader or agree on progress. Same logic applies for 6 nodes - that cluster can survive only two nodes going off. Having a cluster of 4 nodes is more expensive as we can have support the same single node outage with just 3 nodes - so cluster of 4 is a just more expensive with no availability benefit.
There is a case when cluster designers do pick cluster of size 4 or 6. This is when the system allows stale reads and those reads can be executed by any node in a cluster. To support larger scale of potentially stale reads, a cluster owner adds more nodes to handle the load.
Related
I have implemented a data-structure that is kind of a non-binary tree and my algorithm basically goes through all of it's branches (I will provide more details on how it works below).
So, for a base case of a mother-node with three children the data structure would look something like this:
And the algorithm would go through three interactions until it eventually stopped (the function that goes through each node is recursive and the recursion stops when it finds a node that has no children/no nodes below it):
So, in a second case scenario, if the the tree was a little more complex, something like this:
Obs: In the case above, more than one parent node is arriving at the same child node, but this second interaction/arriving at the same child node through a different parent node is necessary, since it could add additional information/features to the child node and thus cannot be skipped.
And the algorithm would run through seven interactions in this second case:
So, basically, the algorithm enters in a for loop every time it needs to access the sons of a node. In case two, for example, there's a first for that access nodes 2, 3 and 4 from node 1 and then a second for the access nodes 5 and 6 from node 2. When all nodes below node 2 have already been accessed, and since nodes 5 and 6 do not have child-nodes, the for from node 1 access node 3 and then a third for begins, which access node 6 from node 3. Finally, when all nodes below node 3 have already been accessed, the for loop from node 1 will access node 4 and a fourth for loop begins, which eventually access node 7 and then the algorithm as a whole stops.
In addition to the complexity of the algorithm I would also like to know if something like this could run in a web application or would it be too slow or too complex to run in a server? I don't know, it seems to me that the computational cost is too high and either running this in a large scale would fry my PC or it would eventually run but after a considerable amount of time.
I hope that I was able to explain the problem I'm dealing with, but if more information is needed, do not hesitate to ask me in the comments. Thanks to all of you for your attention in advance.
I came across this intersting database the other day, and have read some doc on its offical site, I have some questions regarding Raft Group in TiKV (here),
Suppose that we have a cluster which has like 100 nodes, and the replication factor is 3, does that mean we will end up with a lot of tiny Raft "bubbles", each of them contains only 3 members, and they do leader election and log replication inside the "buble".
Or, we only have one single fat Raft "buble" which contains 100 nodes?
Please help to shed some light here, thank you!
a lot of tiny Raft "bubbles", each of them contains only 3 members,
The tiny Raft bubble in your context is a Raft group in TiKV, comprised of 3 replicas(by default). Data is auto-sharded in Regions in TiKV, each Region corresponding to a Raft group. To support large data volume, Multi-Raft is implemented. So you can perceive Multi-raft as tiny Raft "bubbles" that are distributed evenly across your nodes.
Check the image for Raft in TiKV here
we only have one single fat Raft "buble" which contains 100 nodes?
No, a Raft group does not contain nodes, they are contained in nodes.
For more details, see: What is Multi-raft in TiKV
In this case it means that you have 33 shards ("bubbles") of 3 nodes each.
A replication factors of 3 is quite common in distributed systems. In my experience, databases use replication factors of 3 (in 3 different locations) as a sweet spot between durability and latency; 6 (in 3 locations) when they lean heavily towards durability; and 9 (in 3 locations) when they never-ever want to lose data. The 9-node databases are extremely stable (paxos/raft-based) and I have only seen them used as configuration for the 3-node and 6-node databases which can use a more performant protocol (though, raft is pretty performant, too).
I understand that the bench mark for commodity hardware is around 10 nodes able to process 1 million tuples (each of size 10 MB?) per second. However, the term commodity hardware is vague and just to add a pinch of salt, is each node 8 core?
Also is this bench mark speed for a fully transactional base Storm cluster or a cluster that is configured for maximum efficiency?
I'm starting with Riak and so far everything is going great. I'm not concerned about performance at the moment because I'm mainly using it as a backup store. I've read all docs I could find (the littleriakbook.com was great to explain the concepts) but I still seem to not grasp some parts.
The situation is that I can only use 3 physical nodes/servers at the moment (instead of the recommended 5). I want that all data is replicated to all three nodes. Essentially if up to 2 nodes go down, I want to still be able to read and write to the remaining node. And if the nodes are coming up again they should synchronise again.
I've set it all up and riak-admin diag shows me that not all data is fulfilling the n_val requirement. How can I make sure that all three nodes are (eventually) identical copies? Is it possible to trigger a redristribution of the data that doesn't fulfil the requirements?
With only 3 nodes, it is not possible to fulfil the n_val requirement and ensure that the three copies stored of any object always will be on different nodes. The reason for this is in how Riak distributes replicas.
When storing or retrieving an object, Riak will calculate a hash value based on the bucket and key, and map this value to a specific partition on the ring. Once this partition has been determined the other N-1 replicas are always placed on the following N-1 partitions. If we assume we have a ring size of 64 and name these partitions 1-64, an object that hashes into partition 10 and belongs to a bucket with n_val set to 3 will also be stored in partitions 11 and 12.
With 3 nodes you will often see that the partitions are spread out alternating between the physical nodes. This means that for most partitions, the replicas will be on different physical nodes. For the last 2 partitions of the ring, 63 and 64 in our case, storage will however need to wrap around onto partitions 1 and 2. As 64 can not be evenly divided by 3, objects that do hash into these last partitions will therefore only be stored on 2 different physical nodes.
When a node fails or becomes unavailable in Riak, the remaining nodes will temporarily take responsibility for the partitions belonging to the lost node. These are known as fallback partitions and will initially be empty. As data is updated or inserted, these partitions will keep track of it and hand it back to the owning node once it becomes available. If Active Anti-Entropy is enabled, it will over time synchronise the fallback partition with the other partitions in the background.
In dynamo paper : http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf
The Replication section said:
To account for node failures, preference list contains more than N
nodes.
I want to know why ? and does this 'node' mean virtual node ?
It is for increasing Dynamo availability. If the top N nodes in the preference list are good, the other nodes will not be used. But if all the N nodes are unavailable, the other nodes will be used. For write operation, this is called hinted handoff.
The diagram makes sense both for physical nodes and virtual nodes.
I also don't understand the part you're talking about.
Background:
My understanding of the paper is that, since Dynamo's default replication factor is 3, each node N is responsible for circle ranging from N-3 to N (while also being the coordinator to ring range N-1 to N).
That explains why:
node B holds keys from F to B
node C holds keys from G to C
node D holds keys from A to D
And since range A-B falls within all those ranges, nodes B, C and D are the ones that have that range of key hashes.
The paper states:
The section 4.3 Replication:
To address this, the preference list for a key is constructed by skipping positions in the ring to ensure that the list contains only distinct physical nodes.
How can preference list contain more than N nodes if it is constructed by skipping virtual ones?
IMHO they should have stated something like this:
To account for node failures, ring range N-3 to N may contain more than N nodes, N physical nodes plus x virtual nodes.
Distributed DBMS Dyanmo DB falls in that class which sacrifices Consistency. Refer to the image below:
So, the system is inconsistent even though it is highly available. Because network partitions are a given in Distributed Systems you cannot not pick Partition Tolerance.
Addressing your questions:
To account for node failures, preference list contains more than N nodes. I want to know why?
One fact of Large Scale Distributed Systems is that in a system of thousands of nodes, failure of a nodes is a norm.
You are bound to have a few nodes failing in such a big system. You don't treat it as an exceptional condition. You prepare for such situations. How do you prepare?
For Data: You simply replicate your data on multiple nodes.
For Execution: You perform the same execution on multiple nodes. This is called speculative execution. As soon as you get the first result from the multiple executions you ran, you cancel the other executions.
That's the answer right there - you replicate your data to prepare for the case when node(s) may fail.
To account for node failures, preference list contains more than N nodes. Does this 'node' mean virtual node?
I wanted to ensure that I always have access to my house. So I copied my house's keys and gave them to another family member of mine. This guy put those keys in a safe in our house. Now when we all go out, I'm under the illusion that we have other set of keys so in case I lose mine, we can still get in the house. But... those keys are in the house itself. Losing my keys simply means I lose the access to my house. This is what would happen if we replicate the data on virtual nodes instead of physical nodes.
A virtual node is not a separate physical node and so when the real node on which this virtual node has been mapped to will fail, the virtual node will go away as well.
This 'node' cannot mean virtual node if the aim is high availability, which is the aim in Dynamo DB.