How raft node in candidate state process Requestvote from other node in candidate - raft

If raft node in candidate state,its currentTerm is 4 and it votes for itself,then it receive Requestvote contains term is 5.
Do this raft node will vote for the node whose term is 5 and make its currentTerm = 5?

A node in raft always updates term when a higher one is announced. If it will or won't vote for the other node, it will depend on log state. To vote for another node, two conditions must be met: current node did not vote in current term yet and candidate's logs are at least as updates as current node's.
In the original Raft paper - every node has a state and one of registers in that state is "currentTerm". And the definition is - latest term even seen by the node.

Related

RAFT: how candidate term change?

In the picture step b, s5 is elected as leader with term 3. How this 3 comes from ?
In the paper, when a follower gets message from leader encounter timeout, it will increments its term and turn to a candidate state. So i think the term is 2, and it can still win the election to become leader. Because the node s3 and s4 will vote for it.
The description to the picture has a single indicator for term: at step B the term is three; this means that at step A the term is 2. As simple as that. Based in the description, it seems this picture clarifies some previous example and in that example the term got to 2 somehow.
Few observations on the picture to avoid any confusion.
The picture itself has no indication of current term for either step. Horizontal numbers are index, not term. The picture would be better if the author would use [term, value] for each index.
We know the term is 3 as step B. This means that the term became 4 when S1 was reelected. And the rest of the picture explains why d1 won't happen and value (2) accepted by S1 in term 4 and accepted my majority won't ever be lost - which is the main property of the protocol - if a value is accepted by majority, the value won't be lost.
Quick note: a value accepted by majority won't be lost, but the term of that value may change. In other words - every index contains a tuple [term, value]. If a specific value is accepted by majority, then the leader and majority of followers will have same [term, value] for the same index. If the leader fails before issuing commit, then the new leader will emerge and will re-propose the same value with a new term. So as soon as a value is accepted by majority, the value is to stay, even if term will change.

Raft: Will term increasing all the time if partitioned?

Will the partitioned server increase term all the time?
If so, I get another confusion.
Chapter 3.6 (safety) in raft paper says:
Raft determines which of two logs is more up-to-date by comparing the index and term of the last entries in the logs.If the logs have last entries with different terms, then the log with the later term is more up-to-date. If the logs end with the same term, then whichever log is longer is more
up-to-date.
It got me thinking about a scenario when one server from a partitioned network win the election because of the huge term, then causing the unconsistency. Will that happens?
Edit part:
When a node with larger term rejoins the cluster, that will force the current leader to step down. When a leader sends request to a follower and the follower has larger term; the follower rejects the request and the leader sees larger term; that forces the leader to step down and new election will happen.
Some raft implementations have an extra step before a follower becomes candidate - if a follower does not hear from the leader, the follower tries to connect other followers; and if there is a quorum, then the follower becomes a candidate.
(I read your link), and yes, an implementation may keep increasing the term; or be more practical and wait till majority is reachable - there is no point to initiate an election if no majority is available.
I decided to comment because of this line in your question "win the election because of the huge term". Long time ago, when I read raft paper (https://raft.github.io/raft.pdf), I was quite confused by the election process.
Section 5.2 talks about leader election; and it has these words "Each server will vote for at most one candidate in a
given term, on a first-come-first-served basis (note: Section 5.4 adds an additional restriction on votes)."
Section 5.4 "The previous sections described how Raft elects leaders and replicates log entries. However, the mechanisms
described so far are not quite sufficient to ensure that each
state machine executes exactly the same commands in the
same order..."
Basically if a reader reads the paper section by section, and stops to think after each of them, then the reader will be a bit confused. At least I was :)
As a general conclusion: new term absolute value does not make any difference. It is used to reject older terms. But for actual leader election (and new term being started) it's the state of the log what actually matters.

Could Raft elect a leader with uncommitted log?

Suppose a cluster of 5 nodes(ABCDE), node-A is elected leader at the beginning, and while leader-A issues AppendEntries RPCs to the follower(BCDE) to replicate log entry(log-X),only node-B receives and returns success, at this point leader-A crashes.
If node C(or D or E) wins the next leader election then everything's fine, because only node B has log-X, and that means log-X is not committed.
My question is, could node-B (which has the highest term and longest log) win the next leader election? If so, will node-B spread log-X to other nodes?
Yes B could win the election, if it does become leader then the first thing it does it to create a log entry with the new term in its log, and start replicating its log to all the followers. As B's log includes log-X, if all goes well, eventually the log-X entry will be replicated & considered committed.
If node C wins the election, then when it becomes leader, it won't have the log-X entry, and it'll end up overwriting that entry on Node B.
See section 5.4.2 of the raft paper for more details.
Also this means you can't treat failure as meaning the attempted entry definitely doesn't exist, only that the caller doesn't know the outcome. Section 8 has some suggestions for handling this.

In raft leader election,how live leader response to RequestVote rpc from a candidate?

i am reading the raft paper.
To the requestvote rpc,
Receiver implementation:
1. Reply false if term < currentTerm (§5.1)
2. If votedFor is null or candidateId, and candidate’s log is at
least as up-to-date as receiver’s log, grant vote (§5.2, §5.4)
in some situation , candidate's term is equal to leader' currentTerm,so how does the leader response to RequestVote rpc from a candidate?
Let's pick this apart in more human terms:
If the vote is from an older term (term < currentTerm), ignore it.
If we haven't voted in this term (votedFor is null), or if it's a vote for the same candidate we voted for last time in this term (votedFor == candidateId), then grant the vote as long as the candidate log is up-to-date.
Remember that leaders vote for themselves in a given term.
That means that for term == currentTerm, the leader will have votedFor equal to itself. This is not null, so the only way it will grant this vote is if candidateId is itself - i.e., it's casting a vote for itself in the current term. In all other cases, it will not grant the vote.
The high-level thing to remember (in fact, the key invariant in all of this) is that a server never votes more than once in the same term. Once it has made its vote for a term, it's final. And since a leader votes for itself, when it receives other requests for the same term it won't grant it.

Who can explain the 'Replication' in dynamo paper?

In dynamo paper : http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf
The Replication section said:
To account for node failures, preference list contains more than N
nodes.
I want to know why ? and does this 'node' mean virtual node ?
It is for increasing Dynamo availability. If the top N nodes in the preference list are good, the other nodes will not be used. But if all the N nodes are unavailable, the other nodes will be used. For write operation, this is called hinted handoff.
The diagram makes sense both for physical nodes and virtual nodes.
I also don't understand the part you're talking about.
Background:
My understanding of the paper is that, since Dynamo's default replication factor is 3, each node N is responsible for circle ranging from N-3 to N (while also being the coordinator to ring range N-1 to N).
That explains why:
node B holds keys from F to B
node C holds keys from G to C
node D holds keys from A to D
And since range A-B falls within all those ranges, nodes B, C and D are the ones that have that range of key hashes.
The paper states:
The section 4.3 Replication:
To address this, the preference list for a key is constructed by skipping positions in the ring to ensure that the list contains only distinct physical nodes.
How can preference list contain more than N nodes if it is constructed by skipping virtual ones?
IMHO they should have stated something like this:
To account for node failures, ring range N-3 to N may contain more than N nodes, N physical nodes plus x virtual nodes.
Distributed DBMS Dyanmo DB falls in that class which sacrifices Consistency. Refer to the image below:
So, the system is inconsistent even though it is highly available. Because network partitions are a given in Distributed Systems you cannot not pick Partition Tolerance.
Addressing your questions:
To account for node failures, preference list contains more than N nodes. I want to know why?
One fact of Large Scale Distributed Systems is that in a system of thousands of nodes, failure of a nodes is a norm.
You are bound to have a few nodes failing in such a big system. You don't treat it as an exceptional condition. You prepare for such situations. How do you prepare?
For Data: You simply replicate your data on multiple nodes.
For Execution: You perform the same execution on multiple nodes. This is called speculative execution. As soon as you get the first result from the multiple executions you ran, you cancel the other executions.
That's the answer right there - you replicate your data to prepare for the case when node(s) may fail.
To account for node failures, preference list contains more than N nodes. Does this 'node' mean virtual node?
I wanted to ensure that I always have access to my house. So I copied my house's keys and gave them to another family member of mine. This guy put those keys in a safe in our house. Now when we all go out, I'm under the illusion that we have other set of keys so in case I lose mine, we can still get in the house. But... those keys are in the house itself. Losing my keys simply means I lose the access to my house. This is what would happen if we replicate the data on virtual nodes instead of physical nodes.
A virtual node is not a separate physical node and so when the real node on which this virtual node has been mapped to will fail, the virtual node will go away as well.
This 'node' cannot mean virtual node if the aim is high availability, which is the aim in Dynamo DB.

Resources