In raft leader election,how live leader response to RequestVote rpc from a candidate? - raft

i am reading the raft paper.
To the requestvote rpc,
Receiver implementation:
1. Reply false if term < currentTerm (§5.1)
2. If votedFor is null or candidateId, and candidate’s log is at
least as up-to-date as receiver’s log, grant vote (§5.2, §5.4)
in some situation , candidate's term is equal to leader' currentTerm,so how does the leader response to RequestVote rpc from a candidate?

Let's pick this apart in more human terms:
If the vote is from an older term (term < currentTerm), ignore it.
If we haven't voted in this term (votedFor is null), or if it's a vote for the same candidate we voted for last time in this term (votedFor == candidateId), then grant the vote as long as the candidate log is up-to-date.
Remember that leaders vote for themselves in a given term.
That means that for term == currentTerm, the leader will have votedFor equal to itself. This is not null, so the only way it will grant this vote is if candidateId is itself - i.e., it's casting a vote for itself in the current term. In all other cases, it will not grant the vote.
The high-level thing to remember (in fact, the key invariant in all of this) is that a server never votes more than once in the same term. Once it has made its vote for a term, it's final. And since a leader votes for itself, when it receives other requests for the same term it won't grant it.

Related

RAFT: how candidate term change?

In the picture step b, s5 is elected as leader with term 3. How this 3 comes from ?
In the paper, when a follower gets message from leader encounter timeout, it will increments its term and turn to a candidate state. So i think the term is 2, and it can still win the election to become leader. Because the node s3 and s4 will vote for it.
The description to the picture has a single indicator for term: at step B the term is three; this means that at step A the term is 2. As simple as that. Based in the description, it seems this picture clarifies some previous example and in that example the term got to 2 somehow.
Few observations on the picture to avoid any confusion.
The picture itself has no indication of current term for either step. Horizontal numbers are index, not term. The picture would be better if the author would use [term, value] for each index.
We know the term is 3 as step B. This means that the term became 4 when S1 was reelected. And the rest of the picture explains why d1 won't happen and value (2) accepted by S1 in term 4 and accepted my majority won't ever be lost - which is the main property of the protocol - if a value is accepted by majority, the value won't be lost.
Quick note: a value accepted by majority won't be lost, but the term of that value may change. In other words - every index contains a tuple [term, value]. If a specific value is accepted by majority, then the leader and majority of followers will have same [term, value] for the same index. If the leader fails before issuing commit, then the new leader will emerge and will re-propose the same value with a new term. So as soon as a value is accepted by majority, the value is to stay, even if term will change.

How raft node in candidate state process Requestvote from other node in candidate

If raft node in candidate state,its currentTerm is 4 and it votes for itself,then it receive Requestvote contains term is 5.
Do this raft node will vote for the node whose term is 5 and make its currentTerm = 5?
A node in raft always updates term when a higher one is announced. If it will or won't vote for the other node, it will depend on log state. To vote for another node, two conditions must be met: current node did not vote in current term yet and candidate's logs are at least as updates as current node's.
In the original Raft paper - every node has a state and one of registers in that state is "currentTerm". And the definition is - latest term even seen by the node.

Raft: Will term increasing all the time if partitioned?

Will the partitioned server increase term all the time?
If so, I get another confusion.
Chapter 3.6 (safety) in raft paper says:
Raft determines which of two logs is more up-to-date by comparing the index and term of the last entries in the logs.If the logs have last entries with different terms, then the log with the later term is more up-to-date. If the logs end with the same term, then whichever log is longer is more
up-to-date.
It got me thinking about a scenario when one server from a partitioned network win the election because of the huge term, then causing the unconsistency. Will that happens?
Edit part:
When a node with larger term rejoins the cluster, that will force the current leader to step down. When a leader sends request to a follower and the follower has larger term; the follower rejects the request and the leader sees larger term; that forces the leader to step down and new election will happen.
Some raft implementations have an extra step before a follower becomes candidate - if a follower does not hear from the leader, the follower tries to connect other followers; and if there is a quorum, then the follower becomes a candidate.
(I read your link), and yes, an implementation may keep increasing the term; or be more practical and wait till majority is reachable - there is no point to initiate an election if no majority is available.
I decided to comment because of this line in your question "win the election because of the huge term". Long time ago, when I read raft paper (https://raft.github.io/raft.pdf), I was quite confused by the election process.
Section 5.2 talks about leader election; and it has these words "Each server will vote for at most one candidate in a
given term, on a first-come-first-served basis (note: Section 5.4 adds an additional restriction on votes)."
Section 5.4 "The previous sections described how Raft elects leaders and replicates log entries. However, the mechanisms
described so far are not quite sufficient to ensure that each
state machine executes exactly the same commands in the
same order..."
Basically if a reader reads the paper section by section, and stops to think after each of them, then the reader will be a bit confused. At least I was :)
As a general conclusion: new term absolute value does not make any difference. It is used to reject older terms. But for actual leader election (and new term being started) it's the state of the log what actually matters.

How does Raft handle a prolonged network partition?

Consider that we are running Raft on 3 machines: A, B, C and let A be the leader. There is a network partition that splits C, from A, B. Call the current term t. A and B remain on term 2, with no additional messages besides periodic heartbeats. At this time, C enters candidate state and increments term to 3, votes for itself, times out, and repeats. After say 10 cycles, the network partition is resolved. Now the state is A[2], B[2], C[12]; C will reject AppendEntries RPC from A as the term 2 is less than its current term, 10; C cannot assemble a quorum and will continue to run the leader election protocol as a candidate, and become increasingly more divergent from the current term value of A and B.
The question is then, how does Raft (or Raft-derived implementations) handle this issue? Some thoughts I had included:
Such a situation is an availability issue, rather than a safety violation. Ignore and let human operators handle by killing or resetting C
Exponential backoff to decrease the divergence of C per elections
Have C use lastApplied instead of currentTerm as the basis for rejecting or accepting the AppendEntries RPC. That is, we trust the log as the source of truth for terms, rather than currentTerm value. This is already used to ensure that C would not win as per the Election Restriction, however the paper seems to indicate that this "up-to-date" property is a grounds for not voting for C, but is not grounds for C to acquiesce and reset to a follower.
Note: terminology as per In Search of an Understandable Consensus Algorithm (Extended Version)
When C rejects an AppendEntries RPC from the leader A, it will return its now > 2 term. Raft replicas always recognize greater terms, so that in turn will cause the leader to step down and start a new election. Eventually, the cluster will converge on a new term that’s > 2 and which is >= C’s term.
This is an oft discussed (in the Raft dev community) somewhat inconvenient scenario that can cause unnecessary churn in Raft clusters. To guard against it, the Raft dissertation — and most real-world implementations — introduce and use the so-called “pre-vote protocol.” The pre-vote protocol essentially dictates that before becoming a candidate, a follower must first determine whether it can win an election by asking its peers. In the scenario you described above, C would ask for a pre-vote from A and B, and because of the network partition it would not receive any votes. So, C would never transition to the candidate role, never increment the term, and thus never present a term > 2 after the partition heals. Thus, you’ve eliminated the churn.
You can read more about the pre-vote protocol in Diego’s dissertation.

Could Raft elect a leader with uncommitted log?

Suppose a cluster of 5 nodes(ABCDE), node-A is elected leader at the beginning, and while leader-A issues AppendEntries RPCs to the follower(BCDE) to replicate log entry(log-X),only node-B receives and returns success, at this point leader-A crashes.
If node C(or D or E) wins the next leader election then everything's fine, because only node B has log-X, and that means log-X is not committed.
My question is, could node-B (which has the highest term and longest log) win the next leader election? If so, will node-B spread log-X to other nodes?
Yes B could win the election, if it does become leader then the first thing it does it to create a log entry with the new term in its log, and start replicating its log to all the followers. As B's log includes log-X, if all goes well, eventually the log-X entry will be replicated & considered committed.
If node C wins the election, then when it becomes leader, it won't have the log-X entry, and it'll end up overwriting that entry on Node B.
See section 5.4.2 of the raft paper for more details.
Also this means you can't treat failure as meaning the attempted entry definitely doesn't exist, only that the caller doesn't know the outcome. Section 8 has some suggestions for handling this.

Resources