A nodes rejoin contradiction in Raft - raft

A contradiction in Raft make me confuse, which is:
There are 3 nodes in a raft cluster: n1, n2, n3.
n1, n2, n3 are running, n1 becomes leader and accepts value v1 from client.
n1, n2, n3 commit v1.
n1 disconnects and try to upon election, its term increases.
n2, n3 keep running, n2 becomes leader and accepts value v2, v3; n2, n3 commit v2, v3.
n1 reconnects and n2 disconnects.
And here is the problem, n1 doesn't get the committed value(v2, v3) so it can't become leader; term of n3 is smaller than n1 so it can't become leader; The raft cluster can't work.
Is there something wrong in the description above?

To understand where the steps get wrong, it worth to state few rules in Raft.
Not every term will have a leader. If a node initiates an election, the voting may be split or the majority of nodes might not be reachable. In any case, if an election failed, then the node will take a random delay, then increate the term and retry the election for the new term (unless another leader was elected).
A leader stays being a leader till it hears about a higher term. As it was stated in the item 1, the higher term might have no leader - that's ok. But as soon as a leaders sees a message (vote or a heartbeat) with higher term, then the leader turns itself to a follower.
As usual, a follower listens for heartbeats from the leader. When a follower hears nothing for a bit (preset interval + randomized delay), then the follower turns into a candidate. The candidate will increase it term and start an election.
As soon as any node hears about larger term, it turns itself into a follower and uses that new term as a base. The new term may come from another leader or from a candidate
4.1 Common issue with followers being disconnected in case of vanilla implementation of raft: a disconnected follower will think that the leader is dead and it will become a candidate at some point. It will increase its term and start an election. Of course, the election will fail, as the node is disconnected; so the node will keep increasing the term. Assuming the rest of the network is stable, as soon as this node will rejoin, one of two things may happen: a) it will initiate a new election or b) it will receive a heart beat from the leader with smaller term - which will force the leader to move to follower according to rule #2 from above. In anyway, a new election will happen.
With these statements in hands we can see your steps:
While N1 is disconnected, it will keep considering itself as a leader. There is no term increase here.
When N1 reconnects (and N2 disconnects) two things may happen:
a) N1 will send heart beat to N3. N3 will reply with higher term, which will force N1 to became a follower.
b) N3 will send a heart beat to N1. N1 will learn that there is a higher term available, and N1 will turn itself to a follower.
In either case, the final configuration is N3 being the leader and N1 being the follower. And, as usual, values from N3 will flow to N1.

Related

RAFT: how candidate term change?

In the picture step b, s5 is elected as leader with term 3. How this 3 comes from ?
In the paper, when a follower gets message from leader encounter timeout, it will increments its term and turn to a candidate state. So i think the term is 2, and it can still win the election to become leader. Because the node s3 and s4 will vote for it.
The description to the picture has a single indicator for term: at step B the term is three; this means that at step A the term is 2. As simple as that. Based in the description, it seems this picture clarifies some previous example and in that example the term got to 2 somehow.
Few observations on the picture to avoid any confusion.
The picture itself has no indication of current term for either step. Horizontal numbers are index, not term. The picture would be better if the author would use [term, value] for each index.
We know the term is 3 as step B. This means that the term became 4 when S1 was reelected. And the rest of the picture explains why d1 won't happen and value (2) accepted by S1 in term 4 and accepted my majority won't ever be lost - which is the main property of the protocol - if a value is accepted by majority, the value won't be lost.
Quick note: a value accepted by majority won't be lost, but the term of that value may change. In other words - every index contains a tuple [term, value]. If a specific value is accepted by majority, then the leader and majority of followers will have same [term, value] for the same index. If the leader fails before issuing commit, then the new leader will emerge and will re-propose the same value with a new term. So as soon as a value is accepted by majority, the value is to stay, even if term will change.

How raft node in candidate state process Requestvote from other node in candidate

If raft node in candidate state,its currentTerm is 4 and it votes for itself,then it receive Requestvote contains term is 5.
Do this raft node will vote for the node whose term is 5 and make its currentTerm = 5?
A node in raft always updates term when a higher one is announced. If it will or won't vote for the other node, it will depend on log state. To vote for another node, two conditions must be met: current node did not vote in current term yet and candidate's logs are at least as updates as current node's.
In the original Raft paper - every node has a state and one of registers in that state is "currentTerm". And the definition is - latest term even seen by the node.

How does Raft handle a prolonged network partition?

Consider that we are running Raft on 3 machines: A, B, C and let A be the leader. There is a network partition that splits C, from A, B. Call the current term t. A and B remain on term 2, with no additional messages besides periodic heartbeats. At this time, C enters candidate state and increments term to 3, votes for itself, times out, and repeats. After say 10 cycles, the network partition is resolved. Now the state is A[2], B[2], C[12]; C will reject AppendEntries RPC from A as the term 2 is less than its current term, 10; C cannot assemble a quorum and will continue to run the leader election protocol as a candidate, and become increasingly more divergent from the current term value of A and B.
The question is then, how does Raft (or Raft-derived implementations) handle this issue? Some thoughts I had included:
Such a situation is an availability issue, rather than a safety violation. Ignore and let human operators handle by killing or resetting C
Exponential backoff to decrease the divergence of C per elections
Have C use lastApplied instead of currentTerm as the basis for rejecting or accepting the AppendEntries RPC. That is, we trust the log as the source of truth for terms, rather than currentTerm value. This is already used to ensure that C would not win as per the Election Restriction, however the paper seems to indicate that this "up-to-date" property is a grounds for not voting for C, but is not grounds for C to acquiesce and reset to a follower.
Note: terminology as per In Search of an Understandable Consensus Algorithm (Extended Version)
When C rejects an AppendEntries RPC from the leader A, it will return its now > 2 term. Raft replicas always recognize greater terms, so that in turn will cause the leader to step down and start a new election. Eventually, the cluster will converge on a new term that’s > 2 and which is >= C’s term.
This is an oft discussed (in the Raft dev community) somewhat inconvenient scenario that can cause unnecessary churn in Raft clusters. To guard against it, the Raft dissertation — and most real-world implementations — introduce and use the so-called “pre-vote protocol.” The pre-vote protocol essentially dictates that before becoming a candidate, a follower must first determine whether it can win an election by asking its peers. In the scenario you described above, C would ask for a pre-vote from A and B, and because of the network partition it would not receive any votes. So, C would never transition to the candidate role, never increment the term, and thus never present a term > 2 after the partition heals. Thus, you’ve eliminated the churn.
You can read more about the pre-vote protocol in Diego’s dissertation.

What if follower loses its leader in a two-node situation (only a leader and a candidate) in Raft?

I'm learning Raft algorithm. My implementation meets following situation:
1-leader-1-follower situation is established;
shutdown the leader;
follower gets no heartbeat so then becomes a candidate;
candidate keeps sending VoteRequest to the peer (already shutdown) and fails;
election timeout without any leader elected;
candidate starts another candidate session, actually repeats 4-6 ...
I don't see how to solve this situation in Raft papers (maybe I missed something).
In my opinion I can check granted votes in step-5 before starting a new election. Since candidate votes for itself in the beginning of election session, so in this check, the candidate will become a new leader.
But I worry about this solution will break Raft, especially breaking the initial process when all nodes are candidates.
Another idea is treating the network error of RequestVote requests as "Vote Granted". (still worry about if it breaks something)
I know this situation could be caused by 'only 2 nodes'. However even if there are 3 nodes (so 1-leader-2-follower situation established), then if 2 leaders are shut down consequently, the remain follower may still behave like this.
What you are describing as a problem is actually a legit situation.
Raft will not work if the majority of nodes are not present, and there is no way to avoid this besides getting the majority of nodes back in function.

Could Raft elect a leader with uncommitted log?

Suppose a cluster of 5 nodes(ABCDE), node-A is elected leader at the beginning, and while leader-A issues AppendEntries RPCs to the follower(BCDE) to replicate log entry(log-X),only node-B receives and returns success, at this point leader-A crashes.
If node C(or D or E) wins the next leader election then everything's fine, because only node B has log-X, and that means log-X is not committed.
My question is, could node-B (which has the highest term and longest log) win the next leader election? If so, will node-B spread log-X to other nodes?
Yes B could win the election, if it does become leader then the first thing it does it to create a log entry with the new term in its log, and start replicating its log to all the followers. As B's log includes log-X, if all goes well, eventually the log-X entry will be replicated & considered committed.
If node C wins the election, then when it becomes leader, it won't have the log-X entry, and it'll end up overwriting that entry on Node B.
See section 5.4.2 of the raft paper for more details.
Also this means you can't treat failure as meaning the attempted entry definitely doesn't exist, only that the caller doesn't know the outcome. Section 8 has some suggestions for handling this.

Resources