Suppose a cluster of 5 nodes(ABCDE), node-A is elected leader at the beginning, and while leader-A issues AppendEntries RPCs to the follower(BCDE) to replicate log entry(log-X),only node-B receives and returns success, at this point leader-A crashes.
If node C(or D or E) wins the next leader election then everything's fine, because only node B has log-X, and that means log-X is not committed.
My question is, could node-B (which has the highest term and longest log) win the next leader election? If so, will node-B spread log-X to other nodes?
Yes B could win the election, if it does become leader then the first thing it does it to create a log entry with the new term in its log, and start replicating its log to all the followers. As B's log includes log-X, if all goes well, eventually the log-X entry will be replicated & considered committed.
If node C wins the election, then when it becomes leader, it won't have the log-X entry, and it'll end up overwriting that entry on Node B.
See section 5.4.2 of the raft paper for more details.
Also this means you can't treat failure as meaning the attempted entry definitely doesn't exist, only that the caller doesn't know the outcome. Section 8 has some suggestions for handling this.
Related
In the picture step b, s5 is elected as leader with term 3. How this 3 comes from ?
In the paper, when a follower gets message from leader encounter timeout, it will increments its term and turn to a candidate state. So i think the term is 2, and it can still win the election to become leader. Because the node s3 and s4 will vote for it.
The description to the picture has a single indicator for term: at step B the term is three; this means that at step A the term is 2. As simple as that. Based in the description, it seems this picture clarifies some previous example and in that example the term got to 2 somehow.
Few observations on the picture to avoid any confusion.
The picture itself has no indication of current term for either step. Horizontal numbers are index, not term. The picture would be better if the author would use [term, value] for each index.
We know the term is 3 as step B. This means that the term became 4 when S1 was reelected. And the rest of the picture explains why d1 won't happen and value (2) accepted by S1 in term 4 and accepted my majority won't ever be lost - which is the main property of the protocol - if a value is accepted by majority, the value won't be lost.
Quick note: a value accepted by majority won't be lost, but the term of that value may change. In other words - every index contains a tuple [term, value]. If a specific value is accepted by majority, then the leader and majority of followers will have same [term, value] for the same index. If the leader fails before issuing commit, then the new leader will emerge and will re-propose the same value with a new term. So as soon as a value is accepted by majority, the value is to stay, even if term will change.
Will the partitioned server increase term all the time?
If so, I get another confusion.
Chapter 3.6 (safety) in raft paper says:
Raft determines which of two logs is more up-to-date by comparing the index and term of the last entries in the logs.If the logs have last entries with different terms, then the log with the later term is more up-to-date. If the logs end with the same term, then whichever log is longer is more
up-to-date.
It got me thinking about a scenario when one server from a partitioned network win the election because of the huge term, then causing the unconsistency. Will that happens?
Edit part:
When a node with larger term rejoins the cluster, that will force the current leader to step down. When a leader sends request to a follower and the follower has larger term; the follower rejects the request and the leader sees larger term; that forces the leader to step down and new election will happen.
Some raft implementations have an extra step before a follower becomes candidate - if a follower does not hear from the leader, the follower tries to connect other followers; and if there is a quorum, then the follower becomes a candidate.
(I read your link), and yes, an implementation may keep increasing the term; or be more practical and wait till majority is reachable - there is no point to initiate an election if no majority is available.
I decided to comment because of this line in your question "win the election because of the huge term". Long time ago, when I read raft paper (https://raft.github.io/raft.pdf), I was quite confused by the election process.
Section 5.2 talks about leader election; and it has these words "Each server will vote for at most one candidate in a
given term, on a first-come-first-served basis (note: Section 5.4 adds an additional restriction on votes)."
Section 5.4 "The previous sections described how Raft elects leaders and replicates log entries. However, the mechanisms
described so far are not quite sufficient to ensure that each
state machine executes exactly the same commands in the
same order..."
Basically if a reader reads the paper section by section, and stops to think after each of them, then the reader will be a bit confused. At least I was :)
As a general conclusion: new term absolute value does not make any difference. It is used to reject older terms. But for actual leader election (and new term being started) it's the state of the log what actually matters.
Consider that we are running Raft on 3 machines: A, B, C and let A be the leader. There is a network partition that splits C, from A, B. Call the current term t. A and B remain on term 2, with no additional messages besides periodic heartbeats. At this time, C enters candidate state and increments term to 3, votes for itself, times out, and repeats. After say 10 cycles, the network partition is resolved. Now the state is A[2], B[2], C[12]; C will reject AppendEntries RPC from A as the term 2 is less than its current term, 10; C cannot assemble a quorum and will continue to run the leader election protocol as a candidate, and become increasingly more divergent from the current term value of A and B.
The question is then, how does Raft (or Raft-derived implementations) handle this issue? Some thoughts I had included:
Such a situation is an availability issue, rather than a safety violation. Ignore and let human operators handle by killing or resetting C
Exponential backoff to decrease the divergence of C per elections
Have C use lastApplied instead of currentTerm as the basis for rejecting or accepting the AppendEntries RPC. That is, we trust the log as the source of truth for terms, rather than currentTerm value. This is already used to ensure that C would not win as per the Election Restriction, however the paper seems to indicate that this "up-to-date" property is a grounds for not voting for C, but is not grounds for C to acquiesce and reset to a follower.
Note: terminology as per In Search of an Understandable Consensus Algorithm (Extended Version)
When C rejects an AppendEntries RPC from the leader A, it will return its now > 2 term. Raft replicas always recognize greater terms, so that in turn will cause the leader to step down and start a new election. Eventually, the cluster will converge on a new term that’s > 2 and which is >= C’s term.
This is an oft discussed (in the Raft dev community) somewhat inconvenient scenario that can cause unnecessary churn in Raft clusters. To guard against it, the Raft dissertation — and most real-world implementations — introduce and use the so-called “pre-vote protocol.” The pre-vote protocol essentially dictates that before becoming a candidate, a follower must first determine whether it can win an election by asking its peers. In the scenario you described above, C would ask for a pre-vote from A and B, and because of the network partition it would not receive any votes. So, C would never transition to the candidate role, never increment the term, and thus never present a term > 2 after the partition heals. Thus, you’ve eliminated the churn.
You can read more about the pre-vote protocol in Diego’s dissertation.
I'm learning Raft algorithm. My implementation meets following situation:
1-leader-1-follower situation is established;
shutdown the leader;
follower gets no heartbeat so then becomes a candidate;
candidate keeps sending VoteRequest to the peer (already shutdown) and fails;
election timeout without any leader elected;
candidate starts another candidate session, actually repeats 4-6 ...
I don't see how to solve this situation in Raft papers (maybe I missed something).
In my opinion I can check granted votes in step-5 before starting a new election. Since candidate votes for itself in the beginning of election session, so in this check, the candidate will become a new leader.
But I worry about this solution will break Raft, especially breaking the initial process when all nodes are candidates.
Another idea is treating the network error of RequestVote requests as "Vote Granted". (still worry about if it breaks something)
I know this situation could be caused by 'only 2 nodes'. However even if there are 3 nodes (so 1-leader-2-follower situation established), then if 2 leaders are shut down consequently, the remain follower may still behave like this.
What you are describing as a problem is actually a legit situation.
Raft will not work if the majority of nodes are not present, and there is no way to avoid this besides getting the majority of nodes back in function.
ISSUE description
I have a OpenStack system with HA management network (VIP) via ovs (Open vSwitch) port, it's found in this system, with high load (concurrently volume-from-glance-image creation), the VIP port (an ovs port) will be missing.
Analysis
For now, with default log level from log file, the only thing observed is as below the Unreasonably long 62741ms poll interval.
2017-12-29T16:40:38.611Z|00001|timeval(revalidator70)|WARN|Unreasonably long 62741ms poll interval (0ms user, 0ms system)
Idea for now
I will turn debug log on for file and try reproducing the issue:
sudo ovs-appctl vlog/set file:dbg
Question
What else should I do during/after of the issue reproduction please?
Is this issue typical? Caused by what if yes?
I googled OpenvSwitch trouble shoot or other related key words while information was all on data flow/table level instead of this ovs-vswitchd level ( am I right? )
Many thanks!
BR//Wey
This issue was not reproduced and thus I forgot about it, until recently, 2 years afterward, I had a chance to get in touch with this issue in a different environment, and this time I have more ideas on its root cause.
It could be caused by the shift that comes in the bonding, for some reason, the traffic pattern fits the situation of triggering shifts again and again(the condition is quite strong I would say but there is a chance to be hit anyway, right?).
The condition of the shift was quoted as below and please refer to the full doc here: https://docs.openvswitch.org/en/latest/topics/bonding/
Bond Packet Output
When a packet is sent out a bond port, the bond member actually used is selected based on the packet’s source MAC and VLAN tag (see bond_choose_output_member()). In particular, the source MAC and VLAN tag are hashed into one of 256 values, and that value is looked up in a hash table (the “bond hash”) kept in the bond_hash member of struct port. The hash table entry identifies a bond member. If no bond member has yet been chosen for that hash table entry, vswitchd chooses one arbitrarily.
Every 10 seconds, vswitchd rebalances the bond members (see bond_rebalance()). To rebalance, vswitchd examines the statistics for the number of bytes transmitted by each member over approximately the past minute, with data sent more recently weighted more heavily than data sent less recently. It considers each of the members in order from most-loaded to least-loaded. If highly loaded member H is significantly more heavily loaded than the least-loaded member L, and member H carries at least two hashes, then vswitchd shifts one of H’s hashes to L. However, vswitchd will only shift a hash from H to L if it will decrease the ratio of the load between H and L by at least 0.1.
Currently, “significantly more loaded” means that H must carry at least 1 Mbps more traffic, and that traffic must be at least 3% greater than L’s.