How does Raft handle a prolonged network partition? - networking

Consider that we are running Raft on 3 machines: A, B, C and let A be the leader. There is a network partition that splits C, from A, B. Call the current term t. A and B remain on term 2, with no additional messages besides periodic heartbeats. At this time, C enters candidate state and increments term to 3, votes for itself, times out, and repeats. After say 10 cycles, the network partition is resolved. Now the state is A[2], B[2], C[12]; C will reject AppendEntries RPC from A as the term 2 is less than its current term, 10; C cannot assemble a quorum and will continue to run the leader election protocol as a candidate, and become increasingly more divergent from the current term value of A and B.
The question is then, how does Raft (or Raft-derived implementations) handle this issue? Some thoughts I had included:
Such a situation is an availability issue, rather than a safety violation. Ignore and let human operators handle by killing or resetting C
Exponential backoff to decrease the divergence of C per elections
Have C use lastApplied instead of currentTerm as the basis for rejecting or accepting the AppendEntries RPC. That is, we trust the log as the source of truth for terms, rather than currentTerm value. This is already used to ensure that C would not win as per the Election Restriction, however the paper seems to indicate that this "up-to-date" property is a grounds for not voting for C, but is not grounds for C to acquiesce and reset to a follower.
Note: terminology as per In Search of an Understandable Consensus Algorithm (Extended Version)

When C rejects an AppendEntries RPC from the leader A, it will return its now > 2 term. Raft replicas always recognize greater terms, so that in turn will cause the leader to step down and start a new election. Eventually, the cluster will converge on a new term that’s > 2 and which is >= C’s term.
This is an oft discussed (in the Raft dev community) somewhat inconvenient scenario that can cause unnecessary churn in Raft clusters. To guard against it, the Raft dissertation — and most real-world implementations — introduce and use the so-called “pre-vote protocol.” The pre-vote protocol essentially dictates that before becoming a candidate, a follower must first determine whether it can win an election by asking its peers. In the scenario you described above, C would ask for a pre-vote from A and B, and because of the network partition it would not receive any votes. So, C would never transition to the candidate role, never increment the term, and thus never present a term > 2 after the partition heals. Thus, you’ve eliminated the churn.
You can read more about the pre-vote protocol in Diego’s dissertation.

Related

Raft: Will term increasing all the time if partitioned?

Will the partitioned server increase term all the time?
If so, I get another confusion.
Chapter 3.6 (safety) in raft paper says:
Raft determines which of two logs is more up-to-date by comparing the index and term of the last entries in the logs.If the logs have last entries with different terms, then the log with the later term is more up-to-date. If the logs end with the same term, then whichever log is longer is more
up-to-date.
It got me thinking about a scenario when one server from a partitioned network win the election because of the huge term, then causing the unconsistency. Will that happens?
Edit part:
When a node with larger term rejoins the cluster, that will force the current leader to step down. When a leader sends request to a follower and the follower has larger term; the follower rejects the request and the leader sees larger term; that forces the leader to step down and new election will happen.
Some raft implementations have an extra step before a follower becomes candidate - if a follower does not hear from the leader, the follower tries to connect other followers; and if there is a quorum, then the follower becomes a candidate.
(I read your link), and yes, an implementation may keep increasing the term; or be more practical and wait till majority is reachable - there is no point to initiate an election if no majority is available.
I decided to comment because of this line in your question "win the election because of the huge term". Long time ago, when I read raft paper (https://raft.github.io/raft.pdf), I was quite confused by the election process.
Section 5.2 talks about leader election; and it has these words "Each server will vote for at most one candidate in a
given term, on a first-come-first-served basis (note: Section 5.4 adds an additional restriction on votes)."
Section 5.4 "The previous sections described how Raft elects leaders and replicates log entries. However, the mechanisms
described so far are not quite sufficient to ensure that each
state machine executes exactly the same commands in the
same order..."
Basically if a reader reads the paper section by section, and stops to think after each of them, then the reader will be a bit confused. At least I was :)
As a general conclusion: new term absolute value does not make any difference. It is used to reject older terms. But for actual leader election (and new term being started) it's the state of the log what actually matters.

What if follower loses its leader in a two-node situation (only a leader and a candidate) in Raft?

I'm learning Raft algorithm. My implementation meets following situation:
1-leader-1-follower situation is established;
shutdown the leader;
follower gets no heartbeat so then becomes a candidate;
candidate keeps sending VoteRequest to the peer (already shutdown) and fails;
election timeout without any leader elected;
candidate starts another candidate session, actually repeats 4-6 ...
I don't see how to solve this situation in Raft papers (maybe I missed something).
In my opinion I can check granted votes in step-5 before starting a new election. Since candidate votes for itself in the beginning of election session, so in this check, the candidate will become a new leader.
But I worry about this solution will break Raft, especially breaking the initial process when all nodes are candidates.
Another idea is treating the network error of RequestVote requests as "Vote Granted". (still worry about if it breaks something)
I know this situation could be caused by 'only 2 nodes'. However even if there are 3 nodes (so 1-leader-2-follower situation established), then if 2 leaders are shut down consequently, the remain follower may still behave like this.
What you are describing as a problem is actually a legit situation.
Raft will not work if the majority of nodes are not present, and there is no way to avoid this besides getting the majority of nodes back in function.

Is there any termination guarantee for raft algorithm leader election?

I was trying to understand how Raft algorithm overcomes FLP theorem.
For FLP to be valid, consensus algorithm should be deterministic, termination and asynchronous.
As the PhD dissertation on the Raft consensus algorithm says
Unfortunately, it is difficult to put a bound on the time or number of messages leader election will take. According to the FLP impossibility result [28], no fault-tolerant consensus protocol can deterministically terminate in a purely asynchronous model. This manifests itself in split votes in Raft, which can potentially impede progress repeatedly during leader election. Raft also makes use of randomized timeouts during leader election, which makes its analysis probabilistic. Thus, we can only say that leader election performs well with high likelihood, and even then only under various assumptions. For example, servers must choose timeouts from a random distribution (they are not somehow synchronized), clocks must proceed at about the same rates, and servers and networks must be timely (or stopped). If these assumptions are not met for some period of time, the cluster might not be able to elect a leader during that period (though safety will always be maintained).
So even if it is difficult to put a bound on the time or number of messages leader election will take, is it possible to guarantee that in all cases eventually a leader will be elected?
If not, in which case the leader election will never terminate?
You don't have a guarantee that in all cases the leader will be eventually elected. There is only a high likelihood the leader election will perform well.
In the quote you've posted is an example of the case:
For example, servers must choose timeouts from a random distribution (they are not somehow synchronized), clocks must proceed at about the same rates, and servers and networks must be timely (or stopped). If these assumptions are not met for some period of time, the cluster might not be able to elect a leader during that period (though safety will always be maintained).

Could Raft elect a leader with uncommitted log?

Suppose a cluster of 5 nodes(ABCDE), node-A is elected leader at the beginning, and while leader-A issues AppendEntries RPCs to the follower(BCDE) to replicate log entry(log-X),only node-B receives and returns success, at this point leader-A crashes.
If node C(or D or E) wins the next leader election then everything's fine, because only node B has log-X, and that means log-X is not committed.
My question is, could node-B (which has the highest term and longest log) win the next leader election? If so, will node-B spread log-X to other nodes?
Yes B could win the election, if it does become leader then the first thing it does it to create a log entry with the new term in its log, and start replicating its log to all the followers. As B's log includes log-X, if all goes well, eventually the log-X entry will be replicated & considered committed.
If node C wins the election, then when it becomes leader, it won't have the log-X entry, and it'll end up overwriting that entry on Node B.
See section 5.4.2 of the raft paper for more details.
Also this means you can't treat failure as meaning the attempted entry definitely doesn't exist, only that the caller doesn't know the outcome. Section 8 has some suggestions for handling this.

A peer-to-peer and privacy-aware data mining/aggregation algorithm: is it possible?

Suppose I have a network of N nodes, each with a unique identity (e.g. public key) communicating with a central-server-less protocol (e.g. DHT, Kad). Each node stores a variable V. With reference to e-voting as an easy example, that variable could be the name of a candidate.
Now I want to execute an "aggregation" function on all V variables available in the network. With reference to e-voting example, I want to count votes.
My question is completely theoretical (I have to prove a statement, details at the end of the question), so please don't focus on the e-voting and all of its security aspects. Do I have to say it again? Don't answer me that "a node may have any number identities by generating more keys", "IPs can be traced back" etc. because that's another matter.
Let's see the distributed aggregation only from the privacy point of view.
THE question
Is it possible, in a general case, for a node to compute a function of variables stored at other nodes without getting their value associated to the node's identity? Did researchers design such a privacy-aware distributed algorithm?
I'm only dealing with privacy aspects, not general security!
Current thoughts
My current answer is no, so I say that a central server, obtaining all Vs and processes them without storing, is necessary and there are more legal than technical means to assure that no individual node's data is either stored or retransmitted by the central server. I'm asking to prove that my previous statement is false :)
In the e-voting example, I think it's impossible to count how many people voted for Alice and Bob without asking all the nodes, one by one "Hey, who do you vote for?"
Real case
I'm doing research in the Personal Data Store field. Suppose you store your call log in the PDS and somebody wants to find statistical values about the phone calls (i.e. mean duration, number of calls per day, variance, st-dev) without being revealed neither aggregated nor punctual data about an individual (that is, nobody must know neither whom do I call, nor my own mean call duration).
If a trusted broker exists, and everybody trusts it, that node can expose a double getMeanCallDuration() API that first invokes CallRecord[] getCalls() on every PDS in the network and then operates statistics on all rows. Without the central trusted broker, each PDS exposing double getMyMeanCallDuration() isn't statistically usable (the mean of the means shouldn't be the mean of all...) and most importantly reveals the identity of the single user.
Yes, it is possible. There is work that actually answers your question solving the problem, given some assumptions. Check the following paper: Privacy, efficiency & fault tolerance in aggregate computations on massive star networks.
You can do some computation (for example summing) of a group of nodes at another node without having the participants nodes to reveal any data between themselves and not even the node that is computing. After the computation, everyone learns the result (but no one learns any individual data besides their own which they knew already anyways). The paper describes the protocol and proves its security (and the protocol itself gives you the privacy level I just described).
As for protecting the identity of the nodes to unlink their value from their identity, that would be another problem. You could use anonymous credentials (check this: https://idemix.wordpress.com/2009/08/18/quick-intro-to-credentials/) or something alike to show that you are who you are without revealing your identity (in a distributed scenario).
The catch of this protocol is that you need a semi-trusted node to do the computation. A fully distributed protocol (for example, in a P2P network scenario) is not that easy though. Not because of a lack of a storage (you can have a DHT, for example) but rather you need to replace that trusted or semi-trusted node by the network, and that is when you find your issues, who does it? Why that one and not another one? And what if there is a collusion? Etc...
How about when each node publishes two sets of data x and y, such that
x - y = v
Assuming that I can emit x and y independently, you can correctly compute the overall mean and sum, while every single message is largely worthless.
So for the voting example and candidates X, Y, Z, I might have one identity publishing the vote
+2 -1 +3
and my second identity publishes the vote:
-2 +2 -3
But of course you cannot verify that I didn't vote multiple times anymore.

Resources