Based on FLP result, any consensus problem cannot be solved in asynchronous network system, and selecting a unique leader is a kind of consensus problem. Therefore, in theory, leader election is a unsolvable problem in asynchronous network system.
However, after I learned the concept of "reliable broadcast" that every non-faulty node takes the responsibility to broadcast any value they receive from other nodes, it is possible to achieve "every non-faulty node get the same set of messages (ignoring order)". So, if every node uses reliable broadcast to send their node id to other nodes, does it mean that finally every non-faulty node will get the same set of node ids and therefore is able to decide a same leader (simply, node with largest id being the leader)?
If so, then why saying Leader Election is unsolvable? Or do I confuse about something?
Reliable Broadcast is using a (Perfect-)Failure-Detector in order to know which process in the cluster has crashed. Such a detector simply does not exist in an asynchronous network as you basically cannot distinguish between a slow and a faulty process in such a network.
Thus, you cannot rely on reliable broadcast to solve your desired problem.
Your algorithm would never terminate once a single process crashes. Therefore, it would not tolerate any faults at all and with that you're not bypassing FLP at all.
Related
In our Corda network we work with Accounts. We have a network with well-defined nodes.
To show the problem, imagine 3 nodes, PartyA, PartyB and Notary.
We created the accounts (AccountA for example) on PartyA. We have flows that can be executed at PartyB that has AccountA as a participant in the transaction.
Now imagine that PartyA is down for any reason, or communication between nodes is not available.
When I request a new AccountA key for PartyA, the flow gets stuck trying to communicate and does not return any exception. This happens in any situation that tries to communicate with another node, when running a CollectSignaturesFlow or ShareStateAndSyncAccounts to share account states for example.
The question is, is there any configuration or mechanism to return an exception in those cases where it is unable to communicate with another node?
Timeouts can be handled differently depending on where you need to manage it.
There is the flowTimeout:
When a flow implementing the TimedFlow interface and setting the
isTimeoutEnabled flag does not complete within a defined elapsed time,
it is restarted from the initial checkpoint.
Currently only used for notarisation requests with clustered notaries
Or otherwise it can be done programmatically in your flow. I suggest you to take a look to this part of the Corda documentation (Concurrency, Locking and Waiting) where there are many suggestions that you could try to implement.
Looking through the documentation I can't find a way to get the current status of the nodes (online/offline) using the network map service.
Is this already implemented?
I can find this information using OS tools but I would like to know if there is a Corda way for this task.
Thanks in advance.
This feature is not implemented as of Corda V3. However, you can implement this functionality yourself. For example, see the Ping Pong sample here that allows you to ping other nodes.
In the future, it is expected that the network map will regularly poll each node on the network. Nodes that did not respond for a certain period of time (as defined by the network operator) would be evicted from the network map. However, this period of time is expected to be long (e.g. a month).
Please also note that:
In Corda, communication between nodes uses message acknowledgments. If a node is offline when you send it a message, no acknowledgment will be received, and the node will store the message to disk and retry delivery later. It will continue to retry until the counterparty acknowledges receipt of the message
Corda is designed with "always-on" nodes in mind. A node being offline will generally correspond to a disaster scenario, and the situation should not be long-lasting
I'm searching for a high-available SQL solution! One of the articles that I read was about "virtually synchronized" in Galera Cluster: https://www.percona.com/blog/2012/11/20/understanding-multi-node-writing-conflict-metrics-in-percona-xtradb-cluster-and-galera/
He says
When the writeset is actually applied on a given node, any locking
conflicts it detects with open (not-yet-committed) transactions on
that node cause that open transaction to get rolled back.
and
Writesets being applied by replication threads always win
What will happen if the WriteSet conflicts with a committed transaction?
He also says:
Writesets are then “certified” on every node (in order).
How does Galera Cluster make WriteSets ordered over a cluster? Is there any hidden master node who make WriteSets ordered; something like Zookeeper? or what?
This is for the second question (about how Galera orders the writesets).
Galera implements Extended Virtual Synchrony (EVS) based on the Totem protocol. The Totem protocol implements a form of token passing, where only the node with the token is allowed to send out new requests (as I understand it). So the writes are ordered since only one node at a time has the token.
For the academic background, you can look at these:
The Totem Single-Ring Ordering and Membership Protocol
The database state machine and group communication issues
(This Answer does not directly tackle your Question, but it may give you confidence that Galera is 'good'.)
In Galera (PXC, etc), there are two general times when a transaction can fail.
On the node where the transaction is being run, the actions are compared to what is currently running on the same node. If there is a conflict, either one of the transactions is stalled (think innodb_lock_wait_timeout) or is deadlocked (and rolled back).
At COMMIT time, info is sent to all the other nodes; they check your transaction against anything on the node or pending (in gcache). If there is a conflict, a message is sent back saying that there would be trouble. So, the originating node has the COMMIT fail. For this reason, you must check for errors even on the COMMIT statement.
As with single-node systems, a deadlock is usually resolved by replaying the entire transaction.
In the case of autocommit, there is a small, configurable, number of retries, after which the statement will fail. So, again, check for errors. However, since a retry has already been tried, you may want to abort the program.
Currently (in my opinion) Galera, with at least 3 nodes in at least 3 different physical locations, is the best available HA solution for MySQL. It can effectively survive any single-point-of-failure. (Group Replication / InnoDB Cluster, from Oracle, is coming soon, and is very promising.)
One thing to note is that the "critical read" problem has a solution in Galera, but you have to take action. See wsrep_sync_wait. (As of this writing, InnoDB Cluster has no solution.)
See http://mysql.rjweb.org/doc.php/galera for tips (some of which are included above) on coding differences when moving to PXC/Galera.
In the raft's thesis document chapter 6.4, it gives steps to bypass the Raft log for read-only queries and still preserve linearizability:
If the leader has not yet marked an entry from its current term committed, it waits until it has done so. The Leader Completeness
Property guarantees that a leader has all committed entries, but at
the start of its term, it may not know which those are. To find out,
it needs to commit an entry from its term. Raft handles this by having
each leader commit a blank no-op entry into the log at the start of
its term. As soon as this no-op entry is committed, the leader’s
commit index will be at least as large as any other servers’ during
its term.
The leader saves its current commit index in a local variable readIndex. This will be used as a lower bound for the version of the
state that the query operates against.
The leader needs to make sure it hasn’t been superseded by a newer leader of which it is unaware. It issues a new round of heartbeats and
waits for their acknowledgments from a majority of the cluster. Once
these acknowledgments are received, the leader knows that there could
not have existed a leader for a greater term at the moment it sent the
heartbeats. Thus, the readIndex was, at the time, the largest commit
index ever seen by any server in the cluster.
The leader waits for its state machine to advance at least as far as the readIndex; this is current enough to satisfy linearizability.
Finally, the leader issues the query against its state machine and replies to the client with the results.
My questions:
a) for step 1, is it only for case at the time of the leader is just elected? Because only new leader has no entry committed for current term. And since the no-op entry is necessary to find out the current committed entries, then this step in fact is always needed upon election done, but not only specific to read-only query? In other words, normally, when the leader is active for a while, it must has entries committed for its term (including the no-op entry).
b) for step 3, does it mean as long as leader needs to serve read only query, then one extra heartbeat would be sent, regardless of current outstanding heartbeat (sent but no major responses received yet) or the next scheduled heartbeat?
c) for step 4, is it only for followers (for cases where followers help offload the processing of read-only queries)? Because on leader, committed index already means it was applied to local state machine.
All in all, normally, the leader (active for a while) only needs to do step 3 and step 5, right?
a: This is indeed only the case when the leader is first elected. In practice, when a read-only query is received, you check whether an entry has been committed from the leader's current term and queue or reject the query if not.
b: In practice, most implementations batch read-only queries for more efficiency. You don't need to send many concurrent heartbeats. If a heartbeat is outstanding, the leader can enqueue any new reads to be evaluated after that heartbeat is completed. Once a heartbeat is completed, if any additional queries are enqueued then the leader starts another heartbeat. This has the effect of batching linearizable read-only queries for better efficiency.
c: It is not true that the leader's lastApplied index (the index of its state machine) is always equivalent to its commitIndex. Indeed, this is why there is a lastApplied index in Raft in the first place. Leaders do not necessarily have to synchronously apply an index at the same time as committing that index. This is really implementation specific. In practice, Raft implementations usually apply entries in a different thread. So, an entry can be committed and then enqueued for application to the state machine. Some implementations put entries on a queue to be applied to the state machine and allow the state machine to pull entries from that queue to be applied at the state machine's own pace, so when an entry may be applied is unspecified. It's just critical that a read-only query be applied after the last command committed by the leader.
Also, you ask if this only applies to followers. Linearizable queries can only be evaluated through the leader. I suppose there's some algorithm with which you could do linearizable reads on followers, but it would be inefficient. Followers can only maintain sequential consistency for queries. In that case, servers respond to client operations with the index of the state machine when the operation was evaluated. Clients send their last received index with each operation, and when a server receives an operation, it uses the same algorithm to ensure that its state machine's lastApplied index is at least as great as the client's index. This is necessary to ensure that the client does not see state go back in time when switching servers.
There are some other complexities to read-only queries beyond what's described in the Raft literature if you want to support FIFO consistency for concurrent operations from a single client. Some of these are described in Copycat's architecture documentation.
I have a small communication problem that has consumed hours of search. I am using MPICH2 to communicate between different workers. At some points in my program a process needs to multi-cast a message to a fraction of the workers (2 or 3 out of a total of 20). Therefore, I temporarily need to create a group that includes the ranks of all those workers and then use MPI_BCast. However, this seems to be impossible!
I have tried MPI_Comm_Create but the program simply hangs because it required "every" worker call MPI_Comm_Create. I can not also use MPI_Comm_Split because I do not know the ranks of the recipient workers in advance and hence can not color code them.
Could you please help me.
Why do you need to create a new communicator at all?
Your description, of what you actually want to achieve and what the constraints are is a little lacking, but here are some hints, that might be applicable for your problem.
Sticking to classical two-sided communication, you need at some point a communication that involves all processes to identify the recipients, I guess. You could for example broadcast to everybody who is to be a recipient, and subsequently send the actual message to those with peer-to-peer communication (If this relation is going to change over time, I would not bother with creating a new communicator each time).
You could use MPI's one-sided communication concepts, and simply write messages from the broadcasting rank into dedicated memory areas of the receiving ranks. However, one-sided is often considered somewhat bad and not so good on the performance side.
With MPI-3 you could make use of an non-blocking barrier: All processes open the barrier, and those, which are not the broadcasting rank start immediately testing for the completion of this barrier, open a non-blocking receive for any source and regularly test for that as well, otherwise they proceed as usual. The broadcasting rank however, starts sending out its message to the actual recipients and when it completed that, it waits for the non-blocking barrier to complete. Now, all processes will find the barrier to complete, and now they can stop listening for the receives, those who didn't get a message can simply send a message to themselves to properly close the communication and proceed in their computation.