What happens if a TiDB leader goes down? How does TiDB use Raft to ensure data security and consistency? - distributed-database

If one leader node in TiDB is down, will my data get lost or the service be affected? How long will it be until the service recovers (i.e. a new leader is re-elected)?

TiDB uses Raft to synchronize data among multiple replicas and guarantees the strong consistency of data. If one replica goes down, the other replicas can guarantee data security. The default number of replicas in each Region is 3. Based on the Raft protocol, a leader is elected in each Region, and if a single Region leader fails, a new Region leader is soon elected after a maximum of 2 * lease time (lease time is 10 seconds).

Related

Syncing up between leader & follower ADX cluster

Is there a continuous low latency 'sync up' process going on between a leader & follower cluster so that the follower databases are up date to date with the copy stored in the leader cluster? Basically trying to understand if the follower maintains its own read only copy (not referring to follower cache) of the follower databases?
The follower cluster periodically synchronizes to changes in the database(s) it follows, so it has some data lag with respect to the leader cluster.
The lag could vary between a few seconds to a few minutes, depending on the overall size of the followed database(s) metadata.
Once the follower cluster becomes aware of these changes in metadata (data and/or schema objects [e.g. tables] get added/removed) - there's a background process that 'warms' the relevant data artifacts from the leader's persistent storage to the SSD of the follower's nodes.
Data is cached (according to the effective caching policy) on the nodes of the leader and on the nodes of the follower, but is persisted only in the leader's persistent storage.

Mariadb Galera cluster does not come up after killing the mysql process

I have a Mariadb Galera cluster with 2 nodes and it is up and running.
Before moving to production, I want to make sure that if a node crashes abruptly, It should come up on its own.
I tried using systemd "restart", but after killing the mysql process the mariadb service does not come up, so, is there any tool or method, that I can use to automate bringing up the nodes after crashes?
Galera clusters needs to have quorum (3 nodes).
In order to avoid a split-brain condition, the minimum recommended number of nodes in a cluster is 3. Blocking state transfer is yet another reason to require a minimum of 3 nodes in order to enjoy service availability in case one of the members fails and needs to be restarted. While two of the members will be engaged in state transfer, the remaining member(s) will be able to keep on serving client requests.
You can read more here.

Mariadb galera cluster and cap theorem

Where does mariadb galera cluster lies according to cap theorem CP or AP based on a brief explanation how it works.
Consistency -- For handling the "critical read" problem, Galera needs a little help. See http://mysql.rjweb.org/doc.php/galera#critical_reads
Otherwise, one can state that Galera survives "any" single-point-of-failure.
Galera is normally deployed in 3 nodes, one in each of 3 geographic locations. That means that no single machine failure, data center failure, earthquake, tornado, network outage, etc, can take out more than one node at a time. The other two nodes (whichever two survive and still talk to each other) will declare that they "have a quorum" and continue to accept writes and deliver reads. Further, "split brain" is not 'possible'; this is what keeps any attempt at dual-master, even with monitoring from surviving any SPOF.
If the third node or the network is repaired, the Cluster goes about patching up the data as needed, so that the 3 nodes again have identical data.
Granted, this is not quite the same as the definition of CAP, but is is a reasonable goal for a computer cluster.
How it works (in a tiny nutshell)... Each node talks to each other node. It does this only during the COMMIT of a transaction. (Hence, it is reasonably efficient even when spread across a WAN, as needed to survive natural disasters.) The COMMIT says to the other nodes "I am about to do this write; is it OK?" Without actually doing the write, they check Galera's magic sauce to see if it will succeed. Once everyone says "yes", the COMMIT returns success to the client. (That gives you a hint of the "critical" read issue.)

raft: some questions about read only queries

In the raft's thesis document chapter 6.4, it gives steps to bypass the Raft log for read-only queries and still preserve linearizability:
If the leader has not yet marked an entry from its current term committed, it waits until it has done so. The Leader Completeness
Property guarantees that a leader has all committed entries, but at
the start of its term, it may not know which those are. To find out,
it needs to commit an entry from its term. Raft handles this by having
each leader commit a blank no-op entry into the log at the start of
its term. As soon as this no-op entry is committed, the leader’s
commit index will be at least as large as any other servers’ during
its term.
The leader saves its current commit index in a local variable readIndex. This will be used as a lower bound for the version of the
state that the query operates against.
The leader needs to make sure it hasn’t been superseded by a newer leader of which it is unaware. It issues a new round of heartbeats and
waits for their acknowledgments from a majority of the cluster. Once
these acknowledgments are received, the leader knows that there could
not have existed a leader for a greater term at the moment it sent the
heartbeats. Thus, the readIndex was, at the time, the largest commit
index ever seen by any server in the cluster.
The leader waits for its state machine to advance at least as far as the readIndex; this is current enough to satisfy linearizability.
Finally, the leader issues the query against its state machine and replies to the client with the results.
My questions:
a) for step 1, is it only for case at the time of the leader is just elected? Because only new leader has no entry committed for current term. And since the no-op entry is necessary to find out the current committed entries, then this step in fact is always needed upon election done, but not only specific to read-only query? In other words, normally, when the leader is active for a while, it must has entries committed for its term (including the no-op entry).
b) for step 3, does it mean as long as leader needs to serve read only query, then one extra heartbeat would be sent, regardless of current outstanding heartbeat (sent but no major responses received yet) or the next scheduled heartbeat?
c) for step 4, is it only for followers (for cases where followers help offload the processing of read-only queries)? Because on leader, committed index already means it was applied to local state machine.
All in all, normally, the leader (active for a while) only needs to do step 3 and step 5, right?
a: This is indeed only the case when the leader is first elected. In practice, when a read-only query is received, you check whether an entry has been committed from the leader's current term and queue or reject the query if not.
b: In practice, most implementations batch read-only queries for more efficiency. You don't need to send many concurrent heartbeats. If a heartbeat is outstanding, the leader can enqueue any new reads to be evaluated after that heartbeat is completed. Once a heartbeat is completed, if any additional queries are enqueued then the leader starts another heartbeat. This has the effect of batching linearizable read-only queries for better efficiency.
c: It is not true that the leader's lastApplied index (the index of its state machine) is always equivalent to its commitIndex. Indeed, this is why there is a lastApplied index in Raft in the first place. Leaders do not necessarily have to synchronously apply an index at the same time as committing that index. This is really implementation specific. In practice, Raft implementations usually apply entries in a different thread. So, an entry can be committed and then enqueued for application to the state machine. Some implementations put entries on a queue to be applied to the state machine and allow the state machine to pull entries from that queue to be applied at the state machine's own pace, so when an entry may be applied is unspecified. It's just critical that a read-only query be applied after the last command committed by the leader.
Also, you ask if this only applies to followers. Linearizable queries can only be evaluated through the leader. I suppose there's some algorithm with which you could do linearizable reads on followers, but it would be inefficient. Followers can only maintain sequential consistency for queries. In that case, servers respond to client operations with the index of the state machine when the operation was evaluated. Clients send their last received index with each operation, and when a server receives an operation, it uses the same algorithm to ensure that its state machine's lastApplied index is at least as great as the client's index. This is necessary to ensure that the client does not see state go back in time when switching servers.
There are some other complexities to read-only queries beyond what's described in the Raft literature if you want to support FIFO consistency for concurrent operations from a single client. Some of these are described in Copycat's architecture documentation.

Which node should I push data to in a cluster?

I've setup a kafka cluster with 3 nodes.
kafka01.example.com
kafka02.example.com
kafka03.example.com
Kafka does replication so that any node in the cluster can be removed without loosing data.
Normally I would send all data to kafka01, however that will break the entire cluster if that one node goes down.
What is industry best practice when dealing with clusters? I'm evaluating setting up an NGINX reverse proxy with round robin load balancing. Then I can point all data producers at the proxy and it will divvy up between the nodes.
I need to ensure that no data is lost if one of the nodes becomes unavailable.
Is an nginx reverse proxy an appropriate tool for this use case?
Is my assumption correct that a round robin reverse proxy will distribute the data and increase reliability without data loss?
Is there a different approach that I haven't considered?
Normally your producer takes care of distributing the data to all (or selected set of) nodes that are up and running by using a partitioning function either in a round robin mode or by using some semantics of your choice. The producer publishes to a partition of a topic and different nodes are leaders for different partitions of one topic. If a broker node becomes unavailable, this node will fall out of the cluster (In Sync Replicas) and new leaders for partitions on that node will be selected. Through metadata requests/responses, your producer will become aware of this fact and push messages to other nodes which are currently up.

Resources