Mariadb galera cluster and cap theorem - mariadb

Where does mariadb galera cluster lies according to cap theorem CP or AP based on a brief explanation how it works.

Consistency -- For handling the "critical read" problem, Galera needs a little help. See http://mysql.rjweb.org/doc.php/galera#critical_reads
Otherwise, one can state that Galera survives "any" single-point-of-failure.
Galera is normally deployed in 3 nodes, one in each of 3 geographic locations. That means that no single machine failure, data center failure, earthquake, tornado, network outage, etc, can take out more than one node at a time. The other two nodes (whichever two survive and still talk to each other) will declare that they "have a quorum" and continue to accept writes and deliver reads. Further, "split brain" is not 'possible'; this is what keeps any attempt at dual-master, even with monitoring from surviving any SPOF.
If the third node or the network is repaired, the Cluster goes about patching up the data as needed, so that the 3 nodes again have identical data.
Granted, this is not quite the same as the definition of CAP, but is is a reasonable goal for a computer cluster.
How it works (in a tiny nutshell)... Each node talks to each other node. It does this only during the COMMIT of a transaction. (Hence, it is reasonably efficient even when spread across a WAN, as needed to survive natural disasters.) The COMMIT says to the other nodes "I am about to do this write; is it OK?" Without actually doing the write, they check Galera's magic sauce to see if it will succeed. Once everyone says "yes", the COMMIT returns success to the client. (That gives you a hint of the "critical" read issue.)

Related

MariaDB Spider with Galera Clusters failover solutions

I am having problems trying to build a database solution for the experiment to ensure HA and performance(sharding).
Now, I have a spider node and two galera clusters (3 nodes in each cluster), as shown in the figure below, and this configuration works well in general cases.:
However, as far as I know, when the spider engine performs sharding, it must assign primary IP to distribute SQL statements to two nodes in different Galera clusters.
So my first question here is:
Q1): When the machine .12 shuts down due to destruction, how can I make .13 or .14(one of them) automatically replace .12?
The servers that spider engine know
Q2): Are there any open source tools (or technologies) that can help me deal with this situation? If so, please explain how it works. (Maybe MaxScale? But I never knew what it is and what it can do.)
Q3): The motivation for this experiment is as follows. An automated factory has many machines, and each machine generates some data that must be recorded during the production process (maybe hundreds or thousands of data per second) to observe the operation of the machine and make the quality of each batch of products the best.
So my question is: how about this architecture (Figure 1)? or please provides your suggestions.
You could use MaxScale in front of the Galera cluster to make the individual nodes appear like a combined cluster. This way Spider will be able to seamlessly access the shard even if one of the nodes fails. You can take a look at the MaxScale tutorial for instructions on how to configure it for a Galera cluster.
Something like this should work:
This of course has the same limitation that a single database node has: if the MaxScale server goes down, you'll have to switch to a different MaxScale for that cluster. The benefit of using MaxScale is that it is in some sense stateless which means it can be started and stopped almost instantly. A network load balancer (e.g. ELB) can already provide some form of protection from this problem.

what is the difference between innodb_flush_log_at_trx_commit and sync_binlog?

I am reading what affects the durability of InnoDB and find 2 config fields. I have the following questions:
What is the difference between the 2 config fields? My first guess is that innodb_flush_log_at_trx_commit affects InnoDB redo log, and sync_binlog affects the MySQL standard binlog. Am I right?
My second question is if the logging process is divided into 3 phases: write to buffer, write to os cache, flush to disk. Then in which phase does the secondary replication happen?
about sync_binlog, I have another question. If sync_binlog is set to 0, according to innodb doc, the flush is not on commits but delegated to OS. Could it be the case that the binlog is synced before commit so that replication see data not committed?
Add innodb_flush_method to the list.
innodb_flush_log_at_trx_commit should be 1 for security (won't lose any data in a crash) or 2 for speed. (0, I think, is there for historic reasons and has no advantage.)
sync_binlog avoids a Slave from trying to read off the end of the Master's binlog (after a crash). Since the data was already sent to the Slaves, it is not a "data loss" issue, but an annoying error that is easily rectified by hand (move it to the next binlog).
I think this is the order of the replication steps. Note: In the case of a transaction, nothing happens until the COMMIT. (See also "binlog_cache_size".)
Send data to Slaves and flush the copy to the binlog if sync_binlog is on (else let it eventually be flushed). (I don't know which happens first, or even whether they are done by separate threads.)
The I/O thread on the Slave copies the data to its "relay log".
Eventually (usually right away), the execute thread performs the query.
(Multi-source replication and parallel execution on the Slave add further complications.)
Semi-sync gets involved somewhere.
Galera -- see "gcache", etc.
What do you mean by "secondary replication".
Your item 3 may be referring to an obscure case where something could drop through the cracks because too many pieces of hardware are involved.
If you want reliability, see Galera Cluster and/or InnoDB Cluster. This goes beyond what is available with simple Master-Slave replication even with semi-sync.

Mariadb Galera cluster does not come up after killing the mysql process

I have a Mariadb Galera cluster with 2 nodes and it is up and running.
Before moving to production, I want to make sure that if a node crashes abruptly, It should come up on its own.
I tried using systemd "restart", but after killing the mysql process the mariadb service does not come up, so, is there any tool or method, that I can use to automate bringing up the nodes after crashes?
Galera clusters needs to have quorum (3 nodes).
In order to avoid a split-brain condition, the minimum recommended number of nodes in a cluster is 3. Blocking state transfer is yet another reason to require a minimum of 3 nodes in order to enjoy service availability in case one of the members fails and needs to be restarted. While two of the members will be engaged in state transfer, the remaining member(s) will be able to keep on serving client requests.
You can read more here.

How Galera Cluster guarantees consistency?

I'm searching for a high-available SQL solution! One of the articles that I read was about "virtually synchronized" in Galera Cluster: https://www.percona.com/blog/2012/11/20/understanding-multi-node-writing-conflict-metrics-in-percona-xtradb-cluster-and-galera/
He says
When the writeset is actually applied on a given node, any locking
conflicts it detects with open (not-yet-committed) transactions on
that node cause that open transaction to get rolled back.
and
Writesets being applied by replication threads always win
What will happen if the WriteSet conflicts with a committed transaction?
He also says:
Writesets are then “certified” on every node (in order).
How does Galera Cluster make WriteSets ordered over a cluster? Is there any hidden master node who make WriteSets ordered; something like Zookeeper? or what?
This is for the second question (about how Galera orders the writesets).
Galera implements Extended Virtual Synchrony (EVS) based on the Totem protocol. The Totem protocol implements a form of token passing, where only the node with the token is allowed to send out new requests (as I understand it). So the writes are ordered since only one node at a time has the token.
For the academic background, you can look at these:
The Totem Single-Ring Ordering and Membership Protocol
The database state machine and group communication issues
(This Answer does not directly tackle your Question, but it may give you confidence that Galera is 'good'.)
In Galera (PXC, etc), there are two general times when a transaction can fail.
On the node where the transaction is being run, the actions are compared to what is currently running on the same node. If there is a conflict, either one of the transactions is stalled (think innodb_lock_wait_timeout) or is deadlocked (and rolled back).
At COMMIT time, info is sent to all the other nodes; they check your transaction against anything on the node or pending (in gcache). If there is a conflict, a message is sent back saying that there would be trouble. So, the originating node has the COMMIT fail. For this reason, you must check for errors even on the COMMIT statement.
As with single-node systems, a deadlock is usually resolved by replaying the entire transaction.
In the case of autocommit, there is a small, configurable, number of retries, after which the statement will fail. So, again, check for errors. However, since a retry has already been tried, you may want to abort the program.
Currently (in my opinion) Galera, with at least 3 nodes in at least 3 different physical locations, is the best available HA solution for MySQL. It can effectively survive any single-point-of-failure. (Group Replication / InnoDB Cluster, from Oracle, is coming soon, and is very promising.)
One thing to note is that the "critical read" problem has a solution in Galera, but you have to take action. See wsrep_sync_wait. (As of this writing, InnoDB Cluster has no solution.)
See http://mysql.rjweb.org/doc.php/galera for tips (some of which are included above) on coding differences when moving to PXC/Galera.

raft: some questions about read only queries

In the raft's thesis document chapter 6.4, it gives steps to bypass the Raft log for read-only queries and still preserve linearizability:
If the leader has not yet marked an entry from its current term committed, it waits until it has done so. The Leader Completeness
Property guarantees that a leader has all committed entries, but at
the start of its term, it may not know which those are. To find out,
it needs to commit an entry from its term. Raft handles this by having
each leader commit a blank no-op entry into the log at the start of
its term. As soon as this no-op entry is committed, the leader’s
commit index will be at least as large as any other servers’ during
its term.
The leader saves its current commit index in a local variable readIndex. This will be used as a lower bound for the version of the
state that the query operates against.
The leader needs to make sure it hasn’t been superseded by a newer leader of which it is unaware. It issues a new round of heartbeats and
waits for their acknowledgments from a majority of the cluster. Once
these acknowledgments are received, the leader knows that there could
not have existed a leader for a greater term at the moment it sent the
heartbeats. Thus, the readIndex was, at the time, the largest commit
index ever seen by any server in the cluster.
The leader waits for its state machine to advance at least as far as the readIndex; this is current enough to satisfy linearizability.
Finally, the leader issues the query against its state machine and replies to the client with the results.
My questions:
a) for step 1, is it only for case at the time of the leader is just elected? Because only new leader has no entry committed for current term. And since the no-op entry is necessary to find out the current committed entries, then this step in fact is always needed upon election done, but not only specific to read-only query? In other words, normally, when the leader is active for a while, it must has entries committed for its term (including the no-op entry).
b) for step 3, does it mean as long as leader needs to serve read only query, then one extra heartbeat would be sent, regardless of current outstanding heartbeat (sent but no major responses received yet) or the next scheduled heartbeat?
c) for step 4, is it only for followers (for cases where followers help offload the processing of read-only queries)? Because on leader, committed index already means it was applied to local state machine.
All in all, normally, the leader (active for a while) only needs to do step 3 and step 5, right?
a: This is indeed only the case when the leader is first elected. In practice, when a read-only query is received, you check whether an entry has been committed from the leader's current term and queue or reject the query if not.
b: In practice, most implementations batch read-only queries for more efficiency. You don't need to send many concurrent heartbeats. If a heartbeat is outstanding, the leader can enqueue any new reads to be evaluated after that heartbeat is completed. Once a heartbeat is completed, if any additional queries are enqueued then the leader starts another heartbeat. This has the effect of batching linearizable read-only queries for better efficiency.
c: It is not true that the leader's lastApplied index (the index of its state machine) is always equivalent to its commitIndex. Indeed, this is why there is a lastApplied index in Raft in the first place. Leaders do not necessarily have to synchronously apply an index at the same time as committing that index. This is really implementation specific. In practice, Raft implementations usually apply entries in a different thread. So, an entry can be committed and then enqueued for application to the state machine. Some implementations put entries on a queue to be applied to the state machine and allow the state machine to pull entries from that queue to be applied at the state machine's own pace, so when an entry may be applied is unspecified. It's just critical that a read-only query be applied after the last command committed by the leader.
Also, you ask if this only applies to followers. Linearizable queries can only be evaluated through the leader. I suppose there's some algorithm with which you could do linearizable reads on followers, but it would be inefficient. Followers can only maintain sequential consistency for queries. In that case, servers respond to client operations with the index of the state machine when the operation was evaluated. Clients send their last received index with each operation, and when a server receives an operation, it uses the same algorithm to ensure that its state machine's lastApplied index is at least as great as the client's index. This is necessary to ensure that the client does not see state go back in time when switching servers.
There are some other complexities to read-only queries beyond what's described in the Raft literature if you want to support FIFO consistency for concurrent operations from a single client. Some of these are described in Copycat's architecture documentation.

Resources