I was reading the mariaDD knowledge base on Galera Cluster and i came across this:
The basic difference between synchronous and asynchronous replication is that "synchronous" guarantees that if changes happened on one node of the cluster, they happened on other nodes "synchronously", or at the same time. "Asynchronous" gives no guarantees about the delay between applying changes on "master" node and the propagation of changes to "slave" nodes. The delay can be short or long. This also implies that if master node crashes, some of the latest changes may be lost
With the last sentence, i have always understood that even though the updates on the slave in the asynchronous cluster setup is not performed at the same time, it logs these updates to a bin log file as the updates are being made on the master. So in the case that the master crashes before all the data is passed on to the slave, the updates will still go ahead when the master is restored since the bin log file logged the updates. Can somebody please tell me if my understanding is wrong and clarify on the matter for me please. Thanks.
In your example of a normal replication pair, the slave would catch up after the master comes back. Assuming the master does come back, you wouldn't really lose the data but if the master is permanently dead, the data is lost. The knowledge base article you mention is talking about the replication delay and not the overall integrity of the replication stream.
With normal replication, if the slave io thread (the part that gets the replication events from the master) is able to keep up with the master, then the slave may only lose a couple seconds if the master crashes. However, if it cannot keep up and is for example 1 hour behind, the slave would lose access to 1 hour of data. Another way you could lose access to data on the slave is if you have a max relay log size set and that is reached.
Galera makes sure that the write is sent to every node in the cluster before it is actually committed on any of the nodes so once the node that the write is done on commits the write, all of the other nodes will commit the same write. With galera, all writes basically happen at the same time on every node. Losing any node at any time during normal operation will not cause any data loss.
Related
I am reading what affects the durability of InnoDB and find 2 config fields. I have the following questions:
What is the difference between the 2 config fields? My first guess is that innodb_flush_log_at_trx_commit affects InnoDB redo log, and sync_binlog affects the MySQL standard binlog. Am I right?
My second question is if the logging process is divided into 3 phases: write to buffer, write to os cache, flush to disk. Then in which phase does the secondary replication happen?
about sync_binlog, I have another question. If sync_binlog is set to 0, according to innodb doc, the flush is not on commits but delegated to OS. Could it be the case that the binlog is synced before commit so that replication see data not committed?
Add innodb_flush_method to the list.
innodb_flush_log_at_trx_commit should be 1 for security (won't lose any data in a crash) or 2 for speed. (0, I think, is there for historic reasons and has no advantage.)
sync_binlog avoids a Slave from trying to read off the end of the Master's binlog (after a crash). Since the data was already sent to the Slaves, it is not a "data loss" issue, but an annoying error that is easily rectified by hand (move it to the next binlog).
I think this is the order of the replication steps. Note: In the case of a transaction, nothing happens until the COMMIT. (See also "binlog_cache_size".)
Send data to Slaves and flush the copy to the binlog if sync_binlog is on (else let it eventually be flushed). (I don't know which happens first, or even whether they are done by separate threads.)
The I/O thread on the Slave copies the data to its "relay log".
Eventually (usually right away), the execute thread performs the query.
(Multi-source replication and parallel execution on the Slave add further complications.)
Semi-sync gets involved somewhere.
Galera -- see "gcache", etc.
What do you mean by "secondary replication".
Your item 3 may be referring to an obscure case where something could drop through the cracks because too many pieces of hardware are involved.
If you want reliability, see Galera Cluster and/or InnoDB Cluster. This goes beyond what is available with simple Master-Slave replication even with semi-sync.
I'm searching for a high-available SQL solution! One of the articles that I read was about "virtually synchronized" in Galera Cluster: https://www.percona.com/blog/2012/11/20/understanding-multi-node-writing-conflict-metrics-in-percona-xtradb-cluster-and-galera/
He says
When the writeset is actually applied on a given node, any locking
conflicts it detects with open (not-yet-committed) transactions on
that node cause that open transaction to get rolled back.
and
Writesets being applied by replication threads always win
What will happen if the WriteSet conflicts with a committed transaction?
He also says:
Writesets are then “certified” on every node (in order).
How does Galera Cluster make WriteSets ordered over a cluster? Is there any hidden master node who make WriteSets ordered; something like Zookeeper? or what?
This is for the second question (about how Galera orders the writesets).
Galera implements Extended Virtual Synchrony (EVS) based on the Totem protocol. The Totem protocol implements a form of token passing, where only the node with the token is allowed to send out new requests (as I understand it). So the writes are ordered since only one node at a time has the token.
For the academic background, you can look at these:
The Totem Single-Ring Ordering and Membership Protocol
The database state machine and group communication issues
(This Answer does not directly tackle your Question, but it may give you confidence that Galera is 'good'.)
In Galera (PXC, etc), there are two general times when a transaction can fail.
On the node where the transaction is being run, the actions are compared to what is currently running on the same node. If there is a conflict, either one of the transactions is stalled (think innodb_lock_wait_timeout) or is deadlocked (and rolled back).
At COMMIT time, info is sent to all the other nodes; they check your transaction against anything on the node or pending (in gcache). If there is a conflict, a message is sent back saying that there would be trouble. So, the originating node has the COMMIT fail. For this reason, you must check for errors even on the COMMIT statement.
As with single-node systems, a deadlock is usually resolved by replaying the entire transaction.
In the case of autocommit, there is a small, configurable, number of retries, after which the statement will fail. So, again, check for errors. However, since a retry has already been tried, you may want to abort the program.
Currently (in my opinion) Galera, with at least 3 nodes in at least 3 different physical locations, is the best available HA solution for MySQL. It can effectively survive any single-point-of-failure. (Group Replication / InnoDB Cluster, from Oracle, is coming soon, and is very promising.)
One thing to note is that the "critical read" problem has a solution in Galera, but you have to take action. See wsrep_sync_wait. (As of this writing, InnoDB Cluster has no solution.)
See http://mysql.rjweb.org/doc.php/galera for tips (some of which are included above) on coding differences when moving to PXC/Galera.
In the raft's thesis document chapter 6.4, it gives steps to bypass the Raft log for read-only queries and still preserve linearizability:
If the leader has not yet marked an entry from its current term committed, it waits until it has done so. The Leader Completeness
Property guarantees that a leader has all committed entries, but at
the start of its term, it may not know which those are. To find out,
it needs to commit an entry from its term. Raft handles this by having
each leader commit a blank no-op entry into the log at the start of
its term. As soon as this no-op entry is committed, the leader’s
commit index will be at least as large as any other servers’ during
its term.
The leader saves its current commit index in a local variable readIndex. This will be used as a lower bound for the version of the
state that the query operates against.
The leader needs to make sure it hasn’t been superseded by a newer leader of which it is unaware. It issues a new round of heartbeats and
waits for their acknowledgments from a majority of the cluster. Once
these acknowledgments are received, the leader knows that there could
not have existed a leader for a greater term at the moment it sent the
heartbeats. Thus, the readIndex was, at the time, the largest commit
index ever seen by any server in the cluster.
The leader waits for its state machine to advance at least as far as the readIndex; this is current enough to satisfy linearizability.
Finally, the leader issues the query against its state machine and replies to the client with the results.
My questions:
a) for step 1, is it only for case at the time of the leader is just elected? Because only new leader has no entry committed for current term. And since the no-op entry is necessary to find out the current committed entries, then this step in fact is always needed upon election done, but not only specific to read-only query? In other words, normally, when the leader is active for a while, it must has entries committed for its term (including the no-op entry).
b) for step 3, does it mean as long as leader needs to serve read only query, then one extra heartbeat would be sent, regardless of current outstanding heartbeat (sent but no major responses received yet) or the next scheduled heartbeat?
c) for step 4, is it only for followers (for cases where followers help offload the processing of read-only queries)? Because on leader, committed index already means it was applied to local state machine.
All in all, normally, the leader (active for a while) only needs to do step 3 and step 5, right?
a: This is indeed only the case when the leader is first elected. In practice, when a read-only query is received, you check whether an entry has been committed from the leader's current term and queue or reject the query if not.
b: In practice, most implementations batch read-only queries for more efficiency. You don't need to send many concurrent heartbeats. If a heartbeat is outstanding, the leader can enqueue any new reads to be evaluated after that heartbeat is completed. Once a heartbeat is completed, if any additional queries are enqueued then the leader starts another heartbeat. This has the effect of batching linearizable read-only queries for better efficiency.
c: It is not true that the leader's lastApplied index (the index of its state machine) is always equivalent to its commitIndex. Indeed, this is why there is a lastApplied index in Raft in the first place. Leaders do not necessarily have to synchronously apply an index at the same time as committing that index. This is really implementation specific. In practice, Raft implementations usually apply entries in a different thread. So, an entry can be committed and then enqueued for application to the state machine. Some implementations put entries on a queue to be applied to the state machine and allow the state machine to pull entries from that queue to be applied at the state machine's own pace, so when an entry may be applied is unspecified. It's just critical that a read-only query be applied after the last command committed by the leader.
Also, you ask if this only applies to followers. Linearizable queries can only be evaluated through the leader. I suppose there's some algorithm with which you could do linearizable reads on followers, but it would be inefficient. Followers can only maintain sequential consistency for queries. In that case, servers respond to client operations with the index of the state machine when the operation was evaluated. Clients send their last received index with each operation, and when a server receives an operation, it uses the same algorithm to ensure that its state machine's lastApplied index is at least as great as the client's index. This is necessary to ensure that the client does not see state go back in time when switching servers.
There are some other complexities to read-only queries beyond what's described in the Raft literature if you want to support FIFO consistency for concurrent operations from a single client. Some of these are described in Copycat's architecture documentation.
Say I have a multi-step, asynchronous process with these restrictions:
Individual steps can be performed by any worker
Steps must be performed in-order
The approach I'm considering:
Insert a db row that represents the entire process, with a "Steps completed" column to keep track of the progress.
Subscribe to a queue that will receive a message when the entire process is done.
Upon completion of each step, update the db row and queue the next step in the process.
After the last step is completed, queue the "process is complete" message.
Delete the db row.
Thoughts? Pitfalls? Smarter ways to do it?
I've built a system very similar to what you've described in a large, task-intensive document processing system, and have had to live with both the pros and the cons for the last 7 years now. Your approach is solid and workable, but I see some drawbacks:
Potentially vulnerable to state change (i.e., what if process inputs change before all steps are queued, then the later steps could have inputs inconsistent with earlier steps)
More infrastructural than you'd like, involving both a DB and a queue = more points of failure, harder to set up, more documentation required = doesn't quite feel right
How do you keep multiple workers from acting on the same step concurrently? In other words, the DB row says 4 steps are completed, how does a worker process know if it can take #5 or not? Doesn't it need to know whether another process is already working on this? One way or another (DB or MQ) you need to include additional state for locking.
Your example is robust to failure, but doesn't address concurrency. When you add state to address concurrency, then failure handling becomes a serious problem. For example, a process takes step 5, and then puts the DB row into "Working" state. Then when that process fails, step 5 is stuck in "Working" state.
Your orchestrator is a bit heavy, as it is doing a lot of synchronous DB operations, and I would worry that it might not scale as well as the rest of the architecture, as there can be only one of those...this would depend on how long-running your steps were compared to a database transaction--this would probably only become an issue at very massive scale.
If I had it to do over again, I would definitely push even more of the orchestration onto the worker processes. So, the orchestration code is common and could be called by any worker process, but I would keep the central, controlling process as light as possible. I would also use only message queues and not any database to keep the architecture simple and less synchronous.
I would create an exchange with 2 queues: IN and WIP (work in progress)
The central process is responsible for subscribing to process requests, and checking the WIP queue for timed out steps.
1) When the central process received a request for a given processing (X), it invokes the orchestration code, and it loads the first task (X1) into the IN queue
2) The first available worker process (P1) transactionally dequeues X1, and enqueues it into the WIP queue, with a conservative time-to-live (TTL) timeout value. This dequeueing is atomic, and there are no other X tasks in IN, so no second process can work on an X task.
3) If P1 terminates suddenly, no architecture on earth can save this process except for a timeout. At the end of the timeout period, the central process will find the timed out X1 in WIP, and will transactionally dequeue X1 from WIP and enqueue it back into IN, providing the appropriate notifications.
4) If P1 terminates abnormally but gracefully, then the worker process will transactionally dequeue X1 from WIP and enqueue it back into IN, providing the appropriate notifications. Depending on the exception, the worker process could also choose to reset the TTL and retry the step.
5) If P1 hangs indefinitely, or exceeds its TTL, same result as #3. The central process handles it, and presumably the worker process will at some point be recycled--or the rule could be to recycle the worker process anytime there's a timeout.
6) If P1 succeeds, then the worker process will determine the next step, either X2 or X-done. If the next step is X2, then the worker process will transactionally dequeue X1 from WIP, and enqueue X2 into IN. If the next step is X-done, then the processing is complete, and the appopriate action can be taken, perhaps this would be enqueueing X-done into IN for subsequent processing by the orchestrator.
The benefits of my suggested approach are:
Contention between worker processes is specified
All possible failure scenarios (crash, exception, hang, and success) are handled
Simple architecture can be completely implemented with RabbitMQ and no database, which makes it more scalable
Since workers handle determining and enqueueing the next step, there is a more lightweight orchestrator, leading to a more scalable system
The only real drawback is that it is potentially vulnerable to state change, but often this is not a cause for concern. Only you can know whether this would be an issue in your system.
My final thought on this is: you should have a good reason for this orchestration. After all, if process P1 finishes task X1 and now it is time for some process to work on next task X2, it seems P1 would be a very good candidate, as it just finished X1 and is now available. By that logic, a process should just gun through all the steps until completion--why mix and match processes if the tasks need to be done serially? The only async boundary really would be between the client and the worker process. But I will assume that you have a good reason to do this, for example, the processes can run on different and/or resource-specialized machines.
This question is the result of two other questions I've asked in the last few days.
I'm creating a new question because I think it's related to the "next step" in my understanding of how to control the flow of my send/receive, something I didn't get a full answer to yet.
The other related questions are:
An IOCP documentation interpretation question - buffer ownership ambiguity
Non-blocking TCP buffer issues
In summary, I'm using Windows I/O Completion Ports.
I have several threads that process notifications from the completion port.
I believe the question is platform-independent and would have the same answer as if to do the same thing on a *nix, *BSD, Solaris system.
So, I need to have my own flow control system. Fine.
So I send send and send, a lot. How do I know when to start queueing the sends, as the receiver side is limited to X amount?
Let's take an example (closest thing to my question): FTP protocol.
I have two servers; One is on a 100Mb link and the other is on a 10Mb link.
I order the 100Mb one to send to the other one (the 10Mb linked one) a 1GB file. It finishes with an average transfer rate of 1.25MB/s.
How did the sender (the 100Mb linked one) knew when to hold the sending, so the slower one wouldn't be flooded? (In this case the "to-be-sent" queue is the actual file on the hard-disk).
Another way to ask this:
Can I get a "hold-your-sendings" notification from the remote side? Is it built-in in TCP or the so called "reliable network protocol" needs me to do so?
I could of course limit my sendings to a fixed number of bytes but that simply doesn't sound right to me.
Again, I have a loop with many sends to a remote server, and at some point, within that loop I'll have to determine if I should queue that send or I can pass it on to the transport layer (TCP).
How do I do that? What would you do? Of course that when I get a completion notification from IOCP that the send was done I'll issue other pending sends, that's clear.
Another design question related to this:
Since I am to use a custom buffers with a send queue, and these buffers are being freed to be reused (thus not using the "delete" keyword) when a "send-done" notification has been arrived, I'll have to use a mutual exlusion on that buffer pool.
Using a mutex slows things down, so I've been thinking; Why not have each thread have its own buffers pool, thus accessing it , at least when getting the required buffers for a send operation, will require no mutex, because it belongs to that thread only.
The buffers pool is located at the thread local storage (TLS) level.
No mutual pool implies no lock needed, implies faster operations BUT also implies more memory used by the app, because even if one thread already allocated 1000 buffers, the other one that is sending right now and need 1000 buffers to send something will need to allocated these to its own.
Another issue:
Say I have buffers A, B, C in the "to-be-sent" queue.
Then I get a completion notification that tells me that the receiver got 10 out of 15 bytes. Should I re-send from the relative offset of the buffer, or will TCP handle it for me, i.e complete the sending? And if I should, can I be assured that this buffer is the "next-to-be-sent" one in the queue or could it be buffer B for example?
This is a long question and I hope none got hurt (:
I'd loveeee to see someone takes the time to answer here. I promise I'll double-vote for him! (:
Thank you all!
Firstly: I'd ask this as separate questions. You're more likely to get answers that way.
I've spoken about most of this on my blog: http://www.lenholgate.com but then since you've already emailed me to say that you read my blog you know that...
The TCP flow control issue is such that since you are posting asynchronous writes and these each use resources until they complete (see here). During the time that the write is pending there are various resource usage issues to be aware of and the use of your data buffer is the least important of them; you'll also use up some non-paged pool which is a finite resource (though there is much more available in Vista and later than previous operating systems), you'll also be locking pages in memory for the duration of the write and there's a limit to the total number of pages that the OS can lock. Note that both the non-paged pool usage and page locking issues aren't something that's documented very well anywhere, but you'll start seeing writes fail with ENOBUFS once you hit them.
Due to these issues it's not wise to have an uncontrolled number of writes pending. If you are sending a large amount of data and you have a no application level flow control then you need to be aware that if you send data faster than it can be processed by the other end of the connection, or faster than the link speed, then you will begin to use up lots and lots of the above resources as your writes take longer to complete due to TCP flow control and windowing issues. You don't get these problems with blocking socket code as the write calls simply block when the TCP stack can't write any more due to flow control issues; with async writes the writes complete and are then pending. With blocking code the blocking deals with your flow control for you; with async writes you could continue to loop and more and more data which is all just waiting to be sent by the TCP stack...
Anyway, because of this, with async I/O on Windows you should ALWAYS have some form of explicit flow control. So, you either add application level flow control to your protocol, using an ACK, perhaps, so that you know when the data has reached the other side and only allow a certain amount to be outstanding at any one time OR if you cant add to the application level protocol, you can drive things by using your write completions. The trick is to allow a certain number of outstanding write completions per connection and to queue the data (or just don't generate it) once you have reached your limit. Then as each write completes you can generate a new write....
Your question about pooling the data buffers is, IMHO, premature optimisation on your part right now. Get to the point where your system is working properly and you have profiled your system and found that the contention on your buffer pool is the most important hot spot and THEN address it. I found that per thread buffer pools didn't work so well as the distribution of allocations and frees across threads tends not to be as balanced as you'd need to that to work. I've spoken about this more on my blog: http://www.lenholgate.com/blog/2010/05/performance-comparisons-for-recent-code-changes.html
Your question about partial write completions (you send 100 bytes and the completion comes back and says that you have only sent 95) isn't really a problem in practice IMHO. If you get to this position and have more than the one outstanding write then there's nothing you can do, the subsequent writes may well work and you'll have bytes missing from what you expected to send; BUT a) I've never seen this happen unless you have already hit the resource problems that I detail above and b) there's nothing you can do if you have already posted more writes on that connection so simply abort the connection - note that this is why I always profile my networking systems on the hardware that they will run on and I tend to place limits in MY code to prevent the OS resource limits ever being reached (bad drivers on pre Vista operating systems often blue screen the box if they can't get non paged pool so you can bring a box down if you don't pay careful attention to these details).
Separate questions next time, please.
Q1. Most APIs will give you "write is possible" event, after you last wrote and writing is available again (can happen immediately if you failed to fill major part of send buffer with the last send).
With completion port, it will arrive just as "new data" event. Think of new data as "read Ok", so there's also a "write ok" event. Names differ between the APIs.
Q2. If a kernel mode transition for mutex acquisition per chunk of data hurts you, I recommend rethinking what you are doing. It takes 3 microseconds at most, while your thread scheduler slice may be as big as 60 milliseconds on windows.
It may hurt in extreme cases. If you think you are programming extreme communications, please ask again, and I promise to tell you all about it.
To address your question about when it knew to slow down, you seem to lack an understanding of TCP congestion mechanisms. "Slow start" is what you're talking about, but it's not quite how you've worded it. Slow start is exactly that -- starts off slow, and gets faster, up to as fast as the other end is willing to go, wire line speed, whatever.
With respect to the rest of your question, Pavel's answer should suffice.