Is there a continuous low latency 'sync up' process going on between a leader & follower cluster so that the follower databases are up date to date with the copy stored in the leader cluster? Basically trying to understand if the follower maintains its own read only copy (not referring to follower cache) of the follower databases?
The follower cluster periodically synchronizes to changes in the database(s) it follows, so it has some data lag with respect to the leader cluster.
The lag could vary between a few seconds to a few minutes, depending on the overall size of the followed database(s) metadata.
Once the follower cluster becomes aware of these changes in metadata (data and/or schema objects [e.g. tables] get added/removed) - there's a background process that 'warms' the relevant data artifacts from the leader's persistent storage to the SSD of the follower's nodes.
Data is cached (according to the effective caching policy) on the nodes of the leader and on the nodes of the follower, but is persisted only in the leader's persistent storage.
Related
I am using Vault on AWS with the DynamoDB backend. The backend supports HA.
storage "dynamodb" {
ha_enabled = "true"
region = "us-west-2"
table = "vault-data"
}
Reading the HA concept documentation:
https://www.vaultproject.io/docs/concepts/ha.html
To be highly available, one of the Vault server nodes grabs a lock within the data store. The successful server node then becomes the active node; all other nodes become standby nodes. At this point, if the standby nodes receive a request, they will either forward the request or redirect the client depending on the current configuration and state of the cluster -- see the sections below for details. Due to this architecture, HA does not enable increased scalability.
I am not interested in having a fleet of EC2 instances behind a ELB, where only 1 instance behaves like a master and talks to DynamoDB.
I would like to run N Ec2 instances running Vault, that read and write independently from DynamoDB.
Because DynamoDB supports read/write from multiple EC2 instances, I would expect to be able to unseal Vault from multiple instances simultaneously and perform read and write operations. This should work even with ha_enabled = "false", without doing the leader election.
Why this architecture is not suggested in the documentation ? Why it should not work ? Is there any cryptographic limitation that I am missing ?
thank you
It is a feature of Vault Enterprise. With it, you can set up a primary cluster and as many "secondary" clusters, better known as performance replicas. Each cluster has its own storage and unseal mechanism. So you could have one cluster on Dynamo DB and the other on Raft. If both are on Dynamo DB, then you'll need a Dynamo DB table for each.
But keep in mind that performance replicas will always forward write operations to the primary cluster. A write operation is something that affect the global state of Vault. In that sense a POST to /transit is not considered a write operation.
Another possibility is to have your kv store mounted locally (with the -local flag). Then it will behave like a primary even when mounted on a performance replica, at the price of not being able to replicate to the other cluster.
A final note: DR clusters are an exact copy of any cluster. Each cluster, whether a primary or a replica, can have its DR cluster.
If one leader node in TiDB is down, will my data get lost or the service be affected? How long will it be until the service recovers (i.e. a new leader is re-elected)?
TiDB uses Raft to synchronize data among multiple replicas and guarantees the strong consistency of data. If one replica goes down, the other replicas can guarantee data security. The default number of replicas in each Region is 3. Based on the Raft protocol, a leader is elected in each Region, and if a single Region leader fails, a new Region leader is soon elected after a maximum of 2 * lease time (lease time is 10 seconds).
In the raft's thesis document chapter 6.4, it gives steps to bypass the Raft log for read-only queries and still preserve linearizability:
If the leader has not yet marked an entry from its current term committed, it waits until it has done so. The Leader Completeness
Property guarantees that a leader has all committed entries, but at
the start of its term, it may not know which those are. To find out,
it needs to commit an entry from its term. Raft handles this by having
each leader commit a blank no-op entry into the log at the start of
its term. As soon as this no-op entry is committed, the leader’s
commit index will be at least as large as any other servers’ during
its term.
The leader saves its current commit index in a local variable readIndex. This will be used as a lower bound for the version of the
state that the query operates against.
The leader needs to make sure it hasn’t been superseded by a newer leader of which it is unaware. It issues a new round of heartbeats and
waits for their acknowledgments from a majority of the cluster. Once
these acknowledgments are received, the leader knows that there could
not have existed a leader for a greater term at the moment it sent the
heartbeats. Thus, the readIndex was, at the time, the largest commit
index ever seen by any server in the cluster.
The leader waits for its state machine to advance at least as far as the readIndex; this is current enough to satisfy linearizability.
Finally, the leader issues the query against its state machine and replies to the client with the results.
My questions:
a) for step 1, is it only for case at the time of the leader is just elected? Because only new leader has no entry committed for current term. And since the no-op entry is necessary to find out the current committed entries, then this step in fact is always needed upon election done, but not only specific to read-only query? In other words, normally, when the leader is active for a while, it must has entries committed for its term (including the no-op entry).
b) for step 3, does it mean as long as leader needs to serve read only query, then one extra heartbeat would be sent, regardless of current outstanding heartbeat (sent but no major responses received yet) or the next scheduled heartbeat?
c) for step 4, is it only for followers (for cases where followers help offload the processing of read-only queries)? Because on leader, committed index already means it was applied to local state machine.
All in all, normally, the leader (active for a while) only needs to do step 3 and step 5, right?
a: This is indeed only the case when the leader is first elected. In practice, when a read-only query is received, you check whether an entry has been committed from the leader's current term and queue or reject the query if not.
b: In practice, most implementations batch read-only queries for more efficiency. You don't need to send many concurrent heartbeats. If a heartbeat is outstanding, the leader can enqueue any new reads to be evaluated after that heartbeat is completed. Once a heartbeat is completed, if any additional queries are enqueued then the leader starts another heartbeat. This has the effect of batching linearizable read-only queries for better efficiency.
c: It is not true that the leader's lastApplied index (the index of its state machine) is always equivalent to its commitIndex. Indeed, this is why there is a lastApplied index in Raft in the first place. Leaders do not necessarily have to synchronously apply an index at the same time as committing that index. This is really implementation specific. In practice, Raft implementations usually apply entries in a different thread. So, an entry can be committed and then enqueued for application to the state machine. Some implementations put entries on a queue to be applied to the state machine and allow the state machine to pull entries from that queue to be applied at the state machine's own pace, so when an entry may be applied is unspecified. It's just critical that a read-only query be applied after the last command committed by the leader.
Also, you ask if this only applies to followers. Linearizable queries can only be evaluated through the leader. I suppose there's some algorithm with which you could do linearizable reads on followers, but it would be inefficient. Followers can only maintain sequential consistency for queries. In that case, servers respond to client operations with the index of the state machine when the operation was evaluated. Clients send their last received index with each operation, and when a server receives an operation, it uses the same algorithm to ensure that its state machine's lastApplied index is at least as great as the client's index. This is necessary to ensure that the client does not see state go back in time when switching servers.
There are some other complexities to read-only queries beyond what's described in the Raft literature if you want to support FIFO consistency for concurrent operations from a single client. Some of these are described in Copycat's architecture documentation.
Amazon DynamoDB allows the customer to provision the throughput of reads and writes independently. I have read the Amazon Dynamo paper about the system that preceded DynamoDB and read about how Cassandra and Riak implemented these ideas.
I understand how it is possible to increase the throughput of these systems by adding nodes to the cluster which then divides the hash keyspace of tables across more nodes, thereby allowing greater throughput as long as access is relatively random across hash keys. But in systems like Cassandra and Riak this adds throughput to both reads and writes at the same time.
How is DynamoDB architected differently that they are able to scale reads and write independently? Or are they not and Amazon is just charging for them independently even though they essentially have to allocate enough nodes to cover the greater of the two?
You are correct that adding nodes to a cluster should increase the amount of available throughput but that would be on a cluster basis, not a table basis. The DynamoDB cluster is a shared resource across many tables across many accounts. It's like an EC2 node: you are paying for a virtual machine but that virtual machine is hosted on a real machine that is shared among several EC2 virtual machines and depending on the instance type, you get a certain amount of memory, CPU, network IO, etc.
What you are paying for when you pay for throughput is IO and they can be throttled independently. Paying for more throughput does not cause Amazon to partition your table on more nodes. The only thing that cause a table to be partitioned more is if the size of your table grows to the point where more partitions are needed to store the data for your table. The maximum size of the partition, from what I have gathered talking to DynamoDB engineers, is based on the size of the SSDs of the nodes in the cluster.
The trick with provisioned throughput is that it is divided among the partitions. So if you have a hot partition, you could get throttling and ProvisionedThroughputExceededExceptions even if your total requests aren't exceeding the total read or write throughput. This is contrary to what your question ask. You would expect that if your table is divided among more partitions/nodes, you'd get more throughput but in reality it is the opposite unless you scale your throughput with the size of your table.
I want to ask about the Graphite carbon daemons.
https://graphite.readthedocs.org/en/latest/carbon-daemons.html
I would like to ask while running a carbon-rely.py, should i also run carbon-cache.py or the relay is okay?
Regards
Murtaza
Carbon relay is used when you set up a cluster of graphite instances. However, a carbon cache does not need a cluster;
Reg Carbon cache: As we all know that write operations are expensive; Graphite enables collected data to be collected in a cache where the graphite webapp can be used irrespective of a cluster to read and display the most recent data recorded into graphite ( irrespective of whether it was written into disk).
Hope this answers your question.
Carbon-relay only resends data to one or more destinations, so it needed only if you want fork data into several points. Example schemas can be:
save locally and resend to another node (cache or temporary-storage and relay)
resend all data into multiple remote daemons (multiple remote storages)
save all data in multiple local daemons (parallel storage & redundancy)
save different data sets in multiple local daemons (performance)
... other cases ...
So,
in case you need store data locally - you have to use carbon-cache.
in case you need fork data flow on the node,- you have to use carbon-relay before or instead carbon-cache