What is "mult-raft" in TiKV? - raft

I came across this intersting database the other day, and have read some doc on its offical site, I have some questions regarding Raft Group in TiKV (here),
Suppose that we have a cluster which has like 100 nodes, and the replication factor is 3, does that mean we will end up with a lot of tiny Raft "bubbles", each of them contains only 3 members, and they do leader election and log replication inside the "buble".
Or, we only have one single fat Raft "buble" which contains 100 nodes?
Please help to shed some light here, thank you!

a lot of tiny Raft "bubbles", each of them contains only 3 members,
The tiny Raft bubble in your context is a Raft group in TiKV, comprised of 3 replicas(by default). Data is auto-sharded in Regions in TiKV, each Region corresponding to a Raft group. To support large data volume, Multi-Raft is implemented. So you can perceive Multi-raft as tiny Raft "bubbles" that are distributed evenly across your nodes.
Check the image for Raft in TiKV here
we only have one single fat Raft "buble" which contains 100 nodes?
No, a Raft group does not contain nodes, they are contained in nodes.
For more details, see: What is Multi-raft in TiKV

In this case it means that you have 33 shards ("bubbles") of 3 nodes each.
A replication factors of 3 is quite common in distributed systems. In my experience, databases use replication factors of 3 (in 3 different locations) as a sweet spot between durability and latency; 6 (in 3 locations) when they lean heavily towards durability; and 9 (in 3 locations) when they never-ever want to lose data. The 9-node databases are extremely stable (paxos/raft-based) and I have only seen them used as configuration for the 3-node and 6-node databases which can use a more performant protocol (though, raft is pretty performant, too).

Related

Raft protocol split brain

studying raft, I can’t understand one thing, for example, I have a cluster of 6 nodes, and I have 3 partitions with a replication factor of 3, let’s say that a network error has occurred, and now 3 nodes do not see the remaining 3 nodes, until they remain available for clients, and for example, the record SET 5 came to the first formed cluster , and in this case it will pass? because replication factor =3 and majority will be 2? it turns out you can get split brain using raft protocol?
In case of 6 nodes, the majority is 4. So if you have two partitions of three nodes, neither of those partitions will be able to elect a leader or commit new values.
When a raft cluster is created, it is configured with specific number of nodes. And the majority of those nodes is required to either elect a leader or commit a log message.
In case of a raft cluster, every node has a replica of data. I guess we could say that the replication factor is equal to cluster side in a raft cluster. But I don't think I've ever seen replication factor term being used in consensus use case.
Few notes on cluster size.
Traditionally, cluster size is 2*N+1, where N is number of nodes a cluster can lose and still be operational - as the rest of nodes still have majority to elect a leader or commit log entries. Based on that, a cluster of 3 nodes may lose 1 node; a cluster of 5 may lose 2.
There is no much point (from consensus point of view) to have cluster of size 4 or 6. In case of 4 nodes total, the cluster may survive only one node going offline - it can't survive two as the other two are not the majority and they won't be able to elect a leader or agree on progress. Same logic applies for 6 nodes - that cluster can survive only two nodes going off. Having a cluster of 4 nodes is more expensive as we can have support the same single node outage with just 3 nodes - so cluster of 4 is a just more expensive with no availability benefit.
There is a case when cluster designers do pick cluster of size 4 or 6. This is when the system allows stale reads and those reads can be executed by any node in a cluster. To support larger scale of potentially stale reads, a cluster owner adds more nodes to handle the load.

P2P Network Bootstrapping

For P2P networks, I know that some networks have initial bootstrap nodes. However, one would assume that with all new nodes learning of peers from said bootstrap nodes, the network would have difficulty adding new peers and would end up with many cliques - unbalanced, for lack of a better word.
Are there any methods to prevent this from occuring? I'm aware that some DHTs structure their routing tables to be a bit less susceptible to this, but I would think the problem would still hold.
To clarify, I'm asking about what sort of peer mixing algorithms exist/are commonly used for peer to peer networks.
However, one would assume that with all new nodes learning of peers from said bootstrap nodes, the network would have difficulty adding new peers and would end up with many cliques - unbalanced, for lack of a better word.
If the bootstrap nodes were the only source of peers and no further mixing occured that might be an issue. But in practice bootstrap nodes only exist for bootstrapping (possibly only ever once) and then other peer discovery mechanisms take over.
Natural mixing caused by connection churn should be enough to randomize graphs over time, but proactive measures such as a globally agreed-on mixing algorithm to drop certain neighbors in favor of others can speed up that process.
I'm aware that some DHTs structure their routing tables to be a bit less susceptible to this, but I would think the problem would still hold.
The local buckets in kademlia should provide an exhaustive view of the neighborhood, medium-distance buckets will cover different parts of the keyspace for different nodes and the farthest buckets will preferentially contain long-lived nodes which should have a good view of the network.
This doesn't leave much room for clique-formation.

Neo4j design performance: Do I have to avoid large node degrees?

I'm in the middle of designing a data model to be implemented using Neo4j. Is is about a transportation system, that has some stations, having some vehicles traveling between them.
There is a huge amount of travels from some stations, say one million each month. So I want to know is there any performance penalty in case of having some nodes with millions of edges coming out of them? Is it better to keep degrees lower with some design tricks (and probably making design a little worse)?
Relationship degrees really matter most when they're traversed, so expansions that are traversing any relationship type and direction, or the type (and direction) of relationship that has a high number of degrees.
So if there are 100k :TRAVELS_TO relationships to a specific location, 100 :VISITED relationships to the location, and only 1 :TRAVELS_TO relationship from that location, then you'll only be paying a high cost when traversing those :TRAVELS_TO relationships to the location. If you're traversing relationships of a different type and/or direction, you won't be paying any higher cost because of the 100k other relationships.
So diversifying your types and/or directions can certainly help out.
You may want to check Max De Marzi's blog for his approaches when constructing a flight/airtravel graph, you may find good approaches to use here.

Riak: Using n_val = 3 and only 3 nodes

I'm starting with Riak and so far everything is going great. I'm not concerned about performance at the moment because I'm mainly using it as a backup store. I've read all docs I could find (the littleriakbook.com was great to explain the concepts) but I still seem to not grasp some parts.
The situation is that I can only use 3 physical nodes/servers at the moment (instead of the recommended 5). I want that all data is replicated to all three nodes. Essentially if up to 2 nodes go down, I want to still be able to read and write to the remaining node. And if the nodes are coming up again they should synchronise again.
I've set it all up and riak-admin diag shows me that not all data is fulfilling the n_val requirement. How can I make sure that all three nodes are (eventually) identical copies? Is it possible to trigger a redristribution of the data that doesn't fulfil the requirements?
With only 3 nodes, it is not possible to fulfil the n_val requirement and ensure that the three copies stored of any object always will be on different nodes. The reason for this is in how Riak distributes replicas.
When storing or retrieving an object, Riak will calculate a hash value based on the bucket and key, and map this value to a specific partition on the ring. Once this partition has been determined the other N-1 replicas are always placed on the following N-1 partitions. If we assume we have a ring size of 64 and name these partitions 1-64, an object that hashes into partition 10 and belongs to a bucket with n_val set to 3 will also be stored in partitions 11 and 12.
With 3 nodes you will often see that the partitions are spread out alternating between the physical nodes. This means that for most partitions, the replicas will be on different physical nodes. For the last 2 partitions of the ring, 63 and 64 in our case, storage will however need to wrap around onto partitions 1 and 2. As 64 can not be evenly divided by 3, objects that do hash into these last partitions will therefore only be stored on 2 different physical nodes.
When a node fails or becomes unavailable in Riak, the remaining nodes will temporarily take responsibility for the partitions belonging to the lost node. These are known as fallback partitions and will initially be empty. As data is updated or inserted, these partitions will keep track of it and hand it back to the owning node once it becomes available. If Active Anti-Entropy is enabled, it will over time synchronise the fallback partition with the other partitions in the background.

Distributed physics simulation help/advice

I'm working in a distributed memory environment. My task is to simulate using particles tied by springs big 3D objects by dividing them into smaller pieces and each piece get simulated by another computer. I'm using a 3rd party physics engine to a achieve simulation. The problem I am facing is how to transmit the particle information in the extremities where the object is divided. This information is needed to compute interacting particle forces. The line in the image shows where the cut has been made. Because the number o particles is big the communication overhead will be big as well. Is there a good way to transmit such information or is there a way to transmit another value which helps me determine the information I need? Any help is much appreciated. Thank-you
PS: by particle information i mean the new positions from which to compute a resulting force to be applied on the particles simulated in the local machine
"Big" means lots of things. Here the number of points with data being communicated may be "big" in that it's much more than one, but if you have say a million particles in a lattice, and are dividing it between 4 processors (say) by cutting it into squares, you're only communicating 500 particles across each boundary; big compared to one but very small compared to 1,000,000.
A library very commonly used for these sorts of distributed-memory computations (which is somwehat different than distributed computing, which suggests nodes scattered all over the internet; this sort of computation, involving tightly-coupled elements, is usually best done with a series of nearby computers in a lab or in a cluster) is MPI. This pattern of communication is very common, and is called "halo exchange" or "guardcell exchange" or "ghostzone exchange" or some combination; you should be able to find lots of examples of such things by searching for those terms. (There are a few questions on this site on the topic, but they're typically focussed on very specific implementation questions).

Resources