Consider Riak cluster with 5 nodes (A,B,C,D,E) in it, n_val = 3:
1) Coordinator node store (k,v) pair with w=2 that should go to node A and replicas should go to nodes B and C, according to consistent hashing.
Consider node C is down. Riak is able to perform writes to two nodes - A and B, thus satisfying w=2. However, (k, v) should be eventually replicated to 3 nodes,
does this mean that Riak will send this store to D and D will perform hinted handoff when C is back? Or just writes to A and B would be performed and C will synchronize with these nodes using Active Anti-entropy and read repair?
2 Consider I would like to decommission node C from the cluster. I simply shutdown this node. This node contained data that is replicated on nodes D and E as well as replicas for nodes A and B. Now n_val = 3 is no longer satisifed, we only have two replicas. Will Riak automatically create new replicas for node that is down or should I execute special command to mark node C as permanently down?
3) Consider Riak cluster with 3 nodes (A, B, C), n_val=3 and node C is down. Will it be able to satisfy write with w=2?
1) Riak will make use of fallback vnodes, so in the event node C is down during the write, node D will start a fallback vnode to handle requests until it becomes available again. As soon as C becomes available, D will intitiate hinted handoff to bring the vnodes on node C up to date. The use of fallbacks is described here
2) If you are removing node C while it is still able to function and wish to run a smaller cluster, use cluster leave to cause Riak to reassign ownership and transfer the data before shutting down node C.
If you are removing node C to replace it new hardware, first join then new node, but use replace before plan or commit.
If node C has failed such that it's data is unrecoverable, you can use force-remove or force-replace to have new empty vnodes started to replace the lost ones, which will the be populated via AAE or read repair.
3) Yes, Riak uses sloppy quorums where a fallback vnode can be used to satisfy a read or write quorum. If you want to only consider primary vnodes, use pr or pw in the request instead of r or w. See Eventual Consistency for more detail.
Related
I have implemented a data-structure that is kind of a non-binary tree and my algorithm basically goes through all of it's branches (I will provide more details on how it works below).
So, for a base case of a mother-node with three children the data structure would look something like this:
And the algorithm would go through three interactions until it eventually stopped (the function that goes through each node is recursive and the recursion stops when it finds a node that has no children/no nodes below it):
So, in a second case scenario, if the the tree was a little more complex, something like this:
Obs: In the case above, more than one parent node is arriving at the same child node, but this second interaction/arriving at the same child node through a different parent node is necessary, since it could add additional information/features to the child node and thus cannot be skipped.
And the algorithm would run through seven interactions in this second case:
So, basically, the algorithm enters in a for loop every time it needs to access the sons of a node. In case two, for example, there's a first for that access nodes 2, 3 and 4 from node 1 and then a second for the access nodes 5 and 6 from node 2. When all nodes below node 2 have already been accessed, and since nodes 5 and 6 do not have child-nodes, the for from node 1 access node 3 and then a third for begins, which access node 6 from node 3. Finally, when all nodes below node 3 have already been accessed, the for loop from node 1 will access node 4 and a fourth for loop begins, which eventually access node 7 and then the algorithm as a whole stops.
In addition to the complexity of the algorithm I would also like to know if something like this could run in a web application or would it be too slow or too complex to run in a server? I don't know, it seems to me that the computational cost is too high and either running this in a large scale would fry my PC or it would eventually run but after a considerable amount of time.
I hope that I was able to explain the problem I'm dealing with, but if more information is needed, do not hesitate to ask me in the comments. Thanks to all of you for your attention in advance.
I have a 3 node Riak cluster with each having approx. 1 TB disk usage. All of a sudden, one node's hard disk failed unrecoverably.
So, I added a new node using the following steps:
1) riak-admin cluster join
2) down the failed node
3) riak-admin force-replace failed-node new-node
4) riak-admin cluster plan
5) riak-admin cluster commit.
This almost fixed the problem except that after lots of data transfers and handoffs, now not all three nodes have 1 TB disk usage. Only two of them
have 1 TB disk usage. The other one is almost empty. This means there are no longer 3 copies on disk anymore. What commands should I run to forcefully make sure there are three replicas on disk overall without waiting for read-repair or anti-entropy to make three copies ?
Answer got by posting same question to riak-users#lists.basho.com :
(0) Three nodes are insufficient, you should have 5 nodes
(1) You could iterate and read every object in the cluster - this would also trigger read repair for every object
(2) - copied from Engel Sanchez response to a similar question April 10th 2014 )
* If AAE is disabled, you don't have to stop the node to delete the data in
the anti_entropy directories
* If AAE is enabled, deleting the AAE data in a rolling manner may trigger
an avalanche of read repairs between nodes with the bad trees and nodes
with good trees as the data seems to diverge.
If your nodes are already up, with AAE enabled and with old incorrect trees
in the mix, there is a better way. You can dynamically disable AAE with
some console commands. At that point, without stopping the nodes, you can
delete all AAE data across the cluster. At a convenient time, re-enable
AAE. I say convenient because all trees will start to rebuild, and that
can be problematic in an overloaded cluster. Doing this over the weekend
might be a good idea unless your cluster can take the extra load.
To dynamically disable AAE from the Riak console, you can run this command:
riak_core_util:rpc_every_member_ann(riak_kv_entropy_manager, disable, [],
60000).
and enable with the similar:
riak_core_util:rpc_every_member_ann(riak_kv_entropy_manager, enable, [],
60000).
That last number is just a timeout for the RPC operation. I hope this
saves you some extra load on your clusters.
(3) That’s going to be :
(3a) List all keys using the client of your choice
(3b) Fetch each object
https://www.tiot.jp/riak-docs/riak/kv/2.2.3/developing/usage/reading-objects/
https://www.tiot.jp/riak-docs/riak/kv/2.2.3/developing/usage/secondary-indexes/
I'm writing an application using MPI (mpi4py actually). The application may spawn some new processes using MPI_Comm_spawn() (collectively on all current processes) and some nodes from the parent group/communicator may send data to some nodes in the child group/communicator and vice versa. (Notice MPI_Comm_spawn() and data sending/receving are happening in different threads both for functionality [there are other functionalities not directly relevant to this question so I didn't describe] and performance.)
Because the MPI_Comm_spawn() function may be called for several times and I expect all nodes can communicate with each other, I currently plan to use MPI_Intercomm_merge() to merge the two groups (parent and child) into one intracommunicator, and then send data through the new intracommunicator (and the next MPI_Comm_spawn() will happen on the new intracommunicator).
However, because the spawn and merge process happens during the program running, there will be some data sent through the old communicator already (but may not have yet been received by the dest). How could I safely switch from the old communicator to the new communicator (e.g. be able to delete the old communicator[s] at some point) while losing the least performance? The MPI_Comm_merge() is the only way I know to guarentee all processes can send data to each other (because if we don't merge, the next time we call MPI_Comm_merge(), some processes can't directly send data to each other), and I don't mind to change it to another method as long as it works well.
For example, in the following chart, process A, B, C are initial processes (mpiexec -np 3), D is a spawned process:
A and B will send continous data to C; during the sending time, D is spawned; then C sends data to D. Suppose the old communicator A, B and C uses is comm1 and the merged intracommunicator is comm2.
What I want to achieve is to send data through comm1 initially, and (all processes) switch to comm2 after D is spawned. What lacks is a mechanism to know when can C safely switch from comm1 to comm2 to receive data from A and/or B, and then I can safely call MPI_Comm_free(comm1).
Simply sending a special tag through comm1 at the time of switch would be the last option because C don't know how many processes will send data to it. It does know how many groups of processes will send data to it, so this can be achieved by introducing local leaders (but I'd like to know about other options).
Because A, B and C are processing in parellel and send/recv and spawn are happening in different threads, we can't guarentee no pending data when we call MPI_Comm_spawn(). E.g. if we imagine A and B process send and C processes recv at a same rate, when they call comm_spawn, C has only received half of the data from A and B, so we can't drop comm1 at C yet, but have to wait until C has received all pending data from comm1 (which is an unknown number of messages).
Are there any mechanisms provided by MPI or mpi4py (e.g. error codes or exceptions) to achieve this?
By the way, if my approach is apparently bad or if I misunderstand what MPI_Comm_free() does, please point out.
(What I understand is that MPI_Comm_free() is not a collective call; after calling MPI_Comm_free(comm1), no more send/recv calls to comm1 is allowed on the same node which calls MPI_Comm_free(comm1))
so basically, C invokes MPI_Comm_spawn(..., MPI_COMM_SELF, ...)
why don't you have {A,B,C} invoke MPI_Comm_spawn(..., comm1, ...) instead ?
MPI_Intercomm_merge() is a collective operation, so you need to "synchronize" your tasks somehow, so why not "synchronize" them before MPI_Comm_spawn() instead ?
then switching to the new communicator is trivial
I'm starting with Riak and so far everything is going great. I'm not concerned about performance at the moment because I'm mainly using it as a backup store. I've read all docs I could find (the littleriakbook.com was great to explain the concepts) but I still seem to not grasp some parts.
The situation is that I can only use 3 physical nodes/servers at the moment (instead of the recommended 5). I want that all data is replicated to all three nodes. Essentially if up to 2 nodes go down, I want to still be able to read and write to the remaining node. And if the nodes are coming up again they should synchronise again.
I've set it all up and riak-admin diag shows me that not all data is fulfilling the n_val requirement. How can I make sure that all three nodes are (eventually) identical copies? Is it possible to trigger a redristribution of the data that doesn't fulfil the requirements?
With only 3 nodes, it is not possible to fulfil the n_val requirement and ensure that the three copies stored of any object always will be on different nodes. The reason for this is in how Riak distributes replicas.
When storing or retrieving an object, Riak will calculate a hash value based on the bucket and key, and map this value to a specific partition on the ring. Once this partition has been determined the other N-1 replicas are always placed on the following N-1 partitions. If we assume we have a ring size of 64 and name these partitions 1-64, an object that hashes into partition 10 and belongs to a bucket with n_val set to 3 will also be stored in partitions 11 and 12.
With 3 nodes you will often see that the partitions are spread out alternating between the physical nodes. This means that for most partitions, the replicas will be on different physical nodes. For the last 2 partitions of the ring, 63 and 64 in our case, storage will however need to wrap around onto partitions 1 and 2. As 64 can not be evenly divided by 3, objects that do hash into these last partitions will therefore only be stored on 2 different physical nodes.
When a node fails or becomes unavailable in Riak, the remaining nodes will temporarily take responsibility for the partitions belonging to the lost node. These are known as fallback partitions and will initially be empty. As data is updated or inserted, these partitions will keep track of it and hand it back to the owning node once it becomes available. If Active Anti-Entropy is enabled, it will over time synchronise the fallback partition with the other partitions in the background.
In dynamo paper : http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf
The Replication section said:
To account for node failures, preference list contains more than N
nodes.
I want to know why ? and does this 'node' mean virtual node ?
It is for increasing Dynamo availability. If the top N nodes in the preference list are good, the other nodes will not be used. But if all the N nodes are unavailable, the other nodes will be used. For write operation, this is called hinted handoff.
The diagram makes sense both for physical nodes and virtual nodes.
I also don't understand the part you're talking about.
Background:
My understanding of the paper is that, since Dynamo's default replication factor is 3, each node N is responsible for circle ranging from N-3 to N (while also being the coordinator to ring range N-1 to N).
That explains why:
node B holds keys from F to B
node C holds keys from G to C
node D holds keys from A to D
And since range A-B falls within all those ranges, nodes B, C and D are the ones that have that range of key hashes.
The paper states:
The section 4.3 Replication:
To address this, the preference list for a key is constructed by skipping positions in the ring to ensure that the list contains only distinct physical nodes.
How can preference list contain more than N nodes if it is constructed by skipping virtual ones?
IMHO they should have stated something like this:
To account for node failures, ring range N-3 to N may contain more than N nodes, N physical nodes plus x virtual nodes.
Distributed DBMS Dyanmo DB falls in that class which sacrifices Consistency. Refer to the image below:
So, the system is inconsistent even though it is highly available. Because network partitions are a given in Distributed Systems you cannot not pick Partition Tolerance.
Addressing your questions:
To account for node failures, preference list contains more than N nodes. I want to know why?
One fact of Large Scale Distributed Systems is that in a system of thousands of nodes, failure of a nodes is a norm.
You are bound to have a few nodes failing in such a big system. You don't treat it as an exceptional condition. You prepare for such situations. How do you prepare?
For Data: You simply replicate your data on multiple nodes.
For Execution: You perform the same execution on multiple nodes. This is called speculative execution. As soon as you get the first result from the multiple executions you ran, you cancel the other executions.
That's the answer right there - you replicate your data to prepare for the case when node(s) may fail.
To account for node failures, preference list contains more than N nodes. Does this 'node' mean virtual node?
I wanted to ensure that I always have access to my house. So I copied my house's keys and gave them to another family member of mine. This guy put those keys in a safe in our house. Now when we all go out, I'm under the illusion that we have other set of keys so in case I lose mine, we can still get in the house. But... those keys are in the house itself. Losing my keys simply means I lose the access to my house. This is what would happen if we replicate the data on virtual nodes instead of physical nodes.
A virtual node is not a separate physical node and so when the real node on which this virtual node has been mapped to will fail, the virtual node will go away as well.
This 'node' cannot mean virtual node if the aim is high availability, which is the aim in Dynamo DB.