Running Riak on Heterogeneous cluster - riak

From what I've read, Riak treats all nodes in a cluster equal. However we'd like to have a heterogeneous cluster, where cpu/mem/hd is not always equal - in fact they can be very different. Each node would of course meet minimal requirements that's required for a node.
Questions:
1) What is the consequence of creating a cluster composed of machines with wildly varying specifications (cpu, diskspace, disk speed, amount and speed of memory, network speed) and
2) Can the cluster detect and compensate for such differences automatically? (assuming no here)
3) Are there ways to take care of this problem in other ways? Think of: prioritizing nodes in load balancer, based on the hardware. Something else?

I'll answer your questions, however, operating Riak in this manner is strongly discouraged as Riak assumes identical capabilities among nodes.
You could possibly have wildly varying performance characteristics
for operations against your node. In general, the "weakest node" in
the system could affect operations throughout the cluster. For
instance, during the PUT phase of an operation, a replica of the
data could be routed to the weakest node and the duration of that
operation could affect the entire PUT operation based on the PUT
operation's quorum value.
No, the cluster assumes identical hardware.
There really is no way to compensate for this.

Related

What's an elegant/scalable way to dispatch hash queries, whose key is a string, to multiple machines?

I want to make it scalable. Suppose letters are all in lower case. For example, if I only have two machines, queries whose first character is within a ~ m can be dispatched to the first machine, while the n ~ z queries can be dispatched to the second machine.
However, when the third machine comes, to make the queries spread as even as possible, I have to re-calculate the rules and re-distribute the contents stored in the previous two machines. I feel it could be messy. For example, the more complex case, when I already have 26 machines, what should I do when the 27th one comes? What do people usually do to achieve the scalability here?
The process of (self-) organizing machines in a DHT to split the load of handling queries to a pool of objects is called Consistent Hashing:
https://en.wikipedia.org/wiki/Consistent_hashing
I don't think there's a definitive answer to your question.
First is the question of balance. The DHT is balanced when:
each node is under similar load? (load balancing is probably what you're after)
each node is responsible for similar amounts of objects? (this is what you seemed to suggest)
(less likely) each node is responsible for similar amount of the addressing space?
I believe your objective is to make sure none of the machines is overloaded. Unless queries to a single object are enough to saturate a single machine, this is unlikely to happen if you rebalance properly.
If one of the machines is under significantly lower load than the other, you can make the less-load machine take over some of the objects of the higher-load machine by shifting their positions in the ring.
Another way of rebalancing is through virtual nodes -- each machine can simulate being k machines. If its load is low, it can increase the amount of virtual nodes (and take over more more objects). If its load is high, it can remove some of its virtual nodes.

divide workload on different hardware using MPI

I have a small network with computers of different hardware. Is it possible to optimize workload division between these hardware using MPI? ie. give nodes with larger ram and better cpu more data to compute? minimizing waiting time between different nodes for final reduction.
Thanks!
In my program data are divided into equal-sized batches. Each node in the network will process some of them. The result of each batch will be summed up after all batches are processed.
Can you divide the work into more batches than there are processes? If so, change your program so that instead of each process receiving one batch, the master keeps sending batches to whichever node is available, for as long as there are unassigned batches. It should be a fairly easy modification, and it will make faster nodes process more data, leading to a lower overall completion time. There are further enhancements you can make, e.g. once all batches have been assigned and a fast node is available, you could take an already assigned batch away from a slow node and reassign it to said fast node. But these may not be worth the extra effort.
If you absolutely have to work with as many batches as you have nodes, then you'll have to find some way of deciding which nodes are fast and which ones are slow. Perhaps the most robust way of doing this is to assign small, equally sized test batches to each process, and have them time their own solutions. The master can then divide the real data into appropriately sized batches for each node. The biggest downside to this approach is that if the initial speed measurement is inaccurate, then your efforts at load balancing may end up doing more harm than good. Also, depending on the exact data and algorithm you're working with, runtimes with small data sets may not be indicative of runtimes with large data sets.
Yet another way would be to take thorough measurements of each node's speed (i.e. multiple runs with large data sets) in advance, and have the master balance batch sizes according to this precompiled information. The obvious complication here is that you'll somehow have to keep this registry up to date and available.
All in all, I would recommend the very first approach: divide the work into many smaller chunks, and assign chunks to whichever node is available at the moment.

Hadoop suitability for recursive data processing

I have a filtering algorithm that needs to be applied recursively and I am not sure if MapReduce is suitable for this job. W/o giving too much away, I can say that each object that is being filtered is characterized by a collection if ordered list or queue.
The data is not huge, just about 250MB when I export from SQL to
CSV.
The mapping step is simple: the head of the list contains an object that can classify the list as belonging to one of N mapping nodes. the filtration algorithm at each node works on the collection of lists assigned to the node and at the end of the filtration, either a list remains the same as before the filtration or the head of the list is removed.
The reduce function is simple too: all the map jobs' lists are brought together and may have to be written back to disk.
When all the N nodes have returned their output, the mapping step is repeated with this new set of data.
Note: N can be as much as 2000 nodes.
Simple, but it requires perhaps up to a 1000 recursions before the algorithm's termination conditions are met.
My question is would this job be suitable for Hadoop? If not, what are my options?
The main strength of Hadoop is its ability to transparently distribute work on a large number of machines. In order to fully benefit from Hadoop your application has to be characterized, at least by the following three things:
work with large amounts of data (data which is distributed in the cluster of machines) - which would be impossible to store on one machine
be data-parallelizable (i.e. chunks of the original data can be manipulated independently from other chunks)
the problem which the application is trying to solve lends itself nicely to the MapReduce (scatter - gather) model.
It seems that out of these 3, your application has only the last 2 characteristics (with the observation that you are trying to recursively use a scatter - gather procedure - which means a large number of jobs - equal to the recursion depth; see last paragraph why this might not be appropriate for hadoop).
Given the amount of data you're trying to process, I don't see any reason why you wouldn't do it on a single machine, completely in memory. If you think you can benefit from processing that small amount of data in parallel, I would recommend focusing on multicore processing than on distributed data intensive processing. Of course, using the processing power of a networked cluster is tempting but this comes at a cost: mainly the time inefficiency given by the network communication (network being the most contended resource in a hadoop cluster) and by the I/O. In scenarios which are well-fitted to the Hadoop framework these inefficiency can be ignored because of the efficiency gained by distributing the data and the associated work on that data.
As I can see, you need 1000 jobs. The setup and the cleanup of all those jobs would be an unnecessary overhead for your scenario. Also, the overhead of network transfer is not necessary, in my opinion.
Recursive algos are hard in the distributed systems since they can lead to a quick starvation. Any middleware that would work for that needs to support distributed continuations, i.e. the ability to make a "recursive" call without holding the resources (like threads) of the calling side.
GridGain is one product that natively supports distributed continuations.
THe litmus test on distributed continuations: try to develop a naive fibonacci implementation in distributed context using recursive calls. Here's the GridGain's example that implements this using continuations.
Hope it helps.
Q&D, but I suggest you read a comparison of MongoDB and Hadoop:
http://www.osintegrators.com/whitepapers/MongoHadoopWP/index.html
Without knowing more, it's hard to tell. You might want to try both. Post your results if you do!

How many mappers/reducers should be set when configuring Hadoop cluster?

When configuring a Hadoop Cluster whats the scientific method to set the number of mappers/reducers for the cluster?
There is no formula. It depends on how many cores and how much memory do you have. The number of mapper + number of reducer should not exceed the number of cores in general. Keep in mind that the machine is also running Task Tracker and Data Node daemons. One of the general suggestion is more mappers than reducers. If I were you, I would run one of my typical jobs with reasonable amount of data to try it out.
Quoting from "Hadoop The Definite Guide, 3rd edition", page 306
Because MapReduce jobs are normally
I/O-bound, it makes sense to have more tasks than processors to get better
utilization.
The amount of oversubscription depends on the CPU utilization of jobs
you run, but a good rule of thumb is to have a factor of between one and two more
tasks (counting both map and reduce tasks) than processors.
A processor in the quote above is equivalent to one logical core.
But this is just in theory, and most likely each use case is different than another, some tests need to be performed. But this number can be a good start to test with.
Probably, you should also look at reducer lazy loading, which allows reducers to start later when required, so basically, number of maps slots can be increased. Don't have much idea on this though but, seems useful.
Taken from Hadoop Gyan-My blog:
No. of mappers is decided in accordance with the data locality principle as described earlier. Data Locality principle : Hadoop tries its best to run map tasks on nodes where the data is present locally to optimize on the network and inter-node communication latency. As the input data is split into pieces and fed to different map tasks, it is desirable to have all the data fed to that map task available on a single node.Since HDFS only guarantees data having size equal to its block size (64M) to be present on one node, it is advised/advocated to have the split size equal to the HDFS block size so that the map task can take advantage of this data localization. Therefore, 64M of data per mapper. If we see some mappers running for a very small period of time, try to bring down the number of mappers and make them run longer for a minute or so.
No. of reducers should be slightly less than the number of reduce slots in the cluster (the concept of slots comes in with a pre-configuration in the job/task tracker properties while configuring the cluster) so that all the reducers finish in one wave and make full utilisation of the cluster resources.

What are the tradeoffs when generating unique sequence numbers in a distributed and concurrent environment?

I am curious about the contraints and tradeoffs for generating unique sequence numbers in a distributed and concurrent environment.
Imagine this: I have a system where all it does is give back an unique sequence number every time you ask it. Here is an ideal spec for such a system (constraints):
Stay up under high-load.
Allow as many concurrent connections as possible.
Distributed: spread load across multiple machines.
Performance: run as fast as possible and have as much throughput as possible.
Correctness: numbers generated must:
not repeat.
be unique per request (must have a way break ties if any two request happens at the exact same time).
in (increasing) sequential order.
have no gaps between requests: 1,2,3,4... (effectively a counter for total # requests)
Fault tolerant: if one or more, or all machines went down, it could resume to the state before failure.
Obviously, this is an idealized spec and not all constraints can be satisfied fully. See CAP Theorem. However, I would love to hear your analysis on various relaxation of the constraints. What type of problems will we left with and what algorithms would we use to solve the remaining problems. For example, if we rid of the counter constraint, then the problem becomes much easier: since gaps are allowed, we can just partition the numeric ranges and map them onto different machines.
Any references (papers, books, code) are welcome. I'd also like to keep a list of existing software (open source or not).
Software:
Snowflake: a network service for generating unique ID numbers at high scale with some simple guarantees.
keyspace: a publicly accessible, unique 128-bit ID generator, whose IDs can be used for any purpose
RFC-4122 implementations exist in many languages. The RFC spec is probably a really good base, as it prevents the need for any inter-system coordination, the UUIDs are 128-bit, and when using IDs from software implementing certain versions of the spec, they include a time code portion that makes sorting possible, etc.
If you must be sequential (per machine) but can drop the gap/counter requirments look for an implementation of the Version 1 UUID as specified in RFC 4122.
If you're working in .NET and can eliminate the sequential and gap/counter requirements, just use System.Guids. They implement RFC 4122 Version 4 and are already unique (very low collision probability) across machines and requests. This could be easily implemented as a web service or just used locally.
Here's a high-level idea for an approach that may fulfill all the requirements, albeit with a significant caveat that may not match many use cases.
If you can tolerate having two sequence numbers - a logical one returned immediately; guaranteed unique and ordered but with gaps - and a separate physical one guaranteed to be in sequential order with no gaps and available a short while later - then the solution seems straightforward:
One distributed system that can serve up a high resolution clock + machine id as the logical sequence number
Stream all the logical sequence numbers into a separate distributed system that orders the logical sequence numbers and maps them to the physical sequence numbers.
The mapping from logical to physical can happen on-demand as soon as the second system is done with processing.

Resources