Hadoop reduce become slower when there are less reduce task - dictionary

I'm experiencing a really weird case when I am doing some performance tuning of Hadoop. I was running a job with large intermediate output (like InvertedIndex or WordCount without combiner), the network and computation resources are all homogeneous. According to how mapreduce work, when there is more WAVES of reduce task, the overall run time should be slower as there is less overlap between map and shuffle, but it is not the case. It turns out that the job with 5 WAVES of reduce task is about 10% faster than the one with only one WAVE of task. And I checked the log and it turns out that the map tasks' execution time is longer when there is less reduce tasks, also, the overall computation time(not shuffle or merge) during reduce phase is longer when there is less task. I tried to rule out other factors by setting reduce slow-start factor to be 1 so that there is no overlap between map and shuffle, I also limited it to be only one reduce task to be executed at the same time so there is no overlap between reduce tasks, and I modified the scheduler to force mapper and reducer to locate on different machines so there is no I/O congestion. Even with above approach, the same thing still happen. (I also set the map memory buffer to be large enough and the io.sort.factor to be 32 or even larger and io.sort.mb to be larger than 320 accordingly)
I really can't think of any other reason that cause this problem, so any suggestions would be greatly appreciated!
Just in case of confusion, the problem I am experiencing is:
0. I'm comparing the performance of running 1 reduce task vs 5 reduce task of the same job under all other same configurations. There is only one tasktracker for reduce computation.
1. I have forced all reduce task to be executed sequentially by having only one tasktracker for redcue task in both cases, and mapred.tasktracker.reduce.tasks.maximum=1, so there won't be any parallelism during reduce phase
2. I have set mapred.reduce.slowstart.completed.maps=1 so none of the reducer will start to pull data before all map is done
3. It turns out that having one reduce task is slower than having 5 SEQUENTIAL reduce tasks!
4. Even if I set set mapred.reduce.slowstart.completed.maps=0.05 to allow overlap between map & shuffle, (thus when there is only one reduce task, the overlap should be more and it should run faster, because the 5 reduce task are SEQUENTIALLY executing) the 5-reduce-task is still faster than 1-reduce task and the map phase of 1-reduce task become slower!

This is not a problem. The more reduce tasks you have, the faster your data gets processed.
The outputs of the map phase are sent to the reducers . If you have two reducers, the load is distributed between the two reducers.
Incase of the wordcount example, you will have two seperate files with count divided between them. So you will have to manually add the total, or run another map reduce job to calculate the total if you had lots of reduce tasks.

This is as expected, if you only have a single reducer than your job has a single point of failure. Your number of reducers should be set to about 90% capacity. You can find your reduce capacity by multiplying your number of reduce slots with your total number of nodes. I have found that it is also good practice to use a combiner if it is applicable.

If you have just 1 reduce task, then that reducer has to wait for all mappers to finish, and the shuffle phase has to collect all intermediate data to be redirected to just that one reducer. So, it's natural that the map and shuffle times are larger, and so is the overall time, if you have just one reducer.
However if you have more reducers, your data gets processed in parallel, and that makes it faster. Again, if you have too many reducers, then there's too much data being shuffled around, resulting in increase in network traffic. So you have to find that optimal number of reducers which gives you a good balance.

The right number of reduces seems to be 0.95 or 1.75 * (nodes * mapred.tasktracker.tasks.maximum). At 0.95 all of the reduces can launch immediately and start transfering map outputs as the maps finish. At 1.75 the faster nodes will finish their first round of reduces and launch a second round of reduces doing a much better job of load balancing.
courtesy:
http://wiki.apache.org/hadoop/HowManyMapsAndReduces
Setting the number of map tasks and reduce tasks
(similar question wirth resolved answer)
Hope this helps!

Related

Why wouldn't a small Firebase Functions app just use a single Function to handle logic?

...aside from the benefit in separate performance monitoring and logging.
For logging, I am confident I can get granularity through manually adding the name of the "routine" to each call. This is how it is now with several discrete Functions for different parts of the system:
There are multiple automatic logs: start and finish of the routine, for example. It would be more challenging to find out how expensive certain routines are, but it would not be impossible.
The reason I want the entire logic of the application handled by a single handle function is because of reducing cold starts: one function means only one container that can be persistently kept alive when there are very few users of the app.
If a month is ~2.6m seconds and we assume the system uses 1 GB RAM and 1 GHz CPU frequency at all times, that's:
2600000 * 0.0000025 + 2600000 * 0.000001042 = USD$9.21 a month
...for one minimum instance.
I should also state that all of my functions have the bare minimum amount of global scope code; it just sets up Firebase assets (RTDB and Firestore).
From a billing, performance (based on user wait time), and user/developer experience perspective, is there any reason why it would be smart to keep all my functions discrete?
I'd also accept an answer saying "one single function for all logic is reasonable" as long as there's a reason for it.
Thanks!
If you have very small app with ~5 end points and very low traffic. Sure you could do something like this. But why not do it:
billing and performance
The important thing to realize is that with every request a new instance of your function is created. Which means there could be 10s of them running at the same time.
If you would like to have just 1 instance handling all the traffic you should explore GCP Cloud run, where you have 1 container handling multiple requests and scaling only when it's not sufficient.
Imagine you have several end-points and every one of them have different performance requirements.
1 can need only 128MB or RAM
1 can need 1GB RAM
(FYI: You can control the CPU MHz of the function via the RAM settings too - which can speed up execution in some cases)
If you had only 1 function with 1GB of ram. Every request would allocate such function and in some cases most of the memory could go to waste.
But if you split it into multiple, some requests will require much less resources and can save you $ when we talk about bigger amount of executions / month. (tens of thousands+).
Let's imagine function, 3 second execution, 10k executions/month:
128MB would cost you $0.0693
1024MB would cost you $0.495
As you can see, with small app the difference could be nothing. But if you scale it matters. (*The cost can vary based on datacenter)
As for the logging, I don't think it matters. Usually in bigger systems there could be messages traveling trough several functions so you have to deal with that anyway.
As for the cold start. You just need good UI to facilitate that. At first I was worry about it in our apps but later on, you just get used to it that some action can take ~2s to execute (cold start). And you should have the UI "loading" regardless, because you don't know if the function will take ~100ms or 3s due to bad connection.

Intel MPI distributed memory: building a wall out of M*N blocks using q<M processors

Imagine I have M independent jobs, each job has N steps. Jobs are independent from each other but steps of each job should be serial. In other words J(i,j) should be started only after J(i,j-1) is finished (i indicates the job index and j indicates the step). This is isomorphic to building a wall with width of M and hight of N blocks.
Each block of job should be executed only once. The time that it takes to do one block of work using one CPU (also the same order) is different for different blocks and is not known in advance.
The simple way of doing this using MPI is to assign blocks of work to processors and wait until all of them finish their blocks before the next assignment. This way we can make ensure that priorities are enforced, but there will be a lot of waiting time.
Is there a more efficient way of doing this? I mean when a processor finishes its job, using some kind of environmental variables or shared memory, could decide which block of job it should do next, without waiting for other processors to finish their jobs and make a collective decision using communications.
You have M jobs with N steps each. You also have a set of worker processes of size W, somewhere between 2 and M.
If W is close to M, the best you can do is simply assign them 1:1. If one worker finishes early that's fine.
If W is much smaller than M, and N is also fairly large, here is an idea:
Estimate some average or typical time for one step to complete. Call this T. You can adjust this estimate as you go in case you have a very poor estimator at the outset.
Divide your M jobs evenly in number among the workers, and start them. Tell the workers to run as many steps of their assigned jobs as possible before a timeout, say T*N/K. Overrunning the timeout slightly to finish the current job is allowed to ensure forward progress.
Have the workers communicate to each other which steps they completed.
Repeat, dividing the jobs evenly again taking into account how complete each one is (e.g. two 50% complete jobs count the same as one 0% complete job).
The idea is to give all the workers enough time to complete roughly 1/K of the total work each time. If no job takes much more than K*T, this will be quite efficient.
It's up to you to find a reasonable K. Maybe try 10.
Here's an idea, IDK if it's good:
Maintain one shared variable: n = the progress of the farthest-behind task. i.e. the lowest step-number that any of the M tasks has completed. It starts out at 0, because all tasks start at the first step. It stays at 0 until all tasks have completed at least 1 step each.
When a processor finishes a step of a job, check the progress of the step it's currently working on against n. If n < current_job_step - 4, switch tasks because the one we're working on is too far ahead of the farthest-behind one.
I picked 4 to give a balance between too much switching vs. having too much serial work in only a couple tasks. Adjust as necessary, and maybe make it adaptive as you near the end.
Switching tasks without having two threads both grab the same work unit is non-trivial unless you have a scheduler thread that makes all the decisions. If this is on a single shared-memory machine, you could use locking to protect a priority queue.

advise on using parallel computing on different levels

We know that it is better to use parallel computing for longer tasks than shorter tasks, in order to save OS time when switching many small tasks. I am looking for your comments and advise on the following scenario. I got 6 task I can parallel, each of them has a small task that can also be paralleled. Let's say I have 64 cores I can use.
Would it be prudent to use parallel for the 6 larger tasks, and then to parallel again within each task?
You already answered your own question. If the individual tasks are short, the overhead of parallelisation will cause the total calculation time to go up in stead of down. This does not really change when having this kind of hierarchy of parallel jobs. I would think the overhead is even bigger as the information has to be passed down form the smallest jobs, via the intermediate jobs to the top-level. If your smallest tasks do not take a significant amount of time, say at least a few seconds, parallelisation is not going help.

divide workload on different hardware using MPI

I have a small network with computers of different hardware. Is it possible to optimize workload division between these hardware using MPI? ie. give nodes with larger ram and better cpu more data to compute? minimizing waiting time between different nodes for final reduction.
Thanks!
In my program data are divided into equal-sized batches. Each node in the network will process some of them. The result of each batch will be summed up after all batches are processed.
Can you divide the work into more batches than there are processes? If so, change your program so that instead of each process receiving one batch, the master keeps sending batches to whichever node is available, for as long as there are unassigned batches. It should be a fairly easy modification, and it will make faster nodes process more data, leading to a lower overall completion time. There are further enhancements you can make, e.g. once all batches have been assigned and a fast node is available, you could take an already assigned batch away from a slow node and reassign it to said fast node. But these may not be worth the extra effort.
If you absolutely have to work with as many batches as you have nodes, then you'll have to find some way of deciding which nodes are fast and which ones are slow. Perhaps the most robust way of doing this is to assign small, equally sized test batches to each process, and have them time their own solutions. The master can then divide the real data into appropriately sized batches for each node. The biggest downside to this approach is that if the initial speed measurement is inaccurate, then your efforts at load balancing may end up doing more harm than good. Also, depending on the exact data and algorithm you're working with, runtimes with small data sets may not be indicative of runtimes with large data sets.
Yet another way would be to take thorough measurements of each node's speed (i.e. multiple runs with large data sets) in advance, and have the master balance batch sizes according to this precompiled information. The obvious complication here is that you'll somehow have to keep this registry up to date and available.
All in all, I would recommend the very first approach: divide the work into many smaller chunks, and assign chunks to whichever node is available at the moment.

How many mappers/reducers should be set when configuring Hadoop cluster?

When configuring a Hadoop Cluster whats the scientific method to set the number of mappers/reducers for the cluster?
There is no formula. It depends on how many cores and how much memory do you have. The number of mapper + number of reducer should not exceed the number of cores in general. Keep in mind that the machine is also running Task Tracker and Data Node daemons. One of the general suggestion is more mappers than reducers. If I were you, I would run one of my typical jobs with reasonable amount of data to try it out.
Quoting from "Hadoop The Definite Guide, 3rd edition", page 306
Because MapReduce jobs are normally
I/O-bound, it makes sense to have more tasks than processors to get better
utilization.
The amount of oversubscription depends on the CPU utilization of jobs
you run, but a good rule of thumb is to have a factor of between one and two more
tasks (counting both map and reduce tasks) than processors.
A processor in the quote above is equivalent to one logical core.
But this is just in theory, and most likely each use case is different than another, some tests need to be performed. But this number can be a good start to test with.
Probably, you should also look at reducer lazy loading, which allows reducers to start later when required, so basically, number of maps slots can be increased. Don't have much idea on this though but, seems useful.
Taken from Hadoop Gyan-My blog:
No. of mappers is decided in accordance with the data locality principle as described earlier. Data Locality principle : Hadoop tries its best to run map tasks on nodes where the data is present locally to optimize on the network and inter-node communication latency. As the input data is split into pieces and fed to different map tasks, it is desirable to have all the data fed to that map task available on a single node.Since HDFS only guarantees data having size equal to its block size (64M) to be present on one node, it is advised/advocated to have the split size equal to the HDFS block size so that the map task can take advantage of this data localization. Therefore, 64M of data per mapper. If we see some mappers running for a very small period of time, try to bring down the number of mappers and make them run longer for a minute or so.
No. of reducers should be slightly less than the number of reduce slots in the cluster (the concept of slots comes in with a pre-configuration in the job/task tracker properties while configuring the cluster) so that all the reducers finish in one wave and make full utilisation of the cluster resources.

Resources