advise on using parallel computing on different levels - r

We know that it is better to use parallel computing for longer tasks than shorter tasks, in order to save OS time when switching many small tasks. I am looking for your comments and advise on the following scenario. I got 6 task I can parallel, each of them has a small task that can also be paralleled. Let's say I have 64 cores I can use.
Would it be prudent to use parallel for the 6 larger tasks, and then to parallel again within each task?

You already answered your own question. If the individual tasks are short, the overhead of parallelisation will cause the total calculation time to go up in stead of down. This does not really change when having this kind of hierarchy of parallel jobs. I would think the overhead is even bigger as the information has to be passed down form the smallest jobs, via the intermediate jobs to the top-level. If your smallest tasks do not take a significant amount of time, say at least a few seconds, parallelisation is not going help.

Related

Intel MPI distributed memory: building a wall out of M*N blocks using q<M processors

Imagine I have M independent jobs, each job has N steps. Jobs are independent from each other but steps of each job should be serial. In other words J(i,j) should be started only after J(i,j-1) is finished (i indicates the job index and j indicates the step). This is isomorphic to building a wall with width of M and hight of N blocks.
Each block of job should be executed only once. The time that it takes to do one block of work using one CPU (also the same order) is different for different blocks and is not known in advance.
The simple way of doing this using MPI is to assign blocks of work to processors and wait until all of them finish their blocks before the next assignment. This way we can make ensure that priorities are enforced, but there will be a lot of waiting time.
Is there a more efficient way of doing this? I mean when a processor finishes its job, using some kind of environmental variables or shared memory, could decide which block of job it should do next, without waiting for other processors to finish their jobs and make a collective decision using communications.
You have M jobs with N steps each. You also have a set of worker processes of size W, somewhere between 2 and M.
If W is close to M, the best you can do is simply assign them 1:1. If one worker finishes early that's fine.
If W is much smaller than M, and N is also fairly large, here is an idea:
Estimate some average or typical time for one step to complete. Call this T. You can adjust this estimate as you go in case you have a very poor estimator at the outset.
Divide your M jobs evenly in number among the workers, and start them. Tell the workers to run as many steps of their assigned jobs as possible before a timeout, say T*N/K. Overrunning the timeout slightly to finish the current job is allowed to ensure forward progress.
Have the workers communicate to each other which steps they completed.
Repeat, dividing the jobs evenly again taking into account how complete each one is (e.g. two 50% complete jobs count the same as one 0% complete job).
The idea is to give all the workers enough time to complete roughly 1/K of the total work each time. If no job takes much more than K*T, this will be quite efficient.
It's up to you to find a reasonable K. Maybe try 10.
Here's an idea, IDK if it's good:
Maintain one shared variable: n = the progress of the farthest-behind task. i.e. the lowest step-number that any of the M tasks has completed. It starts out at 0, because all tasks start at the first step. It stays at 0 until all tasks have completed at least 1 step each.
When a processor finishes a step of a job, check the progress of the step it's currently working on against n. If n < current_job_step - 4, switch tasks because the one we're working on is too far ahead of the farthest-behind one.
I picked 4 to give a balance between too much switching vs. having too much serial work in only a couple tasks. Adjust as necessary, and maybe make it adaptive as you near the end.
Switching tasks without having two threads both grab the same work unit is non-trivial unless you have a scheduler thread that makes all the decisions. If this is on a single shared-memory machine, you could use locking to protect a priority queue.

What sort of scaling can be expected with mpiblast?

On our HPC cluster, one of the users runs mpiblast jobs on upward of 30 cores. These will typically end up on about 10 different nodes, the nodes normally being shared between users. Although these jobs occasionally scale fairly well and can effectively use about 90% of the cores available, often scaling is very bad with jobs only accumulating CPU-time corresponding to around 10% of the cores available.
Should mpiblast scale better in general? Does anyone know what factors might lead to poor scaling?
mpiblast should work faster in general but there's no guarantee that the scaling would be better. Few factors are there:
For parallel processing, you need to make sure that the nodes that are being used are not idle/not being used properly. That is one of the main reason for poor scaling!
Also, it depends on the files you are using for BLAST. For instance, there are some parameters in mpiblast, you should go through them first.
But in general, mpiblast should scale well when the nodes are equally being used that means greatly load-balanced :)

Hadoop reduce become slower when there are less reduce task

I'm experiencing a really weird case when I am doing some performance tuning of Hadoop. I was running a job with large intermediate output (like InvertedIndex or WordCount without combiner), the network and computation resources are all homogeneous. According to how mapreduce work, when there is more WAVES of reduce task, the overall run time should be slower as there is less overlap between map and shuffle, but it is not the case. It turns out that the job with 5 WAVES of reduce task is about 10% faster than the one with only one WAVE of task. And I checked the log and it turns out that the map tasks' execution time is longer when there is less reduce tasks, also, the overall computation time(not shuffle or merge) during reduce phase is longer when there is less task. I tried to rule out other factors by setting reduce slow-start factor to be 1 so that there is no overlap between map and shuffle, I also limited it to be only one reduce task to be executed at the same time so there is no overlap between reduce tasks, and I modified the scheduler to force mapper and reducer to locate on different machines so there is no I/O congestion. Even with above approach, the same thing still happen. (I also set the map memory buffer to be large enough and the io.sort.factor to be 32 or even larger and io.sort.mb to be larger than 320 accordingly)
I really can't think of any other reason that cause this problem, so any suggestions would be greatly appreciated!
Just in case of confusion, the problem I am experiencing is:
0. I'm comparing the performance of running 1 reduce task vs 5 reduce task of the same job under all other same configurations. There is only one tasktracker for reduce computation.
1. I have forced all reduce task to be executed sequentially by having only one tasktracker for redcue task in both cases, and mapred.tasktracker.reduce.tasks.maximum=1, so there won't be any parallelism during reduce phase
2. I have set mapred.reduce.slowstart.completed.maps=1 so none of the reducer will start to pull data before all map is done
3. It turns out that having one reduce task is slower than having 5 SEQUENTIAL reduce tasks!
4. Even if I set set mapred.reduce.slowstart.completed.maps=0.05 to allow overlap between map & shuffle, (thus when there is only one reduce task, the overlap should be more and it should run faster, because the 5 reduce task are SEQUENTIALLY executing) the 5-reduce-task is still faster than 1-reduce task and the map phase of 1-reduce task become slower!
This is not a problem. The more reduce tasks you have, the faster your data gets processed.
The outputs of the map phase are sent to the reducers . If you have two reducers, the load is distributed between the two reducers.
Incase of the wordcount example, you will have two seperate files with count divided between them. So you will have to manually add the total, or run another map reduce job to calculate the total if you had lots of reduce tasks.
This is as expected, if you only have a single reducer than your job has a single point of failure. Your number of reducers should be set to about 90% capacity. You can find your reduce capacity by multiplying your number of reduce slots with your total number of nodes. I have found that it is also good practice to use a combiner if it is applicable.
If you have just 1 reduce task, then that reducer has to wait for all mappers to finish, and the shuffle phase has to collect all intermediate data to be redirected to just that one reducer. So, it's natural that the map and shuffle times are larger, and so is the overall time, if you have just one reducer.
However if you have more reducers, your data gets processed in parallel, and that makes it faster. Again, if you have too many reducers, then there's too much data being shuffled around, resulting in increase in network traffic. So you have to find that optimal number of reducers which gives you a good balance.
The right number of reduces seems to be 0.95 or 1.75 * (nodes * mapred.tasktracker.tasks.maximum). At 0.95 all of the reduces can launch immediately and start transfering map outputs as the maps finish. At 1.75 the faster nodes will finish their first round of reduces and launch a second round of reduces doing a much better job of load balancing.
courtesy:
http://wiki.apache.org/hadoop/HowManyMapsAndReduces
Setting the number of map tasks and reduce tasks
(similar question wirth resolved answer)
Hope this helps!

MPI + GPU : how to mix the two techniques

My program is well-suited for MPI. Each CPU does its own, specific (sophisticated) job, produces a single double, and then I use an MPI_Reduce to multiply the result from every CPU.
But I repeat this many, many times (> 100,000). Thus, it occurred to me that a GPU would dramatically speed things up.
I have google'd around, but can't find anything concrete. How do you go about mixing MPI with GPUs? Is there a way for the program to query and verify "oh, this rank is the GPU, all other are CPUs" ? Is there a recommended tutorial or something?
Importantly, I don't want or need a full set of GPUs. I really just need a lot of CPUs, and then a single GPU to speed up the frequently-used MPI_Reduce operation.
Here is a schematic example of what I'm talking about:
Suppose I have 500 CPUs. Each CPU somehow produces, say, 50 doubles. I need to multiply all 250,00 of these doubles together. Then I repeat this between 10,000 and 1 million times. If I could have one GPU (in addition to the 500 CPUs), this could be really efficient. Each CPU would compute its 50 doubles for all ~1 million "states". Then, all 500 CPUs would send their doubles to the GPU. The GPU would then multiply the 250,000 doubles together for each of the 1 million "states", producing 1 million doubles.
These numbers are not exact. The compute is indeed very large. I'm just trying to convey the general problem.
This isn't the way to think about these things.
I like to say that MPI and GPGPU stuff are orthogonal(*). You use MPI between tasks (for which think nodes, although you can have multiple tasks per node), and each task may or may not use an accelerator like a GPU to accelerate the computation within task. There is no MPI rank on a GPU.
Regardless, Talonmies is right; this particular example doesn't sound like it would benefit much from a GPU. And it won't be helped by having tens of thousands of doubles per task; if you're only doing one or a few FLOPs per double, the cost of sending the data to the GPU will exceed the benefit of having all those cores operate on them.
(*) This used to be more clearly true; now with, for instance, GPUDirect being able to copy memory to remote GPUs over infiniband, the distinction is fuzzier. However, I maintain that this is still the most useful way to think about things, with such things as RDMA to GPUs being an important optimization but conceptually a minor tweak.
Here I have found some news about the topic:
"MPI, the Message Passing Interface, is a standard API for communicating data via messages between distributed processes that is commonly used in HPC to build applications that can scale to multi-node computer clusters. As such, MPI is fully compatible with CUDA, which is designed for parallel computing on a single computer or node. There are many reasons for wanting to combine the two parallel programming approaches of MPI and CUDA. A common reason is to enable solving problems with a data size too large to fit into the memory of a single GPU, or that would require an unreasonably long compute time on a single node. Another reason is to accelerate an existing MPI application with GPUs or to enable an existing single-node multi-GPU application to scale across multiple nodes. With CUDA-aware MPI these goals can be achieved easily and efficiently. In this post I will explain how CUDA-aware MPI works, why it is efficient, and how you can use it."

Are OpenCL work items executed in parallel?

I know that work items are grouped into the work groups, and you cannot synchronize outside of a work group.
Does it mean that work items are executed in parallel?
If so, is it possible/efficient to make 1 work group with 128 work items?
The work items within a group will be scheduled together, and may run together. It is up to the hardware and/or drivers to choose how parallel the execution actually is. There are different reasons for this, but one very good one is to hide memory latency.
On my AMD card, the 'compute units' are divided into 16 4-wide SIMD units. This means that 16 work items can technically be run at the same time in the group. It is recommended that we use multiples of 64 work items in a group, to hide memory latency. Clearly they cannot all be run at the exact time. This is not a problem, because most kernels are in fact, memory bound, so the scheduler (hardware) will swap the work items waiting on the memory controller out, while the 'ready' items get their compute time. The actual number of work items in the group is set by the host program, and limited by CL_DEVICE_MAX_WORK_GROUP_SIZE. You will need to experiment with the optimal work group size for your kernel.
The cpu implementation is 'worse' when it comes to simultaneous work items. There are only ever as many work items running as you have cores available to run them on. They behave more sequentially in the cpu.
So do work items run at the exactly same time? Almost never really. This is why we need to use barriers when we want to be sure they pause at a given point.
In the (abstract) OpenCL execution model, yes, all work items execute in parallel, and there can be millions of them.
Inside a GPU, all work items of the same work group must be executed on a single "core". This puts a physical restriction on the number of work items per work group (256 or 512 is the max, but it can be smaller for large kernels using a lot of registers). All work groups are then scheduled on the (usually 2 to 16) cores of the GPU.
You can synchronize threads (work items) inside a work group, because they all are resident in the same core, but you can't synchronize threads from different work groups, since they may not be scheduled at the same time, and could be executed on different cores.
Yes, it is possible to have 128 work items inside a work group, unless it consumes too many resources. To reach maximum performance, you usually want to have the largest possible number of threads in a work group (at least 64 are required to hide memory latency, see Vasily Volkov's presentations on this subject).
The idea is that they can be executed in parallel if possible (whether they actually will be executed in parallel depends).
Yes, work items are executed in parallel.
To get the maximal possible number of work items, use clGetDeviceInfo with CL_DEVICE_MAX_WORK_GROUP_SIZE. It depends on the hardware.
Whether it's efficient or not primarily depends on the task you want to implement. If you need a lot of synchronization, it may be that OpenCL does not fit your task. I can't say much more without knowing what you actually want to do.
The work-items in a given work-group execute concurrently on the processing elements of a sigle processing unit.

Resources