What sort of scaling can be expected with mpiblast? - mpi

On our HPC cluster, one of the users runs mpiblast jobs on upward of 30 cores. These will typically end up on about 10 different nodes, the nodes normally being shared between users. Although these jobs occasionally scale fairly well and can effectively use about 90% of the cores available, often scaling is very bad with jobs only accumulating CPU-time corresponding to around 10% of the cores available.
Should mpiblast scale better in general? Does anyone know what factors might lead to poor scaling?

mpiblast should work faster in general but there's no guarantee that the scaling would be better. Few factors are there:
For parallel processing, you need to make sure that the nodes that are being used are not idle/not being used properly. That is one of the main reason for poor scaling!
Also, it depends on the files you are using for BLAST. For instance, there are some parameters in mpiblast, you should go through them first.
But in general, mpiblast should scale well when the nodes are equally being used that means greatly load-balanced :)

Related

advise on using parallel computing on different levels

We know that it is better to use parallel computing for longer tasks than shorter tasks, in order to save OS time when switching many small tasks. I am looking for your comments and advise on the following scenario. I got 6 task I can parallel, each of them has a small task that can also be paralleled. Let's say I have 64 cores I can use.
Would it be prudent to use parallel for the 6 larger tasks, and then to parallel again within each task?
You already answered your own question. If the individual tasks are short, the overhead of parallelisation will cause the total calculation time to go up in stead of down. This does not really change when having this kind of hierarchy of parallel jobs. I would think the overhead is even bigger as the information has to be passed down form the smallest jobs, via the intermediate jobs to the top-level. If your smallest tasks do not take a significant amount of time, say at least a few seconds, parallelisation is not going help.

divide workload on different hardware using MPI

I have a small network with computers of different hardware. Is it possible to optimize workload division between these hardware using MPI? ie. give nodes with larger ram and better cpu more data to compute? minimizing waiting time between different nodes for final reduction.
Thanks!
In my program data are divided into equal-sized batches. Each node in the network will process some of them. The result of each batch will be summed up after all batches are processed.
Can you divide the work into more batches than there are processes? If so, change your program so that instead of each process receiving one batch, the master keeps sending batches to whichever node is available, for as long as there are unassigned batches. It should be a fairly easy modification, and it will make faster nodes process more data, leading to a lower overall completion time. There are further enhancements you can make, e.g. once all batches have been assigned and a fast node is available, you could take an already assigned batch away from a slow node and reassign it to said fast node. But these may not be worth the extra effort.
If you absolutely have to work with as many batches as you have nodes, then you'll have to find some way of deciding which nodes are fast and which ones are slow. Perhaps the most robust way of doing this is to assign small, equally sized test batches to each process, and have them time their own solutions. The master can then divide the real data into appropriately sized batches for each node. The biggest downside to this approach is that if the initial speed measurement is inaccurate, then your efforts at load balancing may end up doing more harm than good. Also, depending on the exact data and algorithm you're working with, runtimes with small data sets may not be indicative of runtimes with large data sets.
Yet another way would be to take thorough measurements of each node's speed (i.e. multiple runs with large data sets) in advance, and have the master balance batch sizes according to this precompiled information. The obvious complication here is that you'll somehow have to keep this registry up to date and available.
All in all, I would recommend the very first approach: divide the work into many smaller chunks, and assign chunks to whichever node is available at the moment.

Estimating the heat generated by a process or job

Is it possible to estimate the heat generated by an individual process in runtime.
Temperature readings of the processor is easily accessible but what I need is process specific information.
Is it possible to map information such as cpu utilization, io, running time, memory usage etc to get some kind of an estimate?
I'm gonna say no. Because the overall temperature of your system components isn't a simple mathematical equation with everything that's moving and switching either.
Heat generated by and inside a computer is dependent on many external factors like hardware setup, ambient temperature of the room, possibly the age of the components, is there dust on them or in the fans, was the cooling paste correctly applied on the CPU or elsewhere, where heat sinks are present, how is heat being dissipated etc.etc.. In short, again, no.
Additionally, your computer runs a LOT of processes at any given time apart from the ones that you control (and "control" is a relative term). Even if it is possible to access certain sensory data for individual components (like you can see to some extent in the BIOS) then interpolating one single process' generated temperature in regard to the total is, well, impossible.
At the lowest levels (gate networks, control signalling etc.), an external individual no longer has any means to observe or measure what's going on but there as well, things are in a changing state, a variable amount of electricity is being used and thus a variable amount of heat generated.
Pertaining to your second question: that's basically what your task manager does. There are countless examples and articles on the internet on how to get that done in a plethora of programming languages.
That is, unless some of the actually smart people in this merry little community of keytappers and screengazers say that it IS actually possible, at which point I will be thoroughly amazed...
EDIT: Monitoring the processes is a first step in what you're looking for. take a look at How to detect a process start & end using c# in windows? and be sure to follow up on duplicates like the one mentioned by Hans.
You could take a look at PowerTOP or some other tool that monitors power usage. I am not sure how accurate it is across different systems but a power estimation should provide at least some relative information as the heat generated assuming the processes you are comparing are running in similar manners on hardware. In reality there are just too many factors to predict power, much less heat, effectively but you may be able to get an idea of the usage.

Are OpenCL work items executed in parallel?

I know that work items are grouped into the work groups, and you cannot synchronize outside of a work group.
Does it mean that work items are executed in parallel?
If so, is it possible/efficient to make 1 work group with 128 work items?
The work items within a group will be scheduled together, and may run together. It is up to the hardware and/or drivers to choose how parallel the execution actually is. There are different reasons for this, but one very good one is to hide memory latency.
On my AMD card, the 'compute units' are divided into 16 4-wide SIMD units. This means that 16 work items can technically be run at the same time in the group. It is recommended that we use multiples of 64 work items in a group, to hide memory latency. Clearly they cannot all be run at the exact time. This is not a problem, because most kernels are in fact, memory bound, so the scheduler (hardware) will swap the work items waiting on the memory controller out, while the 'ready' items get their compute time. The actual number of work items in the group is set by the host program, and limited by CL_DEVICE_MAX_WORK_GROUP_SIZE. You will need to experiment with the optimal work group size for your kernel.
The cpu implementation is 'worse' when it comes to simultaneous work items. There are only ever as many work items running as you have cores available to run them on. They behave more sequentially in the cpu.
So do work items run at the exactly same time? Almost never really. This is why we need to use barriers when we want to be sure they pause at a given point.
In the (abstract) OpenCL execution model, yes, all work items execute in parallel, and there can be millions of them.
Inside a GPU, all work items of the same work group must be executed on a single "core". This puts a physical restriction on the number of work items per work group (256 or 512 is the max, but it can be smaller for large kernels using a lot of registers). All work groups are then scheduled on the (usually 2 to 16) cores of the GPU.
You can synchronize threads (work items) inside a work group, because they all are resident in the same core, but you can't synchronize threads from different work groups, since they may not be scheduled at the same time, and could be executed on different cores.
Yes, it is possible to have 128 work items inside a work group, unless it consumes too many resources. To reach maximum performance, you usually want to have the largest possible number of threads in a work group (at least 64 are required to hide memory latency, see Vasily Volkov's presentations on this subject).
The idea is that they can be executed in parallel if possible (whether they actually will be executed in parallel depends).
Yes, work items are executed in parallel.
To get the maximal possible number of work items, use clGetDeviceInfo with CL_DEVICE_MAX_WORK_GROUP_SIZE. It depends on the hardware.
Whether it's efficient or not primarily depends on the task you want to implement. If you need a lot of synchronization, it may be that OpenCL does not fit your task. I can't say much more without knowing what you actually want to do.
The work-items in a given work-group execute concurrently on the processing elements of a sigle processing unit.

Flex Profiling (Flex Builder): comparing two results

I am trying to use Flex Profiler to improve the application performance (loading time, etc). I have seen the profiler results for the current desgn. I want to compare these results with a new design for the same set of data. Is there some direct way to do it? I don't know any way to save the current profiling results in history and compare it later with the results of a new design.
Otherwise I have to do it manually, write the two results in a notepad and then compare it.
Thanks in advance.
Your stated goal is to improve aspects of the application performance (loading time, etc.) I have similar issues in other languages (C#, C++, C, etc.) I suggest that you focus not so much on the timing measurements that the Flex profiler gives you, but rather use it to extract a small number of samples of the call stack while it is being slow. Don't deal in summaries, but rather examine those stack samples closely. This may bend your mind a little bit, because it will not give you particularly precise time measurements. What it will tell you is which lines of code you need to focus on to get your speedup, and it will give you a very rough idea of how much speedup you can expect. To get the exact amount of speedup, you can time it afterward. (I just use a stopwatch. If I'm getting the load time down from 2 minutes to 10 seconds, timing it is not a high-tech problem.)
(If you are wondering how/why this works, it works because the reason for the program being slower than it's going to be is that it's requesting work to be done, mostly by method calls, that you are going to avoid executing so much. For the amount of time being spent in those method calls, they are sitting exposed on the stack, where you can easily see them. For example, if there is a line of code that is costing you 60% of the time, and you take 5 stack samples, it will appear on 3 samples, plus or minus 1, roughly, regardless of whether it is executed once or a million times. So any such line that shows up on multiple stacks is a possible target for optimization, and targets for optimization will appear on multiple stack samples if you take enough.
The hard part about this is learning not to be distracted by all the profiling results that are irrelevant. Milliseconds, average or total, for methods, are irrelevant. Invocation counts are irrelevant. "Self time" is irrelevant. The call graph is irrelevant. Some packages worry about recursion - it's irrelevant. CPU-bound vs. I/O bound - irrelevant. What is relevant is the fraction of stack samples that individual lines of code appear on.)
ADDED: If you do this, you'll notice a "magnification effect". Suppose you have two independent performance problems, A and B, where A costs 50% and B costs 25%. If you fix A, total time drops by 50%, so now B takes 50% of the remaining time and is easier to find. On the other hand, if you happen to fix B first, time drops by 25%, so A is magnified to 67%. Any problem you fix makes the others appear bigger, so you can keep going until you just can't squeeze it any more.

Resources