MPI send/recv latency jump at 32 MiB message size

MPI send/recv latency jump at 32 MiB message size - mpi

For a university project I'm currently working on a slurm supercomputer cluster and have written a number of C programs using MPI.
While profiling one of them I have observed that the time elapsed between an MPI_Send and an accompanying MPI_Recv operation is a mostly linear function of the message length. However, at around 32 MiB there is a sudden jump in latency from around 10ms to around 20ms. This happens both for two processes on the same node and two processes on separate nodes.
Now I would like to find out why this happens. I'm aware that this is not an MPI intrinsic phenomenon but must be related to the underlying hardware setup, but I'm not sure where to begin looking for an explanation. What are some possible explanations for this and how could I check whether they apply in my case?

Related

MPI - No performance gain when using every available core on the machine

I have a C program (acoustic wave solver) that is parallelized with MPI. However, I've been testing the speed up on various numbers of cores and I've noticed something strange. If I use N processes where N is the number of available cores in the machine, then I do not see a performance improvement over the next step down.
So on my 8 core machine then I see speedup from 1 process to 2 processes to 4 processes, but not from 4 to 8. Similarly on my 4 core laptop I see speedup from 1 to 2, but not from 2 to 4.
Any idea what could be causing this?

Many modern (Intel-)cpu run two hyperthreads on a single physical core. The number of cores you are referencing are actually the number of hardware threads that are available, not the number of physical execution units.
As long as you are using a number of processes that is smaller or equal to the number of physical cores, the processes will (or at least should) be distributed to use all of the available codes. But as soon as all physical cores are taken, additional processes will share a physical core with another process.
It is not possible to give a definitive answer on if using all threads will increase your performance at all or by how much. That strongly depends on the code you are running. A very nice answer to a similar question is given on superuser.com. Essentially, if your process is memory-bound or uses different parts of your cpu (Integer/Floating point arithmetic, Video encoding, vector processing, ...) and communication overhead is small you might even get perfect scaling. Code that is cpu-bound and only does one type of computation might not give any improvement, or might even take longer due to communication overhead.

OpenCL enqueued kernels using lots of host memory

I am executing monte carlo sweeps on a population of replicas of my system using OpenCL kernels. After the initial debugging phase I increased some of the arguments to more realistic values and noticed that the program is suddenly eating up large amounts of host memory. I am executing 1000 sweeps on about 4000 replicas, each sweep consists of 2 kernel invocations. That results in about 8 million kernel invocations.
The source of the memory usage was easy to find (see screenshot).
While the the kernel executions are enqueued the memory usage goes up.
While the kernels are executing the memory usage stays constant.
Once the kernels finish up the usage goes down to its original state.
I did not allocate any memory, as can be seen in the memory snapshots.
That means the OpenCL driver is using the memory. I understand that it must keep a copy of all the arguments to the kernel invocations and also the global and local workgroup size, but that does not add up.
The peak memory usage was 4.5GB. Before enqueuing the kernels about 250MB were used. That means OpenCL used about 4.25GB for 8 million invocations, i.e. about half a kilobyte per invocation.
So my questions are:
Is that kind of memory usage normal and to be expected?
Are there good/known techniques to reduce memory usage?
Maybe I should not enqueue so many kernels simultaneously, but how would I do that without causing synchronization, e.g. with clFinish()?

Enqueueing large number of kernel invocations needs to be done in a bit controlled manner so that command queue does not eat too much memory. First, clFlush may help to some degree then clWaitForEvents is necessary to make a synchronization point in the middle such that for example 2000 kernel invocations is enqueued and clWaitForEvents waits for the 1000th one. Device is not going to pause because we have another 1000 invocations of work pre-batched already. Then similar thing needs to be repeated again and again. This could be illustrated this way:
enqueue 999 kernel commands
while(invocations < 8000000)
{
enqueue 1 kernel command with an event
enqueue 999 kernel commands
wait for the event
}
The optimal number of kernel invocations after which we should wait may be different than presented here so it needs to be worked out for the given scenario.

OpenCl Statement , true or false?

I was reading some results. And there I saw that they used 5120 work-groups and a local-size of 1. I have a limited knowledge about OpenCl and I was wondering if this statement is correct:
As can be seen for the GPU, the first test has 5120 work-groups, with
1 work-item each. This means that the threads which are executed in
parallel are limited to the amount of computing units there are in the
machine. For example if a GPU has 20 computing units there can only be
a maximum of 20 threads which are working in parallel. Though when the
local size is increased to 2, twice the amount of threads are run
simultaneously
From reading some info on OpenCl, it seems about right. Though I need a second opinion.

update. Hmm, nat chouf's comment is right, I understood the question as "in flight at the same time" instead of "physically executed at the same time".
As I wrote, several work-groups can be scheduled at a given time in a single compute unit. The number of such "in-flight" work-groups is limited by the available resources (local memory, registers, etc.) on each compute unit.
In existing implementations (afaik) a compute unit will pick a block (warp/wavefront) of work-items from the same work-group for execution, among all blocks in flight in the compute unit. One "instruction" of this block is inserted in the pipeline (it may take several cycles, and each "instruction" may correspond to several operations in each work-item), and then another block is picked.
So, yes, if work-group size is 1, only 1 work-item per compute unit will be physically started simultaneously. But potentially all work-items may be in-flight in the GPU at the same time.

MPI + GPU : how to mix the two techniques

My program is well-suited for MPI. Each CPU does its own, specific (sophisticated) job, produces a single double, and then I use an MPI_Reduce to multiply the result from every CPU.
But I repeat this many, many times (> 100,000). Thus, it occurred to me that a GPU would dramatically speed things up.
I have google'd around, but can't find anything concrete. How do you go about mixing MPI with GPUs? Is there a way for the program to query and verify "oh, this rank is the GPU, all other are CPUs" ? Is there a recommended tutorial or something?
Importantly, I don't want or need a full set of GPUs. I really just need a lot of CPUs, and then a single GPU to speed up the frequently-used MPI_Reduce operation.
Here is a schematic example of what I'm talking about:
Suppose I have 500 CPUs. Each CPU somehow produces, say, 50 doubles. I need to multiply all 250,00 of these doubles together. Then I repeat this between 10,000 and 1 million times. If I could have one GPU (in addition to the 500 CPUs), this could be really efficient. Each CPU would compute its 50 doubles for all ~1 million "states". Then, all 500 CPUs would send their doubles to the GPU. The GPU would then multiply the 250,000 doubles together for each of the 1 million "states", producing 1 million doubles.
These numbers are not exact. The compute is indeed very large. I'm just trying to convey the general problem.

This isn't the way to think about these things.
I like to say that MPI and GPGPU stuff are orthogonal(*). You use MPI between tasks (for which think nodes, although you can have multiple tasks per node), and each task may or may not use an accelerator like a GPU to accelerate the computation within task. There is no MPI rank on a GPU.
Regardless, Talonmies is right; this particular example doesn't sound like it would benefit much from a GPU. And it won't be helped by having tens of thousands of doubles per task; if you're only doing one or a few FLOPs per double, the cost of sending the data to the GPU will exceed the benefit of having all those cores operate on them.
(*) This used to be more clearly true; now with, for instance, GPUDirect being able to copy memory to remote GPUs over infiniband, the distinction is fuzzier. However, I maintain that this is still the most useful way to think about things, with such things as RDMA to GPUs being an important optimization but conceptually a minor tweak.

Here I have found some news about the topic:
"MPI, the Message Passing Interface, is a standard API for communicating data via messages between distributed processes that is commonly used in HPC to build applications that can scale to multi-node computer clusters. As such, MPI is fully compatible with CUDA, which is designed for parallel computing on a single computer or node. There are many reasons for wanting to combine the two parallel programming approaches of MPI and CUDA. A common reason is to enable solving problems with a data size too large to fit into the memory of a single GPU, or that would require an unreasonably long compute time on a single node. Another reason is to accelerate an existing MPI application with GPUs or to enable an existing single-node multi-GPU application to scale across multiple nodes. With CUDA-aware MPI these goals can be achieved easily and efficiently. In this post I will explain how CUDA-aware MPI works, why it is efficient, and how you can use it."

How many tasks can be executed simultaneously on GPU device?

I'm using OpenCL and have ATI 4850 card. It has:
CL_DEVICE_MAX_COMPUTE_UNITS: 10
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3
CL_DEVICE_MAX_WORK_GROUP_SIZE: 256
CL_DEVICE_MAX_WORK_ITEM_SIZES:(256, 256, 256)
CL_DEVICE_AVAILABLE: 1
CL_DEVICE_NAME: ATI RV770
How many tasks can it execute simultaneously?
Is it CL_DEVICE_MAX_COMPUTE_UNITS * CL_DEVICE_MAX_WORK_ITEM_SIZES = 2560?
To be more specific: a single core processor can execute only one task in the one moment, dual-core can execute 2 tasks... How many tasks can execute my GPU at one moment? Or rephrased: How many processors does my GPU have?

The RV770 has 10 SIMD cores, each consisting of 16 shader cores, each consisting of 5 ALUs (VLIW5 architecture). A total of 800 ALUs that can do parallel computations. I don't think there's a way to get all these numbers out of OpenCL. I'm also not sure what you would equate to a CPU core. Perhaps a shader core? You can read about VLIW at Wikipedia. It's an interesting design.
If you say a CPU core is only executing one "task" at any given time, even though it has multiple ALUs working in parallel, then I guess you can say the RV770 would be working on 160 tasks. But with the differences in how different chips work, I think "core" and "task" can become difficult to define. A CPU with hyperthreading can even execute two sets of code at the same time. With OpenCL I don't believe it is possible yet to execute more than one kernel at any given time - unless recent driver updates have changed that.
Anyway, I think it is more important to present your work to the GPU in a way that gives the best performance. Unfortunately there's no way to find the best work group size other than experimenting. At least not that I know of. One help is that if the drivers support OpenCL 1.1 you can query the CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE and set your work size to a multiple of that. Otherwise, going for a multiple of 64 is probably a safe bet.

GPU work ends up becoming wavefronts/warps.
Using a GPU for UI and compute is effectively using it for many programs without being aware of it. Many for the GUI drawing, plus whatever compute kernels you are executing. Fast OpenCL clients are asynchronous and overlap multiple instance of work so they won't be latency-bound. It is expected that you'll use multiple kernels in parallel.
There doesn't seem to be a "hard" limit other than memory limiting the number of buffers you can use. When using the same GPU for UI and for compute, you must throttle your work. In my experience, issuing too much work will cause starvation of the GUI and/or your compute kernels. There doesn't seem to be anything in the way of ensuring that you won't have starvation (long delays before a work item begins actually executing). Some work item(s) may sit for a very long time (10s seconds or more in bad cases) while the GPU does other work items. I speculate that items are dispatched to pipelines based on data availability and little or nothing is there to prevent starvation of work items.
Limiting how far ahead work is enqueued greatly improves GUI responsiveness by letting the GPU drain its work queue almost/sometimes to empty, reducing GUI drawing workitem starvation delays.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

MPI send/recv latency jump at 32 MiB message size - mpi

Related

MPI - No performance gain when using every available core on the machine

OpenCL enqueued kernels using lots of host memory

OpenCl Statement , true or false?

MPI + GPU : how to mix the two techniques

How many tasks can be executed simultaneously on GPU device?

Categories

Resources