cuDF low GPU utilization - cudf

I have a task that involves running many queries on a dataframe. I compared the performance of running these queries on a Xeon CPU (Pandas) vs. RTX 2080 (CUDF). For a dataframe of 100k rows, GPU is faster but not by much. Looking at nvidia-smi output and the GPU utilization is around 3-4% while running the queries.
My question is what can I do to speed up the cuDF task and achieve high GPU utilization?
For example I can run 8 of these queries on 8 CPU cores in parallel for the CPU use case.
NUM_ELEMENTS = 100000
df = cudf.DataFrame()
df['value1'] = cp.random.sample(NUM_ELEMENTS)
df['value2'] = cp.random.sample(NUM_ELEMENTS)
df['value3'] = cp.random.sample(NUM_ELEMENTS)
c1 = np.random.random()
c2 = np.random.random()
c3 = np.random.random()
res = df.query('((value1 < #c1) & (value2 > #c2) & (value3 < #c3))')
Here is a sample code that doesn't take a lot of GPU cycles, however I want to run thousands of such queries on the data and I don't want to run them sequentially. Is there a way to run the multiple query() calls on a cuDF dataframe in parallel to maximize GPU utilization?

We're working towards enabling this in cudf, but this is currently a limitation of the cuDF library. The parallelism mechanism you're looking for is using CUDA Streams (https://developer.nvidia.com/blog/gpu-pro-tip-cuda-7-streams-simplify-concurrency/). We don't quite yet support CUDA streams in the cuDF Python library, but we're actively working on it.
You may be able to workaround this using a combination of cupy and numba along with their support of CUDA streams (https://docs.cupy.dev/en/stable/reference/generated/cupy.cuda.Stream.html, https://numba.pydata.org/numba-doc/dev/cuda-reference/host.html#stream-management), but you'd be in a very experimental area.

Related

R's gc() on parallel runs seems to dramatically under-report peak memory

In R I have a task that I'm trying to parallelize. Part of this is comparing run-times and peak memory usage for different implementations of the task at hand. I'm using the peakRAM library to determine peak memory, which I think just uses gc() under the surface, since if I do it manually I get the same peak memory results.
The problem is that the results from peakRAM are different from the computer's task manager (or top on Linux). If I run a single-core, these numbers are in the same ballpark, but even using 2 cores, they are really different.
I'm parallelizing using pblapply in a manner similar to this.
times_parallel = peakRAM(
pblapply(X = 1:10,
FUN = \(x) data[iteration==x] %>% parallel_task(),
cl = makeCluster(numcores, type = "FORK"))
)
With a single core, this process requires a peak of 30G of memory. But with 2 cores, peakRAM reports only about 3G of memory. Looking at top however, shows that each of the 2 threads is using around 20-30G of memory at a time.
The only thing I can think of is that peakRAM is only reporting the memory of the main thread but I see nothing in the gc() details that suggests this is happening.
The time reported from peakRAM seems appropriate. Sub-linear gains at different core levels.

Chainer: ParallelUpdater performance vs MultiprocessUpdater

I'd like to train a CNN on the CIFAR10 dataset with chainer on multiple GPUs on a single node. I tried adapting this example to use ParallelUpdater, in a manner identical to the mnist data parallel example but training performance was very poor -- slower than training on one GPU, even though all 8 GPUs were being utilized. I changed to MultiprocessUpdater and performance (iters/sec) was much better.
Bad:
num_gpus = 8
chainer.cuda.get_device_from_id(0).use()
train_iter = chainer.iterators.SerialIterator(train, batch_size)
if num_gpus > 0:
updater = training.updater.ParallelUpdater(
train_iter,
optimizer,
devices={('main' if device == 0 else str(device)): device for device in range(num_gpus)},
)
else:
updater = training.updater.StandardUpdater(train_iter, optimizer, device=0)
Good:
num_gpus = 8
devices = range(num_gpus)
train_iters = [chainer.iterators.MultiprocessIterator(i, batch_size, n_processes=num_gpus) \
for i in chainer.datasets.split_dataset_n_random(train, len(devices))]
test_iter = chainer.iterators.MultiprocessIterator(test, batch_size, repeat=False, n_processes=num_gpus)
device = 0 if num_gpus > 0 else -1 # -1 indicates CPU, 0 indicates first GPU device.
if num_gpus > 0:
updater = training.updaters.MultiprocessParallelUpdater(train_iters, optimizer, devices=range(num_gpus))
else:
updater = training.updater.StandardUpdater(train_iters[0], optimizer, device=device)
I also ran this benchmarking scripts with 8 GPUs, using the ParallelUpdater, but performance was also very poor: https://github.com/mitmul/chainer-cifar10/blob/master/train.py
My question is: how can I get good performance from ParallelUpdater, and what might I be doing wrong with it?
Thanks!
Using multiple GPUs, there is some overhead for communicating, so each iteration speed could be slower.
If you using data parallel method, you can use much more large batch size and large learning rate, it could accelerate your training.
I am not so familiar with ParallelUpdater, so my understanding might be wrong.
I guess the purpose of ParallelUpdater is not for the speed performance, instead its main purpose is to use memory efficiently to compute large batch gradient.
When reading the source code, model update is done in python for loop, so due to the GIL (Global Interpreter Lock) mechanism, I guess its computation itself is not done in parallel.
https://github.com/chainer/chainer/blob/master/chainer/training/updaters/parallel_updater.py#L118
As written, you can use MultiprocessUpdater if you want to get benefit of speed performance by using multiple GPU.
Also, you can consider using ChainerMN which is extension library for multi-GPU training with chainer.
github
documentation

gpgpu: how to estimate speed gains based on gpu and cpu specifications

I am a complete beginner to gpgpu and opencl. I am unable to answer the following two questions about GPGPU in general,
a) Suppose I have a piece of code suitable to be run on a gpu (executes the exact same set of instructions on multiple data). Assume I already have my data on the gpu. Is there any way to look at the specifications of the cpu and gpu, and estimate the potential speed gains? For example, how can I estimate the speed gains (excluding time taken to transfer data to the gpu) if I ran the piece of code (running exact same set of instructions on multiple data) on AMDs R9 295X2 gpu (http://www.amd.com/en-us/products/graphics/desktop/r9/2...) instead of intel i7-4770K processor (http://ark.intel.com/products/75123)
b) Is there any way to estimate the amount of time it would take to transfer data to the gpu?
Thank you!
Thank you for the responses! Given the large number of factors influencing speed gains, trying and testing is certainly a good idea. However, I do have a question on the GFLOPS approach mentioned some responses; GFLOPS metric was what I was looking at before posting the question.
I would think that GFLOPS would be a good way to estimate potential performance gains for SIMD type operations, given that it takes into account difference in clock speeds, cores, and floating point operations per cycle. However, when I crunch numbers using GFLOPS specifications something does not seem correct.
The Good:
GFLOPS based estimate seems to match the observed speed gains for the toy kernel below. The kernel for input integer "n" computes the sum (1+2+3+...+n) in a brute force way. I feel, the kernel below for large integers has a lot of computation operations. I ran the kernel for all ints from 1000 to 60000 on gpu and cpu (sequentially on cpu, without threading), and measured the timings.
__kernel void calculate(__global int* input,__global int* output){
size_t id=get_global_id(0);
int inp_num=input[id];
int si;
int sum;
sum=0;
for(int i=0;i<=inp_num;++i)
sum+=i;
output[id]=sum;
}
GPU on my laptop:
NVS 5400M (www.nvidia.com/object/nvs_techspecs.html)
GFLOPS, single precision: 253.44 (en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units)
CPU on my Laptop:
intel i7-3720QM, 2.6 GHz
GFLOPS (assuming single precision): 83.2 (download.intel.com/support/processors/corei7/sb/core_i7-3700_m.pdf). Intel document does not specify if it is single or double
CPU Time: 3.295 sec
GPU Time: 0.184 sec
Speed gains per core: 3.295/0.184 ~18
Theoretical Estimate of Speed gains with Using all 4 cores: 18/4 ~ 4.5
Speed Gains based on FLOPS: (GPU FLOPS)/(CPU FLOPS) = (253.44/83.2) = 3.0
For the above example GLOPS based estimate seems to be consistent with those obtained from experimentation, if the intel documentation indeed specifies FLOPS for single and not double precision. I did try to search for more links for flops specification for the intel processor on my laptop. The observed speed gain also seems good, given that I have a modest GPU
The Problem:
The FLOPS based approach seems to give a much lower than expected speed gains, after factoring gpu price, when comparing AMDs R9 295X2 gpu (www.amd.com/en-us/products/graphics/desktop/r9/295x2#) with intels i7-4770K (ark.intel.com/products/75123):
AMDs FLOPS, single precision: 11.5 TFLOPS (from above mentioned link)
Intels FLOPS, single precision: (num. of cores) x (FLOPS per cycle per core) x (clock speed) = (4) x (32 (peak) (www.pcmag.com/article2/0,2817,2419798,00.asp)) x (3.5) = 448 GFLOPS
Speed Gains Based on FLOPS = (11.5 TFLOPS)/(448) ~ 26
AMD GPUs price: $1500
Intel CPUs price: $300
For every AMD R9 295X2 gpu, I can buy 5 intel i7-4770K cpus, which reduces the effective speed gains to (26/5) ~ 5. However, this estimate is not at all consistent with the 100-200x, increase in speed one would expect. The low estimate in speed gains by the GFLOPS approach makes my think that something is incorrect with my analysis, but I am not sure what?
You need to examine the kernel(s). I myself am learning CUDA, so I couldn't tell you exactly what you'd do with OpenCL.
But I would figure out roughly how many floating point operations one single instance of the kernel will perform. Then find the number of floating point operations per second each device can handle.
number of kernels to be launched * (n floating-point operations of kernel / throughput of device (FLOPS)) = time to execute
The number of kernels launched will depend on your data.
A) Normally this question is never answered. Since we are not speaking at 1.05x speed gains. When the problem is suitable, the problem is BIG enough to hide any overheads (100k WI), and the data is already in the GPU, then we are speaking of speeds of 100-300x. Normally nobody cares if it is 250x or 251x.
The estimation is difficult to make, since the platforms are completely different. Not only on clock speeds, but memory latency and caches, as well as bus speeds and processing elements.
I cannot give you a clear answer on this, other than try it and measure.
B) The time to copy the memory is completely dependent on the GPU-CPU bus speed (PCI bus). And that is the HW limit, in practice you will always have less speed than that on copying. Generally you can apply the rule of three to solve the time needed, but there is always a small driver overhead that depends on the platform and device. So, copying 100 bytes is usually very slow, but copying some MB is as fast as the bus speed.
The memory copying speed is usually not a design constrain when creating a GPGPU app. Since it can be hided in many ways (pinned memory, etc..), that nodoby will notice any speed decrease due to memory operations.
You should not make any decisions on whether the problem is suitable or not for GPU, just by looking at the time lost at memory copy. Better measures are, if the problem is suitable, and if you have enough data to make the GPU busy (otherwise it is faster to do it in CPU directly).
Potential speed gain highly depends on algorithm implementation. It's difficult to forecast performance level unless you're developing come very simple applications (like simplest image filter). In some cases, estimations can be done, using memory system performance as basis, as many algorithms are bandwidth-bound.
You can calculate transmission time by dividing data amount on GPU memory bandwidth for Device-internal operations. Look at hardware characteristics to get it, or calculate if you know memory frequency & bus width. For Host-Device operations, PCI-E bus speed is the limit usually.
If code is easy(is what lightweight- cores of gpu need) and is not memory dependent then you can approximate to :
Sample kernel:
Read two 32-bit floats from memory and
do calcs on them for 20-30 times at least.
Then write to memory once.
New: GPU
Old: CPU
Gain ratio = ((New/Old) - 1 ) *100 (%)
New= 5000 cores * 2 ALU-FPU per core * 1.0 GHz frequency = 10000 gflops
Old = 10 cores * 8 ALU-FPU per core * 4.0GHz frequency = 320 gflops
((New/Old) - 1 ) *100 ===> 3000% speed gain.
This is when code uses registers and local memory mostly. Rarely hitting global mem.
If code is hard( heavy branching + fake recursivity + non-uniformity ) only 3-5 times speed gain. it can be equal or less than CPU performance for linear code ofcourse.
When code is memory dependant, it will be 1TB/s(GPU) divided by 40GB/s(CPU).
If each iteration needs to upload data to gpu, there will be pci-e bandwidth bottlenect too.
loads are usually classified into 2 categories
bandwidth bound - more time is spent on fetches from global-memory. Even increasing cpu clock freq doesn't help. problems like sorting. bandwidth capacity is measured using GBPS
compute bound - directly proportional to cpu horse-power. problems like matrix multiplication. compute capacity is measured using GFLOPS
there is a tool clpeak which tries to programmatically measure these
its very important to classify your problem to measure its performance & choose the right device(knowing their limits)
say if you compare intel-HD-4000 & i7-3630(both on same chip) in https://github.com/krrishnarraj/clpeak/tree/master/results/Intel%28R%29_OpenCL
i7 is comparatively better at bandwidth(plus no transfer overheads)
in terms of compute, gpu is 4-5 times faster than i7

GPU programming via JOCL uses only 6 out of 80 shader cores?

I am trying to let a program run on my GPU and to start with an easy sample I modified the first sample on http://www.jocl.org/samples/samples.html and to run the following little script: I run n simultaneous "threads" (what's the correct name for the GPU equivalent of a thread?), each of which performs 20000000/n independent tanh()-computations. You can see my code here: http://pastebin.com/DY2pdJzL
The speed is by far not what I expected:
for n=1 it takes 12.2 seconds
for n=2 it takes 6.3 seconds
for n=3 it takes 4.4 seconds
for n=4 it takes 3.4 seconds
for n=5 it takes 3.1 seconds
for n=6 and beyond, it takes 2.7 seconds.
So after n=6 (be it n=8, n=20, n=100, n=1000 or n=100000), there is no performance increase, which means only 6 of these are computed in parallel. However, according to the specifications of my card there should be 80 cores: http://www.amd.com/us/products/desktop/graphics/ati-radeon-hd-5000/hd-5450-overview/pages/hd-5450-overview.aspx#2
It is not a matter of overhead, since increasing or decreasing the 20000000 only matters a linear factor in all the execution times.
I have installed the AMD APP SDK and drivers that support OpenCL: see http://dl.dropbox.com/u/3060536/prtscr.png and http://dl.dropbox.com/u/3060536/prtsrc2.png for details (or at least I conclude from these that OpenCL is running correctly).
So I'm a bit clueless now, where to search for answer. Why can JOCL only do 6 parallel executions on my ATI Radeon HD 5450?
You are hard-coding the local work size to 1. Use a larger size or let the driver choose one for you.
Also, your kernel is not designed in an OpenCL style. You should take out the for loop and let the driver handle the iterating for you.

Limit the number of compute units used by OpenCL

I need to limit the number of compute units used by my opencl application.
I'm running it on a CPU that has 8 compute units, I've seen that with CL_DEVICE_MAX_COMPUTE_UNITS.
The execution time I get with OpenCL is much less than 8 times the normal algorithm without OpenCL (is like 600 time faster). I want to use just 1 compute units because I need to see the real improvement with the same code optimized by OpenCL.
It's just for testing, the real application will continue to use all the compute units.
Thanks for your help
If you are using CPUs, Why dont you try using the OpenCL device fission extension ?
Device Fission allows you to split up a computer unit into sub-devices. You can then create a command queue to the subdevice and enqueue kernels only to that subset of your CPU cores,
You can divide your 8 core device into 8 subdevices of 1 core each for example.
Take a look at the Device Fission example in the AMD APP SDK.

Resources