Why do multiple processes accessing single GPU increase performance? - opencl

I am using PyOpenCl in combination with Python 3.7.
When calling the same kernel with multiple processes having each their own context pointing to the same GPU device, I get performance improvements which scale almost linearly with the number of processes.
I can imagine that execution of parallel processes makes some overlapping transfers possible, where a kernel of process A is executed while process B sends data to the graphic card. But this should not be responsible for such a boost in performance.
Attached you find a code example, where I implemented a dummy application where some data is decoded.
When setting n_processes=1 I get around 12 Mbit/sec, while when setting n_processes=4 I get 45 Mbit/sec.
I am using a single AMD Radeon VII graphics card.
Has anyone a good explanation for that phenomenon?
Update:
I profiled the script using CodeXL. Seems like there is a lot of time wasted between kernel executions and multiple processes are able to make use of it.
import logging
import multiprocessing as mp
import pyopencl as cl
import pyopencl.array as cl_array
from mako.template import Template
import numpy as np
import time
logging.basicConfig(level=logging.INFO,
format='%(asctime)s %(process)d %(levelname)-8s [%(filename)s:%(lineno)d] %(message)s')
kernelsource = """
float boxplus(float a,float b)
{
float boxp=log((1+exp(a+b))/(exp(a)+exp(b)));
return boxp;
}
void kernel test(global const float* in,
global const int* permutation_vector,
global float* out)
{
int gid = get_global_id(0);
int p = gid; // permutation index
float t = 0.0;
for(int k=1; k<10;k++){
p = permutation_vector[p];
t= boxplus(in[p],in[gid]);
}
out[gid] = t;
}
"""
class MyProcess(mp.Process):
def __init__(self, q):
super().__init__()
self.q = q
def run(self) -> None:
platform = cl.get_platforms()
my_gpu_devices = [platform[0].get_devices(device_type=cl.device_type.GPU)[0]]
ctx = cl.Context(devices=my_gpu_devices)
queue = cl.CommandQueue(ctx)
tpl = Template(kernelsource)
rendered_tp = tpl.render()
prg = cl.Program(ctx, str(rendered_tp)).build()
size = 100000 # shape of random input array
dtype = np.float64
output_buffer = cl_array.empty(queue, size, dtype=dtype)
input_buffer = cl_array.empty(queue, size, dtype=dtype)
permutation = np.random.permutation(size)
permutation_buffer = cl_array.to_device(queue, permutation.astype(np.int))
def decode(data_in):
input_buffer.set(data_in)
for i in range(10):
prg.test(queue, input_buffer.shape, None,
input_buffer.data,
permutation_buffer.data,
output_buffer.data)
queue.finish()
return output_buffer.get()
counter = 1
while True:
data_in = np.random.normal(size=size).astype(dtype)
data_out = decode(data_in)
if counter % 100 == 0:
self.q.put(size * 100)
counter = 1
else:
counter += 1
def run_test_multi_cpu_single_gpu():
q = mp.Queue()
n_processes = 4
for i in range(n_processes):
MyProcess(q).start()
t0 = time.time()
symbols_sum = q.get()
i = 0
while True:
i += 1
print('{} Mbit/sec'.format(1 / 1e6 * symbols_sum / (time.time() - t0 + 1e-15)))
symbols = q.get()
symbols_sum += symbols
if __name__ == '__main__':
run_test_multi_cpu_single_gpu()

Kernel loop has too few work. It must be almost comparable to kernel launch overhead. Kernel launch overhead is also comparable to a function call overhead in Python.
for(int k=1; k<10;k++){
p = permutation_vector[p];
t= boxplus(in[p],in[gid]);
}
This latency probably hidden behind another process's kernel launch latency and its kernel launch latency probably hidden behind a third one's function call overhead. And GPU can take even more, there are only 10 cycles of for loop with O(N) complexity. Even low end GPUs get saturated with at least thousands of iterations with O(N*N) complexity.
Also the buffer read/writes and compute are overlapping as you said.
So if the kernel takes all time in that profiling window, there is no
capacity left on the graphic card?
GPU can also overlap multiple computes if it has capability and if each work is small enough to let some in-flight threads remain for others. Number of in-flight threads can be as high as 40*shaders. 40*3840 = 153600 instructions issued/pipelined per cycle(or a few cycles) or lets say 3.46 TFLOPS.
3.46 TFLOPS with even 1000 FLOP per 64bit data element, it can stream data at 3.46 GB/s rate. This is without pipelining anything in the kernel(read element 1, compute, write result, read element 2). But it does pipelining, just after starting first element compute, next batch of items are mapped on same shaders, loading new data, it can take hundreds of GB/s, which is more than PCI-e bandwidth.
Also CPU can't preprocess/post process at that rate. So there are buffer copies and CPU as bottlenecks which are hidden behind each other when there are multiple processes.

Related

Random NaN and incorrect results with OpenCL kernel

I am trying to implement a general matrix-matrix multiplication OpenCL kernel, one that conforms to C = α*A*B + β*C.
The Kernel
I did some research online and decided to use a modified kernel from this website as a starting point. The main modification I have made is that allocation of local memory as working space is now dynamic. Below is the kernel I have written:
__kernel
void clkernel_gemm(const uint M, const uint N, const uint K, const float alpha,
__global const float* A, __global const float* B, const float beta,
__global float* C, __local float* Asub, __local float* Bsub) {
const uint row = get_local_id(0);
const uint col = get_local_id(1);
const uint TS = get_local_size(0); // Tile size
const uint globalRow = TS * get_group_id(0) + row; // Row ID of C (0..M)
const uint globalCol = TS * get_group_id(1) + col; // Row ID of C (0..N)
// Initialise the accumulation register
float acc = 0.0f;
// Loop over all tiles
const int numtiles = K / TS;
for (int t = 0; t < numtiles; t++) {
const int tiledRow = TS * t + row;
const int tiledCol = TS * t + col;
Asub[col * TS + row] = A[tiledCol * M + globalRow];
Bsub[col * TS + row] = B[globalCol * K + tiledRow];
barrier(CLK_LOCAL_MEM_FENCE);
for(int k = 0; k < TS; k++) {
acc += Asub[k * TS + row] * Bsub[col * TS + k] * alpha;
}
barrier(CLK_LOCAL_MEM_FENCE);
}
C[globalCol * M + globalRow] = fma(beta, C[globalCol * M + globalRow], acc);
}
Tile Size (TS) is now a value defined in the calling code, which looks like this:
// A, B and C are 2D matrices, their cl::Buffers have already been set up
// and values appropriately set.
kernel.setArg(0, (cl_int)nrowA);
kernel.setArg(1, (cl_int)ncolB);
kernel.setArg(2, (cl_int)ncolA);
kernel.setArg(3, alpha);
kernel.setArg(4, A_buffer);
kernel.setArg(5, B_buffer);
kernel.setArg(6, beta);
kernel.setArg(7, C_buffer);
kernel.setArg(8, cl::Local(sizeof(float) * nrowA * ncolB));
kernel.setArg(9, cl::Local(sizeof(float) * nrowA * ncolB));
cl::NDRange global(nrowA, ncolB);
cl::NDRange local(nrowA, ncolB);
status = cmdq.enqueueNDRangeKernel(kernel, cl::NDRange(0), global, local);
The Problem
The problem I am encountering is, unit tests (written with Google's gtest) I have written will randomly fail, but only for this particular kernel. (I have 20 other kernels in the same .cl source file that pass tests 100% of the time)
I have a test that multiplies a 1x4 float matrix {0.0, 1.0, 2.0, 3.0} with a transposed version of itself {{0.0}, {1.0}, {2.0}, {3.0}}. The expected output is {14.0}.
However, I can get this correct result maybe just 75% of the time.
Sometimes, I can get 23.0 (GTX 970), 17.01 (GTX 750) or just -nan and 0.0 (all 3 devices). The curious part is, the respective incorrect results seem to be unique to the devices; I cannot seem to, for example, get 23.0 on the Intel CPU or the GTX 750.
I am baffled because if I have made an algorithmic or mathematical mistake, the mistake should be consistent; instead I am getting incorrect results only randomly.
What am I doing wrong here?
Things I have tried
I have verified that the data going into the kernels are correct.
I have tried to initialize both __local memory to 0.0, but this causes all results to become wrong (but frankly, I'm not really sure how to initialize it properly)
I have written a test program that only executes this kernel to rule out any race conditions interacting with the rest of my program, but the bug still happens.
Other points to note
I am using the C++ wrapper retrieved directly from the Github page.
To use the wrapper, I have defined CL_HPP_MINIMUM_OPENCL_VERSION 120 and CL_HPP_TARGET_OPENCL_VERSION 120.
I am compiling the kernels with the -cl-std=CL1.2 flag.
All cl::Buffers are created with only the CL_MEM_READ_WRITE flag.
I am testing this on Ubuntu 16.04, Ubuntu 14.04, and Debian 8.
I have tested this on Intel CPUs with the Intel OpenCL Runtime 16.1 for Ubuntu installed. The runtime reports that it supports up to OpenCL 1.2
I have tested this on both Nvidia GTX 760 and 970. Nvidia only supports up to OpenCL 1.2.
All 3 platforms exhibit the same problem with varying frequency.
This looks like a complicated one. There are several things to address and they won't fit into comments, so I'll post all this as an answer even though it does not solve your problem (yet).
I am baffled because if I have made an algorithmic or mathematical
mistake, the mistake should be consistent; instead I am getting
incorrect results only randomly.
Such a behavior is a typical indicator of race conditions.
I have tried to initialize both __local memory to 0.0, but this causes
all results to become wrong (but frankly, I'm not really sure how to
initialize it properly)
Actually this is a good thing. Finally we have some consistency.
Initializing local memory
Initializing local memory can be done using the work items, e.g. if you have a 1D workgroup of 16 items and your local memory consists of 16 floats, just do this:
local float* ptr = ... // your pointer to local memory
int idx = get_local_id(0); // get the index for the current work-item
ptr[idx] = 0.f; // init with value 0
barrier(CLK_LOCAL_MEM_FENCE); // synchronize local memory access within workgroup
If your local memory is larger, e.g. 64 floats, you will have to use a loop where each work item initializes 4 values, at least that is the most efficient way. However, no one will stop you from using every work item to initialize every value in the local memory, even though that is complete nonsense since you're essentially initializing it multiple times.
Your changes
The original algorithm looks like it is especially designed to use quadratic tiles.
__local float Asub[TS][TS];
__local float Bsub[TS][TS];
Not only that but the size of local memory matches the workgroup size, in their example 32x32.
When I look at your kernel parameters for local memory, I can see that you use parameters that are defined as M and N in the original algorithm. This doesn't seem correct.
Update 1
Since you have not described if the original algorithm works for you, this is what you should do to find your error:
Create a set of testdata. Make sure you only use data sizes that are actually supported by the original algorithm (e.g. minimum size, mulitples of x, etc.). Also, use large data sets since some errors only show if multiple workgroups are dispatched.
Use the original, unaltered algorithm with your testdata sets and verify the results.
Change the algorithm only that instead of fixed size local memory, dynamic local memory size is used, but make sure it has the same size as the fixed size approach. This is what you tried but I think it failed due to what I have described under "Your changes".

Is the device address of a buffer the same for different kernels/programs in OpenCL

When passing buffers as argument to OpenCL kernels, will the address of the buffer seen by the kernel code remains the same for the same buffer?
I used the code below to check and it seems that the address are indeed the same. However, I can't find anything in the standard to guarantee this.
import pyopencl as cl
import numpy as np
def main():
ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)
mf = cl.mem_flags
buf = cl.Buffer(ctx, mf.READ_ONLY, 1000)
buf2 = cl.Buffer(ctx, mf.READ_WRITE, 8)
prg = cl.Program(ctx, """
__kernel void
get_addr(__global const int *in, __global long *out)
{
*out = (long)in;
}
""").build()
knl = prg.get_addr
knl.set_args(buf, buf2)
cl.enqueue_task(queue, knl)
b = np.empty([1], dtype=np.int64)
cl.enqueue_copy(queue, b, buf2).wait()
print(b[0])
prg = cl.Program(ctx, """
__kernel void
get_addr(__global const int *in, __global long *out)
{
*out = (long)in;
}
""").build()
knl = prg.get_addr
knl.set_args(buf, buf2)
cl.enqueue_task(queue, knl)
b = np.empty([1], dtype=np.int64)
cl.enqueue_copy(queue, b, buf2).wait()
print(b[0])
if __name__ == '__main__':
main()
The use case is that I am running a simulation using OpenCL which has many (arrays) of parameters. In order not having to pass these arrays around as arguments, I fill them in a struct and pass the pointer to the struct around instead. Since this struct will be used many times (and by all work items) I would like not having to fill it in every run of every kernels and would like to know if the pointers will change between different runs/work items.
It is not guaranteed for OpenCL 1.x. This is why it is unsafe to store pointers in buffers. The runtime is allowed to move the allocation for each kernel launch. There is no guarantee that it will move it, and of course it is reasonable to expect that the buffer will not often need to move so it isn't surprising that you'd see the result you see. If you allocate a lot more buffers and cycle through them to force the runtime to move them around you will be more likely to see the issue.
For OpenCL 2.0 the shared virtual memory feature guarantees this by definition: the address couldn't be shared if it kept changing.

Working around pyopencl array offset limitation

Is there a way to work around the limitation in PyOpenCL whereby:
array.data
fails with
pyopencl.array.ArrayHasOffsetError: The operation you are attempting does not yet support arrays that start at an offset from the beginning of their buffer.
I tried:
a.base_data[a.offset: a.offset + a.nbytes]
This seems to work sometimes, but other times I get:
pyopencl.LogicError: clCreateSubBuffer failed: invalid value
clcreateSubBuffer needs to have the offset (or in this case it is called the origin) that is aligned, and the size + origin to fall within the limits of the buffer.
CL_INVALID_VALUE is returned in errcode_ret if the region specified by
(origin, size) is out of bounds in buffer.
CL_MISALIGNED_SUB_BUFFER_OFFSET is returned in errcode_ret if there
are no devices in context associated with buffer for which the origin
value is aligned to the CL_DEVICE_MEM_BASE_ADDR_ALIGN value.
For the particular error you are seeing it looks like either your program or pyopencl is miscalculating the size of the array after the offset. Even if you fixed this you may still have problems if the original offset is not aligned to CL_DEVICE_MEM_BASE_ADDR_ALIGN.
Having said that NVIDIA seems to break from spec and allow arbitrary offsets. So your mileage may vary depending on the hardware.
If you're just looking to get a buffer that marks the start of the array data, to pass to a kernel, you don't have to worry about the size. Here's a function that gets a size-1 buffer that points to the start of the offset data:
def data_ptr(array):
if array.offset:
return array.base_data.get_sub_region(array.offset, 1)
else:
return array.data
You can use this to pass to a kernel, if you need a pointer to the start of the offset data. Here's an example, where I want to set a sub-region clV of array clA to the value 3. I use data_ptr to get a pointer to the start of clV's data.
import numpy as np
import pyopencl as cl
import pyopencl.array
ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)
m, n = 5, 5
A = np.random.uniform(size=(m, n)).astype(np.float32)
clA = cl.array.Array(queue, A.shape, A.dtype)
clA.set(A)
clV = clA[1::2, 1::2]
def data(array):
if array.offset:
return array.base_data.get_sub_region(array.offset, 1)
else:
return array.data
source = """
__kernel void fn(long si, long sj, __global float *Y)
{
const int i = get_global_id(0);
const int j = get_global_id(1);
Y[i*si + j*sj] = 3;
}
"""
kernel = cl.Program(ctx, source).build().fn
gsize = clV.shape
lsize = None
estrides = np.array(clV.strides) / clV.dtype.itemsize
kernel(queue, gsize, lsize, estrides[0], estrides[1], data_ptr(clV))
print(clA.get())

speedup when using float4, opencl

I have the following opencl kernel function to get the column sum of a image.
__kernel void columnSum(__global float* src,__global float* dst,int srcCols,
int srcRows,int srcStep,int dstStep)
{
const int x = get_global_id(0);
srcStep >>= 2;
dstStep >>= 2;
if (x < srcCols)
{
int srcIdx = x ;
int dstIdx = x ;
float sum = 0;
for (int y = 0; y < srcRows; ++y)
{
sum += src[srcIdx];
dst[dstIdx] = sum;
srcIdx += srcStep;
dstIdx += dstStep;
}
}
}
I assign that each thread process a column here so that a lot of threads can get the column_sum of each column in parallel.
I also use float4 to rewrite the above kernel so that each thread can read 4 elements in a row at one time from the source image, which is shown below.
__kernel void columnSum(__global float* src,__global float* dst,int srcCols,
int srcRows,int srcStep,int dstStep)
{
const int x = get_global_id(0);
srcStep >>= 2;
dstStep >>= 2;
if (x < srcCols/4)
{
int srcIdx = x ;
int dstIdx = x ;
float4 sum = (float4)(0.0f, 0.0f, 0.0f, 0.0f);
for (int y = 0; y < srcRows; ++y)
{
float4 temp2;
temp2 = vload4(0, &src[4 * srcIdx]);
sum = sum + temp2;
vstore4(sum, 0, &dst[4 * dstIdx]);
srcIdx += (srcStep/4);
dstIdx += (dstStep/4);
}
}
}
In this case, theoretically, I think the time consumed by the second kernel to process a image should be 1/4 of the time consumed by the first kernel function. However, no matter how large the image is, the two kernels almost consume the same time. I don't know why. Can you guys give me some ideas? T
OpenCL vector data types like float4 were fitting better the older GPU architectures, especially AMD's GPUs. Modern GPUs don't have SIMD registers available for individual work-items, they are scalar in that respect. CL_DEVICE_PREFERRED_VECTOR_WIDTH_* equals 1 for OpenCL driver on NVIDIA Kepler GPU and Intel HD integrated graphics. So adding float4 vectors on modern GPU should require 4 operations. On the other hand, OpenCL driver on Intel Core CPU has CL_DEVICE_PREFERRED_VECTOR_WIDTH_FLOAT equal to 4, so these vectors could be added in a single step.
You are directly reading the values from "src" array (global memory). Which typically is 400 times slower than private memory. Your bottleneck is definitelly the memory access, not the "add" operation itself.
When you move from float to float4, the vector operation (add/multiply/...) is more efficient thanks to the ability of the GPU to operate with vectors. However, the read/write to global memory remains the same.
And since that is the main bottleneck, you will not see any speedup at all.
If you want to speed your algorithm, you should move to local memory. However you have to manually resolve the memory management, and the proper block size.
which architecture do you use?
Using float4 has higher instruction level parallelism (and then require 4 times less threads) so theoretically should be faster (see http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf)
However did i understand correctly in you kernel you are doing prefix-sum (you store the partial sum at every iteration of y)? If so, because of the stores the bottleneck is at the memory writes.
I think on the GPU float4 is not a SIMD operation in OpenCL. In other words if you add two float4 values the sum is done in four steps rather than all at once. Floatn is really designed for the CPU. On the GPU floatn serves only as a convenient syntax, at least on Nvidia cards. Each thread on the GPU acts as if it is scalar processor without SIMD. But the threads in a warp are not independent like they are on the CPU. The right way to think of the GPGPU models is Single Instruction Multiple Threads (SIMT).
http://www.yosefk.com/blog/simd-simt-smt-parallelism-in-nvidia-gpus.html
Have you tried running your code on the CPU? I think the code with float4 should run quicker (potentially four times quicker) than the scalar code on the CPU. Also if you have a CPU with AVX then you should try float8. If the float4 code is faster on the CPU than float8 should be even faster on a CPU with AVX.
try to define __ attribute __ to kernel and see changes in run timing
for example try to define:
__ kernel void __ attribute__((vec_type_hint(int)))
or
__ kernel void __ attribute__((vec_type_hint(int4)))
or some floatN as you want
read more:
https://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/functionQualifiers.html

Asynchronous execution of commands from two command queues in OpenCL

I am trying to work out an application that can utilize both CPU and GPU at the same time by OpenCL. Specifically, I have two kernels, one for CPU executing, and one for GPU. CPU kernel will change the content of one buffer, and GPU will do other things when GPU detects that the buffer has been changed by CPU.
__kernel void cpuKernel(__global uint * dst1,const uint size)
{
uint tid = get_global_id(0);
uint size = get_global_size(0);
while(tid < size)
{
atomic_xchg(&dst1[tid],10);
tid += size;
}
}
__kernel void gpuKernel(__global uint * dst1, __global uint * dst2, const uint size)
{
uint tid = get_global_id(0);
uint size = get_global_size(0);
while(tid < vectorSize)
{
while(dst1[vectorOffset + tid] != 10)
;
dst2[vectorOffset + tid] = dst1[vectorOffset+tid];
tid += size;
}
}
As shown above, cpuKernel will change each element of dst1 buffer to 10, correspondingly, after GPU detect such changes, it will assign the element value (10) to the same place of another buffer dst2. cpuKernel is queued in command1 which is associated with CPU device, and gpuKernel is queued in command2 which is associated with GPU device, two command queues have been set CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE flag.
Then I make two cases:
case 1:
clEnqueueNDRangeKernel(command2,gpuKernel);
clEnqueueNDRangeKernel(command1,cpuKernel);
clfinish(command1);
clfinish(command2);
case 2:
clEnqueueNDRangeKernel(command1,cpuKernel);
clfinish(command1);
clEnqueueNDRangeKernel(command2,gpuKernel);
clfinish(command2);
But the results show that the time consumed in two cases are nearly the same, but I expect there will be some overlapping in case 1, but there is not. Can anyone help me? Thanks!
Or, can anyone help to explain how to implement two kernels running on two devices asynchronously in OpenCL?
You are asking too much. As you have probably noticed, buffer objects are relative to a context, while command queues are related to devices.
If a kernel operates on a buffer object, the corresponding data must be on this device. If you do not transfer it explicitely with clEnqueueWriteBuffer(), OpenCL will do that for you.
Hence, if you modify a buffer object with a kernel on one device (for example the CPU), and just after on another device (for example the GPU), the OpenCL driver will wait for the first kernel to finish, transfer the data, and then run the second kernel.

Resources