OpenCL Sub Buffer Host pointer - opencl

I have created a buffer with attributes CL_MEM_READ_WRITE and CL_MEM_ALLOC_HOST_PTR. I have enqueued this buffer to GPU kernels. GPU kernels process the input given and fill these buffers. During this process CPU is made to wait. I have modified this design by partitioning the buffer in to three uniform sections using sub-buffers. Now GPU after filling one sub-buffer, CPU can start processing. This reduces CPU wait to one sub-buffer as opposed to one full frame processing.
The problem i am facing is, the mapped pointer (cpu side pointers) of sub-buffers and buffer are strange. The map pointer of the first sub-buffer and buffer are same. This is alright. But the map pointer of second sub-buffer is not equal to the map pointer of buffer + offset of second sub-buffer. I tried this on integrated GPU models (Intel HD graphics 4000). It was working fine. But when i run this on dedicated graphics card devices (nvidia zotac) i am facing this problem. Have you encountered such a scenario before. Can you provide some pointers to where to look to fix this problem.
typedef struct opencl_buffer {
cl_mem opencl_mem;
void *mapped_pointer;
int size;
}opencl_buffer;
// alloc gpu output buffers
opencl->opencl_mem = clCreateBuffer(
opencl->context, CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR,
3 * alloc_size, NULL, &status);
if (status != CL_SUCCESS)
goto fail;
// create output sub buffers
for (sub_idx = 0; sub_idx < 3; ++sub_idx) {
cl_buffer_region sf_region;
SubFrameInfo subframe;
sf_region.origin = alloc_size * sub_idx;
sf_region.size = alloc_size;
opencl->gpu_output_sub_buf[sub_idx].size = sf_region.size;
opencl->gpu_output_sub_buf[sub_idx].opencl_mem =
clCreateSubBuffer(opencl->opencl_mem,
CL_MEM_READ_WRITE,
CL_BUFFER_CREATE_TYPE_REGION,
&sf_region, &status);
if (status != CL_SUCCESS)
goto fail;
}
Now, when i map gpu_output_sub_buf[0].opencl_mem and gpu_output_sub_buf[1].opencl_mem, the difference between CPU side pointers is expected to be alloc_size (assume char pointers). This happens to be the case in Intel HD graphics. But Nvidia platform is providing a different result.

There is no specification-based reason a mapped sub-buffer should be at an address that is a known offset from the mapped main buffer (or mapped sub-buffer that aligns with same). Mapping only creates a range of host memory that you can use, and then you unmap to get it back on the device. It doesn't have to even be at the same address each time.
Of course OpenCL 2.0 SVM changes all this, but you didn't say you're using SVM, and NVIDIA doesn't support OpenCL 2.0 today anyway.

Related

Code never runs for arrays larger than 8000 entries with errors

I just started building a code for parallel computation with OpenCL.
As far as I understand, the data generated from CPU side (host) is transffered through the buffers (clCreateBuffer -> clEnqueueWriteBuffer -> clSetKernelArg, then processed by the device).
I mainly have to deal with arrays (or matrices) of large size with double precision.
However, I realized the code never runs for arrays larger than 8000 entries with errors.
(This makes sense because 64kb is equivalent to 8000 double precision numbers.)
The error codes were either -6 (CL_OUT_OF_HOST_MEMORY) or -31 (CL_INVALID_VALUE).
One more thing when I set the argument to 2-dimensional array, I could set the size up to 8000 x 8000.
So far, I guess the maximum data size for double precision is 8000 (64kb) for 1D arrays, but I have no idea what happens for 2D or 3D arrays.
Is there any other way to transfer the data larger than 64kb?
If I did something wrong for OpenCL setup in data transfer, what would be recommended?
I appreciate your kind answer.
The hardware that I'm using is Tesla V100 which is installed on the HPC cluster in my school.
The following is a part of my code snippet that I'm testing the data transfer.
bfr(0) = clCreateBuffer(context,
& CL_MEM_READ_WRITE + CL_MEM_COPY_HOST_PTR,
& sizeof(a), c_loc(a), err);
if(err.ne.0) stop "Couldn't create a buffer";
err=clEnqueueWriteBuffer(queue,bfr(0),CL_TRUE,0_8,
& sizeof(a),c_loc(a),0,C_NULL_PTR,C_NULL_PTR)
err = clSetKernelArg(kernel, 0,
& sizeof(bfr(0)), C_LOC(bfr(0)))
print*, err
if(err.ne.0)then
print *, "clSetKernelArg kernel"
print*, err
stop
endif
The code was build by Fortran with using clfortran module.
Thank you again for your answer.
You can do much larger arrays in OpenCL, as large as memory is abailable. For example I'm commonly working with linearized 4D arrays of 2 Billion floats in a CFD application.
Arrays need to be 1D only; if you have 2D or 3D arrays, linearize them, for example with n=x+y*size_x for 2D->1D coordinates. Some older devices only allow arrays 1/4 the size of the device memory. However modern devices typically have an extension to the OpenCL specification to enable larger buffers.
Here is a quickover view on what the OpenCL C bindings do:
clCreateBuffer allocates memory on the device side (video memory for GPUs, RAM for CPUs). Buffers can be as large as host/device memory allows or on some older devices 1/4 of device memory.
clEnqueueWriteBuffer copies memory over PCIe from RAM to video memory. Both on CPU and GPU side buffers must be allocated beforehand. There is no limit on transfer size; it can be as large as the entire buffer or only a subrange of a buffer.
clSetKernelArg links the GPU buffers to the Input parameters of the kernel, so it knows which kernel parameter corresponds to which buffer. Make sure data types of the buffers and kernel arguments match as you won't get an error if they don't. Also make sure the order of kernel arguments matches.
In your case there are several possible causes for the error:
Maybe you have integer overflow during computation of the array size. In this case use 64-bit integer numbers instead to compute the array size/indices.
You are out of memory because other buffers already take up too much memory. Do some bookkeeping to keep track on total (video) memory utilization.
You have selected the wrong device, for example integrated graphics instead of the dedicated GPU, in which case much less memory is available and you end up with cause 2.
To give you a more definitive answer, please provide some additional details:
What hardware do you use?
Show a code snippet of how you allocate device memory.
UPDATE
I see some errors in your code:
The length argument in clCreateBuffer and clEnqueueWriteBuffer requires the number of bytes that your array a has. If a is of type double, then this is a_length*sizeof(double), where a_length is the number of elements in the array a. sizeof(double) returns the number of bytes for one double number which is 8. So the length argument is 8 bytes times the number of elements in the array.
For multiple flags, you typically use bitwise or | instead of +. Shouldn't be an issue here, but is unconvenional.
You had "0_8" as buffer offset. This needs to be zero (0).
const int a_length = 8000;
double* a = new double[a_length];
bfr(0) = clCreateBuffer(context, CL_MEM_READ_WRITE|CL_MEM_COPY_HOST_PTR, a_length*sizeof(double), c_loc(a), err);
if(err.ne.0) stop "Couldn't create a buffer";
err = clEnqueueWriteBuffer(queue, bfr(0), CL_TRUE, 0, a_length*sizeof(double), c_loc(a), 0, C_NULL_PTR, C_NULL_PTR);
err = clSetKernelArg(kernel, 0, sizeof(bfr(0)), C_LOC(bfr(0)));
print*, err;
if(err.ne.0) {
print *, "clSetKernelArg kernel"
print*, err
stop
}

OpenCL - dynamic shared memory allocation

I am trying to translate some existing CUDA kernels to OpenCL and the problem is that I am bound to use OpenCL 1.2, so it is not possible to use non-uniform work-group size, meaning that I should let enqueueNDRangeKernel decide local work-group size(to avoid non-divisible workgroup size w.r.t. global work size).
As it is mentioned in this presentation, I use __local int * which is an argument of kernel function as shared memory pointer with the size that is defined in the host code using the <Kernel>.setArg.
In some of these CUDA kernels, I have allocated dynamic shared memory with the size that is dependant on Thread-Block or local workgroup size. When I try to translate these kernels to OpenCL, I don't know how to get local workgroup size that is set by enqueueNDRangeKernel passing a NULL value for the local argument to let it automatically decide local workgroup size.
To make it more clear, all I want is to translate this piece of CUDA code to OpenCL:
dim3 block_size, grid_size;
unsigned int smem_size;
block_size.x = 10;
block_size.y = 10;
block_size.z = 2;
// smem_size is depandant on ThreadBlock size.
smem_size = (block_size.x * block_size.y * block_size.z) * 5;
myCudaKernel<<< grid_size, block_size, smem_size >>>(...);
*By the way, I need a general solution that is usable for 3-D workgroups as well.
*I have an Nvidia graphics card.

OpenCL - The difference between Buffer and global memory

In Opencl, buffers are the conduit through which data is communicated from the host application.
cl_mem clCreateBuffer (cl_context context, cl_mem_flags flags, size_t size,
void *host_ptr, cl_int *errcode_ret);
Now if I have a buffer a_buffer flaged as READ_ONLY, and the kernel is:
__kernel void two_buffer_double(__global float* a)
{
int i = get_global_id(0);
float b = a[i] * 2;
}
My question is that: is a_buffer a global memory or constant memory? Should I use __constant qualifier for a. What is the connection between cl_mem_flags(READ_ONLY and READ_WRITE) and memory qualifier(global and constant)?
__constant
qualifier is used for constant memory and some cards get it in texture cache and get independent bandwidth from __global but is very limited in size.
__global __read_only * float
means, opencl implementation will try put it in cache(or use some other data path) if hardware sees fit but it is __global so is limited by only vram size or its fraction instead of just 64kB(for example) for __constant.
These qualifiers are for device-side optimization.
At host-side optimization, you should supply it with a
CL_MEM_READ_ONLY
as flag for buffer creation. This means device will only read from it(probably using some DMA/pcie access/caching optimizations) but can be written from host side(as being a host like C# C++ code, not device) using enqueuewrite or map unmap operations.
__constant
is for parametric constant definitions, not for data to be processing.
If you are writing a filter code, data could be __global and filter mask could be __constant if that cannot fit in __private memory(which has ultimate bandwidth) or __local memory(slower than private) so accessing mask bytes does not decrease data bandwidth.
Now answering your question:
" is a_buffer a global memory or constant memory? "
it is global for device side(kernel side) because you declared it as __global but it could be anywhere on host side(hardware).
Edit: for host-side, depends which other flags are used, for example, USE_HOST_PTR makes it directly-accessible from system RAM and there is only a virtual buffer on device side, without it and with just a CL_MEM_READ_WRITE device memory will have a real buffer and its mapped shadow in RAM (as a sub-step for clenqueueread or clenqueuewrite) and copying will visit this shadow first then uploaded to gpu.
An example device: Intel(R) HD (TM) GRAPHICS 400 in a 4GB DDR3L laptop:
Query value
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE 65536 bytes
CL_DEVICE_GLOBAL_MEM_CACHE_SIZE 262144 bytes
CL_DEVICE_GLOBAL_MEM_SIZE 1636414260 bytes
CL_DEVICE_GLOBAL_MEM_CACHE_TYPE CL_READ_WRITE_CACHE
CL_DEVICE_LOCAL_MEM_SIZE 65536(vs constant, benchmark it)
CL_DEVICE_LOCAL_MEM_TYPE CL_LOCAL(so is faster than global)
you cannot query private memory size but for a mid-segment gaming amd card, it is 256kB per thread group. If you set 64 threads per group, it can use 4kB register space per thread or half of it(because of compiler optimizations) before getting slow because of spilling to global memory.
Note that in some OpenCL platforms (e.g. AMD) access qualifiers can only be applied to images, not buffers.

dynamic allocation in shared memory in opencl on Nvidia

I'm following the example here to create a variable-length local memory array.
The kernel signature is something like this:
__kernel void foo(__global float4* ex_buffer,
int ex_int,
__local void *local_var)
Then I call clSetKernelArg for the local memory kernel argument as follows:
clSetKernelArg(*kern, 2, sizeof(char) * MaxSharedMem, NULL)
Where MaxSharedMem is set from querying CL_DEVICE_LOCAL_MEM_SIZE.
Then inside the kernel I split up the allocated local memory into several arrays and other data structures and use them as I see fit. All of this works fine with AMD (gpu and cpu) and Intel devices. However, on Nvidia, I get the error CL_INVALID_COMMAND_QUEUE when I enqueue this kernel and then run clFinish on the queue.
This is a simple kernel that generates the mentioned error (local work size is 32):
__kernel
void s_Kernel(const unsigned int N, __local void *shared_mem_block )
{
const ushort thread_id = get_local_id(0);
__local double *foo = shared_mem_block;
__local ushort *bar = (__local ushort *) &(foo[1000]);
foo[thread_id] = 0.;
bar[thread_id] = 0;
}
The kernel runs fine if I allocate the same arrays and data structures in local memory statically. Could somebody provide an explanation for this behavior, and/or workarounds?
For those interested, I finally received an explanation from Nvidia. When the chunk of shared memory is passed in via a void pointer, the actual alignment does not match the expected alignment for a pointer to double (8-byte aligned). The GPU device throws an exception due to the misalignment.
As one of the comments pointed out, a way to circumvent the problem is to have the kernel parameter be a pointer to something that the compiler would properly align to at least 8 bytes (double, ulong, etc).
Ideally, the compiler would take responsibility for any alignment issues specific to the device, but because there is an implicit pointer cast in the little kernel featured in my question, I think it gets confused.
Once the memory is 8-byte aligned, a cast to a pointer type that assumes a shorter alignment (e.g. ushort) works without issues. So, if you're chaining the memory allocation like I'm doing, and the pointers are to different types, make sure to have the pointer to the largest type in the kernel signature.

OpenCL and Tesla M1060

I'm using the Tesla m1060 for GPGPU computation. It has the following specs:
# of Tesla GPUs 1
# of Streaming Processor Cores (XXX per processor) 240
Memory Interface (512-bit per GPU) 512-bit
When I use OpenCL, I can display the following board information:
available platform OpenCL 1.1 CUDA 6.5.14
device Tesla M1060 type:CL_DEVICE_TYPE_GPU
max compute units:30
max work item dimensions:3
max work item sizes (dim:0):512
max work item sizes (dim:1):512
max work item sizes (dim:2):64
global mem size(bytes):4294770688 local mem size:16383
How can I relate the GPU card informations to the OpenCL memory informations ?
For example:
What does "Memory Interace" means ? Is it linked the a Work Item ?
How can I relate the "240 cores" of the GPU to Work Groups/Items ?
How can I map the work-groups to it (what would be the number of Work groups to use) ?
Thanks
EDIT:
After the following answers, there is a thing that is still unclear to me:
The CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE value is 32 for the kernel I use.
However, my device has a CL_DEVICE_MAX_COMPUTE_UNITS value of 30.
In the OpenCL 1.1 Api, it is written (p. 15):
Compute Unit: An OpenCL device has one or more compute units. A work-group executes on a single compute unit
It seems that either something is incoherent here, or that I didn't fully understand the difference between Work-Groups and Compute Units.
As previously stated, when I set the number of Work Groups to 32, the programs fails with the following error:
Entry function uses too much shared data (0x4020 bytes, 0x4000 max).
The value 16 works.
Addendum
Here is my Kernel signature:
// enable double precision (not enabled by default)
#ifdef cl_khr_fp64
#pragma OPENCL EXTENSION cl_khr_fp64 : enable
#else
#error "IEEE-754 double precision not supported by OpenCL implementation."
#endif
#define BLOCK_SIZE 16 // --> this is what defines the WG size to me
__kernel __attribute__((reqd_work_group_size(BLOCK_SIZE, BLOCK_SIZE, 1)))
void mmult(__global double * A, __global double * B, __global double * C, const unsigned int q)
{
__local double A_sub[BLOCK_SIZE][BLOCK_SIZE];
__local double B_sub[BLOCK_SIZE][BLOCK_SIZE];
// stuff that does matrix multiplication with __local
}
In the host code part:
#define BLOCK_SIZE 16
...
const size_t local_work_size[2] = {BLOCK_SIZE, BLOCK_SIZE};
...
status = clEnqueueNDRangeKernel(command_queue, kernel, 2, NULL, global_work_size, local_work_size, 0, NULL, NULL);
The memory interface doesn't mean anything to an opencl application. It is the number of bits the memory controller has for reading/writing to the memory (the ddr5 part in modern gpus). The formula for maximum global memory speed is approximately: pipelineWidth * memoryClockSpeed, but since opencl is meant to be cross-platform, you won't really need to know this value unless you are trying to figure out an upper bound for memory performance. Knowing about the 512-bit interface is somewhat useful when you're dealing with memory coalescing. wiki: Coalescing (computer science)
The max work item sizes have to do with 1) how the hardware schedules computations, and 2) the amount of low-level memory on the device -- eg. private memory and local memory.
The 240 figure doesn't matter to opencl very much either. You can determine that each of the 30 compute units is made up of 8 streaming processor cores for this gpu architecture (because 240/30 = 8). If you query for CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE, it will very likey be a multiple of 8 for this device. see: clGetKernelWorkGroupInfo
I have answered a similar questions about work group sizing. see here, and here
Ultimately, you need to tune your application and kernels based on your own bench-marking results. I find it worth the time to write many tests with various work group sizes and eventually hard-code the optimal size.
Adding another answer to address your local memory issue.
Entry function uses too much shared data (0x4020 bytes, 0x4000 max)
Since you are allocating A_sub and B_sub, each having 32*32*sizeof(double), you run out of local memory. The device should be allowing you to allocate 16kb, or 0x4000 bytes of local memory without an issue.
0x4020 is 32 bytes or 4 doubles more than what your device allows. There are only two things I can think of that may cause the error: 1) there could be a bug with your device or drivers preventing you from allocating the full 16kb, or 2) you are allocating the memory somewhere else in your kernel.
You will have to use a BLOCK_SIZE value less than 32 to work around this for now.
There's good news though. If you only want to hit a multiple of CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE as a work group size, BLOCK_SIZE=16 already does this for you. (16*16 = 256 = 32*8). To better take advantage of local memory, try BLOCK_SIZE=24. (576=32*18)

Resources