I'm following the example here to create a variable-length local memory array.
The kernel signature is something like this:
__kernel void foo(__global float4* ex_buffer,
int ex_int,
__local void *local_var)
Then I call clSetKernelArg for the local memory kernel argument as follows:
clSetKernelArg(*kern, 2, sizeof(char) * MaxSharedMem, NULL)
Where MaxSharedMem is set from querying CL_DEVICE_LOCAL_MEM_SIZE.
Then inside the kernel I split up the allocated local memory into several arrays and other data structures and use them as I see fit. All of this works fine with AMD (gpu and cpu) and Intel devices. However, on Nvidia, I get the error CL_INVALID_COMMAND_QUEUE when I enqueue this kernel and then run clFinish on the queue.
This is a simple kernel that generates the mentioned error (local work size is 32):
__kernel
void s_Kernel(const unsigned int N, __local void *shared_mem_block )
{
const ushort thread_id = get_local_id(0);
__local double *foo = shared_mem_block;
__local ushort *bar = (__local ushort *) &(foo[1000]);
foo[thread_id] = 0.;
bar[thread_id] = 0;
}
The kernel runs fine if I allocate the same arrays and data structures in local memory statically. Could somebody provide an explanation for this behavior, and/or workarounds?
For those interested, I finally received an explanation from Nvidia. When the chunk of shared memory is passed in via a void pointer, the actual alignment does not match the expected alignment for a pointer to double (8-byte aligned). The GPU device throws an exception due to the misalignment.
As one of the comments pointed out, a way to circumvent the problem is to have the kernel parameter be a pointer to something that the compiler would properly align to at least 8 bytes (double, ulong, etc).
Ideally, the compiler would take responsibility for any alignment issues specific to the device, but because there is an implicit pointer cast in the little kernel featured in my question, I think it gets confused.
Once the memory is 8-byte aligned, a cast to a pointer type that assumes a shorter alignment (e.g. ushort) works without issues. So, if you're chaining the memory allocation like I'm doing, and the pointers are to different types, make sure to have the pointer to the largest type in the kernel signature.
Related
I have some OpenCL kernels which are interfacing with python threw the pyopencl library. The kernels are used to speed up commutative operations (like addition or multiplication), where the order of the operating input variables does not matter. A kernel doing addition might look like this:
__kernel void addition(__global const float *a_g,
__global const float *b_g,
__global float *res_g)
{
int gid = get_global_id(0);
res_g[gid] = a_g[gid] + b_g[gid];
}
For simplicity, assume all the operating buffers (a_g, b_g, res_g) are of the same size and 1 dimensional. The global work size is set to the size of the buffers before launching the kernel and the result is stored in the res_g buffer.
These operations work in a sequential manner, the output from one kernel is used as the input to the next kernel. Given that all these kernels look like the code snippet above, I could simply "chain" adding together 4 inputs by writing the following kernel:
__kernel void addition_chained(__global const float *a_g,
__global const float *b_g,
__global const float *c_g,
__global const float *d_g,
__global float *res_g)
{
int gid = get_global_id(0);
res_g[gid] = a_g[gid] + b_g[gid] + c_g[gid] + d_g[gid];
}
With this, no intermediary result buffers needs to be allocated and there is no overhead in launching new threads.
Is this a common optimization? What are the pros and cons of doing this?
Is there any canonical way to chain kernels in OpenCL? The amount of operations that need chaining might not be known at compile-time.
Reducing kernel calls by combining the actions of multiple kernels into one means less memory allocation and less memory transfers from global memory, which significantly reduces execution time.
If the number of additions is constant throughout your program, you can use to your advantage that OpenCL is compiled from a string at runtime. That means: You can at runtime modify the string containing the OpenCL code, and then compile and run it. This way, you can add a variable number of kernel arguments and summation terms via string concatenation.
If however the number of summation terms charges many times within one execution of your program and is unpredictable, the two-argument kernel is the way to go. Otherwise you would have to recompile the OpenCL code many times which has significant overhead.
Here are sample codes:
__kernel void my_kernel(__global float* src,
__global float* dst){
float4 a = vload4(0,src);
//do something to a
...
vstore4(a,0,dst)
}
According to OpenCL 1.2 Reference, address of global buffer src and dst must be 4-bytes aligned when using vloadn and vstoren, or the results are undefined. My question is whether OpenCL will automate aligning the global device address after completing the call to clCreateBuffer? If not, how to ensure proper alignment?(in addition, how about local memory object?)
Refer to Data Type of OpenCL. The OpenCL compiler is responsible for aligning data items to the appropriate alignment as required by the data type. So I think the answer is basically yes.
Buffers are surely aligned to a boundary bigger than 4 bytes, except you are using CL_MEM_USE_HOST_PTR.
By the way: In your code it could be better to declare the parameters as float4* instead of using vload4 and vstore4.
I am using OpenCL kernel, solely to copy one array to another(a part of a project), using custom memcpy function :
void myMemCpy(__global void *dest,__global void *src, size_t n) {
__global char *csrc = (__global char *)src;
__global char *cdest = (__global char *)dest;
for (int i=0; i<n; i++)
cdest[i] = csrc[i];
}
I am using OpenCL SVM feature with OpenCL version 2.1.
Is there any way to optimize the copying routine or any other way to do copy inside the kernel ?
I think putting an clEnqueueCopyBuffer into the command queue on the host side would be the best option.
In your OpenCL function, each work-item does the whole copy, which doesn't make sense unless your ND-Range only has a single workitem. It's similar to starting the same memcpy() in multiple threads in a non-OpenCL program.
You would have to use at least get_global_id() inside your kernel split the work among the workitems inside your ND-Range. Depending on your ND-Range and your actual device, different memory access patterns might be better suited for the hardware. The hardware vendor's OpenCL optimisation guide is a good starting point here.
In Opencl, buffers are the conduit through which data is communicated from the host application.
cl_mem clCreateBuffer (cl_context context, cl_mem_flags flags, size_t size,
void *host_ptr, cl_int *errcode_ret);
Now if I have a buffer a_buffer flaged as READ_ONLY, and the kernel is:
__kernel void two_buffer_double(__global float* a)
{
int i = get_global_id(0);
float b = a[i] * 2;
}
My question is that: is a_buffer a global memory or constant memory? Should I use __constant qualifier for a. What is the connection between cl_mem_flags(READ_ONLY and READ_WRITE) and memory qualifier(global and constant)?
__constant
qualifier is used for constant memory and some cards get it in texture cache and get independent bandwidth from __global but is very limited in size.
__global __read_only * float
means, opencl implementation will try put it in cache(or use some other data path) if hardware sees fit but it is __global so is limited by only vram size or its fraction instead of just 64kB(for example) for __constant.
These qualifiers are for device-side optimization.
At host-side optimization, you should supply it with a
CL_MEM_READ_ONLY
as flag for buffer creation. This means device will only read from it(probably using some DMA/pcie access/caching optimizations) but can be written from host side(as being a host like C# C++ code, not device) using enqueuewrite or map unmap operations.
__constant
is for parametric constant definitions, not for data to be processing.
If you are writing a filter code, data could be __global and filter mask could be __constant if that cannot fit in __private memory(which has ultimate bandwidth) or __local memory(slower than private) so accessing mask bytes does not decrease data bandwidth.
Now answering your question:
" is a_buffer a global memory or constant memory? "
it is global for device side(kernel side) because you declared it as __global but it could be anywhere on host side(hardware).
Edit: for host-side, depends which other flags are used, for example, USE_HOST_PTR makes it directly-accessible from system RAM and there is only a virtual buffer on device side, without it and with just a CL_MEM_READ_WRITE device memory will have a real buffer and its mapped shadow in RAM (as a sub-step for clenqueueread or clenqueuewrite) and copying will visit this shadow first then uploaded to gpu.
An example device: Intel(R) HD (TM) GRAPHICS 400 in a 4GB DDR3L laptop:
Query value
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE 65536 bytes
CL_DEVICE_GLOBAL_MEM_CACHE_SIZE 262144 bytes
CL_DEVICE_GLOBAL_MEM_SIZE 1636414260 bytes
CL_DEVICE_GLOBAL_MEM_CACHE_TYPE CL_READ_WRITE_CACHE
CL_DEVICE_LOCAL_MEM_SIZE 65536(vs constant, benchmark it)
CL_DEVICE_LOCAL_MEM_TYPE CL_LOCAL(so is faster than global)
you cannot query private memory size but for a mid-segment gaming amd card, it is 256kB per thread group. If you set 64 threads per group, it can use 4kB register space per thread or half of it(because of compiler optimizations) before getting slow because of spilling to global memory.
Note that in some OpenCL platforms (e.g. AMD) access qualifiers can only be applied to images, not buffers.
I've been playing with OpenCL recently, and I'm able to write simple kernels that use only global memory. Now I'd like to start using local memory, but I can't seem to figure out how to use get_local_size() and get_local_id() to compute one "chunk" of output at a time.
For example, let's say I wanted to convert Apple's OpenCL Hello World example kernel to something the uses local memory. How would you do it? Here's the original kernel source:
__kernel square(
__global float *input,
__global float *output,
const unsigned int count)
{
int i = get_global_id(0);
if (i < count)
output[i] = input[i] * input[i];
}
If this example can't easily be converted into something that shows how to make use of local memory, any other simple example will do.
Check out the samples in the NVIDIA or AMD SDKs, they should point you in the right direction. Matrix transpose would use local memory for example.
Using your squaring kernel, you could stage the data in an intermediate buffer. Remember to pass in the additional parameter.
__kernel square(
__global float *input,
__global float *output,
__local float *temp,
const unsigned int count)
{
int gtid = get_global_id(0);
int ltid = get_local_id(0);
if (gtid < count)
{
temp[ltid] = input[gtid];
// if the threads were reading data from other threads, then we would
// want a barrier here to ensure the write completes before the read
output[gtid] = temp[ltid] * temp[ltid];
}
}
There is another possibility to do this, if the size of the local memory is constant. Without using a pointer in the kernels parameter list, the local buffer can be declared within the kernel just by declaring it __local:
__local float localBuffer[1024];
This removes code due to less clSetKernelArg calls.
In OpenCL local memory is meant to share data across all work items in a workgroup. And it usually requires to do a barrier call before the local memory data can be used (for example, one work item wants to read a local memory data that is written by the other work items). Barrier is costly in hardware. Keep in mind, local memory should be used for repeated data read/write. Bank conflict should be avoided as much as possible.
If you are not careful with local memory, you may end up with worse performance some time than using global memory.