I am just wondering what is the semantic of the following kernel
#define T float
__kernel foo(){
__local T bar[32];
__local T a;
}
is bar/a shared between a work-group or every work-item will create a separate copy of bar/a?
They are both shared between the work-group, so there will only be one copy of bar and a per workgroup.
Related
I have some OpenCL kernels which are interfacing with python threw the pyopencl library. The kernels are used to speed up commutative operations (like addition or multiplication), where the order of the operating input variables does not matter. A kernel doing addition might look like this:
__kernel void addition(__global const float *a_g,
__global const float *b_g,
__global float *res_g)
{
int gid = get_global_id(0);
res_g[gid] = a_g[gid] + b_g[gid];
}
For simplicity, assume all the operating buffers (a_g, b_g, res_g) are of the same size and 1 dimensional. The global work size is set to the size of the buffers before launching the kernel and the result is stored in the res_g buffer.
These operations work in a sequential manner, the output from one kernel is used as the input to the next kernel. Given that all these kernels look like the code snippet above, I could simply "chain" adding together 4 inputs by writing the following kernel:
__kernel void addition_chained(__global const float *a_g,
__global const float *b_g,
__global const float *c_g,
__global const float *d_g,
__global float *res_g)
{
int gid = get_global_id(0);
res_g[gid] = a_g[gid] + b_g[gid] + c_g[gid] + d_g[gid];
}
With this, no intermediary result buffers needs to be allocated and there is no overhead in launching new threads.
Is this a common optimization? What are the pros and cons of doing this?
Is there any canonical way to chain kernels in OpenCL? The amount of operations that need chaining might not be known at compile-time.
Reducing kernel calls by combining the actions of multiple kernels into one means less memory allocation and less memory transfers from global memory, which significantly reduces execution time.
If the number of additions is constant throughout your program, you can use to your advantage that OpenCL is compiled from a string at runtime. That means: You can at runtime modify the string containing the OpenCL code, and then compile and run it. This way, you can add a variable number of kernel arguments and summation terms via string concatenation.
If however the number of summation terms charges many times within one execution of your program and is unpredictable, the two-argument kernel is the way to go. Otherwise you would have to recompile the OpenCL code many times which has significant overhead.
I am using OpenCL kernel, solely to copy one array to another(a part of a project), using custom memcpy function :
void myMemCpy(__global void *dest,__global void *src, size_t n) {
__global char *csrc = (__global char *)src;
__global char *cdest = (__global char *)dest;
for (int i=0; i<n; i++)
cdest[i] = csrc[i];
}
I am using OpenCL SVM feature with OpenCL version 2.1.
Is there any way to optimize the copying routine or any other way to do copy inside the kernel ?
I think putting an clEnqueueCopyBuffer into the command queue on the host side would be the best option.
In your OpenCL function, each work-item does the whole copy, which doesn't make sense unless your ND-Range only has a single workitem. It's similar to starting the same memcpy() in multiple threads in a non-OpenCL program.
You would have to use at least get_global_id() inside your kernel split the work among the workitems inside your ND-Range. Depending on your ND-Range and your actual device, different memory access patterns might be better suited for the hardware. The hardware vendor's OpenCL optimisation guide is a good starting point here.
In Opencl, buffers are the conduit through which data is communicated from the host application.
cl_mem clCreateBuffer (cl_context context, cl_mem_flags flags, size_t size,
void *host_ptr, cl_int *errcode_ret);
Now if I have a buffer a_buffer flaged as READ_ONLY, and the kernel is:
__kernel void two_buffer_double(__global float* a)
{
int i = get_global_id(0);
float b = a[i] * 2;
}
My question is that: is a_buffer a global memory or constant memory? Should I use __constant qualifier for a. What is the connection between cl_mem_flags(READ_ONLY and READ_WRITE) and memory qualifier(global and constant)?
__constant
qualifier is used for constant memory and some cards get it in texture cache and get independent bandwidth from __global but is very limited in size.
__global __read_only * float
means, opencl implementation will try put it in cache(or use some other data path) if hardware sees fit but it is __global so is limited by only vram size or its fraction instead of just 64kB(for example) for __constant.
These qualifiers are for device-side optimization.
At host-side optimization, you should supply it with a
CL_MEM_READ_ONLY
as flag for buffer creation. This means device will only read from it(probably using some DMA/pcie access/caching optimizations) but can be written from host side(as being a host like C# C++ code, not device) using enqueuewrite or map unmap operations.
__constant
is for parametric constant definitions, not for data to be processing.
If you are writing a filter code, data could be __global and filter mask could be __constant if that cannot fit in __private memory(which has ultimate bandwidth) or __local memory(slower than private) so accessing mask bytes does not decrease data bandwidth.
Now answering your question:
" is a_buffer a global memory or constant memory? "
it is global for device side(kernel side) because you declared it as __global but it could be anywhere on host side(hardware).
Edit: for host-side, depends which other flags are used, for example, USE_HOST_PTR makes it directly-accessible from system RAM and there is only a virtual buffer on device side, without it and with just a CL_MEM_READ_WRITE device memory will have a real buffer and its mapped shadow in RAM (as a sub-step for clenqueueread or clenqueuewrite) and copying will visit this shadow first then uploaded to gpu.
An example device: Intel(R) HD (TM) GRAPHICS 400 in a 4GB DDR3L laptop:
Query value
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE 65536 bytes
CL_DEVICE_GLOBAL_MEM_CACHE_SIZE 262144 bytes
CL_DEVICE_GLOBAL_MEM_SIZE 1636414260 bytes
CL_DEVICE_GLOBAL_MEM_CACHE_TYPE CL_READ_WRITE_CACHE
CL_DEVICE_LOCAL_MEM_SIZE 65536(vs constant, benchmark it)
CL_DEVICE_LOCAL_MEM_TYPE CL_LOCAL(so is faster than global)
you cannot query private memory size but for a mid-segment gaming amd card, it is 256kB per thread group. If you set 64 threads per group, it can use 4kB register space per thread or half of it(because of compiler optimizations) before getting slow because of spilling to global memory.
Note that in some OpenCL platforms (e.g. AMD) access qualifiers can only be applied to images, not buffers.
I am implementing an algorithm on GPU using Open CL.
Currently I am launching kernel with only one work-group containing 128 work-items.The data in global memory is being used many times by every work-item .To take advantage of speed of shared memory I copied it to shared memory using the following code.
__kernel void kernel1(__global float2* input,
__global int* bin,
__global float2* DFT,
__local float2* localInput,
__const int N){
size_t itemId = get_local_id(0);
localInput[itemId] = input[itemId];
barrier(CLK_LOCAL_MEM_FENCE | CLK_GLOBAL_MEM_FENCE);
........................................................
/*Remaining algo here.*/
........................................................
}
The above code works well if there is only one work group.But if there are more than one work-group,assuming there are two work-groups with equal number of items in each of them the above kernel copies only the first half in the first work-group shared memory and the second-half in the later.
I also tried the below kernel :
__kernel void kernel1(__global float2* input,
__global int* bin,
__global float2* DFT,
__local float2* localInput,
__const int N){
size_t itemId = get_local_id(0);
if(itemId == 0){
for(int index = 0;index<N;index++){
localInput[index] = input[index];
}
}
barrier(CLK_LOCAL_MEM_FENCE | CLK_GLOBAL_MEM_FENCE);
........................................................
/*Remaining algo here.*/
........................................................
}
But the above code has problems like divergence because of conditional statements which decreases the performance.
What further modifications can be done to the code so that entire array can be copied to shared memory of each work-group efficiently?
Any suggestions are well appreciated.
Depending on what device you're running on, there's a good chance you can completely ignore local memory. If you're on a desktop GPU, they used to have practically no cache whatsoever which made using local memory very important, but these days they have a decent amount. If you're hitting the same portion of memory on a gpu itll all be in cache (its generally the same size as shared memory), which is just as fast as local memory (they're the same block of memory, just split). Copying it manually to local memory might additionally impose a minor performance penalty
If you aren't on a desktop GPU (arm/etc) or your requirements make this impractical, async_work_group_copy might be what you are looking for
On an unrelated note, the above code only needs to do a barrier(CLK_LOCAL_MEM_FENCE) as you presumably aren't modifying your input
I've been playing with OpenCL recently, and I'm able to write simple kernels that use only global memory. Now I'd like to start using local memory, but I can't seem to figure out how to use get_local_size() and get_local_id() to compute one "chunk" of output at a time.
For example, let's say I wanted to convert Apple's OpenCL Hello World example kernel to something the uses local memory. How would you do it? Here's the original kernel source:
__kernel square(
__global float *input,
__global float *output,
const unsigned int count)
{
int i = get_global_id(0);
if (i < count)
output[i] = input[i] * input[i];
}
If this example can't easily be converted into something that shows how to make use of local memory, any other simple example will do.
Check out the samples in the NVIDIA or AMD SDKs, they should point you in the right direction. Matrix transpose would use local memory for example.
Using your squaring kernel, you could stage the data in an intermediate buffer. Remember to pass in the additional parameter.
__kernel square(
__global float *input,
__global float *output,
__local float *temp,
const unsigned int count)
{
int gtid = get_global_id(0);
int ltid = get_local_id(0);
if (gtid < count)
{
temp[ltid] = input[gtid];
// if the threads were reading data from other threads, then we would
// want a barrier here to ensure the write completes before the read
output[gtid] = temp[ltid] * temp[ltid];
}
}
There is another possibility to do this, if the size of the local memory is constant. Without using a pointer in the kernels parameter list, the local buffer can be declared within the kernel just by declaring it __local:
__local float localBuffer[1024];
This removes code due to less clSetKernelArg calls.
In OpenCL local memory is meant to share data across all work items in a workgroup. And it usually requires to do a barrier call before the local memory data can be used (for example, one work item wants to read a local memory data that is written by the other work items). Barrier is costly in hardware. Keep in mind, local memory should be used for repeated data read/write. Bank conflict should be avoided as much as possible.
If you are not careful with local memory, you may end up with worse performance some time than using global memory.