I have implemented a simple code with opencl which copies 64MB data using 16 Mthreads. The kernel is as follows:
__kernel void myKernel(
__global float4* a,
__global float4* c,
int count
){
int thread_idx = get_global_id(0);
if (thread_idx >= count) return;
c[thread_idx] = a[thread_idx];
}
My computer has two Graphic Cards (GTX 980 + GTX 970) and one display. I am using Windows 8.1. I have created c[...] buffer on the host, once using normal clCreateBuffer and another time creating a gl buffer with opengl and creating a cl buffer with clCreateFromGLBuffer. My comparsions showed that the buffers created with clCreateBuffer are always of the same speed slow, but the GL buffer which is created on the gpu which is connected to display is much more faster as is shown in the diagrams:
My investigation shows that the clEnqueueAcquireGLBuffer is making everything slower on the gpu which is not connected to the display. Anyone have ever seen such problem? And know how can I fix it?
The complete code can be found here: https://github.com/mmostajab/TestGPUMem_CL
Related
I am using OpenCL kernel, solely to copy one array to another(a part of a project), using custom memcpy function :
void myMemCpy(__global void *dest,__global void *src, size_t n) {
__global char *csrc = (__global char *)src;
__global char *cdest = (__global char *)dest;
for (int i=0; i<n; i++)
cdest[i] = csrc[i];
}
I am using OpenCL SVM feature with OpenCL version 2.1.
Is there any way to optimize the copying routine or any other way to do copy inside the kernel ?
I think putting an clEnqueueCopyBuffer into the command queue on the host side would be the best option.
In your OpenCL function, each work-item does the whole copy, which doesn't make sense unless your ND-Range only has a single workitem. It's similar to starting the same memcpy() in multiple threads in a non-OpenCL program.
You would have to use at least get_global_id() inside your kernel split the work among the workitems inside your ND-Range. Depending on your ND-Range and your actual device, different memory access patterns might be better suited for the hardware. The hardware vendor's OpenCL optimisation guide is a good starting point here.
In Opencl, buffers are the conduit through which data is communicated from the host application.
cl_mem clCreateBuffer (cl_context context, cl_mem_flags flags, size_t size,
void *host_ptr, cl_int *errcode_ret);
Now if I have a buffer a_buffer flaged as READ_ONLY, and the kernel is:
__kernel void two_buffer_double(__global float* a)
{
int i = get_global_id(0);
float b = a[i] * 2;
}
My question is that: is a_buffer a global memory or constant memory? Should I use __constant qualifier for a. What is the connection between cl_mem_flags(READ_ONLY and READ_WRITE) and memory qualifier(global and constant)?
__constant
qualifier is used for constant memory and some cards get it in texture cache and get independent bandwidth from __global but is very limited in size.
__global __read_only * float
means, opencl implementation will try put it in cache(or use some other data path) if hardware sees fit but it is __global so is limited by only vram size or its fraction instead of just 64kB(for example) for __constant.
These qualifiers are for device-side optimization.
At host-side optimization, you should supply it with a
CL_MEM_READ_ONLY
as flag for buffer creation. This means device will only read from it(probably using some DMA/pcie access/caching optimizations) but can be written from host side(as being a host like C# C++ code, not device) using enqueuewrite or map unmap operations.
__constant
is for parametric constant definitions, not for data to be processing.
If you are writing a filter code, data could be __global and filter mask could be __constant if that cannot fit in __private memory(which has ultimate bandwidth) or __local memory(slower than private) so accessing mask bytes does not decrease data bandwidth.
Now answering your question:
" is a_buffer a global memory or constant memory? "
it is global for device side(kernel side) because you declared it as __global but it could be anywhere on host side(hardware).
Edit: for host-side, depends which other flags are used, for example, USE_HOST_PTR makes it directly-accessible from system RAM and there is only a virtual buffer on device side, without it and with just a CL_MEM_READ_WRITE device memory will have a real buffer and its mapped shadow in RAM (as a sub-step for clenqueueread or clenqueuewrite) and copying will visit this shadow first then uploaded to gpu.
An example device: Intel(R) HD (TM) GRAPHICS 400 in a 4GB DDR3L laptop:
Query value
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE 65536 bytes
CL_DEVICE_GLOBAL_MEM_CACHE_SIZE 262144 bytes
CL_DEVICE_GLOBAL_MEM_SIZE 1636414260 bytes
CL_DEVICE_GLOBAL_MEM_CACHE_TYPE CL_READ_WRITE_CACHE
CL_DEVICE_LOCAL_MEM_SIZE 65536(vs constant, benchmark it)
CL_DEVICE_LOCAL_MEM_TYPE CL_LOCAL(so is faster than global)
you cannot query private memory size but for a mid-segment gaming amd card, it is 256kB per thread group. If you set 64 threads per group, it can use 4kB register space per thread or half of it(because of compiler optimizations) before getting slow because of spilling to global memory.
Note that in some OpenCL platforms (e.g. AMD) access qualifiers can only be applied to images, not buffers.
I'm following the example here to create a variable-length local memory array.
The kernel signature is something like this:
__kernel void foo(__global float4* ex_buffer,
int ex_int,
__local void *local_var)
Then I call clSetKernelArg for the local memory kernel argument as follows:
clSetKernelArg(*kern, 2, sizeof(char) * MaxSharedMem, NULL)
Where MaxSharedMem is set from querying CL_DEVICE_LOCAL_MEM_SIZE.
Then inside the kernel I split up the allocated local memory into several arrays and other data structures and use them as I see fit. All of this works fine with AMD (gpu and cpu) and Intel devices. However, on Nvidia, I get the error CL_INVALID_COMMAND_QUEUE when I enqueue this kernel and then run clFinish on the queue.
This is a simple kernel that generates the mentioned error (local work size is 32):
__kernel
void s_Kernel(const unsigned int N, __local void *shared_mem_block )
{
const ushort thread_id = get_local_id(0);
__local double *foo = shared_mem_block;
__local ushort *bar = (__local ushort *) &(foo[1000]);
foo[thread_id] = 0.;
bar[thread_id] = 0;
}
The kernel runs fine if I allocate the same arrays and data structures in local memory statically. Could somebody provide an explanation for this behavior, and/or workarounds?
For those interested, I finally received an explanation from Nvidia. When the chunk of shared memory is passed in via a void pointer, the actual alignment does not match the expected alignment for a pointer to double (8-byte aligned). The GPU device throws an exception due to the misalignment.
As one of the comments pointed out, a way to circumvent the problem is to have the kernel parameter be a pointer to something that the compiler would properly align to at least 8 bytes (double, ulong, etc).
Ideally, the compiler would take responsibility for any alignment issues specific to the device, but because there is an implicit pointer cast in the little kernel featured in my question, I think it gets confused.
Once the memory is 8-byte aligned, a cast to a pointer type that assumes a shorter alignment (e.g. ushort) works without issues. So, if you're chaining the memory allocation like I'm doing, and the pointers are to different types, make sure to have the pointer to the largest type in the kernel signature.
I am implementing an algorithm on GPU using Open CL.
Currently I am launching kernel with only one work-group containing 128 work-items.The data in global memory is being used many times by every work-item .To take advantage of speed of shared memory I copied it to shared memory using the following code.
__kernel void kernel1(__global float2* input,
__global int* bin,
__global float2* DFT,
__local float2* localInput,
__const int N){
size_t itemId = get_local_id(0);
localInput[itemId] = input[itemId];
barrier(CLK_LOCAL_MEM_FENCE | CLK_GLOBAL_MEM_FENCE);
........................................................
/*Remaining algo here.*/
........................................................
}
The above code works well if there is only one work group.But if there are more than one work-group,assuming there are two work-groups with equal number of items in each of them the above kernel copies only the first half in the first work-group shared memory and the second-half in the later.
I also tried the below kernel :
__kernel void kernel1(__global float2* input,
__global int* bin,
__global float2* DFT,
__local float2* localInput,
__const int N){
size_t itemId = get_local_id(0);
if(itemId == 0){
for(int index = 0;index<N;index++){
localInput[index] = input[index];
}
}
barrier(CLK_LOCAL_MEM_FENCE | CLK_GLOBAL_MEM_FENCE);
........................................................
/*Remaining algo here.*/
........................................................
}
But the above code has problems like divergence because of conditional statements which decreases the performance.
What further modifications can be done to the code so that entire array can be copied to shared memory of each work-group efficiently?
Any suggestions are well appreciated.
Depending on what device you're running on, there's a good chance you can completely ignore local memory. If you're on a desktop GPU, they used to have practically no cache whatsoever which made using local memory very important, but these days they have a decent amount. If you're hitting the same portion of memory on a gpu itll all be in cache (its generally the same size as shared memory), which is just as fast as local memory (they're the same block of memory, just split). Copying it manually to local memory might additionally impose a minor performance penalty
If you aren't on a desktop GPU (arm/etc) or your requirements make this impractical, async_work_group_copy might be what you are looking for
On an unrelated note, the above code only needs to do a barrier(CLK_LOCAL_MEM_FENCE) as you presumably aren't modifying your input
I've been playing with OpenCL recently, and I'm able to write simple kernels that use only global memory. Now I'd like to start using local memory, but I can't seem to figure out how to use get_local_size() and get_local_id() to compute one "chunk" of output at a time.
For example, let's say I wanted to convert Apple's OpenCL Hello World example kernel to something the uses local memory. How would you do it? Here's the original kernel source:
__kernel square(
__global float *input,
__global float *output,
const unsigned int count)
{
int i = get_global_id(0);
if (i < count)
output[i] = input[i] * input[i];
}
If this example can't easily be converted into something that shows how to make use of local memory, any other simple example will do.
Check out the samples in the NVIDIA or AMD SDKs, they should point you in the right direction. Matrix transpose would use local memory for example.
Using your squaring kernel, you could stage the data in an intermediate buffer. Remember to pass in the additional parameter.
__kernel square(
__global float *input,
__global float *output,
__local float *temp,
const unsigned int count)
{
int gtid = get_global_id(0);
int ltid = get_local_id(0);
if (gtid < count)
{
temp[ltid] = input[gtid];
// if the threads were reading data from other threads, then we would
// want a barrier here to ensure the write completes before the read
output[gtid] = temp[ltid] * temp[ltid];
}
}
There is another possibility to do this, if the size of the local memory is constant. Without using a pointer in the kernels parameter list, the local buffer can be declared within the kernel just by declaring it __local:
__local float localBuffer[1024];
This removes code due to less clSetKernelArg calls.
In OpenCL local memory is meant to share data across all work items in a workgroup. And it usually requires to do a barrier call before the local memory data can be used (for example, one work item wants to read a local memory data that is written by the other work items). Barrier is costly in hardware. Keep in mind, local memory should be used for repeated data read/write. Bank conflict should be avoided as much as possible.
If you are not careful with local memory, you may end up with worse performance some time than using global memory.