I am new to GPU development , and i was wondering about how many workers (threads) concurrently executes my kernel so i used below kernel
kernel void helloWorld(global int* result)
{
int gid = 0;
gid = get_local_id(0);
if (gid > result[0])
{
result[0] = gid;
}
}
but when running on core i7 intel result[0] is always 0 ; and when run on nvidia GPU it always 0
Opencl divides threads on the device into work groups.
A work group is settled on the same execution unit (note several workgroups can be on the execution unit)
It is possible to let the compiler decide the size of the work groups.
However you pick a number of total threads when starting the kernel.
example:
clEnqueueNDRangeKernel(command_queue, cl_exec, 1, NULL, &tasksize, &local_size_in, 0, NULL, NULL)
tasksize = total number of threads
local_size = number of threads in a group
so tasksize / local_size is the number of workgroups you will have.
If you write NULL instead of local_size the compiler decides about the size of work groups.
There are several contraints for the local_size.
Take a look here: cl api
In my experience the optimal result is mostly the one the compiler choses.
This can be different in complex cases, where you have specific knowledge what happens at runtime, which is not available at compile time.
Also all devices have a max work group size.
CL_DEVICE_MAX_WORK_GROUP_SIZE
Related
With OpenCL 1.1, is it possible to force work items to order their execution so they wait until higher priority items have finished?
I've tried various implementations and seem to always get stuck when executing my kernel on a GPU (Nvidia OpenCL 1.1); although running on a CPU is fine.
My most recent attempt below will hang on a GPU. My suspicion is that the GPU is splitting the global work group in local groups that get suspended creating a deadlock. I typically run a global size several multiples of my local size and this is important for scaling up my calculation.
kernel void ordered_workitem_kernel(
global uint *min_active_id_g
) {
uint i = get_global_id(0);
min_active_id_g[0] = 0;
barrier(CLK_GLOBAL_MEM_FENCE);
while (i >= min_active_id_g[0]) {
// do something interesting here
if (i == min_active_id_g[0])
atomic_inc(&min_active_id_g[0]);
}
}
Perhaps there's a better way to do this? Any suggestions?
You can often see OpenCL kernels such as
kernel void aKernel(global float* input, global float* output, const uint N)
{
const uint global_id = get_global_id(0);
if (global_id >= N) return;
// ...
}
I am wondering if this if (global_id >= N) return; is really necessary, especially if you create your buffer with the global size.
In which cases they are mandatory?
Is it a OpenCL code convention?
This is not a convention - it's the same as in regular C/C++, if you want to skip the rest of the function. It has the potential of speeding up execution, by not doing unnecessary work.
It may be necessary, if you have not padded your buffers to the size of the workgroup and you need to make sure that you are not accessing unallocated memory.
You have to be careful returning like this, because if there is a barrier in the kernel after the return you may deadlock the execution. This is because a barrier has to be reached by all work items in a work group. So if there's a barrier, either the condition needs to be true for whole work group, or it needs to be false for the whole work group.
It's very common to have this conditional in OpenCL 1.x kernels because of the requirement that your global work size be an integer multiple of your work group size. So if you want to specify a work group size of 64 but have 1000 items to process you make the global size 1024, pass 1000 as a parameter (N), and do the check.
In OpenCL 2.0 the integer multiple restriction has been lifted so OpenCL 2.0 kernels are less likely to need this conditional.
I'm using the Tesla m1060 for GPGPU computation. It has the following specs:
# of Tesla GPUs 1
# of Streaming Processor Cores (XXX per processor) 240
Memory Interface (512-bit per GPU) 512-bit
When I use OpenCL, I can display the following board information:
available platform OpenCL 1.1 CUDA 6.5.14
device Tesla M1060 type:CL_DEVICE_TYPE_GPU
max compute units:30
max work item dimensions:3
max work item sizes (dim:0):512
max work item sizes (dim:1):512
max work item sizes (dim:2):64
global mem size(bytes):4294770688 local mem size:16383
How can I relate the GPU card informations to the OpenCL memory informations ?
For example:
What does "Memory Interace" means ? Is it linked the a Work Item ?
How can I relate the "240 cores" of the GPU to Work Groups/Items ?
How can I map the work-groups to it (what would be the number of Work groups to use) ?
Thanks
EDIT:
After the following answers, there is a thing that is still unclear to me:
The CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE value is 32 for the kernel I use.
However, my device has a CL_DEVICE_MAX_COMPUTE_UNITS value of 30.
In the OpenCL 1.1 Api, it is written (p. 15):
Compute Unit: An OpenCL device has one or more compute units. A work-group executes on a single compute unit
It seems that either something is incoherent here, or that I didn't fully understand the difference between Work-Groups and Compute Units.
As previously stated, when I set the number of Work Groups to 32, the programs fails with the following error:
Entry function uses too much shared data (0x4020 bytes, 0x4000 max).
The value 16 works.
Addendum
Here is my Kernel signature:
// enable double precision (not enabled by default)
#ifdef cl_khr_fp64
#pragma OPENCL EXTENSION cl_khr_fp64 : enable
#else
#error "IEEE-754 double precision not supported by OpenCL implementation."
#endif
#define BLOCK_SIZE 16 // --> this is what defines the WG size to me
__kernel __attribute__((reqd_work_group_size(BLOCK_SIZE, BLOCK_SIZE, 1)))
void mmult(__global double * A, __global double * B, __global double * C, const unsigned int q)
{
__local double A_sub[BLOCK_SIZE][BLOCK_SIZE];
__local double B_sub[BLOCK_SIZE][BLOCK_SIZE];
// stuff that does matrix multiplication with __local
}
In the host code part:
#define BLOCK_SIZE 16
...
const size_t local_work_size[2] = {BLOCK_SIZE, BLOCK_SIZE};
...
status = clEnqueueNDRangeKernel(command_queue, kernel, 2, NULL, global_work_size, local_work_size, 0, NULL, NULL);
The memory interface doesn't mean anything to an opencl application. It is the number of bits the memory controller has for reading/writing to the memory (the ddr5 part in modern gpus). The formula for maximum global memory speed is approximately: pipelineWidth * memoryClockSpeed, but since opencl is meant to be cross-platform, you won't really need to know this value unless you are trying to figure out an upper bound for memory performance. Knowing about the 512-bit interface is somewhat useful when you're dealing with memory coalescing. wiki: Coalescing (computer science)
The max work item sizes have to do with 1) how the hardware schedules computations, and 2) the amount of low-level memory on the device -- eg. private memory and local memory.
The 240 figure doesn't matter to opencl very much either. You can determine that each of the 30 compute units is made up of 8 streaming processor cores for this gpu architecture (because 240/30 = 8). If you query for CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE, it will very likey be a multiple of 8 for this device. see: clGetKernelWorkGroupInfo
I have answered a similar questions about work group sizing. see here, and here
Ultimately, you need to tune your application and kernels based on your own bench-marking results. I find it worth the time to write many tests with various work group sizes and eventually hard-code the optimal size.
Adding another answer to address your local memory issue.
Entry function uses too much shared data (0x4020 bytes, 0x4000 max)
Since you are allocating A_sub and B_sub, each having 32*32*sizeof(double), you run out of local memory. The device should be allowing you to allocate 16kb, or 0x4000 bytes of local memory without an issue.
0x4020 is 32 bytes or 4 doubles more than what your device allows. There are only two things I can think of that may cause the error: 1) there could be a bug with your device or drivers preventing you from allocating the full 16kb, or 2) you are allocating the memory somewhere else in your kernel.
You will have to use a BLOCK_SIZE value less than 32 to work around this for now.
There's good news though. If you only want to hit a multiple of CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE as a work group size, BLOCK_SIZE=16 already does this for you. (16*16 = 256 = 32*8). To better take advantage of local memory, try BLOCK_SIZE=24. (576=32*18)
I'm trying to process an image using OpenCL 1.1 C++ on my AMD CPU.
The characteristics are:
using CPU: AMD Turion(tm) 64 X2 Mobile Technology TL-60
initCL:CL_DEVICE_IMAGE2D_MAX_WIDTH :8192
initCL:CL_DEVICE_IMAGE2D_MAX_HEIGHT :8192
initCL:timer resolution in ns:1
initCL:CL_DEVICE_GLOBAL_MEM_SIZE in bytes:1975189504
initCL:CL_DEVICE_GLOBAL_MEM_CACHE_SIZE in bytes:65536
initCL:CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE in bytes:65536
initCL:CL_DEVICE_LOCAL_MEM_SIZE in bytes:32768
initCL:CL_DEVICE_MAX_COMPUTE_UNITS:2
initCL:CL_DEVICE_MAX_WORK_GROUP_SIZE:1024
initCL:CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS:3
initCL:CL_DEVICE_MAX_WORK_ITEM_SIZES:dim=0, size 1024
initCL:CL_DEVICE_MAX_WORK_ITEM_SIZES:dim=1, size 1024
initCL:CL_DEVICE_MAX_WORK_ITEM_SIZES:dim=2, size 1024
createCLKernel:mean_value
createCLKernel:CL_KERNEL_WORK_GROUP_SIZE:1024
createCLKernel:CL_KERNEL_LOCAL_MEM_SIZE used by the kernel in bytes:0
createCLKernel:CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE:1
The kernel is for the moment empty:
__kernel void mean_value(image2d_t p_image,
__global ulong4* p_meanValue)
{
}
The execution call is:
cl::NDRange l_globalOffset;
// The global worksize is the entire image
cl::NDRange l_globalWorkSize(l_width, l_height);
// Needs to be determined
cl::NDRange l_localWorkSize;//(2, 2);
// Computes the mean value
cl::Event l_profileEvent;
gQueue.enqueueNDRangeKernel(gKernelMeanValue, l_globalOffset, l_globalWorkSize,
l_localWorkSize, NULL, &l_profileEvent);
If l_width=558 and l_height=328, l_localWorkSize can not be greater than (2, 2) otherwise, I get this error:"Invalid work group size"
Is it because I only have 2 cores ?
Is there a rule to determine l_localWorkSize ?
You can check 2 things using the clGetDeviceInfo function :
CL_DEVICE_MAX_WORK_GROUP_SIZE to check that 4 is not too big for your workgroup and
CL_DEVICE_MAX_WORK_ITEM_SIZES to check that the number of work-items by dimension is not too big.
And the fact the group-size may be limited to the number of cores makes sense : if you have inter work-items communication/synchronization you'll want them to be executed at the same time, otherwise the OpenCL driver would have to emulate this which might be at least hard and probably impossible in the general case.
I read in the OpenCL specs that enqueueNDRangeKernel() succeeds if l_globalWorkSize is evenly divisible byl_localWorkSize. In my case, I can set it up to (2,41).
Hello Everyone....
i am new to opencl and trying to explore more # it.
What is the work of local_work_size in openCL program and how it matters in performance.
I am working on some image processing algo and for my openCL kernel i gave
as
size_t local_item_size = 1;
size_t global_item_size = (int) (ceil((float)(D_can_width*D_can_height)/local_item_size))*local_item_size; // Process the entire lists
ret = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL,&global_item_size, &local_item_size, 0, NULL, NULL);
and for same kernel when i changed
size_t local_item_size = 16;
keeping everything same.
i got around 4-5 times faster performance.
The local-work-size, aka work-group-size, is the number of work-items in each work-group.
Each work-group is executed on a compute-unit which is able to handle a bunch of work-items, not only one.
So when you are using too small groups you waste some computing power, and only got a coarse parallelization at the compute-unit level.
But if you have too many work-items in a group you can also lose some opportunnity for parallelization as some compute-units may not be used, whereas other would be overused.
So you could test with many values to find the best one or just let OpenCL pick a good one for you by passing NULL as the local-work-size.
PS : I'll be interested in knowing the peformance with OpenCL choice compared to your previous values, so could you please make a test and post the results.
Thanks :)