How to get more work items running concurrently in OpenCL? - opencl

I set the global work size equal to {100,10} and local work size equal to {1,1}. It was expected 100*10 work items running concurrently but turned out only ~50 work items.
I wonder how I can get more work items running simultaneously? Is it depending on my code complexity?
Note: I only use ~100 MB global memory and ~100 KB private memory

Use at least a workgroup size (local size) of 32. In 2D this can be 1x32 or 2x16 or 4x8, but the number of work items in a workgroup must be at least 32, and also a multiple of 32.
The reason for this is that the GPU hardware computes work items in groups of 32 (so-called warps). If you set the workgroup size to only 16, then half of the hardware is idle at any time because only 16 work items are computed in a warp at any time. If you set the work group size to 1, you get only 1/32 of the performance of your GPU.
You can also set work group size to 64, 128, 256 or 512. This may be beneficial if you use local memory explicitly for communications between work items within a workgroup. If you don't use local memory, you can choose work group size to tune for best performance. Out of my experience, across 32, 64, 128 and 256, there is almost no difference.
For good performance, you should also set the global size sufficiently large, at least a few million work items. Otherwise, the GPU may not be saturated and latency for PCIe transfer ia longer than kernel compute times, so overall performance is poor.

Related

OpenCL crash when Max work group size is used

So I have a problem.
I am writing an application which uses OpenCL and whenever I use the max work group size or anything above half of the max work group size, I get a crash (a black screen).
Does anyone know what the reason might be?
I would love to use to entire work group size, instead of just half.
Thanks in advance
The most likely reason is that you have a global array of a size that, at some point, is not evenly divisible by the workgroup size.
For example a global array of size 1920. Workgroup size 32/64/128 works, then you get 60/30/15 workgroups. But if you choose workgroup size 256 (or larger), you get 7.5 workgroups. What then happens is that 8 workgropus are executed, and the last workgroup covers threads 1792-2047. Yet, the global array ends at tread 1919, so access to array elements 1920-2047 reads from / writes to Nirwana and this can crash the entire program.
There are 2 possible solutions:
Only choose a workgroup size small enough that all global array sizes are divisible by it. Note that workgroup size must be 32 or a multiple of 32.
Alternatively, at the very begin of the Kernel code, use a so-called "guard clause": if(get_global_id(0)>=array_size) return;. This makes the thread end if the there is no corresponding memory address to the thread ID, preventing crashing. With this, you can also use any array size that is not divisible by the workgroup size.

OpenCL maximum number of work groups for a device

I am learning OpenCL and using a RTX 2060.
Based on what I read online the maximum number of work items for this device is 1024 and the maximum work items per work group is 64 (which means I can run 16 work groups of 64 work items right?)
Question is : is there a limit to the number of work groups themselves? For example can I run 32 work groups of 32 work items? 64 work groups of 16 work items? 512 work groups of 2 work items? (you get the idea).
The vendor only specifies a value for the maximum size of a workgroup; for Nvidia this is usually 1024. But it is still allowed (although not really useful) to choose even larger workgroup size. The larger the workgroup, the less registers (private variables) you can have per thread, and if you use too many (like many thousand in a table) they spill into global memory which makes things very slow. For details see here.
Note that workgroup size should be 32 or a multiple of 32 to best utilize the hardware.
There is no limit for the number of workgroups. You only will eventually run out of memory. In general, the more workgroups the better, because then the device is fully saturated and no SMs are idle at any time. It is not uncommon to have 2 Million workgroups of 32 threads each.

OpenCL Comptuing Unit and Processing Element

I currently use AMD Hawaii GPU and have some question about it.
In the specification of AMD Hawaii, it has
2816 Processing Element
44 Computing Units
I understood that then it has 2816 threads and 44 work groups.(64 threads in each group)
Is it correct?
I'm confused about the concept of cores, threads, computing units, work groups and processing elements.
No. You can (and should) have multiple work groups per CU and more than one thread per processing element. Each CU can hold up to 40 wavefronts of 64 threads each, so the maximum number of parallel threads is 44*40*64=112640. However, you can often not use all these threads. Other resources might limit the maximum possible number of threads per CU. There is only a limited number of registers per CU and each wavefront uses too many of them, the maximum number of parallel wavefronts is lower.
Each work group is executed on the same CU, as this allows access to a shared memory (LDS) and easy synchronization between the different wavefronts of each workgroup. You can choose the work-group size within certain limits. There is a hard limit (more doesn't work) of 256 threads per work-group and a soft-limit (reduced performance if you are using less) of wavefront size / 64 threads per work group. Your work-group size should also be a multiple of the wavefront size, so 64,128,192 and 256 are the most common choices for work-group size. Everything else reduces the potential peak performance, however, depending on your problem a different work-group size might still be better than forcing a problem into one of choices.
Because each work group can only use up to 256 threads each, multiple workgroups can be executed on each CU in parallel. If you use the maximum workgroup size of 256 threads, you need at least 112640/256=440 work groups in order to use all threads of the GPU. If you have more work groups, up to 440 of them will execute in parallel and the remaining groups will be executed once one of the older groups is finished. If you have less work groups, not all threads will be occupied, which can lead to decreased performance. If you pick smaller work-groups, you will need more of them, e.g: 1760 work-groups with a work-group size of 64.
Using too much of the shared memory (LDS) can also limit the number of work-groups per CU.
The processing elements execute the instructions. Under optimal conditions one instruction can be started per cycle.

Determine max global work group size based on device memory in OpenCL?

I am able to list the following parameters which help in restricting the work items for a device based on the device memory:
CL_DEVICE_GLOBAL_MEM_SIZE
CL_DEVICE_LOCAL_MEM_SIZE
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE
CL_DEVICE_MAX_MEM_ALLOC_SIZE
CL_DEVICE_MAX_WORK_GROUP_SIZE
CL_DEVICE_MAX_WORK_ITEM_SIZES
CL_KERNEL_WORK_GROUP_SIZE
I find the explanation for these parameters insufficient and hence I am not able to use these parameters properly.
Can somebody please tell me what these parameters mean and how they are used.
Is it necessary to check all these parameters?
PS: I have some brief understanding of some of the parameters but I am not sure whether my understanding is correct.
CL_DEVICE_GLOBAL_MEM_SIZE:
Global memory amount of the device. You typically don't care, unless you use high amount of data. Anyway the OpenCL spec will complain about OUT_OF_RESOURCES error if you use more than allowed. (bytes)
CL_DEVICE_LOCAL_MEM_SIZE:
Amount of local memory for each workgroup. However, this limit is just under ideal conditions. If your kernel uses high amount of WI per WG maybe some of the private WI data is being spilled out to local memory. So take it as a maximum available amount per WG.
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE:
The maximum amount of constant memory that can be used for a single kernel. If you use constant buffers that all together have more than this amount, either it will fail, or use global normal memory instead (it may therefore be slower). (bytes)
CL_DEVICE_MAX_MEM_ALLOC_SIZE:
The maximum amount of memory in 1 single piece you can allocate in a device. (bytes)
CL_DEVICE_MAX_WORK_GROUP_SIZE:
Maximum work group size of the device. This is the ideal maximum. Depending on the kernel code the limit may be lower.
CL_DEVICE_MAX_WORK_ITEM_SIZES:
The maximum amount of work items per dimension. IE: The device may have 1024 WI as maximum size and 3 maximum dimensions. But you may not be able to use (1024,1,1) as size, since it may be limited to (64,64,64), so, you can only do (64,2,8) for example.
CL_KERNEL_WORK_GROUP_SIZE:
The default kernel size given by the implementation. It may be forced to be higher, or lower, but the value already provided should be a good one already (good tradeoff of GPU usage %, memory spill off, etc).
NOTE: All this data is the theoretical limits. But if your kernel uses a resource more than other, ie: local memory depending on the size of the work group, you may not be able to reach the maximum work items per work group, since it is possible you reach first the local memory limit.

How to verify wavefront/warp size in OpenCL?

I am using AMD Radeon HD 7700 GPU. I want to use the following kernel to verify the wavefront size is 64.
__kernel
void kernel__test_warpsize(
__global T* dataSet,
uint size
)
{
size_t idx = get_global_id(0);
T value = dataSet[idx];
if (idx<size-1)
dataSet[idx+1] = value;
}
In the main program, I pass an array with 128 elements. The initial values are dataSet[i]=i. After the kernel, I expect the following values:
dataSet[0]=0
dataSet[1]=0
dataSet[2]=1
...
dataSet[63]=62
dataSet[64]=63
dataSet[65]=63
dataSet[66]=65
...
dataSet[127]=126
However, I found dataSet[65] is 64, not 63, which is not as my expectation.
My understanding is that the first wavefront (64 threads) should change dataSet[64] to 63. So when the second wavefront is executed, thread #64 should get 63 and write it to dataSet[65]. But I see dataSet[65] is still 64. Why?
You are invoking undefined behaviour. If you wish to access memory another thread in a workgroup is writing you must use barriers.
In addition assume that the GPU is running 2 wavefronts at once. Then dataSet[65] indeed contains the correct value, the first wavefront has simply not been completed yet.
Also the output of all items as 0 is also a valid result according to spec. It's because everything could also be performed completely serially. That's why you need the barriers.
Based on your comments I edited this part:
Install http://developer.amd.com/tools-and-sdks/heterogeneous-computing/codexl/
Read: http://developer.amd.com/download/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf
Optimizing branching within a certain amount of threads is only a small part of optimization. You should read on how AMD HW schedules the wavefronts within a workgroup and how it hides memory latency by interleaving the execution of wavefronts (within a workgroup). The branching also affects the execution of the whole workgroup as the effective time to run it is basically the same as the time to execute the single longest running wavefront (It cannot free local memory etc until everything in the group is finished so it cannot schedule another workgroup). But this also depends on your local memory and register usage etc. To see what actually happens just grab CodeXL and run GPU profiling run. That will show exactly what happens on the device.
And even this applies only to just the hardware of current generation. That's why the concept is not on the OpenCL specification itself. These properties change a lot and depend a lot on the hardware.
But if you really want to know just what is AMD wavefront size the answer is pretty much awlways 64 (See http://devgurus.amd.com/thread/159153 for reference to their OpenCL programming guide). It's 64 for all GCN devices which compose their whole current lineup. Maybe some older devices have 16 or 32, but right now everything is just 64 (for nvidia it's 32 in general).
CUDA model - what is warp size?
I think this is a good answer which explains the warp briefly.
But I am a bit confused about what sharpneli said such as
" [If you set it to 512 it will almost certainly fail, the spec doesn't require implementations to support arbitrary local sizes. In AMD HW the local size is exactly the wavefront size. Same applies to Nvidia. In general you don't really need to care how the implementation will handle it. ]".
I think the local size which means the group size is set by the programmer. But when the implement occurs, the subdivied group is set by hardware like warp.

Resources