I have seen many tutorials about configuring work dimensions, in which the number of work items conveniently easy to divide into 3 dimensions. I have a big number of work items, speak 164052. What is the best way to configure arbitrary number of work items ? Since in my programm the number of work items might vary, i need a way to calculate it automatically.
What should I do when the number is prime, say 7879 ?
First off, by default you should only be using 1 dimension for your kernels. Some tasks require 2 or 3 dimensions (generally, image processing), but unless you are expressly working on one of those tasks, it probably doesn't benefit you to try to divide stuff up among multiple dimensions, since the benefits are largely about code organization, not about performance.
So that leaves the question of how to divide up work items among local groups. Given a task size of N work items, you have a few options for dividing them up into local groups.
The simplest solution is to simply specify N work items, and let the driver decide for you how to divide those work items among the groups.
size_t work_items = 164052;
clEnqueueNDRangeKernel(queue, kernel, 1, nullptr, &work_items, nullptr, 0, nullptr, nullptr);
If you're programming for a specific environment where you know in advance the ideal number of local work items (often 32 or 64 for NVidia/AMD architectures), you might get better performance by forcing your work item count to align to a multiple of that number.
size_t work_items = 164052;
size_t LOCAL_SIZE = 64;
work_items += LOCAL_SIZE - (work_items % LOCAL_SIZE);
clEnqueueNDRangeKernel(queue, kernel, 1, nullptr, &work_items, &LOCAL_SIZE, 0, nullptr, nullptr);
Note, however, that this requires that you add a check to your kernel code to prevent processing on work items that don't actually exist, or that you pad your buffers to include space for the dummy items.
kernel void main(..., int N) {
if(get_global_id(0) >= N) return;
...
}
Related
I have the following OpenCL kernel, which copies values from one buffer to another, optionally inverting the value (the 'invert' arg can be 1 or -1):-
__kernel void extraction(__global const short* src_buff, __global short* dest_buff, const int record_len, const int invert)
{
int i = get_global_id(0); // Index of record in buffer
int j = get_global_id(1); // Index of value in record
dest_buff[(i* record_len) + j] = src_buff[(i * record_len) + j] * invert;
}
The source buffer contains one or more "records", each containing N (record_len) short values. All records in the buffer are of equal length, and record_len is always a multiple of 32.
The global size is 2D (number of records in the buffer, record length), and I chose this as it seemed to make best use of the GPU parallel processing, with each thread being responsible for copying just one value in one record in the buffer.
(The local work size is set to NULL by the way, allowing OpenCL to determine the value itself).
After reading about vectors recently, I was wondering if I could use these to improve on the performance? I understand the concept of vectors but I'm not sure how to use them in practice, partly due to lack of good examples.
I'm sure the kernel's performance is pretty reasonable already, so this is mainly out of curiosity to see what difference it would make using vectors (or other more suitable approaches).
At the risk of being a bit naive here, could I simply change the two buffer arg types to short16, and change the second value in the 2-D global size from "record length" to "record length / 16"? Would this result in each kernel thread copying a block of 16 short values between the buffers?
Your naive assumption is basically correct, though you may want to add a hint to the compiler that this kernel is optimized for the vector type (Section 6.7.2 of spec), in your case, you would add
attribute((vec_type_hint(short16)))
above your kernel function. So in your example, you would have
__attribute__((vec_type_hint(short16)))
__kernel void extraction(__global const short16* src_buff, __global short16* dest_buff, const int record_len, const int invert)
{
int i = get_global_id(0); // Index of record in buffer
int j = get_global_id(1); // Index of value in record
dest_buff[(i* record_len) + j] = src_buff[(i * record_len) + j] * invert;
}
You are correct in that your 2nd global dimension should be divided by 16, and your record_len should also be divided by 16. Also, if you were to specify the local size instead of giving it NULL, you would also want to divide that by 16.
There are some other things to consider though.
You might think choosing the largest vector size should provide the best performance, especially with such a simple kernel. But in my experience, that rarely is the most optimal size. You may try asking clGetDeviceInfo for CL_DEVICE_PREFERRED_VECTOR_WIDTH_SHORT, but for me this rarely is accurate (also, it may give you 1, meaning the compiler will try auto-vectorization or the device doesn't have vector hardware). It is best to try different vector sizes and see which is fastest.
If your device supports auto-vectorization, and you want to give it a go, it may help to remove your record_len parameter and replace it with get_global_size(1) so the compiler/driver can take care of dividing record_len by whatever vector size it picks. I would recommend doing this anyway, assuming record_len is equal to the global size you gave that dimension.
Also, you gave NULL to the local size argument so that the implementation picks a size automatically. It is guaranteed to pick a size that works, but it will not necessarily pick the most optimal size.
Lastly, for general OpenCL optimizations, you may want to take a look at the NVIDIA OpenCL Best Practices Guide for NVidia hardware, or the AMD APP SDK OpenCL User Guide for AMD GPU hardware. The NVidia one is from 2009, and I'm not sure how much their hardware has changed since. Notice though that it actually says:
The CUDA architecture is a scalar architecture. Therefore, there is no performance
benefit from using vector types and instructions. These should only be used for
convenience.
Older AMD hardware (pre-GCN) benefited from using vector types, but AMD suggests not using them on GCN devices (see mogu's comment). Also if you are targeting a CPU, it will use AVX hardware if available.
You can often see OpenCL kernels such as
kernel void aKernel(global float* input, global float* output, const uint N)
{
const uint global_id = get_global_id(0);
if (global_id >= N) return;
// ...
}
I am wondering if this if (global_id >= N) return; is really necessary, especially if you create your buffer with the global size.
In which cases they are mandatory?
Is it a OpenCL code convention?
This is not a convention - it's the same as in regular C/C++, if you want to skip the rest of the function. It has the potential of speeding up execution, by not doing unnecessary work.
It may be necessary, if you have not padded your buffers to the size of the workgroup and you need to make sure that you are not accessing unallocated memory.
You have to be careful returning like this, because if there is a barrier in the kernel after the return you may deadlock the execution. This is because a barrier has to be reached by all work items in a work group. So if there's a barrier, either the condition needs to be true for whole work group, or it needs to be false for the whole work group.
It's very common to have this conditional in OpenCL 1.x kernels because of the requirement that your global work size be an integer multiple of your work group size. So if you want to specify a work group size of 64 but have 1000 items to process you make the global size 1024, pass 1000 as a parameter (N), and do the check.
In OpenCL 2.0 the integer multiple restriction has been lifted so OpenCL 2.0 kernels are less likely to need this conditional.
Hello Everyone....
i am new to opencl and trying to explore more # it.
What is the work of local_work_size in openCL program and how it matters in performance.
I am working on some image processing algo and for my openCL kernel i gave
as
size_t local_item_size = 1;
size_t global_item_size = (int) (ceil((float)(D_can_width*D_can_height)/local_item_size))*local_item_size; // Process the entire lists
ret = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL,&global_item_size, &local_item_size, 0, NULL, NULL);
and for same kernel when i changed
size_t local_item_size = 16;
keeping everything same.
i got around 4-5 times faster performance.
The local-work-size, aka work-group-size, is the number of work-items in each work-group.
Each work-group is executed on a compute-unit which is able to handle a bunch of work-items, not only one.
So when you are using too small groups you waste some computing power, and only got a coarse parallelization at the compute-unit level.
But if you have too many work-items in a group you can also lose some opportunnity for parallelization as some compute-units may not be used, whereas other would be overused.
So you could test with many values to find the best one or just let OpenCL pick a good one for you by passing NULL as the local-work-size.
PS : I'll be interested in knowing the peformance with OpenCL choice compared to your previous values, so could you please make a test and post the results.
Thanks :)
I have noticed a number of kernel sources that look like this (found randomly by Googling):
__kernel void fill(__global float* array, unsigned int arrayLength, float val)
{
if(get_global_id(0) < arrayLength)
{
array[get_global_id(0)] = val;
}
}
My question is if that if-statement is actually necessary (assuming that "arrayLength" in this example is the same as the global work size).
In some of the more "professional" kernels I have seen, it is not present. It also seems to me that the hardware would do well to not assign kernels to nonsense coordinates.
However, I also know that processors work in groups. Hence, I can imagine that some processors of a group must do nothing (for example if you have 1 group of size 16, and a work size of 41, then the group would process the first 16 work items, then then next 16, then the next 9, with 7 processors not doing anything--do they get dummy kernels?).
I checked the spec., and the only relevant mention of "get_global_id" is the same as the online documentation, which reads:
The global work-item ID specifies the work-item ID based on the number of global work-items specified to execute the kernel.
. . . based how?
So what is it? Is it safe to omit iff the array's size is a multiple of the work group size? What?
You have the right answer already, I think. If the global size of your kernel execution is the same as the array length, then this if statement is useless.
In general, that type of check is only needed for cases where you've partitioned your data in such a way that you know you might execute extra work items relative to your array size. In my experience, you can almost always avoid such cases.
I'm trying to understand the difference between working with just long, long2, long3, long4, long8, long16. Let's assume that my CL_DEVICE_PREFERRED_VECTOR_WIDTH_LONG is 2.
When should I be working with long, long2, long3, long4, long8, long16? Assume that I want my kernel to XOR a bunch of bitvectors of, say, length 500.
If using long, I need to XOR ceil(500/64)=8 long.
If using long2, I need to XOR ceil(500/128)=4 long2.
If using long8, I need to XOR ceil(500/512)=1 long8.
So what would be the difference between xorring long[8], long2[4] or long8? Is there any advantage of going beyond the CL_DEVICE_PREFERRED_VECTOR_WIDTH_LONG at all?
Edit: I added a small sample script to make it more clear: the for (int j=0; j<8; j++) loops over a vector of length 8, and I wondered if I'd better do this for one long8, four long2s or eight longs.
while (i < to) {
ui64[] row = rows[rowIndex];
ui64 bitchange = i++;
bitchange ^= i;
rowIndex = 63-__builtin_clzll(bitchange);
ui64 cardinality = 0;
for (int j=0; j<8; j++) {
curr[j] ^= row[j];
cardinality += __builtin_popcountll(curr[j]);
}
popcountpolynomial[cardinality]++;
}
mfa is correct aout the preferred width, but using wider vectors usually is good. The device will sequentially issue instructions to process in the widest format vectors it supports, which is good because it helps hide the latency of the operation. This is much more true on GPUs and much less true on CPUs, GPUs tend to have a lot of registers (> 1000).
Think of the preferred width as the width that guarantees you won't "waste" vector lanes on vector architecture processors - if the GPU has vector ALUs, issuing instructions that don't use the whole width (say, only use the first item in the vector), then the other lanes might go unused in that instruction, wasting potential computing power. Think of SSE where it could be doing 4 adds with one instruction, but you're only getting one number as a result because you're not using 3 of the 4 pieces of the vector.
OpenCL compilers (on vector ALU hardware) try to restructure your code to "vectorize" if you aren't using the full vector width, but obviously there are limitations to that.
Of course, only use the wider vectors when it feels natural in your algorithm. Never contort your program to try to use really wide vectors.
Using less registers is a good thing too though, if you're using too many registers, it may restrict the number of wavefronts/warps that can be run in parallel.
Using vectors can actually reduce register pressure if the auto-vectorizer fails to find a vectorized solution in scalar code, in the case that the hardware does use a vector ALU - you'll "waste" less vector lanes because more will fit in each register.
CL_DEVICE_PREFERRED_VECTOR_WIDTH_(type) for any data type is usually the most effective size for the memory access. Many current GPU devices use a 128-bit cache line structure, which is why CL_DEVICE_PREFERRED_VECTOR_WIDTH_LONG often evaluates to 2. If you use long4, the memory operation may be broken into two smaller reads/writes on the device -- effectively blocking some threads from executing. I don't think there is an advantage to using vectors larger then the preferred size, but I can imagine a disadvantage. You should benchmark it on your device to see if this is true for you.
If the only operation you are doing is XOR, I suggest using longN (N = preferred size) and the 64-bit atomics to do the job. I hope your device supports 64 bit extended atomics. (cl_khr_int64_extended_atomics)