The description is:
Return type: cl_uint
Maximum dimensions that specify the global and local work-item IDs
used by the data parallel execution model. (Refer to
clEnqueueNDRangeKernel). The minimum value is 3.
The description for work_dim in clEnqueueNDRangeKernel is:
work_dim: The number of dimensions used to specify the global
work-items and work-items in the work-group. work_dim must be greater
than zero and less than or equal to three.
So if work_dim can never be greater than three the maximum dimensions will never be greater than three, right?
Most probably it's a typo in the version 1.0 as suggested by #Simon Richter. And it seems it was corrected. Note that starting the version 1.1 the explanation given for work_dim is:
The number of dimensions used to specify the global work-items and work-items in the work-group. work_dim must be greater than zero and less than or equal to CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS.
Related
I want to design a kernel in which I can pass an array of floats and have them all come out with the maximum being 1.0 and the minimum being 0.0. Theoretically, each element would be mapped to something like (x-min)/(max-min). How can I parallelize this?
A simple solution would be to split the problem into 2 kernels:
Reduction kernel
Divide your array into chunks of N * M elements each, where N is the number of work-items per group, and M is the number of array elements processed by each work-item.
Each work-item computes the min() and max() of its M items.
Within the workgroup, perform a parallel reduction of min and max across the N work-items, giving you the min/max for each chunk.
With those values obtained, one of the items in the group can use atomics to update global min/max values. Given that you are using floats, you will need to use the well known workaround for the lack of atomic min/max/CAS operations on floats.
Application
After your first kernel has completed, you know that the global min and max values must be correct. You can compute your scale factor and normalisation offset, and then kick off as many work items as your array has elements, to multiply/add each array element to adjust it.
Tweak your values for N and M to find an optimum for a given OpenCL implementation and hardware combination. (Note that M = 1 may be the optimum, i.e. launching straight into the parallel reduction.)
Having to synchronise between the two kernels is not ideal but I don't really see a way around that. If you have multiple independent arrays to process, you can hide the synchronisation overhead by submitting them all in parallel.
In the clEnqueueNDRangeKernel should global_work_size parameter be a power of 2?
If not and it is not power of two which error (if at all) is returned?
UPD
Based on the answers : global and local work sizes should not be power of two.
What aabout relation between workgroup size and wavefront size?:
if wavefront size is 64 and local_work_size < 64 - in each lock-step 64 work-item will execute,while (64 - local_work_size) will be work_items which "do nothing".
if 128 > local_work_size > 64 - how will the execution be? In even lock-step entire wavefront will be executed (64 work-items) and in one one local_work_size % 64
Its not necessary that that global work size is a power of 2, it can be any positive integer and less than the maximum number of work items allowed by the device.
The values doesn't need to be a power of 2 but it has to be a number divisible by the work group size.
As it is already said, it does not have to be a power of 2. But in order to have a good performance, you have to choose a local work size that is a multiple of 32 (see this related question: Questions about global and local work size)
Therefore, as your local work size must be a divider of your global work size, you will likely have a power of two (one source of optimization is to choose a global work size bigger than necessary; in order to choose a good local work size, you have to try some)
I can not understand what work_dim is for in clEnqueueNDRangeKernel()?
So, what is the difference between work_dim=1 and work_dim=2?
And why work items are grouped into work groups?
A work item or a work group is a thread running on the device (or neither)?
Thanks ahead!
work_dim is the number of dimensions for the clEnqueueNDRangeKernel() execution.
If you specify work_dim = 1, then the global and local work sizes are unidimensional. Thus, inside the kernels you can only access info in the first dimension, e.g. get_global_id(0), etc.
If you specify work_dim = 2 or 3, then you must also specify 2 or 3 dimensional global and local worksizes; in such case, you can access info inside the kernels in 2 or 3 dimensions, e.g. get_global_id(1), or get_group_id(2).
In practice you can do everything in 1D, but for dealing with 2D or 3D data, it maybe simpler to directly use 2/3 dimensional kernels; for example, in the case of 2D data, such as an image, if each thread/work-item is to deal with a single pixel, each thread/work-item could deal with the pixel at coordinates (x,y), with x = get_global_id(0) and y = get_global_id(1).
A work-item is a thread, while work-groups are groups of work-items/threads.
I believe the division work-groups / work-items is related with the hardware architecture of GPUs and other accelerators (e.g. Cell/BE); you can map the execution of work-groups to GPU Stream Multiprocessors (in NVIDIA talk) or SPUs (in IBM/Cell talk), while the corresponding work-itens would run inside the execution units of the Stream MultiProcessors and/or SPUs. It's not uncommon to have work group size = 1 if you are executing kernels in a CPU (e.g. for a quad-core, you would have 4 work groups, each one with one work item - though in my experience it's usually better to have more workgroups than CPU cores).
Check the OpenCL reference manual, as well as the OpenCl manual for whichever device your are programming. The quick reference card is also very helpful.
when I change the work group size from 16 to 32 or something bigger I get an CL_INVALID_WORK_GROUP_SIZE error. matrix_size is 64.
localWorkSize[0] = groupsize;
localWorkSize[1] = localWorkSize[0];
globalWorkSize[0] = matrix_size;
globalWorkSize[1] = globalWorkSize[0];
First I checked the documentation for clEnqueueNDRangeKernel which states four (five) different causes CL_INVALID_WORK_GROUP_SIZE, but I think non of them apply. Please check my conclusions. (I hope you don't mind my QA style)
Q CL_INVALID_WORK_GROUP_SIZE if local_work_size is specified and number of work-items specified by global_work_size is not evenly divisable by size of work-group given by local_work_size
A 64 % 32 = 0
Q or does not match the work-group size specified for kernel using the __attribute__((reqd_work_group_size(X, Y, Z))) qualifier in program source.
A As I understood the help, I did not use __attribute__.
Q CL_INVALID_WORK_GROUP_SIZE if local_work_size is specified and the total number of work-items in the work-group computed as local_work_size[0] *... local_work_size[work_dim - 1] is greater than the value specified by CL_DEVICE_MAX_WORK_GROUP_SIZE in the table of OpenCL Device Queries for clGetDeviceInfo.
A I queried clGetDeviceInfo and CL_DEVICE_MAX_WORK_GROUP_SIZE is 512, 512, 64
Q CL_INVALID_WORK_GROUP_SIZE if local_work_size is NULL and the __attribute__((reqd_work_group_size(X, Y, Z))) qualifier is used to declare the work-group size for kernel in the program source.
A local_work_size is not NULL.
Q CL_INVALID_WORK_ITEM_SIZE if the number of work-items specified in any of local_work_size[0], ... local_work_size[work_dim - 1] is greater than the corresponding values specified by CL_DEVICE_MAX_WORK_ITEM_SIZES[0], .... CL_DEVICE_MAX_WORK_ITEM_SIZES[work_dim - 1].
A 32 < 512
I hope, I haven't overlooked something. Please tell me, when you have an idea what could cause the CL_INVALID_WORK_GROUP_SIZE or found a error in my conclusions.
Thanks for taking the time to read all this :)
CL_DEVICE_MAX_WORK_GROUP_SIZE should return a single size_t value (for example 512, but I don't know what it'd be on your system). This is the maximum number of work-items in a work-group, not the maximum in each dimension. So in your case you are trying to make a 2D work-group with 32*32 = 1024 work-items, and presumably CL_DEVICE_MAX_WORK_GROUP_SIZE is less than 1024 on your system.
See the OpenCL 1.1 spec, table 4.3, page 37, the definition of CL_DEVICE_MAX_WORK_GROUP_SIZE:
Maximum number of work-items in a work-group executing a kernel using the data parallel execution model.
I had the same problem when I was trying to run my kernel on CPU. I couldn't set work group size more than 128, while CL_DEVICE_MAX_WORK_GROUP_SIZE was returning 1024.
After a little bit of search to find out where 128 is coming from it turned out CL_KERNEL_WORK_GROUP_SIZE was giving the proper value.
Searching through the NVIDIA forums I found these questions, which are also of interest to me, but nobody had answered them in the last four days or so. Can you help?
Original Forum Post
Digging into OpenCL reading tutorials some things stayed unclear for me. Here is a collection of my questions regarding local and global work sizes.
Must the global_work_size be smaller than CL_DEVICE_MAX_WORK_ITEM_SIZES?
On my machine CL_DEVICE_MAX_WORK_ITEM_SIZES = 512, 512, 64.
Is CL_KERNEL_WORK_GROUP_SIZE the recommended work_group_size for the used kernel?
Or is this the only work_group_size the GPU allows?
On my machine CL_KERNEL_WORK_GROUP_SIZE = 512
Do I need to divide into work groups or can I have only one, but not specifying local_work_size?
To what do I have to pay attention, when I only have one work group?
What does CL_DEVICE_MAX_WORK_GROUP_SIZE mean?
On my machine CL_DEVICE_MAX_WORK_GROUP_SIZE = 512, 512, 64
Does this mean, I can have one work group which is as large as the CL_DEVICE_MAX_WORK_ITEM_SIZES?
Has global_work_size to be a divisor of CL_DEVICE_MAX_WORK_ITEM_SIZES?
In my code global_work_size = 20.
In general you can choose global_work_size as big as you want, while local_work_size is constraint by the underlying device/hardware, so all query results will tell you the possible dimensions for local_work_size instead of the global_work_size. the only constraint for the global_work_size is that it must be a multiple of the local_work_size (for each dimension).
The work group sizes specify the sizes of the workgroups so if CL_DEVICE_MAX_WORK_ITEM_SIZES is 512, 512, 64 that means your local_work_size can't be bigger then 512 for the x and y dimension and 64 for the z dimension.
However there is also a constraint on the local group size depending on the kernel. This is expressed through CL_KERNEL_WORK_GROUP_SIZE. Your cumulative workgoupsize (as in the product of all dimensions, e.g. 256 if you have a localsize of 16, 16, 1) must not be greater then that number. This is due to the limited hardware resources to be divided between the threads (from your query results I assume you are programming on a NVIDIA GPU, so the amount of local memory and registers used by a thread will limit the number of threads which can be executed in parallel).
CL_DEVICE_MAX_WORK_GROUP_SIZE defines the maximum size of a work group in the same manner as CL_KERNEL_WORK_GROUP_SIZE, but specific to the device instead the kernel (and it should be a a scalar value aka 512).
You can choose not to specify local_work_group_size, in which case the OpenCL implementation will choose a local work group size for you (so its not a guarantee that it uses only one workgroup). However it's generally not advisiable, since you don't know how your work is divided into workgroups and furthermore it's not guaranteed that the workgroupsize chosen will be optimal.
However, you should note that using only one workgroup is generally not a good idea performancewise (and why use OpenCL if performance is not a concern). In general a workgroup has to execute on one compute unit, while most devices will have more then one (modern CPUs have 2 or more, one for each core, while modern GPUs can have 20 or more). Furthermore even the one Compute Unit on which your workgroup executes might not be fully used, since several workgroup can execute on one compute unit in an SMT style. To use NVIDIA GPUs optimally you need 768/1024/1536 threads (depending on the generation, meaning G80/GT200/GF100) executing on one compute unit, and while I don't know the numbers for amd right now, they are in the same magnitude, so it's good to have more then one workgroup. Furthermore, for GPUs, it's typically advisable to have workgroups which at least 64 threads (and a number of threads divisible by 32/64 (nvidia/amd) per workgroup), because otherwise you will again have reduced performance (32/64 is the minimum granuaty for execution on gpus, so if you have less items in a workgroup, it will still execute as 32/64 threads, but discard the results from unused threads).