I currently use AMD Hawaii GPU and have some question about it.
In the specification of AMD Hawaii, it has
2816 Processing Element
44 Computing Units
I understood that then it has 2816 threads and 44 work groups.(64 threads in each group)
Is it correct?
I'm confused about the concept of cores, threads, computing units, work groups and processing elements.
No. You can (and should) have multiple work groups per CU and more than one thread per processing element. Each CU can hold up to 40 wavefronts of 64 threads each, so the maximum number of parallel threads is 44*40*64=112640. However, you can often not use all these threads. Other resources might limit the maximum possible number of threads per CU. There is only a limited number of registers per CU and each wavefront uses too many of them, the maximum number of parallel wavefronts is lower.
Each work group is executed on the same CU, as this allows access to a shared memory (LDS) and easy synchronization between the different wavefronts of each workgroup. You can choose the work-group size within certain limits. There is a hard limit (more doesn't work) of 256 threads per work-group and a soft-limit (reduced performance if you are using less) of wavefront size / 64 threads per work group. Your work-group size should also be a multiple of the wavefront size, so 64,128,192 and 256 are the most common choices for work-group size. Everything else reduces the potential peak performance, however, depending on your problem a different work-group size might still be better than forcing a problem into one of choices.
Because each work group can only use up to 256 threads each, multiple workgroups can be executed on each CU in parallel. If you use the maximum workgroup size of 256 threads, you need at least 112640/256=440 work groups in order to use all threads of the GPU. If you have more work groups, up to 440 of them will execute in parallel and the remaining groups will be executed once one of the older groups is finished. If you have less work groups, not all threads will be occupied, which can lead to decreased performance. If you pick smaller work-groups, you will need more of them, e.g: 1760 work-groups with a work-group size of 64.
Using too much of the shared memory (LDS) can also limit the number of work-groups per CU.
The processing elements execute the instructions. Under optimal conditions one instruction can be started per cycle.
Related
I am learning OpenCL and using a RTX 2060.
Based on what I read online the maximum number of work items for this device is 1024 and the maximum work items per work group is 64 (which means I can run 16 work groups of 64 work items right?)
Question is : is there a limit to the number of work groups themselves? For example can I run 32 work groups of 32 work items? 64 work groups of 16 work items? 512 work groups of 2 work items? (you get the idea).
The vendor only specifies a value for the maximum size of a workgroup; for Nvidia this is usually 1024. But it is still allowed (although not really useful) to choose even larger workgroup size. The larger the workgroup, the less registers (private variables) you can have per thread, and if you use too many (like many thousand in a table) they spill into global memory which makes things very slow. For details see here.
Note that workgroup size should be 32 or a multiple of 32 to best utilize the hardware.
There is no limit for the number of workgroups. You only will eventually run out of memory. In general, the more workgroups the better, because then the device is fully saturated and no SMs are idle at any time. It is not uncommon to have 2 Million workgroups of 32 threads each.
What happens when I for example set my amount of
workgroups to 5120 and localsize 1
workgroups to 2560 and localsize 2
workgroups to 640 and localsize 4
How does this influence my amount of work-items and access to resources ?
You will have 5120 threads. 5120 groups. 1 thread per group. Each Group(1 thread) will take one processor. You can't synchronize any of them (in the traditional sense).
You will have 2560 threads. 1280 groups. 2 threads in each group. Each Group(2 threads) will take one processor. You can synchronize these two threads(in the traditional sense).
You will have 640 threads. 160 groups. 4 threads in each group. Each Group(4 threads) will take one processor. You can synchronize these four threads(in the traditional sense).
In OpenCL you need to express the global Work Size in terms of the total number of threads. The underlying OpenCL API will look at the global Work Size and divide by the local Work Size to figure out your thread arrangement.
Now (this is a general suggestion. There might be cases where you need to do it, but for now ..)
Is a terrible idea. Clearly. You are wasting your processors time by giving it 1 thread at a time. While this might not to be the end of the world for CPUs it is for modern GPUs. Why? because each processor on your GPU will have a number of cores. All ready for action. Only one of them works in this case. Plus You have no way of synchronizing threads if the need arises.
Same thing.
Same thing.
If I remember correctly NVIDIA suggests at least 32 threads in a group to get the best performance.
I was reading some results. And there I saw that they used 5120 work-groups and a local-size of 1. I have a limited knowledge about OpenCl and I was wondering if this statement is correct:
As can be seen for the GPU, the first test has 5120 work-groups, with
1 work-item each. This means that the threads which are executed in
parallel are limited to the amount of computing units there are in the
machine. For example if a GPU has 20 computing units there can only be
a maximum of 20 threads which are working in parallel. Though when the
local size is increased to 2, twice the amount of threads are run
simultaneously
From reading some info on OpenCl, it seems about right. Though I need a second opinion.
update. Hmm, nat chouf's comment is right, I understood the question as "in flight at the same time" instead of "physically executed at the same time".
As I wrote, several work-groups can be scheduled at a given time in a single compute unit. The number of such "in-flight" work-groups is limited by the available resources (local memory, registers, etc.) on each compute unit.
In existing implementations (afaik) a compute unit will pick a block (warp/wavefront) of work-items from the same work-group for execution, among all blocks in flight in the compute unit. One "instruction" of this block is inserted in the pipeline (it may take several cycles, and each "instruction" may correspond to several operations in each work-item), and then another block is picked.
So, yes, if work-group size is 1, only 1 work-item per compute unit will be physically started simultaneously. But potentially all work-items may be in-flight in the GPU at the same time.
I know that work items are grouped into the work groups, and you cannot synchronize outside of a work group.
Does it mean that work items are executed in parallel?
If so, is it possible/efficient to make 1 work group with 128 work items?
The work items within a group will be scheduled together, and may run together. It is up to the hardware and/or drivers to choose how parallel the execution actually is. There are different reasons for this, but one very good one is to hide memory latency.
On my AMD card, the 'compute units' are divided into 16 4-wide SIMD units. This means that 16 work items can technically be run at the same time in the group. It is recommended that we use multiples of 64 work items in a group, to hide memory latency. Clearly they cannot all be run at the exact time. This is not a problem, because most kernels are in fact, memory bound, so the scheduler (hardware) will swap the work items waiting on the memory controller out, while the 'ready' items get their compute time. The actual number of work items in the group is set by the host program, and limited by CL_DEVICE_MAX_WORK_GROUP_SIZE. You will need to experiment with the optimal work group size for your kernel.
The cpu implementation is 'worse' when it comes to simultaneous work items. There are only ever as many work items running as you have cores available to run them on. They behave more sequentially in the cpu.
So do work items run at the exactly same time? Almost never really. This is why we need to use barriers when we want to be sure they pause at a given point.
In the (abstract) OpenCL execution model, yes, all work items execute in parallel, and there can be millions of them.
Inside a GPU, all work items of the same work group must be executed on a single "core". This puts a physical restriction on the number of work items per work group (256 or 512 is the max, but it can be smaller for large kernels using a lot of registers). All work groups are then scheduled on the (usually 2 to 16) cores of the GPU.
You can synchronize threads (work items) inside a work group, because they all are resident in the same core, but you can't synchronize threads from different work groups, since they may not be scheduled at the same time, and could be executed on different cores.
Yes, it is possible to have 128 work items inside a work group, unless it consumes too many resources. To reach maximum performance, you usually want to have the largest possible number of threads in a work group (at least 64 are required to hide memory latency, see Vasily Volkov's presentations on this subject).
The idea is that they can be executed in parallel if possible (whether they actually will be executed in parallel depends).
Yes, work items are executed in parallel.
To get the maximal possible number of work items, use clGetDeviceInfo with CL_DEVICE_MAX_WORK_GROUP_SIZE. It depends on the hardware.
Whether it's efficient or not primarily depends on the task you want to implement. If you need a lot of synchronization, it may be that OpenCL does not fit your task. I can't say much more without knowing what you actually want to do.
The work-items in a given work-group execute concurrently on the processing elements of a sigle processing unit.
I'm new in GPGPU programming and I'm working with NVIDIA implementation of OpenCL.
My question was how to compute the limit of a GPU device (in number of threads).
From what I understood a there are a number of work-group (equivalent of blocks in CUDA) that contain a number of work-item (~ cuda thread).
How do I get the number of work-group present on my card (and that can run at the same time) and the number of work-item present on one work group?
To what CL_DEVICE_MAX_COMPUTE_UNITS corresponds?
The khronos specification speeks of cores ("The number of parallel compute cores on the OpenCL device.") what is the difference with the CUDA core given in the specification of my graphic card. In my case openCL gives 14 and my GeForce 8800 GT has 112 core based on NVIDIA website.
Does CL_DEVICE_MAX_WORK_GROUP_SIZE (512 in my case) corresponds to the total of work-items given to a specific work-group or the number of work-item that can run at the same time in a work-group?
Any suggestions would be extremely appreciated.
The OpenCL standard does not specify how the abstract execution model provided by OpenCL is mapped to the hardware. You can enqueue any number T of threads (work items), and provide a workgroup size (WG), with at least the following constraints (see OpenCL spec 5.7.3 and 5.8 for details):
WG must divide T
WG must be at most DEVICE_MAX_WORK_GROUP_SIZE
WG must be at most KERNEL_WORK_GROUP_SIZE returned by GetKernelWorkGroupInfo ; it may be smaller than the device max workgroup size if the kernel consumes a lot of resources.
The implementation manages the execution of the kernel on the hardware. All threads of a single workgroup must be scheduled on a single "multiprocessor", but a single multiprocessor can manage several workgroups at the same time.
Threads inside a workgroup are executed by groups of 32 (NVIDIA warp) or 64 (AMD wavefront). Each micro-architecture does this in a different way. You will find more details in NVIDIA and AMD forums, and in the various docs provided by each vendor.
To answer your question: there is no limit to the number of threads. In the real world, your problem is limited by the size of inputs/outputs, i.e. the size of the device memory. To process a 4GB buffer of float, you can enqueue 1G threads, with WG=256 for example. The device will have to schedule 4M workgroups on its small number (say between 2 and 40) of multiprocessors.