OpenCl work groups

OpenCl work groups - opencl

What happens when I for example set my amount of
workgroups to 5120 and localsize 1
workgroups to 2560 and localsize 2
workgroups to 640 and localsize 4
How does this influence my amount of work-items and access to resources ?

You will have 5120 threads. 5120 groups. 1 thread per group. Each Group(1 thread) will take one processor. You can't synchronize any of them (in the traditional sense).
You will have 2560 threads. 1280 groups. 2 threads in each group. Each Group(2 threads) will take one processor. You can synchronize these two threads(in the traditional sense).
You will have 640 threads. 160 groups. 4 threads in each group. Each Group(4 threads) will take one processor. You can synchronize these four threads(in the traditional sense).
In OpenCL you need to express the global Work Size in terms of the total number of threads. The underlying OpenCL API will look at the global Work Size and divide by the local Work Size to figure out your thread arrangement.
Now (this is a general suggestion. There might be cases where you need to do it, but for now ..)
Is a terrible idea. Clearly. You are wasting your processors time by giving it 1 thread at a time. While this might not to be the end of the world for CPUs it is for modern GPUs. Why? because each processor on your GPU will have a number of cores. All ready for action. Only one of them works in this case. Plus You have no way of synchronizing threads if the need arises.
Same thing.
Same thing.
If I remember correctly NVIDIA suggests at least 32 threads in a group to get the best performance.

Related

OpenCL Comptuing Unit and Processing Element

I currently use AMD Hawaii GPU and have some question about it.
In the specification of AMD Hawaii, it has
2816 Processing Element
44 Computing Units
I understood that then it has 2816 threads and 44 work groups.(64 threads in each group)
Is it correct?
I'm confused about the concept of cores, threads, computing units, work groups and processing elements.

No. You can (and should) have multiple work groups per CU and more than one thread per processing element. Each CU can hold up to 40 wavefronts of 64 threads each, so the maximum number of parallel threads is 44*40*64=112640. However, you can often not use all these threads. Other resources might limit the maximum possible number of threads per CU. There is only a limited number of registers per CU and each wavefront uses too many of them, the maximum number of parallel wavefronts is lower.
Each work group is executed on the same CU, as this allows access to a shared memory (LDS) and easy synchronization between the different wavefronts of each workgroup. You can choose the work-group size within certain limits. There is a hard limit (more doesn't work) of 256 threads per work-group and a soft-limit (reduced performance if you are using less) of wavefront size / 64 threads per work group. Your work-group size should also be a multiple of the wavefront size, so 64,128,192 and 256 are the most common choices for work-group size. Everything else reduces the potential peak performance, however, depending on your problem a different work-group size might still be better than forcing a problem into one of choices.
Because each work group can only use up to 256 threads each, multiple workgroups can be executed on each CU in parallel. If you use the maximum workgroup size of 256 threads, you need at least 112640/256=440 work groups in order to use all threads of the GPU. If you have more work groups, up to 440 of them will execute in parallel and the remaining groups will be executed once one of the older groups is finished. If you have less work groups, not all threads will be occupied, which can lead to decreased performance. If you pick smaller work-groups, you will need more of them, e.g: 1760 work-groups with a work-group size of 64.
Using too much of the shared memory (LDS) can also limit the number of work-groups per CU.
The processing elements execute the instructions. Under optimal conditions one instruction can be started per cycle.

MPI - No performance gain when using every available core on the machine

I have a C program (acoustic wave solver) that is parallelized with MPI. However, I've been testing the speed up on various numbers of cores and I've noticed something strange. If I use N processes where N is the number of available cores in the machine, then I do not see a performance improvement over the next step down.
So on my 8 core machine then I see speedup from 1 process to 2 processes to 4 processes, but not from 4 to 8. Similarly on my 4 core laptop I see speedup from 1 to 2, but not from 2 to 4.
Any idea what could be causing this?

Many modern (Intel-)cpu run two hyperthreads on a single physical core. The number of cores you are referencing are actually the number of hardware threads that are available, not the number of physical execution units.
As long as you are using a number of processes that is smaller or equal to the number of physical cores, the processes will (or at least should) be distributed to use all of the available codes. But as soon as all physical cores are taken, additional processes will share a physical core with another process.
It is not possible to give a definitive answer on if using all threads will increase your performance at all or by how much. That strongly depends on the code you are running. A very nice answer to a similar question is given on superuser.com. Essentially, if your process is memory-bound or uses different parts of your cpu (Integer/Floating point arithmetic, Video encoding, vector processing, ...) and communication overhead is small you might even get perfect scaling. Code that is cpu-bound and only does one type of computation might not give any improvement, or might even take longer due to communication overhead.

How to use NDrange in practice?

I have read documentation and books
(also these posts: OpenCL: query number of processing elements ; Understanding work-items and work-groups ; OpenCL: Work items, Processing elements, NDRange)
about the execution model and and theory about data partitioning with NDrange.
Do I build my work-items and work-groups based on my hardware? If yes how can I query how many work-items and work-groups are available on a device? Is there a good practice how to divide work-items and work-groups to achieve a good performance?
I would like to know how do they work and interact in practice, for computation of one dimensional array and for two-dimensional array like an image.

Good partitioning requires knowledge of your GPU hardware. For example, let's look on AMD cards like Radeon 6970. Overall number of cores is 1536. They are packed in 24 SIMD units. Each unit consists of 16 stream processors with VLIW4 architecture. So, we have 16 * 4 (because of VLIW4) * 24 = 1536 cores. Every SIMD unit share some resources (caches, etc) for all cores within it. Hence, a good size for local group in case of Radeon 6970 is some multiple of 64. You can query your OpenCL Device for number of Computing Units. In our case, you should get 24. So, for OpenCL on Radeon 6970 Computing Unit = SIMD Unit. Please, take into account that manual partitioning may cause performance drops on devices with different architecture.
A good example of local group benefits can be found on Nvidia developer zone. Take a look at the bitonic sort sample code, which will show you how to use local groups.

OpenCl Statement , true or false?

I was reading some results. And there I saw that they used 5120 work-groups and a local-size of 1. I have a limited knowledge about OpenCl and I was wondering if this statement is correct:
As can be seen for the GPU, the first test has 5120 work-groups, with
1 work-item each. This means that the threads which are executed in
parallel are limited to the amount of computing units there are in the
machine. For example if a GPU has 20 computing units there can only be
a maximum of 20 threads which are working in parallel. Though when the
local size is increased to 2, twice the amount of threads are run
simultaneously
From reading some info on OpenCl, it seems about right. Though I need a second opinion.

update. Hmm, nat chouf's comment is right, I understood the question as "in flight at the same time" instead of "physically executed at the same time".
As I wrote, several work-groups can be scheduled at a given time in a single compute unit. The number of such "in-flight" work-groups is limited by the available resources (local memory, registers, etc.) on each compute unit.
In existing implementations (afaik) a compute unit will pick a block (warp/wavefront) of work-items from the same work-group for execution, among all blocks in flight in the compute unit. One "instruction" of this block is inserted in the pipeline (it may take several cycles, and each "instruction" may correspond to several operations in each work-item), and then another block is picked.
So, yes, if work-group size is 1, only 1 work-item per compute unit will be physically started simultaneously. But potentially all work-items may be in-flight in the GPU at the same time.

How many tasks can be executed simultaneously on GPU device?

I'm using OpenCL and have ATI 4850 card. It has:
CL_DEVICE_MAX_COMPUTE_UNITS: 10
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3
CL_DEVICE_MAX_WORK_GROUP_SIZE: 256
CL_DEVICE_MAX_WORK_ITEM_SIZES:(256, 256, 256)
CL_DEVICE_AVAILABLE: 1
CL_DEVICE_NAME: ATI RV770
How many tasks can it execute simultaneously?
Is it CL_DEVICE_MAX_COMPUTE_UNITS * CL_DEVICE_MAX_WORK_ITEM_SIZES = 2560?
To be more specific: a single core processor can execute only one task in the one moment, dual-core can execute 2 tasks... How many tasks can execute my GPU at one moment? Or rephrased: How many processors does my GPU have?

The RV770 has 10 SIMD cores, each consisting of 16 shader cores, each consisting of 5 ALUs (VLIW5 architecture). A total of 800 ALUs that can do parallel computations. I don't think there's a way to get all these numbers out of OpenCL. I'm also not sure what you would equate to a CPU core. Perhaps a shader core? You can read about VLIW at Wikipedia. It's an interesting design.
If you say a CPU core is only executing one "task" at any given time, even though it has multiple ALUs working in parallel, then I guess you can say the RV770 would be working on 160 tasks. But with the differences in how different chips work, I think "core" and "task" can become difficult to define. A CPU with hyperthreading can even execute two sets of code at the same time. With OpenCL I don't believe it is possible yet to execute more than one kernel at any given time - unless recent driver updates have changed that.
Anyway, I think it is more important to present your work to the GPU in a way that gives the best performance. Unfortunately there's no way to find the best work group size other than experimenting. At least not that I know of. One help is that if the drivers support OpenCL 1.1 you can query the CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE and set your work size to a multiple of that. Otherwise, going for a multiple of 64 is probably a safe bet.

GPU work ends up becoming wavefronts/warps.
Using a GPU for UI and compute is effectively using it for many programs without being aware of it. Many for the GUI drawing, plus whatever compute kernels you are executing. Fast OpenCL clients are asynchronous and overlap multiple instance of work so they won't be latency-bound. It is expected that you'll use multiple kernels in parallel.
There doesn't seem to be a "hard" limit other than memory limiting the number of buffers you can use. When using the same GPU for UI and for compute, you must throttle your work. In my experience, issuing too much work will cause starvation of the GUI and/or your compute kernels. There doesn't seem to be anything in the way of ensuring that you won't have starvation (long delays before a work item begins actually executing). Some work item(s) may sit for a very long time (10s seconds or more in bad cases) while the GPU does other work items. I speculate that items are dispatched to pipelines based on data availability and little or nothing is there to prevent starvation of work items.
Limiting how far ahead work is enqueued greatly improves GUI responsiveness by letting the GPU drain its work queue almost/sometimes to empty, reducing GUI drawing workitem starvation delays.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex