Hello as i read on OpenCL docs ,
Compute Unit has many processing elements.
Is processing element contains only ALU ?
on the processing element , is single ALU have SIMD operation ? or 4 ALU can make SIMD ?
I think most current devices map a single ALU to a processing element, and an ALU is a single SIMD core. Indeed, CPUs that don't support SIMD are not OpenCL compatible.
The thing about OpenCL is that you don't need to be concerned about the exact underlying architecture unless you are writing a kernel for very specific hardware. Devices in the future could use as many schedulers/ALUs/memory controllers/etc as the manufacturer chooses to implement the SIMD architecture.
If you want to follow the "write once, run anywhere" mantra, you need to stick to the properties exposed by the OpenCL API. (eg CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE, and CL_DEVICE_PREFERRED_VECTOR_WIDTH_*)
Some had shared special function units per several ALUs and some had equal amount of FPU per ALU without any special unit there are addressing units, scalar units too. SIMD organization is different between AMD and NVIDIA and INTEL. Some have 16-wide some have 32-wide groups. Then those groups join together to make a 64-wide compute unit for one of those producers while 192 for another producer. What those ALUs do is generally altered by driver optimizations. You just write single instruction - multiple data code and driver takes care of optimizations unless you choose a optimization-killing execution argument.
You can query necessary info by using opencl api methods.
Related
I am using beta support for OpenCL 2.0 on NVIDIA and targeting highend GPU like 1080ti. In my compute pipeline, I need to sometimes dispatch work to independently image process relatively small images. In theory, I think these images should be able to be processed in parallel on a single GPU because the amount of work groups for a single image won't saturate all the compute units of the GPU.
Is this possible in OpenCL? Does this have a name in OpenCL?
If it is possible, is using multiple queues for a single device the only way to do this? Or will the driver look at the "waitEventList" and decide which kernels can be processed in parallel?
Do I need CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE?
1- Yes, this is one of ways to achieve high yield on occupation of compute units. General name can be "pipelining"(with help of asynchronous enqueueing and/or dynamic parallelism). There are different ways, one is doing reads on 1 queue, doing writes on another queue, doing compute on a third queue with 3 queues in control with wait events; second way could be having M queues each doing a different image's read-compute-write work without events.
2- You can even use single queue but an out-of-ordered type so kernels are dispatched independently. But at least for some amd cards, even an in-order queue can optimize independent kernels (according to amd's codexl) with concurrent execution(this may be out of opencl specs). Wait events can be a constraint to stop this type of driver-side optimizations(again, at least on amd)
From 2.x onwards, there is device-side queueing ability so you can enqueue 1 kernel from host and that kernel can enqueue N kernels, independently of host intervention(if all data is already uploaded to card), this may not be as latency-hiding as using multiple host-side queues(if data is needed from host to device).
3- Out of order execution is not forced on vendors so this may not work.
I have read documentation and books
(also these posts: OpenCL: query number of processing elements ; Understanding work-items and work-groups ; OpenCL: Work items, Processing elements, NDRange)
about the execution model and and theory about data partitioning with NDrange.
Do I build my work-items and work-groups based on my hardware? If yes how can I query how many work-items and work-groups are available on a device? Is there a good practice how to divide work-items and work-groups to achieve a good performance?
I would like to know how do they work and interact in practice, for computation of one dimensional array and for two-dimensional array like an image.
Good partitioning requires knowledge of your GPU hardware. For example, let's look on AMD cards like Radeon 6970. Overall number of cores is 1536. They are packed in 24 SIMD units. Each unit consists of 16 stream processors with VLIW4 architecture. So, we have 16 * 4 (because of VLIW4) * 24 = 1536 cores. Every SIMD unit share some resources (caches, etc) for all cores within it. Hence, a good size for local group in case of Radeon 6970 is some multiple of 64. You can query your OpenCL Device for number of Computing Units. In our case, you should get 24. So, for OpenCL on Radeon 6970 Computing Unit = SIMD Unit. Please, take into account that manual partitioning may cause performance drops on devices with different architecture.
A good example of local group benefits can be found on Nvidia developer zone. Take a look at the bitonic sort sample code, which will show you how to use local groups.
From what I understand, the preferred work group size is roughly dependent on the SIMD width of a compute device (for NVidia, this is the Warp size, on AMD the term is Wavefront).
Logically that would lead one to assume that the preferred work group size is device dependent, not kernel dependent. However, to query this property must be done relative to a particular kernel using CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE. Choosing a value which isn't a multiple of the underlying hardware device SIMD width would not completely load the hardware resulting in reduced performance, and should be regardless of what kernel is being executed.
My question is why is this not the case? Surely this design decision wasn't completely arbitrary. Is there some underlying implementation limitations, or are there cases where this property really should be a kernel property?
The preferred work-group size multiple (PWGSM) is a kernel, rather than device, property, to account for vectorization.
Let's say that the hardware has 16-wide SIMD units. Then a fully scalar kernel could have a PWGSM of 16, assuming the compiler manages to do a full automatic vectorization; similarly, for a kernel that uses float4s all around the compiler could still be able to find way to coalesce work-items in groups of 4, and recommend a PWGSM of 4.
In practice the only compilers that do automatic vectorization (that I know of) are Intel's proprietary ICD, and the open source pocl. Everything else always just returns 1 (if on CPU) or the wavefront/warp width (on GPU).
Logically what you are telling is right,
here you are only considering the data parallelism achieved by SIMD,
the value of SIMD changes for different data types as well, one for char and another one for double
And also you are forgetting the fact that the all the work-items share the memory resources in the work group through local memory. The local memory is not necessarily a multiple of SIMD capability of the underlying hardware and the underlying hardware has multiple local memories.
After reading through section 6.7.2 of the OpenCL 1.2 specifications, I found that a kernel is allowed to provide compiler attributes which specify either required or recommended worksize hints using the __attribute__ keyword. This property can only be passed to the host if the preferred work group size multiple is a kernel property vs. a device property.
The theoretical best work-group size choice may be a device-specific property, but it won't necessarily work best for a specific kernel, or at all. For example, what works best may be a multiple of 2*CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE or something all-together.
The GPU does have many processors which do have a queue of task/jobs that should be calculated.
We call the tasks that wait for execution because they are blocked by an RAM access or which are not jet executed 'in flight'.
To answer your question, the numer of task in flight must be high enougth to compensate the waiting delay introduced by the accesses to the RAM of the Graphics card.
References: Thread 1
Suppose that I've two big functions. Is it better to write them in a separate kernels and call them sequentially, or is better to write only one kernel? (I don't want to read the data back and force form between host and device in between). What about the speed up if I want to call the kernel many times?
One thing to consider is the effect of register pressure on hardware utilization and performance.
As a general rule, big kernels have big register footprints. Typical OpenCL devices (ie. GPUs) have very finite register file sizes and large kernels can result in lower concurrency (fewer concurrent warps/wavefronts), less opportunities for latency hiding, and poorer overall performance. On the other hand, kernel launch overheads are pretty low on most platforms, so if your algorithm doesn't have an enormous amount of state to save between "phases" of execution, the penalty of using multiple kernels can be rather low.
Using multiple kernels also has another side benefit -- you get implicit synchronization between all work units for free. Often that can eliminate the need for atomic memory operations and synchronization primitives which can have a negative impact on code performance.
The ultimate guide should be measured performance. There is no universal rule-of-thumb for this sort of things. Benchmarking is the only way to know for sure.
In general this is a question of (maybe) slightly better performance vs. readibility of your code. Copying buffers is no issue as long as you keep them within the same context. E.g. you could set one output buffer of a kernel to be an input buffer of the next kernel, which would not involve any copying.
The proper way to code in OpenCL is to separate your code into parallel tasks, and each of them is a kernel. This is, each "for loop" should be a kernel. Some times one single CPU code function could result in a 4 kernel implementation in OCL.
If you need to store data between kernel executions just use OpenCL buffers and do not copy to host (this solves the DEVICE<->HOST bottleneck).
If both functions act to different data you could propably write a single kernel, but that depends on the complexity of the operation being run.
I'm using OpenCL and have ATI 4850 card. It has:
CL_DEVICE_MAX_COMPUTE_UNITS: 10
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3
CL_DEVICE_MAX_WORK_GROUP_SIZE: 256
CL_DEVICE_MAX_WORK_ITEM_SIZES:(256, 256, 256)
CL_DEVICE_AVAILABLE: 1
CL_DEVICE_NAME: ATI RV770
How many tasks can it execute simultaneously?
Is it CL_DEVICE_MAX_COMPUTE_UNITS * CL_DEVICE_MAX_WORK_ITEM_SIZES = 2560?
To be more specific: a single core processor can execute only one task in the one moment, dual-core can execute 2 tasks... How many tasks can execute my GPU at one moment? Or rephrased: How many processors does my GPU have?
The RV770 has 10 SIMD cores, each consisting of 16 shader cores, each consisting of 5 ALUs (VLIW5 architecture). A total of 800 ALUs that can do parallel computations. I don't think there's a way to get all these numbers out of OpenCL. I'm also not sure what you would equate to a CPU core. Perhaps a shader core? You can read about VLIW at Wikipedia. It's an interesting design.
If you say a CPU core is only executing one "task" at any given time, even though it has multiple ALUs working in parallel, then I guess you can say the RV770 would be working on 160 tasks. But with the differences in how different chips work, I think "core" and "task" can become difficult to define. A CPU with hyperthreading can even execute two sets of code at the same time. With OpenCL I don't believe it is possible yet to execute more than one kernel at any given time - unless recent driver updates have changed that.
Anyway, I think it is more important to present your work to the GPU in a way that gives the best performance. Unfortunately there's no way to find the best work group size other than experimenting. At least not that I know of. One help is that if the drivers support OpenCL 1.1 you can query the CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE and set your work size to a multiple of that. Otherwise, going for a multiple of 64 is probably a safe bet.
GPU work ends up becoming wavefronts/warps.
Using a GPU for UI and compute is effectively using it for many programs without being aware of it. Many for the GUI drawing, plus whatever compute kernels you are executing. Fast OpenCL clients are asynchronous and overlap multiple instance of work so they won't be latency-bound. It is expected that you'll use multiple kernels in parallel.
There doesn't seem to be a "hard" limit other than memory limiting the number of buffers you can use. When using the same GPU for UI and for compute, you must throttle your work. In my experience, issuing too much work will cause starvation of the GUI and/or your compute kernels. There doesn't seem to be anything in the way of ensuring that you won't have starvation (long delays before a work item begins actually executing). Some work item(s) may sit for a very long time (10s seconds or more in bad cases) while the GPU does other work items. I speculate that items are dispatched to pipelines based on data availability and little or nothing is there to prevent starvation of work items.
Limiting how far ahead work is enqueued greatly improves GUI responsiveness by letting the GPU drain its work queue almost/sometimes to empty, reducing GUI drawing workitem starvation delays.