I am a little bit confused on threads vs workitems. Are there any difference between the two?
I understand the unit of execution on the GPU Cores is workgroup, which consist of workitems. Is this not same as threads?
The terms 'threads' and 'cores' differ depending on the hardware you are running on. On NVIDIA hardware, a work-item is equivalent to a 'CUDA thread'. On CPUs, a thread usually executes an entire work-group, with individual work-items being packed into SIMD lanes.
So when talking about OpenCL programs, it's best to stick to the terms 'work-item', 'work-group', 'compute unit' and 'processing element', to avoid these naming issues.
Related
I have read documentation and books
(also these posts: OpenCL: query number of processing elements ; Understanding work-items and work-groups ; OpenCL: Work items, Processing elements, NDRange)
about the execution model and and theory about data partitioning with NDrange.
Do I build my work-items and work-groups based on my hardware? If yes how can I query how many work-items and work-groups are available on a device? Is there a good practice how to divide work-items and work-groups to achieve a good performance?
I would like to know how do they work and interact in practice, for computation of one dimensional array and for two-dimensional array like an image.
Good partitioning requires knowledge of your GPU hardware. For example, let's look on AMD cards like Radeon 6970. Overall number of cores is 1536. They are packed in 24 SIMD units. Each unit consists of 16 stream processors with VLIW4 architecture. So, we have 16 * 4 (because of VLIW4) * 24 = 1536 cores. Every SIMD unit share some resources (caches, etc) for all cores within it. Hence, a good size for local group in case of Radeon 6970 is some multiple of 64. You can query your OpenCL Device for number of Computing Units. In our case, you should get 24. So, for OpenCL on Radeon 6970 Computing Unit = SIMD Unit. Please, take into account that manual partitioning may cause performance drops on devices with different architecture.
A good example of local group benefits can be found on Nvidia developer zone. Take a look at the bitonic sort sample code, which will show you how to use local groups.
My program is well-suited for MPI. Each CPU does its own, specific (sophisticated) job, produces a single double, and then I use an MPI_Reduce to multiply the result from every CPU.
But I repeat this many, many times (> 100,000). Thus, it occurred to me that a GPU would dramatically speed things up.
I have google'd around, but can't find anything concrete. How do you go about mixing MPI with GPUs? Is there a way for the program to query and verify "oh, this rank is the GPU, all other are CPUs" ? Is there a recommended tutorial or something?
Importantly, I don't want or need a full set of GPUs. I really just need a lot of CPUs, and then a single GPU to speed up the frequently-used MPI_Reduce operation.
Here is a schematic example of what I'm talking about:
Suppose I have 500 CPUs. Each CPU somehow produces, say, 50 doubles. I need to multiply all 250,00 of these doubles together. Then I repeat this between 10,000 and 1 million times. If I could have one GPU (in addition to the 500 CPUs), this could be really efficient. Each CPU would compute its 50 doubles for all ~1 million "states". Then, all 500 CPUs would send their doubles to the GPU. The GPU would then multiply the 250,000 doubles together for each of the 1 million "states", producing 1 million doubles.
These numbers are not exact. The compute is indeed very large. I'm just trying to convey the general problem.
This isn't the way to think about these things.
I like to say that MPI and GPGPU stuff are orthogonal(*). You use MPI between tasks (for which think nodes, although you can have multiple tasks per node), and each task may or may not use an accelerator like a GPU to accelerate the computation within task. There is no MPI rank on a GPU.
Regardless, Talonmies is right; this particular example doesn't sound like it would benefit much from a GPU. And it won't be helped by having tens of thousands of doubles per task; if you're only doing one or a few FLOPs per double, the cost of sending the data to the GPU will exceed the benefit of having all those cores operate on them.
(*) This used to be more clearly true; now with, for instance, GPUDirect being able to copy memory to remote GPUs over infiniband, the distinction is fuzzier. However, I maintain that this is still the most useful way to think about things, with such things as RDMA to GPUs being an important optimization but conceptually a minor tweak.
Here I have found some news about the topic:
"MPI, the Message Passing Interface, is a standard API for communicating data via messages between distributed processes that is commonly used in HPC to build applications that can scale to multi-node computer clusters. As such, MPI is fully compatible with CUDA, which is designed for parallel computing on a single computer or node. There are many reasons for wanting to combine the two parallel programming approaches of MPI and CUDA. A common reason is to enable solving problems with a data size too large to fit into the memory of a single GPU, or that would require an unreasonably long compute time on a single node. Another reason is to accelerate an existing MPI application with GPUs or to enable an existing single-node multi-GPU application to scale across multiple nodes. With CUDA-aware MPI these goals can be achieved easily and efficiently. In this post I will explain how CUDA-aware MPI works, why it is efficient, and how you can use it."
Consider a simple example: vector addition.
If I build a program for CL_DEVICE_TYPE_GPU, and I build the same program for CL_DEVICE_TYPE_CPU, what is the difference between them(except that "CPU program" is running on CPU, and "GPU program" is running on GPU)?
Thanks for your help.
There are a few differences between the device types. The simple answer to your vector question is: Use a gpu for large vectors, and cpu for smaller workloads.
1) Memory copying. GPUs rely on the data you are working on to be passed into them, and the results are later read back to the host. This is done over PCI-e, which yields about 5GB/s for version 2.0 / 2.1. CPUs can use buffers 'in place' - in DDR3 - using either of the CL_MEM_ALLOC_HOST_PTR or CL_MEM_USE_HOST_PTR flags. See here: clCreateBuffer. This is one of the big bottlenecks for many kernels.
2) Clock speed. cpus currently have a big lead over gpus in clock speed. 2Ghz on the low end for most cpus, vs 1Ghz as a top end for most gpus these days. This is one factor that really helps the cpu 'win' over a gpu for small workloads.
3) Concurrent 'threads'. High-end gpus usually have more compute units than their cpu counterparts. For example, the 6970 gpu (Cayman) has 24 opencl compute units, each of these is divided into 16 SIMD units. Most of the top desktop cpus have 8 cores, and server cpus currently stop at 16 cores. (cpu cores map 1:1 to compute unit count) A compute unit in opencl is a portion of the device which can do work that is different from the rest of the device.
4) Thread types. gpus have a SIMD architecture, with many graphic-oriented instructions. cpus have a lot of their area dedicated to branch prediction and general computations. A cpu may have a SIMD unit and/or floating point unit in every core, but the Cayman chip I mentioned above has 1536 units with the gpu instruction set available to each one. AMD calls them stream processors, and there are 4 in each of the SIMD units mentioned above (24x16x4 = 1536). No cpu will have that many sin(x) or dot-product-capable units unless the manufacturer wants to cut out some cache memory or branch prediction hardware. The SIMD layout of the gpus is probably the largest 'win' for large vector addition situations. That the also do other specialized functions is a big bonus.
5) Memory Bandwidth. cpus with DDR3: ~17GB/s. High-end gpus >100GB/s, speeds of over 200GB/s are becoming common lately. If your algorithm is not PCI-e limited (see #1), the gpu will outpace the cpu in raw memory access. The scheduling units in a gpu can hide memory latency further by running only tasks that aren't waiting on memory access. AMD calls this a wavefront, Nvidia calls it a warp. cpus have a large and complicated caching system to help hide their memory access times in the case where the program is reusing the data. For your vector add problem, you will likely be limited more by the PCI-e bus since the vectors are generally used only once or twice each.
6) Power efficiency. A gpu (used properly) will usually be more electrically efficient than a cpu. Because cpus dominate in clock speed, one of the only ways to really reduce power consumption is to down-clock the chip. This obviously leads to longer compute times. Many of the top systems on the Green 500 list are heavily gpu accelerated. see here: green500.org
I'm using OpenCL and have ATI 4850 card. It has:
CL_DEVICE_MAX_COMPUTE_UNITS: 10
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3
CL_DEVICE_MAX_WORK_GROUP_SIZE: 256
CL_DEVICE_MAX_WORK_ITEM_SIZES:(256, 256, 256)
CL_DEVICE_AVAILABLE: 1
CL_DEVICE_NAME: ATI RV770
How many tasks can it execute simultaneously?
Is it CL_DEVICE_MAX_COMPUTE_UNITS * CL_DEVICE_MAX_WORK_ITEM_SIZES = 2560?
To be more specific: a single core processor can execute only one task in the one moment, dual-core can execute 2 tasks... How many tasks can execute my GPU at one moment? Or rephrased: How many processors does my GPU have?
The RV770 has 10 SIMD cores, each consisting of 16 shader cores, each consisting of 5 ALUs (VLIW5 architecture). A total of 800 ALUs that can do parallel computations. I don't think there's a way to get all these numbers out of OpenCL. I'm also not sure what you would equate to a CPU core. Perhaps a shader core? You can read about VLIW at Wikipedia. It's an interesting design.
If you say a CPU core is only executing one "task" at any given time, even though it has multiple ALUs working in parallel, then I guess you can say the RV770 would be working on 160 tasks. But with the differences in how different chips work, I think "core" and "task" can become difficult to define. A CPU with hyperthreading can even execute two sets of code at the same time. With OpenCL I don't believe it is possible yet to execute more than one kernel at any given time - unless recent driver updates have changed that.
Anyway, I think it is more important to present your work to the GPU in a way that gives the best performance. Unfortunately there's no way to find the best work group size other than experimenting. At least not that I know of. One help is that if the drivers support OpenCL 1.1 you can query the CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE and set your work size to a multiple of that. Otherwise, going for a multiple of 64 is probably a safe bet.
GPU work ends up becoming wavefronts/warps.
Using a GPU for UI and compute is effectively using it for many programs without being aware of it. Many for the GUI drawing, plus whatever compute kernels you are executing. Fast OpenCL clients are asynchronous and overlap multiple instance of work so they won't be latency-bound. It is expected that you'll use multiple kernels in parallel.
There doesn't seem to be a "hard" limit other than memory limiting the number of buffers you can use. When using the same GPU for UI and for compute, you must throttle your work. In my experience, issuing too much work will cause starvation of the GUI and/or your compute kernels. There doesn't seem to be anything in the way of ensuring that you won't have starvation (long delays before a work item begins actually executing). Some work item(s) may sit for a very long time (10s seconds or more in bad cases) while the GPU does other work items. I speculate that items are dispatched to pipelines based on data availability and little or nothing is there to prevent starvation of work items.
Limiting how far ahead work is enqueued greatly improves GUI responsiveness by letting the GPU drain its work queue almost/sometimes to empty, reducing GUI drawing workitem starvation delays.
I'm new in GPGPU programming and I'm working with NVIDIA implementation of OpenCL.
My question was how to compute the limit of a GPU device (in number of threads).
From what I understood a there are a number of work-group (equivalent of blocks in CUDA) that contain a number of work-item (~ cuda thread).
How do I get the number of work-group present on my card (and that can run at the same time) and the number of work-item present on one work group?
To what CL_DEVICE_MAX_COMPUTE_UNITS corresponds?
The khronos specification speeks of cores ("The number of parallel compute cores on the OpenCL device.") what is the difference with the CUDA core given in the specification of my graphic card. In my case openCL gives 14 and my GeForce 8800 GT has 112 core based on NVIDIA website.
Does CL_DEVICE_MAX_WORK_GROUP_SIZE (512 in my case) corresponds to the total of work-items given to a specific work-group or the number of work-item that can run at the same time in a work-group?
Any suggestions would be extremely appreciated.
The OpenCL standard does not specify how the abstract execution model provided by OpenCL is mapped to the hardware. You can enqueue any number T of threads (work items), and provide a workgroup size (WG), with at least the following constraints (see OpenCL spec 5.7.3 and 5.8 for details):
WG must divide T
WG must be at most DEVICE_MAX_WORK_GROUP_SIZE
WG must be at most KERNEL_WORK_GROUP_SIZE returned by GetKernelWorkGroupInfo ; it may be smaller than the device max workgroup size if the kernel consumes a lot of resources.
The implementation manages the execution of the kernel on the hardware. All threads of a single workgroup must be scheduled on a single "multiprocessor", but a single multiprocessor can manage several workgroups at the same time.
Threads inside a workgroup are executed by groups of 32 (NVIDIA warp) or 64 (AMD wavefront). Each micro-architecture does this in a different way. You will find more details in NVIDIA and AMD forums, and in the various docs provided by each vendor.
To answer your question: there is no limit to the number of threads. In the real world, your problem is limited by the size of inputs/outputs, i.e. the size of the device memory. To process a 4GB buffer of float, you can enqueue 1G threads, with WG=256 for example. The device will have to schedule 4M workgroups on its small number (say between 2 and 40) of multiprocessors.