I am using two graphics card for opencl code
using profiling, my GTX 630 kepler is running faster than GTX650 Ti for each method request.
after profiling i found out some differences for both graphics card. But i am not able to understand what occupancy, l1_global_load_hit, l1_global_load_miss, active_warps and active_cycles are less for GTX650 Ti. Can any one please help me understand these terms in a more better way.
Decrease local work group size from 1024 down to 512 or 256 maybe even 64, then try again. This will leave more local memory per wave of threads. So more will execute simultaneously henceforth occupying more ALUs.
Don't forget to make the total number of threads a multiple of 768(number of cores of your faster card) to actually fill it evenly through all cores.(not just multiple of 384 like 1k-is which is not good for your faster card)
Related
Is there a way to check the number of stream processors and cores utilized by an OpenCL kernel?
No. However you can make guesses based on your application: If the number of work items is much larger than the number of CUDA cores and the work group size is 32 or larger, all stream processors are used at the same time.
If the number of work items is the about the same or lower than the number of CUDA cores, you won't have full utilization.
If you set the work size to 16, only half of the CUDA cores will be used at any time, but the non-used half is blocked and cannot do other work. (So always set work group size to 32 or larger.)
Tools like nvidia-smi can tell you the time-averaged GPU usage. So if you run your kernel over and over without any delay in between, the usage indicates the average fraction of used CUDA cores at any time.
I get the execution time of vector adder with different size of groupsize and I only use one group in this experiment.
groupsize --------execution time
1 ----------------3.6
50 ---------------4.22
100 --------------4.3
200 --------------4.28
300 --------------4.3
400 --------------4.31
500 --------------4.38
600 --------------4.38
700 --------------4.78
800 --------------5.18
900 --------------5.78
1000 -------------6.4
Can I get the conclusion one sm can work about 600 workitems together?
and I have some questions, could anybody can help me?
Why does the execution time increase sharply when groupsize increases from 1 to 50 and from 600 to 1000?
thank you very much
It would be helpful to see some code, both of the kernel and the host enqueueing parameters. The conclusions also depend on what sort of hardware you're running this on - GPU, CPU, accelerator, FPGA, …?
A few ideas:
GPUs typically can run power-of-2 number of threads in parallel in an execution unit. You will likely get better results if you try e.g. 16, 32, 64, 128, etc. CPUs and other accelerators typically have SIMD-widths which are powers of 2 too, for example x86-64 SSE registers can hold 4 floats, AVX 8, AVX512 16, etc. so it most likely will help there, too.
As you can vary group size so freely, I'm going to assume your work-items don't need to coordinate among each other via local memory or barriers. (The problem is embarrassingly parallel.) A group size of 1 in theory allows your compiler, driver, and hardware maximum flexibility for distributing work-items to threads and parallel execution units optimally. So it should not be a surprise that this is the fastest. (Depending on register pressure and memory access patterns it can still sometimes be helpful to manually increase group size for specific types of hardware in the embarrassingly parallel case.)
On GPUs, all items in a work group must run on the same execution unit, in order to be able to coordinate and share local memory. So by increasing the group size, you're limiting the number of execution units the workload can be spread across, and the execution units need to run your work-items serially - you're reducing parallelism. Above 600 you're probably submitting fewer workgroups than your hardware has execution units.
I'm confused by this CL_DEVICE_MAX_COMPUTE_UNITS. For instance my Intel GPU on Mac, this number is 48. Does this mean the max number of parallel tasks run at the same time is 48 or the multiple of 48, maybe 96, 144...? (I know each compute unit is composed of 1 or more processing elements and each processing element is actually in charge of a "thread". What if these each of the 48 compute units is composed of more than 1 processing elements ). In other words, for my Mac, the "ideal" speedup, although impossible in reality, is 48 times faster than a CPU core (we assume the single "core" computation speed of CPU and GPU is the same), or the multiple of 48, maybe 96, 144...?
Summary: Your speedup is a little complicated, but your machine's (Intel GPU, probably GEN8 or GEN9) fp32 throughput 768 FLOPs per (GPU) clock and 1536 for fp16. Let's assume fp32, so something less than 768x (maybe a third of this depending on CPU speed). See below for the reasoning and some very important caveats.
A Quick Aside on CL_DEVICE_MAX_COMPUTE_UNITS:
Intel does something wonky when with CL_DEVICE_MAX_COMPUTE_UNITS with its GPU driver.
From the clGetDeviceInfo (OpenCL 2.0). CL_DEVICE_MAX_COMPUTE_UNITS says
The number of parallel compute units on the OpenCL device. A
work-group executes on a single compute unit. The minimum value is 1.
However, the Intel Graphics driver does not actually follow this definition and instead returns the number of EUs (Execution Units) --- An EU a grouping of the SIMD ALUs and slots for 7 different SIMD threads (registers and what not). Each SIMD thread represents 8, 16, or 32 workitems depending on what the compiler picks (we want higher, but register pressure can force us lower).
A workgroup is actually limited to a "Slice" (see the figure in section 5.5 "Slice Architecture"), which happens to be 24 EUs (in recent HW). Pick the GEN8 or GEN9 documents. Each slice has it's own SLM, barriers, and L3. Given that your apple book is reporting 48 EUs, I'd say that you have two slices.
Maximum Speedup:
Let's ignore this major annoyance and work with the EU number (and from those arch docs above). For "speedup" I'm comparing a single threaded FP32 calculation on the CPU. With good parallelization etc on the CPU, the speedup would be less, of course.
Each of the 48 EUs can issue two SIMD4 operations per clock in ideal circumstances. Assuming those are fused multiply-add's (so really two ops), that gives us:
48 EUs * 2 SIMD4 ops per EU * 2 (if the op is a fused multiply add)
= 192 SIMD4 ops per clock
= 768 FLOPs per clock for single precision floating point
So your ideal speedup is actually ~768. But there's a bunch of things that chip into this ideal number.
Setup and teardown time. Let's ignore this (assume the WL time dominates the runtime).
The GPU clock maxes out around a gigahertz while the CPU runs faster. Factor that ratio in. (crudely 1/3 maybe? 3Ghz on the CPU vs 1Ghz on the GPU).
If the computation is not heavily multiply-adds "mads", divide by 2 since I doubled above. Many important workloads are "mad"-dominated though.
The execution is mostly non-divergent. If a SIMD thread branches into an if-then-else, the entire SIMD thread (8,16,or 32 workitems) has to execute that code.
Register banking collisions delays can reduce EU ALU throughput. Typically the compiler does a great job avoiding this, but it can theoretically chew into your performance a bit (usually a few percent depending on register pressure).
Buffer address calculation can chew off a few percent too (EU must spend time doing integer compute to read and write addresses).
If one uses too much SLM or barriers, the GPU must leave some of the EU's idle so that there's enough SLM for each work item on the machine. (You can tweak your algorithm to fix this.)
We must keep the WL compute bound. If we blow out any cache in the data access hierarchy, we run into scenarios where no thread is ready to run on an EU and must stall. Assume we avoid this.
?. I'm probably forgetting other things that can go wrong.
We call the efficiency the percentage of theoretical perfect. So if our workload runs at ~530 FLOPs per clock, then we are 60% efficient of the theoretical 768. I've seen very carefully tuned workloads exceed 90% efficiency, but it definitely can take some work.
The ideal speedup you can get is the total number of processing elements which in your case corresponds to 48 * number of processing elements per compute unit. I do not know of a way to get the number of processing elements from OpenCL (that does not mean that it is not possible), however you can just google it for your GPU.
Up to my knowledge, a compute unit consists of one or multiple processing elements (for GPUs usually a lot), a register file, and some local memory. The threads of a compute unit are executed in a SIMD (single instruction multiple data) fashion. This means that the threads of a compute unit all execute the same operation but on different data.
Also, the speedup you get depends on how you execute a kernel function. Since a single work-group can not run on multiple compute units you need a sufficient number of work-groups in order to fully utilize all of the compute units. In addition, the work-group size should be a multiple of CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE.
I dont have an HD5850 but how can I know maximum workgroup size of it for opencl ? What is the preferred floating point vector width for HD5850? I suspected it was 5 but did not work on a friends computer who has 5850. Tried width 4 but did not work fast enough now I suspect work group size is not optimal. Doing NBody for 25k 50k and 100k particles consists of float8 variables for x,y,z, vx,vy,vz.
Thanks.
If you need the OpenCL specifics at development time but don't have access to the hardware, try http://clbenchmark.com. For example, the HD 5850 page is here: http://clbenchmark.com/device-environment.jsp?config=11975982. It shows CL_DEVICE_PREFERRED_VECTOR_WIDTH_FLOAT=4.
Use clGetDeviceInfo to poll for CL_DEVICE_MAX_WORK_GROUP_SIZE. I think the 5850 will have this at 256, but that may not be optimal for your kernel.
Use the same technique to poll for CL_DEVICE_PREFERRED_VECTOR_WIDTH_FLOAT, which I think is 4 on your card.
clGetDeviceInfo
I am wondering how to chose optimal local and global work sizes for different devices in OpenCL?
Is it any universal rule for AMD, NVIDIA, INTEL GPUs?
Should I analyze physical build of the devices (number of multiprocessors, number of streaming processors in multiprocessor, etc)?
Does it depends on the algorithm/implementation? Because I saw that some libraries (like ViennaCL) to assess correct values just tests many combination of local/global work sizes and chose best combination.
NVIDIA recommends that your (local)workgroup-size is a multiple of 32 (equal to one warp, which is their atomic unit of execution, meaning that 32 threads/work-items are scheduled atomically together). AMD on the other hand recommends a multiple of 64(equal to one wavefront). Unsure about Intel, but you can find this type of information in their documentation.
So when you are doing some computation and let say you have 2300 work-items (the global size), 2300 is not dividable by 64 nor 32. If you don't specify the local size, OpenCL will choose a bad local size for you. What happens when you don't have a local size which is a multiple of the atomic unit of execution is that you will get idle threads which leads to bad device utilization. Thus, it can be benificial to add some "dummy" threads so that you get a global size which is a multiple of 32/64 and then use a local size of 32/64 (the global size has to be dividable by the local size). For 2300 you can add 4 dummy threads/work-items, because 2304 is dividable by 32. In the actual kernel, you can write something like:
int globalID = get_global_id(0);
if(globalID >= realNumberOfThreads)
globalID = 0;
This will make the four extra threads do the same as thread 0. (it is often faster to do some extra work then to have many idle threads).
Hope that answered your question. GL HF!
If you're essentially making processing using little memory (e.g. to store kernel private state) you can choose the most intuitive global size for your problem and let OpenCL choose the local size for you.
See my answer here : https://stackoverflow.com/a/13762847/145757
If memory management is a central part of your algorithm and will have a great impact on performance you should indeed go a little further and first check the maximum local size (which depends on the local/private memory usage of your kernel) using clGetKernelWorkGroupInfo, which itself will decide of your global size.