How to calculate peak FLOPS in GPGPU hardware? - opencl

I want to calculate the theoretical peak performance of graphics hardware. Well, actually I want to understand the calculation.
Example with a AMD Radeon HD 6670:
The AMD Accelerated Parallel Processing Programming Guide (http://developer.amd.com/download/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf) tells me in the middle of page 6-42 to take the number of Stream Cores (96), multiply it by the number of operations per cycle for each Stream Core (let's take an ADD with Single Precision, which would be 5) and multiply that by the core clock (800 MHz). That results to:
96 * 5 FLOPS * 800MHz = 384,000 MFLOPS = 384 GFLOPS
The very same document tells me on page D-4 that this particular device has a peak throughput of 768 GFLOPS, which is twice of what I just calculated. Wikipedia and the AMD homepage state the same.
So my question is: Where am I missing the factor of two?

I am not sure about AMD hardware, but I remember that NVIDIA counted MAD (multiply-add) operation as two flops. Since MADs are performed in one cycle, the theoretical performance was multiplied by two.

480 processing elements * 2 operations per cycle(single addition pipeline + single multiplication pipeline per element) * 800MHz = 768 GFLOPS
When the code has too many levels of branching, it drops to 1-4 shader per compute unit which means 6-24 of them and this translates to as low as 10-40 GFlops which is even slower than a single cpu core.

Related

How do I plan this least-squares computation on a GPU?

I'm writing a program called PerfectTIN (https://github.com/phma/perfecttin) which does lots of least-squares adjustments of a TIN to fit a point cloud. Each adjustment takes some contiguous group of triangles and adjusts the elevations of up to 8 points, which are corners of the triangles, to fit the dots in the triangles. I have it working on SMP. At the start of processing, it does only one adjustment at a time, so it splits the adjustment into tasks, each of which takes some dots, all of which are in the same triangle. Each thread takes a task from a queue and computes a small square matrix and a small column vector. When they're all ready, the adjustment routine adds up the matrices and the vectors and finishes the least squares computation.
I'd like to process tasks on the GPU as well as the CPU. The data needed for a task are
The three corners of the triangle (x,y,z)
The coordinates of the dots (x,y,z).
The output data are
A symmetric matrix with up to nine nonzero entries (since it's symmetric, I need only compute six numbers)
A column vector with the same number of rows.
The number of dots is a multiple of 1024, except for a few tasks which I can handle in the CPU (the total number of dots in a triangle can be any nonnegative integer). For a fairly large point cloud of 56 million dots, some tasks are larger than 131072 dots.
Here is part of the output of clinfo (if you need other parts, let me know):
Platform Name Clover
Number of devices 1
Device Name Radeon RX 590 Series (POLARIS10, DRM 3.33.0, 5.3.0-7625-generic, LLVM 9.0.0)
Device Vendor AMD
Device Vendor ID 0x1002
Device Version OpenCL 1.1 Mesa 19.2.8
Driver Version 19.2.8
Device OpenCL C Version OpenCL C 1.1
Device Type GPU
Device Profile FULL_PROFILE
Device Available Yes
Compiler Available Yes
Max compute units 36
Max clock frequency 1545MHz
Max work item dimensions 3
Max work item sizes 256x256x256
Max work group size 256
Preferred work group size multiple 64
Preferred / native vector sizes
char 16 / 16
short 8 / 8
int 4 / 4
long 2 / 2
half 8 / 8 (cl_khr_fp16)
float 4 / 4
double 2 / 2 (cl_khr_fp64)
Double-precision Floating-point support (cl_khr_fp64)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
If I understand right, if I put one dot in each core of the GPU, the total number of dots I can process at once is 36×256=9×1024=9216. Could I put four dots in each core, since a work group would then have 1024 dots? In this case I could process 36864 dots at once. How many dots should each core process? What if a task is bigger than the GPU can hold? What if several tasks (possibly from different triangles) fit in the GPU?
Of course I want this code to run on other GPUs than mine. I'm going to use OpenCL for portability. What different GPUs (description rather than name) am I going to encounter?
If I understand right, if I put one dot in each core of the GPU, the total number of dots I can process at once is 36×256=9×1024=9216.
Not quite. The total number of dots is not limited by the maximum work group size. The work group size is the number of GPU threads working synchronously in a group. Within a work group, you can share data via local memory, which can be useful to speed up certain calculations like matrix multiplications for example (cache tiling).
The idea of GPU parallelization is to split the prioblem up into as many independent parts as there are. In C++ something like
void example(float* data, const int N) {
for(int n=0; n<N; n++) {
data[n] += 1.0f;
}
}
in OpenCL becomes this
kernel void example(global float* data ) {
const int n = get_global_id(0) ;
data[n] += 1.0f;
}
where the global range of the kernel is set to N.
Each GPU thread should process only a single dot. The number of threads (global range) can and should be much larger than the number of GPU cores available. If you don't explicitly need local memory (all threads can work independent), you can set the local work group size to either 32, 64, 128 or 256 - it doesn't matter - but there might be some performance difference between these values. However the global range (total number of threads / points) must be a multiple of the work group size.
What if a task is bigger than the GPU can hold?
I assume you mean when your data set does not fit into video memory all at once. In this case you can do the computation in several batches, so exchange the GPU buffers using PCIe transfers. But that comes at a large performance penalty.
Of course I want this code to run on other GPUs than mine. I'm going to use OpenCL for portability. What different GPUs (description rather than name) am I going to encounter?
OpenCL is excellent for portability across devices and operating systems. Other than AMD GPUs and CPUs, you will encounter Nvidia GPUs, which only support OpenCL 1.2, and Intel GPUs and CPUs. If the graphics drivers are installed, your code will run on all of them without issues. Just be aware that the amount of video memory can be vastly different.

What is the meaning of having a certain number of OpenCL work-items into a CPU?

I'm trying tu understand why I could have more work-items in a CPU than a GPU in one dimension.
PLATFORM 0 DEVICE 0
== CPU ==
DEVICE_VENDOR: Intel
DEVICE NAME: Intel(R) Core(TM) i5-5257U CPU # 2.70GHz
MAXIMUM NUMBER OF PARALLAEL COMPUTE UNITS: 4
MAXIMUM DIMENSIONS FOR THE GLOBAL/LOCAL WORK ITEM IDs: 3
MAXIMUM NUMBER OF WORK-ITEMS IN EACH DIMENSION: (1024 1 1 )
MAXIMUM NUMBER OF WORK-ITEMS IN A WORK-GROUP: 1024
PLATFORM 0 DEVICE 1
== GPU ==
DEVICE_VENDOR: Intel Inc.
DEVICE NAME: Intel(R) Iris(TM) Graphics 6100
MAXIMUM NUMBER OF PARALLAEL COMPUTE UNITS: 48
MAXIMUM DIMENSIONS FOR THE GLOBAL/LOCAL WORK ITEM IDs: 3
MAXIMUM NUMBER OF WORK-ITEMS IN EACH DIMENSION: (256 256 256 )
MAXIMUM NUMBER OF WORK-ITEMS IN A WORK-GROUP: 256
The above is the result of my test code to print the information of the actual hardware that the OpenCL framework can use.
I really do not understand why the value of 1024 in the Maximum number of work-items in the CPU section. What is the real meaning of having that amount of work-items?
CPUs are more general purpose than GPUs. Their OpenCL implementation looks like serialized(but interleaved on instructions) for workgroups since each compute unit is a physical core to issue workgroups as a whole. Since they are serialized/interleaved, they rely on instructions-in-flight. CPUs have 100-200 instructions in-flight and if those instructions are AVX/SSE, then you can expect 800-1600 scalar data operations in-flight. This is well within range of 1024 workitems per workgroup, if OpenCL implementation is vectorized under the hood.
Since GPUs use massive thread-level-parallelism to fill pipelines to have more instructions-in-flight, they don't need as much ILP as CPUs so they can work fine with just 256 threads per workgroup and these threads run in parallel. Thread-level-parallelism fills pipelines easier than instruction-level-parallelism. Intel has 7-way, Nvidia 16-way, Amd 40-way thread-level-parallelism, for each pipeline. Each subslice of Iris6100 has (8 EUs) 64 pipelines. 64 pipelines x 7 means it can have multiple workgroups in-flight too, just like Nvidia and Amd GPUs. Probably having more threads/workitems per workgroup doesn't yield more performance for that iGPU and having more than 1024 threads per workgroup doesn't yield more performance for that CPU.
CPU also has 256kB L2 cache for compute unit which may be another limiting factor on maximum 1024 workitems per workgroup for saving states of each workitem efficiently.
As an image processing example:
You can divide and conquer an image by having 32x32 patches of it, on CPU(1024 threads). But this needs re-computation of 2D indices in kernel since CPU supports 1D kernel.
You can divide and conquer an image by having 16x16 patches of it, on iGPU (256 threads).
256x1 on iGPU
1024x1 on CPU
8x8x4 on iGPU
1x256x1 on iGPU
1x1x256 on iGPU
but not 1x1024x1 on CPU
They are the number of workitems per workgroup and generally they are a fraction of maximum allowed in-flight workitems per compute unit.
For this image processing example, up to several thousands of pixels can be in-flight per compute unit or up to 50k-100k pixels in-flight for a high-end GPU.
Having only 1 on other dimensions for CPU (imo) is originated from CPU's OpenCL implementation being an emulation. It doesn't have hardware to accelerate computation of thread-id values for other dimensions. But GPUs probably have this kind of support on hardware so that they can have more dimensions without decreasing performance as 1D kernel on CPU has to compute some modulos and divisions to emulate 2nd and 3rd dimensions which is a bottleneck for simple kernels.
If CPUs had emulated 2nd and 3rd dimensions too, there would be some modulos and divisions going on background with further slow-downs inside kernel if developers flatten a 3d kernel into 1d indices unknowingly. But GPUs may not even be computing modules under the hood. They could be just some lookup tables as fast as registers or some other fast accessed constants.
This is just a limitation per workgroup. You can launch many workgroups per kernel launch so it shouldn't affect the maximum image size to process in different devices like CPU or GPU or iGPU. Each image is processed by multiple workgroups for tiling from 1x1x1 to 32x32x1 or some other size.

OpenCL: Confused by CL_DEVICE_MAX_COMPUTE_UNITS

I'm confused by this CL_DEVICE_MAX_COMPUTE_UNITS. For instance my Intel GPU on Mac, this number is 48. Does this mean the max number of parallel tasks run at the same time is 48 or the multiple of 48, maybe 96, 144...? (I know each compute unit is composed of 1 or more processing elements and each processing element is actually in charge of a "thread". What if these each of the 48 compute units is composed of more than 1 processing elements ). In other words, for my Mac, the "ideal" speedup, although impossible in reality, is 48 times faster than a CPU core (we assume the single "core" computation speed of CPU and GPU is the same), or the multiple of 48, maybe 96, 144...?
Summary: Your speedup is a little complicated, but your machine's (Intel GPU, probably GEN8 or GEN9) fp32 throughput 768 FLOPs per (GPU) clock and 1536 for fp16. Let's assume fp32, so something less than 768x (maybe a third of this depending on CPU speed). See below for the reasoning and some very important caveats.
A Quick Aside on CL_DEVICE_MAX_COMPUTE_UNITS:
Intel does something wonky when with CL_DEVICE_MAX_COMPUTE_UNITS with its GPU driver.
From the clGetDeviceInfo (OpenCL 2.0). CL_DEVICE_MAX_COMPUTE_UNITS says
The number of parallel compute units on the OpenCL device. A
work-group executes on a single compute unit. The minimum value is 1.
However, the Intel Graphics driver does not actually follow this definition and instead returns the number of EUs (Execution Units) --- An EU a grouping of the SIMD ALUs and slots for 7 different SIMD threads (registers and what not). Each SIMD thread represents 8, 16, or 32 workitems depending on what the compiler picks (we want higher, but register pressure can force us lower).
A workgroup is actually limited to a "Slice" (see the figure in section 5.5 "Slice Architecture"), which happens to be 24 EUs (in recent HW). Pick the GEN8 or GEN9 documents. Each slice has it's own SLM, barriers, and L3. Given that your apple book is reporting 48 EUs, I'd say that you have two slices.
Maximum Speedup:
Let's ignore this major annoyance and work with the EU number (and from those arch docs above). For "speedup" I'm comparing a single threaded FP32 calculation on the CPU. With good parallelization etc on the CPU, the speedup would be less, of course.
Each of the 48 EUs can issue two SIMD4 operations per clock in ideal circumstances. Assuming those are fused multiply-add's (so really two ops), that gives us:
48 EUs * 2 SIMD4 ops per EU * 2 (if the op is a fused multiply add)
= 192 SIMD4 ops per clock
= 768 FLOPs per clock for single precision floating point
So your ideal speedup is actually ~768. But there's a bunch of things that chip into this ideal number.
Setup and teardown time. Let's ignore this (assume the WL time dominates the runtime).
The GPU clock maxes out around a gigahertz while the CPU runs faster. Factor that ratio in. (crudely 1/3 maybe? 3Ghz on the CPU vs 1Ghz on the GPU).
If the computation is not heavily multiply-adds "mads", divide by 2 since I doubled above. Many important workloads are "mad"-dominated though.
The execution is mostly non-divergent. If a SIMD thread branches into an if-then-else, the entire SIMD thread (8,16,or 32 workitems) has to execute that code.
Register banking collisions delays can reduce EU ALU throughput. Typically the compiler does a great job avoiding this, but it can theoretically chew into your performance a bit (usually a few percent depending on register pressure).
Buffer address calculation can chew off a few percent too (EU must spend time doing integer compute to read and write addresses).
If one uses too much SLM or barriers, the GPU must leave some of the EU's idle so that there's enough SLM for each work item on the machine. (You can tweak your algorithm to fix this.)
We must keep the WL compute bound. If we blow out any cache in the data access hierarchy, we run into scenarios where no thread is ready to run on an EU and must stall. Assume we avoid this.
?. I'm probably forgetting other things that can go wrong.
We call the efficiency the percentage of theoretical perfect. So if our workload runs at ~530 FLOPs per clock, then we are 60% efficient of the theoretical 768. I've seen very carefully tuned workloads exceed 90% efficiency, but it definitely can take some work.
The ideal speedup you can get is the total number of processing elements which in your case corresponds to 48 * number of processing elements per compute unit. I do not know of a way to get the number of processing elements from OpenCL (that does not mean that it is not possible), however you can just google it for your GPU.
Up to my knowledge, a compute unit consists of one or multiple processing elements (for GPUs usually a lot), a register file, and some local memory. The threads of a compute unit are executed in a SIMD (single instruction multiple data) fashion. This means that the threads of a compute unit all execute the same operation but on different data.
Also, the speedup you get depends on how you execute a kernel function. Since a single work-group can not run on multiple compute units you need a sufficient number of work-groups in order to fully utilize all of the compute units. In addition, the work-group size should be a multiple of CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE.

gpgpu: how to estimate speed gains based on gpu and cpu specifications

I am a complete beginner to gpgpu and opencl. I am unable to answer the following two questions about GPGPU in general,
a) Suppose I have a piece of code suitable to be run on a gpu (executes the exact same set of instructions on multiple data). Assume I already have my data on the gpu. Is there any way to look at the specifications of the cpu and gpu, and estimate the potential speed gains? For example, how can I estimate the speed gains (excluding time taken to transfer data to the gpu) if I ran the piece of code (running exact same set of instructions on multiple data) on AMDs R9 295X2 gpu (http://www.amd.com/en-us/products/graphics/desktop/r9/2...) instead of intel i7-4770K processor (http://ark.intel.com/products/75123)
b) Is there any way to estimate the amount of time it would take to transfer data to the gpu?
Thank you!
Thank you for the responses! Given the large number of factors influencing speed gains, trying and testing is certainly a good idea. However, I do have a question on the GFLOPS approach mentioned some responses; GFLOPS metric was what I was looking at before posting the question.
I would think that GFLOPS would be a good way to estimate potential performance gains for SIMD type operations, given that it takes into account difference in clock speeds, cores, and floating point operations per cycle. However, when I crunch numbers using GFLOPS specifications something does not seem correct.
The Good:
GFLOPS based estimate seems to match the observed speed gains for the toy kernel below. The kernel for input integer "n" computes the sum (1+2+3+...+n) in a brute force way. I feel, the kernel below for large integers has a lot of computation operations. I ran the kernel for all ints from 1000 to 60000 on gpu and cpu (sequentially on cpu, without threading), and measured the timings.
__kernel void calculate(__global int* input,__global int* output){
size_t id=get_global_id(0);
int inp_num=input[id];
int si;
int sum;
sum=0;
for(int i=0;i<=inp_num;++i)
sum+=i;
output[id]=sum;
}
GPU on my laptop:
NVS 5400M (www.nvidia.com/object/nvs_techspecs.html)
GFLOPS, single precision: 253.44 (en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units)
CPU on my Laptop:
intel i7-3720QM, 2.6 GHz
GFLOPS (assuming single precision): 83.2 (download.intel.com/support/processors/corei7/sb/core_i7-3700_m.pdf). Intel document does not specify if it is single or double
CPU Time: 3.295 sec
GPU Time: 0.184 sec
Speed gains per core: 3.295/0.184 ~18
Theoretical Estimate of Speed gains with Using all 4 cores: 18/4 ~ 4.5
Speed Gains based on FLOPS: (GPU FLOPS)/(CPU FLOPS) = (253.44/83.2) = 3.0
For the above example GLOPS based estimate seems to be consistent with those obtained from experimentation, if the intel documentation indeed specifies FLOPS for single and not double precision. I did try to search for more links for flops specification for the intel processor on my laptop. The observed speed gain also seems good, given that I have a modest GPU
The Problem:
The FLOPS based approach seems to give a much lower than expected speed gains, after factoring gpu price, when comparing AMDs R9 295X2 gpu (www.amd.com/en-us/products/graphics/desktop/r9/295x2#) with intels i7-4770K (ark.intel.com/products/75123):
AMDs FLOPS, single precision: 11.5 TFLOPS (from above mentioned link)
Intels FLOPS, single precision: (num. of cores) x (FLOPS per cycle per core) x (clock speed) = (4) x (32 (peak) (www.pcmag.com/article2/0,2817,2419798,00.asp)) x (3.5) = 448 GFLOPS
Speed Gains Based on FLOPS = (11.5 TFLOPS)/(448) ~ 26
AMD GPUs price: $1500
Intel CPUs price: $300
For every AMD R9 295X2 gpu, I can buy 5 intel i7-4770K cpus, which reduces the effective speed gains to (26/5) ~ 5. However, this estimate is not at all consistent with the 100-200x, increase in speed one would expect. The low estimate in speed gains by the GFLOPS approach makes my think that something is incorrect with my analysis, but I am not sure what?
You need to examine the kernel(s). I myself am learning CUDA, so I couldn't tell you exactly what you'd do with OpenCL.
But I would figure out roughly how many floating point operations one single instance of the kernel will perform. Then find the number of floating point operations per second each device can handle.
number of kernels to be launched * (n floating-point operations of kernel / throughput of device (FLOPS)) = time to execute
The number of kernels launched will depend on your data.
A) Normally this question is never answered. Since we are not speaking at 1.05x speed gains. When the problem is suitable, the problem is BIG enough to hide any overheads (100k WI), and the data is already in the GPU, then we are speaking of speeds of 100-300x. Normally nobody cares if it is 250x or 251x.
The estimation is difficult to make, since the platforms are completely different. Not only on clock speeds, but memory latency and caches, as well as bus speeds and processing elements.
I cannot give you a clear answer on this, other than try it and measure.
B) The time to copy the memory is completely dependent on the GPU-CPU bus speed (PCI bus). And that is the HW limit, in practice you will always have less speed than that on copying. Generally you can apply the rule of three to solve the time needed, but there is always a small driver overhead that depends on the platform and device. So, copying 100 bytes is usually very slow, but copying some MB is as fast as the bus speed.
The memory copying speed is usually not a design constrain when creating a GPGPU app. Since it can be hided in many ways (pinned memory, etc..), that nodoby will notice any speed decrease due to memory operations.
You should not make any decisions on whether the problem is suitable or not for GPU, just by looking at the time lost at memory copy. Better measures are, if the problem is suitable, and if you have enough data to make the GPU busy (otherwise it is faster to do it in CPU directly).
Potential speed gain highly depends on algorithm implementation. It's difficult to forecast performance level unless you're developing come very simple applications (like simplest image filter). In some cases, estimations can be done, using memory system performance as basis, as many algorithms are bandwidth-bound.
You can calculate transmission time by dividing data amount on GPU memory bandwidth for Device-internal operations. Look at hardware characteristics to get it, or calculate if you know memory frequency & bus width. For Host-Device operations, PCI-E bus speed is the limit usually.
If code is easy(is what lightweight- cores of gpu need) and is not memory dependent then you can approximate to :
Sample kernel:
Read two 32-bit floats from memory and
do calcs on them for 20-30 times at least.
Then write to memory once.
New: GPU
Old: CPU
Gain ratio = ((New/Old) - 1 ) *100 (%)
New= 5000 cores * 2 ALU-FPU per core * 1.0 GHz frequency = 10000 gflops
Old = 10 cores * 8 ALU-FPU per core * 4.0GHz frequency = 320 gflops
((New/Old) - 1 ) *100 ===> 3000% speed gain.
This is when code uses registers and local memory mostly. Rarely hitting global mem.
If code is hard( heavy branching + fake recursivity + non-uniformity ) only 3-5 times speed gain. it can be equal or less than CPU performance for linear code ofcourse.
When code is memory dependant, it will be 1TB/s(GPU) divided by 40GB/s(CPU).
If each iteration needs to upload data to gpu, there will be pci-e bandwidth bottlenect too.
loads are usually classified into 2 categories
bandwidth bound - more time is spent on fetches from global-memory. Even increasing cpu clock freq doesn't help. problems like sorting. bandwidth capacity is measured using GBPS
compute bound - directly proportional to cpu horse-power. problems like matrix multiplication. compute capacity is measured using GFLOPS
there is a tool clpeak which tries to programmatically measure these
its very important to classify your problem to measure its performance & choose the right device(knowing their limits)
say if you compare intel-HD-4000 & i7-3630(both on same chip) in https://github.com/krrishnarraj/clpeak/tree/master/results/Intel%28R%29_OpenCL
i7 is comparatively better at bandwidth(plus no transfer overheads)
in terms of compute, gpu is 4-5 times faster than i7

Computing Maximum Concurrent Workgroups

I was wondering if there was a standard way to programatically determine the number of maximum concurrent workgroups that can run on a GPU.
For example, on a NVIDIA card with 5 compute units (or SMs), there can be a maximum of 8 workgroups (or blocks) per compute unit, so the maximum number of workgroups that can be run concurrently is 40.
Since I can find the number of compute units with clGetDeviceInfo, all I need is the maximum number of workgroups that can be run on a compute unit.
Thanks!
Max number of groups per execution unit/ SM are limited by the hardware resources. Let me take example of Intel Gen8 GPU. It contains 16 barrier registers per sub slice. So no more than 16 work groups can run simultaneously.
Also, The amount of shared local memory available per sub-slice (64KB). If for example a work-group requires 32KB of shared local memory, only 2 of those work-groups can run concurrently, regardless of work-group size.
I typically use the number of compute units as the number of work groups. I like to scale up the size of the groups to saturate the hardware, rather than force the gpu to schedule many work groups 'simultaneously'.
I don't know of a way to determine the max number of groups without looking it up on the vendor specs.

Resources