I'm using AMD GPU and OpenCL 1.2. All my calculations use double variables. Who has researched this topic, can you tell me please, is there any point in rewriting everything on float? Will there be a big difference in the calculation time?
How much faster floats are vs doubles depends on the exact hardware. Certain professional grade AMD GPUs are specifically optimised for doubles (e.g. AMD FirePro 9100). On such hardware, double multiplications are half as fast as float multiplications, and addition is the same speed.
On most regular consumer hardware (most Radeons), the difference is about 1/16 and 1/8, respectively. There are also some devices which are in between, at about 1/4 and 1/2 speed for floats.
You can get the exact details in AMD's OpenCL Optimisation Guide, section 2.7.1 "Instruction Bandwidths".
Basically, unless you've specifically bought a device optimised for doubles, you will probably see a huge speed improvement moving to floats, assuming you don't actually need the extra precision or can compensate for it easily.
Related
I am looking to take the log base n (10 would be fine) of a 256 bit unsigned integer as a floating point in rust, with no loss of precision. It would seem to me that I need to implement an 8xf64 512 bit float 512 type and use a Taylor series to approximate ln and then the log. I know there are assembly methods to obtain the log of an f64. I am wondering if anyone on stack overflow can think of a divide and conquer or other method which would be more efficient. I would be amenable to inline assembly operating on the 8xf64 512 bit array.
This might be a useful starting point / outline of an algorithm. IDK if it will get you exact results, like error <= 0.5ulp (i.e. the last bit of the mantissa of your 512-bit float correctly rounded), or even error <= 1 ulp. Perhaps worth looking into what extended-precision calculators like bc / dc / calc do.
I think log converges quickly, so if you're going to do Newton iterations to refine, this bit-scan method might be a fast way to get a good starting point. Even if you only really need about 256 mantissa bits correct, I don't know how big a polynomial it would take to get that, and each multiply / add / fma would be on 512-bit (8x) or 320-bit (5x double precision).
Start by converting integer to binary float
For normal-sized floating-point numbers, the usual method takes advantage of the logarithmic nature of binary floating point. Without 256-bit HW float, you'll want to find the ilog2(int) yourself, i.e. position of the highest set bit (Efficiently find least significant set bit in a large array?).
Then treat your 256-bit integer as the mantissa of a number in the [1..2) or [0.5 .. 1) range, and yes use a polynomial approximation for log2() that's accurate over that limited range. (Before actual soft-float stuff, you might want to left-shift the number so it's normalized, i.e. the highest set bit is at the top. i.e. x <<= clz(x).
Then a polynomial approximation over the mantissa
And then add the integer exponent + log_approx(mantissa) => log2(x).
Efficient implementation of log2(__m256d) in AVX2 has more detail on implementing log2(double) (with SIMD doing 4 at a time, very different from doing one extended precision calculation).
It includes some links to implementations, e.g. Agner Fog's VCL using the ratio of two polynomials instead of one larger polynomial, and various tricks to maintain as much precision as possible: https://github.com/vectorclass/version2/blob/9874e4bfc7a0919fda16596144d393da5f8bf6c0/vectormath_exp.h#L942. Such as further range reduction: if x > SQRT2*0.5, then increment the exponent and double the mantissa. (If 512-bit FP division is really expensive, you might just use more terms in one polynomial.) VCL is currently Apache licensed, so feel free to copy as much as you want from it into anything.
IDK if there are more tricks that might become more valuable for big extended precision, or for soft-float, which that implementation doesn't use. VCL's math functions spend more effort to maintain high precision than some faster approximations, but they're not exact.
Do you really need 512-bit float? Maybe only 320-bit (5x double)?
If you don't need more exponent-range than a double, you might be able to extend the double-double-arithmetic technique to wider floats, taking advantage of hardware FP to get 52 or 53 mantissa bits per 64-bit chunk. (From comments, apparently you're already planning to do that.)
You might not need 512-bit float to have sufficient precision. 256/52 = 4.92, so only 5x double chunks have more precision (mantissa bits) than your input, and could exactly represent any 256-bit integer. (IEEE double does have a large enough exponent range; -1022 .. +1023). And have enough to spare that log2(int) should map each 256-bit input to a unique monotonic output, even with some rounding error.
In terms of SIMD and parallelization, what is the difference between AVX2 and AVX-512? Are they the same thing or different? I just see that double8 is used in AVX-512 and double4 is used for AVX2?
I am using PyOpenCL to write kernel code in C and not sure what the difference would be.
AVX2 is a 256 bit vector instruction set. You have 256 bit registers which can be interpreted several ways (8 floats, 4 doubles, 32 bytes, etc). AVX1 supports only floating point operations, AVX2 adds 256 bit integer operations. AVX-512 is a set of 512 bit vector instructions. There are only 2 flavors of AVX, plain old AVX and AVX2. AVX-512 comes in many different flavors. You may find Intel's Intrinsics Guide interesting.
The biggest difference is simply getting twice as many operations processed per instruction. Though, there are certain instructions in AVX-512 which may make some specific things more optimal (exponent approximations, for example).
I'm confused by this CL_DEVICE_MAX_COMPUTE_UNITS. For instance my Intel GPU on Mac, this number is 48. Does this mean the max number of parallel tasks run at the same time is 48 or the multiple of 48, maybe 96, 144...? (I know each compute unit is composed of 1 or more processing elements and each processing element is actually in charge of a "thread". What if these each of the 48 compute units is composed of more than 1 processing elements ). In other words, for my Mac, the "ideal" speedup, although impossible in reality, is 48 times faster than a CPU core (we assume the single "core" computation speed of CPU and GPU is the same), or the multiple of 48, maybe 96, 144...?
Summary: Your speedup is a little complicated, but your machine's (Intel GPU, probably GEN8 or GEN9) fp32 throughput 768 FLOPs per (GPU) clock and 1536 for fp16. Let's assume fp32, so something less than 768x (maybe a third of this depending on CPU speed). See below for the reasoning and some very important caveats.
A Quick Aside on CL_DEVICE_MAX_COMPUTE_UNITS:
Intel does something wonky when with CL_DEVICE_MAX_COMPUTE_UNITS with its GPU driver.
From the clGetDeviceInfo (OpenCL 2.0). CL_DEVICE_MAX_COMPUTE_UNITS says
The number of parallel compute units on the OpenCL device. A
work-group executes on a single compute unit. The minimum value is 1.
However, the Intel Graphics driver does not actually follow this definition and instead returns the number of EUs (Execution Units) --- An EU a grouping of the SIMD ALUs and slots for 7 different SIMD threads (registers and what not). Each SIMD thread represents 8, 16, or 32 workitems depending on what the compiler picks (we want higher, but register pressure can force us lower).
A workgroup is actually limited to a "Slice" (see the figure in section 5.5 "Slice Architecture"), which happens to be 24 EUs (in recent HW). Pick the GEN8 or GEN9 documents. Each slice has it's own SLM, barriers, and L3. Given that your apple book is reporting 48 EUs, I'd say that you have two slices.
Maximum Speedup:
Let's ignore this major annoyance and work with the EU number (and from those arch docs above). For "speedup" I'm comparing a single threaded FP32 calculation on the CPU. With good parallelization etc on the CPU, the speedup would be less, of course.
Each of the 48 EUs can issue two SIMD4 operations per clock in ideal circumstances. Assuming those are fused multiply-add's (so really two ops), that gives us:
48 EUs * 2 SIMD4 ops per EU * 2 (if the op is a fused multiply add)
= 192 SIMD4 ops per clock
= 768 FLOPs per clock for single precision floating point
So your ideal speedup is actually ~768. But there's a bunch of things that chip into this ideal number.
Setup and teardown time. Let's ignore this (assume the WL time dominates the runtime).
The GPU clock maxes out around a gigahertz while the CPU runs faster. Factor that ratio in. (crudely 1/3 maybe? 3Ghz on the CPU vs 1Ghz on the GPU).
If the computation is not heavily multiply-adds "mads", divide by 2 since I doubled above. Many important workloads are "mad"-dominated though.
The execution is mostly non-divergent. If a SIMD thread branches into an if-then-else, the entire SIMD thread (8,16,or 32 workitems) has to execute that code.
Register banking collisions delays can reduce EU ALU throughput. Typically the compiler does a great job avoiding this, but it can theoretically chew into your performance a bit (usually a few percent depending on register pressure).
Buffer address calculation can chew off a few percent too (EU must spend time doing integer compute to read and write addresses).
If one uses too much SLM or barriers, the GPU must leave some of the EU's idle so that there's enough SLM for each work item on the machine. (You can tweak your algorithm to fix this.)
We must keep the WL compute bound. If we blow out any cache in the data access hierarchy, we run into scenarios where no thread is ready to run on an EU and must stall. Assume we avoid this.
?. I'm probably forgetting other things that can go wrong.
We call the efficiency the percentage of theoretical perfect. So if our workload runs at ~530 FLOPs per clock, then we are 60% efficient of the theoretical 768. I've seen very carefully tuned workloads exceed 90% efficiency, but it definitely can take some work.
The ideal speedup you can get is the total number of processing elements which in your case corresponds to 48 * number of processing elements per compute unit. I do not know of a way to get the number of processing elements from OpenCL (that does not mean that it is not possible), however you can just google it for your GPU.
Up to my knowledge, a compute unit consists of one or multiple processing elements (for GPUs usually a lot), a register file, and some local memory. The threads of a compute unit are executed in a SIMD (single instruction multiple data) fashion. This means that the threads of a compute unit all execute the same operation but on different data.
Also, the speedup you get depends on how you execute a kernel function. Since a single work-group can not run on multiple compute units you need a sufficient number of work-groups in order to fully utilize all of the compute units. In addition, the work-group size should be a multiple of CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE.
I am a complete beginner to gpgpu and opencl. I am unable to answer the following two questions about GPGPU in general,
a) Suppose I have a piece of code suitable to be run on a gpu (executes the exact same set of instructions on multiple data). Assume I already have my data on the gpu. Is there any way to look at the specifications of the cpu and gpu, and estimate the potential speed gains? For example, how can I estimate the speed gains (excluding time taken to transfer data to the gpu) if I ran the piece of code (running exact same set of instructions on multiple data) on AMDs R9 295X2 gpu (http://www.amd.com/en-us/products/graphics/desktop/r9/2...) instead of intel i7-4770K processor (http://ark.intel.com/products/75123)
b) Is there any way to estimate the amount of time it would take to transfer data to the gpu?
Thank you!
Thank you for the responses! Given the large number of factors influencing speed gains, trying and testing is certainly a good idea. However, I do have a question on the GFLOPS approach mentioned some responses; GFLOPS metric was what I was looking at before posting the question.
I would think that GFLOPS would be a good way to estimate potential performance gains for SIMD type operations, given that it takes into account difference in clock speeds, cores, and floating point operations per cycle. However, when I crunch numbers using GFLOPS specifications something does not seem correct.
The Good:
GFLOPS based estimate seems to match the observed speed gains for the toy kernel below. The kernel for input integer "n" computes the sum (1+2+3+...+n) in a brute force way. I feel, the kernel below for large integers has a lot of computation operations. I ran the kernel for all ints from 1000 to 60000 on gpu and cpu (sequentially on cpu, without threading), and measured the timings.
__kernel void calculate(__global int* input,__global int* output){
size_t id=get_global_id(0);
int inp_num=input[id];
int si;
int sum;
sum=0;
for(int i=0;i<=inp_num;++i)
sum+=i;
output[id]=sum;
}
GPU on my laptop:
NVS 5400M (www.nvidia.com/object/nvs_techspecs.html)
GFLOPS, single precision: 253.44 (en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units)
CPU on my Laptop:
intel i7-3720QM, 2.6 GHz
GFLOPS (assuming single precision): 83.2 (download.intel.com/support/processors/corei7/sb/core_i7-3700_m.pdf). Intel document does not specify if it is single or double
CPU Time: 3.295 sec
GPU Time: 0.184 sec
Speed gains per core: 3.295/0.184 ~18
Theoretical Estimate of Speed gains with Using all 4 cores: 18/4 ~ 4.5
Speed Gains based on FLOPS: (GPU FLOPS)/(CPU FLOPS) = (253.44/83.2) = 3.0
For the above example GLOPS based estimate seems to be consistent with those obtained from experimentation, if the intel documentation indeed specifies FLOPS for single and not double precision. I did try to search for more links for flops specification for the intel processor on my laptop. The observed speed gain also seems good, given that I have a modest GPU
The Problem:
The FLOPS based approach seems to give a much lower than expected speed gains, after factoring gpu price, when comparing AMDs R9 295X2 gpu (www.amd.com/en-us/products/graphics/desktop/r9/295x2#) with intels i7-4770K (ark.intel.com/products/75123):
AMDs FLOPS, single precision: 11.5 TFLOPS (from above mentioned link)
Intels FLOPS, single precision: (num. of cores) x (FLOPS per cycle per core) x (clock speed) = (4) x (32 (peak) (www.pcmag.com/article2/0,2817,2419798,00.asp)) x (3.5) = 448 GFLOPS
Speed Gains Based on FLOPS = (11.5 TFLOPS)/(448) ~ 26
AMD GPUs price: $1500
Intel CPUs price: $300
For every AMD R9 295X2 gpu, I can buy 5 intel i7-4770K cpus, which reduces the effective speed gains to (26/5) ~ 5. However, this estimate is not at all consistent with the 100-200x, increase in speed one would expect. The low estimate in speed gains by the GFLOPS approach makes my think that something is incorrect with my analysis, but I am not sure what?
You need to examine the kernel(s). I myself am learning CUDA, so I couldn't tell you exactly what you'd do with OpenCL.
But I would figure out roughly how many floating point operations one single instance of the kernel will perform. Then find the number of floating point operations per second each device can handle.
number of kernels to be launched * (n floating-point operations of kernel / throughput of device (FLOPS)) = time to execute
The number of kernels launched will depend on your data.
A) Normally this question is never answered. Since we are not speaking at 1.05x speed gains. When the problem is suitable, the problem is BIG enough to hide any overheads (100k WI), and the data is already in the GPU, then we are speaking of speeds of 100-300x. Normally nobody cares if it is 250x or 251x.
The estimation is difficult to make, since the platforms are completely different. Not only on clock speeds, but memory latency and caches, as well as bus speeds and processing elements.
I cannot give you a clear answer on this, other than try it and measure.
B) The time to copy the memory is completely dependent on the GPU-CPU bus speed (PCI bus). And that is the HW limit, in practice you will always have less speed than that on copying. Generally you can apply the rule of three to solve the time needed, but there is always a small driver overhead that depends on the platform and device. So, copying 100 bytes is usually very slow, but copying some MB is as fast as the bus speed.
The memory copying speed is usually not a design constrain when creating a GPGPU app. Since it can be hided in many ways (pinned memory, etc..), that nodoby will notice any speed decrease due to memory operations.
You should not make any decisions on whether the problem is suitable or not for GPU, just by looking at the time lost at memory copy. Better measures are, if the problem is suitable, and if you have enough data to make the GPU busy (otherwise it is faster to do it in CPU directly).
Potential speed gain highly depends on algorithm implementation. It's difficult to forecast performance level unless you're developing come very simple applications (like simplest image filter). In some cases, estimations can be done, using memory system performance as basis, as many algorithms are bandwidth-bound.
You can calculate transmission time by dividing data amount on GPU memory bandwidth for Device-internal operations. Look at hardware characteristics to get it, or calculate if you know memory frequency & bus width. For Host-Device operations, PCI-E bus speed is the limit usually.
If code is easy(is what lightweight- cores of gpu need) and is not memory dependent then you can approximate to :
Sample kernel:
Read two 32-bit floats from memory and
do calcs on them for 20-30 times at least.
Then write to memory once.
New: GPU
Old: CPU
Gain ratio = ((New/Old) - 1 ) *100 (%)
New= 5000 cores * 2 ALU-FPU per core * 1.0 GHz frequency = 10000 gflops
Old = 10 cores * 8 ALU-FPU per core * 4.0GHz frequency = 320 gflops
((New/Old) - 1 ) *100 ===> 3000% speed gain.
This is when code uses registers and local memory mostly. Rarely hitting global mem.
If code is hard( heavy branching + fake recursivity + non-uniformity ) only 3-5 times speed gain. it can be equal or less than CPU performance for linear code ofcourse.
When code is memory dependant, it will be 1TB/s(GPU) divided by 40GB/s(CPU).
If each iteration needs to upload data to gpu, there will be pci-e bandwidth bottlenect too.
loads are usually classified into 2 categories
bandwidth bound - more time is spent on fetches from global-memory. Even increasing cpu clock freq doesn't help. problems like sorting. bandwidth capacity is measured using GBPS
compute bound - directly proportional to cpu horse-power. problems like matrix multiplication. compute capacity is measured using GFLOPS
there is a tool clpeak which tries to programmatically measure these
its very important to classify your problem to measure its performance & choose the right device(knowing their limits)
say if you compare intel-HD-4000 & i7-3630(both on same chip) in https://github.com/krrishnarraj/clpeak/tree/master/results/Intel%28R%29_OpenCL
i7 is comparatively better at bandwidth(plus no transfer overheads)
in terms of compute, gpu is 4-5 times faster than i7
How is floating point math performed on a processor with no floating point unit ? e.g low-end 8 bit microcontrollers.
Have a look at this article: http://www.edwardrosten.com/code/fp_template.html
(from this article)
First you have to think about how to represent a floating point number in memory:
struct this_is_a_floating_point_number
{
static const unsigned int mant = ???;
static const int expo = ???;
static const bool posi = ???;
};
Then you'd have to consider how to do basic calculations with this representation. Some might be easy to implement and be rather fast at runtime (multiply or divide by 2 come to mind)
Division might be harder and, for instance, Newtons algorithm could be used to calculate the answer.
Finally, smart approximations and generated values in tables might speed up the calculations at run time.
Many years ago C++ templates helped me getting floating point calculations on an Intel 386 SX
In the end I learned a lot of math and C++ but decided at the same time to buy a co-processor.
Especially the polynomial algorithms and the smart lookup tables; who needs a cosine or tan function when you have sine function, helped a lot in thinking about using integers for floating point arithmetic. Taylor series were a revelation too.
In systems without any floating-point hardware, the CPU emulates it using a series of simpler fixed-point arithmetic operations that run on the integer arithmetic logic unit.
Take a look at the wikipedia page: Floating-point_unit#Floating-point_library as you might find more info.
It is not actually the cpu who emulates the instructions. The floating point operations for low end cpu's are made out of integer arithmetic instructions and the compiler is the one which generates those instructions. Basically the compiler (tool chain) comes with a floating point library containing floating point functions.
The short answer is "slowly". Specialized hardware can do tasks like extracting groups of bits that are not necessarily byte-aligned very fast. Software can do everything that can be done by specialized hardware, but tends to take much longer to do it.
Read "The complete Spectrum ROM disassembly" at http://www.worldofspectrum.org/documentation.html to see examples of floating point computations on an 8 bit Z80 processor.
For things like sine functions, you precompute a few values then interpolate using Chebyshev polynomials.