While reading Physically Based Rendering in Filament I found a few interesting paragraphs in section 4.4.1 about optimizing the implementation of a GGX NDF approximation for half precision floats. I understood that the calculation of 1 - dot(n, h) * dot(n,h) can cause so called catastrophic cancellation and why using the cross product solves the problem, however I didn't get how is any of this related to half precision floats.
It seems that GLSL does not have any half specifier, unlike HLSL (which simply maps it to float since D3D10 most likely because modern desktop hardware doesn't support it anyway; though it seems that with the newest hardware its back again). The thing with Filament is that it is primarly developed for mobile platforms like Android, where half precision floats are supported in hardware.
I understand that using half precision floats is important for performance on both mobile and the most modern desktop targets. As such I would like to understand how is the following code optimized for half precision floats as I can see no half specifier or similar, but merely some constant and a macro:
#define MEDIUMP_FLT_MAX 65504.0
#define saturateMediump(x) min(x, MEDIUMP_FLT_MAX)
float D_GGX(float roughness, float NoH, const vec3 n, const vec3 h) {
vec3 NxH = cross(n, h);
float a = NoH * roughness;
float k = roughness / (dot(NxH, NxH) + a * a);
float d = k * k * (1.0 / PI);
return saturateMediump(d);
}
For completeness, here is the unoptimized code:
float D_GGX(float NoH, float roughness) {
float a = NoH * roughness;
float k = roughness / (1.0 - NoH * NoH + a * a);
return k * k * (1.0 / PI);
}
While GLSL does not have a half type, it does have precision qualifiers whose effects are exclusive-to and dependent-on mobile platforms. I'm assuming that the (complete) optimized shader code from your example contains a default qualifier setting floats to mediump like so:
precision mediump float; Note though that the actual precision remains unspecified, a mediump float might have 16 bits on one platform while it has 24 bits on another.
Here's the catch though: As stated in the linked article and the GLSL specification precision qualifiers are only supported for portability and ought to have no effect on desktop platforms. That means that even desktop GPUs with float16 support would break with the specification if they honored the precision qualifier. On desktop platforms you'll have to enable and use the appropriate extension(e.g. GL_AMD_gpu_shader_half_float) and its specific syntax(e.g. types) to utilize the float16 capabilities.
Related
I have the following OpenCL kernel, which copies values from one buffer to another, optionally inverting the value (the 'invert' arg can be 1 or -1):-
__kernel void extraction(__global const short* src_buff, __global short* dest_buff, const int record_len, const int invert)
{
int i = get_global_id(0); // Index of record in buffer
int j = get_global_id(1); // Index of value in record
dest_buff[(i* record_len) + j] = src_buff[(i * record_len) + j] * invert;
}
The source buffer contains one or more "records", each containing N (record_len) short values. All records in the buffer are of equal length, and record_len is always a multiple of 32.
The global size is 2D (number of records in the buffer, record length), and I chose this as it seemed to make best use of the GPU parallel processing, with each thread being responsible for copying just one value in one record in the buffer.
(The local work size is set to NULL by the way, allowing OpenCL to determine the value itself).
After reading about vectors recently, I was wondering if I could use these to improve on the performance? I understand the concept of vectors but I'm not sure how to use them in practice, partly due to lack of good examples.
I'm sure the kernel's performance is pretty reasonable already, so this is mainly out of curiosity to see what difference it would make using vectors (or other more suitable approaches).
At the risk of being a bit naive here, could I simply change the two buffer arg types to short16, and change the second value in the 2-D global size from "record length" to "record length / 16"? Would this result in each kernel thread copying a block of 16 short values between the buffers?
Your naive assumption is basically correct, though you may want to add a hint to the compiler that this kernel is optimized for the vector type (Section 6.7.2 of spec), in your case, you would add
attribute((vec_type_hint(short16)))
above your kernel function. So in your example, you would have
__attribute__((vec_type_hint(short16)))
__kernel void extraction(__global const short16* src_buff, __global short16* dest_buff, const int record_len, const int invert)
{
int i = get_global_id(0); // Index of record in buffer
int j = get_global_id(1); // Index of value in record
dest_buff[(i* record_len) + j] = src_buff[(i * record_len) + j] * invert;
}
You are correct in that your 2nd global dimension should be divided by 16, and your record_len should also be divided by 16. Also, if you were to specify the local size instead of giving it NULL, you would also want to divide that by 16.
There are some other things to consider though.
You might think choosing the largest vector size should provide the best performance, especially with such a simple kernel. But in my experience, that rarely is the most optimal size. You may try asking clGetDeviceInfo for CL_DEVICE_PREFERRED_VECTOR_WIDTH_SHORT, but for me this rarely is accurate (also, it may give you 1, meaning the compiler will try auto-vectorization or the device doesn't have vector hardware). It is best to try different vector sizes and see which is fastest.
If your device supports auto-vectorization, and you want to give it a go, it may help to remove your record_len parameter and replace it with get_global_size(1) so the compiler/driver can take care of dividing record_len by whatever vector size it picks. I would recommend doing this anyway, assuming record_len is equal to the global size you gave that dimension.
Also, you gave NULL to the local size argument so that the implementation picks a size automatically. It is guaranteed to pick a size that works, but it will not necessarily pick the most optimal size.
Lastly, for general OpenCL optimizations, you may want to take a look at the NVIDIA OpenCL Best Practices Guide for NVidia hardware, or the AMD APP SDK OpenCL User Guide for AMD GPU hardware. The NVidia one is from 2009, and I'm not sure how much their hardware has changed since. Notice though that it actually says:
The CUDA architecture is a scalar architecture. Therefore, there is no performance
benefit from using vector types and instructions. These should only be used for
convenience.
Older AMD hardware (pre-GCN) benefited from using vector types, but AMD suggests not using them on GCN devices (see mogu's comment). Also if you are targeting a CPU, it will use AVX hardware if available.
I'm looking for the OpenCL Sinus Implementation.
Well, I know, the OpenCL Implementation is hardware-vendor-specific, so the Nvidia OpenCL Implementation could look different from the AMD one. But I want to know, whether I need to implement my own sinus for speed reasons.
Accepting this, where is the difference between sin and native_sin?
Here is an amd implementation, testing repeated sin function on itself so any error will make it more chaotic as iterations increase(100 in this example):
__kernel void sin_test_0(__global float *a)
{
int id = get_global_id(0);
float r=a[id];
for(int i=0;i<100;i++)
r = sin(r);
a[id]=r;
}
a[id] was given value of 1111 first for all 16m elements.
sin() = -0,1692203; completed in 265 ms(320 core gpu) and 1950 ms(8 core cpu using float4)
C#'s implementation with Math library = -0,1692202; completed in 55505 ms(single core) and 12998 ms (4 threads) and 8200 ms (max threads Parallel.For) without any explicit compiler hints about vectorization.
native_sin() = -0,1692208; completed in 45 ms
half_sin() = -0,1692207; completed in 165 ms
series expansion of sine(for input=[-1,1]) = -0,155202; completed in 40 ms
only 7th digit is different and that may be because of C# using double type for computing and native version is a bit farther than original. Half seems to be even better than native but slower. Half_sin has a range of -2^16 to 2^16.
Series expansion:
float sin_se(float x)
{
x -= 6.28318530718f*(convert_int(x*0.15915494309f));
float xs=x*x;
float xc=x*x*x;
return ((x - xc*0.166666f) + (xc*xs)*0.0083333f)- (xc*xs*xs)*0.0001984f;
}
if input is between -1 and +1, first line is not necessary and this becomes faster.
native_sin() is probably using its hardware based options to speed-up. These options could be a look up table for magic numbers and a newton-raphson engine. You may not surpass performance of these parts by software emulation for an equal error. Upper example is on a gpu and there is minor difference using a cpu. Even if opencl dictates that all devices must have less than 100 ULP error, a device may have 90 ULP but other 70ULP and accumulated error increases gap between them. If you think you dont accumulate error much and if you have safety digits, then you could just use native_sin, else, you can add your series expansion-like algorithm so all devices compute same way but with more error.
I am working with the OpenCL reduction example provided by Apple here
After a few days of dissecting it, I understand the basics; I've converted it to a version that runs more or less reliably on c++ (Openframeworks) and finds the largest number in the input set.
However, in doing so, a few questions have arisen as follows:
why are multiple passes used? the most I have been able to cause the reduction to require is two; the latter pass only taking a very low number of elements and so being very unsuitable for an openCL process (i.e. wouldn't it be better to stick to a single pass and then process the results of that on the cpu?)
when I set the 'count' number of elements to a very high number (24M and up) and the type to a float4, I get inaccurate (or totally wrong) results. Why is this?
in the openCL kernels, can anyone explain what is being done here:
while (i < n){
int a = LOAD_GLOBAL_I1(input, i);
int b = LOAD_GLOBAL_I1(input, i + group_size);
int s = LOAD_LOCAL_I1(shared, local_id);
STORE_LOCAL_I1(shared, local_id, (a + b + s));
i += local_stride;
}
as opposed to what is being done here?
#define ACCUM_LOCAL_I1(s, i, j) \
{ \
int x = ((__local int*)(s))[(size_t)(i)]; \
int y = ((__local int*)(s))[(size_t)(j)]; \
((__local int*)(s))[(size_t)(i)] = (x + y); \
}
Thanks!
S
To answer the first 2 questions:
why are multiple passes used?
Reducing millions of elements to a few thousands can be done in parallel with a device utilization of almost 100%. But the final step is quite tricky. So, instead of keeping everything in one shot and have multiple threads idle, Apple implementation decided to do a first pass reduction; then adapt the work items to the new reduction problem, and finally completing it.
Ii is a very specific optimization for OpenCL, but it may not be for C++.
when I set the 'count' number of elements to a very high number (24M
and up) and the type to a float4, I get inaccurate (or totally wrong)
results. Why is this?
A float32 precision is 2^23 the remainder. Values higher than 24M = 1.43 x 2^24 (in float representation), have an error in the range +/-(2^24/2^23)/2 ~= 1.
That means, if you do:
float A=24000000;
float B= A + 1; //~1 error here
The operator error is in the range of the data, therefore... big errors if you repeat that in a loop!
This will not happen in 64bits CPUs, because the 32bits float math uses internally 48bits precision, therefore avoiding these errors. However if you get the float close to 2^48 they will happen as well. But that is not the typical case for normal "counting" integers.
The problem is with the precision of 32 bit floats. You're not the first person to ask about this either. OpenCL reduction result wrong with large floats
I am trying to construct a mortgage calculator app but my numbers are off a bit and I am wondering if anyone has any insight into what I am missing. The initial payment amount seems to be accurate but as you increase the year and interest rate the value is slightly off. This is for Canada if that makes any difference. The payment amount also doesn't divide evenly into the amount borrowed either. Here is the relevant code.
double r = interestAmountValue/1200;
double n = yearAmountValue * 12;
double rPower = pow(1+r, n);
double paymentAmt = loanAmountValue * r * rPower / (rPower - 1);
double totalPaymentd = paymentAmt * n;
double totalInterestd = totalPaymentd - loanAmountValue;
The type double is typically implemented as a floating point number. Although calculations using floating points can be fast, particularly when using dedicated hardware, the tradeoff is accuracy. This is a well documented problem: see volume II of The Art of Computer Programming: Semi-Numerical Algorithms for a thorough discussion.
If provided by your language, an arbitrary precision number type will probably result in better accuracy at the expense of some speed (with the loss of speed probably unnoticeable in most applications on modern hardware for typical problems). In Java this is the BigDecimal type.
Has anyone experiences replacing floating point operations on ATMega (2560) based systems? There are a couple of very common situations which happen every day.
For example:
Are comparisons faster than divisions/multiplications?
Are float to int type cast with followed multiplication/division faster than pure floating point operations without type cast?
I hope I don't have to make a benchmark just for me.
Example one:
int iPartialRes = (int)fArg1 * (int)fArg2;
iPartialRes *= iFoo;
faster as?:
float fPartialRes = fArg1 * fArg2;
fPartialRes *= iFoo;
And example two:
iSign = fVal < 0 ? -1 : 1;
faster as?:
iSign = fVal / fabs(fVal);
the questions could be solved just by thinking a moment about it.
AVRs does not have a FPU so all floating point related stuff is done in software --> fp multiplication involves much more than a simple int multiplication
since AVRs also does not have a integer division unit a simple branch is also much faster than a software division. if dividing floating points this is the worst worst case :)
but please note, that your first 2 examples produce very different results.
This is an old answer but I will submit this elaborated answer for the curious.
Just typecasting a float will truncate it ie; 3.7 will become 3, there is no rounding.
Fastest math on a 2560 will be (+,-,*) with divide being the slowest due to no hardware divide. Typecasting to an unsigned long int after multiplying all operands by a pseudo decimal point that suits your fractal number(1) range that your floats are expected to see and tracking the sign as a bool will give the best range/accuracy compromise.
If your loop needs to be as fast as possible, avoid even integer division, instead multiplying by a pseudo fraction instead and then doing your typecast back into a float with myFloat(defined elsewhere) = float(myPseudoFloat) / myPseudoDecimalConstant;
Not sure if you came across the Show info page in the playground. It's basically a sketch that runs a benchmark on your (insert Arduino model here) Shows the actual compute times for various things and systems. The Mega 2560 will be very close to an At Mega 328 as far as FLOPs goes, up to 12.5K/s (80uS per divide float). Typecasting would likely handicap the CPU more as it introduces more overhead and might even give erroneous results due to rounding errors and lack of precision.
(1)ie: 543.509,291 * 100000 = 543,509,291 will move the decimal 6 places to the maximum precision of a float on an 8-bit AVR. If you first multiply all values by the same constant like 1000, or 100000, etc, then the decimal point is preserved and then you cast it back to a float number by dividing by your decimal constant when you are ready to print or store it.
float f = 3.1428;
int x;
x = f * 10000;
x now contains 31428