ATMega peformance for different operations - arduino

Has anyone experiences replacing floating point operations on ATMega (2560) based systems? There are a couple of very common situations which happen every day.
For example:
Are comparisons faster than divisions/multiplications?
Are float to int type cast with followed multiplication/division faster than pure floating point operations without type cast?
I hope I don't have to make a benchmark just for me.
Example one:
int iPartialRes = (int)fArg1 * (int)fArg2;
iPartialRes *= iFoo;
faster as?:
float fPartialRes = fArg1 * fArg2;
fPartialRes *= iFoo;
And example two:
iSign = fVal < 0 ? -1 : 1;
faster as?:
iSign = fVal / fabs(fVal);

the questions could be solved just by thinking a moment about it.
AVRs does not have a FPU so all floating point related stuff is done in software --> fp multiplication involves much more than a simple int multiplication
since AVRs also does not have a integer division unit a simple branch is also much faster than a software division. if dividing floating points this is the worst worst case :)
but please note, that your first 2 examples produce very different results.

This is an old answer but I will submit this elaborated answer for the curious.
Just typecasting a float will truncate it ie; 3.7 will become 3, there is no rounding.
Fastest math on a 2560 will be (+,-,*) with divide being the slowest due to no hardware divide. Typecasting to an unsigned long int after multiplying all operands by a pseudo decimal point that suits your fractal number(1) range that your floats are expected to see and tracking the sign as a bool will give the best range/accuracy compromise.
If your loop needs to be as fast as possible, avoid even integer division, instead multiplying by a pseudo fraction instead and then doing your typecast back into a float with myFloat(defined elsewhere) = float(myPseudoFloat) / myPseudoDecimalConstant;
Not sure if you came across the Show info page in the playground. It's basically a sketch that runs a benchmark on your (insert Arduino model here) Shows the actual compute times for various things and systems. The Mega 2560 will be very close to an At Mega 328 as far as FLOPs goes, up to 12.5K/s (80uS per divide float). Typecasting would likely handicap the CPU more as it introduces more overhead and might even give erroneous results due to rounding errors and lack of precision.
(1)ie: 543.509,291 * 100000 = 543,509,291 will move the decimal 6 places to the maximum precision of a float on an 8-bit AVR. If you first multiply all values by the same constant like 1000, or 100000, etc, then the decimal point is preserved and then you cast it back to a float number by dividing by your decimal constant when you are ready to print or store it.
float f = 3.1428;
int x;
x = f * 10000;
x now contains 31428

Related

How to keep decimal accuracy when dividing with floating points

I am working on a project, and I need to divide a very large 64 bit long value. I absolutely do not care about the whole number result, and only care about the decimal value. The problem is that when dividing a large long with a small 64 bit double floating point value, I loose accuracy in the floating point value due to it needing to store the whole numbers.
Essentially what I am trying to do is this:
double x = long_value / double_value % 1;
but without loosing precision the larger the long_value is. Is there a way of writing this expression so that the whole numbers are discarded and floating point accuracy is not lost? Thanks.
EDIT: btw im out here trying to upvote all these helpful answers, but I just made this account for this question and you need 15 reputation to cast a vote
If your language provides an exact fmod implementation you can do something like this:
double rem = fmod(long_value, double_value);
return rem / double_value;
If long_value does not convert exactly to a double value, you could split it into two halves, fmod them individually, add these values together and divide that sum or sum - double_value by double_value.
If long_value or double_value is negative you may also need to consider different cases depending on how your fmod behaves and what result you expect.
long_value is congruent to:
long_value = long_value - double_value * static_cast<long>(long_value / double_value);
Then you can do this:
double fractionalPart = static_cast<double>(long_value / double_value) % 1;
Does the language you're using have a big integer/big rational library? To avoid loss of information, you'll have to "spread out" the information across more memory while you're transforming it so you don't lose the part you're interested in preserving. This is essentially what a big integer library would do for you. You could employ this algorithm (I don't know what language you're using so this is just pseudocode:
// e.g. 1.5 => (3, 2)
let (numerator, denominator) = double_value.ToBigRational().NumAndDenom();
// information-preserving version of long_value / double_value
let quotient = new BigRational(num: long_value * denominator, denom: numerator);
// information-preserving version of % 1
let remainder = quotient.FractionPart();
// some information could be lost here, but we saved it for the last step
return remainder.ToDouble();

Is there a more efficient way to divide and conquer a uint 256 log on 64 bit hardware with rust or inline assembly than converging Taylor series?

I am looking to take the log base n (10 would be fine) of a 256 bit unsigned integer as a floating point in rust, with no loss of precision. It would seem to me that I need to implement an 8xf64 512 bit float 512 type and use a Taylor series to approximate ln and then the log. I know there are assembly methods to obtain the log of an f64. I am wondering if anyone on stack overflow can think of a divide and conquer or other method which would be more efficient. I would be amenable to inline assembly operating on the 8xf64 512 bit array.
This might be a useful starting point / outline of an algorithm. IDK if it will get you exact results, like error <= 0.5ulp (i.e. the last bit of the mantissa of your 512-bit float correctly rounded), or even error <= 1 ulp. Perhaps worth looking into what extended-precision calculators like bc / dc / calc do.
I think log converges quickly, so if you're going to do Newton iterations to refine, this bit-scan method might be a fast way to get a good starting point. Even if you only really need about 256 mantissa bits correct, I don't know how big a polynomial it would take to get that, and each multiply / add / fma would be on 512-bit (8x) or 320-bit (5x double precision).
Start by converting integer to binary float
For normal-sized floating-point numbers, the usual method takes advantage of the logarithmic nature of binary floating point. Without 256-bit HW float, you'll want to find the ilog2(int) yourself, i.e. position of the highest set bit (Efficiently find least significant set bit in a large array?).
Then treat your 256-bit integer as the mantissa of a number in the [1..2) or [0.5 .. 1) range, and yes use a polynomial approximation for log2() that's accurate over that limited range. (Before actual soft-float stuff, you might want to left-shift the number so it's normalized, i.e. the highest set bit is at the top. i.e. x <<= clz(x).
Then a polynomial approximation over the mantissa
And then add the integer exponent + log_approx(mantissa) => log2(x).
Efficient implementation of log2(__m256d) in AVX2 has more detail on implementing log2(double) (with SIMD doing 4 at a time, very different from doing one extended precision calculation).
It includes some links to implementations, e.g. Agner Fog's VCL using the ratio of two polynomials instead of one larger polynomial, and various tricks to maintain as much precision as possible: https://github.com/vectorclass/version2/blob/9874e4bfc7a0919fda16596144d393da5f8bf6c0/vectormath_exp.h#L942. Such as further range reduction: if x > SQRT2*0.5, then increment the exponent and double the mantissa. (If 512-bit FP division is really expensive, you might just use more terms in one polynomial.) VCL is currently Apache licensed, so feel free to copy as much as you want from it into anything.
IDK if there are more tricks that might become more valuable for big extended precision, or for soft-float, which that implementation doesn't use. VCL's math functions spend more effort to maintain high precision than some faster approximations, but they're not exact.
Do you really need 512-bit float? Maybe only 320-bit (5x double)?
If you don't need more exponent-range than a double, you might be able to extend the double-double-arithmetic technique to wider floats, taking advantage of hardware FP to get 52 or 53 mantissa bits per 64-bit chunk. (From comments, apparently you're already planning to do that.)
You might not need 512-bit float to have sufficient precision. 256/52 = 4.92, so only 5x double chunks have more precision (mantissa bits) than your input, and could exactly represent any 256-bit integer. (IEEE double does have a large enough exponent range; -1022 .. +1023). And have enough to spare that log2(int) should map each 256-bit input to a unique monotonic output, even with some rounding error.

Advantages and disadvantages of single numeric (float) data type [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
Why we use various data types in programming languages ? Why not use float everywhere ? I have heard some arguments like
Arithmetic on int is faster ( but why ?)
It takes more memory to store float. ( I get it.)
What are the additional benefits of using various types of numeric data types ?
Arithmetic on integers has traditionally been faster because it's a simpler operation. It can be implemented in logic gates and, if properly designed, the whole thing can happen in a single clock cycle.
On most modern PCs floating-point support is actually quite fast, because loads of time has been invested into making it fast. It's only on lower-end processors (like Arduino, or some versions of the ARM platform) where floating point seriously suffers, or is absent from the CPU altogether.
A floating point number contains a few different pieces of data: there's a sign bit, and the mantissa, and the exponent. To put those three parts together to determine the value they represent, you do something like this:
value = sign * mantissa * 2^exponent
It's a little more complicated than that because floating point numbers optimize how they store the mantissa a bit (for instance the first bit of the mantissa is assumed to be 1, thus the first bit doesn't actually need to be stored... But this also means zero has to be stored a particular way, and there's various "special values" that can be stored in floats like "not a number" and infinity that have to be handled correctly when working with floats)
So to store the number "3" you'd have a mantissa of 0.75 and an exponent of 2. (0.75 * 2^2 = 3).
But then to add two floats together, you first have to align them. For instance, 3 + 10:
m3 = 0.75 (stored as binary (1)1000000... the first (1) implicit and not actually stored)
e3 = 2
m10 = .625 (stored as binary (1)010000...)
e10 = 4 (.625 * 2^4 = 10)
You can't just add m3 and m10 together, 'cause you'd get the wrong answer. You first have to shift m3 over by a couple bits to get e3 and e10 to match, then you can add the mantissas together and reassemble the result into a new floating point number. A CPU with good floating-point implementation will do all that for you, of course, and do it fast.
So why else would you not want to use floating point values for everything? Well, for starters there's the problem of exactness. If you add or multiply two integers to get another integer, as long as you don't exceed the limits of your integer size, the answer you get will be exactly correct. This isn't the case with floating-point. For instance:
x = 1000000000.0
y = .0000000001
for (cc = 0; cc < 1000000000; cc++) { x += y; }
Logically you'd expect the final value of (x) to be 1000000000.1, but that's almost certainly not what you're going to get. When you add (y) to (x), the change to (x)'s mantissa may be so small that it doesn't even fit into the float, and so (x) may not change at all. And even if that's not the case, (y)'s value is not exact. There are no two integers (a, b) such that (a * 2^b = 10^-10). That's true for many common decimal values, actually. Even something simple like 0.3 can't be stored as an exact value in a binary floating-point number.
So (y) isn't exactly 10^-10, it's actually off by some small amount. For a 32-bit floating point number it'll be off by about 10^-26:
y = 10^-10 + error, error is about 10^-26
Then if you add (y) together ten billion times, the error is magnified by about ten billion times as well, so your final error is around 10^-16
A good floating-point implementation will try to minimize these errors, but it can't always get it right. The problem is fundamental to how the numbers are stored, and to some extent unavoidable. As a result, for instance, even though it seems natural to store a money value in a float, it might be preferable to store it as an integer instead, to get that assurance that the value is always exact.
The "exactness" issue also means that when you test the value of a floating point number, generally speaking, you can't use exact comparisons. For instance:
x = 11.0 / 500
if (x * 50 == 1.1) { ... It doesn't!
for (float x = 0.0; x < 1.0; x += 0.01) { print x; }
// prints 101 values instead of 100, the last one being 0.9999999...
The test fails because (x) isn't exactly the value we specified, and 1.1, when encoded as a float, isn't exactly the value we specified either. They're both close but not exact. So you have to do inexact comparisons:
if (abs(x - expected_value) < small_value) {...
Choosing the correct "small_value" is a problem unto itself. It can depend on what you're doing with the values, what kind of behavior you're trying to achieve.
Finally, if you look at the "it takes more memory" issue, you can also turn that around and think of it in terms of what you get for the memory you use.
If you can work with integer math for your problem, a 32-bit unsigned integer lets you work with (exact) values between 0 and around 4 billion.
If you're using 32-bit floats instead of 32-bit integers, you can store larger values than 4 billion, but you're still limited by the representation: of those 32 bits, one is used for the sign bit, and eight for the mantissa, so you get 23 bits (24, effectively) of mantissa. Once (x >= 2^24), you're beyond the range where integers are stored "exactly" in that float, so (x+1 = x). So a loop like this:
float i;
for (i = 1600000; i < 1700000; i += 1);
would never terminate: (i) would reach (2^24 = 16777216), and the least-significant bit of its mantissa would be of a magnitude greater than 1, so adding 1 to (i) would cease to have any effect.

Inaccurate results with OpenCL Reduction example

I am working with the OpenCL reduction example provided by Apple here
After a few days of dissecting it, I understand the basics; I've converted it to a version that runs more or less reliably on c++ (Openframeworks) and finds the largest number in the input set.
However, in doing so, a few questions have arisen as follows:
why are multiple passes used? the most I have been able to cause the reduction to require is two; the latter pass only taking a very low number of elements and so being very unsuitable for an openCL process (i.e. wouldn't it be better to stick to a single pass and then process the results of that on the cpu?)
when I set the 'count' number of elements to a very high number (24M and up) and the type to a float4, I get inaccurate (or totally wrong) results. Why is this?
in the openCL kernels, can anyone explain what is being done here:
while (i < n){
int a = LOAD_GLOBAL_I1(input, i);
int b = LOAD_GLOBAL_I1(input, i + group_size);
int s = LOAD_LOCAL_I1(shared, local_id);
STORE_LOCAL_I1(shared, local_id, (a + b + s));
i += local_stride;
}
as opposed to what is being done here?
#define ACCUM_LOCAL_I1(s, i, j) \
{ \
int x = ((__local int*)(s))[(size_t)(i)]; \
int y = ((__local int*)(s))[(size_t)(j)]; \
((__local int*)(s))[(size_t)(i)] = (x + y); \
}
Thanks!
S
To answer the first 2 questions:
why are multiple passes used?
Reducing millions of elements to a few thousands can be done in parallel with a device utilization of almost 100%. But the final step is quite tricky. So, instead of keeping everything in one shot and have multiple threads idle, Apple implementation decided to do a first pass reduction; then adapt the work items to the new reduction problem, and finally completing it.
Ii is a very specific optimization for OpenCL, but it may not be for C++.
when I set the 'count' number of elements to a very high number (24M
and up) and the type to a float4, I get inaccurate (or totally wrong)
results. Why is this?
A float32 precision is 2^23 the remainder. Values higher than 24M = 1.43 x 2^24 (in float representation), have an error in the range +/-(2^24/2^23)/2 ~= 1.
That means, if you do:
float A=24000000;
float B= A + 1; //~1 error here
The operator error is in the range of the data, therefore... big errors if you repeat that in a loop!
This will not happen in 64bits CPUs, because the 32bits float math uses internally 48bits precision, therefore avoiding these errors. However if you get the float close to 2^48 they will happen as well. But that is not the typical case for normal "counting" integers.
The problem is with the precision of 32 bit floats. You're not the first person to ask about this either. OpenCL reduction result wrong with large floats

OpenCL reduction result wrong with large floats

I used AMD's two-stage reduction example to compute the sum of all numbers from 0 to 65 536 using floating point precision. Unfortunately, the result is not correct. However, when I modify my code, so that I compute the sum of 65 536 smaller numbers (for example 1), the result is correct.
I couldn't find any error in the code. Is it possible that I am getting wrong results, because of the float type? If this is the case, what is the best approach to solve the issue?
This is a "side effect" of summing floating point numbers using finite precision CPU's or GPU's. The accuracy depends the algorithm and the order the values are summed. The theory and practice behind is explained in Nicholas J, Higham's paper
The Accuracy of Floating Point Summation
http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=7AECC0D6458288CD6E4488AD63A33D5D?doi=10.1.1.43.3535&rep=rep1&type=pdf
The fix is to use a smarter algorithm like the Kahan Summation Algorithm
https://en.wikipedia.org/wiki/Kahan_summation_algorithm
And the Higham paper has some alternatives too.
This problem illustrates the nature of benchmarking, the first rule of the benchmark is to get the
right answer, using realistic data!
There is probably no error in the coding of your kernel or host application. The issue is with the single-precision floating point.
The correct sum is: 65537 * 32768 = 2147516416, and it takes 31 bits to represent it in binary (10000000000000001000000000000000). 32-bit floats can only hold integers accurately up to 2^24.
"Any integer with absolute value less than [2^24] can be exactly represented in the single precision format"
"Floating Point" article, wikipedia
This is why you are getting the correct sum when it is less than or equal to 2^24. If you are doing a complete sum using single-precision, you will eventually lose accuracy no matter which device you are executing the kernel on. There are a few things you can do to get the correct answer:
use double instead of float if your platform supports it
use int or unsigned int
sum a smaller set of numbers eg: 0+1+2+...+4095+4096 = (2^23 + 2^11)
Read more about single precision here.

Resources