Loss of precision in float operations, due to exponent differences? - math

I have a program where I represent lengths (in cm) and angles (in radian) as floats. My lengths usually have values between 10 and 100, while my angles usually have values between 0 and 1.
I'm aware that precision will be lost in all floating point operations, but my question is:
Do I loose extra precision because of the magnitude gap between my two numerical realms? Would it be better if I changed my length unit to be meters, such that my usual length values lies between 0.1 and 1, which matches my usual angle values pretty evenly?

The point of floating point is that the point floats. Changing the magnitudes of numbers does not change the relative errors, except for quantization effects.
A floating point system represents a number x with some value f and an exponent e with some fixed base b (e.g., 2 for binary floating point), so that x = f be. (Often the sign is separated from f, but I am omitting that for simplicity.) If you multiply the numbers being worked with by any power of b, addition and subtraction will operate exactly the same (and so will multiplication and division if you correct for the additional factor), up to the bounds of the format.
If you multiply by other numbers, there can be small effects in rounding. When an operation is performed, the result has to be rounded to a fixed number of digits for the f portion. This rounding error is a fraction of the least significant digit of f. If f is near 1, it is larger relative to f than if f is near 2.
So, if you multiply your numbers by 256 (a power of 2), add, and divide by 256, the results will be the same as if you did the addition directly. If you multiply by 100, add, and divde by 100, there will likely be small changes. After multiplying by 100, some of your numbers will have their f parts moved closer to 2, and some will have their f parts moved closer to 2.
Generally, these changes are effectively random, and you cannot use such scaling to improve the results. Only in special circumstances can you control these errors.

Related

When x = 10³⁰, y = -10³⁰ and z = 1, why do (x+y)+z and x+(y+z) differ?

What Every Computer Scientist Should Know About Floating-Point Arithmetic makes the following claim:
Due to roundoff errors, the associative laws of algebra do not necessarily hold for floating-point numbers. For example, the expression (x+y)+z has a totally different answer than x+(y+z) when x = 1030, y = -1030 and z = 1 (it is 1 in the former case, 0 in the latter).
How does one reach the conclusion in their example? That is, that (x+y)+z=1 and x+(y+z)=0?
I am aware of the associative laws of algebra, but I do not see the issue in this case. To my mind, both x and y will overflow and therefore both have an integer value that is incorrect but nonetheless in range. As x and y will then be integers, they should add as if associativity applies.
Round off error, and other aspects of floating point arithmetic, apply to floating point arithmetic as a whole. While some of the values that a floating point variable can store are integers (in the sense that they are whole numbers), they are not integer-typed. A floating point variable cannot store arbitrarily large integers, any more than an integer variable can. And while wraparound integer arithmetic will make (a+b)-a=b for any unsigned integer-typed a and b, the same is not true for floating point arithmetic. The overflow rules are different.
When x = 10^(30), y = -10^(30) and z = 1, why do (x+y)+z and x+(y+z) differ?
In addition to #Sneftel answer, note the results do not have to differ either.
Even when the floating point type of the operands has insufficient precession to encode the sum 1 + 1030 exactly, like binary64, some languages, like C, allow intermediate calculations to compute using higher precision (like maybe long double as binary128) leading to a common sum of -1.

Why we get to approach asymptotically the value 0 "more" than the value 1?

Probably this is elementary for the people here. I am just a computer user.
I fooled around near the extreme values (0 and 1) for the Standard Normal cumulative distribution function (CDF), and I noticed that we can get very small probability values for large negative values of the variable, but we do not get the same reach towards the other end, for large positive values, where the value "1" appears already for much smaller (in absolute terms) values of the variable.
From a theoretical point of view, the tail probabilities of the Standard Normal distribution are symmetric around zero, so the probability mass to the left of, say, X=-10, is the same as the probability mass to the right of X=10. So at X=-10 the distance of the CDF from zero is the same as is its distance from unity at X=10.
But the computer/software complex doesn't give me this.
Is there something in the way our computers and software (usually) compute, that creates this asymmetric phenomenon, while the actual relation is symmetric?
Computations where done in "r", with an ordinary laptop.
This post is related, Getting high precision values from qnorm in the tail
Floating-point formats represent numbers as a sign s (+1 or −1), a significand f, and an exponent e. Each format has some fixed base b, so the number represented is s•f•be, and f is restricted to be in [1, b) and to be expressible as a base-b numeral of some fixed number p of digits. These formats can represent numbers very close to zero by making e very small. But the closest they can get to 1 (aside from 1 itself) is where either f is as near 1 as it can get (aside from 1 itself) and e is 0 or f is as near b as it can get and e is −1.
For example, in the IEEE-754 binary64 format, commonly used for double in many languages and implementations, b is two, and p is 53, and e can be as low as −1022 for normal numbers (there are subnormal numbers that can be smaller). This means the smallest representable normal number is 2−1022. But near 1, either e is 0 and f is 1+2−52 or e is −1 and f is 2−2−52. The latter number is closer to 1; it is s•f•be = +1•(2−2−52)•2−1 = 1−2−53.
So, in this format, we can get to a distance of 2−1022 from zero (closer with subnormal numbers), but only to a distance of 2−53 from 1.

How to determine error in floating-point calculations?

I have the following equation I want to implement in floating-point arithmetic:
Equation: sqrt((a-b)^2 + (c-d)^2 + (e-f)^2)
I am wondering how to determine how the width of the mantissa affects the accuracy of the results? How does this affect the accuracy of the result? I was wondering what the correct mathematical approach to determining this is?
For instance, if I perform the following operations, how will the accuracy be affected as after each step?
Here are the steps:
Step 1, Perform the following calculations in 32-bit single precision floating point: x=(a-b), y=(c-d), z=(e-f)
Step 2, Round the three results to have a mantissa of 16 bits (not including the hidden bit),
Step 3, Perform the following squaring operations: x2 = x^2, y2 = y^2, z2 = z^2
Step 4, Round x2, y2, and z2 to a mantissa of 10 bits (after the decimal point).
Step 5, Add the values: w = x2 + y2 = z2
Step 6, Round the results to 16 bits
Step 7, Take the square root: sqrt(w)
Step 8, Round to 20 mantissa bits (not including the mantissa).
There are various ways of representing the error of a floating point numbers. There is relative error (a * (1 + ε)), the subtly different ULP error (a + ulp(a) * ε), and relative error. Each of them can be used in analysing the error but all have shortcomings. To get sensible results you often have to take take into account what happens precisely inside floating point calculations. I'm afraid that the 'correct mathematical approach' is a lot of work, and instead I'll give you the following.
simplified ULP based analysis
The following analysis is quite crude, but it does give a good 'feel' for how much error you end up with. Just treat these as examples only.
(a-b)
The operation itself gives you up to a 0.5 ULP error (if rounding RNE). The rounding error of this operation can be small compared to the inputs, but if the inputs are very similar and already contain error, you could be left with nothing but noise!
(a^2)
This operation multiplies not only the input, but also the input error. If dealing with relative error, that means at least multiplying errors by the other mantissa. Interestingly there is a little normalisation step in the multiplier, that means that the relative error is halved if the multiplication result crosses a power of two boundary. The worst case is where the inputs multiply just below that, e.g. having two inputs that are almost sqrt(2). In this case the input error is multiplied to 2*ε*sqrt(2). With an additional final rounding error of 0.5 ULP, the total is an error of ~2 ULP.
adding positive numbers
The worst case here is just the input errors added together, plus another rounding error. We're now at 3*2+0.5 = 6.5 ULP.
sqrt
The worst case for a sqrt is when the input is close to e.g. 1.0. The error roughly just get passed through, plus an additional rounding error. We're now at 7 ULP.
intermediate rounding steps
It will take a bit more work to plug in your intermediate rounding steps.
You can model these as an error related to the number of bits you're rounding off. E.g. going from a 23 to a 10 bit mantissa with RNE introduces an additional 2^(13-2) ULP error relative to the 23-bit mantissa, or 0.5 ULP to the new mantissa (you'll have to scale down your other errors if you want to work with that).
I'll leave it to you to count the errors of your detailed example, but as the commenters noted, rounding to a 10-bit mantissa will dominate, and your final result will be accurate to roughly 8 mantissa bits.

Simple 2 or 3 parameters float PRNG formula that changes faster than the float resolution and produces white noise?

I'm looking for a 2 or 3 parameters math formula with the following characteristics:
Simple (the fewest amount of operations the better)
Random output (non-periodic)
Normalized (Meaning the output will never be outside a given range; doesn't matter the range since once I know the range I can just divide and add/subtract to get it into the 0 to 1 range I'm looking for)
White noise (the more samples you get the more evenly distributed the outputs get across the range of possible output values, with no gaps or hotspots, to the extent permitted by the floating-point standard)
Random all the way down (no gradual changes between output values even if the inputs are changed by the smallest amount the float standard will allow. I understand that given the nature of randomness, it is possible two output values might be close together once in a while, but that must only happen by coincidence, and not because of smoothness or periodicity)
Uses only the operations listed bellow (but of course, any operations that can be done by a combination of the ones listed bellow are also allowed)
I need this because I need a good source of controllable randomness for some experiments I'm doing with Cycles material nodes in Blender. And since that is where the formula will be implemented, the only operations I have available are:
Addition
Subtraction
Multiplication
Division
Power (X to the power of Y)
Logarithm (I think it's X Log Y; I'm not very familiar with the logarithm operation, so I'm not 100% sure if that is enough to specify which type of logarithm it is; let me know if you need more information about it)
Sine
Cosine
Tangent
Arcsine
Arccosine
Arctangent (not Atan2, but that can be created by combining operations if necessary)
Minimum (Returns the lowest of 2 numbers)
Maximum (Returns the highest of 2 numbers)
Round (Returns the closest round number to the input)
Less-than (Returns 1 if X is less than Y, zero otherwise)
Greater-than (Returns 1 if X is more than Y, zero otherwise)
Modulo (Produces a sawtooth pattern of period Y; for positive X values it's in the 0 to Y range, and for negative values of X it's in the -Y to zero range)
Absolute (strips the sign of the input value, makes it positive if it was negative, doesn't do anything if it's already positive)
There is no iteration nor looping functionality available (and of course, branching can only be done by calculating all the branches and then doing something like multiplying the results of the branches not meant to be taken by zero and then adding the results of all of them together).

efficiently determining if a polynomial has a root in the interval [0,T]

I have polynomials of nontrivial degree (4+) and need to robustly and efficiently determine whether or not they have a root in the interval [0,T]. The precise location or number of roots don't concern me, I just need to know if there is at least one.
Right now I'm using interval arithmetic as a quick check to see if I can prove that no roots can exist. If I can't, I'm using Jenkins-Traub to solve for all of the polynomial roots. This is obviously inefficient since it's checking for all real roots and finding their exact positions, information I don't end up needing.
Is there a standard algorithm I should be using? If not, are there any other efficient checks I could do before doing a full Jenkins-Traub solve for all roots?
For example, one optimization I could do is to check if my polynomial f(t) has the same sign at 0 and T. If not, there is obviously a root in the interval. If so, I can solve for the roots of f'(t) and evaluate f at all roots of f' in the interval [0,T]. f(t) has no root in that interval if and only if all of these evaluations have the same sign as f(0) and f(T). This reduces the degree of the polynomial I have to root-find by one. Not a huge optimization, but perhaps better than nothing.
Sturm's theorem lets you calculate the number of real roots in the range (a, b). Given the number of roots, you know if there is at least one. From the bottom half of page 4 of this paper:
Let f(x) be a real polynomial. Denote it by f0(x) and its derivative f′(x) by f1(x). Proceed as in Euclid's algorithm to find
f0(x) = q1(x) · f1(x) − f2(x),
f1(x) = q2(x) · f2(x) − f3(x),
.
.
.
fk−2(x) = qk−1(x) · fk−1(x) − fk,
where fk is a constant, and for 1 ≤ i ≤ k, fi(x) is of degree lower than that of fi−1(x). The signs of the remainders are negated from those in the Euclid algorithm.
Note that the last non-vanishing remainder fk (or fk−1 when fk = 0) is a greatest common
divisor of f(x) and f′(x). The sequence f0, f1,. . ., fk (or fk−1 when fk = 0) is called a Sturm sequence for the polynomial f.
Theorem 1 (Sturm's Theorem) The number of distinct real zeros of a polynomial f(x) with
real coefficients in (a, b) is equal to the excess of the number of changes of sign in the sequence f0(a), ..., fk−1(a), fk over the number of changes of sign in the sequence f0(b), ..., fk−1(b), fk.
You could certainly do binary search on your interval arithmetic. Start with [0,T] and substitute it into your polynomial. If the result interval does not contain 0, you're done. If it does, divide the interval in 2 and recurse on each half. This scheme will find the approximate location of each root pretty quickly.
If you eventually get 4 separate intervals with a root, you know you are done. Otherwise, I think you need to get to intervals [x,y] where f'([x,y]) does not contain zero, meaning that the function is monotonically increasing or decreasing and hence contains at most one zero. Double roots might present a problem, I'd have to think more about that.
Edit: if you suspect a multiple root, find roots of f' using the same procedure.
Use Descartes rule of signs to glean some information. Just count the number of sign changes in the coefficients. This gives you an upper bound on the number of positive real roots. Consider the polynomial P.
P = 131.1 - 73.1*x + 52.425*x^2 - 62.875*x^3 - 69.225*x^4 + 11.225*x^5 + 9.45*x^6 + x^7
In fact, I've constructed P to have a simple list of roots. They are...
{-6, -4.75, -2, 1, 2.3, -i, +i}
Can we determine if there is a root in the interval [0,3]? Note that there is no sign change in the value of P at the endpoints.
P(0) = 131.1
P(3) = 4882.5
How many sign changes are there in the coefficients of P? There are 4 sign changes, so there may be as many as 4 positive roots.
But, now substitute x+3 for x into P. Thus
Q(x) = P(x+3) = ...
4882.5 + 14494.75*x + 15363.9*x^2 + 8054.675*x^3 + 2319.9*x^4 + 370.325*x^5 + 30.45*x^6 + x^7
See that Q(x) has NO sign changes in the coefficients. All of the coefficients are positive values. Therefore there can be no roots larger than 3.
So there MAY be either 2 or 4 roots in the interval [0,3].
At least this tells you whether to bother looking at all. Of course, if the function has opposite signs on each end of the interval, we know there are an odd number of roots in that interval.
It's not that efficient, but is quite reliable. You can construct the polynomial's Companion Matrix (A sparse matrix whose eigenvalues are the polynomial's roots).
There are efficient eigenvalue algorithms that can find eigenvalues in a given interval. One of them is the inverse iteration (Can find eigenvalues closest to some input value. Just give the middle point of the interval as the above value).
If the value f(0)*f(t)<=0 then you are guaranteed to have a root. Otherwise you can start splitting the domain into two parts (bisection) and check the values in the ends until you are confident there is no root in that segment.
if f(0)*f(t)>0 you either have no, two, four, .. roots. Your limit is the polynomial order. if f(0)*f(t)<0 you may have one, three, five, .. roots.

Resources