How to determine error in floating-point calculations? - math

I have the following equation I want to implement in floating-point arithmetic:
Equation: sqrt((a-b)^2 + (c-d)^2 + (e-f)^2)
I am wondering how to determine how the width of the mantissa affects the accuracy of the results? How does this affect the accuracy of the result? I was wondering what the correct mathematical approach to determining this is?
For instance, if I perform the following operations, how will the accuracy be affected as after each step?
Here are the steps:
Step 1, Perform the following calculations in 32-bit single precision floating point: x=(a-b), y=(c-d), z=(e-f)
Step 2, Round the three results to have a mantissa of 16 bits (not including the hidden bit),
Step 3, Perform the following squaring operations: x2 = x^2, y2 = y^2, z2 = z^2
Step 4, Round x2, y2, and z2 to a mantissa of 10 bits (after the decimal point).
Step 5, Add the values: w = x2 + y2 = z2
Step 6, Round the results to 16 bits
Step 7, Take the square root: sqrt(w)
Step 8, Round to 20 mantissa bits (not including the mantissa).

There are various ways of representing the error of a floating point numbers. There is relative error (a * (1 + ε)), the subtly different ULP error (a + ulp(a) * ε), and relative error. Each of them can be used in analysing the error but all have shortcomings. To get sensible results you often have to take take into account what happens precisely inside floating point calculations. I'm afraid that the 'correct mathematical approach' is a lot of work, and instead I'll give you the following.
simplified ULP based analysis
The following analysis is quite crude, but it does give a good 'feel' for how much error you end up with. Just treat these as examples only.
(a-b)
The operation itself gives you up to a 0.5 ULP error (if rounding RNE). The rounding error of this operation can be small compared to the inputs, but if the inputs are very similar and already contain error, you could be left with nothing but noise!
(a^2)
This operation multiplies not only the input, but also the input error. If dealing with relative error, that means at least multiplying errors by the other mantissa. Interestingly there is a little normalisation step in the multiplier, that means that the relative error is halved if the multiplication result crosses a power of two boundary. The worst case is where the inputs multiply just below that, e.g. having two inputs that are almost sqrt(2). In this case the input error is multiplied to 2*ε*sqrt(2). With an additional final rounding error of 0.5 ULP, the total is an error of ~2 ULP.
adding positive numbers
The worst case here is just the input errors added together, plus another rounding error. We're now at 3*2+0.5 = 6.5 ULP.
sqrt
The worst case for a sqrt is when the input is close to e.g. 1.0. The error roughly just get passed through, plus an additional rounding error. We're now at 7 ULP.
intermediate rounding steps
It will take a bit more work to plug in your intermediate rounding steps.
You can model these as an error related to the number of bits you're rounding off. E.g. going from a 23 to a 10 bit mantissa with RNE introduces an additional 2^(13-2) ULP error relative to the 23-bit mantissa, or 0.5 ULP to the new mantissa (you'll have to scale down your other errors if you want to work with that).
I'll leave it to you to count the errors of your detailed example, but as the commenters noted, rounding to a 10-bit mantissa will dominate, and your final result will be accurate to roughly 8 mantissa bits.

Related

Loss of precision in float operations, due to exponent differences?

I have a program where I represent lengths (in cm) and angles (in radian) as floats. My lengths usually have values between 10 and 100, while my angles usually have values between 0 and 1.
I'm aware that precision will be lost in all floating point operations, but my question is:
Do I loose extra precision because of the magnitude gap between my two numerical realms? Would it be better if I changed my length unit to be meters, such that my usual length values lies between 0.1 and 1, which matches my usual angle values pretty evenly?
The point of floating point is that the point floats. Changing the magnitudes of numbers does not change the relative errors, except for quantization effects.
A floating point system represents a number x with some value f and an exponent e with some fixed base b (e.g., 2 for binary floating point), so that x = f be. (Often the sign is separated from f, but I am omitting that for simplicity.) If you multiply the numbers being worked with by any power of b, addition and subtraction will operate exactly the same (and so will multiplication and division if you correct for the additional factor), up to the bounds of the format.
If you multiply by other numbers, there can be small effects in rounding. When an operation is performed, the result has to be rounded to a fixed number of digits for the f portion. This rounding error is a fraction of the least significant digit of f. If f is near 1, it is larger relative to f than if f is near 2.
So, if you multiply your numbers by 256 (a power of 2), add, and divide by 256, the results will be the same as if you did the addition directly. If you multiply by 100, add, and divde by 100, there will likely be small changes. After multiplying by 100, some of your numbers will have their f parts moved closer to 2, and some will have their f parts moved closer to 2.
Generally, these changes are effectively random, and you cannot use such scaling to improve the results. Only in special circumstances can you control these errors.

Data link layer - CRC what does divide by 1 + x mean?

Can someone please explain what this part of CRC codes from Tannenbaum computer networks means!
If there has been a single-bit error, E(x) = x^i , where i determines which bit is
in error. If G(x) contains two or more terms, it will never divide into E(x), so all
single-bit errors will be detected.
And
If there have been two isolated single-bit errors, E(x) = x^i + x^j , where i > j.
Alternatively, this can be written as E(x) = x^j (x^(i − j) + 1). If we assume that G(x)
is not divisible by x, a sufficient condition for all double errors to be detected is
that G(x) does not divide x ^k + 1 for any k up to the maximum value of i − j (i.e.,
up to the maximum frame length). Simple, low-degree polynomials that give pro-
tection to long frames are known. For example, x ^15 + x ^14 + 1 will not divide
x ^k + 1 for any value of k below 32,768.
Please post in simple terms so I can understand it a bit more!. EXAMPLEs are appreciated. Thanks in advance!
A message is a sequence of bits. You can convert any sequence of bits into a polynomial by just making each bit the coefficient of 1, x, x2, etc. starting with the first bit. So 100101 becomes 1+x3+x5.
You can make these polynomials useful by considering their coefficients to be members of the simplest finite field, GF(2), which consists only of the elements 0 and 1. There addition is the exclusive-or operation and multiplication is the and operation.
Now you can do all the things you did with polynomials in high school, but where the coefficients are over GF(2). So 1+x added to x+x2 becomes 1+x2. 1+x times 1+x becomes 1+x2. (Work it out.)
Cyclic Redundancy Checks (CRCs) are derived from this approach to binary message arithmetic, where a message converted to a polynomial is divided by a special constant polynomial whose degree is the number of bits in the CRC. Then the coefficients of the remainder of that polynomial division is the CRC of that message.
Read Ross William's CRC tutorial for more. (Real CRCs are not just that remainder, but you'll see.)

Machine Arithmetic and Smearing: addition of a large an small number

So to 10000 one will add the value 1/10000 10000times. Logically this gives 10001.
However, due to smearing this does not occur which stems from storage limitations. The result is 10000.999999992928.
I have located where the smearing occurs, which is in the second addition:
1: 10000.0001
2: 10000.000199999999
3: 10000.000299999998
4: 10000.000399999997
etc...
However, grasping why the smearing occurred is where the struggle lies.
I wrote code to generate floating point binary numbers to see whether smearing occurred here
So 10000 = 10011100010000 or 1.001110001*10**13 while
0.0001= 0.00000000000001101001 or
1.1010001101101110001011101011000111000100001100101101*2**(-14)
then 10000.0001 = 10011100010000.00000000000001101001
Now the smearing occurs in the next addition. Does it have to do with mantissa size? Why does it only occur in this step as well? Just interested to know. I am going to add all the 1/10000 first and then add it to the 10000 to avoid smaering.
The small "smearing" error for a single addition can be computed exactly as
a=10000; b=0.0001
err = ((a+b)-a)-b
print "err=",err
>>> err= -7.07223084891e-13
The rounding error of an addition is of size (abs(a)+abs(b))*mu/2 or around 1e4 * 1e-16 = 1e-12, which nicely fits the computed result.
In general you also have to test the expression ((a+b)-b)-a, but one of them is always zero, here the latter one.
And indeed this single step error accumulated over all the steps already gives the observed result, secondary errors relating to the slow increase in the sum as first term in each addition having a much lower impact.
print err*10000
>>> -7.072230848908026e-09
print 10001+err*10000
>>> 10000.999999992928
The main problem is that 1/10000 i.e. 0.0001 cannot be encoded exactly as a machine float value (see the IEEE 754 standard), since 10000 is not a power of 2. Also 1/10 = 0.1 cannot be encoded as machine float, so you will experience phanomena like 0.1 + 0.1 + 0.1 > 0.3.
When computing with double precision (64 bit) the following holds:
1.0001 - 1 < 0.0001
10000.0001 + 9999*0.0001 == 10001
So I assume you are computing with single precision (32 bit)?

exp function in Julia evaluating to 0

I want to calculate and plot the probability density of a wave function in Julia. I wrote a small snippet of Julia code for evaluating the following function:
The Julia (incomplete) code is:
set_bigfloat_precision(100)
A = 10
C = 5
m = BigFloat(9.10938356e-31)
ℏ = BigFloat(1.054571800e-34)
t = exp(-(sqrt(C * m) / ℏ))
The last line where I evaluate t gives 0.000000000000.... I tried to set the precision of the BigFloat as well. No luck! What am I doing wrong? Help appreciated.
While in comments Chris Rackauckas has pointed out you entered the formula wrong. I figured it was interesting enough to answer the question anyway
Lets break it down so we can see what we are raising:
A = 10
C = 5
m = BigFloat(9.10938356e-31)
h = BigFloat(1.054571800e-34)
z = -sqrt(C * m)/h
t = exp(z)
So
z =-2.0237336022083455711032042949257e+19
so very roughly z=-2e19)
so roughly t=exp(-2e19) (ie t=1/((e^(2*10^19)))
That is a very small number.
Consider that
exp(big"-1e+10") = 9.278...e-4342944820
and
exp(big"-1e+18") = 2.233...e-434294481903251828
and yes, julia says:
exp(big"-2e+19) = 0.0000
exp(big"-2e+19) is a very small number.
That puts us in context I hope. Very small number.
So julia depends on MPFR for BigFloats
You can try MPFR online. At precision 8192, exp(-2e10)=0
So same result.
Now, it is not the precision that we care about.
But rather the range of the exponant.
MPFR use something kinda like IEEE style floats, where precision is the length of the mantissa, and then you have a exponent. 2^exponent * mantissa
So there is a limit on the range of the exponent.
See: MPFR docs:
Function: mpfr_exp_t mpfr_get_emin (void)
Function: mpfr_exp_t mpfr_get_emax (void)
Return the (current) smallest and largest exponents allowed for a floating-point variable. The smallest positive value of a floating-point variable is one half times 2 raised to the smallest exponent and the largest value has the form (1 - epsilon) times 2 raised to the largest exponent, where epsilon depends on the precision of the considered variable.
Now julia does set these to there maximum range the fairly default MPFR compile will allow. I've been digging around the MPFR source trying to find where this is set, but can't find it. I believe it is related to the max fault a Int64 can hold.
Base.MPFR.get_emin() = -4611686018427387903 =typemin(Int64)>>1 + 1
You can adjust this but only up.
So anyway
0.5*big"2.0"^(Base.MPFR.get_emin()) = 8.5096913117408361391297879096205e-1388255822130839284
but
0.5*big"2.0"^(Base.MPFR.get_emin()-1) = 0.00000000000...
Now we know that
exp(x) = 2^(log(2,e)*x)
So we can exp(z) = 2^(log(2,e)*z)
log(2,e)*z = -29196304319863382016
Base.MPFR.get_emin() = -4611686018427387903
So since the exponent (rough -2.9e19) is less than the minimum allowed exponent (roughly -4.3e17).
An underflow occurs.
Thus your answer as to why you get zero.
It may (or may not) be possible to recomplile MPFR with Int128 exponents, but julia hasn't.
Perhaps julia should throw a Underflow exception.
Free encouraged to report that as an issue on the Julia Bug Tracker.

Simple 2 or 3 parameters float PRNG formula that changes faster than the float resolution and produces white noise?

I'm looking for a 2 or 3 parameters math formula with the following characteristics:
Simple (the fewest amount of operations the better)
Random output (non-periodic)
Normalized (Meaning the output will never be outside a given range; doesn't matter the range since once I know the range I can just divide and add/subtract to get it into the 0 to 1 range I'm looking for)
White noise (the more samples you get the more evenly distributed the outputs get across the range of possible output values, with no gaps or hotspots, to the extent permitted by the floating-point standard)
Random all the way down (no gradual changes between output values even if the inputs are changed by the smallest amount the float standard will allow. I understand that given the nature of randomness, it is possible two output values might be close together once in a while, but that must only happen by coincidence, and not because of smoothness or periodicity)
Uses only the operations listed bellow (but of course, any operations that can be done by a combination of the ones listed bellow are also allowed)
I need this because I need a good source of controllable randomness for some experiments I'm doing with Cycles material nodes in Blender. And since that is where the formula will be implemented, the only operations I have available are:
Addition
Subtraction
Multiplication
Division
Power (X to the power of Y)
Logarithm (I think it's X Log Y; I'm not very familiar with the logarithm operation, so I'm not 100% sure if that is enough to specify which type of logarithm it is; let me know if you need more information about it)
Sine
Cosine
Tangent
Arcsine
Arccosine
Arctangent (not Atan2, but that can be created by combining operations if necessary)
Minimum (Returns the lowest of 2 numbers)
Maximum (Returns the highest of 2 numbers)
Round (Returns the closest round number to the input)
Less-than (Returns 1 if X is less than Y, zero otherwise)
Greater-than (Returns 1 if X is more than Y, zero otherwise)
Modulo (Produces a sawtooth pattern of period Y; for positive X values it's in the 0 to Y range, and for negative values of X it's in the -Y to zero range)
Absolute (strips the sign of the input value, makes it positive if it was negative, doesn't do anything if it's already positive)
There is no iteration nor looping functionality available (and of course, branching can only be done by calculating all the branches and then doing something like multiplying the results of the branches not meant to be taken by zero and then adding the results of all of them together).

Resources