Machine Arithmetic and Smearing: addition of a large an small number - math

So to 10000 one will add the value 1/10000 10000times. Logically this gives 10001.
However, due to smearing this does not occur which stems from storage limitations. The result is 10000.999999992928.
I have located where the smearing occurs, which is in the second addition:
1: 10000.0001
2: 10000.000199999999
3: 10000.000299999998
4: 10000.000399999997
etc...
However, grasping why the smearing occurred is where the struggle lies.
I wrote code to generate floating point binary numbers to see whether smearing occurred here
So 10000 = 10011100010000 or 1.001110001*10**13 while
0.0001= 0.00000000000001101001 or
1.1010001101101110001011101011000111000100001100101101*2**(-14)
then 10000.0001 = 10011100010000.00000000000001101001
Now the smearing occurs in the next addition. Does it have to do with mantissa size? Why does it only occur in this step as well? Just interested to know. I am going to add all the 1/10000 first and then add it to the 10000 to avoid smaering.

The small "smearing" error for a single addition can be computed exactly as
a=10000; b=0.0001
err = ((a+b)-a)-b
print "err=",err
>>> err= -7.07223084891e-13
The rounding error of an addition is of size (abs(a)+abs(b))*mu/2 or around 1e4 * 1e-16 = 1e-12, which nicely fits the computed result.
In general you also have to test the expression ((a+b)-b)-a, but one of them is always zero, here the latter one.
And indeed this single step error accumulated over all the steps already gives the observed result, secondary errors relating to the slow increase in the sum as first term in each addition having a much lower impact.
print err*10000
>>> -7.072230848908026e-09
print 10001+err*10000
>>> 10000.999999992928

The main problem is that 1/10000 i.e. 0.0001 cannot be encoded exactly as a machine float value (see the IEEE 754 standard), since 10000 is not a power of 2. Also 1/10 = 0.1 cannot be encoded as machine float, so you will experience phanomena like 0.1 + 0.1 + 0.1 > 0.3.
When computing with double precision (64 bit) the following holds:
1.0001 - 1 < 0.0001
10000.0001 + 9999*0.0001 == 10001
So I assume you are computing with single precision (32 bit)?

Related

Learning Cyclical Redundancy Check Failure Cases

I am trying to understand how likely a Cyclic Redundancy Check is to fail, given a particular divisor P(x). I am specifically interested in failures resulting from an odd number of bit flips in my message, an example to follow.
Some prerequisite info:
CRC is a very commonly used way to detect errors in computer networks.
P(x), G(x), R(x), and T(x) are all polynomials under binary field arithmetic (i.e., all coefficients are mod2: 0 or 1).'
P(x) is the polynomial that we are given and that we will divide by.
E(x) is an error pattern. It is XORed with T(x) to get T'(x). G(x) is the message that we want to send.
R(x) is the remainder of G(x)/P(x) or just G(x)modP(x).
T(x) is our sent data, (x^k)G(x)+R(x), where k is the degree of P(x).
T'(x) is our received data but potentially with errors.
When T'(x) is received, if T'(x)modP(x)=0 then it is said to be error-free. It may not actually be error-free.
Proof:
Assume an odd number of errors has x + 1 as a factor.
Then E(x) = (x + 1)T(x).
Evaluate E(x) for x = 1 → E(x) = E(1) = 1 since there are odd number of terms.
But (x + 1)T(x) = (1 + 1)T(1) = 0.
Therefore, E(x) cannot have x + 1 as a factor.
Example:
Say my P(x)=x^7 +1=10000001\
Let G(x)=x^7+x^6+x^3+x+1=11001011\
So, T(x)=110010111001010\
When E(x)=011111011111111\
T'(x)=E(x)XORT(x)=101101100110101\
T'(x) modulus P(x)=0, a failure.\
I simulated the results on a particular message(T(x)), namely 11001011, and found CRC to fail 42 of the 16384 possible odd parity bit flips that I attempted. Failure means that T'(x)modP(x)=0.
I expected odd parity bit errors to be caught based on the above proof.
Is the proof wrong, or am I doing my example calculation wrong?
What I really want to know is, given P(x)=x^7 +1, what are the offs that any general message with an odd number of bit flips will be erroneous but not be caught as being erroneous?
Sorry, this is so long-winded but I just want to make sure everything is super clear.

Mean function incorrect value

I have an 80 element array with the same entries: 176.01977965813853
If I use the mean function I will get the value 176.01977965813842
Why is that?
Here is a minimal working example:
using Statistics
arr = fill(176.01977965813853, 80)
julia> mean(arr)
176.01977965813842
I expected this to return 176.01977965813853.
These are just expected floating point errors. But if you need very precise summations, you can use a a bit more elaborate (and costly) summation scheme:
julia> using KahanSummation
[ Info: Precompiling KahanSummation [8e2b3108-d4c1-50be-a7a2-16352aec75c3]
julia> sum_kbn(fill(176.01977965813853, 80))/80
176.01977965813853
Ref: Wikipedia
The problem as I understand it can be reproduced as follows:
using Statistics
arr = fill(176.01977965813853, 80)
julia> mean(arr)
176.01977965813842
The reason for this is that julia does all floating point arithmetic with 64 bits of precision by default (i.e. the Float64 type). Float64s cannot represent any real number. There is a finite step between each floating point number and rounding errors are incurred when you do arithmetic on them. These rounding errors are usually fine, but if you're not careful, they can be catastrophic. For instance:
julia> 1e100 + 1.0 - 1e100
0.0
That says that if I do 10^100 + 1 - 10^100 I get zero! If you want to get an upper bound on the errors caused by floating point arithmetic, we can use IntervalArithmetic.jl:
using IntervalArithmetic
julia> 1e100 + interval(1.0) - 1e100
[0, 1.94267e+84]
That says that the operation 1e100 + 1.0 - 1e100 is at least equal to 0.0 and at most 1.94*10^84, so the error bounds are huge!
We can do the same for the operation you were interested in,
arr = fill(interval(176.01977965813853), 80);
julia> mean(arr)
[176.019, 176.02]
julia> mean(arr).lo
176.019779658138
julia> mean(arr).hi
176.0197796581391
which says that the actual mean could be at least 176.019779658138 or at most 176.0197796581391, but one can't be any more certain due to floating point error! So here, Float64 gave the answer with at most 10^-13 percent error, which is actually quite small.
What if those are unacceptable error bounds? Use more precision! You can use the big string macro to get arbitrary precision number literals:
arr = fill(interval(big"176.01977965813853"), 80);
julia> mean(arr).lo
176.0197796581385299999999999999999999999999999999999999999999999999999999999546
julia> mean(arr).hi
176.019779658138530000000000000000000000000000000000000000000000000000000000043
That calculation was done using 256 bits of precision, but you can get even more precision using the setprecision function:
setprecision(1000)
arr = fill(interval(big"176.01977965813853"), 80);
julia> mean(arr).lo
176.019779658138529999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999599
julia> mean(arr).hi
176.019779658138530000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000579
Note that arbitrary precision arithmetic is sloooow compared to Float64s, so it's usually best to just use arbitrary precision arithmetic to validate your results to make sure you're converging to a good result within your desired accuracy.

exp function in Julia evaluating to 0

I want to calculate and plot the probability density of a wave function in Julia. I wrote a small snippet of Julia code for evaluating the following function:
The Julia (incomplete) code is:
set_bigfloat_precision(100)
A = 10
C = 5
m = BigFloat(9.10938356e-31)
ℏ = BigFloat(1.054571800e-34)
t = exp(-(sqrt(C * m) / ℏ))
The last line where I evaluate t gives 0.000000000000.... I tried to set the precision of the BigFloat as well. No luck! What am I doing wrong? Help appreciated.
While in comments Chris Rackauckas has pointed out you entered the formula wrong. I figured it was interesting enough to answer the question anyway
Lets break it down so we can see what we are raising:
A = 10
C = 5
m = BigFloat(9.10938356e-31)
h = BigFloat(1.054571800e-34)
z = -sqrt(C * m)/h
t = exp(z)
So
z =-2.0237336022083455711032042949257e+19
so very roughly z=-2e19)
so roughly t=exp(-2e19) (ie t=1/((e^(2*10^19)))
That is a very small number.
Consider that
exp(big"-1e+10") = 9.278...e-4342944820
and
exp(big"-1e+18") = 2.233...e-434294481903251828
and yes, julia says:
exp(big"-2e+19) = 0.0000
exp(big"-2e+19) is a very small number.
That puts us in context I hope. Very small number.
So julia depends on MPFR for BigFloats
You can try MPFR online. At precision 8192, exp(-2e10)=0
So same result.
Now, it is not the precision that we care about.
But rather the range of the exponant.
MPFR use something kinda like IEEE style floats, where precision is the length of the mantissa, and then you have a exponent. 2^exponent * mantissa
So there is a limit on the range of the exponent.
See: MPFR docs:
Function: mpfr_exp_t mpfr_get_emin (void)
Function: mpfr_exp_t mpfr_get_emax (void)
Return the (current) smallest and largest exponents allowed for a floating-point variable. The smallest positive value of a floating-point variable is one half times 2 raised to the smallest exponent and the largest value has the form (1 - epsilon) times 2 raised to the largest exponent, where epsilon depends on the precision of the considered variable.
Now julia does set these to there maximum range the fairly default MPFR compile will allow. I've been digging around the MPFR source trying to find where this is set, but can't find it. I believe it is related to the max fault a Int64 can hold.
Base.MPFR.get_emin() = -4611686018427387903 =typemin(Int64)>>1 + 1
You can adjust this but only up.
So anyway
0.5*big"2.0"^(Base.MPFR.get_emin()) = 8.5096913117408361391297879096205e-1388255822130839284
but
0.5*big"2.0"^(Base.MPFR.get_emin()-1) = 0.00000000000...
Now we know that
exp(x) = 2^(log(2,e)*x)
So we can exp(z) = 2^(log(2,e)*z)
log(2,e)*z = -29196304319863382016
Base.MPFR.get_emin() = -4611686018427387903
So since the exponent (rough -2.9e19) is less than the minimum allowed exponent (roughly -4.3e17).
An underflow occurs.
Thus your answer as to why you get zero.
It may (or may not) be possible to recomplile MPFR with Int128 exponents, but julia hasn't.
Perhaps julia should throw a Underflow exception.
Free encouraged to report that as an issue on the Julia Bug Tracker.

How to determine error in floating-point calculations?

I have the following equation I want to implement in floating-point arithmetic:
Equation: sqrt((a-b)^2 + (c-d)^2 + (e-f)^2)
I am wondering how to determine how the width of the mantissa affects the accuracy of the results? How does this affect the accuracy of the result? I was wondering what the correct mathematical approach to determining this is?
For instance, if I perform the following operations, how will the accuracy be affected as after each step?
Here are the steps:
Step 1, Perform the following calculations in 32-bit single precision floating point: x=(a-b), y=(c-d), z=(e-f)
Step 2, Round the three results to have a mantissa of 16 bits (not including the hidden bit),
Step 3, Perform the following squaring operations: x2 = x^2, y2 = y^2, z2 = z^2
Step 4, Round x2, y2, and z2 to a mantissa of 10 bits (after the decimal point).
Step 5, Add the values: w = x2 + y2 = z2
Step 6, Round the results to 16 bits
Step 7, Take the square root: sqrt(w)
Step 8, Round to 20 mantissa bits (not including the mantissa).
There are various ways of representing the error of a floating point numbers. There is relative error (a * (1 + ε)), the subtly different ULP error (a + ulp(a) * ε), and relative error. Each of them can be used in analysing the error but all have shortcomings. To get sensible results you often have to take take into account what happens precisely inside floating point calculations. I'm afraid that the 'correct mathematical approach' is a lot of work, and instead I'll give you the following.
simplified ULP based analysis
The following analysis is quite crude, but it does give a good 'feel' for how much error you end up with. Just treat these as examples only.
(a-b)
The operation itself gives you up to a 0.5 ULP error (if rounding RNE). The rounding error of this operation can be small compared to the inputs, but if the inputs are very similar and already contain error, you could be left with nothing but noise!
(a^2)
This operation multiplies not only the input, but also the input error. If dealing with relative error, that means at least multiplying errors by the other mantissa. Interestingly there is a little normalisation step in the multiplier, that means that the relative error is halved if the multiplication result crosses a power of two boundary. The worst case is where the inputs multiply just below that, e.g. having two inputs that are almost sqrt(2). In this case the input error is multiplied to 2*ε*sqrt(2). With an additional final rounding error of 0.5 ULP, the total is an error of ~2 ULP.
adding positive numbers
The worst case here is just the input errors added together, plus another rounding error. We're now at 3*2+0.5 = 6.5 ULP.
sqrt
The worst case for a sqrt is when the input is close to e.g. 1.0. The error roughly just get passed through, plus an additional rounding error. We're now at 7 ULP.
intermediate rounding steps
It will take a bit more work to plug in your intermediate rounding steps.
You can model these as an error related to the number of bits you're rounding off. E.g. going from a 23 to a 10 bit mantissa with RNE introduces an additional 2^(13-2) ULP error relative to the 23-bit mantissa, or 0.5 ULP to the new mantissa (you'll have to scale down your other errors if you want to work with that).
I'll leave it to you to count the errors of your detailed example, but as the commenters noted, rounding to a 10-bit mantissa will dominate, and your final result will be accurate to roughly 8 mantissa bits.

How to do hypot2(x,y) calculation when numbers can overflow

I'd like to do a hypot2 calculation on a 16-bit processor.
The standard formula is c = sqrt((a * a) + (b * b)). The problem with this is with large inputs it overflows. E.g. 200 and 250, multiply 200 * 200 to get 90,000 which is higher than the max signed value of 32,767, so it overflows, as does b, the numbers are added and the result may as well be useless; it might even signal an error condition because of a negative sqrt.
In my case, I'm dealing with 32-bit numbers, but 32-bit multiply on my processor is very fast, about 4 cycles. I'm using a dsPIC microcontroller. I'd rather not have to multiply with 64-bit numbers because that's wasting precious memory and undoubtedly will be slower. Additionally I only have sqrt for 32-bit numbers, so 64-bit numbers would require another function. So how can I compute a hypot when the values may be large?
Please note I can only really use integer math for this. Using anything like floating point math incurs a speed hit which I'd rather avoid. My processor has a fast integer/fixed point atan2 routine, about 130 cycles; could I use this to compute the hypotenuse length?
Depending on how much accuracy you need you may be able to avoid the squares and the square root operation. There is a section on this topic in Understanding Digital Signal Processing by Rick Lyons (section 10.2, "High-Speed Vector-Magnitude Approximation", starting at page 400 in my edition).
The approximation is essentially:
magnitude = alpha * min + beta * max
where max and min are the maximum and minimum absolute values of the real and imaginary components, and alpha and beta are two constants which are chosen to give a reasonable error distribution over the range of interest. These constants can be represented as fractions with power of 2 divisors to keep the arithemtic simple/efficient. In the book he suggests alpha = 15/16, beta = 15/32, and you can then simplify the formula to:
magnitude = (15 / 16) * (max + min / 2)
which might be implemented as follows using integer operations:
magnitude = 15 * (max + min / 2) / 16
and of course we can use shifts for the divides:
magnitude = (15 * (max + (min >> 1))) >> 4
Error is +/- 5% over a quadrant.
More information on this technique here: http://www.dspguru.com/dsp/tricks/magnitude-estimator
This is taken verbatim from this #John D. Cook blog post, hence CW:
Here’s how to compute sqrt(x*x + y*y)
without risking overflow.
max = maximum(|x|, |y|)
min = minimum(|x|, |y|)
r = min / max
return max*sqrt(1 + r*r)
If #John D. Cook comes along and posts this you should give him the accept :)
Since you essentially can't do any multiplications without overflow you're likely going to lose some precision.
To get the numbers into an acceptable range, pull out some factor x and use
c = x*sqrt( (a/x)*(a/x) + (b/x)*(b/x) )
If x is a common factor, you won't lose precision, but if it's not, you will lose precision.
Update:
Even better, given that you can do some mild work with 64-bit numbers, with just one 64-bit addition, you could do the rest of this problem in 32-bits with only a tiny loss of accuracy. To do this: do the two 32-bit multiplications to give you two 64-bit numbers, add these, and then bit shift as needed to get the sum back down to 32-bits before taking the square root. If you always bit shift by 2 bits, then just multiply the final result by 2^(half the number of bit shifts), based on the rule above. The truncation should only cause a very small loss of accuracy, no more than 2^31, or 0.00000005% error.
Aniko and John, it seems to me that you haven't addressed the OP's problem. If a and b are integers, then a*a + b*b is likely to overflow, because integer operations are being performed. The obvious solution is to convert a and b to floating-point values before computing a*a + b*b. But the OP hasn't let us know what language we should use, so we're a bit stuck.
The standard formula is c = sqrt((a * a) + (b * b)). The problem with this is with large >inputs it overflows.
The solution for overflows (aside from throwing an error) is to saturate your intermediate calculations.
Calculate C = a*a + b*b. If a and b are signed 16-bit numbers, you will never have an overflow. If they are unsigned numbers, you'll need to right-shift the inputs first to get the sum to fit in a 32-bit number.
If C > (MAX_RADIUS)^2, return MAX_RADIUS, where MAX_RADIUS is the maximum value you can tolerate before encounting an overflow.
Otherwise, use either sqrt() or the CORDIC algorithm, which avoids the cost of square roots in favor of loop iteration + adds + shifts, to retrieve the amplitude of the (a,b) vector.
If you can constrain a and b to be at most 7 bits, you won't get any overflow. You can use a count-leading-zeros instruction to figure out how many bits to throw away.
Assume a>=b.
int bits = 16 - count_leading_zeros(a);
if (bits > 7) {
a >>= bits - 7;
b >>= bits - 7;
}
c = sqrt(a*a + b*b);
if (bits > 7) {
c <<= bits - 7;
}
Lots of processors have this instruction nowadays, and if not, you can use other fast techniques.
Although this won't give you the exact answer, it will be very close (at most ~1% low).
Do you need full precision? If you don't, you can increase your range a little bit by discarding a few least significant bits and multiplying them in afterwards.
Can a and b be anything? How about a lookup table if you only have a few a and b that you need to calculate?
A simple solution to avoid overflow is to divide both a and b by a+b before squaring, and then multiply the square root by a+b. Or do the same with max(a,b).
You can do a little simple algebra to bring the results back into range.
sqrt((a * a) + (b * b))
= 2 * sqrt(((a * a) + (b * b)) / 4)
= 2 * sqrt((a * a) / 4 + (b * b) / 4)
= 2 * sqrt((a/2 * a/2) + (b/2 * b/2))

Resources