How to choose epsilon value for floating point? [closed] - math

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
Since we know that 0.1 + 0.2 != 0.3 due to limited number representation, we need to instead check hat abs(0.1+0.2 - 0.3) < ε. The question is, what ε value should we generally choose for different types? Is it possible to estimate it depending on the number of bits and the number and types of operations that are likely to be performed?

A baseline value for epsilon is the difference between 1.0 and the next highest representable value. In C++, this value is available as std::numeric_limits<T>::epsilon().
Note that, at the minimum, you need to scale this value as a proportion of the actual number you're testing. Also, since the precision scales only roughly with the numeric value, you may want to increase your margin by a small factor to prevent spurious errors:
double epsilon = std::numeric_limits<double>::epsilon();
// C++ literals and math functions are double by default
bool is_near = abs(0.1+0.2 - 0.3) <= 0.3 * (2*epsilon);
As a more complete example, a function for comparing doubles:
bool is_approximately_equal(double a, double b) {
double scale = max(abs(a), abs(b));
return abs(a - b) <= scale * (2*epsilon);
}
In practice, the actual epsilon value you should use depends on what you're doing, and what kind of tolerance you actually need. Numeric algorithms will typically have precision tolerances (average and maximum) as well as time and space estimates. But the precision estimate typically starts with something like characteristic_value * epsilon.

You can estimate the machine epsilon using the algorithm below. You need to multiply this epsilon with the integer value of 1+(log(number)/log(2)). After you have determined this value for all numbers in your equation, you can use error analysis to estimate the epsilon value for a specific calculation.
epsilon=1.0
while (1.0 + (epsilon/2.0) > 1.0) {
epsilon = epsilon /2.0
}
//Calculate error using error analysis for a + b
epsilon_equation=Math.sqrt(2*epsilon*epsilon)
document.write('Epsilon: ' + epsilon_equation+'<br>')
document.write('Floating point error: ' + Math.abs(0.2 + 0.4 -0.6)+'<br>')
document.write('Comparison using epsilon: ')
document.write(Math.abs(0.2 + 0.4 -0.6)<epsilon_equation)
Following your comment, I have tried the same approach in C# and it seems to work:
using System;
namespace ConsoleApplication
{
public class Program
{
public static void Main(string[] args)
{
double epsilon = 1.0;
while (1.0 + (epsilon/2.0) > 1.0)
{
epsilon = epsilon/2.0;
}
double epsilon_equation = Math.Sqrt(2*epsilon*epsilon);
Console.WriteLine(Math.Abs(1.0 + 2.0 - 3.0) < Math.Sqrt(3.0 * epsilon_equation * epsilon_equation));
}
}
}

I am aware of the following approach to exact floating-point predicates computation: calculate the value, using standard floating point types, and calculate the error. Usually, the predicate can be stated as p(x) == 0 or p(x) < 0, etc. If the absolute value of p(x) is greater than the error, the computations are considered exact. Otherwise, interval-based or exact rational arithmetic is used.
It is possible to estimate the error from the expression used. I've heard of automatic generators of this, but failed to find any reference.
As far as I know, exact computations are mainly used for geometry, and googling for "exact geometric computations" gives a lot on the topic.
Here is an article that somehow explains error estimation.

Related

When have enough bits of my series with non-negative terms been calculated?

I have a power series with all terms non-negative which I want to evaluate to some arbitrarily set precision p (the length in binary digits of a MPFR floating-point mantissa). The result should be faithfully rounded. The issue is that I don't know when should I stop adding terms to the result variable, that is, how do I know when do I already have p + 32 accurate summed bits of the series? 32 is just an arbitrarily chosen small natural number meant to facilitate more accurate rounding to p binary digits.
This is my original series
0 <= h <= 1
series_orig(h) := sum(n = 0, +inf, a(n) * h^n)
But I actually need to calculate an arbitrary derivative of the above series (m is the order of the derivative):
series(h, m) := sum(n = m, +inf, a(n) * (n - m + 1) * ... * n * h^(n - m))
The rational number sequence a is defined like so:
a(n) := binomial(1/2, n)^2
= (((2*n)!/(n!)) / (n! * 4^n * (2*n - 1)))^2
So how do I know when to stop summing up terms of series?
Is the following maybe a good strategy?
compute in p * 4 (which is assumed to be greater than p + 32).
at each point be able to recall the current partial sum and the previous one.
stop looping when the previous and current partial sums are equal if rounded to precision p + 32.
round to precision p and return.
Clarification
I'm doing this with MPFI, an interval arithmetic addon to MPFR. Thus the [mpfi] tag.
Attempts to get relevant formulas and equations
Guided by Eric in the comments, I have managed to derive a formula for the required working precision and an equation for the required number of terms of the series in the sum.
A problem, however, is that a nice formula for the required number of terms is not possible.
Someone more mathematically capable might instead be able to achieve a formula for a useful upper bound, but that seems quite difficult to do for all possible requested result precisions and for all possible values of m (the order of the derivative). Note that the formulas need to be easily computable so they're ready before I start computing the series.
Another problem is that it seems necessary to assume the worst case for h (h = 1) for there to be any chance of a nice formula, but this is wasteful if h is far from the worst case, that is if h is close to zero.

Calculating sqrt and arcTan in javacard without float type

i want to calculate sqrt and arctangent in javacard. i haven't any math lib to do this for me and i haven't float type to calculate it manually. I have some questions in my mind:
1- Can i use float number in byte array form and working on it? how?
2- Usually how these operations is calculated in javacard?
I found some links but i couldn't help me:
http://stackoverflow.com/questions/15363244/math-library-for-javacard
http://javacardos.com/javacardforum/viewtopic.php?t=437
I should mention that i have to calculate these operation on card. Thank you very much if anyone can help me.
The integer square root can be computed by the Babylonian method, if integer division is available.
Just iterate
R' = (R + S / R) / 2
with a suitable initial R.
Such a value can be found with
R= 1
while S > 2:
R*= 2
S/= 4
(preferably implemented with shifts, if available).
You can stop the iterations when the value of R stabilizes (you can also determine a priori a constant number of iterations that yields sufficient accuracy).
The idea for CORDIC in the computation of atan is to have a table of values
angle[i] = atan(pow(2,-i));
It does not matter if the angles are precomputed in radians or degrees. Then use the tangent addition theorem
tan(a+b)=(tan(a)+tan(b) ) / ( 1-tan(a)*tan(b) )
to successively reduce the given tangent value
tan(x) {
if(x<0) return -atan(-x);
if(x>1) return 2*angle[0]-atan(1/x);
pow2=1.0;
phi=0;
for(i=0;i<10; i++) {
if(x>pow2) {
phi += angle[i];
x = (x-pow2)/(1+pow2*x);
}
pow2 /= 2;
}
return phi+x;
Now one needs to translate these operations and constants into using some kind of fixed point format.

exp function in Julia evaluating to 0

I want to calculate and plot the probability density of a wave function in Julia. I wrote a small snippet of Julia code for evaluating the following function:
The Julia (incomplete) code is:
set_bigfloat_precision(100)
A = 10
C = 5
m = BigFloat(9.10938356e-31)
ℏ = BigFloat(1.054571800e-34)
t = exp(-(sqrt(C * m) / ℏ))
The last line where I evaluate t gives 0.000000000000.... I tried to set the precision of the BigFloat as well. No luck! What am I doing wrong? Help appreciated.
While in comments Chris Rackauckas has pointed out you entered the formula wrong. I figured it was interesting enough to answer the question anyway
Lets break it down so we can see what we are raising:
A = 10
C = 5
m = BigFloat(9.10938356e-31)
h = BigFloat(1.054571800e-34)
z = -sqrt(C * m)/h
t = exp(z)
So
z =-2.0237336022083455711032042949257e+19
so very roughly z=-2e19)
so roughly t=exp(-2e19) (ie t=1/((e^(2*10^19)))
That is a very small number.
Consider that
exp(big"-1e+10") = 9.278...e-4342944820
and
exp(big"-1e+18") = 2.233...e-434294481903251828
and yes, julia says:
exp(big"-2e+19) = 0.0000
exp(big"-2e+19) is a very small number.
That puts us in context I hope. Very small number.
So julia depends on MPFR for BigFloats
You can try MPFR online. At precision 8192, exp(-2e10)=0
So same result.
Now, it is not the precision that we care about.
But rather the range of the exponant.
MPFR use something kinda like IEEE style floats, where precision is the length of the mantissa, and then you have a exponent. 2^exponent * mantissa
So there is a limit on the range of the exponent.
See: MPFR docs:
Function: mpfr_exp_t mpfr_get_emin (void)
Function: mpfr_exp_t mpfr_get_emax (void)
Return the (current) smallest and largest exponents allowed for a floating-point variable. The smallest positive value of a floating-point variable is one half times 2 raised to the smallest exponent and the largest value has the form (1 - epsilon) times 2 raised to the largest exponent, where epsilon depends on the precision of the considered variable.
Now julia does set these to there maximum range the fairly default MPFR compile will allow. I've been digging around the MPFR source trying to find where this is set, but can't find it. I believe it is related to the max fault a Int64 can hold.
Base.MPFR.get_emin() = -4611686018427387903 =typemin(Int64)>>1 + 1
You can adjust this but only up.
So anyway
0.5*big"2.0"^(Base.MPFR.get_emin()) = 8.5096913117408361391297879096205e-1388255822130839284
but
0.5*big"2.0"^(Base.MPFR.get_emin()-1) = 0.00000000000...
Now we know that
exp(x) = 2^(log(2,e)*x)
So we can exp(z) = 2^(log(2,e)*z)
log(2,e)*z = -29196304319863382016
Base.MPFR.get_emin() = -4611686018427387903
So since the exponent (rough -2.9e19) is less than the minimum allowed exponent (roughly -4.3e17).
An underflow occurs.
Thus your answer as to why you get zero.
It may (or may not) be possible to recomplile MPFR with Int128 exponents, but julia hasn't.
Perhaps julia should throw a Underflow exception.
Free encouraged to report that as an issue on the Julia Bug Tracker.

How to do hypot2(x,y) calculation when numbers can overflow

I'd like to do a hypot2 calculation on a 16-bit processor.
The standard formula is c = sqrt((a * a) + (b * b)). The problem with this is with large inputs it overflows. E.g. 200 and 250, multiply 200 * 200 to get 90,000 which is higher than the max signed value of 32,767, so it overflows, as does b, the numbers are added and the result may as well be useless; it might even signal an error condition because of a negative sqrt.
In my case, I'm dealing with 32-bit numbers, but 32-bit multiply on my processor is very fast, about 4 cycles. I'm using a dsPIC microcontroller. I'd rather not have to multiply with 64-bit numbers because that's wasting precious memory and undoubtedly will be slower. Additionally I only have sqrt for 32-bit numbers, so 64-bit numbers would require another function. So how can I compute a hypot when the values may be large?
Please note I can only really use integer math for this. Using anything like floating point math incurs a speed hit which I'd rather avoid. My processor has a fast integer/fixed point atan2 routine, about 130 cycles; could I use this to compute the hypotenuse length?
Depending on how much accuracy you need you may be able to avoid the squares and the square root operation. There is a section on this topic in Understanding Digital Signal Processing by Rick Lyons (section 10.2, "High-Speed Vector-Magnitude Approximation", starting at page 400 in my edition).
The approximation is essentially:
magnitude = alpha * min + beta * max
where max and min are the maximum and minimum absolute values of the real and imaginary components, and alpha and beta are two constants which are chosen to give a reasonable error distribution over the range of interest. These constants can be represented as fractions with power of 2 divisors to keep the arithemtic simple/efficient. In the book he suggests alpha = 15/16, beta = 15/32, and you can then simplify the formula to:
magnitude = (15 / 16) * (max + min / 2)
which might be implemented as follows using integer operations:
magnitude = 15 * (max + min / 2) / 16
and of course we can use shifts for the divides:
magnitude = (15 * (max + (min >> 1))) >> 4
Error is +/- 5% over a quadrant.
More information on this technique here: http://www.dspguru.com/dsp/tricks/magnitude-estimator
This is taken verbatim from this #John D. Cook blog post, hence CW:
Here’s how to compute sqrt(x*x + y*y)
without risking overflow.
max = maximum(|x|, |y|)
min = minimum(|x|, |y|)
r = min / max
return max*sqrt(1 + r*r)
If #John D. Cook comes along and posts this you should give him the accept :)
Since you essentially can't do any multiplications without overflow you're likely going to lose some precision.
To get the numbers into an acceptable range, pull out some factor x and use
c = x*sqrt( (a/x)*(a/x) + (b/x)*(b/x) )
If x is a common factor, you won't lose precision, but if it's not, you will lose precision.
Update:
Even better, given that you can do some mild work with 64-bit numbers, with just one 64-bit addition, you could do the rest of this problem in 32-bits with only a tiny loss of accuracy. To do this: do the two 32-bit multiplications to give you two 64-bit numbers, add these, and then bit shift as needed to get the sum back down to 32-bits before taking the square root. If you always bit shift by 2 bits, then just multiply the final result by 2^(half the number of bit shifts), based on the rule above. The truncation should only cause a very small loss of accuracy, no more than 2^31, or 0.00000005% error.
Aniko and John, it seems to me that you haven't addressed the OP's problem. If a and b are integers, then a*a + b*b is likely to overflow, because integer operations are being performed. The obvious solution is to convert a and b to floating-point values before computing a*a + b*b. But the OP hasn't let us know what language we should use, so we're a bit stuck.
The standard formula is c = sqrt((a * a) + (b * b)). The problem with this is with large >inputs it overflows.
The solution for overflows (aside from throwing an error) is to saturate your intermediate calculations.
Calculate C = a*a + b*b. If a and b are signed 16-bit numbers, you will never have an overflow. If they are unsigned numbers, you'll need to right-shift the inputs first to get the sum to fit in a 32-bit number.
If C > (MAX_RADIUS)^2, return MAX_RADIUS, where MAX_RADIUS is the maximum value you can tolerate before encounting an overflow.
Otherwise, use either sqrt() or the CORDIC algorithm, which avoids the cost of square roots in favor of loop iteration + adds + shifts, to retrieve the amplitude of the (a,b) vector.
If you can constrain a and b to be at most 7 bits, you won't get any overflow. You can use a count-leading-zeros instruction to figure out how many bits to throw away.
Assume a>=b.
int bits = 16 - count_leading_zeros(a);
if (bits > 7) {
a >>= bits - 7;
b >>= bits - 7;
}
c = sqrt(a*a + b*b);
if (bits > 7) {
c <<= bits - 7;
}
Lots of processors have this instruction nowadays, and if not, you can use other fast techniques.
Although this won't give you the exact answer, it will be very close (at most ~1% low).
Do you need full precision? If you don't, you can increase your range a little bit by discarding a few least significant bits and multiplying them in afterwards.
Can a and b be anything? How about a lookup table if you only have a few a and b that you need to calculate?
A simple solution to avoid overflow is to divide both a and b by a+b before squaring, and then multiply the square root by a+b. Or do the same with max(a,b).
You can do a little simple algebra to bring the results back into range.
sqrt((a * a) + (b * b))
= 2 * sqrt(((a * a) + (b * b)) / 4)
= 2 * sqrt((a * a) / 4 + (b * b) / 4)
= 2 * sqrt((a/2 * a/2) + (b/2 * b/2))

How are exponents calculated?

I'm trying to determine the asymptotic run-time of one of my algorithms, which uses exponents, but I'm not sure of how exponents are calculated programmatically.
I'm specifically looking for the pow() algorithm used for double-precision, floating point numbers.
I've had a chance to look at fdlibm's implementation. The comments describe the algorithm used:
* n
* Method: Let x = 2 * (1+f)
* 1. Compute and return log2(x) in two pieces:
* log2(x) = w1 + w2,
* where w1 has 53-24 = 29 bit trailing zeros.
* 2. Perform y*log2(x) = n+y' by simulating muti-precision
* arithmetic, where |y'|<=0.5.
* 3. Return x**y = 2**n*exp(y'*log2)
followed by a listing of all the special cases handled (0, 1, inf, nan).
The most intense sections of the code, after all the special-case handling, involve the log2 and 2** calculations. And there are no loops in either of those. So, the complexity of floating-point primitives notwithstanding, it looks like a asymptotically constant-time algorithm.
Floating-point experts (of which I'm not one) are welcome to comment. :-)
Unless they've discovered a better way to do it, I believe that approximate values for trig, logarithmic and exponential functions (for exponential growth and decay, for example) are generally calculated using arithmetic rules and Taylor Series expansions to produce an approximate result accurate to within the requested precision. (See any Calculus book for details on power series, Taylor series, and Maclaurin series expansions of functions.) Please note that it's been a while since I did any of this so I couldn't tell you, for example, exactly how to calculate the number of terms in the series you need to include guarantee an error that small enough to be negligible in a double-precision calculation.
For example, the Taylor/Maclaurin series expansion for e^x is this:
+inf [ x^k ] x^2 x^3 x^4 x^5
e^x = SUM [ --- ] = 1 + x + --- + ----- + ------- + --------- + ....
k=0 [ k! ] 2*1 3*2*1 4*3*2*1 5*4*3*2*1
If you take all of the terms (k from 0 to infinity), this expansion is exact and complete (no error).
However, if you don't take all the terms going to infinity, but you stop after say 5 terms or 50 terms or whatever, you produce an approximate result that differs from the actual e^x function value by a remainder which is fairly easy to calculate.
The good news for exponentials is that it converges nicely and the terms of its polynomial expansion are fairly easy to code iteratively, so you might (repeat, MIGHT - remember, it's been a while) not even need to pre-calculate how many terms you need to guarantee your error is less than precision because you can test the size of the contribution at each iteration and stop when it becomes close enough to zero. In practice, I do not know if this strategy is viable or not - I'd have to try it. There are important details I have long since forgotten about. Stuff like: machine precision, machine error and rounding error, etc.
Also, please note that if you are not using e^x, but you are doing growth/decay with another base like 2^x or 10^x, the approximating polynomial function changes.
The usual approach, to raise a to the b, for an integer exponent, goes something like this:
result = 1
while b > 0
if b is odd
result *= a
b -= 1
b /= 2
a = a * a
It is generally logarithmic in the size of the exponent. The algorithm is based on the invariant "a^b*result = a0^b0", where a0 and b0 are the initial values of a and b.
For negative or non-integer exponents, logarithms and approximations and numerical analysis are needed. The running time will depend on the algorithm used and what precision the library is tuned for.
Edit: Since there seems to be some interest, here's a version without the extra multiplication.
result = 1
while b > 0
while b is even
a = a * a
b = b / 2
result = result * a
b = b - 1
You can use exp(n*ln(x)) for calculating xn. Both x and n can be double-precision, floating point numbers. Natural logarithm and exponential function can be calculated using Taylor series. Here you can find formulas: http://en.wikipedia.org/wiki/Taylor_series
If I were writing a pow function targeting Intel, I would return exp2(log2(x) * y). Intel's microcode for log2 is surely faster than anything I'd be able to code, even if I could remember my first year calculus and grad school numerical analysis.
e^x = (1 + fraction) * (2^exponent), 1 <= 1 + fraction < 2
x * log2(e) = log2(1 + fraction) + exponent, 0 <= log2(1 + fraction) < 1
exponent = floor(x * log2(e))
1 + fraction = 2^(x * log2(e) - exponent) = e^((x * log2(e) - exponent) * ln2) = e^(x - exponent * ln2), 0 <= x - exponent * ln2 < ln2

Resources