Approximation of `atan` function in fixed point - math

I must do some calculations that need to use trigonometric functions, and especially the atan one. The code will run on an Atmega328p, and for efficiency sake, I can't use floats: I'm using fixed point numbers. Thus, I can't use the standard atan function.
I which to have a function which take a value in fixed point format s16_10 (signed, 16 bits width, point in 10th position), and returns a s16_6 format. The input will be between 0 and 1 (so 0 and 210), so the output, in degrees, will be between -45 and 45 (so -45 * 26 and 45 * 26).
Let's say that Y is the fixed point, s16_6 representation of y, the real angle of the arc, and x such as atan(x) = y, and X the s16_10 representation of x. I start with approximating the atan function, from (0,1) to (-45,45) with a 4th degrees polynomial, and found that we can use:
y ~= 8.11 * x^4 - 19.67 * x^3 - 0.93 * x^2 + 57.52 * x + 0.0096
Which leads to:
Y ~= (8.11 * X^4)/2^34 - (19.62* X^3)/2^24 - (0.93 * X^2)/2^14 + (57.52*X)/2^4 + 0.0069 * 2^6
And here am I stuck... On the one hand, computing the X^4 will lead to a 0 for one fifth of the definition interval, and on the other and the 2n4n in {3, 2, 1} will often lead also to a zero value... How could I do ?

Some of the terms being truncated to zero is not necessarily a disaster; this doesn't substantially worsen your approximation. I simulated your fixed precision setup in Matlab by rounding each term of the polynomial to the nearest integer:
q4 = #(X) round((8.11 * X.^4)/2^34);
q3 = #(X) -round((19.62* X.^3)/2^24);
q2 = #(X) -round((0.93 * X.^2)/2^14);
q1 = #(X) round((57.52*X)/2^4);
q0 = #(X) round(0.0069 * 2^6);
It's true that on the first fifth of the interval [0,210] the terms q4, q3, q2 look rather choppy, and q4 is essentially absent.
But these effects of rounding are of about the same size as the theoretical error of approximation of atan by your polynomial. Here is the plot where red is the difference (polynomial-atan) calculated without rounding to integers, and green is the difference (q4+q3+q2+q1+q0-atan):
As you can see, rounding does not make approximation much worse; in most cases it actually reduces the error by a happy accident.
I do notice that your polynomial systematically overestimates atan. When I fit a 4th degree polynomial to atan on [0,1] with Matlab, the coefficients are slightly different:
8.0927 -19.6568 -0.9257 57.5106 -0.0083
Even truncating these to two significant figures, as you did, I get a better approximation:
(8.09 * X^4)/2^34 - (19.66* X^3)/2^24 - (0.93 * X^2)/2^14 + (57.52*X)/2^4 - 0.0083 * 2^6
This time the truncation to integers does worsen things. But it is to be expected that the outcome of a calculation where several intermediate results are rounded to integers will be off by +-2 or so. The theoretical accuracy of +-0.5, shown by this polynomial, cannot be realized with the given arithmetical tools.

Related

Is there any way to bound the region searched by NLsolve in Julia?

I'm trying to find one of the roots of a nonlinear (roughly quartic) equation.
The equation always has four roots, a pair of them close to zero, a large positive, and a large negative root. I'd like to identify either of the near zero roots, but nlsolve, even with an initial guess very close to these roots, seems to always converge on the large positive or negative root.
A plot of the function essentially looks like a constant negative value, with a (very narrow) even-ordered pole near zero, and gradually rising to cross zero at the large positive and negative roots.
Is there any way I can limit the region searched by nlsolve, or do something to make it more sensitive to the presence of this pole in my function?
EDIT:
Here's some example code reproducing the problem:
using NLsolve
function f!(F,x)
x = x[1]
F[1] = -15000 + x^4 / (x+1e-5)^2
end
# nlsolve will find the root at -122
nlsolve(f!,[0.0])
As output, I get:
Results of Nonlinear Solver Algorithm
* Algorithm: Trust-region with dogleg and autoscaling
* Starting Point: [0.0]
* Zero: [-122.47447713915808]
* Inf-norm of residuals: 0.000000
* Iterations: 15
* Convergence: true
* |x - x'| < 0.0e+00: false
* |f(x)| < 1.0e-08: true
* Function Calls (f): 16
* Jacobian Calls (df/dx): 6
We can find the exact roots in this case by transforming the objective function into a polynomial:
using PolynomialRoots
roots([-1.5e-6,-0.3,-15000,0,1])
produces
4-element Array{Complex{Float64},1}:
122.47449713915809 - 0.0im
-122.47447713915808 + 0.0im
-1.0000000813048448e-5 + 0.0im
-9.999999186951818e-6 + 0.0im
I would love a way to identify the pair of roots around the pole at x = -1e-5 without knowing the exact form of the objective function.
EDIT2:
Trying out Roots.jl :
using Roots
f(x) = -15000 + x^4 / (x+1e-5)^2
find_zero(f,0.0) # finds +122... root
find_zero(f,(-1e-4,0.0)) # error, not a bracketing interval
find_zeros(f,-1e-4,0.0) # finds 0-element Array{Float64,1}
find_zeros(f,-1e-4,0.0,no_pts=6) # finds root slightly less than -1e-5
find_zeros(f,-1e-4,0.0,no_pts=10) # finds 0-element Array{Float64,1}, sensitive to value of no_pts
I can get find_zeros to work, but it's very sensitive to the no_pts argument and the exact values of the endpoints I pick. Doing a loop over no_pts and taking the first non-empty result might work, but something more deterministic to converge would be preferable.
EDIT3 :
Here's applying the tanh transformation suggested by Bogumił
using NLsolve
function f_tanh!(F,x)
x = x[1]
x = -1e-4 * (tanh(x)+1) / 2
F[1] = -15000 + x^4 / (x+1e-5)^2
end
nlsolve(f_tanh!,[100.0]) # doesn't converge
nlsolve(f_tanh!,[1e5]) # doesn't converge
using Roots
function f_tanh(x)
x = -1e-4 * (tanh(x)+1) / 2
return -15000 + x^4 / (x+1e-5)^2
end
find_zeros(f_tanh,-1e10,1e10) # 0-element Array
find_zeros(f_tanh,-1e3,1e3,no_pts=100) # 0-element Array
find_zero(f_tanh,0.0) # convergence failed
find_zero(f_tanh,0.0,max_evals=1_000_000,maxfnevals=1_000_000) # convergence failed
EDIT4 : This combination of techniques identifies at least one root somewhere around 95% of the time, which is good enough for me.
using Peaks
using Primes
using Roots
# randomize pole location
a = 1e-4*rand()
f(x) = -15000 + x^4 / (x+a)^2
# do an initial sample to find the pole location
l = 1000
minval = -1e-4
maxval = 0
m = []
sample_r = []
while l < 1e6
sample_r = range(minval,maxval,length=l)
rough_sample = f.(sample_r)
m = maxima(rough_sample)
if length(m) > 0
break
else
l *= 10
end
end
guess = sample_r[m[1]]
# functions to compress the range around the estimated pole
cube(x) = (x-guess)^3 + guess
uncube(x) = cbrt(x-guess) + guess
f_cube(x) = f(cube(x))
shift = l ÷ 1000
low = sample_r[m[1]-shift]
high = sample_r[m[1]+shift]
# search only over prime no_pts, so no samplings divide into each other
# possibly not necessary?
for i in primes(500)
z = find_zeros(f_cube,uncube(low),uncube(high),no_pts=i)
if length(z)>0
println(i)
println(cube.(z))
break
end
end
More comment could be given if you provided more information on your problem.
However in general:
It seems that your problem is univariate, in which case you can use Roots.jl where find_zero and find_zeros give the interface you ask for (i.e. allowing to specify the search region)
If a problem is multivariate you have several options how to do it in the problem specification for nlsolve (as it by default does not allow to specify a bounding box AFAICT). The simplest is to use variable transformation. E.g. you can apply a ai * tanh(xi) + bi transformation selecting ai and bi for each variable so that it is bounded to the desired interval
The first problem you have in your definition is that the way you define f it never crosses 0 near the two roots you are looking for because Float64 does not have enough precision when you write 1e-5. You need to use greater precision of computations:
julia> using Roots
julia> f(x) = -15000 + x^4 / (x+1/big(10.0^5))^2
f (generic function with 1 method)
julia> find_zeros(f,big(-2*10^-5), big(-8*10^-6), no_pts=100)
2-element Array{BigFloat,1}:
-1.000000081649671426108658262468117284940444265467160592853348997523986352593615e-05
-9.999999183503552405580084054429938261707450678661727461293670518591720605751116e-06
and set no_pts to be sufficiently large to find intervals bracketing the roots.

Numerical blowup problem in a fractional function in R

Ciao,
I am working with this function in R:
betaFun = function(x){
if(x == 0){
return(0.5)
}
return( ( 1+exp(x)*(x-1) )/( x*(exp(x)-1) ) )
}
The function is smooth and well defined for every x (at least from a theoretical point of view) and in 0 the limit approach to 0.5 (you can convince yourself about this by using Hopital theorem).
I have the following problem:
i.e. the fact that, due to the limit, R wrongly compute the values and I get a blowup in 0.
Here I report the numerical issue:
x = c(1e-4, 1e-6, 1e-8, 1e-10, 1e-12, 1e-13)
sapply(x, betaFun)
[1] 5.000083e-01 5.000442e-01 2.220446e+00 0.000000e+00 0.000000e+00 1.111111e+10
As you can see the evaluation is pretty weird, in particular last one.
I thought that I could solve this problem by defining the missing value in 0 (as you can see from the code) but it is not true.
Do you know how can I solve this numerical blow up problem?
I need high precision for this function since I have to invert it around 0. I will do it using nleqslv function from nleqslv library. Of course the inversion will return wrong solutions if the function has numerical problems.
I think that you are losing accuracy in the evaluation of exp(x)-1 for x close to 0. In C if I evaluate your function as
double f2( double x)
{ return (x==0) ? 0.5
: (x*exp(x) - expm1(x))/( x*expm1(x));
}
The problem goes away. Here expm1 is a math library function that computes exp(x) - 1, without losing accuracy for small x. I'm afraid I don't know if R has this, but you'd hope it would.
I think, though, that you would be better to test for |x| was sufficiently small, rather than 0.0. The point is that for small enough x both x*exp(x) and expm1(x) will be, as doubles, x, so their difference will be 0. To keep maximum accuracy may need to add a linear term to the 0.5 you return. I've not worked out precisely what 'sufficiently small should be, but it's somewhere around 1e-16 I think.
Your problem is that you take the quotient of two numbers with very small absolute values. Such numbers are only represented to floating point precision.
You don't specify why you need these function values for x values close to zero. One easy option would be coercion to high precision numbers:
library(Rmpfr)
betaFun = function(x){
x <- mpfr(as.character(x), precBits = 256)
#if x is calculated, you should switch to high precision numbers for its calculation
#this step could be removed then
#do calculation with high precision,
#then coerce to normal precision (assuming that is necessary)
ifelse(x == 0, 0.5, as((1 + exp(x) * (x - 1)) / (x * (exp(x) - 1)), "numeric"))
}
x = c(1e-4, 1e-6, 1e-8, 1e-10, 1e-12, 1e-13, 0)
betaFun(x)
#[1] 0.5000083 0.5000001 0.5000000 0.5000000 0.5000000 0.5000000 0.5000000
As you notice, you are encountering the problem near zero. The roots of both the numerator and denominator are zero. And as the OP mentioned, using L'Hôpitcal, you notice that in that f(x) = 1/2.
From a numerical point of view, things go slightly different. Floating points will always have an error as not every Real number can be represented as a floating point number. For example:
exp(1E-3) -1 = 0.0010005001667083845973138522822409868 # numeric
exp(1/1000)-1 = 0.001000500166708341668055753993058311563076200580... # true
^
The problem in evaluating numerically exp(1E-3)-1 already starts at the beginning, i.e. 1E-3
1E-3 = x = 0.0010000000000000000208166817117216851
exp(x) = 1.0010005001667083845973138522822409868
exp(x) - 1 = 0.0010005001667083845973138522822409868
1E-3 cannot be represented as a floating point, and is accurate upto 17 digits.
IEEE will give the closest floating point value possible to the true value of x, which already has an error due to (1). Still exp(x) is only accurate upto 17 digits.
By subtracting 1, we get a bunch of zero's in the beginning, and now our result is only accurate upto 14 digits.
So now that we know that we cannot represent everything exactly as a floating point, you should realize that near zero, it becomes a bit awkward and both numerator and denominator become less and less accurate, especially near 1E-13.
numerator_numeric(1E-13) = 1.1102230246251565E-16
numerator_true(1E-13) = 5.00000000000033333333333...E-27
Generally, what you do near such a point is use a Taylor expansion around zero, and the normal function everywhere else:
betaFun = function(x){
if(-1E-1 < x && x < 1E-1){
return(0.5 + x/12. - x^3/720. + x^5/30240.)
}
return( ( 1+exp(x)*(x-1) )/( x*(exp(x)-1) ) )
}
The above expansion is accurate upto 13 digits for x in the small region

Numerical strategy to calculate a fraction sometimes very close to zero

In the R function chisq.test() there is the following line:
PVAL <- (1 + sum(ss >= almost.1 * STATISTIC))/(B + 1) with
almost.1 <- 1 - 64 * .Machine$double.eps
This is clearly a computational adjustment to avoid getting round outputs for PVAL.
It doesn't really matter what is calculated, but the idea is that what we really, really want is sum(ss >= STATISTIC)/ B, where ss is the result of a bunch of simulations, STATISTIC is a fixed value to compare to, and B is the number of simulations. We are calculating the percentage of cases in which ss is greater than STATISTIC.
What does adding 1 to both numerator and denominator supposed to accomplish?
and
Why do we need to multiply by 1 - 64 * .Machine$double.eps?

Calculating log(sum of exp(terms) ) when "terms" are very small

I would like to compute log( exp(A1) + exp(A2) ).
The formula below
log(exp(A1) + exp(A2) ) = log[exp(A1)(1 + exp(A2)/exp(A1))] = A1 + log(1+exp(A2-A1))
is useful when A1 and A2 are large and numerically exp(A1)=Inf (or exp(A2)=Inf).
(this formula is discussed in this thread ->
How to calculate log(sum of terms) from its component log-terms). The formula is true when the role of A1 and A2 are replaced.
My concern of this formula is when A1 and A2 are very small. For example, when A1 and A2 are:
A1 <- -40000
A2 <- -45000
then the direct calculation of log(exp(A1) + exp(A2) ) is:
log(exp(A1) + exp(A2))
[1] -Inf
Using the formula above gives:
A1 + log(1 + exp(A2-A1))
[1] -40000
which is the value of A1.
Ising the formula above with flipped role of A1 and A2 gives:
A2 + log(1 + exp(A1-A2))
[1] Inf
Which of the three values are the closest to the true value of log(exp(A1) + exp(A2))? Is there robust way to compute log(exp(A1) + exp(A2)) that can be used both when A1, A2 are small and A1, A2 are large.
Thank you in advance
You should use something with more accuracy to do the direct calculation.
It’s not “useful when [they’re] large”. It’s useful when the difference is very negative.
When x is near 0, then log(1+x) is approximately x. So if A1>A2, we can take your first formula:
log(exp(A1) + exp(A2)) = A1 + log(1+exp(A2-A1))
and approximate it by A1 + exp(A2-A1) (and the approximation will get better as A2-A1 is more negative). Since A2-A1=-5000, this is more than negative enough to make the approximation sufficient.
Regardless, if y is too far from zero (either way) exp(y) will (over|under)flow a double and result in 0 or infinity (this is a double, right? what language are you using?). This explains your answers. But since exp(A2-A1)=exp(-5000) is close to zero, your answer is approximately -40000+exp(-5000), which is indistinguishable from -40000, so that one is correct.
in such huge exponent differences the safest you can do without arbitrary precision is
chose the biggest exponent let it be Am = max(A1,A2)
so: log(exp(A1)+exp(A2)) -> log(exp(Am)) = Am
that is the closest you can get for such case
so in your example the result is -40000+delta
where delta is something very small
If you want to use the second formula then all breaks down to computing log(1+exp(A))
if A is positive then the result is far from the real thing
if A is negative then it will truncate to log(1)=0 so you get the same result as in above
[Notes]
your exponent difference is base^500
single precision 32bit float can store numbers up to (+/-)2^(+/-128)
double precision 64bit float can store numbers up to (+/-)2^(+/-1024)
so when your base is 10 or e then this is nowhere near enough what you need
if you have quadruple precision that should be enough but when you start changing the exp difference again yo will quickly get to the same point as now
[PS] if you need more precision without arbitrary precision
you can try to create own number class
with internal store of numbers like number=a^b
where a,b are floats
but for that you would need to code all basic functions
*,/ is easy
+,- is a nightmare but there could be some approaches/algorithms out there even for this

Fast, inaccurate sin function without lookup

For an ocean shader, I need a fast function that computes a very approximate value for sin(x). The only requirements are that it is periodic, and roughly resembles a sine wave.
The taylor series of sin is too slow, since I'd need to compute up to the 9th power of x just to get a full period.
Any suggestions?
EDIT: Sorry I didn't mention, I can't use a lookup table since this is on the vertex shader. A lookup table would involve a texture sample, which on the vertex shader is slower than the built in sin function.
It doesn't have to be in any way accurate, it just has to look nice.
Use a Chebyshev approximation for as many terms as you need. This is particularly easy if your input angles are constrained to be well behaved (-π .. +π or 0 .. 2π) so you do not have to reduce the argument to a sensible value first. You might use 2 or 3 terms instead of 9.
You can make a look-up table with sin values for some values and use linear interpolation between that values.
A rational algebraic function approximation to sin(x), valid from zero to π/2 is:
f = (C1 * x) / (C2 * x^2 + 1.)
with the constants:
c1 = 1.043406062
c2 = .2508691922
These constants were found by least-squares curve fitting. (Using subroutine DHFTI, by Lawson & Hanson).
If the input is outside [0, 2π], you'll need to take x mod 2 π.
To handle negative numbers, you'll need to write something like:
t = MOD(t, twopi)
IF (t < 0.) t = t + twopi
Then, to extend the range to 0 to 2π, reduce the input with something like:
IF (t < pi) THEN
IF (t < pi/2) THEN
x = t
ELSE
x = pi - t
END IF
ELSE
IF (t < 1.5 * pi) THEN
x = t - pi
ELSE
x = twopi - t
END IF
END IF
Then calculate:
f = (C1 * x) / (C2 * x*x + 1.0)
IF (t > pi) f = -f
The results should be within about 5% of the real sine.
Well, you don't say how accurate you need it to be. The sine can be approximated by straight lines of slopes 2/pi and -2/pi on intervals [0, pi/2], [pi/2, 3*pi/2], [3*pi/2, 2*pi]. This approximation can be had for the cost of a multiplication and an addition after reducing the angle mod 2*pi.
Using a lookup table is probably the best way to control the tradeoff between speed and accuracy.

Resources