exp function in Julia evaluating to 0 - julia

I want to calculate and plot the probability density of a wave function in Julia. I wrote a small snippet of Julia code for evaluating the following function:
The Julia (incomplete) code is:
set_bigfloat_precision(100)
A = 10
C = 5
m = BigFloat(9.10938356e-31)
ℏ = BigFloat(1.054571800e-34)
t = exp(-(sqrt(C * m) / ℏ))
The last line where I evaluate t gives 0.000000000000.... I tried to set the precision of the BigFloat as well. No luck! What am I doing wrong? Help appreciated.

While in comments Chris Rackauckas has pointed out you entered the formula wrong. I figured it was interesting enough to answer the question anyway
Lets break it down so we can see what we are raising:
A = 10
C = 5
m = BigFloat(9.10938356e-31)
h = BigFloat(1.054571800e-34)
z = -sqrt(C * m)/h
t = exp(z)
So
z =-2.0237336022083455711032042949257e+19
so very roughly z=-2e19)
so roughly t=exp(-2e19) (ie t=1/((e^(2*10^19)))
That is a very small number.
Consider that
exp(big"-1e+10") = 9.278...e-4342944820
and
exp(big"-1e+18") = 2.233...e-434294481903251828
and yes, julia says:
exp(big"-2e+19) = 0.0000
exp(big"-2e+19) is a very small number.
That puts us in context I hope. Very small number.
So julia depends on MPFR for BigFloats
You can try MPFR online. At precision 8192, exp(-2e10)=0
So same result.
Now, it is not the precision that we care about.
But rather the range of the exponant.
MPFR use something kinda like IEEE style floats, where precision is the length of the mantissa, and then you have a exponent. 2^exponent * mantissa
So there is a limit on the range of the exponent.
See: MPFR docs:
Function: mpfr_exp_t mpfr_get_emin (void)
Function: mpfr_exp_t mpfr_get_emax (void)
Return the (current) smallest and largest exponents allowed for a floating-point variable. The smallest positive value of a floating-point variable is one half times 2 raised to the smallest exponent and the largest value has the form (1 - epsilon) times 2 raised to the largest exponent, where epsilon depends on the precision of the considered variable.
Now julia does set these to there maximum range the fairly default MPFR compile will allow. I've been digging around the MPFR source trying to find where this is set, but can't find it. I believe it is related to the max fault a Int64 can hold.
Base.MPFR.get_emin() = -4611686018427387903 =typemin(Int64)>>1 + 1
You can adjust this but only up.
So anyway
0.5*big"2.0"^(Base.MPFR.get_emin()) = 8.5096913117408361391297879096205e-1388255822130839284
but
0.5*big"2.0"^(Base.MPFR.get_emin()-1) = 0.00000000000...
Now we know that
exp(x) = 2^(log(2,e)*x)
So we can exp(z) = 2^(log(2,e)*z)
log(2,e)*z = -29196304319863382016
Base.MPFR.get_emin() = -4611686018427387903
So since the exponent (rough -2.9e19) is less than the minimum allowed exponent (roughly -4.3e17).
An underflow occurs.
Thus your answer as to why you get zero.
It may (or may not) be possible to recomplile MPFR with Int128 exponents, but julia hasn't.
Perhaps julia should throw a Underflow exception.
Free encouraged to report that as an issue on the Julia Bug Tracker.

Related

Computing harmonic series for very large N (arbitrary precision problems)

This is a followup question to a previous one I made.
I'm trying to compute the Harmonic series to very large terms, however when comparing to log(n)+γ I'm not getting the expected error.
I suspect the main problem is with the BigFloat julia type.
harmonic_bf = function(n::Int64)
x=BigFloat(0)
for i in n:-1:1
x += BigFloat(1/i)
end
x
end
For example it is well known that the lower bound for the formula: H_n - log(n) - γ is 1/2/(n+1).
However, this holds for n=10^7 then fails for n=10^8.
n=10^8
γ = big"0.57721566490153286060651209008240243104215933593992"
lower_bound(n) = 1/2/(n+1)
>>> harmonic_bf(n)-log(n)-γ > lower_bound(BigFloat(n))
false
It's driving me crazy, I can't seem to understand what is missing... BigFloat supossedly should get arithmetic precision problems out of the way, however it seems not to be the case.
Note: I tried with BigFloat with unset precision and with 256 bits of precision.
You have to make sure that you use BigFloat everywhere. First in your function (notice that BigFloat(1/n) is not the same as 1/BigFloat(i)):
function harmonic_bf(n::Int64)
x=BigFloat(0)
for i in n:-1:1
x += 1/BigFloat(i)
end
x
end
and then in the test (notice BigFloat under log):
julia> harmonic_bf(n)-log(BigFloat(n))-γ > lower_bound(BigFloat(n))
true

Seeding square roots on FPGA in VHDL for Fixed Point

I'm attempting to create a fixed-point square root function for a Xilinx FPGA (hence real types are out, and David Bishops ieee_proposed library is also unsupported for XST synthesis).
I've settled on a Newton-Raphson method to calculate the reciprocal square root (as it involves fewer divisions).
One of the remaining dilemmas I have is how to generate the initial seed. I looked at the Fast Inverse Square Root, but it only appears to work for floating point arithmetic.
My best thoughts at the moment are, to take the length of the input value (ie. find the index of the most significant, non-zero bit), halve it crudely and use that as the power for a seed.
I wrote a short test script to quickly check the accuracy (its in Matlab but that's just so I could plot a graph...)
x = 1:2^24;
gen_result = zeros(1,length(x));
seed_vals = zeros(1,length(x));
for i = 1:length(x)
result = 2^-ceil(log2(x(i))/2); %effectively creates seed value from top bit index
seed_vals(i) = 1/result; %Store seed value
for j = 1:6
result = result*(1.5-0.5*x(i)*result^2); %reciprocal root
end
gen_result(i) = 1/result; %single division at the end
end
And unsurprisingly, the seed becomes wildly inaccurate each time a number increases in size, and this increases as the magnitude of the input increases. As a graph this can be seen as:
The red line is the value of the seed, and as can be seen, is increasing increasing in powers of 2.
My question very simple: Are there any other simple methods I could use to generate a seed value for fixed point square root values in VHDL, ideally which don't cause ever increasing amounts of inaccuracy (and hence require more iterations each time the input increases in size).
Any other incidental advise on how to approach finding fixed points square roots in VHDL would be gratefully received!
I realize this is an old question but I did end up here and this was kind of useful so I want to add my bit.
Assuming your Xilinx chip has an embedded multiplier, you could consider this approach to help get a better starting seed. The basic premise is to convert the input integer to fixed point with all fraction bits, and then use the embedded multiplier to scale half of your initial seed value by 0.X (which in hindsight is probably what people mean when they say "normalize to the region [0.5..1)", now that I think about it). It's basically piecewise linear interpolation of your existing seed method. The steps below should translate relatively easily to RTL, as they're just bit-shifts, adds, and one unsigned multiply.
1) Begin with your existing seed value (e.g. for x=9e6, you would generate s=4096 as the seed for your first guess with your "crude halving" method)
2) Right-shift the existing seed value by 1 to get the previous seed value (s_half = s >> 1 = 2048)
3) Left-shift the input until the most significant bit is a 1. In the event you are sqrting 32-bit ints, x_scale would then be 2304000000 = 0x89544000
4) Slice the upper e.g. 18 bits off of x_scale and multiply by an 18-bit version of s_half (I suggest 18 because I happen to know some Xilinx chips have embedded 18x18 multipliers). For this case, the result, x_scale(31 downto 14) = 140625 = 0x22551.
At least, that's what the multiplier thinks - we're going to use fixed point so that it's actually 0b0.100010010101010001 = 0.53644 instead of 140625.
The result of this multiplication will be s_scale = s_half * x_scale(31 downto 14) = 2048 * 140625 = 288000000, but this output is in 18.18 format (18 integer bits, 18 fraction bits). Take the upper 18 bits, and you get s_scale(35 downto 18) = 1098
5) Add the upper 18 bits of s_scale to s_half to get your improved seed, in this case s_improved = 1098+2048 = 3146
Now you can do a few iterations of Newton-Raphson with this seed. For x=9e6, your crude halving approach would give an initial seed of 4096, the fixed-point scale outlined above gives you 3146, and the actual sqrt(9e6) is 3000. This value is half-way between your seed steps, and my napkin math suggests it saved about 3 iterations of Newton-Raphson

How to determine error in floating-point calculations?

I have the following equation I want to implement in floating-point arithmetic:
Equation: sqrt((a-b)^2 + (c-d)^2 + (e-f)^2)
I am wondering how to determine how the width of the mantissa affects the accuracy of the results? How does this affect the accuracy of the result? I was wondering what the correct mathematical approach to determining this is?
For instance, if I perform the following operations, how will the accuracy be affected as after each step?
Here are the steps:
Step 1, Perform the following calculations in 32-bit single precision floating point: x=(a-b), y=(c-d), z=(e-f)
Step 2, Round the three results to have a mantissa of 16 bits (not including the hidden bit),
Step 3, Perform the following squaring operations: x2 = x^2, y2 = y^2, z2 = z^2
Step 4, Round x2, y2, and z2 to a mantissa of 10 bits (after the decimal point).
Step 5, Add the values: w = x2 + y2 = z2
Step 6, Round the results to 16 bits
Step 7, Take the square root: sqrt(w)
Step 8, Round to 20 mantissa bits (not including the mantissa).
There are various ways of representing the error of a floating point numbers. There is relative error (a * (1 + ε)), the subtly different ULP error (a + ulp(a) * ε), and relative error. Each of them can be used in analysing the error but all have shortcomings. To get sensible results you often have to take take into account what happens precisely inside floating point calculations. I'm afraid that the 'correct mathematical approach' is a lot of work, and instead I'll give you the following.
simplified ULP based analysis
The following analysis is quite crude, but it does give a good 'feel' for how much error you end up with. Just treat these as examples only.
(a-b)
The operation itself gives you up to a 0.5 ULP error (if rounding RNE). The rounding error of this operation can be small compared to the inputs, but if the inputs are very similar and already contain error, you could be left with nothing but noise!
(a^2)
This operation multiplies not only the input, but also the input error. If dealing with relative error, that means at least multiplying errors by the other mantissa. Interestingly there is a little normalisation step in the multiplier, that means that the relative error is halved if the multiplication result crosses a power of two boundary. The worst case is where the inputs multiply just below that, e.g. having two inputs that are almost sqrt(2). In this case the input error is multiplied to 2*ε*sqrt(2). With an additional final rounding error of 0.5 ULP, the total is an error of ~2 ULP.
adding positive numbers
The worst case here is just the input errors added together, plus another rounding error. We're now at 3*2+0.5 = 6.5 ULP.
sqrt
The worst case for a sqrt is when the input is close to e.g. 1.0. The error roughly just get passed through, plus an additional rounding error. We're now at 7 ULP.
intermediate rounding steps
It will take a bit more work to plug in your intermediate rounding steps.
You can model these as an error related to the number of bits you're rounding off. E.g. going from a 23 to a 10 bit mantissa with RNE introduces an additional 2^(13-2) ULP error relative to the 23-bit mantissa, or 0.5 ULP to the new mantissa (you'll have to scale down your other errors if you want to work with that).
I'll leave it to you to count the errors of your detailed example, but as the commenters noted, rounding to a 10-bit mantissa will dominate, and your final result will be accurate to roughly 8 mantissa bits.

differentiation in matlab

i need to find acceleration of an object the formula for that given in text is a = d^2(L)/d(T)^2 , where L= length and T= time
i calculated this in matlab by using this equation
a = (1/(T3-T1))*(((L3-L2)/(T3-T2))-((L2-L1)/(T2-T1)))
or
a = (v2-v1)/(T2-T1)
but im not getting the right answers ,can any body tell me how to find (a) by any other method in matlab.
This has nothing to do with matlab, you are just trying to numerically differentiate a function twice. Depending on the behaviour of the higher (3rd, 4th) derivatives of the function this will or will not yield reasonable results. You will also have to expect an error of order |T3 - T1|^2 with a formula like the one you are using, assuming L is four times differentiable. Instead of using intervals of different size you may try to use symmetric approximations like
v (x) = (L(x-h) - L(x+h))/ 2h
a (x) = (L(x-h) - 2 L(x) + L(x+h))/ h^2
From what I recall from my numerical math lectures this is better suited for numerical calculation of higher order derivatives. You will still get an error of order
C |h|^2, with C = O( ||d^4 L / dt^4 || )
with ||.|| denoting the supremum norm of a function (that is, the fourth derivative of L needs to be bounded). In case that's true you can use that formula to calculate how small h has to be chosen in order to produce a result you are willing to accept. Note, though, that this is just the theoretical error which is a consequence of an analysis of the Taylor approximation of L, see [1] or [2] -- this is where I got it from a moment ago -- or any other introductory book on numerical mathematics. You may get additional errors depending on the quality of the evaluation of L; also, if |L(x-h) - L(x)| is very small numerical substraction may be ill conditioned.
[1] Knabner, Angermann; Numerik partieller Differentialgleichungen; Springer
[2] http://math.fullerton.edu/mathews/n2003/numericaldiffmod.html

How to do hypot2(x,y) calculation when numbers can overflow

I'd like to do a hypot2 calculation on a 16-bit processor.
The standard formula is c = sqrt((a * a) + (b * b)). The problem with this is with large inputs it overflows. E.g. 200 and 250, multiply 200 * 200 to get 90,000 which is higher than the max signed value of 32,767, so it overflows, as does b, the numbers are added and the result may as well be useless; it might even signal an error condition because of a negative sqrt.
In my case, I'm dealing with 32-bit numbers, but 32-bit multiply on my processor is very fast, about 4 cycles. I'm using a dsPIC microcontroller. I'd rather not have to multiply with 64-bit numbers because that's wasting precious memory and undoubtedly will be slower. Additionally I only have sqrt for 32-bit numbers, so 64-bit numbers would require another function. So how can I compute a hypot when the values may be large?
Please note I can only really use integer math for this. Using anything like floating point math incurs a speed hit which I'd rather avoid. My processor has a fast integer/fixed point atan2 routine, about 130 cycles; could I use this to compute the hypotenuse length?
Depending on how much accuracy you need you may be able to avoid the squares and the square root operation. There is a section on this topic in Understanding Digital Signal Processing by Rick Lyons (section 10.2, "High-Speed Vector-Magnitude Approximation", starting at page 400 in my edition).
The approximation is essentially:
magnitude = alpha * min + beta * max
where max and min are the maximum and minimum absolute values of the real and imaginary components, and alpha and beta are two constants which are chosen to give a reasonable error distribution over the range of interest. These constants can be represented as fractions with power of 2 divisors to keep the arithemtic simple/efficient. In the book he suggests alpha = 15/16, beta = 15/32, and you can then simplify the formula to:
magnitude = (15 / 16) * (max + min / 2)
which might be implemented as follows using integer operations:
magnitude = 15 * (max + min / 2) / 16
and of course we can use shifts for the divides:
magnitude = (15 * (max + (min >> 1))) >> 4
Error is +/- 5% over a quadrant.
More information on this technique here: http://www.dspguru.com/dsp/tricks/magnitude-estimator
This is taken verbatim from this #John D. Cook blog post, hence CW:
Here’s how to compute sqrt(x*x + y*y)
without risking overflow.
max = maximum(|x|, |y|)
min = minimum(|x|, |y|)
r = min / max
return max*sqrt(1 + r*r)
If #John D. Cook comes along and posts this you should give him the accept :)
Since you essentially can't do any multiplications without overflow you're likely going to lose some precision.
To get the numbers into an acceptable range, pull out some factor x and use
c = x*sqrt( (a/x)*(a/x) + (b/x)*(b/x) )
If x is a common factor, you won't lose precision, but if it's not, you will lose precision.
Update:
Even better, given that you can do some mild work with 64-bit numbers, with just one 64-bit addition, you could do the rest of this problem in 32-bits with only a tiny loss of accuracy. To do this: do the two 32-bit multiplications to give you two 64-bit numbers, add these, and then bit shift as needed to get the sum back down to 32-bits before taking the square root. If you always bit shift by 2 bits, then just multiply the final result by 2^(half the number of bit shifts), based on the rule above. The truncation should only cause a very small loss of accuracy, no more than 2^31, or 0.00000005% error.
Aniko and John, it seems to me that you haven't addressed the OP's problem. If a and b are integers, then a*a + b*b is likely to overflow, because integer operations are being performed. The obvious solution is to convert a and b to floating-point values before computing a*a + b*b. But the OP hasn't let us know what language we should use, so we're a bit stuck.
The standard formula is c = sqrt((a * a) + (b * b)). The problem with this is with large >inputs it overflows.
The solution for overflows (aside from throwing an error) is to saturate your intermediate calculations.
Calculate C = a*a + b*b. If a and b are signed 16-bit numbers, you will never have an overflow. If they are unsigned numbers, you'll need to right-shift the inputs first to get the sum to fit in a 32-bit number.
If C > (MAX_RADIUS)^2, return MAX_RADIUS, where MAX_RADIUS is the maximum value you can tolerate before encounting an overflow.
Otherwise, use either sqrt() or the CORDIC algorithm, which avoids the cost of square roots in favor of loop iteration + adds + shifts, to retrieve the amplitude of the (a,b) vector.
If you can constrain a and b to be at most 7 bits, you won't get any overflow. You can use a count-leading-zeros instruction to figure out how many bits to throw away.
Assume a>=b.
int bits = 16 - count_leading_zeros(a);
if (bits > 7) {
a >>= bits - 7;
b >>= bits - 7;
}
c = sqrt(a*a + b*b);
if (bits > 7) {
c <<= bits - 7;
}
Lots of processors have this instruction nowadays, and if not, you can use other fast techniques.
Although this won't give you the exact answer, it will be very close (at most ~1% low).
Do you need full precision? If you don't, you can increase your range a little bit by discarding a few least significant bits and multiplying them in afterwards.
Can a and b be anything? How about a lookup table if you only have a few a and b that you need to calculate?
A simple solution to avoid overflow is to divide both a and b by a+b before squaring, and then multiply the square root by a+b. Or do the same with max(a,b).
You can do a little simple algebra to bring the results back into range.
sqrt((a * a) + (b * b))
= 2 * sqrt(((a * a) + (b * b)) / 4)
= 2 * sqrt((a * a) / 4 + (b * b) / 4)
= 2 * sqrt((a/2 * a/2) + (b/2 * b/2))

Resources