Hardware Floating Point Square Root - math

How do hardware implementations of a floating-point square root work? Which algorithm would they use and can anyone provide links to verilog/vhdl implementations?

AFAIK, either a digit-recurrence algorithm (little resource) or Newton's iteration on the reciprocal square root (needs other operators: adder, multiplier, or FMA).
Concerning Newton's iteration, the choice of the initial approximation is not obvious. See Kornerup and Muller's article Choosing starting values for certain Newton–Raphson iterations.

You get the best bang for the money by implementing an approximation for 1 / sqrt (x) in hardware, giving maybe ten or twelve bits of precision, like Intel processors do. Then you use good old Newton iteration to improve that approximation using add/subtract/multiply only, and multiply the last approximation by x.
Alternatively, consider that calculating the square root of x is the same as dividing x by the square root of x. You can implement something very similar to a division, giving one bit of precision each time, except that the number you are dividing by changes in every iteration.

Related

How many arithmetic operations should it take to calculate trig functions?

I'm trying to assess the expected performance of calculating trigonometry functions as a function of the required precision. Obviously the wall clock time depends on the speed of the underlying arithmetic, so factoring that out by just counting number of operations:
Using state-of-the-art algorithms, how many arithmetic operations (add, subtract, multiply, divide) should it take to calculate sin(x), as a function of the number of bits (or decimal digits) of precision required in the output?
... to assess the expected performance of calculating trigonometry functions as a function of the required precision.
Look as the first omitted term in the Taylor series sine for x = π/4 as the order of error.
Details: sin(x) usually has these phases:
Handling special cases: NaN, infinities.
Argument reduction to the primary range to say [-π/4...+π/4]. Real good reduction is hard as π is irrational and so involves code that reaches 50% of sin() time. Much time used to emulate the needed extended precision. (Research K.C. Ng's "ARGUMENT REDUCTION FOR HUGE ARGUMENTS: Good to the Last Bit")
Low quality reduction involves much less:/, truncate, -, *.
Calculation over a limited range. This is what many only consider. If done with a Taylor's series and needing 53 bits, then about 10-11 terms are needed: Taylor series sine. Yet quality code often uses a pair of crafted polynomials, each of about 4-5 terms, to form the quotient p(x)/q(x).
Of course dedicated hardware support in any of these steps greatly increases performance.
Note: code for sin() is often paired with cos() code as extensive use of trig identities simplify the calculation.
I'd expect a software solution for sin() to cost on the order of 25x a common *. This is a rough estimate.
To achieve a very low error rate in the ULP, code typically uses a tad more. sine_crap() could get by with only a few terms. So when assessing time performance, there is a trade-off with correctness. How good a sin() do you want?
assess the expected performance of calculating trigonometry functions as a function of the required precision
Using the Taylors series as a predictor of the number of ops, worst case x = π/4 (45°) and the error in the calculation on the order of the last term of the series:
For 32-bit float, order 6 float ops needed.
For 64-bit double, order 9 float ops needed.
So if time scales by the square of the FP width, double predicted to take 9/6*2*2 or 6 times as long.
We can calculate any trigonometric function using a simple right angled triangle or using the McLaurin\Taylor Series. So it really depends on which one you choose to implement. If you only pass an angle as an argument, and wish to calculate the sin of that particular angle, it would take about 4 to 6 steps to calculate the sin using an unit circle.

Calculating pi without arbitrary precision and only basic arithmetic

I want to calculate pi. But, I have quite a few limits. Variables can only hold up to 5 decimal places, and I only have the following operators:
Addition
Subtraction
Multiplication
Division
Exponents
Square roots
Sin
Cos
Basic Loops, Conditionals, and relational operators.
The BBP algorithm seems useless here, because even though it would not need arbitrary precision, I cannot convert between bases. I'm not aware of any other formulas that can find the nth digit of pi in base 10.
Would it even be possible to calculate pi using these constraints?
BBP can be modified to give π in Base 10. There's a Java implementation on Github. (I believe that the screenshot of the algorithm description is taken from Pi - Unleashed by Arndt/Haenel.)
You'll need the modulo operation and a means to calculate the closest integer to the logarithm of a number, but you can perform them using the operations you have and loops.

OpenCL reduction result wrong with large floats

I used AMD's two-stage reduction example to compute the sum of all numbers from 0 to 65 536 using floating point precision. Unfortunately, the result is not correct. However, when I modify my code, so that I compute the sum of 65 536 smaller numbers (for example 1), the result is correct.
I couldn't find any error in the code. Is it possible that I am getting wrong results, because of the float type? If this is the case, what is the best approach to solve the issue?
This is a "side effect" of summing floating point numbers using finite precision CPU's or GPU's. The accuracy depends the algorithm and the order the values are summed. The theory and practice behind is explained in Nicholas J, Higham's paper
The Accuracy of Floating Point Summation
http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=7AECC0D6458288CD6E4488AD63A33D5D?doi=10.1.1.43.3535&rep=rep1&type=pdf
The fix is to use a smarter algorithm like the Kahan Summation Algorithm
https://en.wikipedia.org/wiki/Kahan_summation_algorithm
And the Higham paper has some alternatives too.
This problem illustrates the nature of benchmarking, the first rule of the benchmark is to get the
right answer, using realistic data!
There is probably no error in the coding of your kernel or host application. The issue is with the single-precision floating point.
The correct sum is: 65537 * 32768 = 2147516416, and it takes 31 bits to represent it in binary (10000000000000001000000000000000). 32-bit floats can only hold integers accurately up to 2^24.
"Any integer with absolute value less than [2^24] can be exactly represented in the single precision format"
"Floating Point" article, wikipedia
This is why you are getting the correct sum when it is less than or equal to 2^24. If you are doing a complete sum using single-precision, you will eventually lose accuracy no matter which device you are executing the kernel on. There are a few things you can do to get the correct answer:
use double instead of float if your platform supports it
use int or unsigned int
sum a smaller set of numbers eg: 0+1+2+...+4095+4096 = (2^23 + 2^11)
Read more about single precision here.

How to numerically compute nonlinear polynomials efficiently and accurately?

(I'm not sure whether I should post this problem on this site or on the math site. Please feel free to migrate this post if necessary.)
My problem at hand is that given a value of k I'd like to numerically compute a rational function of nonlinear polynomials in k which looks like the following: (sorry I don't know how to typeset equations here...)
where {a_0, ..., a_N; b_0, ..., b_N} are complex constants, {u_0, ..., u_N, v_0, ..., v_N} are real constants and i is the imaginary number. I learned from Numerical Recipes that there are whole bunch of ways to compute polynomials quickly, in the meanwhile keeping the rounding error small enough, if all coefficients were constant. But I do not think those ideas are useful in my case since the exponential prefactors also depend on k.
Currently I calculate it in a brute force way in C with complex.h (this is just a pseudo code):
double complex function(double k)
{
return (a_0+a_1*cexp(I*u_1*k)*k+a_2*cexp(I*u_2*k)*k*k+...)/(b_0+b_1*cexp(I*v_1*k)*k+v_2*cexp(I*v_2*k)*k*k+...);
}
However when the number of calls of function increases (because this is just a part of my real calculation), it is very slow and inaccurate (only 6 valid digits). I appreciate any comments and/or suggestions.
I trust that this isn't a homework assignment!
Normally the trick is to use a loop add the next coefficient to the running sum, and multiply by k. However, in your case, I think the "e" term in the coefficient is going to overwhelm any savings by factoring out k. You can still do it, but the savings will probably be small.
Is u_i a constant? Depending on how many times you need to run this formula, maybe you could premultiply u_i * k (unless k changes each run). It's been so many decades since I took a Numerical Analysis course that I have only vague recollections of the tricks of the trade. Let's see... is e^(i*u_i*k) the same as (e^(i*u_i))^k? I don't remember the rules on imaginary numbers, or whether you'll save anything since you've got a real^real (assuming k is real) anyway (internally done using e^power).
If you're getting only 6 digits, that suggests that your math, and maybe your library, is working in single precision (32 bit) reals. Check your library and check your declarations that you are using at least double precision (64 bit) reals everywhere.

Approximating nonparametric cubic Bezier

What is the best way to approximate a cubic Bezier curve? Ideally I would want a function y(x) which would give the exact y value for any given x, but this would involve solving a cubic equation for every x value, which is too slow for my needs, and there may be numerical stability issues as well with this approach.
Would this be a good solution?
Just solve the cubic.
If you're talking about Bezier plane curves, where x(t) and y(t) are cubic polynomials, then y(x) might be undefined or have multiple values. An extreme degenerate case would be the line x= 1.0, which can be expressed as a cubic Bezier (control point 2 is the same as end point 1; control point 3 is the same as end point 4). In that case, y(x) has no solutions for x != 1.0, and infinite solutions for x == 1.0.
A method of recursive subdivision will work, but I would expect it to be much slower than just solving the cubic. (Unless you're working with some sort of embedded processor with unusually poor floating-point capacity.)
You should have no trouble finding code that solves a cubic that has already been thoroughly tested and debuged. If you implement your own solution using recursive subdivision, you won't have that advantage.
Finally, yes, there may be numerical stablility problems, like when the point you want is near a tangent, but a subdivision method won't make those go away. It will just make them less obvious.
EDIT: responding to your comment, but I need more than 300 characters.
I'm only dealing with bezier curves where y(x) has only one (real) root. Regarding numerical stability, using the formula from http://en.wikipedia.org/wiki/Cubic_equation#Summary, it would appear that there might be problems if u is very small. – jtxx000
The wackypedia article is math with no code. I suspect you can find some cookbook code that's more ready-to-use somewhere. Maybe Numerical Recipies or ACM collected algorithms link text.
To your specific question, and using the same notation as the article, u is only zero or near zero when p is also zero or near zero. They're related by the equation:
u^^6 + q u^^3 == p^^3 /27
Near zero, you can use the approximation:
q u^^3 == p^^3 /27
or p / 3u == cube root of q
So the computation of x from u should contain something like:
(fabs(u) >= somesmallvalue) ? (p / u / 3.0) : cuberoot (q)
How "near" zero is near? Depends on how much accuracy you need. You could spend some quality time with Maple or Matlab looking at how much error is introduced for what magnitudes of u. Of course, only you know how much accuracy you need.
The article gives 3 formulas for u for the 3 roots of the cubic. Given the three u values, you can get the 3 corresponding x values. The 3 values for u and x are all complex numbers with an imaginary component. If you're sure that there has to be only one real solution, then you expect one of the roots to have a zero imaginary component, and the other two to be complex conjugates. It looks like you have to compute all three and then pick the real one. (Note that a complex u can correspond to a real x!) However, there's another numerical stability problem there: floating-point arithmetic being what it is, the imaginary component of the real solution will not be exactly zero, and the imaginary components of the non-real roots can be arbitrarily close to zero. So numeric round-off can result in you picking the wrong root. It would be helpfull if there's some sanity check from your application that you could apply there.
If you do pick the right root, one or more iterations of Newton-Raphson can improve it's accuracy a lot.
Yes, de Casteljau algorithm would work for you. However, I don't know if it will be faster than solving the cubic equation by Cardano's method.

Resources