Hardware implementation of square root?

Hardware implementation of square root? - math

I'm trying to find a little bit more information for efficient square root algorithms which are most likely implemented on FPGA. A lot of algorithms are found already but which one are for example from Intel or AMD?
By efficient I mean they are either really fast or they don't need much memory.
EDIT: I should probably mention that the question is generally a floating point number and since most of the hardware implements the IEEE 754 standard where the number is represented as: 1 sign bit, 8 bits biased exponent and 23 bits mantissa.
Thanks!

Not a full solution, but a couple of pointers.
I assume you're working in floating point, so point 1 is remember that floating point is stored as a mantissa and exponent. The exponent of the square root will be approximately half the exponent of the original number thanks to logarithms.
Then the mantissa can be approximated with a look-up table, and then you can use a couple of newton-raphson rounds to give some accuracy to the result from the LUT.
I haven't implemented anything like this for about 8 years, but I think this is how I did it and was able to get a result in 3 or 4 cycles.

This is a great one for fast inverse-quare root.
Have a look at it here. Notice it's pretty much about the initial guess, rather amazing document :)

Related

Is there a more efficient way to divide and conquer a uint 256 log on 64 bit hardware with rust or inline assembly than converging Taylor series?

I am looking to take the log base n (10 would be fine) of a 256 bit unsigned integer as a floating point in rust, with no loss of precision. It would seem to me that I need to implement an 8xf64 512 bit float 512 type and use a Taylor series to approximate ln and then the log. I know there are assembly methods to obtain the log of an f64. I am wondering if anyone on stack overflow can think of a divide and conquer or other method which would be more efficient. I would be amenable to inline assembly operating on the 8xf64 512 bit array.

This might be a useful starting point / outline of an algorithm. IDK if it will get you exact results, like error <= 0.5ulp (i.e. the last bit of the mantissa of your 512-bit float correctly rounded), or even error <= 1 ulp. Perhaps worth looking into what extended-precision calculators like bc / dc / calc do.
I think log converges quickly, so if you're going to do Newton iterations to refine, this bit-scan method might be a fast way to get a good starting point. Even if you only really need about 256 mantissa bits correct, I don't know how big a polynomial it would take to get that, and each multiply / add / fma would be on 512-bit (8x) or 320-bit (5x double precision).
Start by converting integer to binary float
For normal-sized floating-point numbers, the usual method takes advantage of the logarithmic nature of binary floating point. Without 256-bit HW float, you'll want to find the ilog2(int) yourself, i.e. position of the highest set bit (Efficiently find least significant set bit in a large array?).
Then treat your 256-bit integer as the mantissa of a number in the [1..2) or [0.5 .. 1) range, and yes use a polynomial approximation for log2() that's accurate over that limited range. (Before actual soft-float stuff, you might want to left-shift the number so it's normalized, i.e. the highest set bit is at the top. i.e. x <<= clz(x).
Then a polynomial approximation over the mantissa
And then add the integer exponent + log_approx(mantissa) => log2(x).
Efficient implementation of log2(__m256d) in AVX2 has more detail on implementing log2(double) (with SIMD doing 4 at a time, very different from doing one extended precision calculation).
It includes some links to implementations, e.g. Agner Fog's VCL using the ratio of two polynomials instead of one larger polynomial, and various tricks to maintain as much precision as possible: https://github.com/vectorclass/version2/blob/9874e4bfc7a0919fda16596144d393da5f8bf6c0/vectormath_exp.h#L942. Such as further range reduction: if x > SQRT2*0.5, then increment the exponent and double the mantissa. (If 512-bit FP division is really expensive, you might just use more terms in one polynomial.) VCL is currently Apache licensed, so feel free to copy as much as you want from it into anything.
IDK if there are more tricks that might become more valuable for big extended precision, or for soft-float, which that implementation doesn't use. VCL's math functions spend more effort to maintain high precision than some faster approximations, but they're not exact.
Do you really need 512-bit float? Maybe only 320-bit (5x double)?
If you don't need more exponent-range than a double, you might be able to extend the double-double-arithmetic technique to wider floats, taking advantage of hardware FP to get 52 or 53 mantissa bits per 64-bit chunk. (From comments, apparently you're already planning to do that.)
You might not need 512-bit float to have sufficient precision. 256/52 = 4.92, so only 5x double chunks have more precision (mantissa bits) than your input, and could exactly represent any 256-bit integer. (IEEE double does have a large enough exponent range; -1022 .. +1023). And have enough to spare that log2(int) should map each 256-bit input to a unique monotonic output, even with some rounding error.

OpenCL reduction result wrong with large floats

I used AMD's two-stage reduction example to compute the sum of all numbers from 0 to 65 536 using floating point precision. Unfortunately, the result is not correct. However, when I modify my code, so that I compute the sum of 65 536 smaller numbers (for example 1), the result is correct.
I couldn't find any error in the code. Is it possible that I am getting wrong results, because of the float type? If this is the case, what is the best approach to solve the issue?

This is a "side effect" of summing floating point numbers using finite precision CPU's or GPU's. The accuracy depends the algorithm and the order the values are summed. The theory and practice behind is explained in Nicholas J, Higham's paper
The Accuracy of Floating Point Summation
http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=7AECC0D6458288CD6E4488AD63A33D5D?doi=10.1.1.43.3535&rep=rep1&type=pdf
The fix is to use a smarter algorithm like the Kahan Summation Algorithm
https://en.wikipedia.org/wiki/Kahan_summation_algorithm
And the Higham paper has some alternatives too.
This problem illustrates the nature of benchmarking, the first rule of the benchmark is to get the
right answer, using realistic data!

There is probably no error in the coding of your kernel or host application. The issue is with the single-precision floating point.
The correct sum is: 65537 * 32768 = 2147516416, and it takes 31 bits to represent it in binary (10000000000000001000000000000000). 32-bit floats can only hold integers accurately up to 2^24.
"Any integer with absolute value less than [2^24] can be exactly represented in the single precision format"
"Floating Point" article, wikipedia
This is why you are getting the correct sum when it is less than or equal to 2^24. If you are doing a complete sum using single-precision, you will eventually lose accuracy no matter which device you are executing the kernel on. There are a few things you can do to get the correct answer:
use double instead of float if your platform supports it
use int or unsigned int
sum a smaller set of numbers eg: 0+1+2+...+4095+4096 = (2^23 + 2^11)
Read more about single precision here.

Why do programming contests want answers modulo some large prime?

I have been testing the waters of competitive programming and I have already seen this statement mentioned a lot of times:
Print the result modulo 109 + 7
Now I can figure out that this is some way of preventing overflow of digits when dealing with very large numbers. But how and why does it work? I would be grateful if someone could explain the mathematical reasoning behind this.

Many contest questions ask you to compute some very, very large number (say, the number of permutations of an 150-element sequence containing some large number of duplicates). Many programming languages don't natively support arbitrary-precision arithmetic, so in the interest of fairness it makes sense for those contests not to ask you for the exact value. The challenge, then, is the following: how can the contest site know when you have the right answer given that you can't exactly compute it?
One initially appealing option would be to just ask for the answer modulo some large power of two (say, 232 or 264) so that competitors working in languages like C or C++ could just use uint32_t or uint64_ts to do all the computations, letting overflows occur normally, and then submit the results. However, this isn't particularly desirable. Suppose, for example, that the question is the following:
Compute 10,000!
This number is staggeringly huge and is way too big to fit into a 32-bit or 64-bit unsigned integer. However, if you just want to get the answer modulo 232 or 264, you could just use this program:
#include <stdio.h>
int main() {
puts("0");
}
The reason for this is that 10,000! is the product of at least 5,000 even numbers, so one of its factors is 25,000. Therefore, if you just want the answer modulo 232 or 264, you don't actually have to compute it at all. You can just say that the result is 0 mod 232 or 264.
The problem here is that working modulo 232 or 264 is troublesome if the resulting answer is cleanly divisible by either of those numbers. However, if we work modulo a large prime number, then this trick wouldn't work. As an example, the number 7,897,987 is prime. If you try to compute 10,000! mod 7,897,987, then you can't just say "the answer is 0" because none of the numbers multiplied together in 10,000! are divisors of 7,897,987. You'd actually have to do some work to figure out what this number is modulo that large prime. More generally, working modulo a large prime usually requires you to compute the actual answer modulo that large prime, rather than using number-theoretic tricks to skip all the work entirely.
So why work modulo 1,000,000,007? This number happens to be prime (so it's good to use as a modulus) and it's less than 231 - 1, the largest possible value you can fit in a signed 32-bit integer. The signedness is nice here because in some languages (like Java) there are no unsigned integer types and the default integer type is a 32-bit signed integer. This means that you can work modulo 1,000,000,007 without risking an integer overflow.
To summarize:
Working modulo a large prime makes it likely that if your program produces the correct output, it actually did some calculation and did so correctly.
Working modulo 1,000,000,007 allows a large number of languages to use their built-in integer types to store and calculate the result.
Hope this helps!

what options are there for representing numbers with more than 2^81 digits?

I came across an interesting math problem that would require me to do some artithmetic with numbers that have more than 281 digits. I know that its impossible to represent a number this large with a system where there is one memory unit for each digit but wondered if there were any ways around this.
My initial thought was to use a extremely large base instead of base 10 (decimal). After some thought I believe (but can't verify) that the optimal base would be the square root of the number of digits (so for a number with 281 digits you'd use base 240ish) which is a improvement but that doesn't scale well and still isn't really practical.
So what options do I have? I know of many arbitrary precision libraries, but are there any that scale to support this sort of arithmetic?
Thanks o7
EDIT: after thinking some more i realize i may be completely wrong about the "optimal base would be the square root of the number of digits" but a) that's why im asking and b) im too tired to remember my initial reasoning for assumption.
EDIT 2: 1000,000 in base ten = F4240 in base 16 = 364110 in base 8. In base 16 you need 20 bits to store the number in base 8 you need 21 so it would seem that by increasing the base you decrees the total number of bits needed. (again this could be wrong)

This is really a compression problem pretending to be an arithmetic problem. What you can do with such a large number depends entirely on its Kolmogorov complexity. If you're required to do computations on such a large number, it's obviously not going be arrive as 2^81 decimal digits; the Kolmogorov complexity would too high in that case and you can't even finish reading the input before the sun goes out. The best way to deal with such a number is via delayed evaluation and symbolic rational types that a language like Scheme provides. This way a program may be able to answer some questions about the result of computations on the number without actually having to write out all those digits to memory.

I think you should just use scientific notation. You will lose precision, but you can not store numbers that large without losing precision, because storing 2^81 digits will require more than 10^24 bits(about thousand billion terabytes), which is much more that you can have nowadays.

that have more than 2^81 digits
Non-fractional number with 2^81 bits, will take 3*10^11 terabytes of data. Per number.
That's assuming you want every single digit and data isn't compressible.
You could attempt to compress the data storing it in some kind of sparse array that allocates memory only for non-zero elements, but that doesn't guarantee that data will be fit anywhere.
Such precision is useless and impossible to handle on modern hardware. 2^81 bits will take insane amount of time to simply walk through number (9584 trillion years, assuming 1 byte takes 1 millisecond), never mind multiplication/division. I also can't think of any problem that would require precision like that.
Your only option is to reduce precision to first N significant digits and use floating point numbers. Since data won't fit into double, you'll have to use bignum library with floating point support, that provides extremely large floating point numbers. Since you can represent 2^81 (exponent) in bits, you can store beginning of a number using very big floating point.
1000,000 in base ten
Regardless of your base, positive number will take at least floor(log2(number))+1 bits to store it. If base is not 2, then it will take more than floor(log2(number))+1 bits to store it. Numeric base won't reduce number of required bits.

Floating Point Algorithms in C

I am thinking recently on how floating point math works on computers and is hard for me understand all the tecnicals details behind the formulas. I would need to understand the basics of addition, subtraction, multiplication, division and remainder. With these I will be able to make trig functions and formulas.
I can guess something about it, but its a bit unclear. I know that a fixed point can be made by separating a 4 byte integer by a signal flag, a radix and a mantissa. With this we have a 1 bit flag, a 5 bits radix and a 10 bit mantissa. A word of 32 bits is perfect for a floating point value :)
To make an addition between two floats, I can simply try to add the two mantissas and add the carry to the 5 bits radix? This is a way to do floating point math (or fixed point math, to be true) or I am completely wrong?
All the explanations I saw use formulas, multiplications, etc. and they look so complex for a thing I guess, would be a bit more simple. I would need an explanation more directed to beginning programmers and less to mathematicians.

See Anatomy of a floating point number

The radix depends of the representation, if you use radix r=2 you can never change it, the number doesn't even have any data that tell you which radix have. I think you're wrong and you mean exponent.
To add two numbers in floating point you must make the exponent one equal to another by rotating the mantissa. One bit right means exponent+1, and one bit left means exponent -1, when you have the numbers with the same exponent then you can add them.
Value(x) = mantissa * radix ^ exponent
adding these two numbers
101011 * 2 ^ 13
001011 * 2 ^ 12
would be the same as adding:
101011 * 2 ^ 13
000101 * 2 ^ 13
After making exponent equal one to another you can operate.
You also have to know if the representation has implicit bit, I mean, the most significant bit must be a 1, so usually, as in the iee standard its known to be there, but it isn't representated, although its used to operate.
I know this can be a bit confusing and I'm not the best teacher so any doubt you have, just ask.

Run, don't walk, to get Knuth's Seminumerical Algorithms which contains wonderful intuition and algorithms behind doing multiprecision and floating point arithmetic.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex