Floating Point Algorithms in C - math

I am thinking recently on how floating point math works on computers and is hard for me understand all the tecnicals details behind the formulas. I would need to understand the basics of addition, subtraction, multiplication, division and remainder. With these I will be able to make trig functions and formulas.
I can guess something about it, but its a bit unclear. I know that a fixed point can be made by separating a 4 byte integer by a signal flag, a radix and a mantissa. With this we have a 1 bit flag, a 5 bits radix and a 10 bit mantissa. A word of 32 bits is perfect for a floating point value :)
To make an addition between two floats, I can simply try to add the two mantissas and add the carry to the 5 bits radix? This is a way to do floating point math (or fixed point math, to be true) or I am completely wrong?
All the explanations I saw use formulas, multiplications, etc. and they look so complex for a thing I guess, would be a bit more simple. I would need an explanation more directed to beginning programmers and less to mathematicians.

See Anatomy of a floating point number

The radix depends of the representation, if you use radix r=2 you can never change it, the number doesn't even have any data that tell you which radix have. I think you're wrong and you mean exponent.
To add two numbers in floating point you must make the exponent one equal to another by rotating the mantissa. One bit right means exponent+1, and one bit left means exponent -1, when you have the numbers with the same exponent then you can add them.
Value(x) = mantissa * radix ^ exponent
adding these two numbers
101011 * 2 ^ 13
001011 * 2 ^ 12
would be the same as adding:
101011 * 2 ^ 13
000101 * 2 ^ 13
After making exponent equal one to another you can operate.
You also have to know if the representation has implicit bit, I mean, the most significant bit must be a 1, so usually, as in the iee standard its known to be there, but it isn't representated, although its used to operate.
I know this can be a bit confusing and I'm not the best teacher so any doubt you have, just ask.

Run, don't walk, to get Knuth's Seminumerical Algorithms which contains wonderful intuition and algorithms behind doing multiprecision and floating point arithmetic.

Related

Is there a more efficient way to divide and conquer a uint 256 log on 64 bit hardware with rust or inline assembly than converging Taylor series?

I am looking to take the log base n (10 would be fine) of a 256 bit unsigned integer as a floating point in rust, with no loss of precision. It would seem to me that I need to implement an 8xf64 512 bit float 512 type and use a Taylor series to approximate ln and then the log. I know there are assembly methods to obtain the log of an f64. I am wondering if anyone on stack overflow can think of a divide and conquer or other method which would be more efficient. I would be amenable to inline assembly operating on the 8xf64 512 bit array.
This might be a useful starting point / outline of an algorithm. IDK if it will get you exact results, like error <= 0.5ulp (i.e. the last bit of the mantissa of your 512-bit float correctly rounded), or even error <= 1 ulp. Perhaps worth looking into what extended-precision calculators like bc / dc / calc do.
I think log converges quickly, so if you're going to do Newton iterations to refine, this bit-scan method might be a fast way to get a good starting point. Even if you only really need about 256 mantissa bits correct, I don't know how big a polynomial it would take to get that, and each multiply / add / fma would be on 512-bit (8x) or 320-bit (5x double precision).
Start by converting integer to binary float
For normal-sized floating-point numbers, the usual method takes advantage of the logarithmic nature of binary floating point. Without 256-bit HW float, you'll want to find the ilog2(int) yourself, i.e. position of the highest set bit (Efficiently find least significant set bit in a large array?).
Then treat your 256-bit integer as the mantissa of a number in the [1..2) or [0.5 .. 1) range, and yes use a polynomial approximation for log2() that's accurate over that limited range. (Before actual soft-float stuff, you might want to left-shift the number so it's normalized, i.e. the highest set bit is at the top. i.e. x <<= clz(x).
Then a polynomial approximation over the mantissa
And then add the integer exponent + log_approx(mantissa) => log2(x).
Efficient implementation of log2(__m256d) in AVX2 has more detail on implementing log2(double) (with SIMD doing 4 at a time, very different from doing one extended precision calculation).
It includes some links to implementations, e.g. Agner Fog's VCL using the ratio of two polynomials instead of one larger polynomial, and various tricks to maintain as much precision as possible: https://github.com/vectorclass/version2/blob/9874e4bfc7a0919fda16596144d393da5f8bf6c0/vectormath_exp.h#L942. Such as further range reduction: if x > SQRT2*0.5, then increment the exponent and double the mantissa. (If 512-bit FP division is really expensive, you might just use more terms in one polynomial.) VCL is currently Apache licensed, so feel free to copy as much as you want from it into anything.
IDK if there are more tricks that might become more valuable for big extended precision, or for soft-float, which that implementation doesn't use. VCL's math functions spend more effort to maintain high precision than some faster approximations, but they're not exact.
Do you really need 512-bit float? Maybe only 320-bit (5x double)?
If you don't need more exponent-range than a double, you might be able to extend the double-double-arithmetic technique to wider floats, taking advantage of hardware FP to get 52 or 53 mantissa bits per 64-bit chunk. (From comments, apparently you're already planning to do that.)
You might not need 512-bit float to have sufficient precision. 256/52 = 4.92, so only 5x double chunks have more precision (mantissa bits) than your input, and could exactly represent any 256-bit integer. (IEEE double does have a large enough exponent range; -1022 .. +1023). And have enough to spare that log2(int) should map each 256-bit input to a unique monotonic output, even with some rounding error.

What determines which system is used to translate a base 10 number to decimal and vice-versa?

There are a lot of ways to store a given number in a computer. This site lists 5
unsigned
sign magnitude
one's complement
two's complement
biased (not commonly known)
I can think of another. Encode everything in Ascii and write the number with the negative sign (45) and period (46) if needed.
I'm not sure if I'm mixing apples and oranges but today I heard how computers store numbers using single and double precision floating point format. In this everything is written as a power of 2 multiplied by a fraction. This means numbers that aren't powers of 2 like 9 are written as a power of 2 multiplied by a fraction e.g. 9 ➞ 16*9/16. Is that correct?
Who decides which system is used? Is it up to the hardware of the computer or the program? How do computer algebra systems handle transindental numbers like π on a finite machine? It seems like things would be a lot easier if everything's coded in Ascii and the negative sign and the decimal is placed accordingly e.g. -15.2 would be 45 49 53 46 (to base 10)
➞
111000 110001 110101 101110
Well there are many questions here.
The main reason why the system you imagined is bad, is because the lack of entropy. An ASCII character is 8 bits, so instead of 2^32 possible integers, you could represent only 4 characters on 32 bits, so 10000 integer values (+ 1000 negative ones if you want). Even if you reduce to 12 codes (0-9, -, .) you still need 4 bits to store them. So, 10^8+10^7 integer values, still much less than 2^32 (remember, 2^10 ~ 10^3). Using binary is optimal, because our bits only have 2 values. Any base that is a power of 2 also makes sense, hence octal and hex -- but ultimately they're just binary with bits packed per 3 or 4 for readability. If you forget about the sign (just use one bit) and the decimal separator, you get BCD : Binary Coded Decimals, which are usually coded on 4 bits per digit though a version on 8 bits called uncompressed BCD also seems to exist. I'm sure with a bit of research you can find fixed or floating point numbers using BCD.
Putting the sign in front is exactly sign magnitude (without the entropy problem, since it has a constant size of 1 bit).
You're roughly right on the fraction in floating point numbers. These numbers are written with a mantissa m and an exponent e, and their value is m 2^e. If you represent an integer that way, say 8, it would be 1x2^3, then the fraction is 1 = 8/2^3. With 9 that fraction is not exactly representable, so instead of 1 we write the closest number we can with the available bits. That is what we do as well with irrational (and thus transcendental) numbers like Pi : we approximate.
You're not solving anything with this system, even for floating point values. The denominator is going to be a power of 10 instead of a power of 2, which seems more natural to you, because it is the usual way we write rounded numbers, but is not in any way more valid or more accurate. ** Take 1/6 for example, you cannot represent it with a finite number of digits in the form a/10^b. *
The most popular representations for negative numbers is 2's complement, because of its nice properties when adding negative and positive numbers.
Standards committees (argue a lot internally and eventually) decide what complex number formats like floating points look like, and how to consistently treat corner cases. E.g. should dividing by 0 yield NaN ? Infinity ? An exception ? You should check out the IEEE : www.ieee.org . Some committees are not even agreeing yet, for example on how to represent intervals for interval arithmetic. Eventually it's the people who make the processors who get the final word on how bits are interpreted into a number. But sticking to standards allows for portability and compatibility between different processors (or coprocessors, what if your GPU used a different number format ? You'd have more to do than just copy data around).
Many alternatives to floating point values exist, like fixed point or arbitrary precision numbers, logarithmic number systems, rational arithmetic...
* Since 2 divides 10, you might argue that all the numbers representable by a/2^b can be a5^b/10^b, so less numbers need to be approximated. That only covers a minuscule family (an ideal, really) of the rational numbers, which are an infinite set of numbers. So it still doesn't solve the need for approximations for many rational, as well as all irrational numbers (as Pi).
** In fact, because of the fact that we use the powers of 2 we pack more significant digits after the decimal separator than we would with powers of 10 (for a same number of bits). That is, 2^-(53+e), the smallest bit of the mantissa of a double with exponent e, is much smaller than what you can reach with 53 bits of ASCII or 4-bit base 10 digits : at best 10^-4 * 2^-e

Truncating 64-bit IEEE doubles to 61-bits in a safe fashion

I am developing a programming language, September, which uses a tagged variant type as its main value type. 3 bits are used for the type (integer, string, object, exception, etc.), and 61 bits are used for the actual value (the actual integer, pointer to the object, etc.).
Soon, it will be time to add a float type to the language. I almost have the space for a 64-bit double, so I wanted to make use of doubles for calculations internally. Since I'm actually 3 bits short for storage, I would have to round the doubles off after each calculation - essentially resulting in a 61-bit double with a mantissa or exponent shorter by 3 bits.
But! I know floating point is fraught with peril and doing things which sound sensible on paper can produce disastrous results with FP math, so I have an open-ended question to the experts out there:
Is this approach viable at all? Will I run into serious error-accumulation problems in long-running calculations by rounding at each step? Is there some specific way in which I could do the rounding in order to avoid that? Are there any special values that I won't be able to treat that way (subnormals come to mind)?
Ideally, I would like my floats to be as well-behaved as a native 61-bit double would be.
I would recommend borrowing bits from the exponent field of the double-precision format. This is the method described in this article (that you would modify to borrow 3 bits from the exponent instead of 1). With this approach, all computations that do not use very large or very small intermediate results behave exactly as the original double-precision computation would. Even computations that run into the subnormal region of the new format behave exactly as they would if a 1+8+52 61-bit format had been standardized by IEEE.
By contrast, naively borrowing any number of bits at all from the significand introduces many double-rounding problems, all the more frequent that you are rounding from a 52-bit significand to a significand with only a few bits removed. Borrowing one bit from the significand as you suggest in an edit to your question would be the worst, with half the operations statistically producing double-rounded results that are different from what the ideal “native 61-bit double” would have produced. This means that instead of being accurate to 0.5ULP, the basic operations would be accurate to 3/4ULP, a dramatic loss of accuracy that would derail many of the existing, finely-designed numerical algorithms that expect 0.5ULP.
Three is a significant number of bits to borrow from an exponent that only has 11, though, and you could also consider using the single-precision 32-bit format in your language (calling the single-precision operations from the host).
Lastly, I give visibility here to another solution found by Jakub: borrow the three bits from the significand, and simulate round-to-odd for the intermediate double-precision computation before converting to the nearest number in 49-explicit-significand-bit, 11-exponent-bit format. If this way is chosen, it may useful to remark that the rounding itself to 49 bits of significand can be achieved with the following operations:
if ((repr & 7) == 4)
repr += (repr & 8) >> 1); /* midpoint case */
else
repr += 4;
repr &= ~(uint64_t)7; /* round to the nearest */
Despite working on the integer having the same representation as the double being considered, the above snippet works even if the number goes from normal to subnormal, from subnormal to normal, or from normal to infinite. You will of course want to set a tag in the three bits that have been freed as above. To recover a standard double-precision number from its unboxed representation, simply clear the tag with repr &= ~(uint64_t)7;.
This is a summary of my own research and information from the excellent answer by #Pascal Cuoq.
There are two places where we can truncate the 3-bits we need: the exponent, and the mantissa (significand). Both approaches run into problems which have to be explicitly handled in order for the calculations to behave as if we used a hypothetical native 61-bit IEEE format.
Truncating the mantissa
We shorten the mantissa by 3 bits, resulting in a 1s+11e+49m format. When we do that, performing calculations in double-precision and then rounding after each computation exposes us to double rounding problems. Fortunately, double rounding can be avoided by using a special rounding mode (round-to-odd) for the intermediate computations. There is an academic paper describing the approach and proving its correctness for all doubles - as long as we truncate at least 2 bits.
Portable implementation in C99 is straightforward. Since round-to-odd is not one of the available rounding modes, we emulate it by using fesetround(FE_TOWARD_ZERO), and then setting the last bit if the FE_INEXACT exception occurs. After computing the final double this way, we simply round to nearest for storage.
The format of the resulting float loses about 1 significant (decimal) digit compared to a full 64-bit double (from 15-17 digits to 14-16).
Truncating the exponent
We take 3 bits from the exponent, resulting in a 1s+8e+52m format. This approach (applied to a hypothetical introduction of 63-bit floats in OCaml) is described in an article. Since we reduce the range, we have to handle out-of-range exponents on both the positive side (by simply 'rounding' them to infinity) and the negative side. Doing this correctly on the negative side requires biasing the inputs to any operation in order to ensure that we get subnormals in the 64-bit computation whenever the 61-bit result needs to be subnormal. This has to be done a bit differently for each operation, since what matters is not whether the operands are subnormal, but whether we expect the result to be (in 61-bit).
The resulting format has significantly reduced range since we borrow a whopping 3 out of 11 bits of the exponent. The range goes down from 10-308...10308 to about 10-38 to 1038. Seems OK for computation, but we still lose a lot.
Comparison
Both approaches yield a well-behaved 61-bit float. I'm personally leaning towards truncating the mantissa, for three reasons:
the "fix-up" operations for round-to-odd are simpler, do not differ from operation to operation, and can be done after the computation
there is a proof of mathematical correctness of this approach
giving up one significant digit seems less impactful than giving up a big chunk of the double's range
Still, for some uses, truncating the exponent might be more attractive (especially if we care more about precision than range).

How do you perform floating point arithmetic on two floating point numbers?

Suppose I wanted to add, subtract, and/or multiply the following two floating point numbers that follow the format:
1 bit sign
3 bit exponent (bias 3)
6 bit mantissa
Can someone briefly explain how I would do that? I've tried searching online for helpful resources, but I haven't been able to find anything too intuitive. However, I know the procedure is generally supposed to be very simple. As an example, here are two numbers that I'd like to perform the three operations on:
0 110 010001
1 010 010000
To start, take the significand encoding and prefix it with a “1.”, and write the result with the sign determined by the sign bit. So, for your example numbers, we have:
+1.010001
-1.010000
However, these have different scales, because they have different exponents. The exponent of the second one is four less than the first one (0102 compared to 1102). So shift it right by four bits:
+1.010001
- .0001010000
Now both significands have the same scale (exponent 1102), so we can perform normal arithmetic, in binary:
+1.010001
- .0001010000
_____________
+1.0011000000
Next, round the significand to the available bits (seven). In this case, the trailing bits are zero, so the rounding does not change anything:
+1.001100
At this point, we could have a significand that needed more shifting, if it were greater than 2 (102) or less than 1. However, this significand is just where we want it, between 1 and 2. So we can keep the exponent as is (1102).
Convert the sign back to a bit, take the leading “1.” off the significand, and put the bits together:
0 110 001100
Exceptions would arise if the number overflowed or underflowed the normal exponent range, but those did not happen here.

Hardware implementation of square root?

I'm trying to find a little bit more information for efficient square root algorithms which are most likely implemented on FPGA. A lot of algorithms are found already but which one are for example from Intel or AMD?
By efficient I mean they are either really fast or they don't need much memory.
EDIT: I should probably mention that the question is generally a floating point number and since most of the hardware implements the IEEE 754 standard where the number is represented as: 1 sign bit, 8 bits biased exponent and 23 bits mantissa.
Thanks!
Not a full solution, but a couple of pointers.
I assume you're working in floating point, so point 1 is remember that floating point is stored as a mantissa and exponent. The exponent of the square root will be approximately half the exponent of the original number thanks to logarithms.
Then the mantissa can be approximated with a look-up table, and then you can use a couple of newton-raphson rounds to give some accuracy to the result from the LUT.
I haven't implemented anything like this for about 8 years, but I think this is how I did it and was able to get a result in 3 or 4 cycles.
This is a great one for fast inverse-quare root.
Have a look at it here. Notice it's pretty much about the initial guess, rather amazing document :)

Resources