After rounduing float variable, there still be number `0.80000001`

After rounduing float variable, there still be number `0.80000001` - math

I ma using MT4 but it might be the general question of floating number.
I am using NormalizeDouble function which rounds the digit of numbers like this.
double x = 1.33242
y = NormalizeDouble(x,2) // y is 1.33
However in some case.
Even after rounded by NormalizeDouble, there happens a number such us 0.800000001
I have no idea why it happens and how to fix it.
It might be a basic mathematical thing.

You are truncating to powers of 10 but fractional part of float/double can express exactly only powers of 2 like
0.5,0.25,0.125,...
and numbers decomposable to them hence your case:
0.8 = 1/2+1/4 +1/32 +1/64 +1/512 +1/1024 +1/8192 +1/16384...
= 0.5+0.25+0.03125+0.015625+0.001953125+0.0009765625+0.0001220703125+0.00006103515625...
= 0.11001100110011... [bin]
as 0.3 is like periodic number in binary and will always cause some noise in lower bits of mantissa. The FPU implementation tries to find the closest number to your desired value hence the 0.800000001

Related

Binary 2's Complement

I'm facing a problem. when we want to subtract a number from another using 2's complement we can do that. I don't know how to subtract fractional number using 2's complement.
5 is in binary form 101 and 2 is 10. if we want to subtract 2 from 5 we need to find out 2's complement of 2
2's complement of 2-> 11111110
so if we now add with binary of 5 we can get the subtraction result. If I want to get the result of 5.5-2.125. what would be the procedure.

Fixed point numbers can be used and it is still common to find them in embedded code or hardware.
Their use is identical to integers, but you need to specify where your "point" is. For instance, assume that you want 3 bits after after the point and that your data is 8 bits, bits 7..3 are the integer part (left of "point") and bits 2..0 the fractional part. The interpretation of integer part is as usual the binary decomposition of this integer: bits 3 correspond to 20, bits 4 to 21, etc.
For the fractional part, the decomposition is in negative powers or two. bits 2 correspond to 2-1, bits 1 to 2-2 and bit 0 to 2-3.
So for you problem, 5.5=4+1+1/2=22+20+2-1 and its code is 00101(.)100. Similarly 2.125=2+1/8 and its code is 00010(.)001 (note (.) is just an help to understand the coding).
Indeed they are just integers, but you must take into account that all your numbers are multiplied by 2-3. This will have no impact for addition, but results of multiplication and division must be adjusted. Taking into account the position of point and managing over and underflows is the difficulty of arithmetic with fixed point, but it allows to do fractional computations even if your hardware does not provide floating point support (for instance with low end microcontrollers or FPGA systems).
Two complement is similar to integers and its computation is identical. If code of 2.125 is 00010(.)001, than -2.125==11101(.)111. Operations are as usual.
+5 00101(.)100
-2.125 11101(.)111
00011(.)011
and 00011(.)011=2+1+1/4+1/8=3,375
For the record, two complement first use was for fixed point fractional numbers and two complement name comes from that. If a fractional number if represented by, say 0(.)1100000 (0.75), its negative counter part will be 1(.)0100000 (-0.75 or 1.25 if interpreted as unsigned) and we always have x+(unsigned)-x=2. For this coding, the negative value of a fractional number x is the number y that must be added to x to get a 2, hence the name that y is 2's complement of x.

Why does 0.1 + 0.4 = 0.5?

We know that floating point is broken, because decimal numbers can't always be perfectly represented in binary. They're rounded to a number that can be represented in binary; sometimes that number is higher, and sometimes it's lower. In this case using the ubiquitous IEEE 754 double format both 0.1 and 0.4 round higher:
0.1 = 0.1000000000000000055511151231257827021181583404541015625
0.4 = 0.40000000000000002220446049250313080847263336181640625
Since both of these numbers are high, you'd expect their sum to be high as well. Perfect addition should give you 0.5000000000000000277555756156289135105907917022705078125, but instead you get a nice exact 0.5. Why?
The question Is floating point math broken? was already identified above, but this question is different. It is asking for a further level of detail on a non-intuitive result when taking the answers of that question into consideration.

This calculation behaves this way because the addition pushes the result into another (binary) order of magnitude. This adds a significant bit to the left (most-significant side) and therefore has to drop a bit on the right (least-significant side).
The first number, 0.1, is stored in binary as a number between 2^-4 == 1/16 and 2^-3 == 1/8. The second number, 0.4, is stored in binary as a number between 2^-2 == 1/4 and 2^-1 == 1/2. The sum, 0.5, is the number 2^-1 == 1/2 or a little larger. This is a mis-match in magnitudes and can cause loss of digits.
Here is an example, easier to understand. Let's say we are working on a decimal computer that can store four decimal digits in floating point. Let's also say we want to add the numbers 10/3 and 20/3. These may end up stored as
3.334
and
6.667
both of which are a little high. When we get those numbers, we expect the sum to be also a little high, namely
10.001
but notice that we have moved into a new order of magnitude with our result. The full result has five decimal digits, which will not fit. So the computer rounds the result to just four decimal digits and gets the sum
10.00
which surprisingly is the correct exact answer to 10/3 + 20/3.
I get the same kind of thing often in my U.S. high-school Chemistry and Physics classes. When a calculation moves to a new order of magnitude, strange things happen with precision and significant digits.

Although most decimal numbers need to be rounded to fit into binary, some don't. 0.5 can be exactly represented in binary, since it's 2-1.
Floating point isn't just binary, it also has limited precision. Here is the exact sum and the two closest IEEE 754 double representable numbers on either side:
0.5000000000000000277555756156289135105907917022705078125
0.5000000000000000000000000000000000000000000000000000000
0.5000000000000001110223024625156540423631668090820312500
It's clear that the exact 0.5 is closest to the true sum. IEEE 754 has rules regarding simple math operations that dictate how result rounding will take place, and you can generally rely on the closest result to be taken.

fixed point multiplication for normal multiplication

I need to multiply X with a floating point number in floating point as i don't have floating point operations in my processor. I understand the method but don't know why that method exists?
Suppose we want to multiply 2*4.5 in decimal I do the below:
2 * 4.5 (100.1)
So i multiply 2*1001 = 2*9 = 18 and then right shift by 1.
so 18>>1 = 9
Is it that we represent 2 in fixed point and represent 4.5 in fixed point and as we multiply Q1.1 and Q1.1 format so we get Q2.2 format and we do right shifting causing Q1.1 format result.Is this right?

In decimal, your fixed-point example is actually:
2 * 4.5
2 * 45 (after multiplying by 10) = 90
90 / 10 = 9 (after dividing the 10 back out)
In binary, the same thing is being done, but just with powers of 2 instead of powers of 10 (as the factors / divisors). Fixed point operations occur in purely integral space after appropriate multiplications. And multiplying or dividing by a power of 2 is just a left shift or right shift respectively on the binary number (very fast for the CPU). In fixed-point the number of bits to the left (integer) and right (fractional) of the decimal point are fixed (predetermined), which means that some numbers cannot be represented on the scale without loss of precision.
Floating-point further extends the concept by allowing the number of bits assigned to the left and right of the decimal point to be flexible. In floating point, every number is represented as an integral "significand" (or mantissa) to a specified power (for example, a power of 2). This representation allows the same number of significant digits to be maintained over a greater dynamic range (for very small or very large magnitude numbers). For floating point, most of the bits will be assigned to the significant digits of the mantissa, and fewer of the bits assigned to the digits of the power. Floating-point calculations are more expensive (time-wise) than fixed-point, which is why fixed-point remains popular in microcontrollers and embedded systems.
If I didn't answer your question, please elaborate and I will edit this answer to include the information you desire.

Using MPFR And Adding - How many Digits are Correct?

I have a pretty easy question (I think). As much as I've tried, I can not find an answer to this question.
I am creating a function, for which I want the user to enter two numbers. The first is the the number of terms of a certain infinite series to add together. The second is the number of digits the user would like the truncated sum to be accurate to.
Say the terms of the sequence are a_i. How much precision n, would be required in mpfr to ensure the result of adding these a_i from i=0 up to the user's entered value would be needed to guarantee the number of digits the user needs?
By the way, I'm adding the a_i in a naive way.
Any help will be much appreciated.
Thanks,
Rick

You can convert between decimal digits of precision, d, and binary digits of precision, b, with logarithms
b = d × log(10) / log(2)
A little rearranging shows why
b × log(2) = d × log(10)
log(2b) = log(10d)
2b = 10d
Each term of the series (and each addition) will introduce a rounding error at the least significant digit so, assuming each of the t terms involves n (two argument) arithmetic operations, you will want to add an extra
log(t * (n+2))/log(2)
bits.
You'll need to round the number of bits of precision up to be sure that you have enough room for your decimal digits of precision
b = ceil((d*log(10.0) + log(t*(n+2)))/log(2.0));
Finally, you should be aware that the terms may introduce cancellation errors, in which case this simple calculation will dramatically underestimate the required number of bits, even assuming I've got it right in the first place ;-)

How do computers evaluate huge numbers?

If I enter a value, for example
1234567 ^ 98787878
into Wolfram Alpha it can provide me with a number of details. This includes decimal approximation, total length, last digits etc. How do you evaluate such large numbers? As I understand it a programming language would have to have a special data type in order to store the number, let alone add it to something else. While I can see how one might approach the addition of two very large numbers, I can't see how huge numbers are evaluated.
10^2 could be calculated through repeated addition. However a number such as the example above would require a gigantic loop. Could someone explain how such large numbers are evaluated? Also, how could someone create a custom large datatype to support large numbers in C# for example?

Well it's quite easy and you can have done it yourself
Number of digits can be obtained via logarithm:
since `A^B = 10 ^ (B * log(A, 10))`
we can compute (A = 1234567; B = 98787878) in our case that
`B * log(A, 10) = 98787878 * log(1234567, 10) = 601767807.4709646...`
integer part + 1 (601767807 + 1 = 601767808) is the number of digits
First, say, five, digits can be gotten via logarithm as well;
now we should analyze fractional part of the
B * log(A, 10) = 98787878 * log(1234567, 10) = 601767807.4709646...
f = 0.4709646...
first digits are 10^f (decimal point removed) = 29577...
Last, say, five, digits can be obtained as a corresponding remainder:
last five digits = A^B rem 10^5
A rem 10^5 = 1234567 rem 10^5 = 34567
A^B rem 10^5 = ((A rem 10^5)^B) rem 10^5 = (34567^98787878) rem 10^5 = 45009
last five digits are 45009
You may find BigInteger.ModPow (C#) very useful here
Finally
1234567^98787878 = 29577...45009 (601767808 digits)

There are usually libraries providing a bignum datatype for arbitrarily large integers (eg. mapping digits k*n...(k+1)*n-1, k=0..<some m depending on n and number magnitude> to a machine word of size n redefining arithmetic operations). for c#, you might be interested in BigInteger.
exponentiation can be recursively broken down:
pow(a,2*b) = pow(a,b) * pow(a,b);
pow(a,2*b+1) = pow(a,b) * pow(a,b) * a;
there also are number-theoretic results that have engenedered special algorithms to determine properties of large numbers without actually computing them (to be precise: their full decimal expansion).

To compute how many digits there are, one uses the following expression:
decimal_digits(n) = 1 + floor(log_10(n))
This gives:
decimal_digits(1234567^98787878) = 1 + floor(log_10(1234567^98787878))
= 1 + floor(98787878 * log_10(1234567))
= 1 + floor(98787878 * 6.0915146640862625)
= 1 + floor(601767807.4709647)
= 601767808
The trailing k digits are computed by doing exponentiation mod 10^k, which keeps the intermediate results from ever getting too large.
The approximation will be computed using a (software) floating-point implementation that effectively evaluates a^(98787878 log_a(1234567)) to some fixed precision for some number a that makes the arithmetic work out nicely (typically 2 or e or 10). This also avoids the need to actually work with millions of digits at any point.

There are many libraries for this and the capability is built-in in the case of python. You seem primarily concerned with the size of such numbers and the time it may take to do computations like the exponent in your example. So I'll explain a bit.
Representation
You might use an array to hold all the digits of large numbers. A more efficient way would be to use an array of 32 bit unsigned integers and store "32 bit chunks" of the large number. You can think of these chunks as individual digits in a number system with 2^32 distinct digits or characters. I used an array of bytes to do this on an 8-bit Atari800 back in the day.
Doing math
You can obviously add two such numbers by looping over all the digits and adding elements of one array to the other and keeping track of carries. Once you know how to add, you can write code to do "manual" multiplication by multiplying digits and putting the results in the right place and a lot of addition - but software will do all this fairly quickly. There are faster multiplication algorithms than the one you would use manually on paper as well. Paper multiplication is O(n^2) where other methods are O(n*log(n)). As for the exponent, you can of course multiply by the same number millions of times but each of those multiplications would be using the previously mentioned function for doing multiplication. There are faster ways to do exponentiation that require far fewer multiplies. For example you can compute x^16 by computing (((x^2)^2)^2)^2 which involves only 4 actual (large integer) multiplications.
In practice
It's fun and educational to try writing these functions yourself, but in practice you will want to use an existing library that has been optimized and verified.

I think a part of the answer is in the question itself :) To store these expressions, you can store the base (or mantissa), and exponent separately, like scientific notation goes. Extending to that, you cannot possibly evaluate the expression completely and store such large numbers, although, you can theoretically predict certain properties of the consequent expression. I will take you through each of the properties you talked about:
Decimal approximation: Can be calculated by evaluating simple log values.
Total number of digits for expression a^b, can be calculated by the formula
Digits = floor function (1 + Log10(a^b)), where floor function is the closest integer smaller than the number. For e.g. the number of digits in 10^5 is 6.
Last digits: These can be calculated by the virtue of the fact that the expression of linearly increasing exponents form a arithmetic progression. For e.g. at the units place; 7, 9, 3, 1 is repeated for exponents of 7^x. So, you can calculate that if x%4 is 0, the last digit is 1.
Can someone create a custom datatype for large numbers, I can't say, but I am sure, the number won't be evaluated and stored.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex