The required number of digits (in base t) to represent the double in base t? - r

Theorem:
The required number of digits (in base t) to represent the positive integer S in base t is ⟦logtS⟧+1 (⟦.⟧: floor function).
I wondered, what is the required number of digits (in base 2) to represent the maximum positive double (floating point) number in computer. I have 64-bit OS and 32-bit R on it. Hence, I did:
.Machine$double.xmax # 1.797693e+308
typeof(.Machine$double.xmax) # double
floor(log(.Machine$double.xmax, 2))+1 # 1025
.Machine$integer.max # 2147483647
class(.Machine$integer.max) # integer
floor(log(.Machine$integer.max, 2))+1 # 31; (1 bit for sign bit)
So, the theory is OK for integers.
(1) But what about the double equivalent of the theorem? I.e., what is the required number of digits (in base t) to represent the double in base t?
(2) This may be difficult with real numbers with decimals. So, perhaps, one may know the equivalent of the theorem for decimalless reals (that is ">2147483647").
In particular, where does the 1025 above come from?
(3) Would I get 63 if I used 64-bit OS and 64-bit R for the following?
floor(log(.Machine$integer.max, 2))+1 # 63??; (1 bit for sign bit??)

Ad 3) I don't know about doubles but the integer internal representation is still 32 bits even on 64 bit systems. If you want to go bigger you need to use some sort of library for that for example 'bit64'
You will get more detailed information with help(double) and help(integer)

Related

R random number generator faulty?

I was looking into the RNG of base R and was curious if the 32-bit implementation of Mersenne-Twister might be limiting it when scaled to large numbers of random numbers needed so I did a simple test:
set.seed(8)
length(unique(runif(1e8)))
# [1] 98845641
1e8 - 98845641
# 1154359
So it turns out that there are indeed numerous duplicates in the 100 million draw.
When I switch to the 64-bit version of the MT RNG implemented by dqrng package, the problem does not appear.
Question 1:
The 64 bit referenced refers to the type of floating point numbers used?
Question 2:
Am I right to conclude that because of the large span of possible numbers (64bit FP vs 32bit FP), duplicates are less likely when using the 64-bit MT?
from ?Random:
Do not rely on randomness of low-order bits from RNGs. Most of the supplied uniform generators return 32-bit integer values that are converted to doubles, so they take at most 2^32 distinct values and long runs will return duplicated values.
Indeed, when we calculate the expected number of draws that have a duplicate, we get
M <- 2^32
n <- 1e8
(n * (1 - (1 - 1 / M)^(n - 1))) / 2
# [1] 1150705
which is very close to the result that you have.

R largest/smallest representable numbers

I'm trying to get the largest/smallest representable number in R.
After typing ".Machine"
I got:
$double.xmin
[1] 2.225074e-308
$double.xmax
[1] 1.797693e+308
However even if I type 2.225074e-309 in R command prompt I get 2.225074e-309 instead of the expected 0
How can I find the largest/smallest number for which adding or subtracting 1 would lead to either Inf(Adding 1 to largest number) or 0(subtracting 1 from smallest number) ?
.Machine$double.xmin gives the value of the smallest positive number whose representation meets the requirements of IEEE 754 technical standard for floating point computation. As is mentioned in the Wikipedia article on double-precision floating point numbers, that standard requires that:
If a decimal string with at most 15 significant digits is converted to IEEE 754 double precision representation and then converted back to a string with the same number of significant digits, then the final string should match the original. If an IEEE 754 double precision is converted to a decimal string with at least 17 significant digits and then converted back to double, then the final number must match the original.
The same article goes on to note that, by compromising precision, even smaller positive numbers (which do not meet the standards' precision requirements) can be represented:
The 11 bit width of the exponent allows the representation of numbers between 10-308 and 10308, with full 15–17 decimal digits precision. By compromising precision, the subnormal representation allows even smaller values up to about 5 × 10-324.
R's doubles behave in exactly this way, as is noted in the Details section of ?.Machine:
Note that on most platforms smaller positive values than
‘.Machine$double.xmin’ can occur. On a typical R platform the
smallest positive double is about ‘5e-324’.
To confirm that that is the smallest positive value that can be represented using R's doubles and to see the cost in loss of precision, try out a few operations like this:
5e-324
# [1] 4.940656e-324
2e-324
# [1] 0
1.4 * 5e-324
# [1] 4.940656e-324
1.6 * 5e-324
# [1] 9.881313e-324
Here are some representations using SAS, IEEE 754 Big Endian?
data _null_;
y=constant('big');
put y hex16.;
put y E21.3;
run;quit;
Biggest
7FEFFFFFFFFFFFFF
1.79769313486230E+308
data _null_;
y=constant('small');
put y hex16.;
put y E21.3;
run;quit;
Smallest
0010000000000000
2.22507385850720E-308
I am not sure the smallest because SAS may set aside some values for missings.

How to do Division of two fixed point 64 bits variables in Synthesizable Verilog?

I'm implementing an Math equation in verilog, in a combinational scheme (assigns = ...) to the moment Synthesis tool (Quartus II) has been able to do add, sub and mul easly 32 bit unsigned absolute numbers by using the operators "+,- and *" respectively.
However, one of the final steps of the equation is to divide two 64 bits unsigned fixed point variables, the reason why is such of large 64 bit capacity is because I'm destinating 16 bits for integers and 48 bits for fractions (although, computer does everything in binary and doesn't care about fractions, I would be able to check the number to separate fraction from integer in the end).
Problem is that the operator "/" is useless since it auto-invokes a so-called "LPM_divide" library which output only gives me the integer, disregarding fractions, plus in a wrong position (the less significant bit).
For example:
b1000111010000001_000000000000000000000000000000000000000000000000 / b1000111010000001_000000000000000000000000000000000000000000000000
should be 1, it gives me
b0000000000000000_000000000000000000000000000000000000000000000001
So, how can I make this division for synthesizable verilog? What methods or algorithms should I follow, I'd like it to be faster, maybe a full combinational?
I'd like it to keep the 16 integers - 24 fractions user point of view. Thanks in advance.
First assume you multiply two fixed-point numbers.
Let's call them X and Y, first containing Xf fractional bits, and second Yf fractional bits accordingly.
If you multiply those numbers as integers, the LSB Xf+Yf bits of the integer result could be treated as fractional bits of resulting fixed-point number (and you still multiply them as integers).
Similarly, if you divide number of Sf fractional bits by number of Df fractional bits, the resulting integer could be treated as fixed-point number having Sf-Df fractional bits -- therefore your example with resulting integer 1.
Thus, if you need to get 48 fractional bits from your division of 16.48 number by another 16.48 number, append divident with another 48 zeroed fractional bits, then divide the resulting 64+48=112-bit number by another 64-bit number, treating both as integers (and using LPM_divide). The result's LSB 48 bits will then be what you need -- the resulting fixed-point number's 48 fractional bits.

Donot want large numbers to be rounded off in R

options(scipen=999)
625075741017804800
625075741017804806
When I type the above in the R console, I get the same output for the two numbers listed above. The output being: 625075741017804800
How do I avoid that?
Numbers greater than 2^53 are not going to be unambiguously stored in the R numeric classed vectors. There was a recent change to allow integer storage in the numeric abscissa, however your number is larger that that increased capacity for precision:
625075741017804806 > 2^53
[1] TRUE
Prior to that change integers could only be stored up to Machine$integer.max == 2147483647. Numbers larger than that value get silently coerced to 'numeric' class. You will either need to work with them using character values or install a package that is capable of achieving arbitrary precision. Rmpfr and gmp are two that come to mind.
You can use package Rmpfr for arbitrary precision
dig <- mpfr("625075741017804806")
print(dig, 18)
# 1 'mpfr' number of precision 60 bits
# [1] 6.25075741017804806e17

Calculations precision level in R

I am working in R with very small numbers which reflect probabilities in an Maximum Likelihood Estimation algorithm. Some of these numbers are as small as 1e-155 ( or smaller). However, when there is something as simple as summation taking place, the precision level gets truncated to the least precise one and thus ruins the precisions of my calculations and produces meaningless results.
Example:
> sum(c(7.831908e-70,6.002923e-26,6.372573e-36,5.025015e-38,5.603268e-38,1.118121e-14, 4.512098e-07,4.400717e-05,2.300423e-26,1.317602e-58))
[1] 4.445838e-05
As is seen from the example, the base for this calculation is 1e-5 , which in a very rude manner rounds up sensitive calculation.
Is there a way around this? Why is R choosing such a strange automatic behavior? Perhaps it is not really doing this, I just see the result in the truncated form? In this case, is the actual number with correct precision stored in the variable?
There is no precision loss in your sum. But if you're worried about it, you should use a multiple-precision library:
library("Rmpfr")
x <- c(7.831908e-70,6.002923e-26,6.372573e-36,5.025015e-38,5.603268e-38,1.118121e-14, 4.512098e-07,4.400717e-05,2.300423e-26,1.317602e-58)
sum(mpfr(x, 1024))
# 1 'mpfr' number of precision 1024 bits
# [1] 4.445837981118120898327314579322617633703674840117902103769961398533293289165193843930280422747754618577451267010103975610356319174778512980120125435961577770470993217990999166176083700886405875414277348471907198346293122011042229843450802884152750493740313686430454254150390625000000000000000000000000000000000e-5
Your results are only truncated in the display.
Try:
x <- sum(c(7.831908e-70,6.002923e-26,6.372573e-36,5.025015e-38,5.603268e-38,1.118121e-14, 4.512098e-07,4.400717e-05,2.300423e-26,1.317602e-58))
print(x, digits=22)
[1] 4.445837981118121081878e-05
You can read more about the behaviour of print at ?print.default
You can also set an option - this will affext all calls to print
options(digits=22)
have you ever heard about Floating point numbers?
there is no loss of precision (significant figures) in multiplication or division as far as the result stay between
1.7976931348623157·10^308 to 4.9·10^−324 (see the link for detail)
so if you do 1.0e-30 * 1.0e-10 result will be 1.0e-40
but if you do 1.0e-30 + 1.0e-10 result will be 1.0e-10
Why?
-> finite set of number rapresentable with computer works. (64 bits max 2^64 different representation of numbers with 64 bits)
instead of using a direct conversion like for integer numbers (they represent from ~ -2^62 to +2^62, every INTEGER number -> about from -10^16 to +10*16)
or there exist a clever way like floating point? from 1.7976931348623157·10^308 to - 4.9·10^−324 and it can represent /approximate rational numbers?
So in floating point, to achieve a wider range, precision in sums is sacrified, There is loss of precision during sums or subtractions as the significant figures that could be represented by (the 52 bits of) the fraction part (of a floating point number of 64 bits) are less than log10(2^52) ~ 16.
if you look for a basic everyday example, summary(lm), when the p-value of parameter is near zero, summary() output <2.2e-16 (what a coincidence).
why limited to 64 bits? CPU have the execution units specifically to 64bits floating point arithmetic (64 bit IEEE 754 standard), if you use higher precision like 128 bits floating point, the performances will be lowered by 10 times or more, as CPU need to split the data and operation in multiple 64 bits data and operations.
https://en.wikipedia.org/wiki/Double-precision_floating-point_format

Resources