I was looking into the RNG of base R and was curious if the 32-bit implementation of Mersenne-Twister might be limiting it when scaled to large numbers of random numbers needed so I did a simple test:
set.seed(8)
length(unique(runif(1e8)))
# [1] 98845641
1e8 - 98845641
# 1154359
So it turns out that there are indeed numerous duplicates in the 100 million draw.
When I switch to the 64-bit version of the MT RNG implemented by dqrng package, the problem does not appear.
Question 1:
The 64 bit referenced refers to the type of floating point numbers used?
Question 2:
Am I right to conclude that because of the large span of possible numbers (64bit FP vs 32bit FP), duplicates are less likely when using the 64-bit MT?
from ?Random:
Do not rely on randomness of low-order bits from RNGs. Most of the supplied uniform generators return 32-bit integer values that are converted to doubles, so they take at most 2^32 distinct values and long runs will return duplicated values.
Indeed, when we calculate the expected number of draws that have a duplicate, we get
M <- 2^32
n <- 1e8
(n * (1 - (1 - 1 / M)^(n - 1))) / 2
# [1] 1150705
which is very close to the result that you have.
Related
I am trying to calculate the value of exp(1000) or more in R. I looked at the solution provided in Calculate the exponentials of big negative value and followed it to replicate the code for negative values first, but I have the following output.
a <- mpfr(exp(-1000),precBits = 64)
a
1 'mpfr' number of precision 64 bits
[1] 0
I do not understand why my output would be different from the provided solution. I understand this particular solution is for negative values and that I am looking for positive exponents. Regardless, it should work for both ways.
You need to convert to extended precision before you exponentiate, otherwise R will try to compute exp(-1000) with its usual precision and underflow will occur (as you found out).
> a <- exp(mpfr(-1000, precBits = 64))
> a
1 'mpfr' number of precision 64 bits
[1] 5.07595889754945676548e-435
If you run code like:
length(unique(runif(10000000)))
length(unique(rnorm(10000000)))
you'll see that only about 99.8% of runif values are unique, but 100% of rnorm values are. I thought this might be because of the constrained range, but upping the range to (0, 100000) for runif doesn't change the result. Continuous distributions should have probability of repeats =0, and I know in floating-point precision that's not the case, but I'm curious why we don't see fairly close to the same number of repeats between the two.
This is due primarily to the properties of the default PRNG (the fact that runif has a smaller range than rnorm and therefore a smaller number of representable values may also have a similar effect at some point even if the RNG doesn't). It is discussed somewhat obliquely in ?Random:
Do not rely on randomness of low-order bits from RNGs. Most of the
supplied uniform generators return 32-bit integer values that are
converted to doubles, so they take at most 2^32 distinct values and
long runs will return duplicated values (Wichmann-Hill is the
exception, and all give at least 30 varying bits.)
With the example:
sum(duplicated(runif(1e6))) # around 110 for default generator
## and we would expect about almost sure duplicates beyond about
qbirthday(1 - 1e-6, classes = 2e9) # 235,000
Changing to the Wichmann-Hill generator indeed reduces the chance of duplicates:
RNGkind("Wich")
sum(duplicated(runif(1e6)))
[1] 0
sum(duplicated(runif(1e8)))
[1] 0
The documentation for random number generations says:
Do not rely on randomness of low-order bits from RNGs. Most of the
supplied uniform generators return 32-bit integer values that are
converted to doubles, so they take at most 2^32 distinct values and
long runs will return duplicated values (Wichmann-Hill is the
exception, and all give at least 30 varying bits.)
By the birthday paradox you would expect to see repeated values in a set of more than roughly 2^16 values, and 10000000 > 2^16. I haven't found anything directly in the documentation about how many distinct values rnorm will return, but it is presumably larger than 2^32. It is interesting to note that set.seed has different parameters kind which determines the uniform generator and normal.kind which determines the normal generator, so the latter is not a simple transformation of the former.
Theorem:
The required number of digits (in base t) to represent the positive integer S in base t is ⟦logtS⟧+1 (⟦.⟧: floor function).
I wondered, what is the required number of digits (in base 2) to represent the maximum positive double (floating point) number in computer. I have 64-bit OS and 32-bit R on it. Hence, I did:
.Machine$double.xmax # 1.797693e+308
typeof(.Machine$double.xmax) # double
floor(log(.Machine$double.xmax, 2))+1 # 1025
.Machine$integer.max # 2147483647
class(.Machine$integer.max) # integer
floor(log(.Machine$integer.max, 2))+1 # 31; (1 bit for sign bit)
So, the theory is OK for integers.
(1) But what about the double equivalent of the theorem? I.e., what is the required number of digits (in base t) to represent the double in base t?
(2) This may be difficult with real numbers with decimals. So, perhaps, one may know the equivalent of the theorem for decimalless reals (that is ">2147483647").
In particular, where does the 1025 above come from?
(3) Would I get 63 if I used 64-bit OS and 64-bit R for the following?
floor(log(.Machine$integer.max, 2))+1 # 63??; (1 bit for sign bit??)
Ad 3) I don't know about doubles but the integer internal representation is still 32 bits even on 64 bit systems. If you want to go bigger you need to use some sort of library for that for example 'bit64'
You will get more detailed information with help(double) and help(integer)
I am learning about the RSA algorithm. I perform the algorithm on very small prime numbers and use online Big Integer calculators to perform the encryption and decryption and everything works just fine.
My question is about the size of the exponent we create and when it comes to bigger numbers, it seems infeasible to calculate.
For example, the algorithm starts with picking two prime numbers p and q. You compute n=pxq and then the totient of n. Next you pick a number 'e' such that 1
Then to perform an encryption you take say like the ASCII character 'A' which is 65 and you raise it to the power of e. (65^e)
The online big integer calculator started getting very slow and sluggish (over a minute to calculate) when e was bigger than about 100,000 (6 digits)
My question is then, for the working RSA algorithm, what size (number of digits) number does that algorithm pick?
One thought I had was it was possible the online calculator that I was using was not using the best method for exponents? This is the calculator I am using: http://www.javascripter.net/math/calculators/100digitbigintcalculator.htm
Let's say M is the modulus. So YES, you could first perform intermediate = 65^e, and finally compute intermediate mod M. And of course, intermediate would be a very very very very big integer (if e equals 65537, the decimal representation of intermediate contains 118813 digits!).
BUT, thanks to a very basic modular arithmetic theorem,
(65^e) mod M = ((((65 mod M) * 65) mod M) * 65) mod M [...] (e times)
(the theorem states that in a quotient ring, the n-th power of the class of an element is the class of the n-th power of the element)
As you can see, this does not need any very big integer library, since after each arithmetic product, you use mod M that returns an integer between 0 and M-1. So, you only have to compute arithmetic products of integers less than M.
As an example, here is a simple shell script (bash) that computes 65^65537 mod 991*997. As you can see, no need to get a big number library:
#!/bin/bash
# set RSA parameters
m=65 # message to encode
M=$((991*997)) # modulus (both 991 and 997 are prime numbers)
e=65537 # public exponent (coprime with 990*996, thus compliant with RSA algorithm)
# compute (m^e) mod M
ret=1
for i in {1..$e}
do
ret=$(((ret*m)%M))
done
# display the result
echo $ret
It immediately returns 784933, thus 65^65537 mod 991*997 = 784933
The biggest integer computed with your method of calculus has 118813 digits, but the biggest integer handled with this shell script only has 12 or less digits ((M-1)^2 is made of 12 digits).
According to these explanations, we can now answer your question:
My question is then, for the working RSA algorithm, what size (number of digits) number does that algorithm pick?
With the above explanations, you can see that the maximum number of digits in the decimal representation of integers you have to manipulate is 1+log10((M-1)^2), because you will, at most, compute a product of two integers between 0 and M-1.
Note that 1+log10((M-1)^2) = 1+2.log10(M-1) < 2+2.log10(M) = 2.(1+log10(M)). Also note that 1+log10(M) is the number of digits of M.
Therefore, as a conclusion, this proves that the number of digits your library has to handle correctly is twice the number of digits of the modulus (if you are computing the exponentiation using integer multiplications the way explained here).
I am working in R with very small numbers which reflect probabilities in an Maximum Likelihood Estimation algorithm. Some of these numbers are as small as 1e-155 ( or smaller). However, when there is something as simple as summation taking place, the precision level gets truncated to the least precise one and thus ruins the precisions of my calculations and produces meaningless results.
Example:
> sum(c(7.831908e-70,6.002923e-26,6.372573e-36,5.025015e-38,5.603268e-38,1.118121e-14, 4.512098e-07,4.400717e-05,2.300423e-26,1.317602e-58))
[1] 4.445838e-05
As is seen from the example, the base for this calculation is 1e-5 , which in a very rude manner rounds up sensitive calculation.
Is there a way around this? Why is R choosing such a strange automatic behavior? Perhaps it is not really doing this, I just see the result in the truncated form? In this case, is the actual number with correct precision stored in the variable?
There is no precision loss in your sum. But if you're worried about it, you should use a multiple-precision library:
library("Rmpfr")
x <- c(7.831908e-70,6.002923e-26,6.372573e-36,5.025015e-38,5.603268e-38,1.118121e-14, 4.512098e-07,4.400717e-05,2.300423e-26,1.317602e-58)
sum(mpfr(x, 1024))
# 1 'mpfr' number of precision 1024 bits
# [1] 4.445837981118120898327314579322617633703674840117902103769961398533293289165193843930280422747754618577451267010103975610356319174778512980120125435961577770470993217990999166176083700886405875414277348471907198346293122011042229843450802884152750493740313686430454254150390625000000000000000000000000000000000e-5
Your results are only truncated in the display.
Try:
x <- sum(c(7.831908e-70,6.002923e-26,6.372573e-36,5.025015e-38,5.603268e-38,1.118121e-14, 4.512098e-07,4.400717e-05,2.300423e-26,1.317602e-58))
print(x, digits=22)
[1] 4.445837981118121081878e-05
You can read more about the behaviour of print at ?print.default
You can also set an option - this will affext all calls to print
options(digits=22)
have you ever heard about Floating point numbers?
there is no loss of precision (significant figures) in multiplication or division as far as the result stay between
1.7976931348623157·10^308 to 4.9·10^−324 (see the link for detail)
so if you do 1.0e-30 * 1.0e-10 result will be 1.0e-40
but if you do 1.0e-30 + 1.0e-10 result will be 1.0e-10
Why?
-> finite set of number rapresentable with computer works. (64 bits max 2^64 different representation of numbers with 64 bits)
instead of using a direct conversion like for integer numbers (they represent from ~ -2^62 to +2^62, every INTEGER number -> about from -10^16 to +10*16)
or there exist a clever way like floating point? from 1.7976931348623157·10^308 to - 4.9·10^−324 and it can represent /approximate rational numbers?
So in floating point, to achieve a wider range, precision in sums is sacrified, There is loss of precision during sums or subtractions as the significant figures that could be represented by (the 52 bits of) the fraction part (of a floating point number of 64 bits) are less than log10(2^52) ~ 16.
if you look for a basic everyday example, summary(lm), when the p-value of parameter is near zero, summary() output <2.2e-16 (what a coincidence).
why limited to 64 bits? CPU have the execution units specifically to 64bits floating point arithmetic (64 bit IEEE 754 standard), if you use higher precision like 128 bits floating point, the performances will be lowered by 10 times or more, as CPU need to split the data and operation in multiple 64 bits data and operations.
https://en.wikipedia.org/wiki/Double-precision_floating-point_format