Is there a way to store a large number precisely in R? - r

Is there a way to store a large number precicely in R?
double is stored as a binary fraction and its precision varies with the value, and integer has limited range of 4 bytes.
What if I wanted to store a very large number precisely?

You can try the bigz class from the gmppackage:
> library("gmp")
> 2^10000
[1] Inf
> 2^(as.bigz(10000))
[1] "199506.... and a LOT of more numbers!
It basically stores the number as a string and so avoiding the integer/double limits.

It depends on what you mean by large number:
If you want numbers above the top end of double precision arithmetic, there is the Brobdingnag package
If you want more precision there are the gmp and related Rmpfr packages.

Related

Why does R display wrong number with large numbers?

When I save a large number in R as an object the wrong number is saved? Why is that?
options("scipen"=100, "digits"=4)
num <- 201912030032451613
num
#> [1] 201912030032451616
Created on 2019-12-12 by the reprex package (v0.2.1.9000)
As #Roland says, this is a floating point issue (the Wikipedia page on floating point numbers is as good as anything). Unpacking it a bit though, R has specific integer format but it is limited to 32 bit integers:
> str(-2147483647L)
int -2147483647
> str(2147483647L)
int 2147483647
> str(21474836470L)
num 21474836470
Warning message:
non-integer value 21474836470L qualified with L; using numeric value
So, when R gets your number it is storing it as a floating point number not an integer. Floating point numbers are limited in how much precision they can store and typically only have about 17 significant digits. Because your number has more significant digits than that there is loss of precision. Losing precision in the smallest digits doesn't usually matter for computer arithmetic, but if your big number is a key of some kind (or a date stamp) then you are in more trouble. The bit64 package is designed with this kind of use case in mind, or you could import it as a string, depending on what you want to do.

The required number of digits (in base t) to represent the double in base t?

Theorem:
The required number of digits (in base t) to represent the positive integer S in base t is ⟦logtS⟧+1 (⟦.⟧: floor function).
I wondered, what is the required number of digits (in base 2) to represent the maximum positive double (floating point) number in computer. I have 64-bit OS and 32-bit R on it. Hence, I did:
.Machine$double.xmax # 1.797693e+308
typeof(.Machine$double.xmax) # double
floor(log(.Machine$double.xmax, 2))+1 # 1025
.Machine$integer.max # 2147483647
class(.Machine$integer.max) # integer
floor(log(.Machine$integer.max, 2))+1 # 31; (1 bit for sign bit)
So, the theory is OK for integers.
(1) But what about the double equivalent of the theorem? I.e., what is the required number of digits (in base t) to represent the double in base t?
(2) This may be difficult with real numbers with decimals. So, perhaps, one may know the equivalent of the theorem for decimalless reals (that is ">2147483647").
In particular, where does the 1025 above come from?
(3) Would I get 63 if I used 64-bit OS and 64-bit R for the following?
floor(log(.Machine$integer.max, 2))+1 # 63??; (1 bit for sign bit??)
Ad 3) I don't know about doubles but the integer internal representation is still 32 bits even on 64 bit systems. If you want to go bigger you need to use some sort of library for that for example 'bit64'
You will get more detailed information with help(double) and help(integer)

Donot want large numbers to be rounded off in R

options(scipen=999)
625075741017804800
625075741017804806
When I type the above in the R console, I get the same output for the two numbers listed above. The output being: 625075741017804800
How do I avoid that?
Numbers greater than 2^53 are not going to be unambiguously stored in the R numeric classed vectors. There was a recent change to allow integer storage in the numeric abscissa, however your number is larger that that increased capacity for precision:
625075741017804806 > 2^53
[1] TRUE
Prior to that change integers could only be stored up to Machine$integer.max == 2147483647. Numbers larger than that value get silently coerced to 'numeric' class. You will either need to work with them using character values or install a package that is capable of achieving arbitrary precision. Rmpfr and gmp are two that come to mind.
You can use package Rmpfr for arbitrary precision
dig <- mpfr("625075741017804806")
print(dig, 18)
# 1 'mpfr' number of precision 60 bits
# [1] 6.25075741017804806e17

What's the difference between integer class and numeric class in R

I want to preface this by saying I'm an absolute programming beginner, so please excuse how basic this question is.
I'm trying to get a better understanding of "atomic" classes in R and maybe this goes for classes in programming in general. I understand the difference between a character, logical, and complex data classes, but I'm struggling to find the fundamental difference between a numeric class and an integer class.
Let's say I have a simple vector x <- c(4, 5, 6, 6) of integers, it would make sense for this to be an integer class. But when I do class(x) I get [1] "numeric". Then if I convert this vector to an integer class x <- as.integer(x). It return the same exact list of numbers except the class is different.
My question is why is this the case, and why the default class for a set of integers is a numeric class, and what are the advantages and or disadvantages of having an integer set as numeric instead of integer.
There are multiple classes that are grouped together as "numeric" classes, the 2 most common of which are double (for double precision floating point numbers) and integer. R will automatically convert between the numeric classes when needed, so for the most part it does not matter to the casual user whether the number 3 is currently stored as an integer or as a double. Most math is done using double precision, so that is often the default storage.
Sometimes you may want to specifically store a vector as integers if you know that they will never be converted to doubles (used as ID values or indexing) since integers require less storage space. But if they are going to be used in any math that will convert them to double, then it will probably be quickest to just store them as doubles to begin with.
Patrick Burns on Quora says:
First off, it is perfectly feasible to use R successfully for years
and not need to know the answer to this question. R handles the
differences between the (usual) numerics and integers for you in the
background.
> is.numeric(1)
[1] TRUE
> is.integer(1)
[1] FALSE
> is.numeric(1L)
[1] TRUE
> is.integer(1L)
[1] TRUE
(Putting capital 'L' after an integer forces it to be stored as an
integer.)
As you can see "integer" is a subset of "numeric".
> .Machine$integer.max
[1] 2147483647
> .Machine$double.xmax
[1] 1.797693e+308
Integers only go to a little more than 2 billion, while the other
numerics can be much bigger. They can be bigger because they are
stored as double precision floating point numbers. This means that
the number is stored in two pieces: the exponent (like 308 above,
except in base 2 rather than base 10), and the "significand" (like
1.797693 above).
Note that 'is.integer' is not a test of whether you have a whole
number, but a test of how the data are stored.
One thing to watch out for is that the colon operator, :, will return integers if the start and end points are whole numbers. For example, 1:5 creates an integer vector of numbers from 1 to 5. You don't need to append the letter L.
> class(1:5)
[1] "integer"
Reference: https://www.quora.com/What-is-the-difference-between-numeric-and-integer-in-R
To quote the help page (try ?integer), bolded portion mine:
Integer vectors exist so that data can be passed to C or Fortran code which expects them, and so that (small) integer data can be represented exactly and compactly.
Note that current implementations of R use 32-bit integers for integer vectors, so the range of representable integers is restricted to about +/-2*10^9: doubles can hold much larger integers exactly.
Like the help page says, R's integers are signed 32-bit numbers so can hold between -2147483648 and +2147483647 and take up 4 bytes.
R's numeric is identical to an 64-bit double conforming to the IEEE 754 standard. R has no single precision data type. (source: help pages of numeric and double). A double can store all integers between -2^53 and 2^53 exactly without losing precision.
We can see the data type sizes, including the overhead of a vector (source):
> object.size(1:1000)
4040 bytes
> object.size(as.numeric(1:1000))
8040 bytes
To my understanding - we do not declare a variable with a data type so by default R has set any number without L to be a numeric.
If you wrote:
> x <- c(4L, 5L, 6L, 6L)
> class(x)
>"integer" #it would be correct
Example of Integer:
> x<- 2L
> print(x)
Example of Numeric (kind of like double/float from other programming languages)
> x<-3.4
> print(x)
Numeric is an umbrella term for several types of classes (e.g. double and integer). Integers are numbers which do not have decimal points and thus are stored with minimal space in memory. Use the integer class only when doing computations with such numbers, otherwise revert to numeric.

Calculations precision level in R

I am working in R with very small numbers which reflect probabilities in an Maximum Likelihood Estimation algorithm. Some of these numbers are as small as 1e-155 ( or smaller). However, when there is something as simple as summation taking place, the precision level gets truncated to the least precise one and thus ruins the precisions of my calculations and produces meaningless results.
Example:
> sum(c(7.831908e-70,6.002923e-26,6.372573e-36,5.025015e-38,5.603268e-38,1.118121e-14, 4.512098e-07,4.400717e-05,2.300423e-26,1.317602e-58))
[1] 4.445838e-05
As is seen from the example, the base for this calculation is 1e-5 , which in a very rude manner rounds up sensitive calculation.
Is there a way around this? Why is R choosing such a strange automatic behavior? Perhaps it is not really doing this, I just see the result in the truncated form? In this case, is the actual number with correct precision stored in the variable?
There is no precision loss in your sum. But if you're worried about it, you should use a multiple-precision library:
library("Rmpfr")
x <- c(7.831908e-70,6.002923e-26,6.372573e-36,5.025015e-38,5.603268e-38,1.118121e-14, 4.512098e-07,4.400717e-05,2.300423e-26,1.317602e-58)
sum(mpfr(x, 1024))
# 1 'mpfr' number of precision 1024 bits
# [1] 4.445837981118120898327314579322617633703674840117902103769961398533293289165193843930280422747754618577451267010103975610356319174778512980120125435961577770470993217990999166176083700886405875414277348471907198346293122011042229843450802884152750493740313686430454254150390625000000000000000000000000000000000e-5
Your results are only truncated in the display.
Try:
x <- sum(c(7.831908e-70,6.002923e-26,6.372573e-36,5.025015e-38,5.603268e-38,1.118121e-14, 4.512098e-07,4.400717e-05,2.300423e-26,1.317602e-58))
print(x, digits=22)
[1] 4.445837981118121081878e-05
You can read more about the behaviour of print at ?print.default
You can also set an option - this will affext all calls to print
options(digits=22)
have you ever heard about Floating point numbers?
there is no loss of precision (significant figures) in multiplication or division as far as the result stay between
1.7976931348623157·10^308 to 4.9·10^−324 (see the link for detail)
so if you do 1.0e-30 * 1.0e-10 result will be 1.0e-40
but if you do 1.0e-30 + 1.0e-10 result will be 1.0e-10
Why?
-> finite set of number rapresentable with computer works. (64 bits max 2^64 different representation of numbers with 64 bits)
instead of using a direct conversion like for integer numbers (they represent from ~ -2^62 to +2^62, every INTEGER number -> about from -10^16 to +10*16)
or there exist a clever way like floating point? from 1.7976931348623157·10^308 to - 4.9·10^−324 and it can represent /approximate rational numbers?
So in floating point, to achieve a wider range, precision in sums is sacrified, There is loss of precision during sums or subtractions as the significant figures that could be represented by (the 52 bits of) the fraction part (of a floating point number of 64 bits) are less than log10(2^52) ~ 16.
if you look for a basic everyday example, summary(lm), when the p-value of parameter is near zero, summary() output <2.2e-16 (what a coincidence).
why limited to 64 bits? CPU have the execution units specifically to 64bits floating point arithmetic (64 bit IEEE 754 standard), if you use higher precision like 128 bits floating point, the performances will be lowered by 10 times or more, as CPU need to split the data and operation in multiple 64 bits data and operations.
https://en.wikipedia.org/wiki/Double-precision_floating-point_format

Resources