R: Number precision, how to prevent rounding? - r

In R, I have the following vector of numbers:
numbers <- c(0.0193738397702257, 0.0206218006695066, 0.021931558829559,
0.023301378178208, 0.024728095594751, 0.0262069239112787, 0.0277310799996657,
0.0292913948762414, 0.0308758879014822, 0.0324693108459748, 0.0340526658271053,
0.03560271425176, 0.0370915716288017, 0.0384863653635563, 0.0397490272396821,
0.0408363289939899, 0.0417002577578561, 0.0422890917131629, 0.0425479537267193,
0.0424213884467212, 0.0418571402964338, 0.0408094991140723, 0.039243951482081,
0.0371450856007627, 0.0345208537496488, 0.0314091884865658, 0.0278854381969885,
0.0240607638577763, 0.0200808932436969, 0.0161193801903312, 0.0123615428382314,
0.00920410652651576, 0.00628125319205829, 0.0038816517651031,
0.00214210795679701, 0.00103919307280354, 0.000435532895812429,
0.000154730641092234, 4.56593150728962e-05, 1.09540661898799e-05,
2.08952167815574e-06, 3.10045314287095e-07, 3.51923218134997e-08,
3.02121734299694e-09, 1.95269500257237e-10, 9.54697530552714e-12,
3.5914029230041e-13, 1.07379981978647e-14, 2.68543048763588e-16,
6.03891613157815e-18, 1.33875697089866e-19, 3.73885699170518e-21,
1.30142752487978e-22, 5.58607581840324e-24, 2.92551478380617e-25,
1.85002124085815e-26, 1.39826890505611e-27, 1.25058972437096e-28,
1.31082961467944e-29, 1.59522437605631e-30, 2.23371981458205e-31,
3.5678974253211e-32, 6.44735482309705e-33, 1.30771083084868e-33,
2.95492180915218e-34, 7.3857554006177e-35, 2.02831084124162e-35,
6.08139499028838e-36, 1.97878175996974e-36, 6.94814886769478e-37,
2.61888070029751e-37, 1.05433608968287e-37, 4.51270543356897e-38,
2.04454840598946e-38, 9.76544451781597e-39, 4.90105271869773e-39,
2.5743371658684e-39, 1.41165292292001e-39, 8.06250933233367e-40,
4.78746160076622e-40, 2.94835809615626e-40, 1.87667170875529e-40,
1.22833908072915e-40, 8.21091993733535e-41, 5.53869254991177e-41,
3.74485710867631e-41, 2.52485401054841e-41, 1.69027430542613e-41,
1.12176290106797e-41, 7.38294520887852e-42, 4.8381070000246e-42,
3.20123319815522e-42, 2.16493953538386e-42, 1.50891804884267e-42,
1.09057070511506e-42, 8.1903023226717e-43, 6.3480235351625e-43,
5.13533594742621e-43, 4.25591269645348e-43, 3.57422485839717e-43,
3.0293235331048e-43, 2.58514651313175e-43, 2.21952686649801e-43,
1.91634521841049e-43, 1.66319240529025e-43, 1.45043336371471e-43,
1.27052593975384e-43, 1.11752052211757e-43, 9.86689196888877e-44,
8.74248543892126e-44)
I use cumsum to get the cumulative sum. Due to R's numerical precision, many of the numbers towards the end of the vector are now equivalent to 1 (even though technically they're not exactly = 1, just very close to it).
So then when I try to recover my original numbers by using diff(cumulative), I get a lot of 0s instead of a very small number. How can I prevent R from "rounding"?
cumulative <- cumsum(numbers)
diff(cumulative)

I think the Rmpfr package does what you want:
library(Rmpfr)
x <- mpfr(numbers,200) # set arbitrary precision that's greater than R default
cumulative <- cumsum(x)
diff(cumulative)
Here's the top and bottom of the output:
> diff(cumulative)
109 'mpfr' numbers of precision 200 bits
[1] 0.02062180066950659862445860426305443979799747467041015625
[2] 0.021931558829559001655429284483034280128777027130126953125
[3] 0.02330137817820800150148130569505156017839908599853515625
[4] 0.0247280955947510004688805196337852976284921169281005859375
...
[107] 1.117520522117570086014450710640040701536080790307716261438975e-43
[108] 9.866891968888769759087690539062888824928577731689952701181586e-44
[109] 8.742485438921260418707338389502002282130643811990663213422948e-44
You can adjust the precision as you like by changing the second argument to mpfr.

You might want to try out the package Rmpfr.

Related

How to get the center and scale after using the scale function in R

It seems a silly question, but I have searched on line, but still did not find any sufficient reply.
My question is: suppose we have a matrix M, then we use the scale() function, how can we extract the center and scale of each column by writing a line of code (I know we can see the centers and scales..), but my matrix has lots of columns, it is cumbersome to do it manually.
Any ideas? Many thanks!
you are looking for the attributes function:
set.seed(1)
mat = matrix(rnorm(1000),,10) # Suppose you have 10 columns
s = scale(mat) # scale your data
attributes(s)#This gives you the means and the standard deviations:
$`dim`
[1] 100 10
$`scaled:center`
[1] 0.1088873669 -0.0378080766 0.0296735350 0.0516018586 -0.0391342406 -0.0445193567 -0.1995797418
[8] 0.0002549694 0.0100772648 0.0040650015
$`scaled:scale`
[1] 0.8981994 0.9578791 1.0342655 0.9916751 1.1696122 0.9661804 1.0808358 1.0973012 1.0883612 1.0548091
These values can also be obtained as:
colMeans(mat)
[1] 0.1088873669 -0.0378080766 0.0296735350 0.0516018586 -0.0391342406 -0.0445193567 -0.1995797418
[8] 0.0002549694 0.0100772648 0.0040650015
sqrt(diag(var(mat)))
[1] 0.8981994 0.9578791 1.0342655 0.9916751 1.1696122 0.9661804 1.0808358 1.0973012 1.0883612 1.0548091
you get a list that you can subset the way you want:
or you can do
attr(s,"scaled:center")
[1] 0.1088873669 -0.0378080766 0.0296735350 0.0516018586 -0.0391342406 -0.0445193567 -0.1995797418
[8] 0.0002549694 0.0100772648 0.0040650015
attr(s,"scaled:scale")
[1] 0.8981994 0.9578791 1.0342655 0.9916751 1.1696122 0.9661804 1.0808358 1.0973012 1.0883612 1.0548091

Unique in data.table dropped some value by mistake

I need to remove duplicates from a big data frame that has 100 million rows. I am testing if data.table can help me on that. However, in the following code, unique() in data.table did not generate the same result as the unique() for data.frame. Is there a possible bug in setkey in data.table?
library(data.table)
tmp <- data.frame(id=c(1000000128152, 1000000228976, 1000000235508, 1000000294933, 1000000311288, 1000000353770, 1000000441585, 1000000466482, 1000000473521,
1000000491353, 1000000497787, 1000000534948, 1000000589071, 1000000622890, 1000000658287, 1000000695865, 1000000731674, 1000000780659,
1000000818218, 1000000834389, 1000000877189, 1000000937770, 1000000937770, 1000000996135, 1000001061831, 1000001062057, 1000001065241,
1000001097542, 1000001122242, 1000001177167, 1000001194078, 1000001216323, 1000001232155, 1000001294998, 1000001361126, 1000001361126,
1000001389830, 1000001411284, 1000001415793, 1000001417557, 1000001485326, 1000001565513, 1000001624601, 1000001650282, 1000001681805,
1000001683548, 1000001683548, 1000001693445, 1000001693455, 1000001693462, 1000001693466, 1000001693490, 1000001693490, 1000001703493,
1000001703511, 1000001703518, 1000001703546, 1000001703554, 1000001703613, 1000001703644))
unique(tmp$id)
DT <- data.table(tmp)
setkey(DT, id)
DTU <- unique(DT)
DTU$id
Results from the unique(tmp$id):
[1] 1000000128152 1000000228976 1000000235508 1000000294933 1000000311288 1000000353770 1000000441585 1000000466482 1000000473521 1000000491353 1000000497787 1000000534948
[13] 1000000589071 1000000622890 1000000658287 1000000695865 1000000731674 1000000780659 1000000818218 1000000834389 1000000877189 1000000937770 1000000996135 1000001061831
[25] 1000001062057 1000001065241 1000001097542 1000001122242 1000001177167 1000001194078 1000001216323 1000001232155 1000001294998 1000001361126 1000001389830 1000001411284
[37] 1000001415793 1000001417557 1000001485326 1000001565513 1000001624601 1000001650282 1000001681805 1000001683548 1000001693445 1000001693455 1000001693462 1000001693466
[49] 1000001693490 1000001703493 1000001703511 1000001703518 1000001703546 1000001703554 1000001703613 1000001703644
Result from DTU$id:
[1] 1000000128152 1000000228976 1000000235508 1000000294933 1000000311288 1000000353770 1000000441585 1000000466482 1000000473521 1000000491353 1000000497787 1000000534948
[13] 1000000589071 1000000622890 1000000658287 1000000695865 1000000731674 1000000780659 1000000818218 1000000834389 1000000877189 1000000937770 1000000996135 1000001061831
[25] 1000001062057 1000001065241 1000001097542 1000001122242 1000001177167 1000001194078 1000001216323 1000001232155 1000001294998 1000001361126 1000001389830 1000001411284
[37] 1000001415793 1000001417557 1000001485326 1000001565513 1000001624601 1000001650282 1000001681805 1000001683548 1000001693445 1000001693455 1000001693462 1000001693490
[49] 1000001703493 1000001703511 1000001703518 1000001703546 1000001703554 1000001703613 1000001703644
Comparing the two, we see that 1000001693466 got dropped in DTU by mistake. Any suggestions on why? I suspect it's the setkey because when I subtracted the 1000000000000 from all the numbers, the result is the same.
Edit (from Arun): The default rounding feature has been removed in the current development version of data.table, v1.9.7, and is likely stay that way moving forward. See here for installation instructions.
This also means that you're fully responsible for understanding limitations in representing floating point numbers and dealing with them :-).
help(setkey) says (data.table version 1.9.6):
Note that columns of numeric types (i.e., double) have their last two bytes rounded off while computing order, by default, to avoid any unexpected behaviour due to limitations in representing floating point numbers precisely. Have a look at setNumericRounding to learn more.
By changing rounding to 1 byte before keying
DT <- data.table(tmp)
setNumericRounding(1) # set rounding
setkey(DT, id)
the value no longer will be dropped.
However, help(setNumericRounding) says
For large numbers (integers > 2^31), we recommend using bit64::integer64 rather than setting rounding to 0.

Changing cell values from a raster

I have raster cell values with 5 digits, but I need to get rid of the first one, for instance, if the the cell value is "31345" I need to make it "1345".
I am trying to use the calc() function from the raster package to do that by subtracting different numbers from based on the raster cell value (since they are all numbers), like this:
correct.grid <- calc(grid, fun=function(x){ifelse(x < 40000, x-30000,
ifelse(x > 40000 & < 50000, x-40000,
ifelse(x > 50000 & < 60000, x-50000,
ifelse(x > 60000, x-60000, 0)))))})
I guess this is probably a terrible approach to the problem (I am not really good at programming), still, I ran into an error because I am using ">" and "<" inside the function I am guessing. Any ideas on how to make these "ifelse"s to work or maybe a smarter approach to the problem?
This is a piece of the unique values in my data if it helps:
> unique(grid)
[1] 30057 30084 30207 30230 30235 30237 30280 30283 30311 30319 30320 30326 30350 30351 30352 30360
[17] 30384 30396 30415 30420 30447 30449 30452 30456 30478 30481 30497 30507 30522 30560 30562 30605
[33] 30606 30612 30638 30639 30645 30654 30657 30658 30662 30665 30678 30682 30701 30707 30714 30736
[49] 30740 30743 30749 30750 30823 30824 30841 30852 30862 30892 30896 30898 30915 30920 30928 30934
[65] 30956 30962 30978 30986 30998 31021 31022 31031 31042 31053 31055 31081 31085 31092 31097 31099
[81] 31114 31115 31122 31126 31129 31130 31131 31141 31157 31168 31171 40019 40026 40075 40197 40217
[97] 50342 50360 50379 50496 50720 50725 50732 50766 50798 50837 51073 51092 51397 53096 53110 53117
[113] 53118 53120 53124 60003 60005 60041 60485 60516 60655 60661 60825 61039 61174 61185 61187 61210
[129] 61221 61224 61227 61259 61287 61289 61290 61295
If you just want to remove the leftmost digit of each value, how about this:
First, let's load a raster object to work with:
library(raster)
# Load a raster object to work with
grid = system.file("external/test.grd", package="raster")
grid = raster(grid)
# Set up values to be whole numbers
values(grid) = round(values(grid)*100)
Now, remove the leftmost digit from every value in the raster:
values(grid) = as.numeric(substr(values(grid), 2, nchar(values(grid))))
Note that a value with one or more zeros after the leftmost digit will get shortened by more than one digit. For example, 60661 will become 661 and 30001 will become 1.

how to generate very small numbers randomly in R

I want to generate very small numbers in the range of 1e-9 to 1.
if possible those numbers should be from all orders.
for example 1e-9, 2e-5, 3.2e-6 , 1.6e-4 .... etc
I tried this
set.seed(123)
kk <- runif(20,1e-9,1)
#min(kk)
#0.04205953
How can I do this?
EDIT
#RichardScriven suggested that decreasing the max number, so tried that
kk <- runif(20,1e-9,1e-5)
kk
#[1] 6.479287e-06 3.198886e-06 3.077892e-06 2.198457e-06 3.695519e-06 #9.842208e-06 1.542869e-06 9.113490e-07 1.419927e-06 6.900381e-06 #6.192946e-06 8.914050e-06 6.730318e-06 7.371040e-06
#[15] 5.211836e-06 6.598725e-06 8.218233e-06 7.863029e-06 9.798239e-06 #4.394876e-06
Maybe log the ranges, then exp them back to actual values:
set.seed(13031982)
exp(runif(10, log(1e-9),log(1)))
# [1] 1.758939e-02 1.343684e-06 1.803232e-06 1.564901e-04 5.603956e-07
# [6] 1.042067e-09 6.536568e-08 1.374840e-05 2.210080e-04 6.245864e-03

Calculate the exponentials of big negative value

I would like to know how can I get the exponential of big negative number in R? For example when I try :
> exp(-6400)
[1] 0
> exp(-1200)
[1] 0
> exp(-2000)
[1] 0
but I need the value of above expression, even if it is so small, how can I get it in R?
Those number are too small. To know the minimum value your computer can handle try:
> .Machine$double.xmin
[1] 2.225074e-308
Will give you (from ?.Machine)
the smallest non-zero normalized floating-point number, a power of the radix, i.e., double.base ^ double.min.exp. Normally 2.225074e-308.
In my case
> .Machine$double.base
[1] 2
> .Machine$double.min.exp
[1] -1022
Actually I can calculate powers up to
> exp(-745)
[1] 4.940656e-324
To go around this issue you need infinite precision arithmetic.
In R you can achieve that using package Rmpfr (PDF vignette)
library(Rmpfr)
# Calculate exp(-100)
> a <- mpfr(exp(-100), precBits=64)
# exp(-1000)
> a^10
1 'mpfr' number of precision 64 bits
[1] 5.07595889754945890823e-435
# exp(-6400)
> a^64
1 'mpfr' number of precision 64 bits
[1] 3.27578823787094497049e-2780
# use an array of powers
> ex <- c(10, 20, 50, 100, 500, 1000, 1e5)
> a ^ ex
7 'mpfr' numbers of precision 64 bits
[1] 5.07595889754945890823e-435 2.57653587296115182772e-869
[3] 3.36969414830892462745e-2172 1.13548386531474089222e-4343
[5] 1.88757769782054893243e-21715 3.56294956530952353784e-43430
[7] 1.51693678090513840149e-4342945
Note that Rmpfr is based on GNU MPFR and requires GNU GMP. Under Linux you will need gmp, gmp-devel, mpfr, and mpfr-devel to be installed in your system in order to install these packages, not sure how that works under Windows.

Resources