Edit: The floating-point issue has been discussed previously (not only on this site, please see the links below). Therefore this question is essentially a duplicate, however it may be useful to link the well-known floating-point issue to floor division in R.
In this version of R
R version 3.6.3 (2020-02-29) -- "Holding the Windsock"
Copyright (C) 2020 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
I have observed an inconsistency:
> 65 %/% 5
[1] 13
> 6.5 %/% .5
[1] 13
> .65 %/% .05
[1] 12
> .065 %/% .005
[1] 13
It may be a known issue probably related to floating-point arithmetic.
How to deal with this issue in everyday calculations in order to avoid wrong numbers?
As assumed, my question is not new and is intrinsic to floating-point numbers.
As far as I can tell the best way to cope with this issue is to understand it and accept its existence.
More information can be found here, here and here.
Then we have the R-FAQ and the R Inferno; I found this links in the mentioned pages.
This question already has answers here:
Is there an error in round function in R? [duplicate]
(2 answers)
Rounding error in R?
(3 answers)
Closed 3 years ago.
Context
R's help() function indicates that round() adheres to a standard that is identical to IEEE 754: round halves to even.
Wikipedia describes this standard with the following example: 23.5 becomes 24 and 24.5 becomes 24.
help() also states, ...this is dependent on OS services and on representation error.
Experiment
System details
R version: 3.6.2 (2019-12-12)
Arch / OS: x86_64, darwin18.7.0 (MacOS)
On my machine, I see that
round(0.55, digits = 1) # [1] 0.6
round(1.55, digits = 1) # [1] 1.6
round(2.55, digits = 1) # [1] 2.5
round(3.55, digits = 1) # [1] 3.5
Question
Should I attribute the discrepancy between the first two and second two statements as errant OS services and or representation error?
R newbie here, re-building Python pipeline in R.
d is a data.frame():
$ day (chr) "2016-10-13", ...
$ city_name (chr) "SF", ...
$ type (chr) "Green", ...
$ count (int) 10, ...
I'm doing a spread() on the data from tidyr package:
d %>% spread(type,count)
Works fine running locally (Mac) with:
R version 3.3.0 (2016-05-03)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.5 (El Capitan)
tidyr_0.6.1
I run the identical command on a Linux box, on the same input d, but it returns an error:
Error in `[.data.table`(.variables, lengths != 0) : i evaluates to a logical vector length 3 but there are 930 rows. Recycling of logical i is no longer allowed as it hides more bugs than is worth the rare convenience. Explicitly use rep(...,length=.N) if you really need to recycle.
On Linux, I'm running:
R version 3.2.5 (2016-04-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu precise (12.04.5 LTS)
tidyr_0.6.0
Any idea what this error means, and why it would be thrown?
Edit: fixed this after updating tidyr on Linux, and re-starting the R session.
Running R on linux (see version below output below)
I experience weird behavior with sprintf converting dec to hex.
Does anybody know what could explain this? (i.e. first conversion works fine, second returns an error regarding numeric):
> sprintf("%x",2109440182)
[1] "7dbb80b6"
> sprintf("%x",2151028214)
Error in sprintf("%x", 2151028214) :
invalid format '%x'; use format %f, %e, %g or %a for numeric objects
version
_
platform x86_64-pc-linux-gnu
arch x86_64
os linux-gnu
system x86_64, linux-gnu
status
major 3
minor 0.1
year 2013
month 05
day 16
svn rev 62743
language R
version.string R version 3.0.1 (2013-05-16)
nickname Good Sport
Thanks, Michael
gcc : format ‘%x’ expects an argument of type ‘unsigned int’, but argument 2 has type ‘long int’.
I guess the number is larger that an unsigned int. Max range in my system is 2147483648,
So this is correct:
printf("%x\n", 2147483647);
I am running R on an Ubuntu workstation with 8 virtual cores and 8 Gb of ram. I was hoping to routinely use the multicore package to make use of the 8 cores in parallel; however I find that the whole R process becomes duplicated 8 times.
As R actually seems to use much more memory than is reported in gc (by a factor 5, even after gc()), this means that even a relatively mild memory usage (one 200Mb object) becomes intractably memory-heavy once duplicated 8 times.
I looked into bigmemory to have the child processes share the same memory space; but it would require some major rewriting of my code as it doesn't deal with dataframes.
Is there a way to make R as lean as possible before forking, i.e. have the OS reclaim as much memory as possible?
EDIT:
I think I understand what is going on now. The problem is not where I thought it was -- objects that exist in the parent thread and are not manipulated do not get duplicated eight times. Instead my problem, I believe, came from the nature of the manipulation I am making each child process perform. Each has to manipulate a big factor with hundreds of thousands of levels, and I think this is the memory-heavy bit. As a result, it is indeed the case that the overall memory load is proportional to the number of cores; but not as dramatically as I thought.
Another lesson I learned is that with 4 physical cores + possibility of hyperthreading, hyperthreading is actually not typically a good idea for R. The gain is minimal, and the memory cost may be non-trivial. So I'll be working on 4 cores from now on.
For those who would like to experiment, this is the type of code I was running:
# Create data
sampdata <- data.frame(id = 1:1000000)
for (letter in letters) {
sampdata[, letter] <- rnorm(1000000)
}
sampdata$groupid = ceiling(sampdata$id/2)
# Enable multicore
library(multicore)
options(cores=4) # number of cores to distribute the job to
# Actual job
system.time(do.call("cbind",
mclapply(subset(sampdata, select = c(a:z)), function(x) tapply(x, sampdata$groupid, sum))
))
Have you tried data.table?
> system.time(ans1 <- do.call("cbind",
lapply(subset(sampdata,select=c(a:z)),function(x)tapply(x,sampdata$groupid,sum))
))
user system elapsed
906.157 13.965 928.645
> require(data.table)
> DT = as.data.table(sampdata)
> setkey(DT,groupid)
> system.time(ans2 <- DT[,lapply(.SD,sum),by=groupid])
user system elapsed
186.920 1.056 191.582 # 4.8 times faster
> # massage minor diffs in results...
> ans2$groupid=NULL
> ans2=as.matrix(ans2)
> colnames(ans2)=letters
> rownames(ans1)=NULL
> identical(ans1,ans2)
[1] TRUE
Your example is very interesting. It is reasonably large (200MB), there are many groups (1/2 million), and each group is very small (2 rows). The 191s can probably be improved by quite a lot, but at least it's a start. [March 2011]
And now, this idiom (i.e. lapply(.SD,...)) has been improved a lot. With v1.8.2, and on a faster computer than the test above, and with the latest version of R etc, here is the updated comparison :
sampdata <- data.frame(id = 1:1000000)
for (letter in letters) sampdata[, letter] <- rnorm(1000000)
sampdata$groupid = ceiling(sampdata$id/2)
dim(sampdata)
# [1] 1000000 28
system.time(ans1 <- do.call("cbind",
lapply(subset(sampdata,select=c(a:z)),function(x)
tapply(x,sampdata$groupid,sum))
))
# user system elapsed
# 224.57 3.62 228.54
DT = as.data.table(sampdata)
setkey(DT,groupid)
system.time(ans2 <- DT[,lapply(.SD,sum),by=groupid])
# user system elapsed
# 11.23 0.01 11.24 # 20 times faster
# massage minor diffs in results...
ans2[,groupid:=NULL]
ans2[,id:=NULL]
ans2=as.matrix(ans2)
rownames(ans1)=NULL
identical(ans1,ans2)
# [1] TRUE
sessionInfo()
R version 2.15.1 (2012-06-22)
Platform: x86_64-pc-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252
attached base packages:
[1] stats graphics grDevices datasets utils methods base
other attached packages:
[1] data.table_1.8.2 RODBC_1.3-6
Things I've tried on Ubuntu 64 bit R, ranked in order of success:
Work with fewer cores, as you are doing.
Split the mclapply jobs into pieces, and save the partial results to a database using DBI with append=TRUE.
Use the rm function along with gc() often
I have tried all of these, and mclapply still begins to create larger and larger processes as it runs, leading me to suspect each process is holding onto some sort of residual memory it really doesn't need.
P.S. I was using data.table, and it seems each child process copies the data.table.