Rstudio and R terminal give different outputs

Rstudio and R terminal give different outputs - r

In Rstudio (using R 3.1.1) when I run this,
length(unique(sort(c(outer(2:100,2:100,"^")))))
# 9220
In R 3.1.1 when I run this,
length(unique(sort(c(outer(2:100,2:100,"^")))))
# 9183
(the correct output is 9183)
I can't figure out why... help is greatly appreciated

As David Arenburg notes, this is a difference between 32-bit and 64-bit R versions, at least on Windows machines. Presumably, some sort of rounding error is involved. Interestingly, it is the 32-bit R gets the answer right, whereas the 64-bit R finds too many unique numbers.
First to confirm that 9183 is indeed the correct answer, I used the gmp package (a wrapper for the C multiple precision arithmetic library GMP), which provides results that are not subject to rounding errors:
library(gmp)
x <- as.bigz(2:100)
length(unique(do.call(c, sapply(x, function(X) x^X))))
[1] 9183
Here are the results from my 32-bit R:
length(unique(sort(c(outer(2:100,2:100,"^")))))
# [1] 9183
R.version[1:7] _
# platform i386-w64-mingw32
# arch i386
# os mingw32
# system i386, mingw32
# status
# major 3
# minor 1.2
And here are the results from my 64-bit R:
length(unique(sort(c(outer(2:100,2:100,"^")))))
# [1] 9220
R.version[1:7]
# platform x86_64-w64-mingw32
# arch x86_64
# os mingw32
# system x86_64, mingw32
# status
# major 3
# minor 1.2

Related

Floor division in R returns inconsistent results

Edit: The floating-point issue has been discussed previously (not only on this site, please see the links below). Therefore this question is essentially a duplicate, however it may be useful to link the well-known floating-point issue to floor division in R.
In this version of R
R version 3.6.3 (2020-02-29) -- "Holding the Windsock"
Copyright (C) 2020 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
I have observed an inconsistency:
> 65 %/% 5
[1] 13
> 6.5 %/% .5
[1] 13
> .65 %/% .05
[1] 12
> .065 %/% .005
[1] 13
It may be a known issue probably related to floating-point arithmetic.
How to deal with this issue in everyday calculations in order to avoid wrong numbers?

As assumed, my question is not new and is intrinsic to floating-point numbers.
As far as I can tell the best way to cope with this issue is to understand it and accept its existence.
More information can be found here, here and here.
Then we have the R-FAQ and the R Inferno; I found this links in the mentioned pages.

Rounding in R and IEEE 754 [duplicate]

This question already has answers here:
Is there an error in round function in R? [duplicate]
(2 answers)
Rounding error in R?
(3 answers)
Closed 3 years ago.
Context
R's help() function indicates that round() adheres to a standard that is identical to IEEE 754: round halves to even.
Wikipedia describes this standard with the following example: 23.5 becomes 24 and 24.5 becomes 24.
help() also states, ...this is dependent on OS services and on representation error.
Experiment
System details
R version: 3.6.2 (2019-12-12)
Arch / OS: x86_64, darwin18.7.0 (MacOS)
On my machine, I see that
round(0.55, digits = 1) # [1] 0.6
round(1.55, digits = 1) # [1] 1.6
round(2.55, digits = 1) # [1] 2.5
round(3.55, digits = 1) # [1] 3.5
Question
Should I attribute the discrepancy between the first two and second two statements as errant OS services and or representation error?

Spread function fails

R newbie here, re-building Python pipeline in R.
d is a data.frame():
$ day (chr) "2016-10-13", ...
$ city_name (chr) "SF", ...
$ type (chr) "Green", ...
$ count (int) 10, ...
I'm doing a spread() on the data from tidyr package:
d %>% spread(type,count)
Works fine running locally (Mac) with:
R version 3.3.0 (2016-05-03)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.5 (El Capitan)
tidyr_0.6.1
I run the identical command on a Linux box, on the same input d, but it returns an error:
Error in `[.data.table`(.variables, lengths != 0) : i evaluates to a logical vector length 3 but there are 930 rows. Recycling of logical i is no longer allowed as it hides more bugs than is worth the rare convenience. Explicitly use rep(...,length=.N) if you really need to recycle.
On Linux, I'm running:
R version 3.2.5 (2016-04-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu precise (12.04.5 LTS)
tidyr_0.6.0
Any idea what this error means, and why it would be thrown?
Edit: fixed this after updating tidyr on Linux, and re-starting the R session.

R sprintf dec2hex error

Running R on linux (see version below output below)
I experience weird behavior with sprintf converting dec to hex.
Does anybody know what could explain this? (i.e. first conversion works fine, second returns an error regarding numeric):
> sprintf("%x",2109440182)
[1] "7dbb80b6"
> sprintf("%x",2151028214)
Error in sprintf("%x", 2151028214) :
invalid format '%x'; use format %f, %e, %g or %a for numeric objects
version
_
platform x86_64-pc-linux-gnu
arch x86_64
os linux-gnu
system x86_64, linux-gnu
status
major 3
minor 0.1
year 2013
month 05
day 16
svn rev 62743
language R
version.string R version 3.0.1 (2013-05-16)
nickname Good Sport
Thanks, Michael

gcc : format ‘%x’ expects an argument of type ‘unsigned int’, but argument 2 has type ‘long int’.
I guess the number is larger that an unsigned int. Max range in my system is 2147483648,
So this is correct:
printf("%x\n", 2147483647);

Multicore and memory usage in R under Ubuntu

I am running R on an Ubuntu workstation with 8 virtual cores and 8 Gb of ram. I was hoping to routinely use the multicore package to make use of the 8 cores in parallel; however I find that the whole R process becomes duplicated 8 times.
As R actually seems to use much more memory than is reported in gc (by a factor 5, even after gc()), this means that even a relatively mild memory usage (one 200Mb object) becomes intractably memory-heavy once duplicated 8 times.
I looked into bigmemory to have the child processes share the same memory space; but it would require some major rewriting of my code as it doesn't deal with dataframes.
Is there a way to make R as lean as possible before forking, i.e. have the OS reclaim as much memory as possible?
EDIT:
I think I understand what is going on now. The problem is not where I thought it was -- objects that exist in the parent thread and are not manipulated do not get duplicated eight times. Instead my problem, I believe, came from the nature of the manipulation I am making each child process perform. Each has to manipulate a big factor with hundreds of thousands of levels, and I think this is the memory-heavy bit. As a result, it is indeed the case that the overall memory load is proportional to the number of cores; but not as dramatically as I thought.
Another lesson I learned is that with 4 physical cores + possibility of hyperthreading, hyperthreading is actually not typically a good idea for R. The gain is minimal, and the memory cost may be non-trivial. So I'll be working on 4 cores from now on.
For those who would like to experiment, this is the type of code I was running:
# Create data
sampdata <- data.frame(id = 1:1000000)
for (letter in letters) {
sampdata[, letter] <- rnorm(1000000)
}
sampdata$groupid = ceiling(sampdata$id/2)
# Enable multicore
library(multicore)
options(cores=4) # number of cores to distribute the job to
# Actual job
system.time(do.call("cbind",
mclapply(subset(sampdata, select = c(a:z)), function(x) tapply(x, sampdata$groupid, sum))
))

Have you tried data.table?
> system.time(ans1 <- do.call("cbind",
lapply(subset(sampdata,select=c(a:z)),function(x)tapply(x,sampdata$groupid,sum))
))
user system elapsed
906.157 13.965 928.645
> require(data.table)
> DT = as.data.table(sampdata)
> setkey(DT,groupid)
> system.time(ans2 <- DT[,lapply(.SD,sum),by=groupid])
user system elapsed
186.920 1.056 191.582 # 4.8 times faster
> # massage minor diffs in results...
> ans2$groupid=NULL
> ans2=as.matrix(ans2)
> colnames(ans2)=letters
> rownames(ans1)=NULL
> identical(ans1,ans2)
[1] TRUE
Your example is very interesting. It is reasonably large (200MB), there are many groups (1/2 million), and each group is very small (2 rows). The 191s can probably be improved by quite a lot, but at least it's a start. [March 2011]
And now, this idiom (i.e. lapply(.SD,...)) has been improved a lot. With v1.8.2, and on a faster computer than the test above, and with the latest version of R etc, here is the updated comparison :
sampdata <- data.frame(id = 1:1000000)
for (letter in letters) sampdata[, letter] <- rnorm(1000000)
sampdata$groupid = ceiling(sampdata$id/2)
dim(sampdata)
# [1] 1000000 28
system.time(ans1 <- do.call("cbind",
lapply(subset(sampdata,select=c(a:z)),function(x)
tapply(x,sampdata$groupid,sum))
))
# user system elapsed
# 224.57 3.62 228.54
DT = as.data.table(sampdata)
setkey(DT,groupid)
system.time(ans2 <- DT[,lapply(.SD,sum),by=groupid])
# user system elapsed
# 11.23 0.01 11.24 # 20 times faster
# massage minor diffs in results...
ans2[,groupid:=NULL]
ans2[,id:=NULL]
ans2=as.matrix(ans2)
rownames(ans1)=NULL
identical(ans1,ans2)
# [1] TRUE
sessionInfo()
R version 2.15.1 (2012-06-22)
Platform: x86_64-pc-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252
attached base packages:
[1] stats graphics grDevices datasets utils methods base
other attached packages:
[1] data.table_1.8.2 RODBC_1.3-6

Things I've tried on Ubuntu 64 bit R, ranked in order of success:
Work with fewer cores, as you are doing.
Split the mclapply jobs into pieces, and save the partial results to a database using DBI with append=TRUE.
Use the rm function along with gc() often
I have tried all of these, and mclapply still begins to create larger and larger processes as it runs, leading me to suspect each process is holding onto some sort of residual memory it really doesn't need.
P.S. I was using data.table, and it seems each child process copies the data.table.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Rstudio and R terminal give different outputs - r

In Rstudio (using R 3.1.1) when I run this, length(unique(sort(c(outer(2:100,2:100,"^"))))) # 9220 In R 3.1.1 when I run this, length(unique(sort(c(outer(2:100,2:100,"^"))))) # 9183 (the correct output is 9183) I can't figure out why... help is greatly appreciated

Related

Floor division in R returns inconsistent results

Rounding in R and IEEE 754 [duplicate]

Spread function fails

R sprintf dec2hex error

Multicore and memory usage in R under Ubuntu

Categories

Resources