learning how to compute tasks in R for large data sets (more than 1 or 2 GB), I am trying to use ff package and ffdfdply function.
(See this link on how to use ffdfdply: R language: problems computing "group by" or split with ff package )
My data have the following columns:
"id" "birth_date" "diagnose" "date_diagnose"
There are several rows for each "id", and I want to extract the first date where there was a diagnose.
I would apply this :
library(ffbase)
library(plyr)
load(file=file_name); # to load my ffdf database, called data.f .
my_fun <- function(x){
ddply( x , .(id), summarize,
age = min(date_diagnose - birth_date, na.rm=TRUE))
}
result <- ffdfdply(x = data.f, split = data.f$id,
FUN = function(x) my_fun(x) , trace=TRUE) ;
result[1:10,] # to check....
It is very strange, but this command: ffdfdply(x = data.f, .... ) is making RStudio (and R) crash. Sometimes the same command will crash R and sometimes not.
For example, if I trigger again the ffdfdply line (which worked the first time), R will crash.
Also using other functions, data, etc. will have the same effect. There is no memory increase, or anything into log.txt.
Same behaviour when applying the summaryBy "technique"....
So if anybody has the same problem and found the solution, that would be very helpful.
Also ffdfdply gets very slow (slower than SAS...) , and I am thinking about making another strategy to make this kind of tasks.
Is ffdfdply taking into account that for example the data set is ordered by id? (so it does not have to look into all the data to take the same ids... ).
So, if anybody knows other approaches to this ddply problem, it would be really great for all the "large data sets in R with low RAM memory" users.
This is my sessionInfo()
R version 2.15.2 (2012-10-26)
Platform: i386-w64-mingw32/i386 (32-bit)
locale:
[1] LC_COLLATE=Danish_Denmark.1252 LC_CTYPE=Danish_Denmark.1252
[3] LC_MONETARY=Danish_Denmark.1252 LC_NUMERIC=C
[5] LC_TIME=Danish_Denmark.1252
attached base packages:
[1] tools stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] plyr_1.7.1 ffbase_0.6-1 ff_2.2-10 bit_1.1-9
I also noticed this when using the package which we uploaded to CRAN recently. It seems to be caused by overloading in package ffbase the "[.ff" and "[<-.ff" extractor and setter functions from package ff.
I will remove this feature from the package and will upload it to CRAN soon. In the mean time, you can use the version 0.7 of ffbase, which you can get here:
http://dl.dropbox.com/u/25690064/ffbase_0.7.tar.gz
and install it as:
download.file("http://dl.dropbox.com/u/25690064/ffbase_0.7.tar.gz", "ffbase_0.7.tar.gz")
shell("R CMD INSTALL ffbase_0.7.tar.gz")
Let me know if that helped.
Related
There seem to be a difference in speed depending in how you specify the columns to be selected from a data.table: x[, .(var)] vs x[, c('var')].
The reason may be completely obvious, however in the help page .(), list() and c() notations seem to be used interchangeably.
I work with quite large datasets, so it is a bit important to me :-)
Example (the order of call does not affect the speed):
x <- as.data.table(as.character(rnorm(20000000,1,0.5)))
setkey(x, V1)
tic(); x[, .(V1)]; toc()
25.08 sec elapsed
tic(); x[, c('V1')]; toc()
0.28 sec elapsed
tic(); x[, 1]; toc()
0.02 sec elapsed
> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] tictoc_1.0 data.table_1.12.8
loaded via a namespace (and not attached):
[1] compiler_3.6.1 tools_3.6.1 lifecycle_0.2.0 rlang_0.4.6
You have found a bug (issue filed here)-- data.table is trying to determine if the output of [] is keyed; in order to do so, it is running an internal is.sorted function. This is very slow on a huge table of unique strings.
Fortunately, we can do static analysis and realize that your output table is in fact keyed -- there's no subset, and the key column (V1) is unchanged. Therefore the sort order cannot have changed, and your output will also be sorted by V1.
This logic is built in to a PR to fix this issue -- you can test it out with remotes::install_github('Rdatatable/data.table#fix_sorting_on_sorted'), with the caveat that this is a bleeding edge version of the package, or you can wait till it's merged to master, or until a new version is released to CRAN.
In the meantime, here's a workaround:
setkey(x, NULL)
system.time(x[ , .(V1)])
# user system elapsed
# 0.120 0.087 0.213
Of course this blocks later processing from recognizing that your data is sorted & the efficiencies thereto...
In this case (!and this case only -- use with care!!!) -- where you are yourself certain that the data is already sorted by V1 -- you can restore the key instantly with:
setattr(x, 'sorted', 'V1')
More generally there are small differences among selection with [, [[, $, etc. [ will tend to be the slowest since we do a lot of "static query analysis" to help improve the efficiency of your code, which comes with a performance cost which we hope will be small almost every time. Anytime this cost is not small, it should be a bug. There is also some work being done actively to try and offer shortcuts to reduce this overhead, see for example this PR
When I create s3 methods in R 2.14.1, and then call them, the s3 objects fail to execute the methods in cases where the method has the same name as functions already loaded into the workspace (i.e. base functions). Instead it calls the base function and returns an error. This example uses 'match'. I never had this problem before today. Since I last ran this code, I installed R 3.0.2, but kept my 2.14.1 version. I ran into some trouble (different trouble) with 3.0.2 due to certain packages not being up to date in CRAN, so I reverted RStudio to 2.14.1, and then this problem cropped up. Here's an example:
rm(list=ls())
library(R.oo)
this.df<-data.frame(letter=c("A","B","C"),number=1:3)
setConstructorS3("TestClass", function(DF) {
if (missing(DF)) {
data= NA
} else {
data=DF
}
extend(Object(), "TestClass",
.data=data
)
})
setMethodS3(name="match", class="TestClass", function(this,letter,number,...){
ret = rep(TRUE,nrow(this$.data))
if (!missing(number))
ret = ret & (this$.data$number %in% number)
if (!missing(letter)){
ret = ret & (this$.data$letter %in% letter)
}
return(ret)
})
setMethodS3("get_data", "TestClass", function(this,...) {
return(this$.data[this$match(...),])
})
hope<-TestClass(this.df)
hope$match()
Error in match(this, ...) : argument "table" is missing, with no default
hope$get_data()
Here's the sessionInfo() for clues:
sessionInfo()
R version 2.14.1 (2011-12-22)
Platform: i386-pc-mingw32/i386 (32-bit)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] R.oo_1.13.0 R.methodsS3_1.4.2
loaded via a namespace (and not attached):
[1] tools_2.14.1
I tried a lot of combinations of the arguments in setMethodsS3 with no luck.
How can I fix this?
First, I highly recommend calling S3 methods the regular way and not via the <object>$method(...) way, e.g. match(hope) instead of hope$match(). If you do that, everything works as expected.
Second, I can reproduce this issue with R 3.0.2 and R.oo 1.17.0. There appears to be some issues using the particular method name match() here, because if you instead use match2(), calling hope$match2() works as expected. I've seen similar problems when trying to create S3 methods named assign() and get(). The latter actually generates a error if tried, e.g. "Trying to use an unsafe generic method name (trust us, it is for a good reason): get". I'll add assign() and most likely match() to the list of no-no names. DETAILS: Those functions are very special in R so one should avoid using those names. This is because, if done, S3 generic functions are created for them and all calls will be dispatched via generic functions and that is not compatible with
Finally, you should really really update your R - it's literally ancient and few people will not bother trying to help you with issues unless you run the most recent stable R version (now R 3.0.2 soon to be R 3.1.0). At a minimum, you should make sure to run the latest versions of the package (your R.methodsS3 and R.oo versions are nearly 2 and 1 years old by now with important updates since).
Hope this helps
tl;dr: What are potential problems after a truelength over-allocation warning?
Recently I did something stupid like this:
m <- matrix(seq_len(1e4),nrow=10)
library(data.table)
DT <- data.table(id=rep(1:2,each=5),m)
DT[,id2:=id]
#Warning message:
# In `[.data.table`(DT, , `:=`(id2, id)) :
# tl (2002) is greater than 1000 items over-allocated (ncol = 1001).
# If you didn't set the datatable.alloccol option very large,
# please report this to datatable-help including the result of sessionInfo().
DT[,lapply(.SD,mean),by=id2]
After some searching it became apparent that the warning resulted from adding a column by reference to a data.table with too many columns and I found some rather technical explanations (e.g., this), which I probably don't fully understand.
I know that I can avoid the issue (e.g., use data.table(id=rep(1:2,each=5),stack(as.data.frame(m)))), but I wonder if I should expect problems subsequent to such a warning (other than the obvious performance disadvantage from working with a wide format data.table).
R version 2.15.3 (2013-03-01)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252 LC_MONETARY=German_Germany.1252 LC_NUMERIC=C LC_TIME=German_Germany.1252
attached base packages:
[1] stats graphics grDevices datasets utils methods base
other attached packages:
[1] data.table_1.8.8 fortunes_1.5-0
Good question. By default in v1.8.8 :
> options()$datatable.alloccol
max(100, 2 * ncol(DT))
That's probably not the best default. Try changing it :
options(datatable.alloccol = quote(max(100L, ncol(DT)+64L))
UPDATE: I've now changed the default in v1.8.9 to that.
That option just controls how many spare column pointer slots are allocated so that := can add columns by reference.
From NOTES in NEWS for v1.8.9
The default for datatable.alloccol has changed from max(100L, 2L*ncol(DT)) to max(100L, ncol(DT)+64L). And a pointer to ?truelength has been added to an error message as sugggested and thanks to Roland :
Potential problems from over-allocating truelength more than 1000 times
My results of using splines::ns with a least-squares fit varied with no rhyme or reason that I could see, and I think I have traced the problem to the ns function itself.
I have reduced the problem to this:
require(splines)
N <- 0
set.seed(1)
for (i in 1:100) N <- N + identical(ns(1:10,3),ns(1:10,3))
N
My results average about 39, range 34--44 or so, but I expected 100 every time. Why should the results of ns be random? If I substitute bs for ns in both places, I get 100, as expected. My set.seed(1) hopes to demonstrate that the randomness I get is not what R intended.
In a clean session, using RStudio and R version 2.14.2 (2012-02-29), I get 39, 44, 38, etc. Everyone else seems to be getting 100.
Further info:
Substituing splines::ns for ns gives the same results. A clean vanilla session gives the same results. My computer has 8 cores.
The differences, when they happen, are generally or always 2^-54:
Max <- 0
for (i in 1:1000) Max <- max( Max, abs(ns(1:10,3)-ns(1:10,3)) )
c(Max,2^-54)
with result [1] 5.551115e-17 5.551115e-17. This variability causes me big problems down the line, because my optimize(...)$min now varies sometimes even in the first digit, making results not repeatable.
My sessionInfo with a clean vanilla session:
I created what I understand to be known as a clean vanilla session using
> .Last <- function() system("R --vanilla")
> q("no")
This blows away the session, and when I restart it, I get my clean vanilla session. Then, in answer to Ben Bolker's clarifying question, I did this at the beginning of my clean vanilla session:
> sessionInfo()
R version 2.14.2 (2012-02-29)
Platform: x86_64-pc-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] Revobase_6.1.0 RevoMods_6.1.0 RevoScaleR_3.1-0 lattice_0.20-0
[5] rpart_3.1-51
loaded via a namespace (and not attached):
[1] codetools_0.2-8 foreach_1.4.0 grid_2.14.2 iterators_1.0.6
[5] pkgXMLBuilder_1.0 revoIpe_1.0 tools_2.14.2 XML_3.9-1.1
> require(splines)
Loading required package: splines
> N <- 0
> set.seed(1)
> for (i in 1:100) N <- N + identical(ns(1:10,3),ns(1:10,3))
> N
[1] 32
This is the answer I got from REvolution Technical Suppoort (posted here with permission):
The problem here is an issue of floating point arithmetic. Revolution
R uses the Intel mkl BLAS library for some computations, which differs
from what CRAN-R uses and uses this library for the 'ns()'
computation. In this case you will also get different results
depending on whether you are doing the computation on a
Intel-processor based machine or a machine with an AMD chipset.
We do ship the same BLAS and Lapack DLL's that are shipped with
CRAN-R, but they are not the default ones used with Revolution R.
Customers can revert the installed DLL's if they so choose and prefer,
by doing the following:
1). Renaming 'Rblas.dll' to 'Rblas.dll.bak' and 'Rlapack.dll' to
'Rlapack.dll.bak'
in the folder 'C:\Revolution\R-Enterprise-6.1\R-2.14.2\bin\x64'.
2). Rename the files 'Rblas.dll.0' and 'Rlapack.dll.0' in this folder
to Rblas.dll and Rlpack.dll respectively.
Their suggestion worked perfectly. I have renamed these files back and forth several times, using both RStudio (with Revolution R) and Revolution R's own IDE, always with the same result: The BLAS dlls give me N==40 or so, and the CRAN-R dlls give me N==100.
I will probably go back to BLAS, because in my tests, it is 8-times faster for %*% and 4-times faster for svd(). And that is just using one of my cores (verified by the CPU usage column of the Processes tab of Windows Task Manager).
I am hoping someone with better understanding can write a better answer, for I still don't really understand the full ramifications of this.
I have tried the following code, and then R exits unexpectedly:
temp <- rep(as.Date("2009-01-01")+1:365, 365)
print(temp)
Anyone tried this before, is it a bug, or if there is anything I can do?
I have increase the memory available for R from 1024M to 2047M, but the same thing happens.
Thanks.
UPDATE #1
Here's my sessionInfo()
> sessionInfo()
R version 2.11.0 (2010-04-22)
i386-pc-mingw32
locale:
[1] LC_COLLATE=Chinese_Hong Kong S.A.R..950 LC_CTYPE=Chinese_Hong Kong S.A.R..950
[3] LC_MONETARY=Chinese_Hong Kong S.A.R..950 LC_NUMERIC=C
[5] LC_TIME=Chinese_Hong Kong S.A.R..950
attached base packages:
[1] stats graphics grDevices utils datasets methods base
And actually I am trying to do some format(foo, "%y%M"), but it also exit unexpectedly - for "unexpectedly", I mean R closes itself without any signs. Thanks again.
Looks like a known bug, fixed in 2.11.1 to me. Look at 2.11.1 changelog, section BUG FIXES, 8th item.
Works fine for me (R 2.12.0, running on Fedora Core 13)
You may check the output of getOption("max.print") and lower it a little bit (using, for instance, option(max.print=5000).
In general, however, there is no need to print out such a vector, as you will not be able to read the complete output. Functions like str(t) or head(t) are your friends!