Potential problems from over-allocating truelength more than 1000 times - r

tl;dr: What are potential problems after a truelength over-allocation warning?
Recently I did something stupid like this:
m <- matrix(seq_len(1e4),nrow=10)
library(data.table)
DT <- data.table(id=rep(1:2,each=5),m)
DT[,id2:=id]
#Warning message:
# In `[.data.table`(DT, , `:=`(id2, id)) :
# tl (2002) is greater than 1000 items over-allocated (ncol = 1001).
# If you didn't set the datatable.alloccol option very large,
# please report this to datatable-help including the result of sessionInfo().
DT[,lapply(.SD,mean),by=id2]
After some searching it became apparent that the warning resulted from adding a column by reference to a data.table with too many columns and I found some rather technical explanations (e.g., this), which I probably don't fully understand.
I know that I can avoid the issue (e.g., use data.table(id=rep(1:2,each=5),stack(as.data.frame(m)))), but I wonder if I should expect problems subsequent to such a warning (other than the obvious performance disadvantage from working with a wide format data.table).
R version 2.15.3 (2013-03-01)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252 LC_MONETARY=German_Germany.1252 LC_NUMERIC=C LC_TIME=German_Germany.1252
attached base packages:
[1] stats graphics grDevices datasets utils methods base
other attached packages:
[1] data.table_1.8.8 fortunes_1.5-0

Good question. By default in v1.8.8 :
> options()$datatable.alloccol
max(100, 2 * ncol(DT))
That's probably not the best default. Try changing it :
options(datatable.alloccol = quote(max(100L, ncol(DT)+64L))
UPDATE: I've now changed the default in v1.8.9 to that.
That option just controls how many spare column pointer slots are allocated so that := can add columns by reference.
From NOTES in NEWS for v1.8.9
The default for datatable.alloccol has changed from max(100L, 2L*ncol(DT)) to max(100L, ncol(DT)+64L). And a pointer to ?truelength has been added to an error message as sugggested and thanks to Roland :
Potential problems from over-allocating truelength more than 1000 times

Related

R - c() unexpectedly converts names of named vectors into UTF-8. Is this a bug?

I've faced a strange behavior of c() with R 3.3.2 on Windows with non-US-English locale. It converts the names of named vectors into UTF-8.
x <- "φ"
names(x) <- "φ"
Encoding(names(x))
#> [1] "unknown"
Encoding(names(c(x)))
#> [1] "UTF-8"
Thought this issue is not problematic for most people, it is critical for those who uses named vectors as lookup tables (example is here: http://adv-r.had.co.nz/Subsetting.html#applications). I am also the one who stuck with the behavior of dplyr's select() function.
I'm not quite sure whether this behavior is a bug or by design. Should I submit a bug report to R core?
Here's info about my R environment:
sessionInfo()
#> R version 3.3.2 (2016-10-31)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows >= 8 x64 (build 9200)
#>
#> locale:
#> [1] LC_COLLATE=Japanese_Japan.932 LC_CTYPE=Japanese_Japan.932 LC_MONETARY=Japanese_Japan.932
#> [4] LC_NUMERIC=C LC_TIME=Japanese_Japan.932
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> loaded via a namespace (and not attached):
#> [1] tools_3.3.2
You should still see names(c(x)) == names(x) on your system. The encoding change by c() may be unintentional, but shouldn't affect your code in most scenarios.
On Windows, which doesn't have a UTF-8 locale, your safest bet is to convert all strings to UTF-8 first via enc2utf8(), and then stay in UTF-8. This will also enable safe lookups.
Language symbols (as used in dplyr's group_by()) are an entirely different issue. For some reason they are always interpreted in the native encoding. (Try as.name(names(c(x))).) However, it's still best to have them in UTF-8, and convert to native just before calling as.name(). This is what dplyr should be doing, we're just not quite there yet.
My recommendation is to use ASCII-only characters for column names when using dplyr on Windows. This requires some discipline if you're relying on tidyr::spread() for non-ASCII column contents. You could also consider switching to a system (OS X or Linux) that works with UTF-8 natively.

setMethodS3 fails to correctly reassign default function

When I create s3 methods in R 2.14.1, and then call them, the s3 objects fail to execute the methods in cases where the method has the same name as functions already loaded into the workspace (i.e. base functions). Instead it calls the base function and returns an error. This example uses 'match'. I never had this problem before today. Since I last ran this code, I installed R 3.0.2, but kept my 2.14.1 version. I ran into some trouble (different trouble) with 3.0.2 due to certain packages not being up to date in CRAN, so I reverted RStudio to 2.14.1, and then this problem cropped up. Here's an example:
rm(list=ls())
library(R.oo)
this.df<-data.frame(letter=c("A","B","C"),number=1:3)
setConstructorS3("TestClass", function(DF) {
if (missing(DF)) {
data= NA
} else {
data=DF
}
extend(Object(), "TestClass",
.data=data
)
})
setMethodS3(name="match", class="TestClass", function(this,letter,number,...){
ret = rep(TRUE,nrow(this$.data))
if (!missing(number))
ret = ret & (this$.data$number %in% number)
if (!missing(letter)){
ret = ret & (this$.data$letter %in% letter)
}
return(ret)
})
setMethodS3("get_data", "TestClass", function(this,...) {
return(this$.data[this$match(...),])
})
hope<-TestClass(this.df)
hope$match()
Error in match(this, ...) : argument "table" is missing, with no default
hope$get_data()
Here's the sessionInfo() for clues:
sessionInfo()
R version 2.14.1 (2011-12-22)
Platform: i386-pc-mingw32/i386 (32-bit)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] R.oo_1.13.0 R.methodsS3_1.4.2
loaded via a namespace (and not attached):
[1] tools_2.14.1
I tried a lot of combinations of the arguments in setMethodsS3 with no luck.
How can I fix this?
First, I highly recommend calling S3 methods the regular way and not via the <object>$method(...) way, e.g. match(hope) instead of hope$match(). If you do that, everything works as expected.
Second, I can reproduce this issue with R 3.0.2 and R.oo 1.17.0. There appears to be some issues using the particular method name match() here, because if you instead use match2(), calling hope$match2() works as expected. I've seen similar problems when trying to create S3 methods named assign() and get(). The latter actually generates a error if tried, e.g. "Trying to use an unsafe generic method name (trust us, it is for a good reason): get". I'll add assign() and most likely match() to the list of no-no names. DETAILS: Those functions are very special in R so one should avoid using those names. This is because, if done, S3 generic functions are created for them and all calls will be dispatched via generic functions and that is not compatible with
Finally, you should really really update your R - it's literally ancient and few people will not bother trying to help you with issues unless you run the most recent stable R version (now R 3.0.2 soon to be R 3.1.0). At a minimum, you should make sure to run the latest versions of the package (your R.methodsS3 and R.oo versions are nearly 2 and 1 years old by now with important updates since).
Hope this helps

ns varies for no apparent reason

My results of using splines::ns with a least-squares fit varied with no rhyme or reason that I could see, and I think I have traced the problem to the ns function itself.
I have reduced the problem to this:
require(splines)
N <- 0
set.seed(1)
for (i in 1:100) N <- N + identical(ns(1:10,3),ns(1:10,3))
N
My results average about 39, range 34--44 or so, but I expected 100 every time. Why should the results of ns be random? If I substitute bs for ns in both places, I get 100, as expected. My set.seed(1) hopes to demonstrate that the randomness I get is not what R intended.
In a clean session, using RStudio and R version 2.14.2 (2012-02-29), I get 39, 44, 38, etc. Everyone else seems to be getting 100.
Further info:
Substituing splines::ns for ns gives the same results. A clean vanilla session gives the same results. My computer has 8 cores.
The differences, when they happen, are generally or always 2^-54:
Max <- 0
for (i in 1:1000) Max <- max( Max, abs(ns(1:10,3)-ns(1:10,3)) )
c(Max,2^-54)
with result [1] 5.551115e-17 5.551115e-17. This variability causes me big problems down the line, because my optimize(...)$min now varies sometimes even in the first digit, making results not repeatable.
My sessionInfo with a clean vanilla session:
I created what I understand to be known as a clean vanilla session using
> .Last <- function() system("R --vanilla")
> q("no")
This blows away the session, and when I restart it, I get my clean vanilla session. Then, in answer to Ben Bolker's clarifying question, I did this at the beginning of my clean vanilla session:
> sessionInfo()
R version 2.14.2 (2012-02-29)
Platform: x86_64-pc-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] Revobase_6.1.0 RevoMods_6.1.0 RevoScaleR_3.1-0 lattice_0.20-0
[5] rpart_3.1-51
loaded via a namespace (and not attached):
[1] codetools_0.2-8 foreach_1.4.0 grid_2.14.2 iterators_1.0.6
[5] pkgXMLBuilder_1.0 revoIpe_1.0 tools_2.14.2 XML_3.9-1.1
> require(splines)
Loading required package: splines
> N <- 0
> set.seed(1)
> for (i in 1:100) N <- N + identical(ns(1:10,3),ns(1:10,3))
> N
[1] 32
This is the answer I got from REvolution Technical Suppoort (posted here with permission):
The problem here is an issue of floating point arithmetic. Revolution
R uses the Intel mkl BLAS library for some computations, which differs
from what CRAN-R uses and uses this library for the 'ns()'
computation. In this case you will also get different results
depending on whether you are doing the computation on a
Intel-processor based machine or a machine with an AMD chipset.
We do ship the same BLAS and Lapack DLL's that are shipped with
CRAN-R, but they are not the default ones used with Revolution R.
Customers can revert the installed DLL's if they so choose and prefer,
by doing the following:
1). Renaming 'Rblas.dll' to 'Rblas.dll.bak' and 'Rlapack.dll' to
'Rlapack.dll.bak'
in the folder 'C:\Revolution\R-Enterprise-6.1\R-2.14.2\bin\x64'.
2). Rename the files 'Rblas.dll.0' and 'Rlapack.dll.0' in this folder
to Rblas.dll and Rlpack.dll respectively.
Their suggestion worked perfectly. I have renamed these files back and forth several times, using both RStudio (with Revolution R) and Revolution R's own IDE, always with the same result: The BLAS dlls give me N==40 or so, and the CRAN-R dlls give me N==100.
I will probably go back to BLAS, because in my tests, it is 8-times faster for %*% and 4-times faster for svd(). And that is just using one of my cores (verified by the CPU usage column of the Processes tab of Windows Task Manager).
I am hoping someone with better understanding can write a better answer, for I still don't really understand the full ramifications of this.

ffdfdply function crashes R and is very slow

learning how to compute tasks in R for large data sets (more than 1 or 2 GB), I am trying to use ff package and ffdfdply function.
(See this link on how to use ffdfdply: R language: problems computing "group by" or split with ff package )
My data have the following columns:
"id" "birth_date" "diagnose" "date_diagnose"
There are several rows for each "id", and I want to extract the first date where there was a diagnose.
I would apply this :
library(ffbase)
library(plyr)
load(file=file_name); # to load my ffdf database, called data.f .
my_fun <- function(x){
ddply( x , .(id), summarize,
age = min(date_diagnose - birth_date, na.rm=TRUE))
}
result <- ffdfdply(x = data.f, split = data.f$id,
FUN = function(x) my_fun(x) , trace=TRUE) ;
result[1:10,] # to check....
It is very strange, but this command: ffdfdply(x = data.f, .... ) is making RStudio (and R) crash. Sometimes the same command will crash R and sometimes not.
For example, if I trigger again the ffdfdply line (which worked the first time), R will crash.
Also using other functions, data, etc. will have the same effect. There is no memory increase, or anything into log.txt.
Same behaviour when applying the summaryBy "technique"....
So if anybody has the same problem and found the solution, that would be very helpful.
Also ffdfdply gets very slow (slower than SAS...) , and I am thinking about making another strategy to make this kind of tasks.
Is ffdfdply taking into account that for example the data set is ordered by id? (so it does not have to look into all the data to take the same ids... ).
So, if anybody knows other approaches to this ddply problem, it would be really great for all the "large data sets in R with low RAM memory" users.
This is my sessionInfo()
R version 2.15.2 (2012-10-26)
Platform: i386-w64-mingw32/i386 (32-bit)
locale:
[1] LC_COLLATE=Danish_Denmark.1252 LC_CTYPE=Danish_Denmark.1252
[3] LC_MONETARY=Danish_Denmark.1252 LC_NUMERIC=C
[5] LC_TIME=Danish_Denmark.1252
attached base packages:
[1] tools stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] plyr_1.7.1 ffbase_0.6-1 ff_2.2-10 bit_1.1-9
I also noticed this when using the package which we uploaded to CRAN recently. It seems to be caused by overloading in package ffbase the "[.ff" and "[<-.ff" extractor and setter functions from package ff.
I will remove this feature from the package and will upload it to CRAN soon. In the mean time, you can use the version 0.7 of ffbase, which you can get here:
http://dl.dropbox.com/u/25690064/ffbase_0.7.tar.gz
and install it as:
download.file("http://dl.dropbox.com/u/25690064/ffbase_0.7.tar.gz", "ffbase_0.7.tar.gz")
shell("R CMD INSTALL ffbase_0.7.tar.gz")
Let me know if that helped.

R exits unexpectedly when trying to print a date vector with length more than 130K

I have tried the following code, and then R exits unexpectedly:
temp <- rep(as.Date("2009-01-01")+1:365, 365)
print(temp)
Anyone tried this before, is it a bug, or if there is anything I can do?
I have increase the memory available for R from 1024M to 2047M, but the same thing happens.
Thanks.
UPDATE #1
Here's my sessionInfo()
> sessionInfo()
R version 2.11.0 (2010-04-22)
i386-pc-mingw32
locale:
[1] LC_COLLATE=Chinese_Hong Kong S.A.R..950 LC_CTYPE=Chinese_Hong Kong S.A.R..950
[3] LC_MONETARY=Chinese_Hong Kong S.A.R..950 LC_NUMERIC=C
[5] LC_TIME=Chinese_Hong Kong S.A.R..950
attached base packages:
[1] stats graphics grDevices utils datasets methods base
And actually I am trying to do some format(foo, "%y%M"), but it also exit unexpectedly - for "unexpectedly", I mean R closes itself without any signs. Thanks again.
Looks like a known bug, fixed in 2.11.1 to me. Look at 2.11.1 changelog, section BUG FIXES, 8th item.
Works fine for me (R 2.12.0, running on Fedora Core 13)
You may check the output of getOption("max.print") and lower it a little bit (using, for instance, option(max.print=5000).
In general, however, there is no need to print out such a vector, as you will not be able to read the complete output. Functions like str(t) or head(t) are your friends!

Resources