Understanding data.table invalid .selfref warning - r

I am trying to figuring out the data.table 'invalid .selfref' error that I am getting with the code below.
library(data.table)
library(dplyr)
DT <- data.table(aa=1:100, bb=rnorm(n=100), dd=gl(2,100))
DT <- DT %.% group_by(dd, aa) %.% summarize(m=mean(bb))
DT <- DT[, ee := 3]
The last line throws the error. Here there is the suggestion to just write the last line as DT$ee <- 3 but doesn't really explain why it works (and the := doesn't) and being a beginner data.table user also doesn't feel like the proper data.table idiom.
It IS related to the dplyr line in there that obviously changes the DT data table. But when I change that line (and those following) into DDT <- DT %.% group_by() ... then I still get the selfref error from the DT[, ee := 3] line.
Been checking the various sources but all the info there doesn't really come down, so I am still confused.
R version 3.1.0 (2014-04-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=Dutch_Netherlands.1252 LC_CTYPE=Dutch_Netherlands.1252
[3] LC_MONETARY=Dutch_Netherlands.1252 LC_NUMERIC=C
[5] LC_TIME=Dutch_Netherlands.1252
attached base packages:
[1] graphics grDevices utils datasets stats methods base
other attached packages:
[1] dplyr_0.2 data.table_1.9.2 ggplot2_1.0.0
loaded via a namespace (and not attached):
[1] assertthat_0.1 colorspace_1.2-4 digest_0.6.4 grid_3.1.0
[5] gtable_0.1.2 MASS_7.3-31 munsell_0.4.2 parallel_3.1.0
[9] plyr_1.8.1 proto_0.3-10 Rcpp_0.11.2 reshape2_1.4
[13] scales_0.2.4 stringr_0.6.2 tools_3.1.0

I just ran your code, and I see the problem. data.table over-allocates vector of column pointers (for efficiently adding columns by reference later on) and this warning occurs when an operation (most likely inadvertently) removes that over allocation.
Let me try to explain over-allocation using slide 45 from Matt's useR 2014 presentation. The (blue and yellow) boxes on the top correspond to the vector of column pointers and the arrow shows the data each pointer is pointing to.
The figure on the left depicts pictorially how adding (or cbinding) a column to a data.frame works. cbinding a column basically results in a (deep or shallow) copy resulting in a new location for the vector of column pointers (shown in yellow) and the data (which has now one more column).
The figure on the right shows the data.table way, where there are more than 3 blue boxes to begin with, due to over-allocation while data.table creation. And by using :=, not even a shallow copy is being made. The vector of column pointers that were there before stay where they are and the next unused over-allocated box is used to assign your new column.
This is about the difference and as to what over-allocation here means.
Now the warning tells that whatever operation you did has removed this over-allocation - meaning the extra blue boxes are gone! So, we can't add columns by reference anymore, until we over-allocate again (which is unnecessary and should be avoided, but since it's already gone, we do what's the next best thing).
My guess is that your dplyr syntax somehow removes this over-allocation which is caught int the next step when you use := and data.table over-allocates once again before to add new column by reference (which'll result in a shallow copy).
If I do it the data.table way:
DT <- DT[, list(m=mean(bb)), by=list(dd,aa)]
DT[, ee := 3]
it works just fine.
I don't have the time to look into dplyr right now to verify or find out what's doing this.
Update: Have suggested necessary changes as a pull request here.

Related

Why does data.table notation for column selection affect speed

There seem to be a difference in speed depending in how you specify the columns to be selected from a data.table: x[, .(var)] vs x[, c('var')].
The reason may be completely obvious, however in the help page .(), list() and c() notations seem to be used interchangeably.
I work with quite large datasets, so it is a bit important to me :-)
Example (the order of call does not affect the speed):
x <- as.data.table(as.character(rnorm(20000000,1,0.5)))
setkey(x, V1)
tic(); x[, .(V1)]; toc()
25.08 sec elapsed
tic(); x[, c('V1')]; toc()
0.28 sec elapsed
tic(); x[, 1]; toc()
0.02 sec elapsed
> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] tictoc_1.0 data.table_1.12.8
loaded via a namespace (and not attached):
[1] compiler_3.6.1 tools_3.6.1 lifecycle_0.2.0 rlang_0.4.6
You have found a bug (issue filed here)-- data.table is trying to determine if the output of [] is keyed; in order to do so, it is running an internal is.sorted function. This is very slow on a huge table of unique strings.
Fortunately, we can do static analysis and realize that your output table is in fact keyed -- there's no subset, and the key column (V1) is unchanged. Therefore the sort order cannot have changed, and your output will also be sorted by V1.
This logic is built in to a PR to fix this issue -- you can test it out with remotes::install_github('Rdatatable/data.table#fix_sorting_on_sorted'), with the caveat that this is a bleeding edge version of the package, or you can wait till it's merged to master, or until a new version is released to CRAN.
In the meantime, here's a workaround:
setkey(x, NULL)
system.time(x[ , .(V1)])
# user system elapsed
# 0.120 0.087 0.213
Of course this blocks later processing from recognizing that your data is sorted & the efficiencies thereto...
In this case (!and this case only -- use with care!!!) -- where you are yourself certain that the data is already sorted by V1 -- you can restore the key instantly with:
setattr(x, 'sorted', 'V1')
More generally there are small differences among selection with [, [[, $, etc. [ will tend to be the slowest since we do a lot of "static query analysis" to help improve the efficiency of your code, which comes with a performance cost which we hope will be small almost every time. Anytime this cost is not small, it should be a bug. There is also some work being done actively to try and offer shortcuts to reduce this overhead, see for example this PR

How R differentiates between the two filter function one in dplyr package and other for linear filtering in time series?

I wanted to filter a data set based on some conditions. When I looked at the help for filter function the result was:
filter {stats} R Documentation
Linear Filtering on a Time Series
Description
Applies linear filtering to a univariate time series or to each series separately of a multivariate time series.
After searching on web I found the filter function I needed i.e. from dplyr package. How can R have two functions with same name. What am I missing here?
At the moment the R interpreter would dispatch a call to filter to the dplyr environment, at least if the class of the object were among the avaialble methods:
methods(filter)
[1] filter.data.frame* filter.default* filter.sf* filter.tbl_cube* filter.tbl_df* filter.tbl_lazy*
[7] filter.ts*
As you can see there is a ts method, so if the object were of that class, the interpreter would instead deliver the x values to it. However, it appears that the authors of dplyr have blocked that mechanism and instead put in a warning function. You would need to use:
getFromNamespace('filter', 'stats')
function (x, filter, method = c("convolution", "recursive"),
sides = 2L, circular = FALSE, init = NULL)
{ <omitting rest of function body> }
# same result also obtained with:
stats::filter
R functions are contained in namespaces, so a full designation of a function would be: namespace_name::function_name. There is a hierarchy of namespace containers (actually "environments" in R terminology) arranged along a search path (which will vary depending on the order in which packages and their dependencies have been loaded). The ::-infix-operator can be used to specify a namespace or package name that is further up the search path than might be found in the context of the calling function. The function search can display the names of currently loaded packages and their associated namespaces. See ?search Here's mine at the moment (which is a rather bloated one because I answer a lot of questions and don't usually start with a clean systems:
> search()
[1] ".GlobalEnv" "package:kernlab" "package:mice" "package:plotrix"
[5] "package:survey" "package:Matrix" "package:grid" "package:DHARMa"
[9] "package:eha" "train" "package:SPARQL" "package:RCurl"
[13] "package:XML" "package:rnaturalearthdata" "package:rnaturalearth" "package:sf"
[17] "package:plotly" "package:rms" "package:SparseM" "package:Hmisc"
[21] "package:Formula" "package:survival" "package:lattice" "package:remotes"
[25] "package:forcats" "package:stringr" "package:dplyr" "package:purrr"
[29] "package:readr" "package:tidyr" "package:tibble" "package:ggplot2"
[33] "package:tidyverse" "tools:rstudio" "package:stats" "package:graphics"
[37] "package:grDevices" "package:utils" "package:datasets" "package:methods"
[41] "Autoloads"
At the moment I can find instances of 3 versions of filter using the help system:
?filter
# brings this up in the help panel
Help on topic 'filter' was found in the following packages:
Return rows with matching conditions
(in package dplyr in library /home/david/R/x86_64-pc-linux-gnu-library/3.5.1)
Linear Filtering on a Time Series
(in package stats in library /usr/lib/R/library)
Objects exported from other packages
(in package plotly in library /home/david/R/x86_64-pc-linux-gnu-library/3.5.1)

R read.csv didn't load all rows of .tsv file

A little mystery. I have a .tsv file that contains 58936 rows. I loaded the file into R using this command:
dat <- read.csv("weekly_devdata.tsv", header=FALSE, stringsAsFactors=TRUE, sep="\t")
but nrow(dat) only shows this:
> nrow(dat)
[1] 28485
So I used the sed -n command to write the rows around where it stopped (before, including and after that row) to a new file and was able to load that file into R so I don't think there was any corruption in the file.
Is it an environment issue?
Here's my sessionInfo()
> sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] tcltk stats graphics grDevices utils datasets methods base
other attached packages:
[1] sqldf_0.4-10 RSQLite_1.0.0 DBI_0.3.1 gsubfn_0.6-6 proto_0.3-10 scales_0.2.4 plotrix_3.5-11
[8] reshape2_1.4.1 dplyr_0.4.1
loaded via a namespace (and not attached):
[1] assertthat_0.1 chron_2.3-45 colorspace_1.2-4 lazyeval_0.1.10 magrittr_1.5 munsell_0.4.2
[7] parallel_3.1.2 plyr_1.8.1 Rcpp_0.11.4 rpart_4.1-8 stringr_0.6.2 tools_3.1.2
Did I run out of memory? Is that why it didn't finish loading?
I had a similar problem lately, and it turned out I had two different problems.
1 - Not all rows had the right number of tabs. I ended up counting them using awk
2 - At some points in the file I had quotes that were not closed. This was causing it to skip over all the lines until it found a closing quote.
I will dig up the awk code I used to investigate and fix these issues and post it.
Since I am using Windows, I used the awk that came with git bash.
This counted the number of tabs in a line and printed out those lines that did not have the right number.
awk -F "\t" 'NF!=6 { print NF-1 ":" $0 } ' Catalog01.csv
I used something similar to count quotes, and I used tr to fix a lot of it.
Pretty sure this was not a memory issue. If the problem is unmatched quotes then try this:
t <-read.csv("weekly_devdata.tsv", header=FALSE, stringsAsFactors=TRUE,sep="\t",
quote="")
There is also the very useful function count.fields that I use inside table to get a high-level view of the consequences of various parameter settings. Take a look at results of:
table( count.fields( "weekly_devdata.tsv", sep="\t"))
And compare to:
table( count.fields( "weekly_devdata.tsv", sep="\t", quote=""))
It's sometime necessary to read in with readLines, then remove one or more lines assigning the result to clean and then send the cleaned up lines to read.table(text=clean, sep="\t", quote="")
Could well be some illegal characters in some of the entries... Check out how many upload and where the issue is taking place. Delve deeper into that row of the raw data. Webdings font that kind of stuff!

Potential problems from over-allocating truelength more than 1000 times

tl;dr: What are potential problems after a truelength over-allocation warning?
Recently I did something stupid like this:
m <- matrix(seq_len(1e4),nrow=10)
library(data.table)
DT <- data.table(id=rep(1:2,each=5),m)
DT[,id2:=id]
#Warning message:
# In `[.data.table`(DT, , `:=`(id2, id)) :
# tl (2002) is greater than 1000 items over-allocated (ncol = 1001).
# If you didn't set the datatable.alloccol option very large,
# please report this to datatable-help including the result of sessionInfo().
DT[,lapply(.SD,mean),by=id2]
After some searching it became apparent that the warning resulted from adding a column by reference to a data.table with too many columns and I found some rather technical explanations (e.g., this), which I probably don't fully understand.
I know that I can avoid the issue (e.g., use data.table(id=rep(1:2,each=5),stack(as.data.frame(m)))), but I wonder if I should expect problems subsequent to such a warning (other than the obvious performance disadvantage from working with a wide format data.table).
R version 2.15.3 (2013-03-01)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252 LC_MONETARY=German_Germany.1252 LC_NUMERIC=C LC_TIME=German_Germany.1252
attached base packages:
[1] stats graphics grDevices datasets utils methods base
other attached packages:
[1] data.table_1.8.8 fortunes_1.5-0
Good question. By default in v1.8.8 :
> options()$datatable.alloccol
max(100, 2 * ncol(DT))
That's probably not the best default. Try changing it :
options(datatable.alloccol = quote(max(100L, ncol(DT)+64L))
UPDATE: I've now changed the default in v1.8.9 to that.
That option just controls how many spare column pointer slots are allocated so that := can add columns by reference.
From NOTES in NEWS for v1.8.9
The default for datatable.alloccol has changed from max(100L, 2L*ncol(DT)) to max(100L, ncol(DT)+64L). And a pointer to ?truelength has been added to an error message as sugggested and thanks to Roland :
Potential problems from over-allocating truelength more than 1000 times

ffdfdply function crashes R and is very slow

learning how to compute tasks in R for large data sets (more than 1 or 2 GB), I am trying to use ff package and ffdfdply function.
(See this link on how to use ffdfdply: R language: problems computing "group by" or split with ff package )
My data have the following columns:
"id" "birth_date" "diagnose" "date_diagnose"
There are several rows for each "id", and I want to extract the first date where there was a diagnose.
I would apply this :
library(ffbase)
library(plyr)
load(file=file_name); # to load my ffdf database, called data.f .
my_fun <- function(x){
ddply( x , .(id), summarize,
age = min(date_diagnose - birth_date, na.rm=TRUE))
}
result <- ffdfdply(x = data.f, split = data.f$id,
FUN = function(x) my_fun(x) , trace=TRUE) ;
result[1:10,] # to check....
It is very strange, but this command: ffdfdply(x = data.f, .... ) is making RStudio (and R) crash. Sometimes the same command will crash R and sometimes not.
For example, if I trigger again the ffdfdply line (which worked the first time), R will crash.
Also using other functions, data, etc. will have the same effect. There is no memory increase, or anything into log.txt.
Same behaviour when applying the summaryBy "technique"....
So if anybody has the same problem and found the solution, that would be very helpful.
Also ffdfdply gets very slow (slower than SAS...) , and I am thinking about making another strategy to make this kind of tasks.
Is ffdfdply taking into account that for example the data set is ordered by id? (so it does not have to look into all the data to take the same ids... ).
So, if anybody knows other approaches to this ddply problem, it would be really great for all the "large data sets in R with low RAM memory" users.
This is my sessionInfo()
R version 2.15.2 (2012-10-26)
Platform: i386-w64-mingw32/i386 (32-bit)
locale:
[1] LC_COLLATE=Danish_Denmark.1252 LC_CTYPE=Danish_Denmark.1252
[3] LC_MONETARY=Danish_Denmark.1252 LC_NUMERIC=C
[5] LC_TIME=Danish_Denmark.1252
attached base packages:
[1] tools stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] plyr_1.7.1 ffbase_0.6-1 ff_2.2-10 bit_1.1-9
I also noticed this when using the package which we uploaded to CRAN recently. It seems to be caused by overloading in package ffbase the "[.ff" and "[<-.ff" extractor and setter functions from package ff.
I will remove this feature from the package and will upload it to CRAN soon. In the mean time, you can use the version 0.7 of ffbase, which you can get here:
http://dl.dropbox.com/u/25690064/ffbase_0.7.tar.gz
and install it as:
download.file("http://dl.dropbox.com/u/25690064/ffbase_0.7.tar.gz", "ffbase_0.7.tar.gz")
shell("R CMD INSTALL ffbase_0.7.tar.gz")
Let me know if that helped.

Resources