What I'm looking for is a "best-practices-approved" alternative
to the following workaround / workflow. Consider
that I have a bunch of columns of similar data, and would like to perform a sequence of similar operations on these columns or sets of them, where the operations are of arbitrarily high complexity, and the groups of column names passed to each operation specified in a variable.
I realize this issue sounds contrived, but I run into it with surprising frequency. The examples are usually so messy that it is difficult to separate out the features relevant to this question, but I recently stumbled across one that was fairly straightforward to simplify for use as a MWE here:
library(data.table)
library(lubridate)
library(zoo)
the.table <- data.table(year=1991:1996,var1=floor(runif(6,400,1400)))
the.table[,`:=`(var2=var1/floor(runif(6,2,5)),
var3=var1/floor(runif(6,2,5)))]
# Replicate data across months
new.table <- the.table[, list(asofdate=seq(from=ymd((year)*10^4+101),
length.out=12,
by="1 month")),by=year]
# Do a complicated procedure to each variable in some group.
var.names <- c("var1","var2","var3")
for(varname in var.names) {
#As suggested in an answer to Link 3 above
#Convert the column name to a 'quote' object
quote.convert <- function(x) eval(parse(text=paste0('quote(',x,')')))
#Do this for every column name I'll need
varname <- quote.convert(varname)
anntot <- quote.convert(paste0(varname,".annual.total"))
monthly <- quote.convert(paste0(varname,".monthly"))
rolling <- quote.convert(paste0(varname,".rolling"))
scaled <- quote.convert(paste0(varname,".scaled"))
#Perform the relevant tasks, using eval()
#around every variable columnname I may want
new.table[,eval(anntot):=
the.table[,rep(eval(varname),each=12)]]
new.table[,eval(monthly):=
the.table[,rep(eval(varname)/12,each=12)]]
new.table[,eval(rolling):=
rollapply(eval(monthly),mean,width=12,
fill=c(head(eval(monthly),1),
tail(eval(monthly),1)))]
new.table[,eval(scaled):=
eval(anntot)/sum(eval(rolling))*eval(rolling),
by=year]
}
Of course, the particular effect on the data and variables here is irrelevant, so please do not focus on it or suggest improvements to accomplishing what it accomplishes in this particular case. What I am looking for, rather, is a generic strategy for the workflow of repeatedly applying an arbitrarily complicated procedure of data.table actions to a list of columns or list of lists-of-columns, specified in a variable or passed as an argument to a function, where the procedure must refer programmatically to columns named in the variable/argument, and possibly includes updates, joins, groupings, calls to the data.table special objects .I, .SD, etc.; BUT one which is simpler, more elegant, shorter, or easier to design or implement or understand than the one above or others that require frequent quote-ing and eval-ing.
In particular please note that because the procedures can be fairly complex and involve repeatedly updating the data.table and then referencing the updated columns, the standard lapply(.SD,...), ... .SDcols = ... approach is usually not a workable substitute. Also replacing each call of eval(a.column.name) with DT[[a.column.name]] neither simplifies much nor works completely in general since that doesn't play nice with the other data.table operations, as far as I am aware.
I am aware of many workarounds for various use cases of variable column
names in data.table, including:
Select / assign to data.table when variable names are stored in a character vector
Pass column name in data.table using variable
Referring to data.table columns by names saved in variables
passing column names to data.table programmatically
Data.table meta-programming
How to write a function that calls a function that calls data.table?
Using dynamic column names in `data.table`
Dynamic column names in data.table
Assign multiple columns using := in data.table, by group
Setting column name in "group by" operation with data.table
Summarizing multiple columns with data.table
and probably more I haven't referenced.
But: even if I learned all the tricks documented above to the point that I
never had to look them up to remind myself how to use them, I still would find
that working with column names that are passed as parameters to a function is
an extremely tedious task.
Problem you are describing is not strictly related to data.table.
Complex queries cannot be easily translated to code that machine can parse, thus we are not able to escape complexity in writing a query for complex operations.
You can try to imagine how to programmatically construct a query for the following data.table query using dplyr or SQL:
DT[, c(f1(v1, v2, opt=TRUE),
f2(v3, v4, v5, opt1=FALSE, opt2=TRUE),
lapply(.SD, f3, opt1=TRUE, opt2=FALSE))
, by=.(id1, id2)]
Assuming that all columns (id1, id2, v1...v5) or even options (opt, opt1, opt2) should be passed as variables.
Because of complexity in expression of queries I don't think you could easily accomplish requirement stated in your question:
is simpler, more elegant, shorter, or easier to design or implement or understand than the one above or others that require frequent quote-ing and eval-ing.
Although, comparing to other programming languages, base R provides very useful tools to deal with such problems.
You already found suggestions to use get, mget, DT[[col_name]], parse, quote, eval.
As you mentioned DT[[col_name]] might not play well with data.table optimizations, thus is not that useful here.
parse is probably the easiest way to construct complex queries as you can just operate on strings, but it doesn't provide basic language syntax validation. So you can ended up trying to parse a string that R parser does not accept. Additionally there is a security concern as presented in 2655#issuecomment-376781159.
get/mget are the ones most commonly suggested to deal with such problems. get and mget are internally catch by [.data.table and translated to expected columns. So you are assuming your arbitrary complex query will be able to be decomposed by [.data.table and expected columns properly inputted.
Since you asked this question few years back, the new feature - dot-dot prefix - is being rolled out in recently. You prefix variable name using dot-dot to refer to a variable outside of the scope of current data.table. Similarly as you refer parent directory in file system. Internals behind dot-dot will be quite similar to get, variables having prefix will be de-referenced inside of [.data.table. . In future releases dot-dot prefix may allow calls like:
col1="a"; col2="b"; col3="g"; col4="x"; col5="y"
DT[..col4==..col5, .(s1=sum(..col1), s2=sum(..col2)), by=..col3]
Personally I prefer quote and eval instead. quote and eval is interpreted almost as written by hand from scratch. This method does not rely on data.table abilities to manage references to columns. We can expect all optimizations to work the same way as if you would write those queries by hand. I found it also easier to debug as at any point you can just print quoted expression to see what is actually passed to data.table query. Additionally there is a less space for bugs to occur. Constructing complex queries using R language object is sometimes tricky, it is easy to wrap the procedure into function so it can be applied in different use cases and easily re-used. Important to note that this method is independent from data.table. It uses R language constructs. You can find more information about that in official R Language Definition in Computing on the language chapter.
What else?
I submitted proposal of a new concept called macro in #1579. In short it is a wrapper on DT[eval(qi), eval(qj), eval(qby)] so you still have to operate on R language objects. You are welcome to put your comment there.
Recently I proposed another approach for metaprogramming interface in PR#4304. In short it plugs base R substitute functionality into [.data.table using new argument env.
Going to the example. Below I will show two ways to solve it. First one will use base R metaprogramming, second one will use metaprogramming for data.table proposed in PR#4304 (see above).
Base R computing on the language
I will wrap all logic into do_vars function. Calling do_vars(donot=TRUE) will print expressions to be computed on data.table instead of eval them. Below code should be run just after the OP code.
expected = copy(new.table)
new.table = the.table[, list(asofdate=seq(from=ymd((year)*10^4+101), length.out=12, by="1 month")), by=year]
do_vars = function(x, y, vars, donot=FALSE) {
name.suffix = function(x, suffix) as.name(paste(x, suffix, sep="."))
do_var = function(var, x, y) {
substitute({
x[, .anntot := y[, rep(.var, each=12)]]
x[, .monthly := y[, rep(.var/12, each=12)]]
x[, .rolling := rollapply(.monthly, mean, width=12, fill=c(head(.monthly,1), tail(.monthly,1)))]
x[, .scaled := .anntot/sum(.rolling)*.rolling, by=year]
}, list(
.var=as.name(var),
.anntot=name.suffix(var, "annual.total"),
.monthly=name.suffix(var, "monthly"),
.rolling=name.suffix(var, "rolling"),
.scaled=name.suffix(var, "scaled")
))
}
ql = lapply(setNames(nm=vars), do_var, x, y)
if (donot) return(ql)
lapply(ql, eval.parent)
invisible(x)
}
do_vars(new.table, the.table, c("var1","var2","var3"))
all.equal(expected, new.table)
#[1] TRUE
we can preview queries
do_vars(new.table, the.table, c("var1","var2","var3"), donot=TRUE)
#$var1
#{
# x[, `:=`(var1.annual.total, y[, rep(var1, each = 12)])]
# x[, `:=`(var1.monthly, y[, rep(var1/12, each = 12)])]
# x[, `:=`(var1.rolling, rollapply(var1.monthly, mean, width = 12,
# fill = c(head(var1.monthly, 1), tail(var1.monthly, 1))))]
# x[, `:=`(var1.scaled, var1.annual.total/sum(var1.rolling) *
# var1.rolling), by = year]
#}
#
#$var2
#{
# x[, `:=`(var2.annual.total, y[, rep(var2, each = 12)])]
# x[, `:=`(var2.monthly, y[, rep(var2/12, each = 12)])]
# x[, `:=`(var2.rolling, rollapply(var2.monthly, mean, width = 12,
# fill = c(head(var2.monthly, 1), tail(var2.monthly, 1))))]
# x[, `:=`(var2.scaled, var2.annual.total/sum(var2.rolling) *
# var2.rolling), by = year]
#}
#
#$var3
#{
# x[, `:=`(var3.annual.total, y[, rep(var3, each = 12)])]
# x[, `:=`(var3.monthly, y[, rep(var3/12, each = 12)])]
# x[, `:=`(var3.rolling, rollapply(var3.monthly, mean, width = 12,
# fill = c(head(var3.monthly, 1), tail(var3.monthly, 1))))]
# x[, `:=`(var3.scaled, var3.annual.total/sum(var3.rolling) *
# var3.rolling), by = year]
#}
#
Proposed data.table metaprogramming
expected = copy(new.table)
new.table = the.table[, list(asofdate=seq(from=ymd((year)*10^4+101), length.out=12, by="1 month")), by=year]
name.suffix = function(x, suffix) as.name(paste(x, suffix, sep="."))
do_var2 = function(var, x, y) {
x[, .anntot := y[, rep(.var, each=12)],
env = list(
.anntot = name.suffix(var, "annual.total"),
.var = var
)]
x[, .monthly := y[, rep(.var/12, each=12)],
env = list(
.monthly = name.suffix(var, "monthly"),
.var = var
)]
x[, .rolling := rollapply(.monthly, mean, width=12, fill=c(head(.monthly,1), tail(.monthly,1))),
env = list(
.rolling = name.suffix(var, "rolling"),
.monthly = name.suffix(var, "monthly")
)]
x[, .scaled := .anntot/sum(.rolling)*.rolling, by=year,
env = list(
.scaled = name.suffix(var, "scaled"),
.anntot = name.suffix(var, "annual.total"),
.rolling = name.suffix(var, "rolling")
)]
TRUE
}
sapply(setNames(nm=var.names), do_var2, new.table, the.table)
#var1 var2 var3
#TRUE TRUE TRUE
all.equal(expected, new.table)
#[1] TRUE
Data and updated OP code
library(data.table)
library(lubridate)
library(zoo)
the.table <- data.table(year=1991:1996,var1=floor(runif(6,400,1400)))
the.table[,`:=`(var2=var1/floor(runif(6,2,5)),
var3=var1/floor(runif(6,2,5)))]
# Replicate data across months
new.table <- the.table[, list(asofdate=seq(from=ymd((year)*10^4+101),
length.out=12,
by="1 month")),by=year]
# Do a complicated procedure to each variable in some group.
var.names <- c("var1","var2","var3")
for(varname in var.names) {
#As suggested in an answer to Link 3 above
#Convert the column name to a 'quote' object
quote.convert <- function(x) eval(parse(text=paste0('quote(',x,')')))
#Do this for every column name I'll need
varname <- quote.convert(varname)
anntot <- quote.convert(paste0(varname,".annual.total"))
monthly <- quote.convert(paste0(varname,".monthly"))
rolling <- quote.convert(paste0(varname,".rolling"))
scaled <- quote.convert(paste0(varname,".scaled"))
#Perform the relevant tasks, using eval()
#around every variable columnname I may want
new.table[,paste0(varname,".annual.total"):=
the.table[,rep(eval(varname),each=12)]]
new.table[,paste0(varname,".monthly"):=
the.table[,rep(eval(varname)/12,each=12)]]
new.table[,paste0(varname,".rolling"):=
rollapply(eval(monthly),mean,width=12,
fill=c(head(eval(monthly),1),
tail(eval(monthly),1)))]
new.table[,paste0(varname,".scaled"):=
eval(anntot)/sum(eval(rolling))*eval(rolling),
by=year]
}
Thanks for the question. Your original approach goes a long way towards solving most of the issues.
Here I've tweaked the quoting function slightly, and changed the approach to parse and evaluate the entire RHS expression as a string instead of the individual variables.
The reasoning being:
You probably don't want to be repeating yourself by declaring every variable you need to use at the start of the loop.
Strings will scale better since they can be generated programmatically. I've added an example below that calculates row-wise percentages to illustrate this.
library(data.table)
library(lubridate)
library(zoo)
set.seed(1)
the.table <- data.table(year=1991:1996,var1=floor(runif(6,400,1400)))
the.table[,`:=`(var2=var1/floor(runif(6,2,5)),
var3=var1/floor(runif(6,2,5)))]
# Replicate data across months
new.table <- the.table[, list(asofdate=seq(from=ymd((year)*10^4+101),
length.out=12,
by="1 month")),by=year]
# function to paste, parse & evaluate arguments
evalp <- function(..., envir=parent.frame()) {eval(parse(text=paste0(...)), envir=envir)}
# Do a complicated procedure to each variable in some group.
var.names <- c("var1","var2","var3")
for(varname in var.names) {
# 1. For LHS, use paste0 to generate new column name as string (from #eddi's comment)
# 2. For RHS, use evalp
new.table[, paste0(varname, '.annual.total') := evalp(
'the.table[,rep(', varname, ',each=12)]'
)]
new.table[, paste0(varname, '.monthly') := evalp(
'the.table[,rep(', varname, '/12,each=12)]'
)]
# Need to add envir=.SD when working within the table
new.table[, paste0(varname, '.rolling') := evalp(
'rollapply(',varname, '.monthly,mean,width=12,
fill=c(head(', varname, '.monthly,1), tail(', varname, '.monthly,1)))'
, envir=.SD
)]
new.table[,paste0(varname, '.scaled'):= evalp(
varname, '.annual.total / sum(', varname, '.rolling) * ', varname, '.rolling'
, envir=.SD
)
,by=year
]
# Since we're working with strings, more freedom
# to work programmatically
new.table[, paste0(varname, '.row.percent') := evalp(
'the.table[,rep(', varname, '/ (', paste(var.names, collapse='+'), '), each=12)]'
)]
}
I tried to do this in data.table thinking "this isn't so bad"... but after an embarrassing length of time, I gave up. Matt says something like 'do in pieces then join', but I couldn't figure out elegant ways to do these pieces, especially because the last one depends on previous steps.
I have to say, this is a pretty brilliantly constructed question, and I too encounter similar issues frequently. I love data.table, but I still struggle sometimes. I don't know if I'm struggling with data.table or the complexity of the problem.
Here is the incomplete approach I've taken.
Realistically I can imagine that in a normal process you would have more intermediate variables stored that would be useful for calculating these values.
library(data.table)
library(zoo)
## Example yearly data
set.seed(27)
DT <- data.table(year=1991:1996,
var1=floor(runif(6,400,1400)))
DT[ , var2 := var1 / floor(runif(6,2,5))]
DT[ , var3 := var1 / floor(runif(6,2,5))]
setkeyv(DT,colnames(DT)[1])
DT
## Convenience function
nonkey <- function(dt){colnames(dt)[!colnames(dt)%in%key(dt)]}
## Annual data expressed monthly
NewDT <- DT[, j=list(asofdate=as.IDate(paste(year, 1:12, 1, sep="-"))), by=year]
setkeyv(NewDT, colnames(NewDT)[1:2])
## Create annual data
NewDT_Annual <- NewDT[DT]
setnames(NewDT_Annual,
nonkey(NewDT_Annual),
paste0(nonkey(NewDT_Annual), ".annual.total"))
## Compute monthly data
NewDT_Monthly <- NewDT[DT[ , .SD / 12, keyby=list(year)]]
setnames(NewDT_Monthly,
nonkey(NewDT_Monthly),
paste0(nonkey(NewDT_Monthly), ".monthly"))
## Compute rolling stats
NewDT_roll <- NewDT_Monthly[j = lapply(.SD, rollapply, mean, width=12,
fill=c(.SD[1],tail(.SD, 1))),
.SDcols=nonkey(NewDT_Monthly)]
NewDT_roll <- cbind(NewDT_Monthly[,1:2,with=F], NewDT_roll)
setkeyv(NewDT_roll, colnames(NewDT_roll)[1:2])
setnames(NewDT_roll,
nonkey(NewDT_roll),
gsub(".monthly$",".rolling",nonkey(NewDT_roll)))
## Compute normalized values
## Compute "adjustment" table which is
## total of each variable, by year for rolling
## divided by
## original annual totals
## merge "adjustment values" in with monthly data, and then
## make a modified data.table which is each varaible * annual adjustment factor
## Merge everything
NewDT_Combined <- NewDT_Annual[NewDT_roll][NewDT_Monthly]
I am working with RMongoDB and I need to fill an empty data.frame with the values of a query. The results are quite long, about 2 milion documents (rows).
While I was doing performance tests, I found out that the time for writing the values to a row increases by the dimension of the data frame. Maybe it is a well known issue and I am the last one to notice it.
Some code example:
set.seed(20140430)
nreg <- 2e3
dfres <- as.data.frame(matrix(rep(NA,nreg*7),nrow=nreg,ncol=7))
system.time(dfres[1e3,] <- c(1:5,"a","b"))
summary(replicate(10,system.time(dfres[sample(1:nreg,1),] <- c(1:5,"a","b"))[3]))
nreg <- 2e6
dfres <- as.data.frame(matrix(rep(NA,nreg*7),nrow=nreg,ncol=7))
system.time(dfres[1e3,] <- c(1:5,"a","b"))
summary(replicate(10,system.time(dfres[sample(1:nreg,1),] <- c(1:5,"a","b"))[3]))
On my machine, the assignment at the 2 milion rows data.frame takes about 0.4 seconds. This is a lot of time if I want to fill the whole dataset. Here goes a second simulation in order to draw the issue.
nreg <- seq(2e1,2e7,length.out=10)
te <- NULL
for(i in nreg){
dfres <- as.data.frame(matrix(rep(NA,i*7),nrow=i,ncol=7))
te <- c(te,mean(replicate(10,{r <- sample(1:i,1); system.time(dfres[r,] <- c(1:5,"a","b"))[3]}) ) )
}
plot(nreg,te,xlab="Number of rows",ylab="Avg. time for 10 random assignments [sec]",type="o")
#rm(nreg,dfres,te)
Question: Why this happens? Is there a quicker way to fill the data.frame in memory?
Let's start with "columns" first and see what goes on and then return to rows.
R versions < 3.1.0 (unnecessarily) copies the entire data.frame when you operate on them. For example:
## R v3.0.3
df <- data.frame(x=1:5, y=6:10)
dplyr:::changes(df, transform(df, z=11:15)) ## requires dplyr to be available
# Changed variables:
# old new
# x 0x7ff9343fb4d0 0x7ff9326dfba8
# y 0x7ff9343fb488 0x7ff9326dfbf0
# z <added> 0x7ff9326dfc38
# Changed attributes:
# old new
# names 0x7ff934170c28 0x7ff934308808
# row.names 0x7ff934551b18 0x7ff934308970
# class 0x7ff9346c5278 0x7ff935d1d1f8
You can see that addition of "new" column has resulted in a copy of the "old" columns (the addresses are different). Also the attributes are copied. What bites most is that these copies are deep copies, as opposed to shallow copies.
Shallow copies only copy the vector of column pointers, not the entire data, where as deep copies copy everything (which is unnecessary here).
However, in R v3.1.0, there has been nice welcoming changes in that the "old" columns are not deep copied. All credits to the R core dev team.
## R v3.1.0
df <- data.frame(x=1:5, y=6:10)
dplyr:::changes(df, transform(df, z=11:15)) ## requires dplyr to be available
# Changed variables:
# old new
# z <added> 0x7f85d328dda8
# Changed attributes:
# old new
# names 0x7f85d1459548 0x7f85d297bec8
# row.names 0x7f85d2c66cd8 0x7f85d2bfa928
# class 0x7f85d345cab8 0x7f85d2d6afb8
You can see that the columns x and y aren't changed at all (and therefore not present in the output of changes function call). This is a huge (and welcoming) improvement!
So far, we looked at the issue in adding columns in R <3.1.0 and v3.1.0
Now, coming to your question: so, what about the "rows"? Let's consider older version of R first and then come back to R v3.1.0.
## R v3.0.3
df <- data.frame(x=1:5, y=6:10)
df.old <- df
df$y[1L] <- -6L
dplyr:::changes(df.old, df)
# Changed variables:
# old new
# x 0x7f968b423e50 0x7f968ac6ba40
# y 0x7f968b423e98 0x7f968ac6bad0
#
# Changed attributes:
# old new
# names 0x7f968ab88a28 0x7f968abca8e0
# row.names 0x7f968abb6438 0x7f968ab22bb0
# class 0x7f968ad73e08 0x7f968b580828
Once again we see that changing column y has resulted in copying column x as well in older versions of R.
## R v3.1.0
df <- data.frame(x=1:5, y=6:10)
df.old <- df
df$y[1L] <- -6L
dplyr:::changes(df.old, df)
# Changed variables:
# old new
# y 0x7f85d3544090 0x7f85d2c9bbb8
#
# Changed attributes:
# old new
# row.names 0x7f85d35a69a8 0x7f85d35a6690
We see the nice improvements in R v3.1.0 which has resulted in the copy of just column y. Once again, great improvements in R v3.1.0! R's copy-on-modify has gotten wiser.
But still, using data.table's assignment by reference semantics, we can do one step better - not copy even the y column as is the case in R v3.1.0.
The idea being: as long as the type of the object you assign to a column at certain indices don't change (here, column y is integer - so as long as you assign an integer back to y), we really can do it without having to copy by modifying in-place (by reference).
Why? Because we don't have to allocate/re-allocate anything here. As an example, if you had assigned a double/numeric type, which requires 8 bytes of storage as opposed to 4-bytes of storage for integer column y, then we've to create a new column y and copy values back.
That is, we can sub-assign by reference using data.table. We can use either := or set() to do this. I'll demonstrate using set() here.
Now, here's a comparison with base R and data.table on your data with 2,000 to 20,000,000 rows in multiples of 10, against R v3.0.3 and v3.1.0 separately. You can find the code here.
Plot for comparison against R v3.0.3:
Plot for comparison against R v3.1.0:
The min, median and max for R v3.0.3, R v3.1.0 and data.table on 20 million rows with 10 replications are:
type min median max
base_3.0.3 10.05 10.70 18.51
base_3.1.0 1.67 1.97 5.20
data.table 0.04 0.04 0.05
Note: You can see the complete timings in this gist.
This clearly shows the improvement in R v3.1.0, but also shows that the column which is being changed is still being copied and that still consumes sometime, which is overcome through sub-assignment by reference in data.table.
HTH