New columns in data.frame don't retain POSIXct class - r

I spent almost two days to find the reason of an error occuirring - probably trivial for many, but I cannot figure out the reason for that and I am thankful for help:
When I create a new data.frame and add columns with a specific class (POSIXct) using ...$... syntax, it works nicely ("p" columns in code below, they become class POSIXct as intended).
However, if I do the same using the ...[..., ...] syntax, POSIXct class is lost upon assignment ("n" columns in code below, since they become unintendedly class numeric).
Even after setting class explicitely, it remains numeric using the ...[..., ...] syntax, but not using the ...$.... syntax..
What is the reasoning behind this behaviour? Obviously I have found a workaround, but it is more convenient to use vectors of column names, and I am afraid that I miss sth. very basic but cannot figure out what, or where to look by which keywords.
Basically I need to access the columns by a variable and then assign class and data.
rm(dfDummy) # just make sure there is no residual old data/columns leftover
dfDummy <- data.frame(a = 1:10, dummy = dummy)
dfDummy$p <- as.POSIXct(NA)
dfDummy$p.rep <- as.POSIXct(rep(NA, 10))
dfDummy[ , c("n1", "n2")] <- as.POSIXct(NA)
dfDummy[ , c("n1.rep", "n2.rep")] <- as.POSIXct(rep(NA, 10))
sapply(X = c("p", "p.rep", "n1", "n2", "n1.rep", "n2.rep"), function(x) class(dfDummy[, x]))
# even after setting the class explicitely, it remains "numeric" - what is wrong?
class(dfDummy[ , c("n1", "n2", "n1.rep", "n2.rep")]) <- c("POSIXct", "POSIXt")
sapply(X = c("p", "p.rep", "n1", "n2", "n1.rep", "n2.rep"), function(x) class(dfDummy[, x]))

The issue has nothing really to do with using $ or [, except when using $ a single column is being assigned and when you're using [ multiple columns are.
Rather when you assign into multiple columns the POSIXct vector is being recycled and simplified into a matrix - and matrices can't hold class POSIXct.
If you instead pass a list, it will work:
dfDummy[ , c("n1.rep", "n2.rep")] <- list(as.POSIXct(NA))
lapply(dfDummy[ , c("n1.rep", "n2.rep")], class)
$n1.rep
[1] "POSIXct" "POSIXt"
$n2.rep
[1] "POSIXct" "POSIXt"

Related

Change data.table column classes to integer64, double respectively

Given data.table dt:
dt <- structure(list(V1 = c("1544018118438041139", "1544018118466235879",
"1544018118586849680", "1544018118601169211", "1544018118612947335",
"1544018118614422179"), V2 = c("162", "162", "161.05167", "158.01309",
"157", "157"), V3 = c("38", "38", "36.051697", "33.01306", "32",
"32"), V4 = c("0.023529414", "0.025490198", "0.023529414", "0.027450982",
"0.03137255", "0.03137255"), V5 = c("1", "1", "1", "1", "1",
"1"), V6 = c("2131230815", "2131230815", "2131230815", "2131230815",
"2131230815", "2131230815"), V7 = c("1", "0", "0", "0", "0",
"-1")), class = c("data.table", "data.frame"), row.names = c(NA,
-6L), .internal.selfref = <pointer: 0x2715f60>)
I want the first column to be bit64::as.integer64() and the rest of the columns as.numeric()
I am trying to do this:
dt <- dt[ ,V1 := bit64::as.integer64(V1)]
dt[, lapply(.SD, as.numeric), .SDcols = -c("V1")]
But it doesn't seem to do what I want, please advise how to change specific columns to class A(integer64) and the rest to another class B (say as.numeric())?
From the comments above it seems like you want to be able to do this all in one step rather than convert the first to integer64 and then the rest to double. One way you can do this is with:
dt[, names(dt) := Map(function(fun, x) fun(x), rep(list(bit64::as.integer64, as.numeric), times = c(1,length(.SD) - 1)), .SD), .SDcols = names(dt)]
The Map function iterates through your inputs together. That is, it takes the first elements of your first and second vectors and pass them as arguments to our function. Then it takes the second elements of both vectors and passes those to the function.
In our Map call we have:
A main function to apply. This is an anonymous function which takes two arugments (1) fun, and (2) x. The result of our function is the result of applying fun to x or fun(x). For a concrete example try:
myfun <- function(fun, x){
fun(x)
}
res<-myfun(as.numeric, c("1","1")); class(res)
A list of functions to pass to our main function. These will be used as fun in our main function. In this case its list(as.integer64, as.numeric, as.numeric,...)
A list of vectors to pass to our main function. These will be used as x in our main function. In this case each column of our dt.
A quick and dirty visual aid of how this works is (assuming custom_function takes two arguments):
It looks to me that you have a data.table object with integer64 nanosecond timestamps since the epoch. I use the same at work to represent high-resolution timestamps.
The good news is that data.table supports this -- by relying on our package nanotime which itself uses bit64 for the integer64 type. However, I create my timestamps differently, typically from compiled code where I retrieve the data.
I described this in some detail at the Rcpp Gallery in this post . So some good news: this can be done. Some bad news: I don't think you can do it the way you want it because we can only go via double which has only 16 decimals precision, not 19. But maybe I am missing trick so if simpler solution exists I'd be all ears. (And I keep forgetting if there is a 'parse int64 from string approach'. I never went that route because you can't do that at scale -- I deal with pretty sizeable data sets too.)
Thanks guys, #dirk_eddelbuettel I managed to do this:
1) Load all the JSON files (in my case) and use
bigint_as_char=TRUE
within fromJSON command.
2) Now you have a big table with all columns as characters.
3) Convert timestamp column to bit64::as.integer64() - you get the numbers I want.
4) Convert the rest to desired types.
5) When I want to perform calculations, for example timestamp - lag(timestamp) I am adding the lag_timestamp = lag(timestamp) (with dplyr::mutate) as new column and add diff_column = storing it as.character()
6) You are almost done - the new diff column stores the value I want as string / character and now you can convert it to as.numeric() where needed or apply ifelse() to deal with non relevant values.
7) That's all, it works perfectly for me and don't crash R Studio.
Before applying my solution R Studio crashed.

R character variables in function

I’ve worked with SAS and SQL previously I’m trying to get into R via a course. I’ve been set the following task by my tutor:
“Using the Iris dataset, write an R function that takes as its arguments an Iris species and attribute name and returns the minimum and maximum values of the attribute for that species.”
Which sounded straightforward at first, but I’ve come unstuck trying to make the function. The below is as far as I've gotten
#write the function
question_2 <- function(x, y, data){
new_table <- subset(data, Species==x)
themin <-min(new_table$y)
themax <-max(new_table$y)
return(themin)
return(themax)}
#test the function - Species , Attribute, Data
question_2("setosa",Sepal.Width, iris)
I assumed I needed quotes around the species when running the function, but I get an error that there were "no non-missing arguments to min/max", which I'm guessing means my attempt at making 'new_table' has brought back zero observations.
Can anyone see where I'm going wrong?
edit: thanks all for the swift and insightful responses. i'll take that reading on board. thanks again!
Indeed, your teacher didn't give you the easiest thing to do in R. You were almost right. You can't return twice in a function.
question_2 <- function(x, y, data){
new_table <- subset(data, Species==x)
themin <-min(new_table[[y]])
themax <-max(new_table[[y]])
return(list(themin, themax))}
question_2("setosa","Sepal.Width", iris)
df$colname cannot be used with a variable to the right of $, because it will search for the column named "colname" ("y" in your case) rather than the character the variable colname (if it even exists) represents.
The syntax df[["colname"]] is useful in this case because it allows for character input (which may also be a variable representing a character). This holds for both object types list and data.frame. In fact, a data.frame can be seen as a list of vectors.
Example
df <- data.frame(col1 = 5:7, col2 = letters[1:3])
a <- "col1"
# $ vs [[
df$col1 # works because "col1" is a column of df
df$a # does not work because "a" is not a column of df
df[["col1"]] # works because "col1" is a column of df
df[[a]] # works because "col1" is a column of df
# dataframes can be seen as list of vectors
ls <- list(col1 = 5:7, col2 = letters[1:3])
ls$col1 # works
ls[[a]] # works
One problem is that Sepal.Width seems to be some object in the workspace. Otherwise R would yell at you Object "Sepal.Width" not found.. Whatever Sepal.Width (the object) is, it is probably not a character string with the value "Sepal.Width". But even if it were, R would not know how to use the $ operator to get that named column from new_table, not without some needlessly advanced programming. #Flo.P's suggestion of using [[ is a good one.
You must pass y as "Sepal.Width".
Another approach: you can take advantage of subset by writing this:
question_2 <- function(x, y, data){
newy <- subset(data, subset=Species==x, select=y)
themin <-min(newy)
themax <-max(newy)
return(c(themin, themax))
}
question_2("setosa","Sepal.Width", iris)

Recoding data.table values in a loop using a vector of column names

I have a large data table that contains some categorical variables, where missing values have been coded as blank strings. I would like to recode them to NA.
I have a vector storing the names of the categorical variables:
categorical_variables = c("v3", etc.
The vector is definitely set up correctly - I have successfully used it to loop through plots of each column. However when I try to recode using this...
for (v in categorical_variables) myDataTable[get(v)=="",get(v):=NA]
...I get the following error:
Error in get(v) : object 'v3' not found
Yet this works OK:
myDataTable[v3=="",v3:=NA]
And this also works OK:
myDataTable[get("v3")=="",get("v3")]
So it's when I try to do the assignment using get() combined with := it throws up the error. What am I doing wrong?
The data.table is very large (hence my preference for using data.table), so ideally I don't want to convert to data.frame and use a base R approach. I feel like this should be a very straightforward procedure in data.table, but I've really struggled to find anything conclusive in the documentation, on Google, or on here! Is this a bug or am I missing something obvious?
We can use set. According to ?set, it is very fast as the overhead of [.data.table is avoided
library(data.table)
for (v in categorical_variables){
set(myDataTable, i=which(myDataTable[[v]]==""), j=v, value=NA)
}
However, this can be avoided while reading itself, as fread has the na.strings option (just like read.csv/read.table). We can specify the characters that needs to be read as NA i.e. if we have "" and $ to read as NA,
myDataTable <- fread("yourfile.csv", na.strings=c("", "$"))
data
myDataTable <- data.table(v3=c(letters[1:3], ''),
v5 = 1:4, v7 = c('', '', letters[1:2]))

Efficient Way to Convert to Numeric

I have converted a bunch of my columns from factor to numeric, but the code was very cumbersome. I had to individually convert each column, which ended up taking more time than it should. This is the code I used (only a short sample - I actually have many more columns):
city1$NY <-as.numeric(levels(city1$NY))[city1$NY]
city1$CHI<-as.numeric(levels(city1$CHI))[city1$CHI]
city1$LA <-as.numeric(levels(city1$LA))[city1$LA]
city1$ATL<-as.numeric(levels(city1$ATL))[city1$ATL]
city1$MIA<-as.numeric(levels(city1$MIA))[city1$MIA]
I was almost positive that instead of doing all of that, I could've just done:
city1[,CityNames]<-as.numeric(levels(city1[,CityNames]))[city1[,CityNames]]
Where CityNames is just all of the columns for the data that I would like to convert.. But that doesn't work, as I get:
Error in as.numeric(levels(city1[, CityNames]))[city1[, CityNames]] :
invalid subscript type 'list'
Can anyone tell what I am doing wrong? Or is there just simply no easier way to do this task other than my long, annoying first method?
I was almost positive that instead of doing all of that, I could've just done:
city1[,CityNames]<-as.numeric(levels(city1[,CityNames]))[city1[,CityNames]]
So, a small change is needed:
city1[,CityNames] <- lapply(city1[,CityNames], function(x) as.numeric(levels(x))[x] )
The original approach didn't work because
levels are vector-specific, so it's not clear what myvec = levels(city1[,CityNames]) is.
myvec[ city1[,CityNames] ] throws an error because city1[,CityNames] is a data.frame and cannot be used to subset in this way.
This is typically what I do when I want to convert many columns in a data.frame to a different data type:
convNames <- c("NY", "CHI", "LA", "ATL", "MIA")
for(name in convNames) { city1[, name] <- as.numeric(as.character((city1[, name])) }
It's a nice two lines and you just have to add the names of whatever columns you want to coerce to the convNames vector to add a new column to the coercing loop below.
EDIT: Do to a factor issue, do the lapply method above.
I'm not sure if it is faster, but may be since the lookups may be what is slowing you down. Try city1 <- as.numeric(as.character(city1)). The as.character() converts to the level values and then the as.numeric() interprets those strings as their a numeric equivalent. It may be significantly faster since it does not have to do any lookups into the levels vector for each value.

How can one work fully generically in data.table in R with column names in variables

What I'm looking for is a "best-practices-approved" alternative
to the following workaround / workflow. Consider
that I have a bunch of columns of similar data, and would like to perform a sequence of similar operations on these columns or sets of them, where the operations are of arbitrarily high complexity, and the groups of column names passed to each operation specified in a variable.
I realize this issue sounds contrived, but I run into it with surprising frequency. The examples are usually so messy that it is difficult to separate out the features relevant to this question, but I recently stumbled across one that was fairly straightforward to simplify for use as a MWE here:
library(data.table)
library(lubridate)
library(zoo)
the.table <- data.table(year=1991:1996,var1=floor(runif(6,400,1400)))
the.table[,`:=`(var2=var1/floor(runif(6,2,5)),
var3=var1/floor(runif(6,2,5)))]
# Replicate data across months
new.table <- the.table[, list(asofdate=seq(from=ymd((year)*10^4+101),
length.out=12,
by="1 month")),by=year]
# Do a complicated procedure to each variable in some group.
var.names <- c("var1","var2","var3")
for(varname in var.names) {
#As suggested in an answer to Link 3 above
#Convert the column name to a 'quote' object
quote.convert <- function(x) eval(parse(text=paste0('quote(',x,')')))
#Do this for every column name I'll need
varname <- quote.convert(varname)
anntot <- quote.convert(paste0(varname,".annual.total"))
monthly <- quote.convert(paste0(varname,".monthly"))
rolling <- quote.convert(paste0(varname,".rolling"))
scaled <- quote.convert(paste0(varname,".scaled"))
#Perform the relevant tasks, using eval()
#around every variable columnname I may want
new.table[,eval(anntot):=
the.table[,rep(eval(varname),each=12)]]
new.table[,eval(monthly):=
the.table[,rep(eval(varname)/12,each=12)]]
new.table[,eval(rolling):=
rollapply(eval(monthly),mean,width=12,
fill=c(head(eval(monthly),1),
tail(eval(monthly),1)))]
new.table[,eval(scaled):=
eval(anntot)/sum(eval(rolling))*eval(rolling),
by=year]
}
Of course, the particular effect on the data and variables here is irrelevant, so please do not focus on it or suggest improvements to accomplishing what it accomplishes in this particular case. What I am looking for, rather, is a generic strategy for the workflow of repeatedly applying an arbitrarily complicated procedure of data.table actions to a list of columns or list of lists-of-columns, specified in a variable or passed as an argument to a function, where the procedure must refer programmatically to columns named in the variable/argument, and possibly includes updates, joins, groupings, calls to the data.table special objects .I, .SD, etc.; BUT one which is simpler, more elegant, shorter, or easier to design or implement or understand than the one above or others that require frequent quote-ing and eval-ing.
In particular please note that because the procedures can be fairly complex and involve repeatedly updating the data.table and then referencing the updated columns, the standard lapply(.SD,...), ... .SDcols = ... approach is usually not a workable substitute. Also replacing each call of eval(a.column.name) with DT[[a.column.name]] neither simplifies much nor works completely in general since that doesn't play nice with the other data.table operations, as far as I am aware.
I am aware of many workarounds for various use cases of variable column
names in data.table, including:
Select / assign to data.table when variable names are stored in a character vector
Pass column name in data.table using variable
Referring to data.table columns by names saved in variables
passing column names to data.table programmatically
Data.table meta-programming
How to write a function that calls a function that calls data.table?
Using dynamic column names in `data.table`
Dynamic column names in data.table
Assign multiple columns using := in data.table, by group
Setting column name in "group by" operation with data.table
Summarizing multiple columns with data.table
and probably more I haven't referenced.
But: even if I learned all the tricks documented above to the point that I
never had to look them up to remind myself how to use them, I still would find
that working with column names that are passed as parameters to a function is
an extremely tedious task.
Problem you are describing is not strictly related to data.table.
Complex queries cannot be easily translated to code that machine can parse, thus we are not able to escape complexity in writing a query for complex operations.
You can try to imagine how to programmatically construct a query for the following data.table query using dplyr or SQL:
DT[, c(f1(v1, v2, opt=TRUE),
f2(v3, v4, v5, opt1=FALSE, opt2=TRUE),
lapply(.SD, f3, opt1=TRUE, opt2=FALSE))
, by=.(id1, id2)]
Assuming that all columns (id1, id2, v1...v5) or even options (opt, opt1, opt2) should be passed as variables.
Because of complexity in expression of queries I don't think you could easily accomplish requirement stated in your question:
is simpler, more elegant, shorter, or easier to design or implement or understand than the one above or others that require frequent quote-ing and eval-ing.
Although, comparing to other programming languages, base R provides very useful tools to deal with such problems.
You already found suggestions to use get, mget, DT[[col_name]], parse, quote, eval.
As you mentioned DT[[col_name]] might not play well with data.table optimizations, thus is not that useful here.
parse is probably the easiest way to construct complex queries as you can just operate on strings, but it doesn't provide basic language syntax validation. So you can ended up trying to parse a string that R parser does not accept. Additionally there is a security concern as presented in 2655#issuecomment-376781159.
get/mget are the ones most commonly suggested to deal with such problems. get and mget are internally catch by [.data.table and translated to expected columns. So you are assuming your arbitrary complex query will be able to be decomposed by [.data.table and expected columns properly inputted.
Since you asked this question few years back, the new feature - dot-dot prefix - is being rolled out in recently. You prefix variable name using dot-dot to refer to a variable outside of the scope of current data.table. Similarly as you refer parent directory in file system. Internals behind dot-dot will be quite similar to get, variables having prefix will be de-referenced inside of [.data.table. . In future releases dot-dot prefix may allow calls like:
col1="a"; col2="b"; col3="g"; col4="x"; col5="y"
DT[..col4==..col5, .(s1=sum(..col1), s2=sum(..col2)), by=..col3]
Personally I prefer quote and eval instead. quote and eval is interpreted almost as written by hand from scratch. This method does not rely on data.table abilities to manage references to columns. We can expect all optimizations to work the same way as if you would write those queries by hand. I found it also easier to debug as at any point you can just print quoted expression to see what is actually passed to data.table query. Additionally there is a less space for bugs to occur. Constructing complex queries using R language object is sometimes tricky, it is easy to wrap the procedure into function so it can be applied in different use cases and easily re-used. Important to note that this method is independent from data.table. It uses R language constructs. You can find more information about that in official R Language Definition in Computing on the language chapter.
What else?
I submitted proposal of a new concept called macro in #1579. In short it is a wrapper on DT[eval(qi), eval(qj), eval(qby)] so you still have to operate on R language objects. You are welcome to put your comment there.
Recently I proposed another approach for metaprogramming interface in PR#4304. In short it plugs base R substitute functionality into [.data.table using new argument env.
Going to the example. Below I will show two ways to solve it. First one will use base R metaprogramming, second one will use metaprogramming for data.table proposed in PR#4304 (see above).
Base R computing on the language
I will wrap all logic into do_vars function. Calling do_vars(donot=TRUE) will print expressions to be computed on data.table instead of eval them. Below code should be run just after the OP code.
expected = copy(new.table)
new.table = the.table[, list(asofdate=seq(from=ymd((year)*10^4+101), length.out=12, by="1 month")), by=year]
do_vars = function(x, y, vars, donot=FALSE) {
name.suffix = function(x, suffix) as.name(paste(x, suffix, sep="."))
do_var = function(var, x, y) {
substitute({
x[, .anntot := y[, rep(.var, each=12)]]
x[, .monthly := y[, rep(.var/12, each=12)]]
x[, .rolling := rollapply(.monthly, mean, width=12, fill=c(head(.monthly,1), tail(.monthly,1)))]
x[, .scaled := .anntot/sum(.rolling)*.rolling, by=year]
}, list(
.var=as.name(var),
.anntot=name.suffix(var, "annual.total"),
.monthly=name.suffix(var, "monthly"),
.rolling=name.suffix(var, "rolling"),
.scaled=name.suffix(var, "scaled")
))
}
ql = lapply(setNames(nm=vars), do_var, x, y)
if (donot) return(ql)
lapply(ql, eval.parent)
invisible(x)
}
do_vars(new.table, the.table, c("var1","var2","var3"))
all.equal(expected, new.table)
#[1] TRUE
we can preview queries
do_vars(new.table, the.table, c("var1","var2","var3"), donot=TRUE)
#$var1
#{
# x[, `:=`(var1.annual.total, y[, rep(var1, each = 12)])]
# x[, `:=`(var1.monthly, y[, rep(var1/12, each = 12)])]
# x[, `:=`(var1.rolling, rollapply(var1.monthly, mean, width = 12,
# fill = c(head(var1.monthly, 1), tail(var1.monthly, 1))))]
# x[, `:=`(var1.scaled, var1.annual.total/sum(var1.rolling) *
# var1.rolling), by = year]
#}
#
#$var2
#{
# x[, `:=`(var2.annual.total, y[, rep(var2, each = 12)])]
# x[, `:=`(var2.monthly, y[, rep(var2/12, each = 12)])]
# x[, `:=`(var2.rolling, rollapply(var2.monthly, mean, width = 12,
# fill = c(head(var2.monthly, 1), tail(var2.monthly, 1))))]
# x[, `:=`(var2.scaled, var2.annual.total/sum(var2.rolling) *
# var2.rolling), by = year]
#}
#
#$var3
#{
# x[, `:=`(var3.annual.total, y[, rep(var3, each = 12)])]
# x[, `:=`(var3.monthly, y[, rep(var3/12, each = 12)])]
# x[, `:=`(var3.rolling, rollapply(var3.monthly, mean, width = 12,
# fill = c(head(var3.monthly, 1), tail(var3.monthly, 1))))]
# x[, `:=`(var3.scaled, var3.annual.total/sum(var3.rolling) *
# var3.rolling), by = year]
#}
#
Proposed data.table metaprogramming
expected = copy(new.table)
new.table = the.table[, list(asofdate=seq(from=ymd((year)*10^4+101), length.out=12, by="1 month")), by=year]
name.suffix = function(x, suffix) as.name(paste(x, suffix, sep="."))
do_var2 = function(var, x, y) {
x[, .anntot := y[, rep(.var, each=12)],
env = list(
.anntot = name.suffix(var, "annual.total"),
.var = var
)]
x[, .monthly := y[, rep(.var/12, each=12)],
env = list(
.monthly = name.suffix(var, "monthly"),
.var = var
)]
x[, .rolling := rollapply(.monthly, mean, width=12, fill=c(head(.monthly,1), tail(.monthly,1))),
env = list(
.rolling = name.suffix(var, "rolling"),
.monthly = name.suffix(var, "monthly")
)]
x[, .scaled := .anntot/sum(.rolling)*.rolling, by=year,
env = list(
.scaled = name.suffix(var, "scaled"),
.anntot = name.suffix(var, "annual.total"),
.rolling = name.suffix(var, "rolling")
)]
TRUE
}
sapply(setNames(nm=var.names), do_var2, new.table, the.table)
#var1 var2 var3
#TRUE TRUE TRUE
all.equal(expected, new.table)
#[1] TRUE
Data and updated OP code
library(data.table)
library(lubridate)
library(zoo)
the.table <- data.table(year=1991:1996,var1=floor(runif(6,400,1400)))
the.table[,`:=`(var2=var1/floor(runif(6,2,5)),
var3=var1/floor(runif(6,2,5)))]
# Replicate data across months
new.table <- the.table[, list(asofdate=seq(from=ymd((year)*10^4+101),
length.out=12,
by="1 month")),by=year]
# Do a complicated procedure to each variable in some group.
var.names <- c("var1","var2","var3")
for(varname in var.names) {
#As suggested in an answer to Link 3 above
#Convert the column name to a 'quote' object
quote.convert <- function(x) eval(parse(text=paste0('quote(',x,')')))
#Do this for every column name I'll need
varname <- quote.convert(varname)
anntot <- quote.convert(paste0(varname,".annual.total"))
monthly <- quote.convert(paste0(varname,".monthly"))
rolling <- quote.convert(paste0(varname,".rolling"))
scaled <- quote.convert(paste0(varname,".scaled"))
#Perform the relevant tasks, using eval()
#around every variable columnname I may want
new.table[,paste0(varname,".annual.total"):=
the.table[,rep(eval(varname),each=12)]]
new.table[,paste0(varname,".monthly"):=
the.table[,rep(eval(varname)/12,each=12)]]
new.table[,paste0(varname,".rolling"):=
rollapply(eval(monthly),mean,width=12,
fill=c(head(eval(monthly),1),
tail(eval(monthly),1)))]
new.table[,paste0(varname,".scaled"):=
eval(anntot)/sum(eval(rolling))*eval(rolling),
by=year]
}
Thanks for the question. Your original approach goes a long way towards solving most of the issues.
Here I've tweaked the quoting function slightly, and changed the approach to parse and evaluate the entire RHS expression as a string instead of the individual variables.
The reasoning being:
You probably don't want to be repeating yourself by declaring every variable you need to use at the start of the loop.
Strings will scale better since they can be generated programmatically. I've added an example below that calculates row-wise percentages to illustrate this.
library(data.table)
library(lubridate)
library(zoo)
set.seed(1)
the.table <- data.table(year=1991:1996,var1=floor(runif(6,400,1400)))
the.table[,`:=`(var2=var1/floor(runif(6,2,5)),
var3=var1/floor(runif(6,2,5)))]
# Replicate data across months
new.table <- the.table[, list(asofdate=seq(from=ymd((year)*10^4+101),
length.out=12,
by="1 month")),by=year]
# function to paste, parse & evaluate arguments
evalp <- function(..., envir=parent.frame()) {eval(parse(text=paste0(...)), envir=envir)}
# Do a complicated procedure to each variable in some group.
var.names <- c("var1","var2","var3")
for(varname in var.names) {
# 1. For LHS, use paste0 to generate new column name as string (from #eddi's comment)
# 2. For RHS, use evalp
new.table[, paste0(varname, '.annual.total') := evalp(
'the.table[,rep(', varname, ',each=12)]'
)]
new.table[, paste0(varname, '.monthly') := evalp(
'the.table[,rep(', varname, '/12,each=12)]'
)]
# Need to add envir=.SD when working within the table
new.table[, paste0(varname, '.rolling') := evalp(
'rollapply(',varname, '.monthly,mean,width=12,
fill=c(head(', varname, '.monthly,1), tail(', varname, '.monthly,1)))'
, envir=.SD
)]
new.table[,paste0(varname, '.scaled'):= evalp(
varname, '.annual.total / sum(', varname, '.rolling) * ', varname, '.rolling'
, envir=.SD
)
,by=year
]
# Since we're working with strings, more freedom
# to work programmatically
new.table[, paste0(varname, '.row.percent') := evalp(
'the.table[,rep(', varname, '/ (', paste(var.names, collapse='+'), '), each=12)]'
)]
}
I tried to do this in data.table thinking "this isn't so bad"... but after an embarrassing length of time, I gave up. Matt says something like 'do in pieces then join', but I couldn't figure out elegant ways to do these pieces, especially because the last one depends on previous steps.
I have to say, this is a pretty brilliantly constructed question, and I too encounter similar issues frequently. I love data.table, but I still struggle sometimes. I don't know if I'm struggling with data.table or the complexity of the problem.
Here is the incomplete approach I've taken.
Realistically I can imagine that in a normal process you would have more intermediate variables stored that would be useful for calculating these values.
library(data.table)
library(zoo)
## Example yearly data
set.seed(27)
DT <- data.table(year=1991:1996,
var1=floor(runif(6,400,1400)))
DT[ , var2 := var1 / floor(runif(6,2,5))]
DT[ , var3 := var1 / floor(runif(6,2,5))]
setkeyv(DT,colnames(DT)[1])
DT
## Convenience function
nonkey <- function(dt){colnames(dt)[!colnames(dt)%in%key(dt)]}
## Annual data expressed monthly
NewDT <- DT[, j=list(asofdate=as.IDate(paste(year, 1:12, 1, sep="-"))), by=year]
setkeyv(NewDT, colnames(NewDT)[1:2])
## Create annual data
NewDT_Annual <- NewDT[DT]
setnames(NewDT_Annual,
nonkey(NewDT_Annual),
paste0(nonkey(NewDT_Annual), ".annual.total"))
## Compute monthly data
NewDT_Monthly <- NewDT[DT[ , .SD / 12, keyby=list(year)]]
setnames(NewDT_Monthly,
nonkey(NewDT_Monthly),
paste0(nonkey(NewDT_Monthly), ".monthly"))
## Compute rolling stats
NewDT_roll <- NewDT_Monthly[j = lapply(.SD, rollapply, mean, width=12,
fill=c(.SD[1],tail(.SD, 1))),
.SDcols=nonkey(NewDT_Monthly)]
NewDT_roll <- cbind(NewDT_Monthly[,1:2,with=F], NewDT_roll)
setkeyv(NewDT_roll, colnames(NewDT_roll)[1:2])
setnames(NewDT_roll,
nonkey(NewDT_roll),
gsub(".monthly$",".rolling",nonkey(NewDT_roll)))
## Compute normalized values
## Compute "adjustment" table which is
## total of each variable, by year for rolling
## divided by
## original annual totals
## merge "adjustment values" in with monthly data, and then
## make a modified data.table which is each varaible * annual adjustment factor
## Merge everything
NewDT_Combined <- NewDT_Annual[NewDT_roll][NewDT_Monthly]

Resources