data.table operations by column name - r

Suppose I have a data.table
a <- data.table(id=c(1,1,2,2,3),a=21:25,b=11:15,key="id")
I can add new columns like this:
a[, sa := sum(a), by="id"]
a[, sb := sum(b), by="id"]
> a
id a b sa sb
1: 1 21 11 43 23
2: 1 22 12 43 23
3: 2 23 13 47 27
4: 2 24 14 47 27
5: 3 25 15 25 15
However, suppose that I have column names instead:
for (n in c("a","b")) {
s <- paste0("s",n)
a[, s := sum(n), by="id", with=FALSE] # ERROR: invalid 'type' (character) of argument
}
what do I do?

You can also do this:
a <- data.table(id=c(1,1,2,2,3),a=21:25,b=11:15,key="id")
a[, c("sa", "sb") := lapply(.SD, sum), by = id]
Or slightly more generally:
cols.to.sum = c("a", "b")
a[, paste0("s", cols.to.sum) := lapply(.SD, sum), by = id, .SDcols = cols.to.sum]

This is similar to :
How to generate a linear combination of variables and update table using data.table in a loop call?
but you want to combine this with by= too, so set() isn't flexible enough. That's a deliberate design design and set() is unlikely to change in that regard.
I sometimes use the EVAL helper at the end of that answer.
https://stackoverflow.com/a/20808573/403310 Some wince at that approach but I just think of it like constructing a dynamic SQL statement, which is quite common practice. The EVAL approach gives ultimate flexibility without head scratching about eval() and quote(). To see the dynamic query that's been constructed (to check it) you can add a print inside your EVAL helper function.
However, in this simple example you can wrap the LHS of := with brackets to tell data.table to lookup the value (clearer than with=FALSE), and the RHS needs a get().
for (n in c("a","b")) {
s <- paste0("s",n)
a[, (s) := sum(get(n)), by="id"]
}

Edit 2020-02-15 about ..
data.table also supports the .. syntax to "look up a level", obviating the need for with=FALSE in most cases, e.g. dt[ , ..n1] and dt[ , ..n2] in the below
have a look at with in ? data.table:
dt <- data.table(id=1:5,a=21:25,b=11:15,key="id")
dt[, n3 := dt[ , n1, with = FALSE ] * dt[ , n2, with = FALSE ], with = FALSE ]
EDIT:
Or you just change the colnames forth and back:
dt <- data.table(id=1:5,a=21:25,b=11:15,key="id")
dt[ , dt.names["n3"] := 1L, with = FALSE ]
dt.names <- c( n1 = "a", n2 = "b", n3 = "c" )
setnames( dt, dt.names, names(dt.names) )
dt[ , n3 := n1 * n2, by = "id" ]
setnames( dt, names(dt.names), dt.names )
which works together with by.

Here is an approach that does the call mangling and avoids any overhead with .SD
# a helper function
makeCall <- function(x,fun) bquote(.(fun)(.(x)))
# the columns you wish to sum (apply function to)
cols <- c('a','b')
new.cols <- paste0('s',cols)
# create named list of names
name.cols <- setNames(sapply(cols,as.name), new.cols)
# create the call
my_call <- as.call(c(as.name(':='), lapply(name.cols, makeCall, fun = as.name('sum'))))
(a[, eval(my_call), by = 'id'])
# id a b sa sb
# 1: 1 21 11 43 23
# 2: 1 22 12 43 23
# 3: 2 23 13 47 27
# 4: 2 24 14 47 27
# 5: 3 25 15 25 15

Related

R loop to standardize variables using data.table

how can I loop over certain variables in order to standardize them? I am strying to set up the code but it's not working, my idea was to use assign or eval but those seem not working. Below a reproducible working example.
if (!require('data.table')) {install.packages('data.table'); library('data.table')}
a <- seq(0,10,1)
b <- seq(99,100,0.1)
dt <- data.table(a,b)
# Expected result
dt[,z_a:= ((a-mean(a,na.rm=TRUE))/sd(a,na.rm=TRUE)) ]
dt[,z_b:= ((a-mean(a,na.rm=TRUE))/sd(a,na.rm=TRUE)) ]
# Loop not working
stdvars <- c(a,b)
for (v in stdvars) {
dt[z_v:= ((v-mean(v,na.rm=TRUE))/sd(v,na.rm=TRUE)) ]
}
dt
I would advise against using explicit loops when working with data.table, as its internal functionality is many times more efficient. In particular, you can define a function which you call through lapply over a specified subset (.SD):
standardise = function(x){(x-mean(x, na.rm = TRUE))/sd(x, na.rm = TRUE)} # Define a standardising function
oldcols = c('a', 'b') # Name of old columns
newcols = paste0('z_', oldcols) # Name of new columns ('z_a' and 'z_b')
dt[, (newcols) := lapply(.SD, standardise), .SDcols = oldcols]
Output:
> dt
a b z_a z_b
1: 0 99.0 -1.5075567 -1.5075567
2: 1 99.1 -1.2060454 -1.2060454
3: 2 99.2 -0.9045340 -0.9045340
4: 3 99.3 -0.6030227 -0.6030227
5: 4 99.4 -0.3015113 -0.3015113
6: 5 99.5 0.0000000 0.0000000
7: 6 99.6 0.3015113 0.3015113
8: 7 99.7 0.6030227 0.6030227
9: 8 99.8 0.9045340 0.9045340
10: 9 99.9 1.2060454 1.2060454
11: 10 100.0 1.5075567 1.5075567
.SD means that you are calling a lapply across a Subset of the Data, defined by the .SDcols argument. In this case, we define newcols as the application of standardise function across the subset oldcols.
There is a built-in function scale which allows to standardize variables.
The missing values are removed when standardizing.
So it would be more direct to proceed as follows:
cols <- c("a", "b")
dt[, paste0("z_", cols) := lapply(.SD, scale), .SDcols = cols]
An option is to use non-standard evaluation:
for (v in c("a", "b")) {
eval(substitute(dt[, paste0("z_", v) := (V - mean(V, na.rm=TRUE)) / sd(V, na.rm=TRUE)],
list(V=as.name(v))))
}
dt
Or putting it in a function:
f <- function(DT, v) {
lhs <- paste0("z_", as.list(match.call())$v)
eval(substitute(
DT[, (lhs) := (v - mean(v, na.rm=TRUE)) / sd(v, na.rm=TRUE)]))
}
f(dt, a)
f(dt, b)
dt

R: data.table, set first and last value of a group to NA

I would like to set the first and the last value in a group to NA. Here is an example:
DT <- data.table(v = rnorm(12), class=rep(1:3, each=4))
DT[, v[c(1,.N)] := NA , by=class]
But this is not working. How can I do it?
At the moment, the way to go about this would be to first extract the indices, and then do one assignment by reference.
idx = DT[, .(idx = .I[c(1L, .N)]), by=class]$idx
DT[idx, v := NA]
I'll try and add this example to the Reference semantics vignette.
This may not be a one-liner, but it does have 'first' and 'last' in the code :)
> DT <- data.table(v = rnorm(12), class=rep(1:3, each=4))
> setkey(DT, class)
> classes = DT[, .(unique(class))]
> DT[classes, v := NA, mult='first']
> DT[classes, v := NA, mult='last']
> DT
v class
1: NA 1
2: -1.8191 1
3: -0.6355 1
4: NA 1
5: NA 2
6: -1.1771 2
7: -0.8125 2
8: NA 2
9: NA 3
10: 0.2357 3
11: 0.3416 3
12: NA 3
>
Order is also preserved for the non-key columns. I think that is a documented (committed to) feature.
With a helper function it's easy
set.na = function(x,y) {x[y] = NA; x}
DT[, set.na(v,c(1,.N)) , by=class]
The canonical way to modify subsets of the data is to use i to define the subset. You cannot use [ together with :=. Either create a temporary i as suggested by #David Arenburg or you can create the outcome vector yourself using a c(NA, v[-c(1, .N)], NA) construction.
DT[, v := c(NA, v[-c(1, .N)], NA)[1:.N], by = class]
However, you should also note that the row order can change when you e.g. set a new key or use any number of functions. So you should be very careful with this operation.

R - How to run average & max on different data.table columns based on multiple factors & return original colnames

I am changing my R code from data.frame + plyr to data.tables as I need a faster and more memory-efficient way to handle a big data set. Unfortunately, my R skills are woefully limited and I've hit a wall for the whole day. Would appreciate if SO experts here can enlighten.
My Goals
Aggregate rows in my data.table based on 2 functions - average and max - run on selected columns (with column names passed via vector) while grouping by columns also passed via vector.
The resulting DT should contain the original column names.
There should not be unnecessary copying of the DT in order to conserve memory
My Test Code
DT = data.table( a=LETTERS[c(1,1,1:4)],b=4:9, c=3:8, d = rnorm(6),
e=LETTERS[c(rep(25,3),rep(26,3))], key="a" )
GrpVar1 <- "a"
GrpVar2 <- "e"
VarToMax <- "b"
VarToAve <- c( "c", "d")
What I tried but didn't work for me
DT[, list( b=max( b ), c=mean(c), d=mean(d) ), by=c( GrpVar1, GrpVar2 ) ]
# Hard-code col name - not what I want
DT[, list( max( get(VarToMax) ), mean( get(VarToAve) )), by=c( GrpVar1, GrpVar2 ) ]
# Col names become 'V1', 'V2', worse, 1 column goes missing - Not what I want either
DT[, list( get(VarToMax)=max( get(VarToMax) ),
get(VarToAve)=mean( get(VarToAve) ) ), by=c( GrpVar1, GrpVar2 ) ]
# Above code gave Error!
Additional Question
Based on my very limited understanding of DTs, the with = F argument should instruct R to parse the values of VarToMax and VarToAve, but running the code below leads to error.
DT[, list( max(VarToMax), mean(VarToAve) ), by=c( GrpVar1, GrpVar2 ), with=F ]
# Error in `[.data.table`(DT, , list(max(VarToMax), mean(VarToAve)), by = c(GrpVar1, :
# object 'ansvals' not found
# In addition: Warning message:
# In mean.default(VarToAve) :
# argument is not numeric or logical: returning NA
Existing SO solutions can't help
Arun's solution was how I got to this point, but I am very stuck. His other solution using lapply and .SDcols involves creating 2 extra DT, which does not meet my memory-conserving requirement.
dt1 <- dt[, lapply(.SD, sum), by=ID, .SDcols=c(3,4)]
dt2 <- dt[, lapply(.SD, head, 1), by=ID, .SDcols=c(2)]
I am SO confused over data.table! Any help would be most appreciated!
In a similar fashion as #David Arenburg, but using .SDcols in order to simplify the notation. Also I show the code until the merge.
DTaves <- DT[, lapply(.SD, mean), .SDcols = VarToAve, by = c(GrpVar1, GrpVar2)]
DTmaxs <- DT[, lapply(.SD, max), .SDcols = VarToMax, by = c(GrpVar1, GrpVar2)]
merge(DTmaxs, DTaves)
## a e b c d
## 1: A Y 6 4 0.2230091
## 2: B Z 7 6 0.5909434
## 3: C Z 8 7 -0.4828223
## 4: D Z 9 8 -1.3591240
Alternatively, you can do this in one go by subsetting the .SD using the .. notation to look for VarToAve in the parent frame of .SD (as opposed to a column named VarToAve)
DT[, c(lapply(.SD[, ..VarToAve], mean),
lapply(.SD[, ..VarToMax], max)),
by = c(GrpVar1, GrpVar2)]
## a e c d b
## 1: A Y 4 0.2230091 6
## 2: B Z 6 0.5909434 7
## 3: C Z 7 -0.4828223 8
## 4: D Z 8 -1.3591240 9
Here's my humble attempt
DT[, as.list(c(setNames(max(get(VarToMax)), VarToMax),
lapply(.SD[, ..VarToAve], mean))),
c(GrpVar1, GrpVar2)]
# a e b c d
# 1: A Y 6 4 -0.8000173
# 2: B Z 7 6 0.2508633
# 3: C Z 8 7 1.1966517
# 4: D Z 9 8 1.7291615
Or, for maximum efficiency you could use colMeans and eval(as.name()) combination instead of lapply and get
DT[, as.list(c(setNames(max(eval(as.name(VarToMax))), VarToMax),
colMeans(.SD[, ..VarToAve]))),
c(GrpVar1, GrpVar2)]
# a e b c d
# 1: A Y 6 4 -0.8000173
# 2: B Z 7 6 0.2508633
# 3: C Z 8 7 1.1966517
# 4: D Z 9 8 1.7291615

Apply different functions to different sets of columns by group

I have a data.table with the following features:
bycols: columns that divide the data into groups
nonvaryingcols: columns that are constant within each group (so that taking the first item from within each group and carrying that through would be sufficient)
datacols: columns to be aggregated / summarized (e.g. sum them within group)
I'm curious what the most efficient way to do what you might call a mixed collapse, taking all three of the above inputs as character vectors. It doesn't have to be the absolute fastest, but fast enough with reasonable syntax would be ideal.
Example data, where the different sets of columns are stored in character vectors.
require(data.table)
set.seed(1)
bycols <- c("g1","g2")
datacols <- c("dat1","dat2")
nonvaryingcols <- c("nv1","nv2")
test <- data.table(
g1 = rep( letters, 10 ),
g2 = rep( c(LETTERS,LETTERS), each = 5 ),
dat1 = runif( 260 ),
dat2 = runif( 260 ),
nv1 = rep( seq(130), 2),
nv2 = rep( seq(130), 2)
)
Final data should look like:
g1 g2 dat1 dat2 nv1 nv2
1: a A 0.8403809 0.6713090 1 1
2: b A 0.4491883 0.4607716 2 2
3: c A 0.6083939 1.2031960 3 3
4: d A 1.5510033 1.2945761 4 4
5: e A 1.1302971 0.8573135 5 5
6: f B 1.4964821 0.5133297 6 6
I have worked out two different ways of doing it, but one is horridly inflexible and unwieldy, and one is horridly slow. Will post tomorrow if no one has come up with something better by then.
As always with this sort of programmatic use of [.data.table, the general strategy is to construct an expression e that that can be evaluated in the j argument. Once you understand that (as I'm sure you do), it just becomes a game of computing on the language to get a j-slot expression that looks like what you'd write at the command line.
Here, for instance, and given the particular values in your example, you'd like a call that looks like:
test[, list(dat1=sum(dat1), dat2=sum(dat2), nv1=nv1[1], nv2=nv2[1]),
by=c("g1", "g2")]
so the expression you'd like evaluated in the j-slot is
list(dat1=sum(dat1), dat2=sum(dat2), nv1=nv1[1], nv2=nv2[1])
Most of the following function is taken up with constructing just that expression:
f <- function(dt, bycols, datacols, nvcols) {
e <- c(sapply(datacols, function(x) call("sum", as.symbol(x))),
sapply(nvcols, function(x) call("[", as.symbol(x), 1)))
e<- as.call(c(as.symbol("list"), e))
dt[,eval(e), by=bycols]
}
f(test, bycols=bycols, datacols=datacols, nvcols=nonvaryingcols)
## g1 g2 dat1 dat2 nv1 nv2
## 1: a A 0.8403809 0.6713090 1 1
## 2: b A 0.4491883 0.4607716 2 2
## 3: c A 0.6083939 1.2031960 3 3
## 4: d A 1.5510033 1.2945761 4 4
## 5: e A 1.1302971 0.8573135 5 5
## ---
## 126: v Z 0.5627018 0.4282380 126 126
## 127: w Z 0.7588966 1.4429034 127 127
## 128: x Z 0.7060596 1.3736510 128 128
## 129: y Z 0.6015249 0.4488285 129 129
## 130: z Z 1.5304034 1.6012207 130 130
Here's what I had come up with. It works, but very slowly.
test[, {
cbind(
as.data.frame( t( sapply( .SD[, ..datacols], sum ) ) ),
.SD[, ..nonvaryingcols][1]
)
}, by = bycols ]
Benchmarks
FunJosh <- function() {
f(test, bycols=bycols, datacols=datacols, nvcols=nonvaryingcols)
}
FunAri <- function() {
test[, {
cbind(
as.data.frame( t( sapply( .SD[, ..datacols], sum ) ) ),
.SD[, ..nonvaryingcols][1]
)
}, by = bycols ]
}
FunEddi <- function() {
cbind(
test[, lapply(.SD, sum), by = bycols, .SDcols = datacols],
test[, lapply(.SD, "[", 1), by = bycols, .SDcols = nonvaryingcols][, ..nonvaryingcols]
)
}
library(microbenchmark)
identical(FunJosh(), FunAri())
# [1] TRUE
microbenchmark(FunJosh(), FunAri(), FunEddi())
#Unit: milliseconds
# expr min lq median uq max neval
# FunJosh() 2.749164 2.958478 3.098998 3.470937 6.863933 100
# FunAri() 246.082760 255.273839 284.485654 360.471469 509.740240 100
# FunEddi() 5.877494 6.229739 6.528205 7.375939 112.895573 100
At least two orders of magnitude slower than #joshobrien's solution. Edit #Eddi's solution is much faster as well, and shows that cbind wasn't optimal but could be fairly fast in the right hands. Might be all the transforming and sapplying I was doing rather than just directly using lapply.
Just for a bit of variety, here is a variant of #Josh O'brien's solution that uses the bquote operator instead of call. I did try to replace the final as.call with a bquote, but because bquote doesn't support list splicing (e.g., see this question), I couldn't get that to work.
f <- function(dt, bycols, datacols, nvcols) {
datacols = sapply(datacols, as.symbol)
nvcols = sapply(nvcols, as.symbol)
e = c(lapply(datacols, function(x) bquote(sum(.(x)))),
lapply(nvcols, function(x) bquote(.(x)[1])))
e = as.call(c(as.symbol("list"), e))
dt[,eval(e), by=bycols]
}
> f(test, bycols=bycols, datacols=datacols, nvcols=nonvaryingcols)
g1 g2 dat1 dat2 nv1 nv2
1: a A 0.8404 0.6713 1 1
2: b A 0.4492 0.4608 2 2
3: c A 0.6084 1.2032 3 3
4: d A 1.5510 1.2946 4 4
5: e A 1.1303 0.8573 5 5
---
126: v Z 0.5627 0.4282 126 126
127: w Z 0.7589 1.4429 127 127
128: x Z 0.7061 1.3737 128 128
129: y Z 0.6015 0.4488 129 129
130: z Z 1.5304 1.6012 130 130
>

Dynamic column names in data.table

I am trying to add columns to my data.table, where the names are dynamic. I addition I need to use the by argument when adding these columns. For example:
test_dtb <- data.table(a = sample(1:100, 100), b = sample(1:100, 100), id = rep(1:10,10))
cn <- parse(text = "blah")
test_dtb[ , eval(cn) := mean(a), by = id]
# Error in `[.data.table`(test_dtb, , `:=`(eval(cn), mean(a)), by = id) :
# LHS of := must be a single column name when with=TRUE. When with=FALSE the LHS may be a vector of column names or positions.
Another attempt:
cn <- "blah"
test_dtb[ , cn := mean(a), by = id, with = FALSE]
# Error in `[.data.table`(test_dtb, , `:=`(cn, mean(a)), by = id, with = FALSE) : 'with' must be TRUE when 'by' or 'keyby' is provided
Update from Matthew:
This now works in v1.8.3 on R-Forge. Thanks for highlighting!
See this similar question for new examples:
Assign multiple columns using data.table, by group
From data.table 1.9.4, you can just do this:
## A parenthesized symbol, `(cn)`, gets evaluated to "blah" before `:=` is carried out
test_dtb[, (cn) := mean(a), by = id]
head(test_dtb, 4)
# a b id blah
# 1: 41 19 1 54.2
# 2: 4 99 2 50.0
# 3: 49 85 3 46.7
# 4: 61 4 4 57.1
See Details in ?:=:
DT[i, (colvector) := val]
[...] NOW PREFERRED [...] syntax. The parens are enough to stop the LHS being a symbol; same as c(colvector)
Original answer:
You were on exactly the right track: constructing an expression to be evaluated within the call to [.data.table is the data.table way to do this sort of thing. Going just a bit further, why not construct an expression that evaluates to the entire j argument (rather than just its left hand side)?
Something like this should do the trick:
## Your code so far
library(data.table)
test_dtb <- data.table(a=sample(1:100, 100),b=sample(1:100, 100),id=rep(1:10,10))
cn <- "blah"
## One solution
expr <- parse(text = paste0(cn, ":=mean(a)"))
test_dtb[,eval(expr), by=id]
## Checking the result
head(test_dtb, 4)
# a b id blah
# 1: 30 26 1 38.4
# 2: 83 82 2 47.4
# 3: 47 66 3 39.5
# 4: 87 23 4 65.2
Expression can be constructed with bquote.
cn <- "blah"
expr <- bquote(.(as.name(cn)):=mean(a))
test_dtb[,eval(expr), by=id]
I believe setnames(DT, c(col.names)) yields the most readable code

Resources