Dynamic column names in data.table

Dynamic column names in data.table - r

I am trying to add columns to my data.table, where the names are dynamic. I addition I need to use the by argument when adding these columns. For example:
test_dtb <- data.table(a = sample(1:100, 100), b = sample(1:100, 100), id = rep(1:10,10))
cn <- parse(text = "blah")
test_dtb[ , eval(cn) := mean(a), by = id]
# Error in `[.data.table`(test_dtb, , `:=`(eval(cn), mean(a)), by = id) :
# LHS of := must be a single column name when with=TRUE. When with=FALSE the LHS may be a vector of column names or positions.
Another attempt:
cn <- "blah"
test_dtb[ , cn := mean(a), by = id, with = FALSE]
# Error in `[.data.table`(test_dtb, , `:=`(cn, mean(a)), by = id, with = FALSE) : 'with' must be TRUE when 'by' or 'keyby' is provided
Update from Matthew:
This now works in v1.8.3 on R-Forge. Thanks for highlighting!
See this similar question for new examples:
Assign multiple columns using data.table, by group

From data.table 1.9.4, you can just do this:
## A parenthesized symbol, `(cn)`, gets evaluated to "blah" before `:=` is carried out
test_dtb[, (cn) := mean(a), by = id]
head(test_dtb, 4)
# a b id blah
# 1: 41 19 1 54.2
# 2: 4 99 2 50.0
# 3: 49 85 3 46.7
# 4: 61 4 4 57.1
See Details in ?:=:
DT[i, (colvector) := val]
[...] NOW PREFERRED [...] syntax. The parens are enough to stop the LHS being a symbol; same as c(colvector)
Original answer:
You were on exactly the right track: constructing an expression to be evaluated within the call to [.data.table is the data.table way to do this sort of thing. Going just a bit further, why not construct an expression that evaluates to the entire j argument (rather than just its left hand side)?
Something like this should do the trick:
## Your code so far
library(data.table)
test_dtb <- data.table(a=sample(1:100, 100),b=sample(1:100, 100),id=rep(1:10,10))
cn <- "blah"
## One solution
expr <- parse(text = paste0(cn, ":=mean(a)"))
test_dtb[,eval(expr), by=id]
## Checking the result
head(test_dtb, 4)
# a b id blah
# 1: 30 26 1 38.4
# 2: 83 82 2 47.4
# 3: 47 66 3 39.5
# 4: 87 23 4 65.2

Expression can be constructed with bquote.
cn <- "blah"
expr <- bquote(.(as.name(cn)):=mean(a))
test_dtb[,eval(expr), by=id]

I believe setnames(DT, c(col.names)) yields the most readable code

Related

R perform summary operation and subset result by data.table column

I want to use a list external to my data.table to inform what a new column of data should be, in that data.table. In this case, the length of the list element corresponding to a data.table attribute;
# dummy list. I am interested in extracting the vector length of each list element
l <- list(a=c(3,5,6,32,4), b=c(34,5,6,34,2,4,6,7), c = c(3,4,5))
# dummy dt, the underscore number in Attri2 is the element of the list i want the length of
dt <- data.table(Attri1 = c("t","y","h","g","d","e","d"),
Attri2 = c("fghd_1","sdafsf_3","ser_1","fggx_2","sada_2","sfesf_3","asdas_2"))
# extract that number to a new attribute, just for clarity
dt[, list_gp := tstrsplit(Attri2, "_", fixed=TRUE, keep=2)]
# then calculate the lengths of the vectors in the list, and attempt to subset by the index taken above
dt[,list_len := '[['(lapply(1, length),list_gp)]
Error in lapply(l, length)[[list_gp]] : no such index at level 1
I envisaged the list_len column to be 5,3,5,8,8,3,8

A couple of things.
tstrsplit gives you a string. convert to number.
not quite sure about the [[ construct there, see proposed solution:
dt[, list_gp := as.numeric( tstrsplit(Attri2, "_", fixed=TRUE, keep=2)[[1]] )]
dt[, list_len := sapply( l[ list_gp ], length ) ]
Output:
> dt
Attri1 Attri2 list_gp list_len
1: t fghd_1 1 5
2: y sdafsf_3 3 3
3: h ser_1 1 5
4: g fggx_2 2 8
5: d sada_2 2 8
6: e sfesf_3 3 3
7: d asdas_2 2 8

Pass strings as code to summarize multiple columns with data.table

We would like to summarize a data table to create a lot of new variables that result from the combination of columns names and values from the original data.
Here is reproducile example illustrating the result we would like to achieve with two columns only for the sake of brevity
library(data.table)
data('mtcars')
setDT(mtcars)
# Desired output
mtcars[, .(
acm_hp_carb2 = mean(hp[which( carb <= 2)], na.rm=T),
acm_wt_am1 = mean(wt[which( am== 1)], na.rm=T)
), by= .(cyl, gear)]
Because we want to summarize a lot of columns, we created a function that returns all the strings that we would use to create each summary variable. In this example, we have this:
a <- 'acm_hp_carb2 = mean(hp[which( carb <= 2)], na.rm=T)'
b <- 'acm_wt_am1 = mean(wt[which( am== 1)], na.rm=T)'
And here is our failed attempt. Note that the new columns created do not receive the names we want to assign to them.
mtcars[, .(
eval(parse(text=a)),
eval(parse(text=b))
), by= .(cyl, gear)]

Seems like the only part which isn't working is the column names. If you put a and b in a vector and add names to them, you can use lapply to do the eval(parse and keep the names from the vector. I used regex to get the names, but presumably in the real code you can assign the names as whatever variable you're using to construct the strings in the first place.
Result has many NaNs but it matches your desired output.
to_make <- c(a, b)
to_make <- setNames(to_make, sub('^(.*) =.*', '\\1', to_make))
mtcars2[, lapply(to_make, function(x) eval(parse(text = x)))
, by= .(cyl, gear)]
# cyl gear acm_hp_carb2 acm_wt_am1
# 1: 6 4 NaN 2.747500
# 2: 4 4 76.0 2.114167
# 3: 6 3 107.5 NaN
# 4: 8 3 162.5 NaN
# 5: 4 3 97.0 NaN
# 6: 4 5 102.0 1.826500
# 7: 8 5 NaN 3.370000
# 8: 6 5 NaN 2.770000

You can make one call and eval it:
f = function(...){
ex = parse(text = sprintf(".(%s)", paste(..., sep=", ")))[[1]]
print(ex)
mtcars[, eval(ex), by=.(cyl, gear)]
}
f(a,b)
a2 <- 'acm_hp_carb2 = mean(hp[carb <= 2], na.rm=T)'
b2 <- 'acm_wt_am1 = mean(wt[am == 1], na.rm=T)'
f(a2, b2)
I guess the which() is not needed.

R: data.table, set first and last value of a group to NA

I would like to set the first and the last value in a group to NA. Here is an example:
DT <- data.table(v = rnorm(12), class=rep(1:3, each=4))
DT[, v[c(1,.N)] := NA , by=class]
But this is not working. How can I do it?

At the moment, the way to go about this would be to first extract the indices, and then do one assignment by reference.
idx = DT[, .(idx = .I[c(1L, .N)]), by=class]$idx
DT[idx, v := NA]
I'll try and add this example to the Reference semantics vignette.

This may not be a one-liner, but it does have 'first' and 'last' in the code :)
> DT <- data.table(v = rnorm(12), class=rep(1:3, each=4))
> setkey(DT, class)
> classes = DT[, .(unique(class))]
> DT[classes, v := NA, mult='first']
> DT[classes, v := NA, mult='last']
> DT
v class
1: NA 1
2: -1.8191 1
3: -0.6355 1
4: NA 1
5: NA 2
6: -1.1771 2
7: -0.8125 2
8: NA 2
9: NA 3
10: 0.2357 3
11: 0.3416 3
12: NA 3
>
Order is also preserved for the non-key columns. I think that is a documented (committed to) feature.

With a helper function it's easy
set.na = function(x,y) {x[y] = NA; x}
DT[, set.na(v,c(1,.N)) , by=class]

The canonical way to modify subsets of the data is to use i to define the subset. You cannot use [ together with :=. Either create a temporary i as suggested by #David Arenburg or you can create the outcome vector yourself using a c(NA, v[-c(1, .N)], NA) construction.
DT[, v := c(NA, v[-c(1, .N)], NA)[1:.N], by = class]
However, you should also note that the row order can change when you e.g. set a new key or use any number of functions. So you should be very careful with this operation.

data.frame and splitting rows... not found a suitable solution for my data

I am struggling a bit with my tables. I am trying to split some variables (using R), but I am having difficulties with one specific column.
My dataset is like this:
test<-data.frame(
Chrom_no=c(1,1,2,3),
Region=c('12..13','22..23','100','34..36'),
Ref=c('AT','CG','A','AAA'),
Alt=c('TA','GA','T','CGG'),
Prob=c(99,98.7,99,99.9))
I want to separate all the regions that are grouped together. So far, I have solved for all the columns, but the 'Region' one:
ref2 <- strsplit(as.character(test$Ref), '')
alt2<-strsplit(as.character(test$Alt), '')
test2<-data.frame(
Chrom_no=rep(test$Chrom_no, vapply(ref2, FUN=length, FUN.VALUE=integer(1))),
Region=rep(test$Region, vapply(ref2, FUN=length, FUN.VALUE=integer(1))),
Ref=unlist(ref2),
Alt=unlist(alt2),
Prob=rep(test$Prob, vapply(ref2, FUN=length, FUN.VALUE=integer(1))))
I don't know how to solve fix that column: e.g. '12..13': 12 should go on the Ref=A and 13 should go in Ref=T (first and second character, respectively). Things get complicated, as some of the columns have 3 characters (and corresponding range: 22..24), some will have more.
How could I solve? I have been looking for a solution in the last couple of days, but I am still not sure how to solve. I apologize if this has already been solved somewhere else.
P.S.: I am aware that in order to strsplit on the 'Region' column I need to use:
'\\..'
as separator.

If I understand your end goal correctly, you can look into using the "data.table" package. With it, you can set up your problem like the following:
library(data.table)
## Change your data.frame to a data.table
DT <- as.data.table(test)
## Convert the relevant columns to be characters instead of factors
DT[, c("Region", "Ref", "Alt") := lapply(.SD, as.character),
.SDcols = c("Region", "Ref", "Alt")]
DT[, list(Chrom_no = rep(Chrom_no, nchar(Ref)), # Expand the Chrom_no
Region = unlist(lapply( # Split Region and use
strsplit(Region, "..", TRUE), # the result to create
function(x) { # the range of values
x <- as.numeric(x) # needed
if (length(x) > 1) seq(x[1], x[2]) else x
})),
Ref = unlist(strsplit(Ref, "")), # Split Ref
Alt = unlist(strsplit(Alt, "")), # Split Alt
Prob = rep(Prob, nchar(Ref)))] # Expand Prob
# Chrom_no Region Ref Alt Prob
# 1: 1 12 A T 99.0
# 2: 1 13 T A 99.0
# 3: 1 22 C G 98.7
# 4: 1 23 G A 98.7
# 5: 2 100 A T 99.0
# 6: 3 34 A C 99.9
# 7: 3 35 A G 99.9
# 8: 3 36 A G 99.9
The above code can probably be streamlined a bit, but I thought this should be enough to get you started.

data.table operations by column name

Suppose I have a data.table
a <- data.table(id=c(1,1,2,2,3),a=21:25,b=11:15,key="id")
I can add new columns like this:
a[, sa := sum(a), by="id"]
a[, sb := sum(b), by="id"]
> a
id a b sa sb
1: 1 21 11 43 23
2: 1 22 12 43 23
3: 2 23 13 47 27
4: 2 24 14 47 27
5: 3 25 15 25 15
However, suppose that I have column names instead:
for (n in c("a","b")) {
s <- paste0("s",n)
a[, s := sum(n), by="id", with=FALSE] # ERROR: invalid 'type' (character) of argument
}
what do I do?

You can also do this:
a <- data.table(id=c(1,1,2,2,3),a=21:25,b=11:15,key="id")
a[, c("sa", "sb") := lapply(.SD, sum), by = id]
Or slightly more generally:
cols.to.sum = c("a", "b")
a[, paste0("s", cols.to.sum) := lapply(.SD, sum), by = id, .SDcols = cols.to.sum]

This is similar to :
How to generate a linear combination of variables and update table using data.table in a loop call?
but you want to combine this with by= too, so set() isn't flexible enough. That's a deliberate design design and set() is unlikely to change in that regard.
I sometimes use the EVAL helper at the end of that answer.
https://stackoverflow.com/a/20808573/403310 Some wince at that approach but I just think of it like constructing a dynamic SQL statement, which is quite common practice. The EVAL approach gives ultimate flexibility without head scratching about eval() and quote(). To see the dynamic query that's been constructed (to check it) you can add a print inside your EVAL helper function.
However, in this simple example you can wrap the LHS of := with brackets to tell data.table to lookup the value (clearer than with=FALSE), and the RHS needs a get().
for (n in c("a","b")) {
s <- paste0("s",n)
a[, (s) := sum(get(n)), by="id"]
}

Edit 2020-02-15 about ..
data.table also supports the .. syntax to "look up a level", obviating the need for with=FALSE in most cases, e.g. dt[ , ..n1] and dt[ , ..n2] in the below
have a look at with in ? data.table:
dt <- data.table(id=1:5,a=21:25,b=11:15,key="id")
dt[, n3 := dt[ , n1, with = FALSE ] * dt[ , n2, with = FALSE ], with = FALSE ]
EDIT:
Or you just change the colnames forth and back:
dt <- data.table(id=1:5,a=21:25,b=11:15,key="id")
dt[ , dt.names["n3"] := 1L, with = FALSE ]
dt.names <- c( n1 = "a", n2 = "b", n3 = "c" )
setnames( dt, dt.names, names(dt.names) )
dt[ , n3 := n1 * n2, by = "id" ]
setnames( dt, names(dt.names), dt.names )
which works together with by.

Here is an approach that does the call mangling and avoids any overhead with .SD
# a helper function
makeCall <- function(x,fun) bquote(.(fun)(.(x)))
# the columns you wish to sum (apply function to)
cols <- c('a','b')
new.cols <- paste0('s',cols)
# create named list of names
name.cols <- setNames(sapply(cols,as.name), new.cols)
# create the call
my_call <- as.call(c(as.name(':='), lapply(name.cols, makeCall, fun = as.name('sum'))))
(a[, eval(my_call), by = 'id'])
# id a b sa sb
# 1: 1 21 11 43 23
# 2: 1 22 12 43 23
# 3: 2 23 13 47 27
# 4: 2 24 14 47 27
# 5: 3 25 15 25 15