Pass character vectors and column names to data.table as a list of columns?
I want to be able to produce a subset of columns in R using data.table in a way that I can determine some of them earlier on and pass the predetermined list on as a character vector, then combine with a static list of columns.
That is, given this:
a <- 1:4
b <- 5:8
c <- c('aa','bb','cc','dd')
e <- 1:4
z <- data.table(a,b,c,e)
I want to do this:
z[, list(a,b)]
Which produces this output:
a b
1: 1 5
2: 2 6
3: 3 7
4: 4 8
But I want to do it in some way similar to this (which works, almost):
cols <- "b"
z[, list(get(cols), a)]
Results:
Note that it doesn't return the name of the column stored in cols
V1 a
1: 5 1
2: 6 2
3: 7 3
4: 8 4
but I need to do it with more than one element of cols (which does not work):
cols <- c('a', 'b')
z[, list(mget(cols), c)]
The above produces the following error:
Error: value for ‘a’ not found
I think my problem lies with scoping and which environments mget is looking in, but I can't figure out what exactly I am doing wrong. Also, how do I preserve the column titles?
Here are two (pretty much equivalent) options. One using lapply:
z[, c(lapply(cols, get), list(c))]
# V1 V2 V3
#1: 1 5 aa
#2: 2 6 bb
#3: 3 7 cc
#4: 4 8 dd
And one using mget:
z[, c(mget(cols, inherits = TRUE), c = list(c))]
# a b c
#1: 1 5 aa
#2: 2 6 bb
#3: 3 7 cc
#4: 4 8 dd
Note that get returns a vector which loses the information about column name (and there isn't much you can do about it besides manually adding it back in), while mget returns a named list.
Attempting to mix standard and non-standard evaluation within a single call will probably end in tears / frustration / obfusticated code.
There are a number of options in data.table
Use .. notation to "look up one level" to find the vector of column names
cols <- c('a','b')
z[, ..cols]
Use .SDcols
z[, .SD, .SDcols = cols]
But if you really want to combine the two ways of referencing, then you can use something like (introducing another option, with=FALSE, which allows more general expressions for column names than a simple vector)
ll <- function(char=NULL,uneval=NULL){
Call <- match.call()
cols <- lapply(Call$uneval,as.character)
unlist(c(char,cols))}
z[, ll(cols,c), with=FALSE]
# a b c
# 1: 1 5 aa
# 2: 2 6 bb
# 3: 3 7 cc
# 4: 4 8 dd
z[, ll(char=cols), with=FALSE]
# a b
# 1: 1 5
# 2: 2 6
# 3: 3 7
# 4: 4 8
z[, ll(uneval=c), with=FALSE]
# c
# 1: aa
# 2: bb
# 3: cc
# 4: dd
Combining a variable with column names with hard-coded column names in data.table
Given z and cols from the example above:
To combine a list of column names in a variable col with other hard coded column name c, we combine them in a new character vector c(col, 'c') in the call to data.table. We can refer to cols from within j (the second argument within []) by using the "up-one-level" notation ..:
z[, c(..cols, 'c')]
Thank you to #thelatemail for providing the base to the solution above.
Related
I am trying to find the data.table equivalent of select(one_of(...)), where a select statement is passed and if some columns do not exist in the data.table, they are skipped, instead of producing an error and failing.
The current method I know of selecting columns using data.table syntax would be df[, ..select_cols], where select_cols is a character vector containing the names of the columns I want to select. However, this method leads to an error message when one of the column names does not exist in the data.table.
Any advice on a data.table statement that would just skip columns that don't exist in the data.table would be really appreciated!
You can use intersect to keep the columns which are available in data.table.
library(data.table)
dt <- data.table(a = 1:5, b = 2:6)
select_cols <- function(dt, cols) {
cols <- intersect(names(dt), cols)
dt[, ..cols]
}
select_cols(dt, c('a', 'b'))
# a b
#1: 1 2
#2: 2 3
#3: 3 4
#4: 4 5
#5: 5 6
select_cols(dt, c('a', 'c'))
# a
#1: 1
#2: 2
#3: 3
#4: 4
#5: 5
I'm trying to pass two separate lists of variable names into a data.table (v1.9.4). It returns the correct columns, but it strips the variable names. This works as expected:
dt <- data.table(a=1:3, b=4:6, c=7:9, d=10:12)
dt
a b c d
1: 1 4 7 10
2: 2 5 8 11
3: 3 6 9 12
It also works fine to pass a single list of names:
dt[,list(a,b)]
a b
1: 1 4
2: 2 5
3: 3 6
But when I need to pass multiple lists, it returns the correct columns but strips the variable names:
dt[,c(list(a,b), list(c,d))]
V1 V2 V3 V4
1: 1 4 7 10
2: 2 5 8 11
3: 3 6 9 12
Why two lists? I'm using multiple quote()'d lists of variables. I've read FAQ question 1.6, and I know that one workaround is to use a character vector using with=FALSE. But my real use case involves passing a mix of names and expressions to a function, e.g.,
varnames <- quote(list(a,b))
expr <- quote(list(a*b, c+d))
function(dt, varnames, expr) {
dt[,c(varnames, expr)]
}
And I'd like the "varnames" columns to have their proper names (and they do if you just pass a single list like
dt[,list(a,b,a*b,c+d)]
a b V3 V4
1: 1 4 4 17
2: 2 5 10 19
3: 3 6 18 21
How can I combine multiple lists in a data.table such that it still returns the proper column names? (I'm not completely sure if this is a data.table issue or if I'm just doing something silly in the way I'm trying to combine lists in R, but c() seems to do what I want.)
Another option is to construct the full call ahead of time:
varnames[4:5] <- expr[2:3] # this results in `list(a, b, a * b, c + d)`
dt[, eval(varnames)]
produces:
a b V3 V4
1: 1 4 4 17
2: 2 5 10 19
3: 3 6 18 21
More generically, suppose you have a list of quoted lists of expressions:
exprlist <- list(quote(list(a, b)), quote(list(c, c %% a)), quote(list(a + b)))
expr <- as.call(Reduce(function(x, y) c(as.list(x), as.list(y)[-1]), exprlist)) # #eddi
dt[, eval(expr)]
Here's a possible workaround using .SD
varnames <- quote(list(a,b))
expr <- quote(list(a*b, c+d))
myFunc <- function(dt, varnames, expr) {
dt[, c(.SD[, eval(varnames)], eval(expr))]
}
myFunc(dt, varnames, expr)
# a b V1 V2
# 1: 1 4 4 17
# 2: 2 5 10 19
# 3: 3 6 18 21
I'd like to order a data.table by a variable holding the name of a column:
I've tried every combination of + eval, getandc` without success:
I have colVar = "someColumnName"
I'd like to apply this to: DT[order(colVar)]
data.table has special functions for that matter which will modify your data set by reference instead of copying it to a new object.
You can either use setkey or (in versions >= 1.9.4) setorder which is capable of ordering in decreasing order too.
Note the difference between setkey vs. setkeyv and setorder vs. setorderv. v notes that you can pass either a quoted variable name or a variable containing one.
Using #andrewzm data set
dtbl
# x y
# 1: 1 5
# 2: 2 4
# 3: 3 3
# 4: 4 2
# 5: 5 1
setorderv(dtbl, colVar)[] # or `sekeyv(dtbl, colVar)[]` or `setorderv(dtbl, "y")[]`
# x y
# 1: 5 1
# 2: 4 2
# 3: 3 3
# 4: 2 4
# 5: 1 5
You can use double brackets for data tables:
library(data.table)
dtbl <- data.table(x = 1:5, y = 5:1)
colVar = "y"
dtbl_sorted <- dtbl[order(dtbl[[colVar]])]
dtbl_sorted
Pass character vectors and column names to data.table as a list of columns?
I want to be able to produce a subset of columns in R using data.table in a way that I can determine some of them earlier on and pass the predetermined list on as a character vector, then combine with a static list of columns.
That is, given this:
a <- 1:4
b <- 5:8
c <- c('aa','bb','cc','dd')
e <- 1:4
z <- data.table(a,b,c,e)
I want to do this:
z[, list(a,b)]
Which produces this output:
a b
1: 1 5
2: 2 6
3: 3 7
4: 4 8
But I want to do it in some way similar to this (which works, almost):
cols <- "b"
z[, list(get(cols), a)]
Results:
Note that it doesn't return the name of the column stored in cols
V1 a
1: 5 1
2: 6 2
3: 7 3
4: 8 4
but I need to do it with more than one element of cols (which does not work):
cols <- c('a', 'b')
z[, list(mget(cols), c)]
The above produces the following error:
Error: value for ‘a’ not found
I think my problem lies with scoping and which environments mget is looking in, but I can't figure out what exactly I am doing wrong. Also, how do I preserve the column titles?
Here are two (pretty much equivalent) options. One using lapply:
z[, c(lapply(cols, get), list(c))]
# V1 V2 V3
#1: 1 5 aa
#2: 2 6 bb
#3: 3 7 cc
#4: 4 8 dd
And one using mget:
z[, c(mget(cols, inherits = TRUE), c = list(c))]
# a b c
#1: 1 5 aa
#2: 2 6 bb
#3: 3 7 cc
#4: 4 8 dd
Note that get returns a vector which loses the information about column name (and there isn't much you can do about it besides manually adding it back in), while mget returns a named list.
Attempting to mix standard and non-standard evaluation within a single call will probably end in tears / frustration / obfusticated code.
There are a number of options in data.table
Use .. notation to "look up one level" to find the vector of column names
cols <- c('a','b')
z[, ..cols]
Use .SDcols
z[, .SD, .SDcols = cols]
But if you really want to combine the two ways of referencing, then you can use something like (introducing another option, with=FALSE, which allows more general expressions for column names than a simple vector)
ll <- function(char=NULL,uneval=NULL){
Call <- match.call()
cols <- lapply(Call$uneval,as.character)
unlist(c(char,cols))}
z[, ll(cols,c), with=FALSE]
# a b c
# 1: 1 5 aa
# 2: 2 6 bb
# 3: 3 7 cc
# 4: 4 8 dd
z[, ll(char=cols), with=FALSE]
# a b
# 1: 1 5
# 2: 2 6
# 3: 3 7
# 4: 4 8
z[, ll(uneval=c), with=FALSE]
# c
# 1: aa
# 2: bb
# 3: cc
# 4: dd
Combining a variable with column names with hard-coded column names in data.table
Given z and cols from the example above:
To combine a list of column names in a variable col with other hard coded column name c, we combine them in a new character vector c(col, 'c') in the call to data.table. We can refer to cols from within j (the second argument within []) by using the "up-one-level" notation ..:
z[, c(..cols, 'c')]
Thank you to #thelatemail for providing the base to the solution above.
I have the following data.table.
ts,id
1,a
2,a
3,a
4,a
5,a
6,a
7,a
1,b
2,b
3,b
4,b
I want to subset this data.table into two. The criteria is to have approximately the first half for each group (in this case column "id") in one data table and the remaining in another data.table. So the expected result are two data.tables as follows
ts,id
1,a
2,a
3,a
4,a
1,b
2,b
and
ts,id
5,a
6,a
7,a
3,b
4,b
I tried the following,
z1 = x[,.SD[.I < .N/2,],by=dev]
z1
and got just the following
id ts
a 1
a 2
a 3
Somehow, .I within the .SD isn't working the way I think it should. Any help appreciated.
Thanks in advance.
.I gives the row locations with respect to the whole data.table. Thus it can't be used like that within .SD.
Something like
DT[, subset := seq_len(.N) > .N/2,by='id']
subset1 <- DT[(subset)][,subset:=NULL]
subset2 <- DT[!(subset)][,subset:=NULL]
subset1
# ts id
# 1: 4 a
# 2: 5 a
# 3: 6 a
# 4: 7 a
# 5: 3 b
# 6: 4 b
subset2
# ts id
# 1: 1 a
# 2: 2 a
# 3: 3 a
# 4: 1 b
# 5: 2 b
Should work
For more than 2 groups, you could use cut to create a factor with the appropriate number of levels
Something like
DT[, subset := cut(seq_len(.N), 3, labels= FALSE),by='id']
# you could copy to the global environment a subset for each, but this
# will not be memory efficient!
list2env(setattr(split(DT, DT[['subset']]),'names', paste0('s',1:3)), .GlobalEnv)
Here's the corrected version of your expression:
dt[, .SD[, .SD[.I <= .N/2]], by = id]
# id ts
#1: a 1
#2: a 2
#3: a 3
#4: b 1
#5: b 2
The reason yours is not working is because .I and .N are not available in the i-expression (i.e. first argument of [) and so the parent data.table's .I and .N are used (i.e. dt's).