Select columns in a data.table using vector of characters - r

I am trying to select some columns from a data.table but getting unexpected results.
For the following, I want to select columns y and z and this works as expected
library(data.table)
dt <- data.table(x=1:4, y=5:8, z=9:6)
dt[, c("y", "z")]
When I try to do this using setdiff it returns nonsense
omit_var <- "x"
dt[, setdiff(c("x","y","z"), omit_var)]
Even though they are equivalent all.equal(setdiff(c("x","y","z"), omit_var), c("y", "z"))
Why is this happening -- I guessing a scoping issue but can I avoid it while keeping the code similar?
(I realise I can do i <- setdiff(c("x","y","z"), omit); dt[,..i])

Following comments from #Roland: "If you call a function in j, data.table can't know that you want to subset. It assumes that you want the return value of the function.". So can use either of
dt[, mget(setdiff(c("x","y","z"), omit_var))]
dt[, setdiff(c("x","y","z"), omit_var), with = FALSE]
#sindri_baldur gave another alternative using .SDcols
dt[, .SD, .SDcols = setdiff(c("x","y","z"), omit_var)]
And more options from #DavidArenburg
dt[, .SD, .SDcols = -omit_var]
If you want to keep all columns except the one in a character string you can use
dt[, !"x"]
Or you could use the .. operator
cols <- setdiff(names(dt), omit_var)
dt[, ..cols]

Related

How to select data.table columns by partial string match and update them by a constant multiplication?

I have a large data.table with several columns, where some contain values in Cubic Feet.
These are marked by an added "_cft" at the end of the column name. I want to convert the values of these columns to m³ b multiplying them with a constant and returning the updated value.
I can already select the columns and multiply them, but am not able to replace the existing values.
My code looks like follows:
dt <- dt[, lapply(.SD, function(x) x * 0.0283168), .SDcols= grepl("_cft", names(dt))]
This however only returns me the columns I converted to the data.table, but I want to keep all the columns in the original data.table.
I have already tried using the :=operator, but it results in an error:
"Error: unexpected symbol in dt <- dt[, `:=` lapply"
How can I do this?
Note that you should not combine <- with := because the latter works by reference.
Your error message suggests that you did not do the assignment properly. you need to specify the columns you want to assign to. Doing something like
dt[, `:=` lapply(.SD, function(x) x * 0.0283168), .SDcols= grepl("_cft", names(dt))]
will not work, and that's why you got that error message.
Try the following code:
cols = grep("_cft", names(dt))
dt[, (cols) := lapply(.SD, function(x) x * 0.0283168), .SDcols=cols]
# or simply
dt[, (cols) := lapply(.SD, `*`, 0.0283168), .SDcols=cols]

.SD and .SDcols for the i expression in data.table join

i'm trying to copy a subset of columns from Y to X based on a join, where the subset of columns is dynamic
I can identify the columns quite easily: names(Y)[grep("xxx", names(Y))]
but when i try to use that code in the j expression, it just gives me the column names, not the values of the columns. the .SD and .SDcols gets pretty close, but they only apply to the x expression. I'm trying to do something like this:
X[Y[names(Y)[grep("xxx", names(Y))] := .SD, .SDcols = names(Y)[grep("xxx", names(Y)), on=.(zzz)]
is there an equivalent set of .SD and .SDcols constructs that apply to the i expression? Or, do I need to build up a string for the j expression and eval that string?
Perhaps this will help you get started:
library(data.table)
X <- as.data.table(mtcars[1:5], keep.rownames = "id")
Y <- as.data.table(mtcars, keep.rownames = "id")
cols <- c("gear", "carb")
# copy cols from Y to X based on common "id":
X[Y, (cols) := mget(cols), on = "id"]
As Frank notes in his comment, it might be safer to prefix the column names with i. to ensure the assigned columns are indeed from Y:
X[Y, (cols) := mget(paste0("i.", cols)), on = "id"]

using mean with .SD and .SDcols in data.table

I am writing a very simple function to summarize columns of data.tables. I am passing one column at a time to the function, and then doing some diagnostics to figure out the options for summarization, and then doing the summarization. I am doing this in data.table to allow for some very large datasets.
So, I am using .SDcols to pass in the column to summarize, and using functions on .SD in the j part of a data.table expression. Since I am passing in one column at a time, I am not using lapply. And what I am finding is that some functions work and others do not. Below is a test dataset I am working with and the results I see:
dt <- data.table(
a=1:10,
b=as.factor(letters[1:10]),
c=c(TRUE, FALSE),
d=runif(10, 0.5, 100),
e=c(0,1),
f=as.integer(c(0,1)),
g=as.numeric(1:10),
h=c("cat1", "cat2", "cat3", "cat4", "cat5"))
mean(dt$a)
[1] 5.5
dt[, mean(.SD), .SDcols = "a"]
[1] NA
Warning message:
In mean.default(.SD) : argument is not numeric or logical: returning NA
dt[, sum(.SD), .SDcols = "a"]
[1] 55
dt[, max(.SD), .SDcols = "a"]
[1] 10
dt[, colMeans(.SD), .SDcols = "a"]
a
5.5
dt[, lapply(.SD, mean), .SDcols = "a"]
a
1: 5.5
Interestingly, weighted.mean gives the wrong answer (55, the sum) when I use weighted.mean(.SD) in j. But when I use lapply(.SD, weighted.mean) in j, it gives the right answer (5.5, the mean).
I tried turning off data.table optimizations to see if it was the internal data.table mean function, but that didn't change things.
Maybe this is just a problem with using mean() on a list (which seems to be what .SD returns)? I guess there is never a reason to NOT use the lapply paradigm with .SD? It seems that only the lapply option returns a data.table. The others seem to return vectors, except for colMeans which is returning something else (list?).
My main question is why mean(.SD) does not work. And the corollary is whether .SD can be used in the absence of one of the apply functions.
Thanks.
I think the appropriate way of approaching what you want is to just use the standard syntax:
dt[ , lapply(.SD, mean), .SDcols = "a"]
Alternatively, you can pass a variable by name as follows:
col_to_pass = "a"
dt[ , mean(get(col_to_pass)) ]
Eventually, you can generalized this approach to multiple columns as follows:
col_to_pass = c("a", "d")
dt[ , lapply( mget(col_to_pass), mean) ]

Updating data.table with get(x) [duplicate]

I'm trying to apply a function to a group of columns in a large data.table without referring to each one individually.
a <- data.table(
a=as.character(rnorm(5)),
b=as.character(rnorm(5)),
c=as.character(rnorm(5)),
d=as.character(rnorm(5))
)
b <- c('a','b','c','d')
with the MWE above, this:
a[,b=as.numeric(b),with=F]
works, but this:
a[,b[2:3]:=data.table(as.numeric(b[2:3])),with=F]
doesn't work. What is the correct way to apply the as.numeric function to just columns 2 and 3 of a without referring to them individually.
(In the actual data set there are tens of columns so it would be impractical)
The idiomatic approach is to use .SD and .SDcols
You can force the RHS to be evaluated in the parent frame by wrapping in ()
a[, (b) := lapply(.SD, as.numeric), .SDcols = b]
For columns 2:3
a[, 2:3 := lapply(.SD, as.numeric), .SDcols = 2:3]
or
mysubset <- 2:3
a[, (mysubset) := lapply(.SD, as.numeric), .SDcols = mysubset]

Apply a function to a subset of data.table columns, by column-indices instead of name

I'm trying to apply a function to a group of columns in a large data.table without referring to each one individually.
a <- data.table(
a=as.character(rnorm(5)),
b=as.character(rnorm(5)),
c=as.character(rnorm(5)),
d=as.character(rnorm(5))
)
b <- c('a','b','c','d')
with the MWE above, this:
a[,b=as.numeric(b),with=F]
works, but this:
a[,b[2:3]:=data.table(as.numeric(b[2:3])),with=F]
doesn't work. What is the correct way to apply the as.numeric function to just columns 2 and 3 of a without referring to them individually.
(In the actual data set there are tens of columns so it would be impractical)
The idiomatic approach is to use .SD and .SDcols
You can force the RHS to be evaluated in the parent frame by wrapping in ()
a[, (b) := lapply(.SD, as.numeric), .SDcols = b]
For columns 2:3
a[, 2:3 := lapply(.SD, as.numeric), .SDcols = 2:3]
or
mysubset <- 2:3
a[, (mysubset) := lapply(.SD, as.numeric), .SDcols = mysubset]

Resources