I'm trying to apply a function to a group of columns in a large data.table without referring to each one individually.
a <- data.table(
a=as.character(rnorm(5)),
b=as.character(rnorm(5)),
c=as.character(rnorm(5)),
d=as.character(rnorm(5))
)
b <- c('a','b','c','d')
with the MWE above, this:
a[,b=as.numeric(b),with=F]
works, but this:
a[,b[2:3]:=data.table(as.numeric(b[2:3])),with=F]
doesn't work. What is the correct way to apply the as.numeric function to just columns 2 and 3 of a without referring to them individually.
(In the actual data set there are tens of columns so it would be impractical)
The idiomatic approach is to use .SD and .SDcols
You can force the RHS to be evaluated in the parent frame by wrapping in ()
a[, (b) := lapply(.SD, as.numeric), .SDcols = b]
For columns 2:3
a[, 2:3 := lapply(.SD, as.numeric), .SDcols = 2:3]
or
mysubset <- 2:3
a[, (mysubset) := lapply(.SD, as.numeric), .SDcols = mysubset]
Related
I have a large data.table with several columns, where some contain values in Cubic Feet.
These are marked by an added "_cft" at the end of the column name. I want to convert the values of these columns to m³ b multiplying them with a constant and returning the updated value.
I can already select the columns and multiply them, but am not able to replace the existing values.
My code looks like follows:
dt <- dt[, lapply(.SD, function(x) x * 0.0283168), .SDcols= grepl("_cft", names(dt))]
This however only returns me the columns I converted to the data.table, but I want to keep all the columns in the original data.table.
I have already tried using the :=operator, but it results in an error:
"Error: unexpected symbol in dt <- dt[, `:=` lapply"
How can I do this?
Note that you should not combine <- with := because the latter works by reference.
Your error message suggests that you did not do the assignment properly. you need to specify the columns you want to assign to. Doing something like
dt[, `:=` lapply(.SD, function(x) x * 0.0283168), .SDcols= grepl("_cft", names(dt))]
will not work, and that's why you got that error message.
Try the following code:
cols = grep("_cft", names(dt))
dt[, (cols) := lapply(.SD, function(x) x * 0.0283168), .SDcols=cols]
# or simply
dt[, (cols) := lapply(.SD, `*`, 0.0283168), .SDcols=cols]
Question 1: line 1 throws an error. Why and how to multiply all columns by DT[i,j]?
Question 2: line 2 works but are there better ways to multiply all other columns by one column?
df=data.table(matrix(1:15,3,5))
df[ , lapply(.SD, function(x) {x*df$V5), .SDcols = c("V1","V2","V3","V4")] #line 1
df[ , lapply(.SD, function(x) {x*df[1,"V5"})}, .SDcols = c("V1","V2","V3","V4")] #line 2
As we are multiplying one column with the rest, either do the multiplication of the Subset of Data.table directly
df[, .SD * V5, .SDcols = V1:V4]
Or with lapply
df[, lapply(.SD, `*`, V5), .SDcols = V1:V4]
Note that in both cases, we are not updating the original dataset columns. For that we need :=
df[, paste0("V", 1:4) := .SD * V5, .SDcols = V1:V4]
In the OP's code, there is a closing } missing in the line 1
df[ , lapply(.SD, function(x) {x*df$V5), .SDcols = c("V1","V2","V3","V4")]
^^
It would be
df[, lapply(.SD, function(x) { x* V5 }), .SDcols = V1:V4]
Here, we don't really need those curlies as well as within the data.table, column names can be referenced as unquoted instead of df$ along with the shortened version of .SDcols where column names are represented as a range (:)
i'm trying to copy a subset of columns from Y to X based on a join, where the subset of columns is dynamic
I can identify the columns quite easily: names(Y)[grep("xxx", names(Y))]
but when i try to use that code in the j expression, it just gives me the column names, not the values of the columns. the .SD and .SDcols gets pretty close, but they only apply to the x expression. I'm trying to do something like this:
X[Y[names(Y)[grep("xxx", names(Y))] := .SD, .SDcols = names(Y)[grep("xxx", names(Y)), on=.(zzz)]
is there an equivalent set of .SD and .SDcols constructs that apply to the i expression? Or, do I need to build up a string for the j expression and eval that string?
Perhaps this will help you get started:
library(data.table)
X <- as.data.table(mtcars[1:5], keep.rownames = "id")
Y <- as.data.table(mtcars, keep.rownames = "id")
cols <- c("gear", "carb")
# copy cols from Y to X based on common "id":
X[Y, (cols) := mget(cols), on = "id"]
As Frank notes in his comment, it might be safer to prefix the column names with i. to ensure the assigned columns are indeed from Y:
X[Y, (cols) := mget(paste0("i.", cols)), on = "id"]
In case of assignment (by reference), with = FALSE can be replaced by LHS in parentheses, (). This nice feature does not work when simply subsetting the column without assignment. Of course there is workarount with .SD/.SDcols or get()/mget(), but it would be nice to subset a column just the same way, with or without assignment.
dt <- data.table(A = 1:3, B = 4:6 )
col <- "A"
cols <- c("A","B")
# assign the old way
dt[, col := 9 , with=FALSE]
dt[, cols := .(9,8), with=FALSE]
# assign the new way
dt[, (col) := 8 ]
dt[, (cols) := .(8,7)]
# But the above syntax does not work for subsetting
dt[, (col)]
dt[, (cols)]
# I know how I can subset col and cols, but that is not the question here,
# e.g.:
dt[, col, with=FALSE]
dt[, cols, with=FALSE]
dt[, .SD, .SDcols=col]
dt[, .SD, .SDcols=cols]
# Below, further (there are even more) types of subsetting but they are not
# the same for col and cols, which is importent for looping where I dont
# know how many cols I call in advance.
dt[, get(col)]
dt[, mget(cols)]
dt[[col]] # Returns a vector, nor running: dt[[cols]]
In other words: if dt[ , (col) := 8] runs, as a naive user I expect df[ , (col)] to run as well. Probably there would be a conflict in [data.table so that cannot be implemented?
I'm trying to apply a function to a group of columns in a large data.table without referring to each one individually.
a <- data.table(
a=as.character(rnorm(5)),
b=as.character(rnorm(5)),
c=as.character(rnorm(5)),
d=as.character(rnorm(5))
)
b <- c('a','b','c','d')
with the MWE above, this:
a[,b=as.numeric(b),with=F]
works, but this:
a[,b[2:3]:=data.table(as.numeric(b[2:3])),with=F]
doesn't work. What is the correct way to apply the as.numeric function to just columns 2 and 3 of a without referring to them individually.
(In the actual data set there are tens of columns so it would be impractical)
The idiomatic approach is to use .SD and .SDcols
You can force the RHS to be evaluated in the parent frame by wrapping in ()
a[, (b) := lapply(.SD, as.numeric), .SDcols = b]
For columns 2:3
a[, 2:3 := lapply(.SD, as.numeric), .SDcols = 2:3]
or
mysubset <- 2:3
a[, (mysubset) := lapply(.SD, as.numeric), .SDcols = mysubset]