Question 1: line 1 throws an error. Why and how to multiply all columns by DT[i,j]?
Question 2: line 2 works but are there better ways to multiply all other columns by one column?
df=data.table(matrix(1:15,3,5))
df[ , lapply(.SD, function(x) {x*df$V5), .SDcols = c("V1","V2","V3","V4")] #line 1
df[ , lapply(.SD, function(x) {x*df[1,"V5"})}, .SDcols = c("V1","V2","V3","V4")] #line 2
As we are multiplying one column with the rest, either do the multiplication of the Subset of Data.table directly
df[, .SD * V5, .SDcols = V1:V4]
Or with lapply
df[, lapply(.SD, `*`, V5), .SDcols = V1:V4]
Note that in both cases, we are not updating the original dataset columns. For that we need :=
df[, paste0("V", 1:4) := .SD * V5, .SDcols = V1:V4]
In the OP's code, there is a closing } missing in the line 1
df[ , lapply(.SD, function(x) {x*df$V5), .SDcols = c("V1","V2","V3","V4")]
^^
It would be
df[, lapply(.SD, function(x) { x* V5 }), .SDcols = V1:V4]
Here, we don't really need those curlies as well as within the data.table, column names can be referenced as unquoted instead of df$ along with the shortened version of .SDcols where column names are represented as a range (:)
Related
Often, I want to manipulate several variables in a DT and I need to select the column names based on their names or class.
d <- data.table(x = 1:10, y= letters[1:10])
# My usual approach
col <- str_subset(names(d), '^x')
d[, (col) := 2:11]
However, it would be very useful and less verbose to do this:
d[, (names(.SD)) := 2:11, .SDcols = patterns('^x')]
But this throws an error:
Error in `[.data.table`(d, , `:=`((names(.SD)), 2:11), .SDcols = patterns("^x")) :
LHS of := isn't column names ('character') or positions ('integer' or 'numeric')
>
The column names of .SD are available, though:
> d[, names(.SD), .SDcols = patterns('^x')]
[1] "x"
Why aren't the names of .SD available for assignment on the LHS of :=?
As noted this is not yet possible. The workaround only adds one line of code:
cols = grep('^x', names(d))
d[ , (cols) := 2:11, .SDcols = cols]
In case of assignment (by reference), with = FALSE can be replaced by LHS in parentheses, (). This nice feature does not work when simply subsetting the column without assignment. Of course there is workarount with .SD/.SDcols or get()/mget(), but it would be nice to subset a column just the same way, with or without assignment.
dt <- data.table(A = 1:3, B = 4:6 )
col <- "A"
cols <- c("A","B")
# assign the old way
dt[, col := 9 , with=FALSE]
dt[, cols := .(9,8), with=FALSE]
# assign the new way
dt[, (col) := 8 ]
dt[, (cols) := .(8,7)]
# But the above syntax does not work for subsetting
dt[, (col)]
dt[, (cols)]
# I know how I can subset col and cols, but that is not the question here,
# e.g.:
dt[, col, with=FALSE]
dt[, cols, with=FALSE]
dt[, .SD, .SDcols=col]
dt[, .SD, .SDcols=cols]
# Below, further (there are even more) types of subsetting but they are not
# the same for col and cols, which is importent for looping where I dont
# know how many cols I call in advance.
dt[, get(col)]
dt[, mget(cols)]
dt[[col]] # Returns a vector, nor running: dt[[cols]]
In other words: if dt[ , (col) := 8] runs, as a naive user I expect df[ , (col)] to run as well. Probably there would be a conflict in [data.table so that cannot be implemented?
I'm trying to apply a function to a group of columns in a large data.table without referring to each one individually.
a <- data.table(
a=as.character(rnorm(5)),
b=as.character(rnorm(5)),
c=as.character(rnorm(5)),
d=as.character(rnorm(5))
)
b <- c('a','b','c','d')
with the MWE above, this:
a[,b=as.numeric(b),with=F]
works, but this:
a[,b[2:3]:=data.table(as.numeric(b[2:3])),with=F]
doesn't work. What is the correct way to apply the as.numeric function to just columns 2 and 3 of a without referring to them individually.
(In the actual data set there are tens of columns so it would be impractical)
The idiomatic approach is to use .SD and .SDcols
You can force the RHS to be evaluated in the parent frame by wrapping in ()
a[, (b) := lapply(.SD, as.numeric), .SDcols = b]
For columns 2:3
a[, 2:3 := lapply(.SD, as.numeric), .SDcols = 2:3]
or
mysubset <- 2:3
a[, (mysubset) := lapply(.SD, as.numeric), .SDcols = mysubset]
Suppose I have a data.table as follows -:
data = data.table(c("a","a","b","b","c"),c(1,2,3,4,5))
I would like to sum the numeric vector, only when the factor vector has more than one entry.
The problem I have will require the use of .SD. I understand that I could create a N field via
data[ , N := .N, by = V1]
and then sum via
data[N > 1, lapply(.SD,sum), by = V1, .SDcols = 2]
However, is there a one step call to do this?
Referencing .SD in the call doesn't return an answer -
data[, lapply(.SD[which(length(.SD)>1)],sum), by = V1, .SDcols = 2]
I would like to understand why this doesn't work. Neither does -:
data[, lapply(.SD[which(.N>1)],sum), by = V1, .SDcols = 2]
Thanks!
data <- data.table(c("a","a","b","b","c"),c(1,2,3,4,5))
data[, if(.N > 1) lapply(.SD, sum) else NULL, by=V1]
# V1 V2
# 1: a 3
# 2: b 7
I'm trying to apply a function to a group of columns in a large data.table without referring to each one individually.
a <- data.table(
a=as.character(rnorm(5)),
b=as.character(rnorm(5)),
c=as.character(rnorm(5)),
d=as.character(rnorm(5))
)
b <- c('a','b','c','d')
with the MWE above, this:
a[,b=as.numeric(b),with=F]
works, but this:
a[,b[2:3]:=data.table(as.numeric(b[2:3])),with=F]
doesn't work. What is the correct way to apply the as.numeric function to just columns 2 and 3 of a without referring to them individually.
(In the actual data set there are tens of columns so it would be impractical)
The idiomatic approach is to use .SD and .SDcols
You can force the RHS to be evaluated in the parent frame by wrapping in ()
a[, (b) := lapply(.SD, as.numeric), .SDcols = b]
For columns 2:3
a[, 2:3 := lapply(.SD, as.numeric), .SDcols = 2:3]
or
mysubset <- 2:3
a[, (mysubset) := lapply(.SD, as.numeric), .SDcols = mysubset]