Use parentheses () to subset a data.table column - r

In case of assignment (by reference), with = FALSE can be replaced by LHS in parentheses, (). This nice feature does not work when simply subsetting the column without assignment. Of course there is workarount with .SD/.SDcols or get()/mget(), but it would be nice to subset a column just the same way, with or without assignment.
dt <- data.table(A = 1:3, B = 4:6 )
col <- "A"
cols <- c("A","B")
# assign the old way
dt[, col := 9 , with=FALSE]
dt[, cols := .(9,8), with=FALSE]
# assign the new way
dt[, (col) := 8 ]
dt[, (cols) := .(8,7)]
# But the above syntax does not work for subsetting
dt[, (col)]
dt[, (cols)]
# I know how I can subset col and cols, but that is not the question here,
# e.g.:
dt[, col, with=FALSE]
dt[, cols, with=FALSE]
dt[, .SD, .SDcols=col]
dt[, .SD, .SDcols=cols]
# Below, further (there are even more) types of subsetting but they are not
# the same for col and cols, which is importent for looping where I dont
# know how many cols I call in advance.
dt[, get(col)]
dt[, mget(cols)]
dt[[col]] # Returns a vector, nor running: dt[[cols]]
In other words: if dt[ , (col) := 8] runs, as a naive user I expect df[ , (col)] to run as well. Probably there would be a conflict in [data.table so that cannot be implemented?

Related

R data.table: names of .SD not available for assignment

Often, I want to manipulate several variables in a DT and I need to select the column names based on their names or class.
d <- data.table(x = 1:10, y= letters[1:10])
# My usual approach
col <- str_subset(names(d), '^x')
d[, (col) := 2:11]
However, it would be very useful and less verbose to do this:
d[, (names(.SD)) := 2:11, .SDcols = patterns('^x')]
But this throws an error:
Error in `[.data.table`(d, , `:=`((names(.SD)), 2:11), .SDcols = patterns("^x")) :
LHS of := isn't column names ('character') or positions ('integer' or 'numeric')
>
The column names of .SD are available, though:
> d[, names(.SD), .SDcols = patterns('^x')]
[1] "x"
Why aren't the names of .SD available for assignment on the LHS of :=?
As noted this is not yet possible. The workaround only adds one line of code:
cols = grep('^x', names(d))
d[ , (cols) := 2:11, .SDcols = cols]

.SD and .SDcols for the i expression in data.table join

i'm trying to copy a subset of columns from Y to X based on a join, where the subset of columns is dynamic
I can identify the columns quite easily: names(Y)[grep("xxx", names(Y))]
but when i try to use that code in the j expression, it just gives me the column names, not the values of the columns. the .SD and .SDcols gets pretty close, but they only apply to the x expression. I'm trying to do something like this:
X[Y[names(Y)[grep("xxx", names(Y))] := .SD, .SDcols = names(Y)[grep("xxx", names(Y)), on=.(zzz)]
is there an equivalent set of .SD and .SDcols constructs that apply to the i expression? Or, do I need to build up a string for the j expression and eval that string?
Perhaps this will help you get started:
library(data.table)
X <- as.data.table(mtcars[1:5], keep.rownames = "id")
Y <- as.data.table(mtcars, keep.rownames = "id")
cols <- c("gear", "carb")
# copy cols from Y to X based on common "id":
X[Y, (cols) := mget(cols), on = "id"]
As Frank notes in his comment, it might be safer to prefix the column names with i. to ensure the assigned columns are indeed from Y:
X[Y, (cols) := mget(paste0("i.", cols)), on = "id"]

Convert *some* column classes in data.table

I want to convert a subset of data.table cols to a new class. There's a popular question here (Convert column classes in data.table) but the answer creates a new object, rather than operating on the starter object.
Take this example:
dat <- data.frame(ID=c(rep("A", 5), rep("B",5)), Quarter=c(1:5, 1:5), value=rnorm(10))
cols <- c('ID', 'Quarter')
How best to convert to just the cols columns to (e.g.) a factor? In a normal data.frame you could do this:
dat[, cols] <- lapply(dat[, cols], factor)
but that doesn't work for a data.table, and neither does this
dat[, .SD := lapply(.SD, factor), .SDcols = cols]
A comment in the linked question from Matt Dowle (from Dec 2013) suggests the following, which works fine, but seems a bit less elegant.
for (j in cols) set(dat, j = j, value = factor(dat[[j]]))
Is there currently a better data.table answer (i.e. shorter + doesn't generate a counter variable), or should I just use the above + rm(j)?
Besides using the option as suggested by Matt Dowle, another way of changing the column classes is as follows:
dat[, (cols) := lapply(.SD, factor), .SDcols = cols]
By using the := operator you update the datatable by reference. A check whether this worked:
> sapply(dat,class)
ID Quarter value
"factor" "factor" "numeric"
As suggeted by #MattDowle in the comments, you can also use a combination of for(...) set(...) as follows:
for (col in cols) set(dat, j = col, value = factor(dat[[col]]))
which will give the same result. A third alternative is:
for (col in cols) dat[, (col) := factor(dat[[col]])]
On a smaller datasets, the for(...) set(...) option is about three times faster than the lapply option (but that doesn't really matter, because it is a small dataset). On larger datasets (e.g. 2 million rows), each of these approaches takes about the same amount of time. For testing on a larger dataset, I used:
dat <- data.table(ID=c(rep("A", 1e6), rep("B",1e6)),
Quarter=c(1:1e6, 1:1e6),
value=rnorm(10))
Sometimes, you will have to do it a bit differently (for example when numeric values are stored as a factor). Then you have to use something like this:
dat[, (cols) := lapply(.SD, function(x) as.integer(as.character(x))), .SDcols = cols]
WARNING: The following explanation is not the data.table-way of doing things. The datatable is not updated by reference because a copy is made and stored in memory (as pointed out by #Frank), which increases memory usage. It is more an addition in order to explain the working of with = FALSE.
When you want to change the column classes the same way as you would do with a dataframe, you have to add with = FALSE as follows:
dat[, cols] <- lapply(dat[, cols, with = FALSE], factor)
A check whether this worked:
> sapply(dat,class)
ID Quarter value
"factor" "factor" "numeric"
If you don't add with = FALSE, datatable will evaluate dat[, cols] as a vector. Check the difference in output between dat[, cols] and dat[, cols, with = FALSE]:
> dat[, cols]
[1] "ID" "Quarter"
> dat[, cols, with = FALSE]
ID Quarter
1: A 1
2: A 2
3: A 3
4: A 4
5: A 5
6: B 1
7: B 2
8: B 3
9: B 4
10: B 5
You can use .SDcols:
dat[, cols] <- dat[, lapply(.SD, factor), .SDcols=cols]

Updating data.table with get(x) [duplicate]

I'm trying to apply a function to a group of columns in a large data.table without referring to each one individually.
a <- data.table(
a=as.character(rnorm(5)),
b=as.character(rnorm(5)),
c=as.character(rnorm(5)),
d=as.character(rnorm(5))
)
b <- c('a','b','c','d')
with the MWE above, this:
a[,b=as.numeric(b),with=F]
works, but this:
a[,b[2:3]:=data.table(as.numeric(b[2:3])),with=F]
doesn't work. What is the correct way to apply the as.numeric function to just columns 2 and 3 of a without referring to them individually.
(In the actual data set there are tens of columns so it would be impractical)
The idiomatic approach is to use .SD and .SDcols
You can force the RHS to be evaluated in the parent frame by wrapping in ()
a[, (b) := lapply(.SD, as.numeric), .SDcols = b]
For columns 2:3
a[, 2:3 := lapply(.SD, as.numeric), .SDcols = 2:3]
or
mysubset <- 2:3
a[, (mysubset) := lapply(.SD, as.numeric), .SDcols = mysubset]

Apply a function to a subset of data.table columns, by column-indices instead of name

I'm trying to apply a function to a group of columns in a large data.table without referring to each one individually.
a <- data.table(
a=as.character(rnorm(5)),
b=as.character(rnorm(5)),
c=as.character(rnorm(5)),
d=as.character(rnorm(5))
)
b <- c('a','b','c','d')
with the MWE above, this:
a[,b=as.numeric(b),with=F]
works, but this:
a[,b[2:3]:=data.table(as.numeric(b[2:3])),with=F]
doesn't work. What is the correct way to apply the as.numeric function to just columns 2 and 3 of a without referring to them individually.
(In the actual data set there are tens of columns so it would be impractical)
The idiomatic approach is to use .SD and .SDcols
You can force the RHS to be evaluated in the parent frame by wrapping in ()
a[, (b) := lapply(.SD, as.numeric), .SDcols = b]
For columns 2:3
a[, 2:3 := lapply(.SD, as.numeric), .SDcols = 2:3]
or
mysubset <- 2:3
a[, (mysubset) := lapply(.SD, as.numeric), .SDcols = mysubset]

Resources