Often, I want to manipulate several variables in a DT and I need to select the column names based on their names or class.
d <- data.table(x = 1:10, y= letters[1:10])
# My usual approach
col <- str_subset(names(d), '^x')
d[, (col) := 2:11]
However, it would be very useful and less verbose to do this:
d[, (names(.SD)) := 2:11, .SDcols = patterns('^x')]
But this throws an error:
Error in `[.data.table`(d, , `:=`((names(.SD)), 2:11), .SDcols = patterns("^x")) :
LHS of := isn't column names ('character') or positions ('integer' or 'numeric')
>
The column names of .SD are available, though:
> d[, names(.SD), .SDcols = patterns('^x')]
[1] "x"
Why aren't the names of .SD available for assignment on the LHS of :=?
As noted this is not yet possible. The workaround only adds one line of code:
cols = grep('^x', names(d))
d[ , (cols) := 2:11, .SDcols = cols]
Related
So I'm new to data.table and don't understand now I can modify by reference at the same time that I perform an operation on chosen columns using the .SD symbol? I have two examples.
Example 1
> DT <- data.table("group1:1" = 1, "group1:2" = 1, "group2:1" = 1)
> DT
group1:1 group1:2 group2:1
1: 1 1 1
Let's say for example I simply to choose only columns which contain "group1:" in the name. I know it's pretty straightforward to just reassign the result of operation to the same object like so:
cols1 <- names(DT)[grep("group1:", names(DT))]
DT <- DT[, .SD, .SDcols = cols1]
From reading the data.table vignette on reference-semantics my understanding is that the above does not modify by reference, whereas a similar operation that would use the := would do so. Is this accurate? If that's correct Is there a better way to do this operation that does modify by reference? In trying to figure this out I got stuck on how to combine the .SD symbol and the := operator. I tried
DT[, c(cols1) := .SD, .SDcols = cols1]
DT[, c(cols1) := lapply(.SD,function(x)x), .SDcols = cols1]
neither of which gave the result I wanted.
Example 2
Say I want to perform a different operation dcast that uses .SD as input. Example data table:
> DT <- data.table(x = c(1,2,1,2), y = c("A","A","B","B"), z = 5:8)
> DT
x y z
1: 1 A 5
2: 2 A 6
3: 1 B 7
4: 2 B 8
Again, I know I can just reassign like so:
> DT <- dcast(DT, x ~ y, value.var = "z")
> DT
x A B
1: 1 5 7
2: 2 6 8
But don't understand why the following does not work (or whether it would be preferable in some circumstances):
> DT <- data.table(x = c(1,2,1,2), y = c("A","A","B","B"), z = 5:8)
> cols <- c("x", unique(DT$y))
> DT[, cols := dcast(.SD, x ~ y, value.var = "z")]
In your example,
cols1 <- names(DT)[grep("group1:", names(DT))]
DT[, c(cols1) := .SD, .SDcols = cols1] # not this
DT[, (cols1) := .SD, .SDcols = cols1] # this will work
Below is other example to set 0 values on numeric columns .SDcols by reference.
The trick is to assign column names vector before :=.
colnames = DT[, names(.SD), .SDcols = is.numeric] # column name vector
DT[, (colnames) := lapply(.SD, nafill, fill = 0), .SDcols= is.numeric]
The 1.12.2 documentation mentions with=FALSE is not necessary anymore to select columns dynamically. Note that x[, cols] is equivalent to x[, ..cols] and to x[, cols, with=FALSE]. But when I use it, I get a warning
https://cran.r-project.org/web/packages/data.table/data.table.pdf
tmp = data.table(x = numeric(0), y = numeric(0))
cols = c('x', 'y')
tmp[, cols]
Error in [.data.table(tmp, , cols) :
j (the 2nd argument inside [...]) is a single symbol but column name 'cols' is
not found. Perhaps you intended DT[, ..cols]. This difference to data.frame
is deliberate and explained in FAQ 1.1.
tmp[, ..cols]
# Empty data.table (0 rows) of 2 cols: x,y
tmp[, cols, with = FALSE]
# Empty data.table (0 rows) of 2 cols: x,y
I would have expected cols to behave similarly as per documentation. Also the variations tmp[, ..cols] gives object usage lint issues (which is useful for static analysis of code).
i'm trying to copy a subset of columns from Y to X based on a join, where the subset of columns is dynamic
I can identify the columns quite easily: names(Y)[grep("xxx", names(Y))]
but when i try to use that code in the j expression, it just gives me the column names, not the values of the columns. the .SD and .SDcols gets pretty close, but they only apply to the x expression. I'm trying to do something like this:
X[Y[names(Y)[grep("xxx", names(Y))] := .SD, .SDcols = names(Y)[grep("xxx", names(Y)), on=.(zzz)]
is there an equivalent set of .SD and .SDcols constructs that apply to the i expression? Or, do I need to build up a string for the j expression and eval that string?
Perhaps this will help you get started:
library(data.table)
X <- as.data.table(mtcars[1:5], keep.rownames = "id")
Y <- as.data.table(mtcars, keep.rownames = "id")
cols <- c("gear", "carb")
# copy cols from Y to X based on common "id":
X[Y, (cols) := mget(cols), on = "id"]
As Frank notes in his comment, it might be safer to prefix the column names with i. to ensure the assigned columns are indeed from Y:
X[Y, (cols) := mget(paste0("i.", cols)), on = "id"]
In case of assignment (by reference), with = FALSE can be replaced by LHS in parentheses, (). This nice feature does not work when simply subsetting the column without assignment. Of course there is workarount with .SD/.SDcols or get()/mget(), but it would be nice to subset a column just the same way, with or without assignment.
dt <- data.table(A = 1:3, B = 4:6 )
col <- "A"
cols <- c("A","B")
# assign the old way
dt[, col := 9 , with=FALSE]
dt[, cols := .(9,8), with=FALSE]
# assign the new way
dt[, (col) := 8 ]
dt[, (cols) := .(8,7)]
# But the above syntax does not work for subsetting
dt[, (col)]
dt[, (cols)]
# I know how I can subset col and cols, but that is not the question here,
# e.g.:
dt[, col, with=FALSE]
dt[, cols, with=FALSE]
dt[, .SD, .SDcols=col]
dt[, .SD, .SDcols=cols]
# Below, further (there are even more) types of subsetting but they are not
# the same for col and cols, which is importent for looping where I dont
# know how many cols I call in advance.
dt[, get(col)]
dt[, mget(cols)]
dt[[col]] # Returns a vector, nor running: dt[[cols]]
In other words: if dt[ , (col) := 8] runs, as a naive user I expect df[ , (col)] to run as well. Probably there would be a conflict in [data.table so that cannot be implemented?
I'm trying to apply a function to a group of columns in a large data.table without referring to each one individually.
a <- data.table(
a=as.character(rnorm(5)),
b=as.character(rnorm(5)),
c=as.character(rnorm(5)),
d=as.character(rnorm(5))
)
b <- c('a','b','c','d')
with the MWE above, this:
a[,b=as.numeric(b),with=F]
works, but this:
a[,b[2:3]:=data.table(as.numeric(b[2:3])),with=F]
doesn't work. What is the correct way to apply the as.numeric function to just columns 2 and 3 of a without referring to them individually.
(In the actual data set there are tens of columns so it would be impractical)
The idiomatic approach is to use .SD and .SDcols
You can force the RHS to be evaluated in the parent frame by wrapping in ()
a[, (b) := lapply(.SD, as.numeric), .SDcols = b]
For columns 2:3
a[, 2:3 := lapply(.SD, as.numeric), .SDcols = 2:3]
or
mysubset <- 2:3
a[, (mysubset) := lapply(.SD, as.numeric), .SDcols = mysubset]