In R I have a data.table df with an integer column X. I want to convert this column from integer to a character.
This is really easy with df[, X:=as.character(X)].
Now for the question, I have the name of the column (X) stored in a variable like this:
col_name <- 'X'.
How do I access the column (and convert it to a character column) while only knowing the variable.
I tried numerious things all yielding in nothing useful or a column of NA's. Which syntax will get me the result I want?
library(data.table)
DT <- as.data.table(iris)
col_name <- "Petal.Length"
Use ( to evaluate the LHS of := and use list subsetting to select the column:
DT[, (col_name) := as.character(DT[[col_name]])]
class(DT[[col_name]])
#[1] "character"
We can specify it in .SDcols and assign the columns to character
df[, (col_name) := as.character(.SD[[1L]]), .SDcols = col_name]
If there are more than one column, use lapply
df[, (col_names) := lapply(.SD, as.character), .SDcols = col_names]
data
df <- data.table(X = as.integer(1:5), Y = LETTERS[1:5])
col_name <- "X"
Related
Often, I want to manipulate several variables in a DT and I need to select the column names based on their names or class.
d <- data.table(x = 1:10, y= letters[1:10])
# My usual approach
col <- str_subset(names(d), '^x')
d[, (col) := 2:11]
However, it would be very useful and less verbose to do this:
d[, (names(.SD)) := 2:11, .SDcols = patterns('^x')]
But this throws an error:
Error in `[.data.table`(d, , `:=`((names(.SD)), 2:11), .SDcols = patterns("^x")) :
LHS of := isn't column names ('character') or positions ('integer' or 'numeric')
>
The column names of .SD are available, though:
> d[, names(.SD), .SDcols = patterns('^x')]
[1] "x"
Why aren't the names of .SD available for assignment on the LHS of :=?
As noted this is not yet possible. The workaround only adds one line of code:
cols = grep('^x', names(d))
d[ , (cols) := 2:11, .SDcols = cols]
When iterating through all columns in an R data.table using reference semantics, what makes more sense from a memory usage standpoint:
(1) dt[, (all_cols) := lapply(.SD, my_fun)]
or
(2) lapply(colnames(dt), function(col) dt[, (col) := my_fun(dt[[col]])])[[1]]
My question is: In (2), I am forcing data.table to overwrite dt on a column by column basis, so I would assume to need extra memory on the order of column size. Is this also the case for (1)? Or is all of lapply(.SD, my_fun) evaluated before the original columns are overwritten?
Some sample code to run the above variants:
library(data.table)
dt <- data.table(a = 1:10, b = 11:20)
my_fun <- function(x) x + 1
all_cols <- colnames(dt)
Following the suggestion of #Frank, the most efficient way (from a memory point of view) to replace a data.table column by column by applying a function my_fun to each column, is
library(data.table)
dt <- data.table(a = 1:10, b = 11:20)
my_fun <- function(x) x + 1
all_cols <- colnames(dt)
for (col in all_cols) set(dt, j = col, value = my_fun(dt[[col]]))
This currently (v1.11.4) is not handled in the same way as an expression like dt[, lapply(.SD, my_fun)] which internally is optimised to dt[, list(fun(a), fun(b), ...)], where a, b, ... are columns in .SD (see ?datatable.optimize). This might change in the future and is being tracked by #1414.
i'm trying to copy a subset of columns from Y to X based on a join, where the subset of columns is dynamic
I can identify the columns quite easily: names(Y)[grep("xxx", names(Y))]
but when i try to use that code in the j expression, it just gives me the column names, not the values of the columns. the .SD and .SDcols gets pretty close, but they only apply to the x expression. I'm trying to do something like this:
X[Y[names(Y)[grep("xxx", names(Y))] := .SD, .SDcols = names(Y)[grep("xxx", names(Y)), on=.(zzz)]
is there an equivalent set of .SD and .SDcols constructs that apply to the i expression? Or, do I need to build up a string for the j expression and eval that string?
Perhaps this will help you get started:
library(data.table)
X <- as.data.table(mtcars[1:5], keep.rownames = "id")
Y <- as.data.table(mtcars, keep.rownames = "id")
cols <- c("gear", "carb")
# copy cols from Y to X based on common "id":
X[Y, (cols) := mget(cols), on = "id"]
As Frank notes in his comment, it might be safer to prefix the column names with i. to ensure the assigned columns are indeed from Y:
X[Y, (cols) := mget(paste0("i.", cols)), on = "id"]
I'm trying to use data.table rather data.frame(for a faster code). Despite the syntax difference between than, I'm having problems when I need to extract a specific character column and use it as character vector. When I call:
library(data.table)
DT <- fread("file.txt")
vect <- as.character(DT[, 1, with = FALSE])
class(vect)
###[1] "character"
head(vect)
It returns:
[1] "c(\"uc003hzj.4\", \"uc021ofx.1\", \"uc021olu.1\", \"uc021ome.1\", \"uc021oov.1\", \"uc021opl.1\", \"uc021osl.1\", \"uc021ovd.1\", \"uc021ovp.1\", \"uc021pdq.1\", \"uc021pdv.1\", \"uc021pdw.1\")
Any ideas of how to avoid these "\" in the output?
The as.character works on vectors and not on data.frame/data.table objects in the way the OP expected. So, if we need to get the first column as character class, subset with .SD[[1L]] and apply the as.character
DT[, as.character(.SD[[1L]])]
If there are multiple columns, we can specify the column index with .SDcols and loop over the .SD to convert to character and assign (:=) the output back to the particular columns.
DT[, (1:2) := lapply(.SD, as.character), .SDcols= 1:2]
data
DT <- data.table(Col1 = 1:5, Col2= 6:10, Col3= LETTERS[1:5])
I'm trying to apply a function to a group of columns in a large data.table without referring to each one individually.
a <- data.table(
a=as.character(rnorm(5)),
b=as.character(rnorm(5)),
c=as.character(rnorm(5)),
d=as.character(rnorm(5))
)
b <- c('a','b','c','d')
with the MWE above, this:
a[,b=as.numeric(b),with=F]
works, but this:
a[,b[2:3]:=data.table(as.numeric(b[2:3])),with=F]
doesn't work. What is the correct way to apply the as.numeric function to just columns 2 and 3 of a without referring to them individually.
(In the actual data set there are tens of columns so it would be impractical)
The idiomatic approach is to use .SD and .SDcols
You can force the RHS to be evaluated in the parent frame by wrapping in ()
a[, (b) := lapply(.SD, as.numeric), .SDcols = b]
For columns 2:3
a[, 2:3 := lapply(.SD, as.numeric), .SDcols = 2:3]
or
mysubset <- 2:3
a[, (mysubset) := lapply(.SD, as.numeric), .SDcols = mysubset]