I have a data.table with column names that are not valid R names
DT = data.table(a = c(1, 2), `0b` = c(4, 5))
And I want to use something like this
my_column <- "0b"
DT[, mean(eval(parse(text = my_column)))]
but I get an error
Error in parse(text = my_column) : <text>:1:2: unexpected symbol
1: 0b
^
Is there any way how I can do it, i.e. use non valid column name as a variable inside j?
We can either specify the column in .SDcols and get the mean using .SD
DT[, mean(.SD[[1L]]),.SDcols=my_column]
Or we can subset the column using [[ and then get the mean.
mean(DT[[my_column]])
In R, syntactically invalid names need back-ticks in order to be evaluated. Although .SDcols is probably the proper way to go, you can use as.name() or as.symbol() to turn the character my_column into a back-ticked name.
DT[, mean(eval(as.name(my_column)))]
# [1] 4.5
Or a bit more clunky way would be to do
with(DT, do.call(mean, list(as.name(my_column))))
# [1] 4.5
As you specified in declaring your example, using backticks (`) is a common way to handle strange column names:
DT[ , mean(`0b`)]
Though get also works:
DT[ , mean(get("0b"))]
We can also do it the data.frame way
sapply(DT[ , "0b"], mean)
Though you may just want to setnames to get rid of the pesky column names altogether (by reference)
setnames(DT, "0b", "something_digestible")
Related
Is it possible to select a column of a data.table and get back a vector? In base R, the argument drop=TRUE would do the trick. For example,
library(data.table)
dat <- as.data.table(iris)
dat[,"Species"] # returns data.table
dat[,"Species", drop=TRUE] # same
iris[, "Species", drop=TRUE] # a factor, wanted result
Is there a way to do this with data.table?
EDIT: the dat[,Species] method is fine, however I need a method where I can pass the column name in a variable:
x <- "Species"
dat[,x, drop=TRUE]
With data.frame, the default is drop = TRUE and in data.table, it is the opposite while it is done internally. According to ?data.table
drop - Never used by data.table. Do not use. It needs to be here because data.table inherits from data.frame.
In order to get the same behavior, we can use [[ to extract the column by passing a string
identical(dat[["Species"]], iris[, "Species"])
#[1] TRUE
Or
dat$Species
By using [[ or $, it extracts as a vector while also bypass the data.table overhead
See data.table FAQ #1.1. This comes as a feature since 2006.
I am doing practicing exercise based on the problems and solutions for data.table in R. The problem was: get the row and column positions of missing values in a data table. The solution code is used " [.....with=F][[1]]. I am not understanding this part of that code and expecting expert opinion to make my concept clear on that.
for(i in 1:NROW(DT)){
for(j in 1:NCOL(DT)){
curr_value <- DT[i, j,with=F][[1]]
I can understand first two lines, but not understanding ,with=F and then [[1]]
What the meaning of with=F and why has been used [[1]] after than that. Why there is double bracket with 1?
Generally in data.table, with = FALSE allows you to select columns named in a variable.
Consider the following minimal example,
library(data.table)
dt <- data.table(mtcars)
Let's select the following columns from dt
cols <- c(1, 7)
The following command will produce an error
dt[, cols]
Instead you can use with = F
dt[, cols, with = F]
From ?data.table
When with=TRUE (default), j is evaluated within the frame of the data.table;
i.e., it sees column names as if they are variables.
A shorter alternative is to use
dt[, ..cols]
See also Why does “..” work to pass column names in a character vector variable?
I have some data:
library(data.table)
data(mtcars)
setDT(mtcars)
I have some vectors of names of columns I would like to keep:
some_keep <- c('mpg','cyl')
more_keep <- c('disp','hp')
I want to drop-in-place every column except those named in some_keep and more_keep.
I know that I can use setdiff with names:
mtcars[, ( setdiff(names(mtcars),c(some_keep,more_keep)) ) := NULL] # this works
But this seems not very readable. I know that to select all but these, I can use with=FALSE:
mtcars[,-c(some_keep,more_keep), with=FALSE] # returns the columns I want to drop
But then this doesn't work:
mtcars[,(-c(some_keep,more_keep)):=NULL] # invalid argument to unary operator
Nor do these:
mtcars[,-c(some_keep,more_keep)] <- NULL # invalid argument to unary operator
mtcars[,-c(some_keep,more_keep), with=FALSE] <- NULL # unused argument (with = FALSE)
Is there a simpler data.table expression that doesn't require writing the table's name twice?
Please note that this seemingly duplicate question is really asking about selecting (not dropping) all but specified as shown above.
I'd like to create a new data.table column by adding columns whose names match a certain expression. I don't know how many columns I'll be adding. The catch is, the columns are of class 'integer64' so rowSums(.) does not appear to work on them.
For instance, this works for two (known) integer64 columns:
DT <- data.table(a=as.integer64(1:4),b=as.integer64(5:8),c=as.integer64(9:12))
DT[, y := .SD[, 1] + .SD[, 2], .SDcols=c("a", "b")]
And this works for my case, any number of columns, but not if their class is integer64:
DT[, y := rowSums(.SD), .SDcols=c("a", "b")] # gives incomprehensible data if class of a and b is integer64
One way I can work around it is by defining the data type of the column y beforehand. Is there a simpler way to do it? I may be missing something simple here and I apologize if that's so.
I have a data.table containing a name column, and I'm trying to extract a regular expression from this name. The most obvious way to do it in this case is with the := operator, as I'm assigning this extracted string as the actual name of the data. In doing so, I find that this doesn't actually apply the function in the way that I would expect. I'm not sure if it's intentional, and I was wondering if there's a reason it does what it does or if it's a bug.
library(data.table)
dt <- data.table(name = c('foo123', 'bar234'))
Searching for the desired expression in a simple character vector behaves as expected:
name <- dt[1, name]
pattern <- '(.*?)\\d+'
regmatches(name, regexec(pattern, name))
[[1]]
[1] "foo123" "foo"
I can easily subset this to get what I want
regmatches(name, regexec(pattern, name))[[1]][2]
[1] "foo"
However, I run into issues when I try to apply this to the entire data.table:
dt[, name_final := regmatches(name, regexec(pattern, name))[[1]][2]]
dt
name name_final
1: foo123 foo
2: bar234 foo
I don't know how data.table works internally, but I would guess that the function was applied to the entire name column first, and then the result is coerced into a vector somehow and then assigned to the new name_final column. However, the behavior I would expect here would be on a row-by-row basis. I can emulate this behavior by adding a dummy id column;
dt[, id := seq_along(name)]
dt[, name_final := regmatches(name, regexec(pattern, name))[[1]][2], by = list(id)]
dt
name name_final id
1: foo123 foo 1
2: bar234 bar 2
Is there a reason that this isn't the default behavior? If so, I would guess that it had to do with columns being atomic to the data.table rather than the rows, but I'd like to understand what's going on there.
Pretty much nothing in R runs on a row-by-row basis. It's always better to work with columns of data at a time so you can pretty much assume that the entire column vector of values will be passed in as a parameter to your function. Here's a way to extract the second element for each item in the regmatches list
dt[, name_final := sapply(regmatches(name, regexec(pattern, name)), `[`, 2)]
Functions like sapply() or Vectorize() can "fake" a per-row type call for functions that aren't meant to be run on a vector/list of data at a time.