I have some data:
library(data.table)
data(mtcars)
setDT(mtcars)
I have some vectors of names of columns I would like to keep:
some_keep <- c('mpg','cyl')
more_keep <- c('disp','hp')
I want to drop-in-place every column except those named in some_keep and more_keep.
I know that I can use setdiff with names:
mtcars[, ( setdiff(names(mtcars),c(some_keep,more_keep)) ) := NULL] # this works
But this seems not very readable. I know that to select all but these, I can use with=FALSE:
mtcars[,-c(some_keep,more_keep), with=FALSE] # returns the columns I want to drop
But then this doesn't work:
mtcars[,(-c(some_keep,more_keep)):=NULL] # invalid argument to unary operator
Nor do these:
mtcars[,-c(some_keep,more_keep)] <- NULL # invalid argument to unary operator
mtcars[,-c(some_keep,more_keep), with=FALSE] <- NULL # unused argument (with = FALSE)
Is there a simpler data.table expression that doesn't require writing the table's name twice?
Please note that this seemingly duplicate question is really asking about selecting (not dropping) all but specified as shown above.
Related
Is it possible to select a column of a data.table and get back a vector? In base R, the argument drop=TRUE would do the trick. For example,
library(data.table)
dat <- as.data.table(iris)
dat[,"Species"] # returns data.table
dat[,"Species", drop=TRUE] # same
iris[, "Species", drop=TRUE] # a factor, wanted result
Is there a way to do this with data.table?
EDIT: the dat[,Species] method is fine, however I need a method where I can pass the column name in a variable:
x <- "Species"
dat[,x, drop=TRUE]
With data.frame, the default is drop = TRUE and in data.table, it is the opposite while it is done internally. According to ?data.table
drop - Never used by data.table. Do not use. It needs to be here because data.table inherits from data.frame.
In order to get the same behavior, we can use [[ to extract the column by passing a string
identical(dat[["Species"]], iris[, "Species"])
#[1] TRUE
Or
dat$Species
By using [[ or $, it extracts as a vector while also bypass the data.table overhead
See data.table FAQ #1.1. This comes as a feature since 2006.
I am doing practicing exercise based on the problems and solutions for data.table in R. The problem was: get the row and column positions of missing values in a data table. The solution code is used " [.....with=F][[1]]. I am not understanding this part of that code and expecting expert opinion to make my concept clear on that.
for(i in 1:NROW(DT)){
for(j in 1:NCOL(DT)){
curr_value <- DT[i, j,with=F][[1]]
I can understand first two lines, but not understanding ,with=F and then [[1]]
What the meaning of with=F and why has been used [[1]] after than that. Why there is double bracket with 1?
Generally in data.table, with = FALSE allows you to select columns named in a variable.
Consider the following minimal example,
library(data.table)
dt <- data.table(mtcars)
Let's select the following columns from dt
cols <- c(1, 7)
The following command will produce an error
dt[, cols]
Instead you can use with = F
dt[, cols, with = F]
From ?data.table
When with=TRUE (default), j is evaluated within the frame of the data.table;
i.e., it sees column names as if they are variables.
A shorter alternative is to use
dt[, ..cols]
See also Why does “..” work to pass column names in a character vector variable?
This question already has answers here:
Dynamically add column names to data.table when aggregating
(2 answers)
Closed 2 years ago.
Using this dummy dataset
setDT(mtcars_copy<-copy(mtcars))
new_col<- "sum_carb" # for dynamic column referencing
Why does Case 1 work but not Case 2?
# Case 1 - Works fine
mtcars_copy[,eval(new_col):=sum(carb)] # Works fine
# Case 2:Doesnt work
aggregate_mtcars<-mtcars_copy[,(eval(new_col)=sum(carb))] # error
aggregate_mtcars<-mtcars_copy[,eval(new_col)=sum(carb))] # error
aggregate_mtcars<-mtcars_copy[,c(eval(new_col)=sum(carb))] # Error
How does one get Case 2 to work wherein I dont want the main table (mtcars_copy in this case to hold the new columns) but for the results to be stored in a separate aggregation table (aggregate_mtcars)
One option is to use the base R function setNames
aggregate_mtcars <- mtcars_copy[, setNames(.(sum(carb)), new_col)]
Or you could use data.table::setnames
aggregate_mtcars <- setnames(mtcars_copy[, .(sum(carb))], new_col)
I think what you want is to simply make a copy when doing case 1.
aggregate_mtcars <- copy(mtcars_copy)[, eval(new_col) := sum(carb)]
That retains mtcars_copy as a separate dataset to the new aggregate_metcars, without the new columns.
The reason is because case 2 uses data.frame way to create column in a data frame (as a new list).
There is hidden parameter in data.table : with that handles the way the object is returned. It can be a data.table or a vector.
?data.table :
By default with=TRUE and j is evaluated within the frame of x; column names can be used as variables. In case of overlapping variables names inside dataset and in parent scope you can use double dot prefix ..cols to explicitly refer to 'cols variable parent scope and not from your dataset.
When j is a character vector of column names, a numeric vector of column positions to select or of the form startcol:endcol, and the value returned is always a data.table. with=FALSE is not necessary anymore to select columns dynamically. Note that x[, cols] is equivalent to x[, ..cols] and to x[, cols, with=FALSE] and to x[, .SD, .SDcols=cols].
# Case 2 :
aggregate_mtcars<-mtcars_copy[,(get(new_col)=sum(carb))] # error
aggregate_mtcars<-mtcars_copy[,eval(new_col)=sum(carb))] # error
aggregate_mtcars<-mtcars_copy[,c(eval(new_col)=sum(carb))] # Error
mtcars_copy[, new_col, with = FALSE ] # gives a data.table
mtcars_copy[, eval(new_col), with = FALSE ] # this works and create a data.table
mtcars_copy[, eval(new_col), with = TRUE ] # the default that is used here with error
mtcars_copy[, get(new_col), with = TRUE ] # works and gives a vector
# Case 2 solution : affecting values the data.frame way
mtcars_copy[, eval(new_col) ] <- sum(mtcars_copy$carb) # or any vector
mtcars_copy[[eval(new_col)]] <- sum(mtcars_copy$carb) # or any vector
I have a data.table with column names that are not valid R names
DT = data.table(a = c(1, 2), `0b` = c(4, 5))
And I want to use something like this
my_column <- "0b"
DT[, mean(eval(parse(text = my_column)))]
but I get an error
Error in parse(text = my_column) : <text>:1:2: unexpected symbol
1: 0b
^
Is there any way how I can do it, i.e. use non valid column name as a variable inside j?
We can either specify the column in .SDcols and get the mean using .SD
DT[, mean(.SD[[1L]]),.SDcols=my_column]
Or we can subset the column using [[ and then get the mean.
mean(DT[[my_column]])
In R, syntactically invalid names need back-ticks in order to be evaluated. Although .SDcols is probably the proper way to go, you can use as.name() or as.symbol() to turn the character my_column into a back-ticked name.
DT[, mean(eval(as.name(my_column)))]
# [1] 4.5
Or a bit more clunky way would be to do
with(DT, do.call(mean, list(as.name(my_column))))
# [1] 4.5
As you specified in declaring your example, using backticks (`) is a common way to handle strange column names:
DT[ , mean(`0b`)]
Though get also works:
DT[ , mean(get("0b"))]
We can also do it the data.frame way
sapply(DT[ , "0b"], mean)
Though you may just want to setnames to get rid of the pesky column names altogether (by reference)
setnames(DT, "0b", "something_digestible")
I'm wondering how to use the subset function if I don't know the name of the column I want to test. The scenario is this: I have a Shiny app where the user can pick a variable on which to filter (subset) the data table. I receive the column name from the webapp as input, and I want to subset based on the value of that column, like so:
subset(myData, THECOLUMN == someValue)
Except where both THECOLUMN and someValue are variables. Is there a syntax for passing the desired column name as a string?
Seems to want a bareword that is the column name, not a variable that holds the column name.
Both subset and with are designed for interactive use and warnings against their use within other functions will be found in their help pages. This stems from their strategy of evaluation arguments as expressions within an environment constructed from the names of their data arguments. These column/element names would otherwise not be "objects" in the R-sense.
If THECOLUMN is the name of an object whose value is the name of the column and someValue is the name of an object whose value is the target, then you should use:
dfrm[ dfrm[[THECOLUMN]] == someValue , ]
The fact that "[[" will evaluate its argument is why it is superior to "$" for programing. If we use joran's example:
d <- data.frame(x = letters[1:5],y = runif(5))
THECOLUMN= "x"
someValue= "c"
d[ d[[THECOLUMN]] == someValue , ]
# x y
# 3 c 0.7556127
So in this case all these return the same atomic vector:
d[[ THECOLUMN ]]
d[[ 'x' ]]
d[ , 'x' ]
d[, THECOLUMN ]
d$x # of the three extraction functions: `$`, `[[`, and `[`,
# only `$` is unable to evaluate its argument
This is precisely why subset is a bad tool for anything other than interactive use:
d <- data.frame(x = letters[1:5],y = runif(5))
> d[d[,'x'] == 'c',]
x y
3 c 0.3080524
Fundamentally, extracting things in R is built around [. Use it.
I think you could use the following one-liner:
myData[ , grep(someValue, colnames(myData))]
where
colnames(myData)
outputs a vector containing all column names and
grep(someValue, colnames(myData))
should results in a numeric vector of length 1 (given the column name is unique) pointing to your column. See ?grep for information about pattern matching in R.