I was trying to write a function that contains two actions, and the first one includes subsetting a dataframe.
Let's say I have these two dataframes.
ID Var1
1 5 3
2 6 1
ID Var2
1 5 9
2 6 2
And my function is like this
mu_fuc = function(df, condition) {
workingdf = subset(df, condition < 3)
###pass working df to the other action
}
I am aware that for this first action, I can use conditional slicing as a work around. However, as I try to work on the other action, I realized I still have to refer to a column name in an dataframe for another existing function.
I tried as.name(condition), but it did not work.
Thank you for your time. Any suggestion is highly appreciated.
---------Update on 12/26-----------
The method using eval worked well for the function once, as below.
meta = function(sub_data, y) {
y <- eval(as.list(match.call())$y, sub_data)
workingdf <- subset(sub_data, y != 999) ##this successfully grab the column named y in the dataframe##
meta1 <- metacor(y, ##this successfully grab the column named y in the dataframe##
n,
data = workingdf,
studlab = workingdf$Author_year,
sm = "ZCOR",
method.tau = "SJ",
comb.fixed = F)
return(meta1)
But somehow the same approach did not work in the following code.
mod_analysis = function(meta, moderator){
workingdf <- meta$data
moderator <- eval(as.list(match.call())$moderator, workingdf)
output = metareg(meta, moderator)
return(output)
}
Then it was this error message:
Error in eval(predvars, data, env) : object 'moderator' not found
I don't know why it worked for the first function but not the second.
One can convert the condition from a simple column reference to an expression, enabling the function argument to include the right hand side of an expression instead of hard coding it into the function. This can be accomplished with a couple of functions from the rlang package, enquo() and eval_tidy().
We'll illustrate this with a subsetting function and the mtcars data frame.
aSubsetFunction <- function(df,condition){
require(rlang)
condition <- enquo(condition)
rows_value <- eval_tidy(condition, df)
stopifnot(is.logical(rows_value))
df[rows_value, ,drop = FALSE]
}
The condition <- enquo(condition) line quotes the condition expression. The eval_tidy() function evaluates the quoted expression, using df as a data mask. The output from eval_tidy(), rows_value, is a vector of logical values (TRUE / FALSE), which we use on the row dimension of the input data frame with the [ form of the extract operator. We use stopifnot() to generate an error if rows_value is not a vector of logical values.
We call the function twice to illustrate that it works with multiple columns in the data frame.
aSubsetFunction(mtcars,mpg > 25)
aSubsetFunction(mtcars,carb > 4)
...and the output:
> aSubsetFunction(mtcars,mpg > 25)
Loading required package: rlang
mpg cyl disp hp drat wt qsec vs am gear carb
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
> aSubsetFunction(mtcars,carb > 4)
mpg cyl disp hp drat wt qsec vs am gear carb
Ferrari Dino 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
Maserati Bora 15.0 8 301 335 3.54 3.57 14.6 0 1 5 8
Using data from the original post, the solution works as follows.
df1 <- read.csv(text="ID,Val1
5,3
6,1")
df2 <- read.csv(text="ID,Val2
5,9
6,2")
aSubsetFunction(df1,Val1 < 3)
aSubsetFunction(df2,Val2 < 3)
...and the output:
> aSubsetFunction(df1,Val1 < 3)
ID Val1
2 6 1
> aSubsetFunction(df2,Val2 < 3)
ID Val2
2 6 2
Having illustrated the approach, we can use the order of object evaluation in R to simplify the function down to a single line of R code:
aSubsetFunction <- function(df,condition){
require(rlang)
df[eval_tidy(enquo(condition), df), ,drop = FALSE]
}
...which produces the same output as listed above.
> aSubsetFunction(mtcars,mpg > 25)
Loading required package: rlang
mpg cyl disp hp drat wt qsec vs am gear carb
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
> aSubsetFunction(mtcars,carb > 4)
mpg cyl disp hp drat wt qsec vs am gear carb
Ferrari Dino 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
Maserati Bora 15.0 8 301 335 3.54 3.57 14.6 0 1 5 8
>
Epilogue: why can't we use variable substitution with subset()?
From an initial look at the question, one might expect that we could resolve the question with the following code.
subset2 <- function(df,condition){
subset(df,df[[condition]] > 4)
}
subset2(mtcars,carb)
However, this fails with an object not found error:
Error in (function(x, i, exact) if (is.matrix(i)) as.matrix(x)[[i]] else .subset2(x, :
object 'carb' not found
Once again Advanced R provides an explanation, directly quoting from the documentation for subset().
This is a convenience function intended for use interactively. For programming it is better to use the standard subsetting functions like [, and in particular the non-standard evaluation of argument subset can have unanticipated consequences.
Bottom line: it's important to understand Base R non-standard evaluation when writing functions to automate one's analysis because the assumptions coded into various R functions can produce unexpected results. This is especially true of modeling functions like lm() that rely on formula(), as Wickham describes in Advanced R: wrapping modeling functions.
References: Advanced R, Chapter 20 section 4, Chapter 20 section 6
I guess you can try match.call with eval
mu_fuc <- function(df, condition) {
condition <- eval(as.list(match.call())$condition, df)
workingdf <- subset(df, condition < 3)
workingdf
}
which enables
> mu_fuc(df1, Var1)
ID Var1
2 6 1
> mu_fuc(df2, Var2)
ID Var2
2 6 2
Try this with indexing:
#Funtion
mu_fuc = function(df, condition) {
workingdf <- df[df[[condition]]<3,]
return(workingdf)
}
#Apply
mu_fuc(df1,'Var1')
Output:
mu_fuc(df1,'Var1')
ID Var1
2 6 1
Some data used:
#Data
df1 <- structure(list(ID = 5:6, Var1 = c(3L, 1L)), class = "data.frame", row.names = c("1",
"2"))
Related
I want to randomly sample a dataset. If I already have that dataset loaded, I can do something like this:
library(dplyr)
set.seed(-1)
mtcars %>% slice_sample(n = 3)
# mpg cyl disp hp drat wt qsec vs am gear carb
# AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.3 0 0 3 2
# Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.5 0 1 5 6
# Merc 240D 24.4 4 146.7 62 3.69 3.190 20.0 1 0 4 2
But my dataset is stored as a parquet file. As an example, I'll create a parquet from mtcars:
library(arrow)
# Create parquet file
write_dataset(mtcars, "~/mtcars", format = "parquet")
open_dataset("~/mtcars") %>%
slice_sample(n = 3) %>%
collect()
# Error in UseMethod("slice_sample") :
# no applicable method for 'slice_sample' applied to an object of class "c('FileSystemDataset', 'Dataset', 'ArrowObject', 'R6')"
Clearly, slice_sample isn't implemented for parquet files and neither is slice:
open_dataset("~/mtcars") %>% nrow() -> n
subsample <- sample(1:n, 3)
open_dataset("~/mtcars") %>%
slice(subsample) %>%
collect()
# Error in UseMethod("slice") :
# no applicable method for 'slice' applied to an object of class "c('FileSystemDataset', 'Dataset', 'ArrowObject', 'R6')"
Now, I know filter is implemented, so I tried that:
open_dataset("~/mtcars") %>%
filter(row_number() %in% subsample) %>%
collect()
# Error: Filter expression not supported for Arrow Datasets: row_number() %in% subsample
# Call collect() first to pull data into R.
(This also doesn't work if I create a filtering vector first, e.g., foo <- rep(FALSE, n); foo[subsample] <- TRUE and use that in filter.)
This error offers some helpful advice, though: collect the data and then subsample. The issue is that the file is ginormous. So much so, that it crashes my session.
Question: is there a way to randomly subsample a parquet file before loading it with collect?
It turns out that there is an example in the documentation that pretty much fulfils my goal. That example is a smidge dated, as it uses sample_frac which has been superseded rather than slice_sample, but the general principle holds so I've updated it here. As I don't know how many batches there will be, here I show how it can be done with proportions, like Pace suggested, instead of pulling a fixed number of columns.
One issue with this approach is that (as far as I understand) it does require that the entire dataset is read in, it just does it in batches rather than in one go.
open_dataset("~/mtcars") %>%
map_batches(~ as_record_batch(slice_sample(as.data.frame(.), prop = 0.1))) %>%
collect()
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
# 2 14.3 8 360 245 3.21 3.570 15.84 0 0 3 4
# 3 15.8 8 351 264 4.22 3.170 14.50 0 1 5 4
This question already has an answer here:
Indexing dataframes in R
(1 answer)
Closed 2 years ago.
I am working on a R function. I have the following code:
best <- function(state, outcome) {
##Read outcome data
data <- read.csv("outcome-of-care-measures.csv")
#rename some of the columns
names(data)[names(data) == "Hospital.30.Day.Death..Mortality..Rates.from.Heart.Attack"] <- "heart_attack_rate"
names(data)[names(data) == "Hospital.30.Day.Death..Mortality..Rates.from.Heart.Failure"] <- "heart_failure_rate"
names(data)[names(data) == "Hospital.30.Day.Death..Mortality..Rates.from.Pneumonia"] <- "pneumonia_rate"
print(names(data))
states <- unique(data$State)
##check that the state and the outcome are valid
outcomes <- c("heart attack", "heart failure", "pneumonia")
if(!is.element(state, states)) {
stop("invalid state")
}
if(!is.element(outcome, outcomes)) {
stop("invalid outcome")
}
##Return hospital name in that state with the lowest 30-day death rate
newdata <- NULL
if(outcome == "heart attack") {
newdata <- subset(data, State == state & heart_attack_rate != "Not Available", select=c(Hospital.Name, State, heart_attack_rate))
sorted_data <- newdata[order(heart_attack_rate), ]
}
else if(outcome == "heart failure") {
newdata <- subset(data, State == state & heart_failure_rate != "Not Available", select=c(Hospital.Name, State, heart_failure_rate))
sorted_data <- newdata[order(heart_failure_rate), ]
}
else {
newdata <- subset(data, State == state & pneumonia_rate != "Not Available", select=c(Hospital.Name, State, pneumonia_rate))
sorted_data <- newdata[order(pneumonia_rate), ]
}
}
The above function takes a state and outcome as parameters. Depending on these parameters, I am making a subset of the original data frame. I have renamed some of the columns in the data frame in order for the names to be more readable.
For the columns heart_attack_rate, heart_failure_rate and pneumonia_rate I want to sort the data frame by these column values. E.g. this is done in the following line:
sorted_data <- newdata[order(heart_attack_rate), ]
However, when I run the function with the following inputs:
best("TX", "heart attack")
I get the following error:
Error in order(heart_attack_rate) : object 'heart_attack_rate' not found
I am not sure why I am getting this error or how to resolve it. Any insights are appreciated.
I think that makes complete sense, since if newdata is a data.frame there is no reference of heart_attack_rate in the complete function.
Consider this example using built-in mtcars dataset
mtcars[order(cyl), ]
Error in order(cyl) : object 'cyl' not found
You need to refer the column name using $
mtcars[order(mtcars$cyl), ]
# mpg cyl disp hp drat wt qsec vs am gear carb
#Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
#Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
#Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
#Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
#....
Or use with :
mtcars[with(mtcars, order(cyl)), ]
As a sidenote, if the data is data.table your attempt would have worked.
library(data.table)
df <- mtcars
setDT(df)
df[order(cyl),]
#Or
#df[order(cyl)]
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1: 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
# 2: 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
# 3: 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
# 4: 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
# 5: 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
#....
One of the quirks of subsetting a data frame is that you have to repeatedly type the name of that data frame when mentioning columns. For example, the data frame cars is mentioned 3 times here:
cars[cars$speed == 4 & cars$dist < 10, ]
## speed dist
## 1 4 2
The data.table package solves this.
library(data.table)
dt_cars <- as.data.table(cars)
dt_cars[speed == 4 & dist < 10]
As does dplyr.
library(dplyr)
cars %>% filter(speed == 4, dist < 10)
I'd like to know if a solution exists for standard-issue data.frames (that is, not resorting to data.table or dplyr).
I think I'm looking for something like
cars[MAGIC(speed == 4 & dist < 10), ]
or
MAGIC(cars[speed == 4 & dist < 10, ])
where MAGIC is to be determined.
I tried the following, but it gave me an error.
library(rlang)
cars[locally(speed == 4 & dist < 10), ]
# Error in locally(speed == 4 & dist < 10) : object 'speed' not found
1) subset This only requires that cars be mentioned once. No packages are used.
subset(cars, speed == 4 & dist < 10)
## speed dist
## 1 4 2
2) sqldf This uses a package but does not use dplyr or data.table which were the only two packages excluded by the question:
library(sqldf)
sqldf("select * from cars where speed = 4 and dist < 10")
## speed dist
## 1 4 2
3) assignment Not sure if this counts but you could assign cars to some other variable name such as . and then use that. In that case cars would only be mentioned once. This uses no packages.
. <- cars
.[.$speed == 4 & .$dist < 10, ]
## speed dist
## 1 4 2
or
. <- cars
with(., .[speed == 4 & dist < 10, ])
## speed dist
## 1 4 2
With respect to these two solutions you might want to check out this article on the Bizarro Pipe: http://www.win-vector.com/blog/2017/01/using-the-bizarro-pipe-to-debug-magrittr-pipelines-in-r/
4) magrittr This could also be expressed in magrittr and that package was not excluded by the question. Note we are using the magrittr %$% operator:
library(magrittr)
cars %$% .[speed == 4 & dist < 10, ]
## speed dist
## 1 4 2
subset is the base function which solves this problem. However, like all base R functions which use non-standard evaluation subset does not perform fully hygienic code expansion. So subset() evaluates the wrong variable when used within non-global scopes (such as in lapply loops).
As an example, here we define the variable var in two places, first in the global scope with value 40, then in a local scope with value 30. The use of local() here is for simplicity, however this would behave equivalently inside a function. Intuitively, we would expect subset to use the value 30 in the evaluation. However upon executing the following code we see instead the value 40 is used (so no rows are returned).
var <- 40
local({
var <- 30
dfs <- list(mtcars, mtcars)
lapply(dfs, subset, mpg > var)
})
#> [[1]]
#> [1] mpg cyl disp hp drat wt qsec vs am gear carb
#> <0 rows> (or 0-length row.names)
#>
#> [[2]]
#> [1] mpg cyl disp hp drat wt qsec vs am gear carb
#> <0 rows> (or 0-length row.names)
This happens because the parent.frame() used in subset() is the environment within the body of lapply() rather than the local block. Because all environments eventually inherit from the global environment the variable var is found there with value 40.
Hygienic variable expansion via quasiquotation (as implemented in the rlang package) solves this problem. We can define a variant of subset using tidy evaluation that works properly in all contexts. The code is derived from and largely identical to that of base::subset.data.frame().
subset2 <- function (x, subset, select, drop = FALSE, ...) {
r <- if (missing(subset))
rep_len(TRUE, nrow(x))
else {
r <- rlang::eval_tidy(rlang::enquo(subset), x)
if (!is.logical(r))
stop("'subset' must be logical")
r & !is.na(r)
}
vars <- if (missing(select))
TRUE
else {
nl <- as.list(seq_along(x))
names(nl) <- names(x)
rlang::eval_tidy(rlang::enquo(select), nl)
}
x[r, vars, drop = drop]
}
This version of subset behaves identically to base::subset.data.frame().
subset2(mtcars, gear > 4, disp:wt)
#> disp hp drat wt
#> Porsche 914-2 120.3 91 4.43 2.140
#> Lotus Europa 95.1 113 3.77 1.513
#> Ford Pantera L 351.0 264 4.22 3.170
#> Ferrari Dino 145.0 175 3.62 2.770
#> Maserati Bora 301.0 335 3.54 3.570
However subset2() does not suffer the scoping issues of subset. In our previous example the value 30 is used for var, as we would expect from lexical scoping rules.
local({
var <- 30
dfs <- list(mtcars, mtcars)
lapply(dfs, subset2, mpg > var)
})
#> [[1]]
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
#> Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
#> Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
#> Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
#>
#> [[2]]
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
#> Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
#> Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
#> Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
This allows non-standard evaluation to be used robustly in all contexts, not just in top level contexts as with previous approaches.
This makes functions which use non-standard evaluation much more useful. Before while they were nice to have for interactive use, you needed to use more verbose standard evaluation functions when writing functions and packages. Now the same function can be used in all contexts without needing to modify the code!
For more details on non-standard evaluation please see Lionel Henry's Tidy evaluation (hygienic fexprs) presentation, the rlang vignette on tidy evaluation and the programming with dplyr vignette.
I understand I'm totally cheating, but technically it works :):
with(cars, data.frame(speed=speed,dist=dist)[speed == 4 & dist < 10,])
# speed dist
# 1 4 2
More horror:
`[` <- function(x,i,j){
rm(`[`,envir = parent.frame())
eval(parse(text=paste0("with(x,x[",deparse(substitute(i)),",])")))
}
cars[speed == 4 & dist < 10, ]
# speed dist
# 1 4 2
Solution with overriding [ method for data.frame. In the new method we check class of the i argument and if it is expression or formula we evaluate it in the data.frame context.
##### override subsetting method
`[.data.frame` = function (x, i, j, ...) {
if(!missing(i) && (is.language(i) || is.symbol(i) || inherits(i, "formula"))) {
if(inherits(i, "formula")) i = as.list(i)[[2]]
i = eval(i, x, enclos = baseenv())
}
base::`[.data.frame`(x, i, j, ...)
}
#####
data(cars)
cars[cars$speed == 4 & cars$dist < 10, ]
# speed dist
# 1 4 2
# cars[speed == 4 & dist < 10, ] # error
cars[quote(speed == 4 & dist < 10),]
# speed dist
# 1 4 2
# ,or
cars[~ speed == 4 & dist < 10,]
# speed dist
# 1 4 2
Another solution with more magic. Please, restart R session to avoid interference with previous solution:
locally = function(expr){
curr_call = as.list(sys.call(1))
if(as.character(curr_call[[1]])=="["){
possibly_df = eval(curr_call[[2]], parent.frame())
if(is.data.frame(possibly_df)){
expr = substitute(expr)
expr = eval(expr, possibly_df, enclos = baseenv())
}
}
expr
}
cars[locally(speed == 4 & dist < 10), ]
# speed dist
# 1 4 2
Using attach()
attach(cars)
cars[speed == 4 & dist < 10,]
# speed dist
# 1 4 2
I was very early on in my R learning dissuaded from using attach(), but as long as you're careful not to introduce name conflicts I think it should be OK.
It seems to me that subset and filter (from dplyr) are having the same result.
But my question is: is there at some point a potential difference, for ex. speed, data sizes it can handle etc? Are there occasions that it is better to use one or the other?
Example:
library(dplyr)
df1<-subset(airquality, Temp>80 & Month > 5)
df2<-filter(airquality, Temp>80 & Month > 5)
summary(df1$Ozone)
# Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
# 9.00 39.00 64.00 64.51 84.00 168.00 14
summary(df2$Ozone)
# Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
# 9.00 39.00 64.00 64.51 84.00 168.00 14
They are, indeed, producing the same result, and they are very similar in concept.
The advantage of subset is that it is part of base R and doesn't require any additional packages. With small sample sizes, it seems to be a bit faster than filter (6 times faster in your example, but that's measured in microseconds).
As the data sets grow, filter seems gains the upper hand in efficiency. At 15,000 records, filter outpaces subset by about 300 microseconds. And at 153,000 records, filter is three times faster (measured in milliseconds).
So in terms of human time, I don't think there's much difference between the two.
The other advantage (and this is a bit of a niche advantage) is that filter can operate on SQL databases without pulling the data into memory. subset simply doesn't do that.
Personally, I tend to use filter, but only because I'm already using the dplyr framework. If you aren't working with out-of-memory data, it won't make much of a difference.
library(dplyr)
library(microbenchmark)
# Original example
microbenchmark(
df1<-subset(airquality, Temp>80 & Month > 5),
df2<-filter(airquality, Temp>80 & Month > 5)
)
Unit: microseconds
expr min lq mean median uq max neval cld
subset 95.598 107.7670 118.5236 119.9370 125.949 167.443 100 a
filter 551.886 564.7885 599.4972 571.5335 594.993 2074.997 100 b
# 15,300 rows
air <- lapply(1:100, function(x) airquality) %>% bind_rows
microbenchmark(
df1<-subset(air, Temp>80 & Month > 5),
df2<-filter(air, Temp>80 & Month > 5)
)
Unit: microseconds
expr min lq mean median uq max neval cld
subset 1187.054 1207.5800 1293.718 1216.671 1257.725 2574.392 100 b
filter 968.586 985.4475 1056.686 1023.862 1036.765 2489.644 100 a
# 153,000 rows
air <- lapply(1:1000, function(x) airquality) %>% bind_rows
microbenchmark(
df1<-subset(air, Temp>80 & Month > 5),
df2<-filter(air, Temp>80 & Month > 5)
)
Unit: milliseconds
expr min lq mean median uq max neval cld
subset 11.841792 13.292618 16.21771 13.521935 13.867083 68.59659 100 b
filter 5.046148 5.169164 10.27829 5.387484 6.738167 65.38937 100 a
One additional difference not yet mentioned is that filter discards rownames, while subset doesn't:
filter(mtcars, gear == 5)
mpg cyl disp hp drat wt qsec vs am gear carb
1 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2
2 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
3 15.8 4 351.0 264 4.22 3.170 14.5 0 1 5 4
4 19.7 4 145.0 175 3.62 2.770 15.5 0 1 5 6
5 15.0 4 301.0 335 3.54 3.570 14.6 0 1 5 8
subset(mtcars, gear == 5)
mpg cyl disp hp drat wt qsec vs am gear carb
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
Ford Pantera L 15.8 4 351.0 264 4.22 3.170 14.5 0 1 5 4
Ferrari Dino 19.7 4 145.0 175 3.62 2.770 15.5 0 1 5 6
Maserati Bora 15.0 4 301.0 335 3.54 3.570 14.6 0 1 5 8
In the main use cases they behave the same :
library(dplyr)
identical(
filter(starwars, species == "Wookiee"),
subset(starwars, species == "Wookiee"))
# [1] TRUE
But they have a quite a few differences, including (I was as exhaustive as possible but might have missed some) :
subset can be used on matrices
filter can be used on databases
filter drops row names
subset drop attributes other than class, names and row names.
subset has a select argument
subset recycles its condition argument
filter supports conditions as separate arguments
filter preserves the class of the column
filter supports the .data pronoun
filter supports some rlang features
filter supports grouping
filter supports n() and row_number()
filter is stricter
filter is a bit faster when it counts
subset has methods in other packages
subset can be used on matrices
subset(state.x77, state.x77[,"Population"] < 400)
# Population Income Illiteracy Life Exp Murder HS Grad Frost Area
# Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432
# Wyoming 376 4566 0.6 70.29 6.9 62.9 173 97203
Though columns can't be used directly as variables in the subset argument
subset(state.x77, Population < 400)
Error in subset.matrix(state.x77, Population < 400) : object
'Population' not found
Neither works with filter
filter(state.x77, state.x77[,"Population"] < 400)
Error in UseMethod("filter_") : no applicable method for 'filter_'
applied to an object of class "c('matrix', 'double', 'numeric')"
filter(state.x77, Population < 400)
Error in UseMethod("filter_") : no applicable method for 'filter_'
applied to an object of class "c('matrix', 'double', 'numeric')"
filter can be used on databases
library(DBI)
con <- dbConnect(RSQLite::SQLite(), ":memory:")
dbWriteTable(con, "mtcars", mtcars)
tbl(con,"mtcars") %>%
filter(hp < 65)
# # Source: lazy query [?? x 11]
# # Database: sqlite 3.19.3 [:memory:]
# mpg cyl disp hp drat wt qsec vs am gear carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
# 2 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
subset can't
tbl(con,"mtcars") %>%
subset(hp < 65)
Error in subset.default(., hp < 65) : object 'hp' not found
filter drops row names
filter(mtcars, hp < 65)
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
# 2 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
subset doesn't
subset(mtcars, hp < 65)
# mpg cyl disp hp drat wt qsec vs am gear carb
# Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
# Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
subset drop attributes other than class, names and row names.
cars_head <- head(cars)
attr(cars_head, "info") <- "head of cars dataset"
attributes(subset(cars_head, speed > 0))
#> $names
#> [1] "speed" "dist"
#>
#> $row.names
#> [1] 1 2 3 4 5 6
#>
#> $class
#> [1] "data.frame"
attributes(filter(cars_head, speed > 0))
#> $names
#> [1] "speed" "dist"
#>
#> $row.names
#> [1] 1 2 3 4 5 6
#>
#> $class
#> [1] "data.frame"
#>
#> $info
#> [1] "head of cars dataset"
subset has a select argument
While dplyr follows tidyverse principles which aim at having each function doing one thing, so select is a separate function.
identical(
subset(starwars, species == "Wookiee", select = c("name", "height")),
filter(starwars, species == "Wookiee") %>% select(name, height)
)
# [1] TRUE
It also has a drop argument, that makes mostly sense in the context of using the select argument.
subset recycles its condition argument
half_iris <- subset(iris,c(TRUE,FALSE))
dim(iris) # [1] 150 5
dim(half_iris) # [1] 75 5
filter doesn't
half_iris <- filter(iris,c(TRUE,FALSE))
Error in filter_impl(.data, quo) : Result must have length 150, not 2
filter supports conditions as separate arguments
Conditions are fed to ... so we can have several conditions as different arguments, which is the same as using & but might be more readable sometimes due to logical operator precedence and automatic identation.
identical(
subset(starwars,
(species == "Wookiee" | eye_color == "blue") &
mass > 120),
filter(starwars,
species == "Wookiee" | eye_color == "blue",
mass > 120)
)
filter preserves the class of the column
df <- data.frame(a=1:2, b = 3:4, c= 5:6)
class(df$a) <- "foo"
class(df$b) <- "Date"
# subset preserves the Date, but strips the "foo" class
str(subset(df,TRUE))
#> 'data.frame': 2 obs. of 3 variables:
#> $ a: int 1 2
#> $ b: Date, format: "1970-01-04" "1970-01-05"
#> $ c: int 5 6
# filter keeps both
str(dplyr::filter(df,TRUE))
#> 'data.frame': 2 obs. of 3 variables:
#> $ a: 'foo' int 1 2
#> $ b: Date, format: "1970-01-04" "1970-01-05"
#> $ c: int 5 6
filter supports the use use of the .data pronoun
mtcars %>% filter(.data[["hp"]] < 65)
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
# 2 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
filter supports some rlang features
x <- "hp"
library(rlang)
mtcars %>% filter(!!sym(x) < 65)
# m pg cyl disp hp drat wt qsec vs am gear carb
# 1 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
# 2 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
filter65 <- function(data,var){
data %>% filter(!!enquo(var) < 65)
}
mtcars %>% filter65(hp)
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
# 2 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
filter supports grouping
iris %>%
group_by(Species) %>%
filter(Petal.Length < quantile(Petal.Length,0.01))
# # A tibble: 3 x 5
# # Groups: Species [3]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# <dbl> <dbl> <dbl> <dbl> <fctr>
# 1 4.6 3.6 1.0 0.2 setosa
# 2 5.1 2.5 3.0 1.1 versicolor
# 3 4.9 2.5 4.5 1.7 virginica
iris %>%
group_by(Species) %>%
subset(Petal.Length < quantile(Petal.Length,0.01))
# # A tibble: 2 x 5
# # Groups: Species [1]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# <dbl> <dbl> <dbl> <dbl> <fctr>
# 1 4.3 3.0 1.1 0.1 setosa
# 2 4.6 3.6 1.0 0.2 setosa
filter supports n() and row_number()
filter(iris, row_number() < n()/30)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
# 4 4.6 3.1 1.5 0.2 setosa
filter is stricter
It trigger errors if the input is suspicious.
filter(iris, Species = "setosa")
# Error: `Species` (`Species = "setosa"`) must not be named, do you need `==`?
identical(subset(iris, Species = "setosa"), iris)
# [1] TRUE
df1 <- setNames(data.frame(a = 1:3, b=5:7),c("a","a"))
# df1
# a a
# 1 1 5
# 2 2 6
# 3 3 7
filter(df1, a > 2)
#Error: Column `a` must have a unique name
subset(df1, a > 2)
# a a.1
# 3 3 7
filter is a bit faster when it counts
Borrowing the dataset that Benjamin built in his answer (153 k rows), it's twice faster, though it should rarely be a bottleneck.
air <- lapply(1:1000, function(x) airquality) %>% bind_rows
microbenchmark::microbenchmark(
subset = subset(air, Temp>80 & Month > 5),
filter = filter(air, Temp>80 & Month > 5)
)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# subset 8.771962 11.551255 19.942501 12.576245 13.933290 108.0552 100 b
# filter 4.144336 4.686189 8.024461 6.424492 7.499894 101.7827 100 a
subset has methods in other packages
subset is an S3 generic, just as dplyr::filter is, but subset as a base function is more likely to have methods developed in other packages, one prominent example is zoo:::subset.zoo.
Interesting. I was trying to see the difference in terms of the resulting dataset and I coulnd't get an explanation to why the "[" operator behaved differently (i.e., to why it also returned NAs):
# Subset for year=2013
sub<-brfss2013 %>% filter(iyear == "2013")
dim(sub)
#[1] 486088 330
length(which(is.na(sub$iyear))==T)
#[1] 0
sub2<-filter(brfss2013, iyear == "2013")
dim(sub2)
#[1] 486088 330
length(which(is.na(sub2$iyear))==T)
#[1] 0
sub3<-brfss2013[brfss2013$iyear=="2013", ]
dim(sub3)
#[1] 486093 330
length(which(is.na(sub3$iyear))==T)
#[1] 5
sub4<-subset(brfss2013, iyear=="2013")
dim(sub4)
#[1] 486088 330
length(which(is.na(sub4$iyear))==T)
#[1] 0
A difference is also that subset does more things than filter you can also select and drop while you have two different functions in dplyr
subset(df, select=c("varA", "varD"))
dplyr::select(df,varA, varD)
An additional advantage of filter is that it plays nice with grouped data. subset ignores groupings.
So when the data is grouped, subset will still make reference to the whole data, but filter will only reference the group.
# setup
library(tidyverse)
data.frame(a = 1:2) %>% group_by(a) %>% subset(length(a) == 1)
# returns empty table
data.frame(a = 1:2) %>% group_by(a) %>% filter(length(a) == 1)
# returns all rows
As subset() manual states:
Warning: This is a convenience function intended for use interactively
I learned from this great article not only the secret behind this warning, but a good understanding of substitute(), match.call(), eval(), quote(), call, promise and other related R subjects, that are a little bit complicated.
Now I understand what's the warning above for. A super-simple implementation of subset() could be as follows:
subset = function(x, condition) x[eval(substitute(condition), envir=x),]
While subset(mtcars, cyl==4) returns the table of rows in mtcars that satisfy cyl==4, enveloping subset() in another function fails:
sub = function(x, condition) subset(x, condition)
sub(mtcars, cyl == 4)
# Error in eval(expr, envir, enclos) : object 'cyl' not found
Using the original version of subset() also produces exactly the same error condition. This is due to the limitation of substitute()-eval() pair: It works fine while condition is cyl==4, but when the condition is passed through the enveloping function sub(), the condition argument of subset() will be no longer cyl==4, but the nested condition in the sub() body, and the eval() fails - it's a bit complicated.
But does it exist any other implementation of subset() with exactly the same arguments that would be programming-safe - i.e. able to evaluate its condition while it's called by another function?
The [ function is what you're looking for. ?"[". mtcars[mtcars$cyl == 4,] is equivalent to the subset command and is "programming" safe.
sub = function(x, condition) {
x[condition,]
}
sub(mtcars, mtcars$cyl==4)
Does what you're asking without the implicit with() in the function call. The specifics are complicated, however a function like:
sub = function(x, quoted_condition) {
x[with(x, eval(parse(text=quoted_condition))),]
}
sub(mtcars, 'cyl==4')
Sorta does what you're looking for, but there are edge cases where this will have unexpected results.
using data.table and the [ subset function you can get the implicit with(...) you're looking for.
library(data.table)
MT = data.table(mtcars)
MT[cyl==4]
there are better, faster ways to do this subsetting in data.table, but this illustrates the point well.
using data.table you can also construct expressions to be evaluated later
cond = expression(cyl==4)
MT[eval(cond)]
these two can now be passed through functions:
wrapper = function(DT, condition) {
DT[eval(condition)]
}
Here's an alternative version of subset() which continues to work even when it's nested -- at least as long as the logical subsetting expression (e.g. cyl == 4) is supplied to the top-level function call.
It works by climbing up the call stack, substitute()ing at each step to ultimately capture the logical subsetting expression passed in by the user. In the call to sub2() below, for example, the for loop works up the call stack from expr to x to AA and finally to cyl ==4.
SUBSET <- function(`_dat`, expr) {
ff <- sys.frames()
ex <- substitute(expr)
ii <- rev(seq_along(ff))
for(i in ii) {
ex <- eval(substitute(substitute(x, env=sys.frames()[[n]]),
env = list(x = ex, n=i)))
}
`_dat`[eval(ex, envir = `_dat`),]
}
## Define test functions that nest SUBSET() more and more deeply
sub <- function(x, condition) SUBSET(x, condition)
sub2 <- function(AA, BB) sub(AA, BB)
## Show that it works, at least when the top-level function call
## contains the logical subsetting expression
a <- SUBSET(mtcars, cyl == 4) ## Direct call to SUBSET()
b <- sub(mtcars, cyl == 4) ## SUBSET() called one level down
c <- sub2(mtcars, cyl == 4) ## SUBSET() called two levels down
identical(a,b)
# [1] TRUE
> identical(a,c)
# [1] TRUE
a[1:5,]
# mpg cyl disp hp drat wt qsec vs am gear carb
# Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
# Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
# Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
# Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
# Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
** For some explanation of the construct inside the for loop, see Section 6.2, paragraph 6 of the R Language Definition manual.
Just because it's such mind-bending fun (??), here is a slightly different solution that addresses a problem Hadley pointed to in comments to my accepted solution.
Hadley posted a gist demonstrating a situation in which my accepted function goes awry. The twist in that example (copied below) is that a symbol passed to SUBSET() is defined in the body (rather than the arguments) of one of the calling functions; it thus gets captured by substitute() instead of the intended global variable. Confusing stuff, I know.
f <- function() {
cyl <- 4
g()
}
g <- function() {
SUBSET(mtcars, cyl == 4)$cyl
}
f()
Here is a better function that will only substitute the values of symbols found in calling functions' argument lists. It works in all of the situations that Hadley or I have so far proposed.
SUBSET <- function(`_dat`, expr) {
ff <- sys.frames()
n <- length(ff)
ex <- substitute(expr)
ii <- seq_len(n)
for(i in ii) {
## 'which' is the frame number, and 'n' is # of frames to go back.
margs <- as.list(match.call(definition = sys.function(n - i),
call = sys.call(sys.parent(i))))[-1]
ex <- eval(substitute(substitute(x, env = ll),
env = list(x = ex, ll = margs)))
}
`_dat`[eval(ex, envir = `_dat`),]
}
## Works in Hadley's counterexample ...
f()
# [1] 4 4 4 4 4 4 4 4 4 4 4
## ... and in my original test cases.
sub <- function(x, condition) SUBSET(x, condition)
sub2 <- function(AA, BB) sub(AA, BB)
a <- SUBSET(mtcars, cyl == 4) ## Direct call to SUBSET()
b <- sub(mtcars, cyl == 4) ## SUBSET() called one level down
c <- sub2(mtcars, cyl == 4)
all(identical(a, b), identical(b, c))
# [1] TRUE
IMPORTANT: Please note that this still is not (nor can it be made into) a generally useful function. There's simply no way for the function to know which symbols you want it to use in all of the substitutions it performs as it works up the call stack. There are many situations in which users would want it to use the values of symbols assigned to within function bodies, but this function will always ignore those.
Update:
Here is a new version which fixes two problems:
a) The previous version simply traversed sys.frames() backwards. This version follows parent.frames() until it reaches .GlobalEnv. This is important in, e.g., subscramble, where scramble's frame should be ignored.
b) This version has a single substitute per level. This prevents the second substitute call from substituting symbols from one level higher that were introduced by the first substitute call.
subset <- function(x, condition) {
call <- substitute(condition)
frames <- sys.frames()
parents <- sys.parents()
# starting one frame up, keep climbing until we get to .GlobalEnv
i <- tail(parents, 1)
while(i != 0) {
f <- sys.frames()[[i]]
# copy x into f, except for variable with conflicting names.
xnames <- setdiff(ls(x), ls(f))
for (n in xnames) assign(n, x[[n]], envir=f)
call <- eval(substitute(substitute(expr, f), list(expr=call)))
# leave f the way we found it
rm(list=xnames, envir=f)
i <- parents[i]
}
r <- eval(call, x, .GlobalEnv)
x[r, ]
}
This version passes #hadley's test from the comments:
mtcars $ condition <- 4; subscramble(mtcars, cyl == 4)
Unfortunately the following two examples now behave differently:
cyl <- 6; subset(mtcars, cyl==4)
local({cyl <- 6; subset(mtcars, cyl==4)})
This is a slight modification of Josh's first function. At each frame in the stack, we substitute from x before substituting from the frame. This means that symbols in the data frame take precedence at every step. We can avoid pseudo-gensyms like _dat by skipping subset's frame in the for loop.
subset <- function(x, condition) {
call <- substitute(condition)
frames <- rev(sys.frames())[-1]
for(f in frames) {
call <- eval(substitute(substitute(expr, x), list(expr=call)))
call <- eval(substitute(substitute(expr, f), list(expr=call)))
}
r <- eval(call, x, .GlobalEnv)
x[r, ]
}
This version works in the simple case (it's worth checking that we haven't had a regression):
subset(mtcars, cyl == 4)
# mpg cyl disp hp drat wt qsec vs am gear carb
# Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
# Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
# Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
# Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
# Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
# Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
# Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
# Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
# Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
# Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
# Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
It also works with subscramble and f:
scramble <- function(x) x[sample(nrow(x)), ]
subscramble <- function(x, condition) scramble(subset(x, condition))
subscramble(mtcars, cyl == 4) $ cyl
# [1] 4 4 4 4 4 4 4 4 4 4 4
f <- function() {cyl <- 4; g()}
g <- function() subset(mtcars, cyl == 4) $ cyl
g()
# [1] 4 4 4 4 4 4 4 4 4 4 4
And even works in some trickier situations:
gear5 <- function(z, condition) {
x <- 5
subset(z, condition & (gear == x))
}
x <- 4
gear5(mtcars, cyl == x)
# mpg cyl disp hp drat wt qsec vs am gear carb
# Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2
# Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
The lines inside the for loop might require some explanation. Suppose call is assigned as follows:
call <- quote(y == x)
str(call)
# language y == x
We want to substitute the value 4 for x in call. But the straightforward way doesn't work, since we want the contents of call, not the symbol call.
substitute(call, list(x=4))
# call
So we build the expression we need, using another substitute call.
substitute(substitute(expr, list(x=4)), list(expr=call))
# substitute(y == x, list(x = 4))
Now we have a language object that describes what we want to do. All that's left it to actually do it:
eval(substitute(substitute(expr, list(x=4)), list(expr=call)))
# y == 4