Generate summarise function parameters using a loop - r

I have a data looks like this.
data = data.frame(GENDER = c("1", "1", "1", "2", "2"),
ZSCORE_0 = c(12.12, 12.67, 13.72, 13.79, 14.78),
ZSCORE_3 = ...,
ZSCORE_6 = ...,
...
ZSCORE = 60 = ...)
I tried summarizing this data using summarise function in dplyr package.
I have a problem that is too many parameters.
For example
data %>%
group_by(GENDER) %>%
summarise(MIN_ZSCORE_0 = min(ZSCORE_0),
MIN_ZSCORE_3 = min(ZSCORE_3),
...,
MIN_ZSCORE_60 = min(ZSCORE_60),
MAX_ZSCORE_0 = max(ZSCORE_0),
MAX_ZSCORE_3 = max(ZSCORE_3),
...,
MAX_ZSCORE_60 = max(ZSCORE_60),
MEAN,
MEDIAN,
n,
...)
I want to simplify this work.
I used a loop to create parameters.
interval = seq(3, 60, 3)
data %>%
group_by(GENDER) %>%
summarise(for (i in interval) {
target = paste0("ZSCORE_", i)
min(target)
max(target)
...
n(target)
})
However it does not work.
Error: Column `for (... in NULL) NULL` is of unsupported type NULL

You cannot use loop inside summarise. However, try using summarise_all:
require(tidyverse)
mtcars %>%
summarise_all(c("min", "max"))
Result:
mpg_min cyl_min disp_min hp_min drat_min wt_min qsec_min
1 10.4 4 71.1 52 2.76 1.513 14.5
vs_min am_min gear_min carb_min mpg_max cyl_max disp_max
1 0 0 3 1 33.9 8 472
hp_max drat_max wt_max qsec_max vs_max am_max gear_max
1 335 4.93 5.424 22.9 1 1 5
carb_max
1 8
Edit
Their is a problem in using n() inside summarise_all/summarise_if because it automatically tries to force the argument na.rm = TRUE into n(). Which in turn raise an error, as n() doesn't have this argument. However, you can you use this hack (taken from here):
require(tidyverse)
mtcars %>%
summarise_if(is.numeric, c("min", "max")) %>%
cbind(summarise_if(mtcars, is.numeric, funs(n())))
Result:
mpg_min cyl_min disp_min hp_min drat_min wt_min qsec_min
1 10.4 4 71.1 52 2.76 1.513 14.5
vs_min am_min gear_min carb_min mpg_max cyl_max disp_max
1 0 0 3 1 33.9 8 472
hp_max drat_max wt_max qsec_max vs_max am_max gear_max
1 335 4.93 5.424 22.9 1 1 5
carb_max mpg cyl disp hp drat wt qsec vs am gear carb
1 8 32 32 32 32 32 32 32 32 32 32 32

Related

How to move R code into functions to generalise behaviour

I have a huge messy piece of R code with loads of ugly repetition. There is an opportunity to massively reduce it. Starting with this piece of code:
table <-
risk_assigned %>%
group_by(rental_type, room_type) %>%
summarise_all(funs( sum(!is.na(.)) / length(.) ) ) %>%
select(-c(device_id, ts, room, hhi, temp)) %>%
adorn_pct_formatting()
I would like to generalise it into a function so it can be reused.
LayKable = function(kableDetails) {
table <-
risk_assigned %>%
group_by(kableDetails$group1 , kableDetails$group2) %>%
summarise_all(funs( sum(!is.na(.)) / length(.) ) ) #%>%
select(-c(device_id, ts, room, hhi, temp)) %>%
adorn_pct_formatting()
...
kable <- table
return(kable)
}
kableDetails <- list(
group1 = "rental_type",
group2 = "room_type"
)
newKable <- LayKable(kableDetails)
This rather half-hearted attempt serves to explain what I want to do. How can I pass stuff into this function inside a list (I'm a C programmer, pretending it's a struct).
When passing function arguments to a dplyr verb inside a function you have to use rlang terms. But should be simple to define a function you can pass a number of grouping terms to:
library(dplyr)
test_func <- function(..., data = mtcars) {
# Passing `data` as a default argument as it's nice to be flexible!
data %>%
group_by(!!!enquos(...)) %>%
summarise(across(.fns = sum), .groups = "drop")
}
test_func(cyl, gear)
#> # A tibble: 8 x 11
#> cyl gear mpg disp hp drat wt qsec vs am carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 4 3 21.5 120. 97 3.7 2.46 20.0 1 0 1
#> 2 4 4 215. 821 608 32.9 19.0 157. 8 6 12
#> 3 4 5 56.4 215. 204 8.2 3.65 33.6 1 2 4
#> 4 6 3 39.5 483 215 5.84 6.68 39.7 2 0 2
#> 5 6 4 79 655. 466 15.6 12.4 70.7 2 2 16
#> 6 6 5 19.7 145 175 3.62 2.77 15.5 0 1 6
#> 7 8 3 181. 4291. 2330 37.4 49.2 206. 0 0 37
#> 8 8 5 30.8 652 599 7.76 6.74 29.1 0 2 12
Update - adding a list
I see your ideal would be to write a list of arguments for each function call and pass these rather than write out the arguments in each call. You can do this using do.call to pass a list of named arguments to a function. Again, when using dplyr verbs you can quote variable names in constructing your list (so that R doesn't try to find them in the global environment when compiling the list) and !!enquo each one in the calls to then use them there:
library(dplyr)
test_func2 <- function(.summary_var, .group_var, data = mtcars) {
data %>%
group_by(!!enquo(.group_var)) %>%
summarise(mean = mean(!!enquo(.summary_var)))
}
# Test with bare arguments
test_func2(hp, cyl)
#> # A tibble: 3 x 2
#> cyl mean
#> <dbl> <dbl>
#> 1 4 82.6
#> 2 6 122.
#> 3 8 209.
# Construct and pass list
args <- list(.summary_var = quote(hp), .group_var = quote(cyl))
do.call(test_func2, args = args)
#> # A tibble: 3 x 2
#> cyl mean
#> <dbl> <dbl>
#> 1 4 82.6
#> 2 6 122.
#> 3 8 209.
A handy guide to tidy evaluation where most of these ideas are explained more clearly.
Created on 2021-12-21 by the reprex package (v2.0.1)

How to refer to a column in a function

I was trying to write a function that contains two actions, and the first one includes subsetting a dataframe.
Let's say I have these two dataframes.
ID Var1
1 5 3
2 6 1
ID Var2
1 5 9
2 6 2
And my function is like this
mu_fuc = function(df, condition) {
workingdf = subset(df, condition < 3)
###pass working df to the other action
}
I am aware that for this first action, I can use conditional slicing as a work around. However, as I try to work on the other action, I realized I still have to refer to a column name in an dataframe for another existing function.
I tried as.name(condition), but it did not work.
Thank you for your time. Any suggestion is highly appreciated.
---------Update on 12/26-----------
The method using eval worked well for the function once, as below.
meta = function(sub_data, y) {
y <- eval(as.list(match.call())$y, sub_data)
workingdf <- subset(sub_data, y != 999) ##this successfully grab the column named y in the dataframe##
meta1 <- metacor(y, ##this successfully grab the column named y in the dataframe##
n,
data = workingdf,
studlab = workingdf$Author_year,
sm = "ZCOR",
method.tau = "SJ",
comb.fixed = F)
return(meta1)
But somehow the same approach did not work in the following code.
mod_analysis = function(meta, moderator){
workingdf <- meta$data
moderator <- eval(as.list(match.call())$moderator, workingdf)
output = metareg(meta, moderator)
return(output)
}
Then it was this error message:
Error in eval(predvars, data, env) : object 'moderator' not found
I don't know why it worked for the first function but not the second.
One can convert the condition from a simple column reference to an expression, enabling the function argument to include the right hand side of an expression instead of hard coding it into the function. This can be accomplished with a couple of functions from the rlang package, enquo() and eval_tidy().
We'll illustrate this with a subsetting function and the mtcars data frame.
aSubsetFunction <- function(df,condition){
require(rlang)
condition <- enquo(condition)
rows_value <- eval_tidy(condition, df)
stopifnot(is.logical(rows_value))
df[rows_value, ,drop = FALSE]
}
The condition <- enquo(condition) line quotes the condition expression. The eval_tidy() function evaluates the quoted expression, using df as a data mask. The output from eval_tidy(), rows_value, is a vector of logical values (TRUE / FALSE), which we use on the row dimension of the input data frame with the [ form of the extract operator. We use stopifnot() to generate an error if rows_value is not a vector of logical values.
We call the function twice to illustrate that it works with multiple columns in the data frame.
aSubsetFunction(mtcars,mpg > 25)
aSubsetFunction(mtcars,carb > 4)
...and the output:
> aSubsetFunction(mtcars,mpg > 25)
Loading required package: rlang
mpg cyl disp hp drat wt qsec vs am gear carb
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
> aSubsetFunction(mtcars,carb > 4)
mpg cyl disp hp drat wt qsec vs am gear carb
Ferrari Dino 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
Maserati Bora 15.0 8 301 335 3.54 3.57 14.6 0 1 5 8
Using data from the original post, the solution works as follows.
df1 <- read.csv(text="ID,Val1
5,3
6,1")
df2 <- read.csv(text="ID,Val2
5,9
6,2")
aSubsetFunction(df1,Val1 < 3)
aSubsetFunction(df2,Val2 < 3)
...and the output:
> aSubsetFunction(df1,Val1 < 3)
ID Val1
2 6 1
> aSubsetFunction(df2,Val2 < 3)
ID Val2
2 6 2
Having illustrated the approach, we can use the order of object evaluation in R to simplify the function down to a single line of R code:
aSubsetFunction <- function(df,condition){
require(rlang)
df[eval_tidy(enquo(condition), df), ,drop = FALSE]
}
...which produces the same output as listed above.
> aSubsetFunction(mtcars,mpg > 25)
Loading required package: rlang
mpg cyl disp hp drat wt qsec vs am gear carb
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
> aSubsetFunction(mtcars,carb > 4)
mpg cyl disp hp drat wt qsec vs am gear carb
Ferrari Dino 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
Maserati Bora 15.0 8 301 335 3.54 3.57 14.6 0 1 5 8
>
Epilogue: why can't we use variable substitution with subset()?
From an initial look at the question, one might expect that we could resolve the question with the following code.
subset2 <- function(df,condition){
subset(df,df[[condition]] > 4)
}
subset2(mtcars,carb)
However, this fails with an object not found error:
Error in (function(x, i, exact) if (is.matrix(i)) as.matrix(x)[[i]] else .subset2(x, :
object 'carb' not found
Once again Advanced R provides an explanation, directly quoting from the documentation for subset().
This is a convenience function intended for use interactively. For programming it is better to use the standard subsetting functions like [, and in particular the non-standard evaluation of argument subset can have unanticipated consequences.
Bottom line: it's important to understand Base R non-standard evaluation when writing functions to automate one's analysis because the assumptions coded into various R functions can produce unexpected results. This is especially true of modeling functions like lm() that rely on formula(), as Wickham describes in Advanced R: wrapping modeling functions.
References: Advanced R, Chapter 20 section 4, Chapter 20 section 6
I guess you can try match.call with eval
mu_fuc <- function(df, condition) {
condition <- eval(as.list(match.call())$condition, df)
workingdf <- subset(df, condition < 3)
workingdf
}
which enables
> mu_fuc(df1, Var1)
ID Var1
2 6 1
> mu_fuc(df2, Var2)
ID Var2
2 6 2
Try this with indexing:
#Funtion
mu_fuc = function(df, condition) {
workingdf <- df[df[[condition]]<3,]
return(workingdf)
}
#Apply
mu_fuc(df1,'Var1')
Output:
mu_fuc(df1,'Var1')
ID Var1
2 6 1
Some data used:
#Data
df1 <- structure(list(ID = 5:6, Var1 = c(3L, 1L)), class = "data.frame", row.names = c("1",
"2"))

R describeby function subscript out of bounds error

I'm fairly new to R and I'm trying to get descriptive statistics grouped by multiple variables using the describeby function from the psych package.
Here's what I'm trying to run:
JL <- describeBy(df$JL, group=list(df$Time, df$Cohort, df$Gender), digits=3, skew=FALSE, mat=TRUE)
And I get the error message Error in `[<-`(`*tmp*`, var, group + 1, value = dim.names[[group]][[groupi]]) :
subscript out of bounds
I only get this error message with my Gender variable (which is dichotomous in this datset). I'm able to run the code when I take out the mat=TRUE argument, and I see that it's generating groupings with NULL for Gender. I saw in other answers that this has something to do with the array being out of bounds but I'm not sure how to troubleshoot. Any advice is appreciated.
Thanks so much.
You could use dplyr, with some custom functions added.
library(dplyr)
se <- function(x) sd(x, na.rm=TRUE)/sqrt(length(na.omit(x)))
rnge <- function(x) diff(range(x, na.rm=TRUE))
group_by(df, Time, Cohort, Gender) %>%
summarise_at(vars(JL), .funs=list(n=length, mean=mean, sd=sd, min=min, max=max, range=rnge, se=se)) %>%
as.data.frame()
Using the mtcars dataset:
group_by(mtcars, vs, am, cyl) %>%
summarise_at(vars(mpg), .funs=list(n=length, mean=mean, sd=sd, min=min, max=max, range=rnge, se=se)) %>% as.data.frame()
vs am cyl n mean sd min max range se
1 0 0 8 12 15.1 2.774 10.4 19.2 8.8 0.801
2 0 1 4 1 26.0 NA 26.0 26.0 0.0 NA
3 0 1 6 3 20.6 0.751 19.7 21.0 1.3 0.433
4 0 1 8 2 15.4 0.566 15.0 15.8 0.8 0.400
5 1 0 4 3 22.9 1.453 21.5 24.4 2.9 0.839
6 1 0 6 4 19.1 1.632 17.8 21.4 3.6 0.816
7 1 1 4 7 28.4 4.758 21.4 33.9 12.5 1.798
Using the describBy function from the psych package returns your error:
library(psych)
describeBy(mtcars$mpg, group=list(mtcars$vs, mtcars$am, mtcars$cyl), digits=3, skew=FALSE, mat=TRUE)
Error in [<-(*tmp*, var, group + 1, value =
dim.names[[group]][[groupi]]) : subscript out of bounds
Because not all combinations of the three groups exist in the data.
with(mtcars,
ftable(table(vs,am,cyl)))
# cyl 4 6 8
#vs am
#0 0 0 0 12
# 1 1 3 2
#1 0 3 4 0
# 1 7 0 0

Non-standard subsetting of data.frames

One of the quirks of subsetting a data frame is that you have to repeatedly type the name of that data frame when mentioning columns. For example, the data frame cars is mentioned 3 times here:
cars[cars$speed == 4 & cars$dist < 10, ]
## speed dist
## 1 4 2
The data.table package solves this.
library(data.table)
dt_cars <- as.data.table(cars)
dt_cars[speed == 4 & dist < 10]
As does dplyr.
library(dplyr)
cars %>% filter(speed == 4, dist < 10)
I'd like to know if a solution exists for standard-issue data.frames (that is, not resorting to data.table or dplyr).
I think I'm looking for something like
cars[MAGIC(speed == 4 & dist < 10), ]
or
MAGIC(cars[speed == 4 & dist < 10, ])
where MAGIC is to be determined.
I tried the following, but it gave me an error.
library(rlang)
cars[locally(speed == 4 & dist < 10), ]
# Error in locally(speed == 4 & dist < 10) : object 'speed' not found
1) subset This only requires that cars be mentioned once. No packages are used.
subset(cars, speed == 4 & dist < 10)
## speed dist
## 1 4 2
2) sqldf This uses a package but does not use dplyr or data.table which were the only two packages excluded by the question:
library(sqldf)
sqldf("select * from cars where speed = 4 and dist < 10")
## speed dist
## 1 4 2
3) assignment Not sure if this counts but you could assign cars to some other variable name such as . and then use that. In that case cars would only be mentioned once. This uses no packages.
. <- cars
.[.$speed == 4 & .$dist < 10, ]
## speed dist
## 1 4 2
or
. <- cars
with(., .[speed == 4 & dist < 10, ])
## speed dist
## 1 4 2
With respect to these two solutions you might want to check out this article on the Bizarro Pipe: http://www.win-vector.com/blog/2017/01/using-the-bizarro-pipe-to-debug-magrittr-pipelines-in-r/
4) magrittr This could also be expressed in magrittr and that package was not excluded by the question. Note we are using the magrittr %$% operator:
library(magrittr)
cars %$% .[speed == 4 & dist < 10, ]
## speed dist
## 1 4 2
subset is the base function which solves this problem. However, like all base R functions which use non-standard evaluation subset does not perform fully hygienic code expansion. So subset() evaluates the wrong variable when used within non-global scopes (such as in lapply loops).
As an example, here we define the variable var in two places, first in the global scope with value 40, then in a local scope with value 30. The use of local() here is for simplicity, however this would behave equivalently inside a function. Intuitively, we would expect subset to use the value 30 in the evaluation. However upon executing the following code we see instead the value 40 is used (so no rows are returned).
var <- 40
local({
var <- 30
dfs <- list(mtcars, mtcars)
lapply(dfs, subset, mpg > var)
})
#> [[1]]
#> [1] mpg cyl disp hp drat wt qsec vs am gear carb
#> <0 rows> (or 0-length row.names)
#>
#> [[2]]
#> [1] mpg cyl disp hp drat wt qsec vs am gear carb
#> <0 rows> (or 0-length row.names)
This happens because the parent.frame() used in subset() is the environment within the body of lapply() rather than the local block. Because all environments eventually inherit from the global environment the variable var is found there with value 40.
Hygienic variable expansion via quasiquotation (as implemented in the rlang package) solves this problem. We can define a variant of subset using tidy evaluation that works properly in all contexts. The code is derived from and largely identical to that of base::subset.data.frame().
subset2 <- function (x, subset, select, drop = FALSE, ...) {
r <- if (missing(subset))
rep_len(TRUE, nrow(x))
else {
r <- rlang::eval_tidy(rlang::enquo(subset), x)
if (!is.logical(r))
stop("'subset' must be logical")
r & !is.na(r)
}
vars <- if (missing(select))
TRUE
else {
nl <- as.list(seq_along(x))
names(nl) <- names(x)
rlang::eval_tidy(rlang::enquo(select), nl)
}
x[r, vars, drop = drop]
}
This version of subset behaves identically to base::subset.data.frame().
subset2(mtcars, gear > 4, disp:wt)
#> disp hp drat wt
#> Porsche 914-2 120.3 91 4.43 2.140
#> Lotus Europa 95.1 113 3.77 1.513
#> Ford Pantera L 351.0 264 4.22 3.170
#> Ferrari Dino 145.0 175 3.62 2.770
#> Maserati Bora 301.0 335 3.54 3.570
However subset2() does not suffer the scoping issues of subset. In our previous example the value 30 is used for var, as we would expect from lexical scoping rules.
local({
var <- 30
dfs <- list(mtcars, mtcars)
lapply(dfs, subset2, mpg > var)
})
#> [[1]]
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
#> Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
#> Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
#> Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
#>
#> [[2]]
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
#> Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
#> Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
#> Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
This allows non-standard evaluation to be used robustly in all contexts, not just in top level contexts as with previous approaches.
This makes functions which use non-standard evaluation much more useful. Before while they were nice to have for interactive use, you needed to use more verbose standard evaluation functions when writing functions and packages. Now the same function can be used in all contexts without needing to modify the code!
For more details on non-standard evaluation please see Lionel Henry's Tidy evaluation (hygienic fexprs) presentation, the rlang vignette on tidy evaluation and the programming with dplyr vignette.
I understand I'm totally cheating, but technically it works :):
with(cars, data.frame(speed=speed,dist=dist)[speed == 4 & dist < 10,])
# speed dist
# 1 4 2
More horror:
`[` <- function(x,i,j){
rm(`[`,envir = parent.frame())
eval(parse(text=paste0("with(x,x[",deparse(substitute(i)),",])")))
}
cars[speed == 4 & dist < 10, ]
# speed dist
# 1 4 2
Solution with overriding [ method for data.frame. In the new method we check class of the i argument and if it is expression or formula we evaluate it in the data.frame context.
##### override subsetting method
`[.data.frame` = function (x, i, j, ...) {
if(!missing(i) && (is.language(i) || is.symbol(i) || inherits(i, "formula"))) {
if(inherits(i, "formula")) i = as.list(i)[[2]]
i = eval(i, x, enclos = baseenv())
}
base::`[.data.frame`(x, i, j, ...)
}
#####
data(cars)
cars[cars$speed == 4 & cars$dist < 10, ]
# speed dist
# 1 4 2
# cars[speed == 4 & dist < 10, ] # error
cars[quote(speed == 4 & dist < 10),]
# speed dist
# 1 4 2
# ,or
cars[~ speed == 4 & dist < 10,]
# speed dist
# 1 4 2
Another solution with more magic. Please, restart R session to avoid interference with previous solution:
locally = function(expr){
curr_call = as.list(sys.call(1))
if(as.character(curr_call[[1]])=="["){
possibly_df = eval(curr_call[[2]], parent.frame())
if(is.data.frame(possibly_df)){
expr = substitute(expr)
expr = eval(expr, possibly_df, enclos = baseenv())
}
}
expr
}
cars[locally(speed == 4 & dist < 10), ]
# speed dist
# 1 4 2
Using attach()
attach(cars)
cars[speed == 4 & dist < 10,]
# speed dist
# 1 4 2
I was very early on in my R learning dissuaded from using attach(), but as long as you're careful not to introduce name conflicts I think it should be OK.

Difference between subset and filter from dplyr

It seems to me that subset and filter (from dplyr) are having the same result.
But my question is: is there at some point a potential difference, for ex. speed, data sizes it can handle etc? Are there occasions that it is better to use one or the other?
Example:
library(dplyr)
df1<-subset(airquality, Temp>80 & Month > 5)
df2<-filter(airquality, Temp>80 & Month > 5)
summary(df1$Ozone)
# Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
# 9.00 39.00 64.00 64.51 84.00 168.00 14
summary(df2$Ozone)
# Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
# 9.00 39.00 64.00 64.51 84.00 168.00 14
They are, indeed, producing the same result, and they are very similar in concept.
The advantage of subset is that it is part of base R and doesn't require any additional packages. With small sample sizes, it seems to be a bit faster than filter (6 times faster in your example, but that's measured in microseconds).
As the data sets grow, filter seems gains the upper hand in efficiency. At 15,000 records, filter outpaces subset by about 300 microseconds. And at 153,000 records, filter is three times faster (measured in milliseconds).
So in terms of human time, I don't think there's much difference between the two.
The other advantage (and this is a bit of a niche advantage) is that filter can operate on SQL databases without pulling the data into memory. subset simply doesn't do that.
Personally, I tend to use filter, but only because I'm already using the dplyr framework. If you aren't working with out-of-memory data, it won't make much of a difference.
library(dplyr)
library(microbenchmark)
# Original example
microbenchmark(
df1<-subset(airquality, Temp>80 & Month > 5),
df2<-filter(airquality, Temp>80 & Month > 5)
)
Unit: microseconds
expr min lq mean median uq max neval cld
subset 95.598 107.7670 118.5236 119.9370 125.949 167.443 100 a
filter 551.886 564.7885 599.4972 571.5335 594.993 2074.997 100 b
# 15,300 rows
air <- lapply(1:100, function(x) airquality) %>% bind_rows
microbenchmark(
df1<-subset(air, Temp>80 & Month > 5),
df2<-filter(air, Temp>80 & Month > 5)
)
Unit: microseconds
expr min lq mean median uq max neval cld
subset 1187.054 1207.5800 1293.718 1216.671 1257.725 2574.392 100 b
filter 968.586 985.4475 1056.686 1023.862 1036.765 2489.644 100 a
# 153,000 rows
air <- lapply(1:1000, function(x) airquality) %>% bind_rows
microbenchmark(
df1<-subset(air, Temp>80 & Month > 5),
df2<-filter(air, Temp>80 & Month > 5)
)
Unit: milliseconds
expr min lq mean median uq max neval cld
subset 11.841792 13.292618 16.21771 13.521935 13.867083 68.59659 100 b
filter 5.046148 5.169164 10.27829 5.387484 6.738167 65.38937 100 a
One additional difference not yet mentioned is that filter discards rownames, while subset doesn't:
filter(mtcars, gear == 5)
mpg cyl disp hp drat wt qsec vs am gear carb
1 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2
2 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
3 15.8 4 351.0 264 4.22 3.170 14.5 0 1 5 4
4 19.7 4 145.0 175 3.62 2.770 15.5 0 1 5 6
5 15.0 4 301.0 335 3.54 3.570 14.6 0 1 5 8
subset(mtcars, gear == 5)
mpg cyl disp hp drat wt qsec vs am gear carb
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
Ford Pantera L 15.8 4 351.0 264 4.22 3.170 14.5 0 1 5 4
Ferrari Dino 19.7 4 145.0 175 3.62 2.770 15.5 0 1 5 6
Maserati Bora 15.0 4 301.0 335 3.54 3.570 14.6 0 1 5 8
In the main use cases they behave the same :
library(dplyr)
identical(
filter(starwars, species == "Wookiee"),
subset(starwars, species == "Wookiee"))
# [1] TRUE
But they have a quite a few differences, including (I was as exhaustive as possible but might have missed some) :
subset can be used on matrices
filter can be used on databases
filter drops row names
subset drop attributes other than class, names and row names.
subset has a select argument
subset recycles its condition argument
filter supports conditions as separate arguments
filter preserves the class of the column
filter supports the .data pronoun
filter supports some rlang features
filter supports grouping
filter supports n() and row_number()
filter is stricter
filter is a bit faster when it counts
subset has methods in other packages
subset can be used on matrices
subset(state.x77, state.x77[,"Population"] < 400)
# Population Income Illiteracy Life Exp Murder HS Grad Frost Area
# Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432
# Wyoming 376 4566 0.6 70.29 6.9 62.9 173 97203
Though columns can't be used directly as variables in the subset argument
subset(state.x77, Population < 400)
Error in subset.matrix(state.x77, Population < 400) : object
'Population' not found
Neither works with filter
filter(state.x77, state.x77[,"Population"] < 400)
Error in UseMethod("filter_") : no applicable method for 'filter_'
applied to an object of class "c('matrix', 'double', 'numeric')"
filter(state.x77, Population < 400)
Error in UseMethod("filter_") : no applicable method for 'filter_'
applied to an object of class "c('matrix', 'double', 'numeric')"
filter can be used on databases
library(DBI)
con <- dbConnect(RSQLite::SQLite(), ":memory:")
dbWriteTable(con, "mtcars", mtcars)
tbl(con,"mtcars") %>%
filter(hp < 65)
# # Source: lazy query [?? x 11]
# # Database: sqlite 3.19.3 [:memory:]
# mpg cyl disp hp drat wt qsec vs am gear carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
# 2 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
subset can't
tbl(con,"mtcars") %>%
subset(hp < 65)
Error in subset.default(., hp < 65) : object 'hp' not found
filter drops row names
filter(mtcars, hp < 65)
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
# 2 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
subset doesn't
subset(mtcars, hp < 65)
# mpg cyl disp hp drat wt qsec vs am gear carb
# Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
# Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
subset drop attributes other than class, names and row names.
cars_head <- head(cars)
attr(cars_head, "info") <- "head of cars dataset"
attributes(subset(cars_head, speed > 0))
#> $names
#> [1] "speed" "dist"
#>
#> $row.names
#> [1] 1 2 3 4 5 6
#>
#> $class
#> [1] "data.frame"
attributes(filter(cars_head, speed > 0))
#> $names
#> [1] "speed" "dist"
#>
#> $row.names
#> [1] 1 2 3 4 5 6
#>
#> $class
#> [1] "data.frame"
#>
#> $info
#> [1] "head of cars dataset"
subset has a select argument
While dplyr follows tidyverse principles which aim at having each function doing one thing, so select is a separate function.
identical(
subset(starwars, species == "Wookiee", select = c("name", "height")),
filter(starwars, species == "Wookiee") %>% select(name, height)
)
# [1] TRUE
It also has a drop argument, that makes mostly sense in the context of using the select argument.
subset recycles its condition argument
half_iris <- subset(iris,c(TRUE,FALSE))
dim(iris) # [1] 150 5
dim(half_iris) # [1] 75 5
filter doesn't
half_iris <- filter(iris,c(TRUE,FALSE))
Error in filter_impl(.data, quo) : Result must have length 150, not 2
filter supports conditions as separate arguments
Conditions are fed to ... so we can have several conditions as different arguments, which is the same as using & but might be more readable sometimes due to logical operator precedence and automatic identation.
identical(
subset(starwars,
(species == "Wookiee" | eye_color == "blue") &
mass > 120),
filter(starwars,
species == "Wookiee" | eye_color == "blue",
mass > 120)
)
filter preserves the class of the column
df <- data.frame(a=1:2, b = 3:4, c= 5:6)
class(df$a) <- "foo"
class(df$b) <- "Date"
# subset preserves the Date, but strips the "foo" class
str(subset(df,TRUE))
#> 'data.frame': 2 obs. of 3 variables:
#> $ a: int 1 2
#> $ b: Date, format: "1970-01-04" "1970-01-05"
#> $ c: int 5 6
# filter keeps both
str(dplyr::filter(df,TRUE))
#> 'data.frame': 2 obs. of 3 variables:
#> $ a: 'foo' int 1 2
#> $ b: Date, format: "1970-01-04" "1970-01-05"
#> $ c: int 5 6
filter supports the use use of the .data pronoun
mtcars %>% filter(.data[["hp"]] < 65)
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
# 2 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
filter supports some rlang features
x <- "hp"
library(rlang)
mtcars %>% filter(!!sym(x) < 65)
# m pg cyl disp hp drat wt qsec vs am gear carb
# 1 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
# 2 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
filter65 <- function(data,var){
data %>% filter(!!enquo(var) < 65)
}
mtcars %>% filter65(hp)
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
# 2 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
filter supports grouping
iris %>%
group_by(Species) %>%
filter(Petal.Length < quantile(Petal.Length,0.01))
# # A tibble: 3 x 5
# # Groups: Species [3]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# <dbl> <dbl> <dbl> <dbl> <fctr>
# 1 4.6 3.6 1.0 0.2 setosa
# 2 5.1 2.5 3.0 1.1 versicolor
# 3 4.9 2.5 4.5 1.7 virginica
iris %>%
group_by(Species) %>%
subset(Petal.Length < quantile(Petal.Length,0.01))
# # A tibble: 2 x 5
# # Groups: Species [1]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# <dbl> <dbl> <dbl> <dbl> <fctr>
# 1 4.3 3.0 1.1 0.1 setosa
# 2 4.6 3.6 1.0 0.2 setosa
filter supports n() and row_number()
filter(iris, row_number() < n()/30)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
# 4 4.6 3.1 1.5 0.2 setosa
filter is stricter
It trigger errors if the input is suspicious.
filter(iris, Species = "setosa")
# Error: `Species` (`Species = "setosa"`) must not be named, do you need `==`?
identical(subset(iris, Species = "setosa"), iris)
# [1] TRUE
df1 <- setNames(data.frame(a = 1:3, b=5:7),c("a","a"))
# df1
# a a
# 1 1 5
# 2 2 6
# 3 3 7
filter(df1, a > 2)
#Error: Column `a` must have a unique name
subset(df1, a > 2)
# a a.1
# 3 3 7
filter is a bit faster when it counts
Borrowing the dataset that Benjamin built in his answer (153 k rows), it's twice faster, though it should rarely be a bottleneck.
air <- lapply(1:1000, function(x) airquality) %>% bind_rows
microbenchmark::microbenchmark(
subset = subset(air, Temp>80 & Month > 5),
filter = filter(air, Temp>80 & Month > 5)
)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# subset 8.771962 11.551255 19.942501 12.576245 13.933290 108.0552 100 b
# filter 4.144336 4.686189 8.024461 6.424492 7.499894 101.7827 100 a
subset has methods in other packages
subset is an S3 generic, just as dplyr::filter is, but subset as a base function is more likely to have methods developed in other packages, one prominent example is zoo:::subset.zoo.
Interesting. I was trying to see the difference in terms of the resulting dataset and I coulnd't get an explanation to why the "[" operator behaved differently (i.e., to why it also returned NAs):
# Subset for year=2013
sub<-brfss2013 %>% filter(iyear == "2013")
dim(sub)
#[1] 486088 330
length(which(is.na(sub$iyear))==T)
#[1] 0
sub2<-filter(brfss2013, iyear == "2013")
dim(sub2)
#[1] 486088 330
length(which(is.na(sub2$iyear))==T)
#[1] 0
sub3<-brfss2013[brfss2013$iyear=="2013", ]
dim(sub3)
#[1] 486093 330
length(which(is.na(sub3$iyear))==T)
#[1] 5
sub4<-subset(brfss2013, iyear=="2013")
dim(sub4)
#[1] 486088 330
length(which(is.na(sub4$iyear))==T)
#[1] 0
A difference is also that subset does more things than filter you can also select and drop while you have two different functions in dplyr
subset(df, select=c("varA", "varD"))
dplyr::select(df,varA, varD)
An additional advantage of filter is that it plays nice with grouped data. subset ignores groupings.
So when the data is grouped, subset will still make reference to the whole data, but filter will only reference the group.
# setup
library(tidyverse)
data.frame(a = 1:2) %>% group_by(a) %>% subset(length(a) == 1)
# returns empty table
data.frame(a = 1:2) %>% group_by(a) %>% filter(length(a) == 1)
# returns all rows

Resources