Summary Statistics table with factors and continuous variables - r

I am trying to create a simple summary statistics table (min, max, mean, n, etc) that handles both factor variables and continuous variables, even when there is more than one factor variable. I'm trying to produce good looking HTML output, eg stargazer or huxtable output.
For a simple reproducible example, I'll use mtcars but change two of the variables to factors, and simplify to three variables.
library(tidyverse)
library(stargazer)
mtcars_df <- mtcars
mtcars_df <- mtcars_df %>%
mutate(vs = factor(vs),
am = factor(am)) %>%
select(mpg, vs, am)
head(mtcars_df)
So the data has two factor variables, vs and am. mpg is left as a double:
#> mpg vs am
#> <dbl> <fctr> <fctr>
#> 1 21.0 0 1
#> 2 21.0 0 1
#> 3 22.8 1 1
#> 4 21.4 1 0
#> 5 18.7 0 0
#> 6 18.1 1 0
My desired output would look something like this (format only, the numbers aren't all correct for am0):
======================================================
Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max
------------------------------------------------------
mpg 32 20.091 6.027 10 15.4 22.8 34
vs0 32 0.562 0.504 0 0 1 1
vs1 32 0.438 0.504 0 0 1 1
am0 32 0.594 0.499 0 0 1 1
am1 32 0.406 0.499 0 0 1 1
------------------------------------------------------
A straight call to stargazer does not handle factors (but we have a solution for summarising one factor, below)
# this doesn't give factors
stargazer(mtcars_df, type = "text")
======================================================
Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max
------------------------------------------------------
mpg 32 20.091 6.027 10 15.4 22.8 34
------------------------------------------------------
This previous answer from #jake-fisher works very well to summarise one factor variable.
https://stackoverflow.com/a/26935270/8742237
The code below from the previous answer gives both values of the first factor vs, i.e. vs0 and vs1 but when it comes to the second factor, am, it only lists summary statistics for one value of am:
am0 is missing.
I do realise that this is because we want to avoid the dummy variable trap when modeling, but my issue is not about modeling, it's about creating a summary table with all values of all factor variables.
options(na.action = "na.pass") # so that we keep missing values in the data
X <- model.matrix(~ . - 1, data = mtcars_df)
X.df <- data.frame(X) # stargazer only does summary tables of data.frame objects
#names(X) <- colnames(X)
stargazer(X.df, type = "text")
======================================================
Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max
------------------------------------------------------
mpg 32 20.091 6.027 10 15.4 22.8 34
vs0 32 0.562 0.504 0 0 1 1
vs1 32 0.438 0.504 0 0 1 1
am1 32 0.406 0.499 0 0 1 1
------------------------------------------------------
While use of stargazer or huxtable would be preferred, if there's an easier way to produce this sort of summary table with a different library, that would still be very helpful.

In the end, instead of using model.matrix(), which is designed to drop the base case when creating dummy variables, a simple fix is to use mlr::createDummyFeatures(), which creates a Dummy for all values, even the base case.
library(tidyverse)
library(stargazer)
library(mlr)
mtcars_df <- mtcars
mtcars_df <- mtcars_df %>%
mutate(vs = factor(vs),
am = factor(am)) %>%
select(mpg, vs, am)
head(mtcars_df)
X <- mlr::createDummyFeatures(obj = mtcars_df)
X.df <- data.frame(X) # stargazer only does summary tables of data.frame objects
#names(X) <- colnames(X)
stargazer(X.df, type = "text")
which does give the desired output:
======================================================
Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max
------------------------------------------------------
mpg 32 20.091 6.027 10 15.4 22.8 34
vs.0 32 0.562 0.504 0 0 1 1
vs.1 32 0.438 0.504 0 0 1 1
am.0 32 0.594 0.499 0 0 1 1
am.1 32 0.406 0.499 0 0 1 1
------------------------------------------------------

Related

How do I only report selected summary statistics in a table that lists variables as rows using R?

I have a dataset and I need to create a simple table with the number of observations, means, and standard deviations of all the variables (columns). I can't find a way to get only the required 3 summary statistics. Everything I tried keeps giving me min, max, median, 1st and 3rd quartiles, etc. The table should look something like this (with a title):
Table 1: Table Title
_______________________________________
Variables Observations Mean Std.Dev
_______________________________________
Age 30 24 2
... . . .
... . . .
_______________________________________
The summary () does not work because it gives too many other summary statistics. I have done this:
sapply(dataset, function(x) list(means=mean(x,na.rm=TRUE), sds=sd(x,na.rm=TRUE)))
But how do I form the table from this? And is there a better way to do this than using "sapply"?
sapply does return the values that you want but it is not properly structured.
Using mtcars data as an example :
#Get the required statistics and convert the data into dataframe
summ_data <- data.frame(t(sapply(mtcars, function(x)
list(means = mean(x,na.rm=TRUE), sds = sd(x,na.rm=TRUE)))))
#Change rownames to new column
summ_data$variables <- rownames(summ_data)
#Remove rownames
rownames(summ_data) <- NULL
#Make variable column as 1st column
cbind(summ_data[ncol(summ_data)], summ_data[-ncol(summ_data)])
Another way would be using dplyr functions :
library(dplyr)
mtcars %>%
summarise(across(.fns = list(means = mean, sds = sd),
.names = '{col}_{fn}')) %>%
tidyr::pivot_longer(cols = everything(),
names_to = c('variable', '.value'),
names_sep = '_')
# A tibble: 11 x 3
# variable means sds
# <chr> <dbl> <dbl>
# 1 mpg 20.1 6.03
# 2 cyl 6.19 1.79
# 3 disp 231. 124.
# 4 hp 147. 68.6
# 5 drat 3.60 0.535
# 6 wt 3.22 0.978
# 7 qsec 17.8 1.79
# 8 vs 0.438 0.504
# 9 am 0.406 0.499
#10 gear 3.69 0.738
#11 carb 2.81 1.62

R describeby function subscript out of bounds error

I'm fairly new to R and I'm trying to get descriptive statistics grouped by multiple variables using the describeby function from the psych package.
Here's what I'm trying to run:
JL <- describeBy(df$JL, group=list(df$Time, df$Cohort, df$Gender), digits=3, skew=FALSE, mat=TRUE)
And I get the error message Error in `[<-`(`*tmp*`, var, group + 1, value = dim.names[[group]][[groupi]]) :
subscript out of bounds
I only get this error message with my Gender variable (which is dichotomous in this datset). I'm able to run the code when I take out the mat=TRUE argument, and I see that it's generating groupings with NULL for Gender. I saw in other answers that this has something to do with the array being out of bounds but I'm not sure how to troubleshoot. Any advice is appreciated.
Thanks so much.
You could use dplyr, with some custom functions added.
library(dplyr)
se <- function(x) sd(x, na.rm=TRUE)/sqrt(length(na.omit(x)))
rnge <- function(x) diff(range(x, na.rm=TRUE))
group_by(df, Time, Cohort, Gender) %>%
summarise_at(vars(JL), .funs=list(n=length, mean=mean, sd=sd, min=min, max=max, range=rnge, se=se)) %>%
as.data.frame()
Using the mtcars dataset:
group_by(mtcars, vs, am, cyl) %>%
summarise_at(vars(mpg), .funs=list(n=length, mean=mean, sd=sd, min=min, max=max, range=rnge, se=se)) %>% as.data.frame()
vs am cyl n mean sd min max range se
1 0 0 8 12 15.1 2.774 10.4 19.2 8.8 0.801
2 0 1 4 1 26.0 NA 26.0 26.0 0.0 NA
3 0 1 6 3 20.6 0.751 19.7 21.0 1.3 0.433
4 0 1 8 2 15.4 0.566 15.0 15.8 0.8 0.400
5 1 0 4 3 22.9 1.453 21.5 24.4 2.9 0.839
6 1 0 6 4 19.1 1.632 17.8 21.4 3.6 0.816
7 1 1 4 7 28.4 4.758 21.4 33.9 12.5 1.798
Using the describBy function from the psych package returns your error:
library(psych)
describeBy(mtcars$mpg, group=list(mtcars$vs, mtcars$am, mtcars$cyl), digits=3, skew=FALSE, mat=TRUE)
Error in [<-(*tmp*, var, group + 1, value =
dim.names[[group]][[groupi]]) : subscript out of bounds
Because not all combinations of the three groups exist in the data.
with(mtcars,
ftable(table(vs,am,cyl)))
# cyl 4 6 8
#vs am
#0 0 0 0 12
# 1 1 3 2
#1 0 3 4 0
# 1 7 0 0

Obtain importance of individual trees in a RandomForest

Question: Is there a way to extract the variable importance for each individual CART model from a randomForest object?
rf_mod$forest doesn't seem to have this information, and the docs don't mention it.
In R's randomForest package, the average variable importance for the entire forest of CART models is given by importance(rf_mod).
library(randomForest)
df <- mtcars
set.seed(1)
rf_mod = randomForest(mpg ~ .,
data = df,
importance = TRUE,
ntree = 200)
importance(rf_mod)
%IncMSE IncNodePurity
cyl 6.0927875 111.65028
disp 8.7730959 261.06991
hp 7.8329831 212.74916
drat 2.9529334 79.01387
wt 7.9015687 246.32633
qsec 0.7741212 26.30662
vs 1.6908975 31.95701
am 2.5298261 13.33669
gear 1.5512788 17.77610
carb 3.2346351 35.69909
We can also extract individual tree structure with getTree. Here's the first tree.
head(getTree(rf_mod, k = 1, labelVar = TRUE))
left daughter right daughter split var split point status prediction
1 2 3 wt 2.15 -3 18.91875
2 0 0 <NA> 0.00 -1 31.56667
3 4 5 wt 3.16 -3 17.61034
4 6 7 drat 3.66 -3 21.26667
5 8 9 carb 3.50 -3 15.96500
6 0 0 <NA> 0.00 -1 19.70000
One workaround is to grow many CARTs (i.e. - ntree = 1), get the variable importance of each tree, and average the resulting %IncMSE:
# number of trees to grow
nn <- 200
# function to run nn CART models
run_rf <- function(rand_seed){
set.seed(rand_seed)
one_tr = randomForest(mpg ~ .,
data = df,
importance = TRUE,
ntree = 1)
return(one_tr)
}
# list to store output of each model
l <- vector("list", length = nn)
l <- lapply(1:nn, run_rf)
The extraction, averaging, and comparison step.
# extract importance of each CART model
library(dplyr); library(purrr)
map(l, importance) %>%
map(as.data.frame) %>%
map( ~ { .$var = rownames(.); rownames(.) <- NULL; return(.) } ) %>%
bind_rows() %>%
group_by(var) %>%
summarise(`%IncMSE` = mean(`%IncMSE`)) %>%
arrange(-`%IncMSE`)
# A tibble: 10 x 2
var `%IncMSE`
<chr> <dbl>
1 wt 8.52
2 cyl 7.75
3 disp 7.74
4 hp 5.53
5 drat 1.65
6 carb 1.52
7 vs 0.938
8 qsec 0.824
9 gear 0.495
10 am 0.355
# compare to the RF model above
importance(rf_mod)
%IncMSE IncNodePurity
cyl 6.0927875 111.65028
disp 8.7730959 261.06991
hp 7.8329831 212.74916
drat 2.9529334 79.01387
wt 7.9015687 246.32633
qsec 0.7741212 26.30662
vs 1.6908975 31.95701
am 2.5298261 13.33669
gear 1.5512788 17.77610
carb 3.2346351 35.69909
I'd like to be able to extract the variable importance of each tree directly from a randomForest object, without this roundabout method that involves completely re-running the RF in order to facilitate reproducible cumulative variable importance plots like this one, and the one below shown for mtcars. Minimal example here.
I'm aware that a single tree's variable importance is not statistically meaningful, and it's not my intention to interpret trees in isolation. I want them for the purpose of visualization and communicating that as trees increase in a forest, the variable importance measures jump around before stabilizing.
When training a randomForest model, the importance scores are computed for the entire forest and stored directly inside the object. Tree-specific scores are not kept and so cannot be directly retrieved from a randomForest object.
Unfortunately, you are correct about having to incrementally construct a forest. The good news is that a randomForest object is self-contained, and you don't need to implement your own run_rf. Instead, you can use stats::update to re-fit the random forest model with a single tree and randomForest::grow to add additional trees one at a time:
## Starting with a random forest having a single tree,
## grow it 9 times, one tree at a time
rfs <- purrr::accumulate( .init = update(rf_mod, ntree=1),
rep(1,9), randomForest::grow )
## Retrieve the importance scores from each random forest
imp <- purrr::map( rfs, ~importance(.x)[,"%IncMSE"] )
## Combine all results into a single data frame
dplyr::bind_rows( !!!imp )
# # A tibble: 10 x 10
# cyl disp hp drat wt qsec vs am gear carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 0 18.8 8.63 1.05 0 1.17 0 0 0 0.194
# 2 0 10.0 46.4 0.561 0 -0.299 0 0 0.543 2.05
# 3 0 22.4 31.2 0.955 0 -0.199 0 0 0.362 5.1
# 4 1.55 24.1 23.4 0.717 0 -0.150 0 0 0.272 5.28
# 5 1.24 22.8 23.6 0.573 0 -0.178 0 0 -0.0259 4.98
# 6 1.03 26.2 22.3 0.478 1.25 0.775 0 0 -0.0216 4.1
# 7 0.887 22.5 22.5 0.406 1.79 -0.101 0 0 -0.0185 3.56
# 8 0.776 19.7 21.3 0.944 1.70 0.105 0 0.0225 -0.0162 3.11
# 9 0.690 18.4 19.1 0.839 1.51 1.24 1.01 0.02 -0.0144 2.77
# 10 0.621 18.4 21.2 0.937 1.32 1.11 0.910 0.0725 -0.114 2.49
The data frame shows how feature importance changes with each additional tree. This is the right panel of your plot example. The trees themselves (for the left panel) can be retrieved from the final forest, which is given by dplyr::last( rfs ).
Disclaimer: This is not really an answer, but too long to post as a comment. Will remove if deemed not appropriate.
While I (think I) understand your question, to be honest I am unsure whether your question makes sense from a statistics/ML point-of-view. The following is based on my obviously limited understanding of RF and CART. Perhaps my comment-post will lead to some insights.
Let's start with some general random forest (RF) theory on variable importance from Hastie, Tibshirani, Friedman, The Elements of Statistical Learning, p. 593 (bold-face mine):
At each split in each tree, the improvement in the split-criterion is the
importance measure attributed to the splitting variable, and is accumulated
over all the trees in the forest separately for each variable. [...]
Random forests also use the oob samples to construct a different variable-importance measure, apparently to measure the prediction strength of each variable.
So the variable importance measure in RF is defined as a measure accumulated over all trees.
In traditional single classification trees (CARTs), variable importance is characterised through the Gini index that measures node impurity (see e.g. How to measure/rank “variable importance” when using CART? (specifically using {rpart} from R) and Carolin Strobl's PhD thesis)
More complex measures to characterise variable importance in CART-like models exist; for example in rpart:
An overall measure of variable importance is the sum of the goodness of split
measures for each split for which it was the primary variable, plus goodness * (adjusted
agreement) for all splits in which it was a surrogate. In the printout these are scaled to sum
to 100 and the rounded values are shown, omitting any variable whose proportion is less
than 1%.
So the bottom line here is the following: At the very least it won't be easy (and in the worst case it won't make sense) to compare variable measures from single classifaction trees with variable importance measures applied to ensemble-based methods like RF.
Which leads me to ask: Why do you want to extract variable importance measures for individual trees from an RF model? Even if you came up with a method to calculate variable importances from individual trees, I believe they wouldn't be very meaningful, and they wouldn't have to "converge" to the ensemble-accumulated values.
We can simplify it by
library(tidyverse)
out <- map(seq_len(nn), ~
run_rf(.x) %>%
importance) %>%
reduce(`+`) %>%
magrittr::divide_by(nn)

Difference between subset and filter from dplyr

It seems to me that subset and filter (from dplyr) are having the same result.
But my question is: is there at some point a potential difference, for ex. speed, data sizes it can handle etc? Are there occasions that it is better to use one or the other?
Example:
library(dplyr)
df1<-subset(airquality, Temp>80 & Month > 5)
df2<-filter(airquality, Temp>80 & Month > 5)
summary(df1$Ozone)
# Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
# 9.00 39.00 64.00 64.51 84.00 168.00 14
summary(df2$Ozone)
# Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
# 9.00 39.00 64.00 64.51 84.00 168.00 14
They are, indeed, producing the same result, and they are very similar in concept.
The advantage of subset is that it is part of base R and doesn't require any additional packages. With small sample sizes, it seems to be a bit faster than filter (6 times faster in your example, but that's measured in microseconds).
As the data sets grow, filter seems gains the upper hand in efficiency. At 15,000 records, filter outpaces subset by about 300 microseconds. And at 153,000 records, filter is three times faster (measured in milliseconds).
So in terms of human time, I don't think there's much difference between the two.
The other advantage (and this is a bit of a niche advantage) is that filter can operate on SQL databases without pulling the data into memory. subset simply doesn't do that.
Personally, I tend to use filter, but only because I'm already using the dplyr framework. If you aren't working with out-of-memory data, it won't make much of a difference.
library(dplyr)
library(microbenchmark)
# Original example
microbenchmark(
df1<-subset(airquality, Temp>80 & Month > 5),
df2<-filter(airquality, Temp>80 & Month > 5)
)
Unit: microseconds
expr min lq mean median uq max neval cld
subset 95.598 107.7670 118.5236 119.9370 125.949 167.443 100 a
filter 551.886 564.7885 599.4972 571.5335 594.993 2074.997 100 b
# 15,300 rows
air <- lapply(1:100, function(x) airquality) %>% bind_rows
microbenchmark(
df1<-subset(air, Temp>80 & Month > 5),
df2<-filter(air, Temp>80 & Month > 5)
)
Unit: microseconds
expr min lq mean median uq max neval cld
subset 1187.054 1207.5800 1293.718 1216.671 1257.725 2574.392 100 b
filter 968.586 985.4475 1056.686 1023.862 1036.765 2489.644 100 a
# 153,000 rows
air <- lapply(1:1000, function(x) airquality) %>% bind_rows
microbenchmark(
df1<-subset(air, Temp>80 & Month > 5),
df2<-filter(air, Temp>80 & Month > 5)
)
Unit: milliseconds
expr min lq mean median uq max neval cld
subset 11.841792 13.292618 16.21771 13.521935 13.867083 68.59659 100 b
filter 5.046148 5.169164 10.27829 5.387484 6.738167 65.38937 100 a
One additional difference not yet mentioned is that filter discards rownames, while subset doesn't:
filter(mtcars, gear == 5)
mpg cyl disp hp drat wt qsec vs am gear carb
1 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2
2 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
3 15.8 4 351.0 264 4.22 3.170 14.5 0 1 5 4
4 19.7 4 145.0 175 3.62 2.770 15.5 0 1 5 6
5 15.0 4 301.0 335 3.54 3.570 14.6 0 1 5 8
subset(mtcars, gear == 5)
mpg cyl disp hp drat wt qsec vs am gear carb
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
Ford Pantera L 15.8 4 351.0 264 4.22 3.170 14.5 0 1 5 4
Ferrari Dino 19.7 4 145.0 175 3.62 2.770 15.5 0 1 5 6
Maserati Bora 15.0 4 301.0 335 3.54 3.570 14.6 0 1 5 8
In the main use cases they behave the same :
library(dplyr)
identical(
filter(starwars, species == "Wookiee"),
subset(starwars, species == "Wookiee"))
# [1] TRUE
But they have a quite a few differences, including (I was as exhaustive as possible but might have missed some) :
subset can be used on matrices
filter can be used on databases
filter drops row names
subset drop attributes other than class, names and row names.
subset has a select argument
subset recycles its condition argument
filter supports conditions as separate arguments
filter preserves the class of the column
filter supports the .data pronoun
filter supports some rlang features
filter supports grouping
filter supports n() and row_number()
filter is stricter
filter is a bit faster when it counts
subset has methods in other packages
subset can be used on matrices
subset(state.x77, state.x77[,"Population"] < 400)
# Population Income Illiteracy Life Exp Murder HS Grad Frost Area
# Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432
# Wyoming 376 4566 0.6 70.29 6.9 62.9 173 97203
Though columns can't be used directly as variables in the subset argument
subset(state.x77, Population < 400)
Error in subset.matrix(state.x77, Population < 400) : object
'Population' not found
Neither works with filter
filter(state.x77, state.x77[,"Population"] < 400)
Error in UseMethod("filter_") : no applicable method for 'filter_'
applied to an object of class "c('matrix', 'double', 'numeric')"
filter(state.x77, Population < 400)
Error in UseMethod("filter_") : no applicable method for 'filter_'
applied to an object of class "c('matrix', 'double', 'numeric')"
filter can be used on databases
library(DBI)
con <- dbConnect(RSQLite::SQLite(), ":memory:")
dbWriteTable(con, "mtcars", mtcars)
tbl(con,"mtcars") %>%
filter(hp < 65)
# # Source: lazy query [?? x 11]
# # Database: sqlite 3.19.3 [:memory:]
# mpg cyl disp hp drat wt qsec vs am gear carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
# 2 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
subset can't
tbl(con,"mtcars") %>%
subset(hp < 65)
Error in subset.default(., hp < 65) : object 'hp' not found
filter drops row names
filter(mtcars, hp < 65)
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
# 2 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
subset doesn't
subset(mtcars, hp < 65)
# mpg cyl disp hp drat wt qsec vs am gear carb
# Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
# Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
subset drop attributes other than class, names and row names.
cars_head <- head(cars)
attr(cars_head, "info") <- "head of cars dataset"
attributes(subset(cars_head, speed > 0))
#> $names
#> [1] "speed" "dist"
#>
#> $row.names
#> [1] 1 2 3 4 5 6
#>
#> $class
#> [1] "data.frame"
attributes(filter(cars_head, speed > 0))
#> $names
#> [1] "speed" "dist"
#>
#> $row.names
#> [1] 1 2 3 4 5 6
#>
#> $class
#> [1] "data.frame"
#>
#> $info
#> [1] "head of cars dataset"
subset has a select argument
While dplyr follows tidyverse principles which aim at having each function doing one thing, so select is a separate function.
identical(
subset(starwars, species == "Wookiee", select = c("name", "height")),
filter(starwars, species == "Wookiee") %>% select(name, height)
)
# [1] TRUE
It also has a drop argument, that makes mostly sense in the context of using the select argument.
subset recycles its condition argument
half_iris <- subset(iris,c(TRUE,FALSE))
dim(iris) # [1] 150 5
dim(half_iris) # [1] 75 5
filter doesn't
half_iris <- filter(iris,c(TRUE,FALSE))
Error in filter_impl(.data, quo) : Result must have length 150, not 2
filter supports conditions as separate arguments
Conditions are fed to ... so we can have several conditions as different arguments, which is the same as using & but might be more readable sometimes due to logical operator precedence and automatic identation.
identical(
subset(starwars,
(species == "Wookiee" | eye_color == "blue") &
mass > 120),
filter(starwars,
species == "Wookiee" | eye_color == "blue",
mass > 120)
)
filter preserves the class of the column
df <- data.frame(a=1:2, b = 3:4, c= 5:6)
class(df$a) <- "foo"
class(df$b) <- "Date"
# subset preserves the Date, but strips the "foo" class
str(subset(df,TRUE))
#> 'data.frame': 2 obs. of 3 variables:
#> $ a: int 1 2
#> $ b: Date, format: "1970-01-04" "1970-01-05"
#> $ c: int 5 6
# filter keeps both
str(dplyr::filter(df,TRUE))
#> 'data.frame': 2 obs. of 3 variables:
#> $ a: 'foo' int 1 2
#> $ b: Date, format: "1970-01-04" "1970-01-05"
#> $ c: int 5 6
filter supports the use use of the .data pronoun
mtcars %>% filter(.data[["hp"]] < 65)
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
# 2 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
filter supports some rlang features
x <- "hp"
library(rlang)
mtcars %>% filter(!!sym(x) < 65)
# m pg cyl disp hp drat wt qsec vs am gear carb
# 1 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
# 2 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
filter65 <- function(data,var){
data %>% filter(!!enquo(var) < 65)
}
mtcars %>% filter65(hp)
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
# 2 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
filter supports grouping
iris %>%
group_by(Species) %>%
filter(Petal.Length < quantile(Petal.Length,0.01))
# # A tibble: 3 x 5
# # Groups: Species [3]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# <dbl> <dbl> <dbl> <dbl> <fctr>
# 1 4.6 3.6 1.0 0.2 setosa
# 2 5.1 2.5 3.0 1.1 versicolor
# 3 4.9 2.5 4.5 1.7 virginica
iris %>%
group_by(Species) %>%
subset(Petal.Length < quantile(Petal.Length,0.01))
# # A tibble: 2 x 5
# # Groups: Species [1]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# <dbl> <dbl> <dbl> <dbl> <fctr>
# 1 4.3 3.0 1.1 0.1 setosa
# 2 4.6 3.6 1.0 0.2 setosa
filter supports n() and row_number()
filter(iris, row_number() < n()/30)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
# 4 4.6 3.1 1.5 0.2 setosa
filter is stricter
It trigger errors if the input is suspicious.
filter(iris, Species = "setosa")
# Error: `Species` (`Species = "setosa"`) must not be named, do you need `==`?
identical(subset(iris, Species = "setosa"), iris)
# [1] TRUE
df1 <- setNames(data.frame(a = 1:3, b=5:7),c("a","a"))
# df1
# a a
# 1 1 5
# 2 2 6
# 3 3 7
filter(df1, a > 2)
#Error: Column `a` must have a unique name
subset(df1, a > 2)
# a a.1
# 3 3 7
filter is a bit faster when it counts
Borrowing the dataset that Benjamin built in his answer (153 k rows), it's twice faster, though it should rarely be a bottleneck.
air <- lapply(1:1000, function(x) airquality) %>% bind_rows
microbenchmark::microbenchmark(
subset = subset(air, Temp>80 & Month > 5),
filter = filter(air, Temp>80 & Month > 5)
)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# subset 8.771962 11.551255 19.942501 12.576245 13.933290 108.0552 100 b
# filter 4.144336 4.686189 8.024461 6.424492 7.499894 101.7827 100 a
subset has methods in other packages
subset is an S3 generic, just as dplyr::filter is, but subset as a base function is more likely to have methods developed in other packages, one prominent example is zoo:::subset.zoo.
Interesting. I was trying to see the difference in terms of the resulting dataset and I coulnd't get an explanation to why the "[" operator behaved differently (i.e., to why it also returned NAs):
# Subset for year=2013
sub<-brfss2013 %>% filter(iyear == "2013")
dim(sub)
#[1] 486088 330
length(which(is.na(sub$iyear))==T)
#[1] 0
sub2<-filter(brfss2013, iyear == "2013")
dim(sub2)
#[1] 486088 330
length(which(is.na(sub2$iyear))==T)
#[1] 0
sub3<-brfss2013[brfss2013$iyear=="2013", ]
dim(sub3)
#[1] 486093 330
length(which(is.na(sub3$iyear))==T)
#[1] 5
sub4<-subset(brfss2013, iyear=="2013")
dim(sub4)
#[1] 486088 330
length(which(is.na(sub4$iyear))==T)
#[1] 0
A difference is also that subset does more things than filter you can also select and drop while you have two different functions in dplyr
subset(df, select=c("varA", "varD"))
dplyr::select(df,varA, varD)
An additional advantage of filter is that it plays nice with grouped data. subset ignores groupings.
So when the data is grouped, subset will still make reference to the whole data, but filter will only reference the group.
# setup
library(tidyverse)
data.frame(a = 1:2) %>% group_by(a) %>% subset(length(a) == 1)
# returns empty table
data.frame(a = 1:2) %>% group_by(a) %>% filter(length(a) == 1)
# returns all rows

In the "Tables"-package: How to get column percentages of a subset of a variable?

In the table below the column named "Percent" shows the total column percent. How do I get it to show the column percent of each level of "am" within each level of "vs"?
This is what I've got:
This is what I'm looking for:
Knitr chunk below:
<<echo=FALSE,results='asis'>>=
#
# library(tables)
# library(Hmisc)
# library(Formula)
## This gives me column percentages for the total table.
latex( tabular( Factor(vs)*Factor(am) ~ gear*Percent("col"), data=mtcars ) )
## I am trying to get column percentages for each level of "vs"
#
I think you would need to change your formula to do this. Like this for example:
tabular(Factor(vs) ~ gear*Percent("row")*Factor(am), data = mtcars)
# gear
# Percent
# am
#vs 0 1
#0 66.67 33.33
#1 50.00 50.00
You can use the Equal() pseudofunction for the denom option to make levels of factor vs the denominator.
library(tables)
tabular( Factor(vs)*Factor(am) ~ gear*Percent(denom = Equal(vs)), data=mtcars)
#>
#> gear
#> vs am Percent
#> 0 0 66.67
#> 1 33.33
#> 1 0 50.00
#> 1 50.00
Created on 2020-09-07 by the reprex package (v0.3.0)

Resources