understanding difference in results between dplyr group_by vs tapply - r

I was expecting to see the same results between these two runs, and they are different. Makes me question if I really understand what how the dplyr code is working (I have read pretty much everything I can find about dplyr in the package and online). Can anyone explain why the results are different, or how to obtain similar results?
library(dplyr)
x <- iris
x <- x %.%
group_by(Species, Sepal.Width) %.%
summarise (freq=n()) %.%
summarise (mean_by_group = mean(Sepal.Width))
print(x)
x <- iris
x <- tapply(x$Sepal.Width, x$Species, mean)
print(x)
Update: I don't think this is the most efficient way to do this, but the following code gives a result that matches the tapply approach. Per Hadley's suggestion, I scrutinized the results line by line, and this is the best I could come up with using dplyr
library(dplyr)
x <- iris
x <- x %.%
group_by(Species, Sepal.Width) %.%
summarise (freq=n()) %.%
mutate (mean_by_group = sum(Sepal.Width*freq)/sum(freq)) %.%
print(x)
Update: for some reason I thought I had to group all variables I wanted to analyse, which is what sent things in the wrong direction. This is all I needed, which is closer to the examples in the package.
x <- iris %.%
group_by(Species) %.%
summarise(Sepal.Width = mean(Sepal.Width))
print(x)

Maybe this...
- dplyr:
require(dplyr)
iris %>% group_by(Species) %>% summarise(mean_width = mean(Sepal.Width))
# Source: local data frame [3 x 2]
#
# Species mean_width
# 1 setosa 3.428
# 2 versicolor 2.770
# 3 virginica 2.974
- tapply:
tapply(iris$Sepal.Width, iris$Species, mean)
# setosa versicolor virginica
# 3.428 2.770 2.974
NOTE: tapply() simplifies output by default whereas summarise() does not:
typeof(tapply(iris$Sepal.Width, iris$Species, mean, simplify=TRUE))
# [1] "double"
it returns a list otherwise:
typeof(tapply(iris$Sepal.Width, iris$Species, mean, simplify=FALSE))
# [1] "list"
So to actually get the same type of output form tapply() you would need:
tbl_df(
data.frame(
mean_width = tapply( iris$Sepal.Width,
iris$Species,
mean )))
# Source: local data frame [3 x 1]
#
# mean_width
# setosa 3.428
# versicolor 2.770
# virginica 2.974
and this still isn't the same! as unique(iris$Species) is an attribute here and not a column of the df...

Related

take mean of variable defined by string in dplyr

Seems like this should be easy but I'm stumped. I've gotten the rough hang of programming with dplyr 0.7, but struggling with this: How do I program in dplyr if the variable I want to program with will be a string?
I am scraping a database, and for a variety of reasons want to summarize a variable that I will know the position of but not the name of (the thing I want is always the first column of the supplied table, but the name of the variable stored in that column will vary depending on the database being scraped). To use iris as an example, suppose that I know that the variable that I want is in the first column
library(tidyverse)
desired_var <- colnames(iris)[1]
print(desired_var)
"Sepal.Length"
I now want to group by Species, and take the mean of desired_var, i.e. what I want is to perform
iris %>%
group_by(Species) %>%
summarise(desired_mean = mean(Sepal.Length))
But, now I want to take the mean of a column which is defined by a string stored in desired_var
I get how to do this with a "bare" Sepal.Length
desired_var <- quo(Sepal.Length)
iris %>%
group_by(Species) %>%
summarise(desired_mean = mean(!!desired_var))
But how in the world do I deal with the fact that I have "Sepal.Length" not Sepal.Length , i.e. that desired_var <- "Sepal.Length" ?
You're wondering into tidyeval which is a rather new feature of the tidyverse (see here) more used to create functions using tidyverse functions. For now it is only available with dplyr but the plan is to extend it to the other tidyverse packages.
For your need though, you don't really need to get into that, when summarize_at will do. This function allows you to extend a particular manipulation that you specify across any variables of your choosing:
iris %>%
group_by(Species) %>%
summarise_at(vars(one_of("Sepal.Length", "Sepal.Width")), funs(desired_mean = mean))
# A tibble: 3 x 3
Species Sepal.Length_desired_mean Sepal.Width_desired_mean
<fctr> <dbl> <dbl>
1 setosa 5.006 3.428
2 versicolor 5.936 2.770
3 virginica 6.588 2.974
You can store the list of variables into a vector, and then use that vector instead:
selected_vectors <- c("Sepal.Length", "Sepal.Width")
iris %>%
group_by(Species) %>%
summarise_at(vars(one_of(selected_vectors)), funs(desired_mean = mean))
1) dynamic variable with !!sym Use sym (or parse_expr) like this:
library(dplyr)
library(rlang)
desired_var <- "Sepal.Length"
iris %>%
group_by(Species) %>%
summarise(desired_mean = mean(!!sym(desired_var))) %>%
ungroup
giving:
# A tibble: 3 x 2
Species desired_mean
<fctr> <dbl>
1 setosa 5.006
2 versicolor 5.936
3 virginica 6.588
2) summarise_at As #Phil points out in the comments in the particular case of summarise this could be done like this without using any rlang facilities:
library(dplyr)
desired_var <- "Sepal.Length"
iris %>%
group_by(Species) %>%
summarise_at(desired_var, funs(mean)) %>%
ungroup
giving:
# A tibble: 3 x 2
Species Sepal.Length
<fctr> <dbl>
1 setosa 5.006
2 versicolor 5.936
3 virginica 6.588
3) dynamic variable and name with !! If you need to set the name dynamically in (1) then try this:
library(dplyr)
library(rlang)
desired_var <- "Sepal.Length"
desired_var_name <- paste("mean", desired_var, sep = "_")
iris %>%
group_by(Species) %>%
summarise(!!desired_var_name := mean(!!sym(desired_var))) %>%
ungroup
giving:
# A tibble: 3 x 2
Species mean_Sepal.Length
<fctr> <dbl>
1 setosa 5.006
2 versicolor 5.936
3 virginica 6.588

Is dplyr easier than data.table to be used within functions and loops? [duplicate]

I want to use use the dplyr::group_by function inside another function, but I do not know how to pass the arguments to this function.
Can someone provide a working example?
library(dplyr)
data(iris)
iris %.% group_by(Species) %.% summarise(n = n()) #
## Source: local data frame [3 x 2]
## Species n
## 1 virginica 50
## 2 versicolor 50
## 3 setosa 50
mytable0 <- function(x, ...) x %.% group_by(...) %.% summarise(n = n())
mytable0(iris, "Species") # OK
## Source: local data frame [3 x 2]
## Species n
## 1 virginica 50
## 2 versicolor 50
## 3 setosa 50
mytable1 <- function(x, key) x %.% group_by(as.name(key)) %.% summarise(n = n())
mytable1(iris, "Species") # Wrong!
# Error: unsupported type for column 'as.name(key)' (SYMSXP)
mytable2 <- function(x, key) x %.% group_by(key) %.% summarise(n = n())
mytable2(iris, "Species") # Wrong!
# Error: index out of bounds
For programming, group_by_ is the counterpart to group_by:
library(dplyr)
mytable <- function(x, ...) x %>% group_by_(...) %>% summarise(n = n())
mytable(iris, "Species")
# or iris %>% mytable("Species")
which gives:
Species n
1 setosa 50
2 versicolor 50
3 virginica 50
Update At the time this was written dplyr used %.% which is what was originally used above but now %>% is favored so have changed above to that to keep this relevant.
Update 2 regroup is now deprecated, use group_by_ instead.
Update 3 group_by_(list(...)) now becomes group_by_(...) in new version of dplyr as per Roberto's comment.
Update 4 Added minor variation suggested in comments.
Update 5: With rlang/tidyeval it is now possible to do this:
library(rlang)
mytable <- function(x, ...) {
group_ <- syms(...)
x %>%
group_by(!!!group_) %>%
summarise(n = n())
}
mytable(iris, "Species")
or passing Species unevaluated, i.e. no quotes around it:
library(rlang)
mytable <- function(x, ...) {
group_ <- enquos(...)
x %>%
group_by(!!!group_) %>%
summarise(n = n())
}
mytable(iris, Species)
Update 6: There is now a {{...}} notation that works if there is just one grouping variable:
mytable <- function(x, group) {
x %>%
group_by({{group}}) %>%
summarise(n = n())
}
mytable(iris, Species)
UPDATE: As of dplyr 0.7.0 you can use tidy eval to accomplish this.
See http://dplyr.tidyverse.org/articles/programming.html for more details.
library(tidyverse)
data("iris")
my_table <- function(df, group_var) {
group_var <- enquo(group_var) # Create quosure
df %>%
group_by(!!group_var) %>% # Use !! to unquote the quosure
summarise(n = n())
}
my_table(iris, Species)
> my_table(iris, Species)
# A tibble: 3 x 2
Species n
<fctr> <int>
1 setosa 50
2 versicolor 50
3 virginica 50
As a complement to the Update 6 in the answer by #G. Grothendieck, if you want to use a string as an argument in your summary function, instead of embracing the argument with doubled braces ({{), you should use the .data pronoun as described in the Programming vignette: Loop over multiple variables:
mytable <- function( x, group ) {
x %>%
group_by( .data[[group]] ) %>%
summarise( n = n() )
}
group_string <- 'Species'
mytable( iris, group_string )
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 3 x 2
Species n
<fct> <int>
1 setosa 50
2 versicolor 50
3 virginica 50
Ugly as they come, but she works:
mytable3 <- function(x, key) {
my.call <- bquote(summarise(group_by(.(substitute(x)), NULL), n = n()))
my.call[[2]][[3]] <- as.name(key)
eval(my.call, parent.frame())
}
mytable3(iris, "Species")
# Source: local data frame [3 x 2]
#
# Species n
# 1 virginica 50
# 2 versicolor 50
# 3 setosa 50
There are almost certainly cases that will cause this to break, but you get the idea. I don't think you can get around messing with the call. One other thing that did work but was even uglier is:
mytable4 <- function(x, key) summarise(group_by(x, x[[key]]), n = n())

R dplyr summarise multiple functions to selected variables

I have a dataset for which I want to summarise by mean, but also calculate the max to just 1 of the variables.
Let me start with an example of what I would like to achieve:
iris %>%
group_by(Species) %>%
filter(Sepal.Length > 5) %>%
summarise_at("Sepal.Length:Petal.Width",funs(mean))
which give me the following result
# A tibble: 3 × 5
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
<fctr> <dbl> <dbl> <dbl> <dbl>
1 setosa 5.8 4.4 1.9 0.5
2 versicolor 7.0 3.4 5.1 1.8
3 virginica 7.9 3.8 6.9 2.5
Is there an easy way to add, for example, max(Petal.Width)to summarise?
So far I have tried the following:
iris %>%
group_by(Species) %>%
filter(Sepal.Length > 5) %>%
summarise_at("Sepal.Length:Petal.Width",funs(mean)) %>%
mutate(Max.Petal.Width = max(iris$Petal.Width))
But with this approach I lose both the group_by and the filter from the code above and gives the wrong results.
The only solution I have been able to achieve is the following:
iris %>%
group_by(Species) %>%
filter(Sepal.Length > 5) %>%
summarise_at("Sepal.Length:Petal.Width",funs(mean,max)) %>%
select(Species:Petal.Width_mean,Petal.Width_max) %>%
rename(Max.Petal.Width = Petal.Width_max) %>%
rename_(.dots = setNames(names(.), gsub("_.*$","",names(.))))
Which is a bit convoluted and involves a lot of typing to just add a column with a different summarisation.
Thank you
Although this is an old question, it remains an interesting problem for which I have two solutions that I believe should be available to whoever finds this page.
Solution one
My own take:
mapply(summarise_at,
.vars = lst(names(iris)[!names(iris)%in%"Species"], "Petal.Width"),
.funs = lst(mean, max),
MoreArgs = list(.tbl = iris %>% group_by(Species) %>% filter(Sepal.Length > 5)))
%>% reduce(merge, by = "Species")
# Species Sepal.Length Sepal.Width Petal.Length Petal.Width.x Petal.Width.y
# 1 setosa 5.314 3.714 1.509 0.2773 0.5
# 2 versicolor 5.998 2.804 4.317 1.3468 1.8
# 3 virginica 6.622 2.984 5.573 2.0327 2.5
Solution two
An elegant solution using package purrr from the tidyverse itself, inspired by this discussion:
list(.vars = lst(names(iris)[!names(iris)%in%"Species"], "Petal.Width"),
.funs = lst("mean" = mean, "max" = max)) %>%
pmap(~ iris %>% group_by(Species) %>% filter(Sepal.Length > 5) %>% summarise_at(.x, .y))
%>% reduce(inner_join, by = "Species")
+ + + # A tibble: 3 x 6
Species Sepal.Length Sepal.Width Petal.Length Petal.Width.x Petal.Width.y
<fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 setosa 5.31 3.71 1.51 0.277 0.5
2 versicolor 6.00 2.80 4.32 1.35 1.8
3 virginica 6.62 2.98 5.57 2.03 2.5
Short discussion
The data.frame and tibble are the desired result, the last column being the max of petal.width and the other ones the means (by group and filter) of all other columns.
Both solutions hinge on three realizations:
summarise_at accepts as arguments two lists, one of n variables and one of m functions, and applies all m functions to all n variables, therefore producing m X n vectors in a tibble. The solution might thus imply forcing this function to loop in some way across "couples" formed by all variables to which we want one specific function to be applied and the one function, then another group of variables and their own function, and so on!
Now, what does the above in R? What does force an operation to corresponding elements of two lists? Functions such as mapply or the family of functions map2, pmap and variations thereof from dplyr's tidyverse fellow purrr. Both accept two lists of l elements and perform a given operation on corresponding elements (matched by position) of the two lists.
Because the product is not a tibble or a data.frame, but a list, you
simply need to use reduce with inner_join or just merge.
Note that the means I obtain are different from those of the OP, but they are the means I obtain with his reproducible example as well (maybe we have two different versions of the iris dataset?).
If you wanted to do something more complex like that, you could write your own version of summarize_at. With this version you supply triplets of column names, functions, and naming rules. For example
Here's a rough start
my_summarise_at<-function (.tbl, ...)
{
dots <- list(...)
stopifnot(length(dots)%%3==0)
vars <- do.call("append", Map(function(.cols, .funs, .name) {
cols <- select_colwise_names(.tbl, .cols)
funs <- as.fun_list(.funs, .env = parent.frame())
val<-colwise_(.tbl, funs, cols)
names <- sapply(names(val), function(x) gsub("%", x, .name))
setNames(val, names)
}, dots[seq_along(dots)%%3==1], dots[seq_along(dots)%%3==2], dots[seq_along(dots)%%3==0]))
summarise_(.tbl, .dots = vars)
}
environment(my_summarise_at)<-getNamespace("dplyr")
And you can call it with
iris %>%
group_by(Species) %>%
filter(Sepal.Length > 5) %>%
my_summarise_at("Sepal.Length:Petal.Width", mean, "%_mean",
"Petal.Width", max, "%_max")
For the names we just replace the "%" with the default name. The idea is just to dynamically build the summarize_ expression. The summarize_at function is really just a convenience wrapper around that basic function.
If you are trying to do everything with dplyr (which might be easier to remember), then you can leverage the new across function which will be available from dplyr 1.0.0.
iris %>%
group_by(Species) %>%
filter(Sepal.Length > 5) %>%
summarize(across(Sepal.Length:Petal.Width, mean)) %>%
cbind(iris %>%
group_by(Species) %>%
summarize(across(Petal.Width, max)) %>%
select(-Species)
)
It shows that the only difficulty is to combine two calculations on the same column Petal.Width on a grouped variable - you have to do the grouping again but can nest it into the cbind.
This returns correctly the result:
Species Sepal.Length Sepal.Width Petal.Length Petal.Width Petal.Width
1 setosa 5.313636 3.713636 1.509091 0.2772727 0.6
2 versicolor 5.997872 2.804255 4.317021 1.3468085 1.8
3 virginica 6.622449 2.983673 5.573469 2.0326531 2.5
If the task would not specify two calculations but only one on the same column Petal.Width, then this could be elegantly written as:
iris %>%
group_by(Species) %>%
filter(Sepal.Length > 5) %>%
summarize(
across(Sepal.Length:Petal.Length, mean),
across(Petal.Width, max)
)
I was looking for something similar and tried the following. It works well and much easier to read than the suggested solutions.
iris %>%
group_by(Species) %>%
filter(Sepal.Length > 5) %>%
summarise(MeanSepalLength=mean(Sepal.Length),
MeanSepalWidth = mean(Sepal.Width),
MeanPetalLength=mean(Petal.Length),
MeanPetalWidth=mean(Petal.Width),
MaxPetalWidth=max(Petal.Width))
# A tibble: 3 x 6
Species MeanSepalLength MeanSepalWidth MeanPetalLength MeanPetalWidth MaxPetalWidth
<fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 setosa 5.01 3.43 1.46 0.246 0.6
2 versicolor 5.94 2.77 4.26 1.33 1.8
3 virginica 6.59 2.97 5.55 2.03 2.5
In summarise() part, define your column name and give your column to summarise inside your function of choice.

use dplyr's summarise_each to return one row per function?

I'm using dplyr's summarise_each to apply a function to multiple columns of data. One thing that's nice is that you can apply multiple functions at once. Thing is, it's annoying that the output is a dataframe with a single row. It seems like it should return as many rows as functions, with as many columns as columns that were summarised.
library(dplyr)
default <-
iris %>%
summarise_each(funs(min, max), matches("Petal"))
this returns
> default
Petal.Length_min Petal.Width_min Petal.Length_max Petal.Width_max
1 1 0.1 6.9 2.5
I'd prefer something like
library(reshape2)
desired <-
iris %>%
select(matches("Petal")) %>%
melt() %>%
group_by(variable) %>%
summarize(min=min(value),max=max(value)) %>%
t()
which returns something close (not a dataframe, but you all get the idea)
> desired
[,1] [,2]
variable "Petal.Length" "Petal.Width"
min "1.0" "0.1"
max "6.9" "2.5"
is there an option in summarise_each to do this? If not, Hadley, would you mind adding it?
You can achieve a similar output combining the dplyr and tidyr packages.
Something along these lines can help
library(dplyr)
library(tidyr)
iris %>%
select(matches("Petal")) %>%
summarise_each(funs(min, max)) %>%
gather(variable, value) %>%
separate(variable, c("var", "stat"), sep = "\\_") %>%
spread(var, value)
## stat Petal.Length Petal.Width
## 1 max 6.9 2.5
## 2 min 1.0 0.1
To my knowledge there's no such argument. Anyhow, here's a workaround that outputs tidy data, I think that would be even better than having as many rows as functions and as many columns as summarised columns. (note that add_rownames requires dplyr 0.4.0)
library("dplyr")
library("tidyr")
iris %>%
summarise_each(funs(min, max, mean, median), matches("Petal")) %>%
t %>%
as.data.frame %>%
add_rownames %>%
separate(rowname, into = c("feature", "fun"), sep = "_")
returns:
feature fun V1
1 Petal.Length min 1.000000
2 Petal.Width min 0.100000
3 Petal.Length max 6.900000
4 Petal.Width max 2.500000
5 Petal.Length mean 3.758000
6 Petal.Width mean 1.199333
7 Petal.Length median 4.350000
8 Petal.Width median 1.300000
One option is to use purrr::map_df (really map_dfc to simplify back to a data.frame with bind_cols though map_df is fine for now) with a function that makes a vector of results of each function, i.e.
library(tidyverse)
iris %>% select(contains('Petal')) %>%
map_dfc(~c(min(.x), max(.x))) %>%
mutate(stat = c('min', 'max')) # to add column of function names
#> # A tibble: 2 × 3
#> Petal.Length Petal.Width stat
#> <dbl> <dbl> <chr>
#> 1 1.0 0.1 min
#> 2 6.9 2.5 max

Refactor R code when library functions use non-standard evaluation

I have some R code that looks like this:
library(dplyr)
library(datasets)
iris %.% group_by(Species) %.% filter(rank(Petal.Length, ties.method = 'random')<=2) %.% ungroup()
Giving:
Source: local data frame [6 x 5]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 4.3 3.0 1.1 0.1 setosa
2 4.6 3.6 1.0 0.2 setosa
3 5.0 2.3 3.3 1.0 versicolor
4 5.1 2.5 3.0 1.1 versicolor
5 4.9 2.5 4.5 1.7 virginica
6 6.0 3.0 4.8 1.8 virginica
This groups by species, and for each group keeps only the two with the shortest Petal.Length. I have some duplication in my code, because I do this several times for different columns and numbers. E.g.:
iris %.% group_by(Species) %.% filter(rank(Petal.Length, ties.method = 'random')<=2) %.% ungroup()
iris %.% group_by(Species) %.% filter(rank(-Petal.Length, ties.method = 'random')<=2) %.% ungroup()
iris %.% group_by(Species) %.% filter(rank(Petal.Width, ties.method = 'random')<=3) %.% ungroup()
iris %.% group_by(Species) %.% filter(rank(-Petal.Width, ties.method = 'random')<=3) %.% ungroup()
I want to extract this into a function. The naive approach doesn't work:
keep_min_n_by_species <- function(expr, n) {
iris %.% group_by(Species) %.% filter(rank(expr, ties.method = 'random') <= n) %.% ungroup()
}
keep_min_n_by_species(Petal.Width, 2)
Error in filter_impl(.data, dots(...), environment()) :
object 'Petal.Width' not found
As I understand it, the expression rank(Petal.Length, ties.method = 'random') <= 2 is evaluated in a different context, introduced by the filter function, that provides a meaning for the Petal.Length expression. I can't just swap in a variable for Petal.Length, because it will be evaluated in the wrong context. I've tried using different combinations of substitute and eval, having read this page: Non-standard evaluation. I can't figure out an appropriate combination. I think the problem might be that I don't just want to pass through an expression from the caller (Petal.Length) through to filter for it to evaluate - I want to construct a new bigger expression (rank(Petal.Length, ties.method = 'random') <= 2) and then pass that whole expression through to filter for it to evaluate.
How can I refactor this expression into a function?
More generally, how should I go about extracting an R expression into a function?
Even more generally, am I approaching this with the wrong mindset? In more mainstream languages I'm familiar with (e.g. Python, C++, C#), this is a relatively straightforward operation that I want to do all the time to remove duplication in my code. In R it seems (to me, at least) that non-standard evaluation can make it a very non-obvious operation. Should I be doing something else entirely?
dplyr version 0.3 is beginning to address this using the lazyeval package, as #baptiste mentioned, and a new family of functions that use standard evaluation (same function names as the NSE versions, but ending in _). There is a vignette here: https://github.com/hadley/dplyr/blob/master/vignettes/nse.Rmd
All that being said, I don't know best practices for what you're trying to do (though I'm trying to do the same thing). I have something working, but like I said, I don't know if it's the best way to do it. Note the use of filter_() instead of filter(), and passing in the argument as a quoted character string:
devtools::install_github("hadley/dplyr")
devtools::install_github("hadley/lazyeval")
library(dplyr)
library(lazyeval)
keep_min_n_by_species <- function(expr, n, rev = FALSE) {
iris %>%
group_by(Species) %>%
filter_(interp(~rank(if (rev) -x else x, ties.method = 'random') <= y, # filter_, not filter
x = as.name(expr), y = n)) %>%
ungroup()
}
keep_min_n_by_species("Petal.Width", 3) # "Petal.Width" as character string
keep_min_n_by_species("Petal.Width", 3, rev = TRUE)
Update based on #hadley's comment:
keep_min_n_by_species <- function(expr, n) {
expr <- lazy(expr)
formula <- interp(~rank(x, ties.method = 'random') <= y,
x = expr, y = n)
iris %>%
group_by(Species) %>%
filter_(formula) %>%
ungroup()
}
keep_min_n_by_species(Petal.Width, 3)
keep_min_n_by_species(-Petal.Width, 3)
How about
keep_min_n_by_species <- function(expr, n) {
mc <- match.call()
fx <- bquote(rank(.(mc$expr), ties.method = 'random') <= .(mc$n))
iris %.% group_by(Species) %.% filter(fx) %.% ungroup()
}
That seems to allow all the statements to run without error
keep_min_n_by_species(Petal.Width, 2)
keep_min_n_by_species(-Petal.Width, 2)
keep_min_n_by_species(Petal.Width, 3)
keep_min_n_by_species(-Petal.Width, 3)
The idea is that we use match.call() to capture the unevaluated expressions passed to the function. Then we use bquote() to build the filter as a call object.

Resources