I'd like to be able to use dplyr's group_by to group by multiple columns, simple enough. But, the complication is I want to create a function where one or more columns are always in the group by and the user can select an additional column to group by. What I've tried so far involves using the non-string specification of the columns that are always in the group by and using a string for the column the user selects, but nothing I've tried works. This combination seems to work fine in SELECT, but not GROUP_BY. Ideally, I'd rather not switch to all strings because I want to be able to take advantage of some of the functionality of dplyr that allows me to select a range of columns. Below is an example.
To make a simple example, I started with the iris data set and added a couple more columns, their exact meanings are not important.
test_tbl <- iris %>%
mutate(extra_var1 = ifelse(Sepal.Length >= 5.0, "Yes", "No"),
extra_var2 = "What")
Here's an example that uses the non-string specification for all variables, which works just fine:
test_tbl %>%
select(Species, extra_var1, Sepal.Length, Petal.Width) %>%
group_by(Species, extra_var1) %>%
summarize(average.Sepal.Length = mean(Sepal.Length),
average.Petal.Width = mean(Petal.Width))
But, I'd like to be able to, within a function, have the user specify whether they want to group by extra_var1 or extra_var2. Here's my attempt, which doesn't work. Again, I believe the select part works fine, but the group_by part does not.
group_and_summarize <- function(var) {
test_tbl %>%
select(Species, var, Sepal.Length, Petal.Width) %>%
group_by(Species, var) %>%
summarize(average.Sepal.Length = mean(Sepal.Length),
average.Petal.Width = mean(Petal.Width))
}
group_and_summarize("extra_var1")
This would be one way to do it:
library(dplyr)
group_and_summarize <- function(var) {
test_tbl %>%
select(Species, {{var}}, Sepal.Length, Petal.Width) %>%
group_by(Species, {{var}}) %>%
summarize(average.Sepal.Length = mean(Sepal.Length),
average.Petal.Width = mean(Petal.Width))
}
group_and_summarize(extra_var1)
#> `summarise()` regrouping output by 'Species' (override with `.groups` argument)
#> # A tibble: 6 x 4
#> # Groups: Species [3]
#> Species extra_var1 average.Sepal.Length average.Petal.Width
#> <fct> <chr> <dbl> <dbl>
#> 1 setosa No 4.67 0.195
#> 2 setosa Yes 5.23 0.28
#> 3 versicolor No 4.9 1
#> 4 versicolor Yes 5.96 1.33
#> 5 virginica No 4.9 1.7
#> 6 virginica Yes 6.62 2.03
Created on 2021-05-11 by the reprex package (v0.3.0)
If you want the user to enter strings then we can use !!! syms():
group_and_summarize <- function(vars) {
test_tbl %>%
select(Species, !!! syms(vars), Sepal.Length, Petal.Width) %>%
group_by(Species, !!! syms(vars)) %>%
summarize(average.Sepal.Length = mean(Sepal.Length),
average.Petal.Width = mean(Petal.Width))
}
group_and_summarize(c("extra_var1", "extra_var2"))
#> `summarise()` regrouping output by 'Species', 'extra_var1' (override with `.groups` argument)
#> # A tibble: 6 x 5
#> # Groups: Species, extra_var1 [6]
#> Species extra_var1 extra_var2 average.Sepal.Length average.Petal.Width
#> <fct> <chr> <chr> <dbl> <dbl>
#> 1 setosa No What 4.67 0.195
#> 2 setosa Yes What 5.23 0.28
#> 3 versicolor No What 4.9 1
#> 4 versicolor Yes What 5.96 1.33
#> 5 virginica No What 4.9 1.7
#> 6 virginica Yes What 6.62 2.03
Created on 2021-05-11 by the reprex package (v0.3.0)
Related
cdata is a tibble (I used haven to import a .sav file into the cdata object).
Why does using cdata$WEIGHT instead of WEIGHT produce such a radical difference in the output below?
this code uses cdata$WEIGHT :
cdata %>% group_by(as.factor(state)) %>%
summarise(n = n(), weighted_n = sum(cdata$WEIGHT))
produces an unwanted table:
this code uses WEIGHT :
cdata %>% group_by(as.factor(state)) %>%
summarise(n = n(), weighted_n = sum(WEIGHT))
produces the correct table:
I realize that tibble has a different mental model than base R. However, the above difference doesn't make intuitive sense to me. What's the intent behind this difference in output when using a common column identification technique (cdata$WEIGHT)?
When we having a grouping variable, cdata$WEIGHT extracts the whole column and thus the sum is from the whole column whereas if we use only WEIGHT, it returns only the data from the column for each group
If we really wanted to use $, then use the pronoun .data
library(dplyr)
iris %>%
group_by(Species) %>%
summarise(Sepal.Length = sum(.data$Sepal.Length), .groups = 'drop')
# A tibble: 3 x 2
Species Sepal.Length
<fct> <dbl>
1 setosa 250.
2 versicolor 297.
3 virginica 329.
which is identical to
iris %>%
group_by(Species) %>%
summarise(Sepal.Length = sum(Sepal.Length), .groups = 'drop')
# A tibble: 3 x 2
Species Sepal.Length
<fct> <dbl>
1 setosa 250.
2 versicolor 297.
3 virginica 329.
Or use cur_data()
iris %>%
group_by(Species) %>%
summarise(Sepal.Length = sum(cur_data()$Sepal.Length), .groups = 'drop')
# A tibble: 3 x 2
Species Sepal.Length
<fct> <dbl>
1 setosa 250.
2 versicolor 297.
3 virginica 329.
Whereas if we use .$ or iris$, it extracts the whole column breaking the group attributes
iris %>%
group_by(Species) %>%
summarise(Sepal.Length = sum(.$Sepal.Length), .groups = 'drop')
# A tibble: 3 x 2
Species Sepal.Length
<fct> <dbl>
1 setosa 876.
2 versicolor 876.
3 virginica 876.
I am trying to make a data.frame which displays the average time an individual displays a behaviour.
I have been using group_by and summarise to calculate the averages across groups. But the output is many rows down. See an example using the iris dataset...
data(iris)
x <- iris %>%
group_by(Species, Petal.Length) %>%
summarise(mean(Sepal.Length))
I would like to get an output that has, for this example, one row per 'Species' and a column of averages per 'Petal.Length'.
I have resorted to creating multiple outputs and then using left_join to combine them into the desired data.frame. See example below...
a <- iris %>%
group_by(Species) %>%
filter(Petal.Length == 0.1) %>%
summarise(mean(Sepal.Length))
b <- iris %>%
group_by(Species) %>%
filter(Petal.Length == 0.2) %>%
summarise(mean(Sepal.Length))
left_join(a, b)
However, doing this twelve or more times at a time is tedious and I am sure there must be an easy way to get the mean(Sepal.Length) for the 'Petal.Length' 0.1, and 0.2, and 0.3 (etc) in the one output.
n.b. in my data Petal.Length would actually be characters that represent behaviours and Sepal.Length would be the duration of time
Some ideas:
library(tidyverse)
data(iris)
mutate(iris, Petal.Length_discrete = cut(Petal.Length, 5)) %>%
group_by(Species, Petal.Length_discrete) %>%
summarise(mean(Sepal.Length))
#> `summarise()` has grouped output by 'Species'. You can override using the `.groups` argument.
#> # A tibble: 7 x 3
#> # Groups: Species [3]
#> Species Petal.Length_discrete `mean(Sepal.Length)`
#> <fct> <fct> <dbl>
#> 1 setosa (0.994,2.18] 5.01
#> 2 versicolor (2.18,3.36] 5
#> 3 versicolor (3.36,4.54] 5.81
#> 4 versicolor (4.54,5.72] 6.43
#> 5 virginica (3.36,4.54] 4.9
#> 6 virginica (4.54,5.72] 6.32
#> 7 virginica (5.72,6.91] 7.25
iris %>%
group_split(Species, Petal.Length) %>%
map(~ summarise(.x, mean(Sepal.Length))) %>%
head(3)
#> [[1]]
#> # A tibble: 1 x 1
#> `mean(Sepal.Length)`
#> <dbl>
#> 1 4.6
#>
#> [[2]]
#> # A tibble: 1 x 1
#> `mean(Sepal.Length)`
#> <dbl>
#> 1 4.3
#>
#> [[3]]
#> # A tibble: 1 x 1
#> `mean(Sepal.Length)`
#> <dbl>
#> 1 5.4
Created on 2021-06-28 by the reprex package (v2.0.0)
I'd like to group multiple t test result into one table. Originally my code looks like this:
tt_data <- iris %>%
group_by(Species) %>%
summarise(p = t.test(Sepal.Length,Petal.Length,alternative="two.sided",paired=T)$p.value,
estimate = t.test(Sepal.Length,Petal.Length,alternative="two.sided",paired=T)$estimate
)
tt_data
# Species p estimate
# setosa 2.542887e-51 3.544
# versicolor 9.667914e-36 1.676
# virginica 7.985259e-28 1.036
However, base on the idea that I should only perform the statistical test once, is there a way for me to run t test once per group and collect the intended table? I think there are some combination of broom and purrr but I am unfamiliar with the syntax.
# code idea (I know this won't work!)
tt_data <- iris %>%
group_by(Species) %>%
summarise(tt = t.test(Sepal.Length,Petal.Length,alternative="two.sided",paired=T)) %>%
select(Species, tt.p, tt.estimate)
tt_data
# Species tt.p tt.estimate
# setosa 2.542887e-51 3.544
# versicolor 9.667914e-36 1.676
# virginica 7.985259e-28 1.036
You can use broom::tidy() to transform the resut of the t.test to a tidy 'tibble':
library(dplyr)
library(broom)
iris %>%
group_by(Species) %>%
group_modify(~{
t.test(.$Sepal.Length,.$Petal.Length,alternative="two.sided",paired=T) %>%
tidy()
}) %>%
select(estimate, p.value)
#> Adding missing grouping variables: `Species`
#> # A tibble: 3 x 3
#> # Groups: Species [3]
#> Species estimate p.value
#> <fct> <dbl> <dbl>
#> 1 setosa 3.54 2.54e-51
#> 2 versicolor 1.68 9.67e-36
#> 3 virginica 1.04 7.99e-28
Created on 2020-09-02 by the reprex package (v0.3.0)
You can use map to select the desired values from the list generated by t.test and by tidying it up to a data frame via broom::tidy, i.e.
library(dplyr)
iris %>%
group_by(Species) %>%
summarise(p = list(broom::tidy(t.test(Sepal.Length, Petal.Length, alternative = "two.sided", paired = T)))) %>%
mutate(p.value = purrr::map(p, ~select(.x, c('p.value', 'estimate')))) %>%
select(-p) %>%
unnest()
# A tibble: 3 x 3
# Species p.value estimate
# <fct> <dbl> <dbl>
#1 setosa 2.54e-51 3.54
#2 versicolor 9.67e-36 1.68
#3 virginica 7.99e-28 1.04
Suppose we want to group_by() and summarise a massive data.frame with very many columns, but that there are some large groups of consecutive columns that will have the same summarise condition (e.g. max, mean etc)
Is there a way to avoid having to specify the summarise condition for each and every column, and instead do it for ranges of columns?
Example
Suppose we want to do this:
iris %>%
group_by(Species) %>%
summarise(max(Sepal.Length), mean(Sepal.Width), mean(Petal.Length), mean(Petal.Width))
but note that 3 consecutive columns have the same summarise condition, mean(Sepal.Width), mean(Petal.Length), mean(Petal.Width)
Is there a way to use some method like mean(Sepal.Width:Petal.Width) to specify the condition for the range of columns, and hence a avoiding having to type out the summarise condition multiple times for all the columns in between)
Note
The iris example above is a small and manageable example that has a range of 3 consecutive columns, but actual use case has ~hundreds.
The upcoming version 1.0.0 of dplyr will have across() function that does what you wish for
Basic usage
across() has two primary arguments:
The first argument, .cols, selects the columns you want to operate on.
It uses tidy selection (like select()) so you can pick variables by
position, name, and type.
The second argument, .fns, is a function or list of functions to apply to
each column. This can also be a purrr style formula (or list of formulas)
like ~ .x / 2. (This argument is optional, and you can omit it if you just want
to get the underlying data; you'll see that technique used in
vignette("rowwise").)
### Install development version on GitHub first
# install.packages("devtools")
# devtools::install_github("tidyverse/dplyr")
library(dplyr, warn.conflicts = FALSE)
Control how the names are created with the .names argument which takes a glue spec:
iris %>%
group_by(Species) %>%
summarise(
across(c(Sepal.Width:Petal.Width), ~ mean(.x, na.rm = TRUE), .names = "mean_{col}"),
across(c(Sepal.Length), ~ max(.x, na.rm = TRUE), .names = "max_{col}")
)
#> # A tibble: 3 x 5
#> Species mean_Sepal.Width mean_Petal.Leng~ mean_Petal.Width max_Sepal.Length
#> * <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 setosa 3.43 1.46 0.246 5.8
#> 2 versicolor 2.77 4.26 1.33 7
#> 3 virginica 2.97 5.55 2.03 7.9
Using multiple functions
my_func <- list(
mean = ~ mean(., na.rm = TRUE),
max = ~ max(., na.rm = TRUE)
)
iris %>%
group_by(Species) %>%
summarise(across(where(is.numeric), my_func, .names = "{fn}.{col}"))
#> # A tibble: 3 x 9
#> Species mean.Sepal.Length max.Sepal.Length mean.Sepal.Width max.Sepal.Width
#> * <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 setosa 5.01 5.8 3.43 4.4
#> 2 versicolor 5.94 7 2.77 3.4
#> 3 virginica 6.59 7.9 2.97 3.8
#> mean.Petal.Length max.Petal.Length mean.Petal.Width max.Petal.Width
#> * <dbl> <dbl> <dbl> <dbl>
#> 1 1.46 1.9 0.246 0.6
#> 2 4.26 5.1 1.33 1.8
#> 3 5.55 6.9 2.03 2.5
Created on 2020-03-06 by the reprex package (v0.3.0)
Since summarise collapses the rows and hence we cannot further apply any functions to it, we can use mutate_at instead, select range of columns to apply function and then select 1st row from every group.
library(dplyr)
iris %>%
group_by(Species) %>%
mutate_at(vars(Sepal.Width:Petal.Width), mean) %>%
mutate_at(vars(Sepal.Length), max) %>%
slice(1L)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# <dbl> <dbl> <dbl> <dbl> <fct>
#1 5.8 3.43 1.46 0.246 setosa
#2 7 2.77 4.26 1.33 versicolor
#3 7.9 2.97 5.55 2.03 virginica
We can use pmap from purrr to apply various functions to various columns and then join back together at the end. Note the use of lst from purrr so we can refer to previously named objects in the list construction. This allows us to analyze the same column with multiple functions, such as Sepal.Length below.
library(tidyverse)
lst(a = list("Sepal.Length", names(select(iris, Sepal.Length:Petal.Width))),
b = list("max" = max, "mean" = mean),
c = names(b)) %>%
pmap(function(a, b, c) {
iris %>%
group_by(Species) %>%
summarize_at(a, b) %>%
rename_at(a, paste0, "_", c)
}) %>%
reduce(inner_join, by = "Species")
#> # A tibble: 3 x 6
#> Species Sepal.Length_max Sepal.Length_me~ Sepal.Width_mean Petal.Length_me~
#> <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 setosa 5.8 5.01 3.43 1.46
#> 2 versic~ 7 5.94 2.77 4.26
#> 3 virgin~ 7.9 6.59 2.97 5.55
#> # ... with 1 more variable: Petal.Width_mean <dbl>
I'm creating a bunch of basic status reports and one of things I'm finding tedious is adding a total row to all my tables. I'm currently using the Tidyverse approach and this is an example of my current code. What I'm looking for is an option to have a few different levels included by default.
#load into RStudio viewer (not required)
iris = iris
#summary at the group level
summary_grouped = iris %>%
group_by(Species) %>%
summarize(mean_s_length = mean(Sepal.Length),
max_s_width = max(Sepal.Width))
#summary at the overall level
summary_overall = iris %>%
summarize(mean_s_length = mean(Sepal.Length),
max_s_width = max(Sepal.Width)) %>%
mutate(Species = "Overall")
#append results for report
summary_table = rbind(summary_grouped, summary_overall)
Doing this multiple times over is very tedious. I kind of want:
summary_overall = iris %>%
group_by(Species, total = TRUE) %>%
summarize(mean_s_length = mean(Sepal.Length),
max_s_width = max(Sepal.Width))
FYI - if you're familiar with SAS I'm looking for the same type of functionality available via a class, ways or types statements in proc means that let me control the level of summarization and get multiple levels in one call.
Any help is appreciated. I know I can create my own function, but was hoping there is something that already exists. I would also prefer to stick with the tidyverse style of programming though I'm not set on that.
Another alternative:
library(tidyverse)
iris %>%
mutate_at("Species", as.character) %>%
list(group_by(.,Species), .) %>%
map(~summarize(.,mean_s_length = mean(Sepal.Length),
max_s_width = max(Sepal.Width))) %>%
bind_rows() %>%
replace_na(list(Species="Overall"))
#> # A tibble: 4 x 3
#> Species mean_s_length max_s_width
#> <chr> <dbl> <dbl>
#> 1 setosa 5.01 4.4
#> 2 versicolor 5.94 3.4
#> 3 virginica 6.59 3.8
#> 4 Overall 5.84 4.4
You can write a function which does the same summarize on an ungrouped tibble and rbinds that to the end.
summarize2 <- function(df, ...){
bind_rows(summarise(df, ...), summarize(ungroup(df), ...))
}
iris %>%
group_by(Species) %>%
summarize2(
mean_s_length = mean(Sepal.Length),
max_s_width = max(Sepal.Width)
)
# # A tibble: 4 x 3
# Species mean_s_length max_s_width
# <fct> <dbl> <dbl>
# 1 setosa 5.01 4.4
# 2 versicolor 5.94 3.4
# 3 virginica 6.59 3.8
# 4 NA 5.84 4.4
You could add some logic for what the "Overall" groups should be named if you want
summarize2 <- function(df, ...){
s1 <- summarise(df, ...)
s2 <- summarize(ungroup(df), ...)
for(v in group_vars(s1)){
if(is.factor(s1[[v]]))
s1[[v]] <- as.character(s1[[v]])
if(is.character(s1[[v]]))
s2[[v]] <- 'Overall'
else if(is.numeric(s1[[v]]))
s2[[v]] <- -Inf
}
bind_rows(s1, s2)
}
iris %>%
group_by(Species, g = Petal.Length %/% 1) %>%
summarize2(
mean_s_length = mean(Sepal.Length),
max_s_width = max(Sepal.Width)
)
# # Groups: Species [4]
# Species g mean_s_length max_s_width
# <chr> <dbl> <dbl> <dbl>
# 1 setosa 1 5.01 4.4
# 2 versicolor 3 5.35 2.9
# 3 versicolor 4 6.09 3.4
# 4 versicolor 5 6.35 3
# 5 virginica 4 5.85 3
# 6 virginica 5 6.44 3.4
# 7 virginica 6 7.43 3.8
# 8 Overall -Inf 5.84 4.4
library(dplyr)
iris %>%
group_by(Species) %>%
summarize(mean_s_length = mean(Sepal.Length),
max_s_width = max(Sepal.Width)) %>%
ungroup() %>%
mutate_at(vars(Species), as.character) %>%
{rbind(.,c("Overal",mean(.$mean_s_length),max(.$max_s_width)))} %>%
mutate_at(vars(-Species), as.double) %>%
mutate_at(vars(Species), as.factor)
#> # A tibble: 4 x 3
#> Species mean_s_length max_s_width
#> <fct> <dbl> <dbl>
#> 1 setosa 5.01 4.4
#> 2 versicolor 5.94 3.4
#> 3 virginica 6.59 3.8
#> 4 Overal 5.84 4.4
Created on 2019-06-21 by the reprex package (v0.3.0)
One way, also tedious but in one longer pipe, is to put the second summarise instructions in bind_rows.
The as.character call avoids a warning:
Warning messages:
1: In bind_rows_(x, .id) :
binding factor and character vector, coercing into character vector
2: In bind_rows_(x, .id) :
binding character and factor vector, coercing into character vector
library(tidyverse)
summary_grouped <- iris %>%
mutate(Species = as.character(Species)) %>%
group_by(Species) %>%
summarize(mean_s_length = mean(Sepal.Length),
max_s_width = max(Sepal.Width)) %>%
bind_rows(iris %>%
summarize(mean_s_length = mean(Sepal.Length),
max_s_width = max(Sepal.Width)) %>%
mutate(Species = "Overall"))
## A tibble: 4 x 3
# Species mean_s_length max_s_width
# <chr> <dbl> <dbl>
#1 setosa 5.01 4.4
#2 versicolor 5.94 3.4
#3 virginica 6.59 3.8
#4 Overall 5.84 4.4
Maybe something like this:
As you want to perform different operations on the same input (iris), best to map over the different summary functions and apply to the data.
map_dfr combines the list outputs using bind_rows
library(dplyr)
library(purrr)
pipe <- . %>%
group_by(Species) %>%
summarize(
mean_s_length = mean(Sepal.Length),
max_s_width = max(Sepal.Width))
map_dfr(
list(pipe, . %>% mutate(Species = "Overall") %>% pipe),
exec,
iris)
#> Warning in bind_rows_(x, .id): binding factor and character vector,
#> coercing into character vector
#> Warning in bind_rows_(x, .id): binding character and factor vector,
#> coercing into character vector
#> # A tibble: 4 x 3
#> Species mean_s_length max_s_width
#> <chr> <dbl> <dbl>
#> 1 setosa 5.01 4.4
#> 2 versicolor 5.94 3.4
#> 3 virginica 6.59 3.8
#> 4 Overall 5.84 4.4
Solution where you need to apply wanted function only once on a double dataset:
library(tidyverse)
iris %>%
rbind(mutate(., Species = "Overall")) %>%
group_by(Species) %>%
summarize(
mean_s_length = mean(Sepal.Length),
max_s_width = max(Sepal.Width)
)
# A tibble: 4 x 3
Species mean_s_length max_s_width
<chr> <dbl> <dbl>
1 Overall 5.84 4.4
2 setosa 5.01 4.4
3 versicolor 5.94 3.4
4 virginica 6.59 3.8
Trick is to pass original dataset with a new group ID (ie Species): mutate(iris, Species = "Overall")