In R, working in the tidyverse:
My data sources change. There's a column which is only present some weeks. When it is, I want to summarize it. Using iris as an example, suppose that Sepal.Width is sometimes missing. Conceptually, I want a function like this
library(tidyverse)
summIris <- function(irisDf){
irisDf %>%
group_by(Species) %>%
summarise_ifPresent(
Sepal.Length = mean(Sepal.Length),
Sepal.Width = mean(Sepal.Width))
}
Which'd return
R > summIris(iris )
# A tibble: 3 x 3
Species Sepal.Length Sepal.Width
<fct> <dbl> <dbl>
1 setosa 5.01 3.43
2 versicolor 5.94 2.77
3 virginica 6.59 2.97
> summIris(iris %>% select(- Sepal.Width ))
# A tibble: 3 x 2
Species Sepal.Length
<fct> <dbl>
1 setosa 5.01
2 versicolor 5.94
3 virginica 6.59
I could work around by wrapping the logic in if else. But is there something more concise and elegant?
summarize_at allows you to define on which columns you execute the summary, and you can use starts_with, ends_with, matches, or contains to dynamically select columns.
library(dplyr)
iris %>%
group_by(Species) %>%
summarize_at(vars(starts_with("Sepal")), funs(mean(.)))
# # A tibble: 3 x 3
# Species Sepal.Length Sepal.Width
# <fct> <dbl> <dbl>
# 1 setosa 5.01 3.43
# 2 versicolor 5.94 2.77
# 3 virginica 6.59 2.97
iris %>%
select(-Sepal.Length) %>%
group_by(Species) %>%
summarize_at(vars(starts_with("Sepal")), funs(mean(.)))
# # A tibble: 3 x 2
# Species Sepal.Width
# <fct> <dbl>
# 1 setosa 3.43
# 2 versicolor 2.77
# 3 virginica 2.97
Another one also works but gives a warning with unfound columns:
iris %>%
select(-Sepal.Length) %>%
group_by(Species) %>%
summarize_at(vars(one_of(c("Sepal.Width", "Sepal.Length"))), funs(mean(.)))
# Warning: Unknown columns: `Sepal.Length`
# # A tibble: 3 x 2
# Species Sepal.Width
# <fct> <dbl>
# 1 setosa 3.43
# 2 versicolor 2.77
# 3 virginica 2.97
Related
I think this is hopefully a reasonably straightforward problem. I am creating a custom function that perform a battery of summary statistics and formats the data consistently for some reports. To accomplish this I write a function() that take 3 variables as input, a dataset, a grouping variable, and a response variable of interest. I perform the summary statistics using dplyr::summarize(). I know that to use dplyr::summarize in a custom function, I have to embrace the grouping variable and the response variable with curly curly notation within the dplyr:: functions. I want to record the name of the response variable in the output tibble. In a non-tidyverse:: world I would use deparse(substitute()) to accomplish this. However, this method apparently does not work within the tidyverse::. Here is my reproducible example. I will walk through it piecemeal, and then post the uninterrupted code at the end of my question.
For my first attempt, I tried the deparse(substitute({{}})) approach
library(tidyverse)
data("iris")
fxn1<-function(DF, grp, var){
out<-DF %>%
group_by({{grp}}) %>%
summarize(Mean_Val=mean({{var}}, na.rm=TRUE),
Var=deparse(substitute({{var}})))
}
Demo1<-fxn1(iris, Species, Petal.Width)
Demo1
Unfortunately this created some sort of expression in the Var column, and messes up the summarization.
# A tibble: 12 x 3
# Groups: Species [3]
Species Mean_Val Var
<fct> <dbl> <chr>
1 setosa 0.246 "(function (...) "
2 setosa 0.246 "{"
3 setosa 0.246 " .External2(ffi_tilde_eval, sys.call(), environment(), ~
4 setosa 0.246 "})(Petal.Width)"
5 versicolor 1.33 "(function (...) "
6 versicolor 1.33 "{"
7 versicolor 1.33 " .External2(ffi_tilde_eval, sys.call(), environment(), ~
8 versicolor 1.33 "})(Petal.Width)"
9 virginica 2.03 "(function (...) "
10 virginica 2.03 "{"
11 virginica 2.03 " .External2(ffi_tilde_eval, sys.call(), environment(), ~
12 virginica 2.03 "})(Petal.Width)"
For my second attempt, I got rid of the curly curly notation in deparse(substitute())
fxn2<-function(DF, grp, var){
out<-DF %>%
group_by({{grp}}) %>%
summarize(Mean_Val=mean({{var}}, na.rm=TRUE),
Var=deparse(substitute(var)))
}
Demo2<-fxn2(iris, Species, Petal.Width)
Demo2
This almost correct, but instead of inputing "Petal.Width" into the Var column, it adds "var".
# A tibble: 3 x 3
Species Mean_Val Var
<fct> <dbl> <chr>
1 setosa 0.246 var
2 versicolor 1.33 var
3 virginica 2.03 var
What I want my data to look like is this
Desired<-iris %>% group_by(Species) %>% summarize(Mean_Val=mean(Petal.Width), Var="Petal.Width")
Desired
Which looks like this:
# A tibble: 3 x 3
Species Mean_Val Var
<fct> <dbl> <chr>
1 setosa 0.246 Petal.Width
2 versicolor 1.33 Petal.Width
3 virginica 2.03 Petal.Width
Does anyone know how to get dplyr:: to do the equivalent of deparse(substitute), but actually return the name of the variable, and not the argument name? Any guidance would be much appreciated.
Here is the uninterrupted reproducible example:
library(tidyverse)
data("iris")
fxn1<-function(DF, grp, var){
out<-DF %>%
group_by({{grp}}) %>%
summarize(Mean_Val=mean({{var}}, na.rm=TRUE),
Var=deparse(substitute({{var}})))
}
Demo1<-fxn1(iris, Species, Petal.Width)
Demo1
fxn2<-function(DF, grp, var){
out<-DF %>%
group_by({{grp}}) %>%
summarize(Mean_Val=mean({{var}}, na.rm=TRUE),
Var=deparse(substitute(var)))
}
Demo2<-fxn2(iris, Species, Petal.Width)
Demo2
Desired<-iris %>% group_by(Species) %>% summarize(Mean_Val=mean(Petal.Width), Var="Petal.Width")
Desired
This kind of provides your expected output ("Var" is a list, so not ideal); does it solve your problem?
library(tidyverse)
data("iris")
fxn1<-function(DF, grp, var){
out<-DF %>%
group_by({{grp}}) %>%
summarize(Mean_Val=mean({{var}}, na.rm=TRUE),
Var=deparse(substitute({{var}})))
}
Demo1<-fxn1(iris, Species, Petal.Width)
#> `summarise()` has grouped output by 'Species'. You can override using the
#> `.groups` argument.
Demo1
#> # A tibble: 12 × 3
#> # Groups: Species [3]
#> Species Mean_Val Var
#> <fct> <dbl> <chr>
#> 1 setosa 0.246 "(function (...) "
#> 2 setosa 0.246 "{"
#> 3 setosa 0.246 " .External2(ffi_tilde_eval, sys.call(), environment(…
#> 4 setosa 0.246 "})(Petal.Width)"
#> 5 versicolor 1.33 "(function (...) "
#> 6 versicolor 1.33 "{"
#> 7 versicolor 1.33 " .External2(ffi_tilde_eval, sys.call(), environment(…
#> 8 versicolor 1.33 "})(Petal.Width)"
#> 9 virginica 2.03 "(function (...) "
#> 10 virginica 2.03 "{"
#> 11 virginica 2.03 " .External2(ffi_tilde_eval, sys.call(), environment(…
#> 12 virginica 2.03 "})(Petal.Width)"
fxn2<-function(DF, grp, var){
out<-DF %>%
group_by({{grp}}) %>%
summarize(Mean_Val=mean({{var}}, na.rm=TRUE),
Var=deparse(substitute(var)))
}
Demo2<-fxn2(iris, Species, Petal.Width)
Demo2
#> # A tibble: 3 × 3
#> Species Mean_Val Var
#> <fct> <dbl> <chr>
#> 1 setosa 0.246 var
#> 2 versicolor 1.33 var
#> 3 virginica 2.03 var
Desired<-iris %>% group_by(Species) %>% summarize(Mean_Val=mean(Petal.Width), Var="Petal.Width")
Desired
#> # A tibble: 3 × 3
#> Species Mean_Val Var
#> <fct> <dbl> <chr>
#> 1 setosa 0.246 Petal.Width
#> 2 versicolor 1.33 Petal.Width
#> 3 virginica 2.03 Petal.Width
fxn3 <- function(DF, grp, var){
DF %>%
group_by({{grp}}) %>%
summarize(Mean_Val=mean({{var}}, na.rm=TRUE),
Var=c(ensym(var)))
}
Demo3 <- fxn3(iris, Species, Petal.Width)
Demo3
#> # A tibble: 3 × 3
#> Species Mean_Val Var
#> <fct> <dbl> <list>
#> 1 setosa 0.246 <sym>
#> 2 versicolor 1.33 <sym>
#> 3 virginica 2.03 <sym>
print.data.frame(Demo3)
#> Species Mean_Val Var
#> 1 setosa 0.246 Petal.Width
#> 2 versicolor 1.326 Petal.Width
#> 3 virginica 2.026 Petal.Width
Created on 2022-04-21 by the reprex package (v2.0.1)
I have 5 columns in which I'd like to group by a column and then summarize as mean per columns. However, in the process, I'd like to only calculate the mean for values between a certain range for all the columns. Is this possible? Not excluding the rows themselves but the values to be aggregated.
Current code:
a <- b %>% group_by(c) %>% summarise_all(funs(mean(., na.rm=T)))
If you want to use only a subset of data to compute the mean on, you can use a lambda function inside summarise().
However, if the subset is based on only one variable, you should simply use filter().
Also, note that summarise_all() is retired and we should use summarise(across()) instead.
Here is an example where the mean is computed with only values included between 2 and 3.
library(tidyverse)
iris %>%
group_by(Species) %>%
summarise(across(everything(), ~mean(.x, na.rm=TRUE)))
#> # A tibble: 3 x 5
#> Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#> <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 setosa 5.01 3.43 1.46 0.246
#> 2 versicolor 5.94 2.77 4.26 1.33
#> 3 virginica 6.59 2.97 5.55 2.03
my_range = c(inf=2, sup=3)
iris %>%
group_by(Species) %>%
summarise(across(everything(), ~.x[.x>my_range["inf"] & .x<my_range["sup"]] %>% mean(na.rm=TRUE)))
#> # A tibble: 3 x 5
#> Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#> <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 setosa NaN 2.60 NaN NaN
#> 2 versicolor NaN 2.63 NaN NaN
#> 3 virginica NaN 2.69 NaN 2.27
Created on 2021-05-12 by the reprex package (v2.0.0)
I am aggregating some data and I want to add group sizes N to the output table. Until recently, the code below worked fine. Now, N is equal to the rowcount of my table.
iris %>%
group_by(Species) %>%
group_by(N = n(), .add = TRUE) %>%
summarise_all(list(~mean(., na.rm = TRUE)))
# A tibble: 3 x 6
# Groups: Species [3]
Species N Sepal.Length Sepal.Width Petal.Length Petal.Width
<fct> <int> <dbl> <dbl> <dbl> <dbl>
1 setosa 150 5.01 3.43 1.46 0.246
2 versicolor 150 5.94 2.77 4.26 1.33
3 virginica 150 6.59 2.97 5.55 2.03
This looks like a recently introduced bug. Can be reproduced on dplyr 1.0.3 but not on 1.0.2.
You could however, avoid the second group_by completely in this case.
library(dplyr)
iris %>%
group_by(Species) %>%
summarise(across(.fns = mean, na.rm = TRUE),
N = n())
# Species Sepal.Length Sepal.Width Petal.Length Petal.Width N
#* <fct> <dbl> <dbl> <dbl> <dbl> <int>
#1 setosa 5.01 3.43 1.46 0.246 50
#2 versicolor 5.94 2.77 4.26 1.33 50
#3 virginica 6.59 2.97 5.55 2.03 50
Try this:
rm(list = ls())
library(dplyr)
iris %>%
group_by(Species) %>%
group_by(N = n(), .add = TRUE) %>%
summarise_all(list(~mean(., na.rm = TRUE)))
(Using Iris for reproducibility)
I want to calculate min/max row by Petal.Width & grouped by Species in R. I have done that using two approaches, I want to understand is there a better approach (preferably tidyverse) , also note because of ties answer might vary in both. Please correct if there is anything wrong in both these approaches.
Approach 1
library(tidyverse)
iris %>%
group_by(Species) %>%
slice_max(Petal.Width, n = 1, with_ties=FALSE) %>%
rbind(
iris %>%
group_by(Species) %>%
slice_min(Petal.Width, n = 1, with_ties=FALSE))
Approach 2
iris %>%
group_by(Species) %>%
arrange(Petal.Width) %>%
filter(row_number() %in% c(1,n()))
Here is a the way to do it with summarise(across()):
library(dplyr)
iris %>%
group_by(Species) %>%
summarise(across(.cols = Petal.Width,
.fns = list(min = min, max = max),
.names = "{col}_{fn}"))
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 3 x 3
Species Petal.Width_min Petal.Width_max
<fct> <dbl> <dbl>
1 setosa 0.1 0.6
2 versicolor 1 1.8
3 virginica 1.4 2.5
You could easily find the min and max of every numerical variable in a data set this way:
iris %>%
group_by(Species) %>%
summarise(across(where(is.numeric),
.fns = list(min = min, max = max),
.names = "{col}_{fn}"))
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 3 x 9
Species Sepal.Length_min Sepal.Length_max Sepal.Width_min Sepal.Width_max Petal.Length_min Petal.Length_max Petal.Width_min Petal.Width_max
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 setosa 4.3 5.8 2.3 4.4 1 1.9 0.1 0.6
2 versicolor 4.9 7 2 3.4 3 5.1 1 1.8
3 virginica 4.9 7.9 2.2 3.8 4.5 6.9 1.4 2.5
You could also use slice like below:
iris %>%
group_by(Species) %>%
slice(which.min(Petal.Width),
which.max(Petal.Width))
Output:
# A tibble: 6 x 5
# Groups: Species [3]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 5 3.5 1.6 0.6 setosa
2 5.9 3.2 4.8 1.8 versicolor
3 6.3 3.3 6 2.5 virginica
4 4.9 3.1 1.5 0.1 setosa
5 4.9 2.4 3.3 1 versicolor
6 6.1 2.6 5.6 1.4 virginica
Using aggregate.
aggregate(Petal.Width ~ Species, iris, function(x) c(min=min(x), max=max(x)))
# Species Petal.Width.min Petal.Width.max
# 1 setosa 0.1 0.6
# 2 versicolor 1.0 1.8
# 3 virginica 1.4 2.5
I'm creating a bunch of basic status reports and one of things I'm finding tedious is adding a total row to all my tables. I'm currently using the Tidyverse approach and this is an example of my current code. What I'm looking for is an option to have a few different levels included by default.
#load into RStudio viewer (not required)
iris = iris
#summary at the group level
summary_grouped = iris %>%
group_by(Species) %>%
summarize(mean_s_length = mean(Sepal.Length),
max_s_width = max(Sepal.Width))
#summary at the overall level
summary_overall = iris %>%
summarize(mean_s_length = mean(Sepal.Length),
max_s_width = max(Sepal.Width)) %>%
mutate(Species = "Overall")
#append results for report
summary_table = rbind(summary_grouped, summary_overall)
Doing this multiple times over is very tedious. I kind of want:
summary_overall = iris %>%
group_by(Species, total = TRUE) %>%
summarize(mean_s_length = mean(Sepal.Length),
max_s_width = max(Sepal.Width))
FYI - if you're familiar with SAS I'm looking for the same type of functionality available via a class, ways or types statements in proc means that let me control the level of summarization and get multiple levels in one call.
Any help is appreciated. I know I can create my own function, but was hoping there is something that already exists. I would also prefer to stick with the tidyverse style of programming though I'm not set on that.
Another alternative:
library(tidyverse)
iris %>%
mutate_at("Species", as.character) %>%
list(group_by(.,Species), .) %>%
map(~summarize(.,mean_s_length = mean(Sepal.Length),
max_s_width = max(Sepal.Width))) %>%
bind_rows() %>%
replace_na(list(Species="Overall"))
#> # A tibble: 4 x 3
#> Species mean_s_length max_s_width
#> <chr> <dbl> <dbl>
#> 1 setosa 5.01 4.4
#> 2 versicolor 5.94 3.4
#> 3 virginica 6.59 3.8
#> 4 Overall 5.84 4.4
You can write a function which does the same summarize on an ungrouped tibble and rbinds that to the end.
summarize2 <- function(df, ...){
bind_rows(summarise(df, ...), summarize(ungroup(df), ...))
}
iris %>%
group_by(Species) %>%
summarize2(
mean_s_length = mean(Sepal.Length),
max_s_width = max(Sepal.Width)
)
# # A tibble: 4 x 3
# Species mean_s_length max_s_width
# <fct> <dbl> <dbl>
# 1 setosa 5.01 4.4
# 2 versicolor 5.94 3.4
# 3 virginica 6.59 3.8
# 4 NA 5.84 4.4
You could add some logic for what the "Overall" groups should be named if you want
summarize2 <- function(df, ...){
s1 <- summarise(df, ...)
s2 <- summarize(ungroup(df), ...)
for(v in group_vars(s1)){
if(is.factor(s1[[v]]))
s1[[v]] <- as.character(s1[[v]])
if(is.character(s1[[v]]))
s2[[v]] <- 'Overall'
else if(is.numeric(s1[[v]]))
s2[[v]] <- -Inf
}
bind_rows(s1, s2)
}
iris %>%
group_by(Species, g = Petal.Length %/% 1) %>%
summarize2(
mean_s_length = mean(Sepal.Length),
max_s_width = max(Sepal.Width)
)
# # Groups: Species [4]
# Species g mean_s_length max_s_width
# <chr> <dbl> <dbl> <dbl>
# 1 setosa 1 5.01 4.4
# 2 versicolor 3 5.35 2.9
# 3 versicolor 4 6.09 3.4
# 4 versicolor 5 6.35 3
# 5 virginica 4 5.85 3
# 6 virginica 5 6.44 3.4
# 7 virginica 6 7.43 3.8
# 8 Overall -Inf 5.84 4.4
library(dplyr)
iris %>%
group_by(Species) %>%
summarize(mean_s_length = mean(Sepal.Length),
max_s_width = max(Sepal.Width)) %>%
ungroup() %>%
mutate_at(vars(Species), as.character) %>%
{rbind(.,c("Overal",mean(.$mean_s_length),max(.$max_s_width)))} %>%
mutate_at(vars(-Species), as.double) %>%
mutate_at(vars(Species), as.factor)
#> # A tibble: 4 x 3
#> Species mean_s_length max_s_width
#> <fct> <dbl> <dbl>
#> 1 setosa 5.01 4.4
#> 2 versicolor 5.94 3.4
#> 3 virginica 6.59 3.8
#> 4 Overal 5.84 4.4
Created on 2019-06-21 by the reprex package (v0.3.0)
One way, also tedious but in one longer pipe, is to put the second summarise instructions in bind_rows.
The as.character call avoids a warning:
Warning messages:
1: In bind_rows_(x, .id) :
binding factor and character vector, coercing into character vector
2: In bind_rows_(x, .id) :
binding character and factor vector, coercing into character vector
library(tidyverse)
summary_grouped <- iris %>%
mutate(Species = as.character(Species)) %>%
group_by(Species) %>%
summarize(mean_s_length = mean(Sepal.Length),
max_s_width = max(Sepal.Width)) %>%
bind_rows(iris %>%
summarize(mean_s_length = mean(Sepal.Length),
max_s_width = max(Sepal.Width)) %>%
mutate(Species = "Overall"))
## A tibble: 4 x 3
# Species mean_s_length max_s_width
# <chr> <dbl> <dbl>
#1 setosa 5.01 4.4
#2 versicolor 5.94 3.4
#3 virginica 6.59 3.8
#4 Overall 5.84 4.4
Maybe something like this:
As you want to perform different operations on the same input (iris), best to map over the different summary functions and apply to the data.
map_dfr combines the list outputs using bind_rows
library(dplyr)
library(purrr)
pipe <- . %>%
group_by(Species) %>%
summarize(
mean_s_length = mean(Sepal.Length),
max_s_width = max(Sepal.Width))
map_dfr(
list(pipe, . %>% mutate(Species = "Overall") %>% pipe),
exec,
iris)
#> Warning in bind_rows_(x, .id): binding factor and character vector,
#> coercing into character vector
#> Warning in bind_rows_(x, .id): binding character and factor vector,
#> coercing into character vector
#> # A tibble: 4 x 3
#> Species mean_s_length max_s_width
#> <chr> <dbl> <dbl>
#> 1 setosa 5.01 4.4
#> 2 versicolor 5.94 3.4
#> 3 virginica 6.59 3.8
#> 4 Overall 5.84 4.4
Solution where you need to apply wanted function only once on a double dataset:
library(tidyverse)
iris %>%
rbind(mutate(., Species = "Overall")) %>%
group_by(Species) %>%
summarize(
mean_s_length = mean(Sepal.Length),
max_s_width = max(Sepal.Width)
)
# A tibble: 4 x 3
Species mean_s_length max_s_width
<chr> <dbl> <dbl>
1 Overall 5.84 4.4
2 setosa 5.01 4.4
3 versicolor 5.94 3.4
4 virginica 6.59 3.8
Trick is to pass original dataset with a new group ID (ie Species): mutate(iris, Species = "Overall")