As per this article, a recent version of rlang and glue allows to combine tunnelling {{ myobj }} into a glue string:
library(dplyr)
library(rlang)
library(glue)
mean_by <- function(data, by, var, prefix = "avg") {
data %>%
group_by({{ by }}) %>%
summarise("{prefix}_{{ var }}" := mean({{ var }}, na.rm = TRUE))
}
iris %>% mean_by(Species, Sepal.Width)
#> # A tibble: 3 x 2
#> Species avg_Sepal.Width
#> <fct> <dbl>
#> 1 setosa 3.43
#> 2 versicolor 2.77
#> 3 virginica 2.97
But if I want to combine on the other side of the equation, I cannot do this:
analyze_by <- function(data, by, var, prefix = "avg") {
data %>%
group_by({{ by }}) %>%
summarise(analysis = glue("{ prefix } by {{ var }}"))
}
iris %>% analyze_by(Species, Sepal.Width)
#> # A tibble: 3 x 2
#> Species analysis
#> <fct> <glue>
#> 1 setosa avg by { var }
#> 2 versicolor avg by { var }
#> 3 virginica avg by { var }
What would be the best way to get around this problem?
Perhaps, we can use ensym or enquo before passing into glue
analyze_by <- function(data, by, var, prefix = "avg") {
var <- rlang::ensym(var)
data %>%
dplyr::group_by({{ by }}) %>%
dplyr::summarise(analysis = glue::glue("{ prefix } by {var}"))
}
iris %>%
analyze_by(Species, Sepal.Width)
# A tibble: 3 x 2
# Species analysis
# <fct> <glue>
#1 setosa avg by Sepal.Width
#2 versicolor avg by Sepal.Width
#3 virginica avg by Sepal.Width
You could use deparse and substitute :
library(dplyr)
library(rlang)
analyze_by <- function(data, by, var, prefix = "avg") {
var_char <- deparse(substitute(var))
data %>%
group_by({{ by }}) %>%
summarise(analysis = glue::glue("{ prefix } by {var_char}"))
}
iris %>% analyze_by(Species, Sepal.Width)
# A tibble: 3 x 2
# Species analysis
# <fct> <glue>
#1 setosa avg by Sepal.Width
#2 versicolor avg by Sepal.Width
#3 virginica avg by Sepal.Width
Related
I have data with three groups and would like to perform a different custom function on each of the three groups. Rather than write three separate functions, and calling them all separately, I'm wondering whether I can easily wrap all three into one function with a 'group' parameter.
For example, say I want the mean for group A:
library(tidyverse)
data(iris)
iris$Group <- c(rep("A", 50), rep("B", 50), rep("C", 50))
f_a <- function(df){
out <- df %>%
group_by(Species) %>%
summarise(mean = mean(Sepal.Length))
return(out)
}
The median for group B
f_b <- function(df){
out <- df %>%
group_by(Species) %>%
summarise(median = median(Sepal.Length))
return(out)
}
And the standard deviation for group C
f_c <- function(df){
out <- df %>%
group_by(Species) %>%
summarise(sd= sd(Sepal.Length))
return(out)
}
Is there any way I can combine the above functions and run them according to a group parameter?? Like:
fx(df, group = "A")
Which would produce the results of the above f_a function??
Keeping in mind that in my actual use context, I can't simply group_by(group) in the original function, since the actual functions are more complex. Thanks!!
We create a switch inside the function to select the appropriate function to be applied based on the matching input from group. This function is passed into summarise to apply after groupihg by 'Species'
fx <- function(df, group) {
fn_selector <- switch(group,
A = "mean",
B = "median",
C = "sd")
df %>%
group_by(Species) %>%
summarise(!! fn_selector :=
match.fun(fn_selector)(Sepal.Length), .groups = 'drop')
}
-testing
fx(iris, "A")
# A tibble: 3 x 2
# Species mean
# <fct> <dbl>
#1 setosa 5.01
#2 versicolor 5.94
#3 virginica 6.59
fx(iris, "B")
# A tibble: 3 x 2
# Species median
# <fct> <dbl>
#1 setosa 5
#2 versicolor 5.9
#3 virginica 6.5
fx(iris, "C")
# A tibble: 3 x 2
# Species sd
# <fct> <dbl>
#1 setosa 0.352
#2 versicolor 0.516
#3 virginica 0.636
I don't understand the point of having group column in the dataset. When we pass group = "A" in the function this has got nothing to do with group column that was created.
Instead of passing group = "A" in the function and then mapping A to some function you can directly pass the function that you want to apply.
library(dplyr)
f_a <- function(df, fn){
out <- df %>%
group_by(Species) %>%
summarise(out = fn(Sepal.Length))
return(out)
}
f_a(iris, mean)
# A tibble: 3 x 2
# Species out
#* <fct> <dbl>
#1 setosa 5.01
#2 versicolor 5.94
#3 virginica 6.59
f_a(iris, median)
# A tibble: 3 x 2
# Species out
#* <fct> <dbl>
#1 setosa 5
#2 versicolor 5.9
#3 virginica 6.5
I'm creating a bunch of basic status reports and one of things I'm finding tedious is adding a total row to all my tables. I'm currently using the Tidyverse approach and this is an example of my current code. What I'm looking for is an option to have a few different levels included by default.
#load into RStudio viewer (not required)
iris = iris
#summary at the group level
summary_grouped = iris %>%
group_by(Species) %>%
summarize(mean_s_length = mean(Sepal.Length),
max_s_width = max(Sepal.Width))
#summary at the overall level
summary_overall = iris %>%
summarize(mean_s_length = mean(Sepal.Length),
max_s_width = max(Sepal.Width)) %>%
mutate(Species = "Overall")
#append results for report
summary_table = rbind(summary_grouped, summary_overall)
Doing this multiple times over is very tedious. I kind of want:
summary_overall = iris %>%
group_by(Species, total = TRUE) %>%
summarize(mean_s_length = mean(Sepal.Length),
max_s_width = max(Sepal.Width))
FYI - if you're familiar with SAS I'm looking for the same type of functionality available via a class, ways or types statements in proc means that let me control the level of summarization and get multiple levels in one call.
Any help is appreciated. I know I can create my own function, but was hoping there is something that already exists. I would also prefer to stick with the tidyverse style of programming though I'm not set on that.
Another alternative:
library(tidyverse)
iris %>%
mutate_at("Species", as.character) %>%
list(group_by(.,Species), .) %>%
map(~summarize(.,mean_s_length = mean(Sepal.Length),
max_s_width = max(Sepal.Width))) %>%
bind_rows() %>%
replace_na(list(Species="Overall"))
#> # A tibble: 4 x 3
#> Species mean_s_length max_s_width
#> <chr> <dbl> <dbl>
#> 1 setosa 5.01 4.4
#> 2 versicolor 5.94 3.4
#> 3 virginica 6.59 3.8
#> 4 Overall 5.84 4.4
You can write a function which does the same summarize on an ungrouped tibble and rbinds that to the end.
summarize2 <- function(df, ...){
bind_rows(summarise(df, ...), summarize(ungroup(df), ...))
}
iris %>%
group_by(Species) %>%
summarize2(
mean_s_length = mean(Sepal.Length),
max_s_width = max(Sepal.Width)
)
# # A tibble: 4 x 3
# Species mean_s_length max_s_width
# <fct> <dbl> <dbl>
# 1 setosa 5.01 4.4
# 2 versicolor 5.94 3.4
# 3 virginica 6.59 3.8
# 4 NA 5.84 4.4
You could add some logic for what the "Overall" groups should be named if you want
summarize2 <- function(df, ...){
s1 <- summarise(df, ...)
s2 <- summarize(ungroup(df), ...)
for(v in group_vars(s1)){
if(is.factor(s1[[v]]))
s1[[v]] <- as.character(s1[[v]])
if(is.character(s1[[v]]))
s2[[v]] <- 'Overall'
else if(is.numeric(s1[[v]]))
s2[[v]] <- -Inf
}
bind_rows(s1, s2)
}
iris %>%
group_by(Species, g = Petal.Length %/% 1) %>%
summarize2(
mean_s_length = mean(Sepal.Length),
max_s_width = max(Sepal.Width)
)
# # Groups: Species [4]
# Species g mean_s_length max_s_width
# <chr> <dbl> <dbl> <dbl>
# 1 setosa 1 5.01 4.4
# 2 versicolor 3 5.35 2.9
# 3 versicolor 4 6.09 3.4
# 4 versicolor 5 6.35 3
# 5 virginica 4 5.85 3
# 6 virginica 5 6.44 3.4
# 7 virginica 6 7.43 3.8
# 8 Overall -Inf 5.84 4.4
library(dplyr)
iris %>%
group_by(Species) %>%
summarize(mean_s_length = mean(Sepal.Length),
max_s_width = max(Sepal.Width)) %>%
ungroup() %>%
mutate_at(vars(Species), as.character) %>%
{rbind(.,c("Overal",mean(.$mean_s_length),max(.$max_s_width)))} %>%
mutate_at(vars(-Species), as.double) %>%
mutate_at(vars(Species), as.factor)
#> # A tibble: 4 x 3
#> Species mean_s_length max_s_width
#> <fct> <dbl> <dbl>
#> 1 setosa 5.01 4.4
#> 2 versicolor 5.94 3.4
#> 3 virginica 6.59 3.8
#> 4 Overal 5.84 4.4
Created on 2019-06-21 by the reprex package (v0.3.0)
One way, also tedious but in one longer pipe, is to put the second summarise instructions in bind_rows.
The as.character call avoids a warning:
Warning messages:
1: In bind_rows_(x, .id) :
binding factor and character vector, coercing into character vector
2: In bind_rows_(x, .id) :
binding character and factor vector, coercing into character vector
library(tidyverse)
summary_grouped <- iris %>%
mutate(Species = as.character(Species)) %>%
group_by(Species) %>%
summarize(mean_s_length = mean(Sepal.Length),
max_s_width = max(Sepal.Width)) %>%
bind_rows(iris %>%
summarize(mean_s_length = mean(Sepal.Length),
max_s_width = max(Sepal.Width)) %>%
mutate(Species = "Overall"))
## A tibble: 4 x 3
# Species mean_s_length max_s_width
# <chr> <dbl> <dbl>
#1 setosa 5.01 4.4
#2 versicolor 5.94 3.4
#3 virginica 6.59 3.8
#4 Overall 5.84 4.4
Maybe something like this:
As you want to perform different operations on the same input (iris), best to map over the different summary functions and apply to the data.
map_dfr combines the list outputs using bind_rows
library(dplyr)
library(purrr)
pipe <- . %>%
group_by(Species) %>%
summarize(
mean_s_length = mean(Sepal.Length),
max_s_width = max(Sepal.Width))
map_dfr(
list(pipe, . %>% mutate(Species = "Overall") %>% pipe),
exec,
iris)
#> Warning in bind_rows_(x, .id): binding factor and character vector,
#> coercing into character vector
#> Warning in bind_rows_(x, .id): binding character and factor vector,
#> coercing into character vector
#> # A tibble: 4 x 3
#> Species mean_s_length max_s_width
#> <chr> <dbl> <dbl>
#> 1 setosa 5.01 4.4
#> 2 versicolor 5.94 3.4
#> 3 virginica 6.59 3.8
#> 4 Overall 5.84 4.4
Solution where you need to apply wanted function only once on a double dataset:
library(tidyverse)
iris %>%
rbind(mutate(., Species = "Overall")) %>%
group_by(Species) %>%
summarize(
mean_s_length = mean(Sepal.Length),
max_s_width = max(Sepal.Width)
)
# A tibble: 4 x 3
Species mean_s_length max_s_width
<chr> <dbl> <dbl>
1 Overall 5.84 4.4
2 setosa 5.01 4.4
3 versicolor 5.94 3.4
4 virginica 6.59 3.8
Trick is to pass original dataset with a new group ID (ie Species): mutate(iris, Species = "Overall")
Ok, I'm just trying to rename a column inside a nested tibble based on an identifier/character column:
MWE:
library(magrittr)
iris %>%
tibble::as_tibble() %>%
tidyr::nest(-Species) %>%
dplyr::mutate(
Species = as.character(Species),
data = purrr::map2(data, Species,
~dplyr::rename(.x, !!.y := Sepal.Width)))
but this returns the error:
Error in quos(..., .named = TRUE) : object '.y' not found
I have tried using ensym from rlang and all sort of combinations of !! and := without success.
That is the first tibble in the data column should have the Sepal.Width column renamed to setosa, the second to versicolor, and for the last tibble Sepal.Widht should be renamed to virginica.
You could switch away from the formula notation:
library(magrittr)
irisNest <- iris %>%
tibble::as_tibble() %>%
tidyr::nest(-Species) %>%
dplyr::mutate(Species = as.character(Species))
f <- function(x,y) {dplyr::rename(x, !!y := Sepal.Width)}
irisCheck <- dplyr::mutate(irisNest,
data = purrr::map2(data, Species, f))
library("tidyverse")
rename_func <- function(data, Species) {
Species <- as.character(Species)
data %>%
rename(!!Species := Sepal.Length)
}
iris2 <- as_tibble(iris) %>%
nest(-Species) %>%
group_by(Species) %>%
mutate(
data = map2(data, Species, rename_func))
iris2 %>% filter(Species == "setosa") %>% unnest() %>% head(1)
#> # A tibble: 1 x 5
#> # Groups: Species [3]
#> Species setosa Sepal.Width Petal.Length Petal.Width
#> <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 setosa 5.1 3.5 1.4 0.2
iris2 %>% filter(Species == "versicolor") %>% unnest() %>% head(1)
#> # A tibble: 1 x 5
#> # Groups: Species [3]
#> Species versicolor Sepal.Width Petal.Length Petal.Width
#> <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 versicolor 7 3.2 4.7 1.4
iris2 %>% filter(Species == "virginica") %>% unnest() %>% head(1)
#> # A tibble: 1 x 5
#> # Groups: Species [3]
#> Species virginica Sepal.Width Petal.Length Petal.Width
#> <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 virginica 6.3 3.3 6 2.5
Created on 2019-03-10 by the reprex package (v0.2.1)
I would like to pre-assign my column name and use that within a dplyr pipe
Here's an example. I want to do this:
iris %>%
group_by(Species) %>%
summarise(Var = mean(Petal.Length[Sepal.Width > 3]))
But with the column name assigned outside of the pipe, like this
col_name <- "Petal.Length"
iris %>%
group_by(Species) %>%
summarise(Var = mean(!!col_name[Sepal.Width > 3]))
We can convert to symbol (sym) and then do the evaluation (!!)
iris %>%
group_by(Species) %>%
summarise(Var = mean((!!rlang::sym(col_name))[Sepal.Width >3]))
# A tibble: 3 x 2
# Species Var
# <fct> <dbl>
#1 setosa 1.48
#2 versicolor 4.65
#3 virginica 5.72
If we need to use only dplyr, then can pass the variable object in summarise_at
iris %>%
group_by(Species) %>%
summarise_at(vars(col_name), funs(mean(.[Sepal.Width > 3])))
# A tibble: 3 x 2
# Species Petal.Length
# <fct> <dbl>
#1 setosa 1.48
#2 versicolor 4.65
#3 virginica 5.72
I am trying to carry out the following dplyr task, but within a function.
library("dplyr")
iris %>%
group_by(Species) %>%
summarise(N = sum(Petal.Width == 0.2, na.rm = T))
I was thinking along the lines of the following, which is not complete because I am unclear on the syntax.
getSummary <- function(varName,level) {
summary <- iris %>%
group_by(Species %>%
summarise_(N = interp(~sum(var == ilevel, na.rm = T),
var = as.name(varName))))
sums <- summary$N
}
In this case levels is the numeric 0.2. Are there any changes if the value is a character "0.2"?
dplyr is in the process of switching over from a lazyeval-powered NSE system to an rlang-powered one. On the new version (available now through the GitHub version, and soon through CRAN), you can use
library(dplyr)
getSummary <- function(varName, level) {
varName <- enquo(varName) # parse and quote variable name
iris %>%
group_by(Species) %>%
summarise(N = sum((!!varName) == level), # unquote with !! to use
var = rlang::quo_text(varName)) # turn quosure to string
}
getSummary(Petal.Width, 0.2)
#> # A tibble: 3 × 3
#> Species N var
#> <fctr> <int> <chr>
#> 1 setosa 29 Petal.Width
#> 2 versicolor 0 Petal.Width
#> 3 virginica 0 Petal.Width
# or make it accept strings
getSummary <- function(varName, level) {
iris %>%
group_by(Species) %>%
summarise(N = sum((!!rlang::sym(varName)) == level),
var = varName)
}
getSummary('Sepal.Length', 5.0)
#> # A tibble: 3 × 3
#> Species N var
#> <fctr> <int> <chr>
#> 1 setosa 8 Sepal.Length
#> 2 versicolor 2 Sepal.Length
#> 3 virginica 0 Sepal.Length
To use the old lazyeval syntax, it would look like
getSummary <- function(varName, level) {
iris %>%
group_by(Species) %>%
summarise_(N = lazyeval::interp(~sum(x == y), # formula to substitute into
x = lazyeval::lazy(varName), # substituted but unevaluated name
y = level), # value to substitute
var = ~lazyeval::expr_text(varName)) # convert expression to string (equivalent to `deparse(substitute(...))`)
}
getSummary(Sepal.Length, 5.0)
#> # A tibble: 3 × 3
#> Species N var
#> <fctr> <int> <chr>
#> 1 setosa 8 Sepal.Length
#> 2 versicolor 2 Sepal.Length
#> 3 virginica 0 Sepal.Length
# or make it accept strings
getSummary <- function(varName, level) {
iris %>%
group_by(Species) %>%
summarise_(N = lazyeval::interp(~sum(x == y),
x = as.name(varName),
y = level),
var = ~varName)
}
getSummary('Petal.Width', 0.2)
#> # A tibble: 3 × 3
#> Species N var
#> <fctr> <int> <chr>
#> 1 setosa 29 Petal.Width
#> 2 versicolor 0 Petal.Width
#> 3 virginica 0 Petal.Width