I have data with three groups and would like to perform a different custom function on each of the three groups. Rather than write three separate functions, and calling them all separately, I'm wondering whether I can easily wrap all three into one function with a 'group' parameter.
For example, say I want the mean for group A:
library(tidyverse)
data(iris)
iris$Group <- c(rep("A", 50), rep("B", 50), rep("C", 50))
f_a <- function(df){
out <- df %>%
group_by(Species) %>%
summarise(mean = mean(Sepal.Length))
return(out)
}
The median for group B
f_b <- function(df){
out <- df %>%
group_by(Species) %>%
summarise(median = median(Sepal.Length))
return(out)
}
And the standard deviation for group C
f_c <- function(df){
out <- df %>%
group_by(Species) %>%
summarise(sd= sd(Sepal.Length))
return(out)
}
Is there any way I can combine the above functions and run them according to a group parameter?? Like:
fx(df, group = "A")
Which would produce the results of the above f_a function??
Keeping in mind that in my actual use context, I can't simply group_by(group) in the original function, since the actual functions are more complex. Thanks!!
We create a switch inside the function to select the appropriate function to be applied based on the matching input from group. This function is passed into summarise to apply after groupihg by 'Species'
fx <- function(df, group) {
fn_selector <- switch(group,
A = "mean",
B = "median",
C = "sd")
df %>%
group_by(Species) %>%
summarise(!! fn_selector :=
match.fun(fn_selector)(Sepal.Length), .groups = 'drop')
}
-testing
fx(iris, "A")
# A tibble: 3 x 2
# Species mean
# <fct> <dbl>
#1 setosa 5.01
#2 versicolor 5.94
#3 virginica 6.59
fx(iris, "B")
# A tibble: 3 x 2
# Species median
# <fct> <dbl>
#1 setosa 5
#2 versicolor 5.9
#3 virginica 6.5
fx(iris, "C")
# A tibble: 3 x 2
# Species sd
# <fct> <dbl>
#1 setosa 0.352
#2 versicolor 0.516
#3 virginica 0.636
I don't understand the point of having group column in the dataset. When we pass group = "A" in the function this has got nothing to do with group column that was created.
Instead of passing group = "A" in the function and then mapping A to some function you can directly pass the function that you want to apply.
library(dplyr)
f_a <- function(df, fn){
out <- df %>%
group_by(Species) %>%
summarise(out = fn(Sepal.Length))
return(out)
}
f_a(iris, mean)
# A tibble: 3 x 2
# Species out
#* <fct> <dbl>
#1 setosa 5.01
#2 versicolor 5.94
#3 virginica 6.59
f_a(iris, median)
# A tibble: 3 x 2
# Species out
#* <fct> <dbl>
#1 setosa 5
#2 versicolor 5.9
#3 virginica 6.5
Related
I have dataset which shows Variables, calculation I want to perform (sum, no. of distinct values) and new variable names after the calculation.
library(dplyr)
RefDf <- read.table(text = "Variables Calculation NewVariable
Sepal.Length sum Sepal.Length2
Petal.Length n_distinct Petal.LengthNew
", header = T)
Manual Approach - Summarise by grouping of Species variable.
iris %>% group_by_at("Species") %>%
summarise(Sepal.Length2 = sum(Sepal.Length,na.rm = T),
Petal.LengthNew = n_distinct(Petal.Length, na.rm = T)
)
Automate via eval(parse( ))
x <- RefDf %>% mutate(Check = paste0(NewVariable, " = ", Calculation, "(", Variables, ", na.rm = T", ")")) %>% pull(Check)
iris %>% group_by_at("Species") %>% summarise(eval(parse(text = x)))
As of now it is returning -
Species `eval(parse(text = x))`
<fct> <int>
1 setosa 9
2 versicolor 19
3 virginica 20
It should return -
Species Sepal.Length2 Petal.LengthNew
<fct> <dbl> <int>
1 setosa 250. 9
2 versicolor 297. 19
3 virginica 329. 20
You can use parse_exprs:
library(tidyverse)
library(rlang)
RefDf <- read.table(text = "Variables Calculation NewVariable
Sepal.Length sum Sepal.Length2
Petal.Length n_distinct Petal.LengthNew
", header = T)
#
expr_txt <- set_names(str_c(RefDf$Calculation, "(", RefDf$Variables, ")"),
RefDf$NewVariable)
iris %>%
group_by_at("Species") %>%
summarise(!!!parse_exprs(expr_txt), .groups = "drop")
## A tibble: 3 x 3
#Species Sepal.Length2 Petal.LengthNew
#<fct> <dbl> <int>
#1 setosa 250. 9
#2 versicolor 297. 19
#3 virginica 329. 20
Updated
I found a way of sparing those extra lines.
This is just another way of getting your desired result. I'd rather create a function call for every row of your data set and then iterate over it beside the new column names to get to the desired output:
library(dplyr)
library(rlang)
library(purrr)
# First we create a new variable which is actually of type call in your data set
RefDf %>%
rowwise() %>%
mutate(Call = list(call2(Calculation, parse_expr(Variables)))) -> Rf
Rf
# A tibble: 2 x 4
# Rowwise:
Variables Calculation NewVariable Call
<chr> <chr> <chr> <list>
1 Sepal.Length sum Sepal.Length2 <language>
2 Petal.Length n_distinct Petal.LengthNew <language>
# Then we iterate over `NewVariable` and `Call` at the same time to set the new variable
# name and also evaluate the `call` at the same time
map2(Rf$NewVariable, Rf$Call, ~ iris %>% group_by(Species) %>%
summarise(!!.x := eval_tidy(.y))) %>%
reduce(~ left_join(.x, .y, by = "Species"))
# A tibble: 3 x 3
Species Sepal.Length2 Petal.LengthNew
<fct> <dbl> <int>
1 setosa 250. 9
2 versicolor 297. 19
3 virginica 329. 20
I am summarizing group means from a table using the summarize function from the dplyr package in R. I would like to do this dynamically, using a column name string stored in another variable.
The following is the "normal" way and it works, of course:
myTibble <- group_by( iris, Species)
summarise( myTibble, avg = mean( Sepal.Length))
# A tibble: 3 x 2
Species avg
<fct> <dbl>
1 setosa 5.01
2 versicolor 5.94
3 virginica 6.59
However, I would like to do something like this instead:
myTibble <- group_by( iris, Species)
colOfInterest <- "Sepal.Length"
summarise( myTibble, avg = mean( colOfInterest))
I've read the Programming with dplyr page, and I've tried a bunch of combinations of quo, enquo, !!, .dots=(...), etc., but I haven't figured out the right way to do it yet.
I'm also aware of this answer, but, 1) when I use the standard-evaluation function standardise_, R tells me that it's depreciated, and 2) that answer doesn't seem elegant at all. So, is there a good, easy way to do this?
Thank you!
1) Use !!sym(...) like this:
colOfInterest <- "Sepal.Length"
iris %>%
group_by(Species) %>%
summarize(avg = mean(!!sym(colOfInterest))) %>%
ungroup
giving:
# A tibble: 3 x 2
Species avg
<fct> <dbl>
1 setosa 5.01
2 versicolor 5.94
3 virginica 6.59
2) A second approach is:
colOfInterest <- "Sepal.Length"
iris %>%
group_by(Species) %>%
summarize(avg = mean(.data[[colOfInterest]])) %>%
ungroup
Of course this is straight forward in base R:
aggregate(list(avg = iris[[colOfInterest]]), iris["Species"], mean)
Another solution:
iris %>%
group_by(Species) %>%
summarise_at(vars("Sepal.Length"), mean) %>%
ungroup()
# A tibble: 3 x 2
Species Sepal.Length
<fct> <dbl>
1 setosa 5.01
2 versicolor 5.94
3 virginica 6.59
Suppose I have the following function
SlowFunction = function(vector){
return(list(
mean =mean(vector),
sd = sd(vector)
))
}
And I would like to use dplyr:summarise to write the results to a dataframe:
iris %>%
dplyr::group_by(Species) %>%
dplyr::summarise(
mean = SlowFunction(Sepal.Length)$mean,
sd = SlowFunction(Sepal.Length)$sd
)
Does anyone have a suggestion how I can do this by calling "SlowFunction" once instead of twice? (In my code "SlowFunction" is a slow function that I have to call many times.) Without splitting "SlowFunction" in two parts of course. So actually I would like to somehow fill multiple columns of a dataframe in one statement.
Without changing your current SlowFunction one way is to use do
library(dplyr)
iris %>%
group_by(Species) %>%
do(data.frame(SlowFunction(.$Sepal.Length)))
# Species mean sd
# <fct> <dbl> <dbl>
#1 setosa 5.01 0.352
#2 versicolor 5.94 0.516
#3 virginica 6.59 0.636
Or with group_split + purrr::map_dfr
bind_cols(Species = unique(iris$Species), iris %>%
group_split(Species) %>%
map_dfr(~SlowFunction(.$Sepal.Length)))
An option is to use to store the output of SlowFunction in a list column of data.frames and then to use unnest
iris %>%
group_by(Species) %>%
summarise(res = list(as.data.frame(SlowFunction(Sepal.Length)))) %>%
unnest()
## A tibble: 3 x 3
# Species mean sd
# <fct> <dbl> <dbl>
#1 setosa 5.01 0.352
#2 versicolor 5.94 0.516
#3 virginica 6.59 0.636
We can use group_map if you are using dplyr 0.8.0 or later. The output from SlowFunction needs to be converted to a data frame.
library(dplyr)
iris %>%
group_by(Species) %>%
group_map(~SlowFunction(.x$Sepal.Length) %>% as.data.frame())
# # A tibble: 3 x 3
# # Groups: Species [3]
# Species mean sd
# <fct> <dbl> <dbl>
# 1 setosa 5.01 0.352
# 2 versicolor 5.94 0.516
# 3 virginica 6.59 0.636
We can change the SlowFunction to return a tibble and
SlowFunction = function(vector){
tibble(
mean =mean(vector),
sd = sd(vector)
)
}
and then unnest the summarise output in a list
iris %>%
group_by(Species) %>%
summarise(out = list(SlowFunction(Sepal.Length))) %>%
unnest
# A tibble: 3 x 3
# Species mean sd
# <fct> <dbl> <dbl>
#1 setosa 5.01 0.352
#2 versicolor 5.94 0.516
#3 virginica 6.59 0.636
I would like to pre-assign my column name and use that within a dplyr pipe
Here's an example. I want to do this:
iris %>%
group_by(Species) %>%
summarise(Var = mean(Petal.Length[Sepal.Width > 3]))
But with the column name assigned outside of the pipe, like this
col_name <- "Petal.Length"
iris %>%
group_by(Species) %>%
summarise(Var = mean(!!col_name[Sepal.Width > 3]))
We can convert to symbol (sym) and then do the evaluation (!!)
iris %>%
group_by(Species) %>%
summarise(Var = mean((!!rlang::sym(col_name))[Sepal.Width >3]))
# A tibble: 3 x 2
# Species Var
# <fct> <dbl>
#1 setosa 1.48
#2 versicolor 4.65
#3 virginica 5.72
If we need to use only dplyr, then can pass the variable object in summarise_at
iris %>%
group_by(Species) %>%
summarise_at(vars(col_name), funs(mean(.[Sepal.Width > 3])))
# A tibble: 3 x 2
# Species Petal.Length
# <fct> <dbl>
#1 setosa 1.48
#2 versicolor 4.65
#3 virginica 5.72
I would like to use summarise() from dplyr after grouping data to compute a new variable. But, I would like it to use one equation for some of the data and a second equation for the rest of the data.
I have tried using group_by() and and summarise() with if_else() but it isn't working.
Here's an example. Let's say--for some reason--I wanted to find a special value for sepal length. For the species 'setosa' this special value is twice the mean of the sepal length. For all of the other species it is simply the mean of sepal length. This is the code I've tried, but it doesn't work with summarise()
library(dplyr)
iris %>%
group_by(Species) %>%
summarise(sepal_special = if_else(Species == "setosa", mean(Sepal.Length)*2, mean(Sepal.Length)))
This idea works with mutate() but I would need to re-format the tibble to be the dataset I am looking for.
library(dplyr)
iris %>%
group_by(Species) %>%
mutate(sepal_special = if_else(Species == "setosa", mean(Sepal.Length)*2, mean(Sepal.Length)))
This is how I want the resulting tibble to be laid out:
library(dplyr)
iris %>%
group_by(Species)%>%
summarise(sepal_mean = mean(Sepal.Length))
# A tibble: 3 x 2
# Species sepal_special
# <fctr> <dbl>
#1 setosa 5.01
#2 versicolor 5.94
#3 virginica 6.59
#>
But my result would show the value for setosa x 2
# A tibble: 3 x 2
# Species sepal_special
# <fctr> <dbl>
#1 setosa **10.02**
#2 versicolor 5.94
#3 virginica 6.59
#>
Suggestions? I feel like I've really searched for ways to use if_else() with summarise() but can't find it anywhere, which means there must be a better way.
Thanks!
After the mutate step, use summarise to get the first element of 'sepal_special' for each 'Species'
iris %>%
group_by(Species) %>%
mutate(sepal_special = if_else(Species == "setosa",
mean(Sepal.Length)*2, mean(Sepal.Length))) %>%
summarise(sepal_special = first(sepal_special))
# A tibble: 3 x 2
# Species sepal_special
# <fctr> <dbl>
#1 setosa 10.0
#2 versicolor 5.94
#3 virginica 6.59
Or instead of calling the mutate, after the if_else is applied, get the first value in summarise
iris %>%
group_by(Species) %>%
summarise(sepal_special = if_else(Species == "setosa",
mean(Sepal.Length)*2, mean(Sepal.Length))[1])
# A tibble: 3 x 2
# Species sepal_special
# <fctr> <dbl>
#1 setosa 10.0
#2 versicolor 5.94
#3 virginica 6.59
Another option: since twice the mean is the same as the mean of twice the values, you can double the sepal lengths for setosa and then summarise:
iris %>%
mutate(Sepal.Length = ifelse(Species == "setosa", 2*Sepal.Length, Sepal.Length)) %>%
group_by(Species) %>%
summarise(sepal_special = mean(Sepal.Length))
# A tibble: 3 x 2
Species sepal_special
<fct> <dbl>
1 setosa 10.0
2 versicolor 5.94
3 virginica 6.59