I am trying to carry out the following dplyr task, but within a function.
library("dplyr")
iris %>%
group_by(Species) %>%
summarise(N = sum(Petal.Width == 0.2, na.rm = T))
I was thinking along the lines of the following, which is not complete because I am unclear on the syntax.
getSummary <- function(varName,level) {
summary <- iris %>%
group_by(Species %>%
summarise_(N = interp(~sum(var == ilevel, na.rm = T),
var = as.name(varName))))
sums <- summary$N
}
In this case levels is the numeric 0.2. Are there any changes if the value is a character "0.2"?
dplyr is in the process of switching over from a lazyeval-powered NSE system to an rlang-powered one. On the new version (available now through the GitHub version, and soon through CRAN), you can use
library(dplyr)
getSummary <- function(varName, level) {
varName <- enquo(varName) # parse and quote variable name
iris %>%
group_by(Species) %>%
summarise(N = sum((!!varName) == level), # unquote with !! to use
var = rlang::quo_text(varName)) # turn quosure to string
}
getSummary(Petal.Width, 0.2)
#> # A tibble: 3 × 3
#> Species N var
#> <fctr> <int> <chr>
#> 1 setosa 29 Petal.Width
#> 2 versicolor 0 Petal.Width
#> 3 virginica 0 Petal.Width
# or make it accept strings
getSummary <- function(varName, level) {
iris %>%
group_by(Species) %>%
summarise(N = sum((!!rlang::sym(varName)) == level),
var = varName)
}
getSummary('Sepal.Length', 5.0)
#> # A tibble: 3 × 3
#> Species N var
#> <fctr> <int> <chr>
#> 1 setosa 8 Sepal.Length
#> 2 versicolor 2 Sepal.Length
#> 3 virginica 0 Sepal.Length
To use the old lazyeval syntax, it would look like
getSummary <- function(varName, level) {
iris %>%
group_by(Species) %>%
summarise_(N = lazyeval::interp(~sum(x == y), # formula to substitute into
x = lazyeval::lazy(varName), # substituted but unevaluated name
y = level), # value to substitute
var = ~lazyeval::expr_text(varName)) # convert expression to string (equivalent to `deparse(substitute(...))`)
}
getSummary(Sepal.Length, 5.0)
#> # A tibble: 3 × 3
#> Species N var
#> <fctr> <int> <chr>
#> 1 setosa 8 Sepal.Length
#> 2 versicolor 2 Sepal.Length
#> 3 virginica 0 Sepal.Length
# or make it accept strings
getSummary <- function(varName, level) {
iris %>%
group_by(Species) %>%
summarise_(N = lazyeval::interp(~sum(x == y),
x = as.name(varName),
y = level),
var = ~varName)
}
getSummary('Petal.Width', 0.2)
#> # A tibble: 3 × 3
#> Species N var
#> <fctr> <int> <chr>
#> 1 setosa 29 Petal.Width
#> 2 versicolor 0 Petal.Width
#> 3 virginica 0 Petal.Width
Related
I am trying to make a data.frame which displays the average time an individual displays a behaviour.
I have been using group_by and summarise to calculate the averages across groups. But the output is many rows down. See an example using the iris dataset...
data(iris)
x <- iris %>%
group_by(Species, Petal.Length) %>%
summarise(mean(Sepal.Length))
I would like to get an output that has, for this example, one row per 'Species' and a column of averages per 'Petal.Length'.
I have resorted to creating multiple outputs and then using left_join to combine them into the desired data.frame. See example below...
a <- iris %>%
group_by(Species) %>%
filter(Petal.Length == 0.1) %>%
summarise(mean(Sepal.Length))
b <- iris %>%
group_by(Species) %>%
filter(Petal.Length == 0.2) %>%
summarise(mean(Sepal.Length))
left_join(a, b)
However, doing this twelve or more times at a time is tedious and I am sure there must be an easy way to get the mean(Sepal.Length) for the 'Petal.Length' 0.1, and 0.2, and 0.3 (etc) in the one output.
n.b. in my data Petal.Length would actually be characters that represent behaviours and Sepal.Length would be the duration of time
Some ideas:
library(tidyverse)
data(iris)
mutate(iris, Petal.Length_discrete = cut(Petal.Length, 5)) %>%
group_by(Species, Petal.Length_discrete) %>%
summarise(mean(Sepal.Length))
#> `summarise()` has grouped output by 'Species'. You can override using the `.groups` argument.
#> # A tibble: 7 x 3
#> # Groups: Species [3]
#> Species Petal.Length_discrete `mean(Sepal.Length)`
#> <fct> <fct> <dbl>
#> 1 setosa (0.994,2.18] 5.01
#> 2 versicolor (2.18,3.36] 5
#> 3 versicolor (3.36,4.54] 5.81
#> 4 versicolor (4.54,5.72] 6.43
#> 5 virginica (3.36,4.54] 4.9
#> 6 virginica (4.54,5.72] 6.32
#> 7 virginica (5.72,6.91] 7.25
iris %>%
group_split(Species, Petal.Length) %>%
map(~ summarise(.x, mean(Sepal.Length))) %>%
head(3)
#> [[1]]
#> # A tibble: 1 x 1
#> `mean(Sepal.Length)`
#> <dbl>
#> 1 4.6
#>
#> [[2]]
#> # A tibble: 1 x 1
#> `mean(Sepal.Length)`
#> <dbl>
#> 1 4.3
#>
#> [[3]]
#> # A tibble: 1 x 1
#> `mean(Sepal.Length)`
#> <dbl>
#> 1 5.4
Created on 2021-06-28 by the reprex package (v2.0.0)
I have dataset which shows Variables, calculation I want to perform (sum, no. of distinct values) and new variable names after the calculation.
library(dplyr)
RefDf <- read.table(text = "Variables Calculation NewVariable
Sepal.Length sum Sepal.Length2
Petal.Length n_distinct Petal.LengthNew
", header = T)
Manual Approach - Summarise by grouping of Species variable.
iris %>% group_by_at("Species") %>%
summarise(Sepal.Length2 = sum(Sepal.Length,na.rm = T),
Petal.LengthNew = n_distinct(Petal.Length, na.rm = T)
)
Automate via eval(parse( ))
x <- RefDf %>% mutate(Check = paste0(NewVariable, " = ", Calculation, "(", Variables, ", na.rm = T", ")")) %>% pull(Check)
iris %>% group_by_at("Species") %>% summarise(eval(parse(text = x)))
As of now it is returning -
Species `eval(parse(text = x))`
<fct> <int>
1 setosa 9
2 versicolor 19
3 virginica 20
It should return -
Species Sepal.Length2 Petal.LengthNew
<fct> <dbl> <int>
1 setosa 250. 9
2 versicolor 297. 19
3 virginica 329. 20
You can use parse_exprs:
library(tidyverse)
library(rlang)
RefDf <- read.table(text = "Variables Calculation NewVariable
Sepal.Length sum Sepal.Length2
Petal.Length n_distinct Petal.LengthNew
", header = T)
#
expr_txt <- set_names(str_c(RefDf$Calculation, "(", RefDf$Variables, ")"),
RefDf$NewVariable)
iris %>%
group_by_at("Species") %>%
summarise(!!!parse_exprs(expr_txt), .groups = "drop")
## A tibble: 3 x 3
#Species Sepal.Length2 Petal.LengthNew
#<fct> <dbl> <int>
#1 setosa 250. 9
#2 versicolor 297. 19
#3 virginica 329. 20
Updated
I found a way of sparing those extra lines.
This is just another way of getting your desired result. I'd rather create a function call for every row of your data set and then iterate over it beside the new column names to get to the desired output:
library(dplyr)
library(rlang)
library(purrr)
# First we create a new variable which is actually of type call in your data set
RefDf %>%
rowwise() %>%
mutate(Call = list(call2(Calculation, parse_expr(Variables)))) -> Rf
Rf
# A tibble: 2 x 4
# Rowwise:
Variables Calculation NewVariable Call
<chr> <chr> <chr> <list>
1 Sepal.Length sum Sepal.Length2 <language>
2 Petal.Length n_distinct Petal.LengthNew <language>
# Then we iterate over `NewVariable` and `Call` at the same time to set the new variable
# name and also evaluate the `call` at the same time
map2(Rf$NewVariable, Rf$Call, ~ iris %>% group_by(Species) %>%
summarise(!!.x := eval_tidy(.y))) %>%
reduce(~ left_join(.x, .y, by = "Species"))
# A tibble: 3 x 3
Species Sepal.Length2 Petal.LengthNew
<fct> <dbl> <int>
1 setosa 250. 9
2 versicolor 297. 19
3 virginica 329. 20
As per this article, a recent version of rlang and glue allows to combine tunnelling {{ myobj }} into a glue string:
library(dplyr)
library(rlang)
library(glue)
mean_by <- function(data, by, var, prefix = "avg") {
data %>%
group_by({{ by }}) %>%
summarise("{prefix}_{{ var }}" := mean({{ var }}, na.rm = TRUE))
}
iris %>% mean_by(Species, Sepal.Width)
#> # A tibble: 3 x 2
#> Species avg_Sepal.Width
#> <fct> <dbl>
#> 1 setosa 3.43
#> 2 versicolor 2.77
#> 3 virginica 2.97
But if I want to combine on the other side of the equation, I cannot do this:
analyze_by <- function(data, by, var, prefix = "avg") {
data %>%
group_by({{ by }}) %>%
summarise(analysis = glue("{ prefix } by {{ var }}"))
}
iris %>% analyze_by(Species, Sepal.Width)
#> # A tibble: 3 x 2
#> Species analysis
#> <fct> <glue>
#> 1 setosa avg by { var }
#> 2 versicolor avg by { var }
#> 3 virginica avg by { var }
What would be the best way to get around this problem?
Perhaps, we can use ensym or enquo before passing into glue
analyze_by <- function(data, by, var, prefix = "avg") {
var <- rlang::ensym(var)
data %>%
dplyr::group_by({{ by }}) %>%
dplyr::summarise(analysis = glue::glue("{ prefix } by {var}"))
}
iris %>%
analyze_by(Species, Sepal.Width)
# A tibble: 3 x 2
# Species analysis
# <fct> <glue>
#1 setosa avg by Sepal.Width
#2 versicolor avg by Sepal.Width
#3 virginica avg by Sepal.Width
You could use deparse and substitute :
library(dplyr)
library(rlang)
analyze_by <- function(data, by, var, prefix = "avg") {
var_char <- deparse(substitute(var))
data %>%
group_by({{ by }}) %>%
summarise(analysis = glue::glue("{ prefix } by {var_char}"))
}
iris %>% analyze_by(Species, Sepal.Width)
# A tibble: 3 x 2
# Species analysis
# <fct> <glue>
#1 setosa avg by Sepal.Width
#2 versicolor avg by Sepal.Width
#3 virginica avg by Sepal.Width
I'm creating a bunch of basic status reports and one of things I'm finding tedious is adding a total row to all my tables. I'm currently using the Tidyverse approach and this is an example of my current code. What I'm looking for is an option to have a few different levels included by default.
#load into RStudio viewer (not required)
iris = iris
#summary at the group level
summary_grouped = iris %>%
group_by(Species) %>%
summarize(mean_s_length = mean(Sepal.Length),
max_s_width = max(Sepal.Width))
#summary at the overall level
summary_overall = iris %>%
summarize(mean_s_length = mean(Sepal.Length),
max_s_width = max(Sepal.Width)) %>%
mutate(Species = "Overall")
#append results for report
summary_table = rbind(summary_grouped, summary_overall)
Doing this multiple times over is very tedious. I kind of want:
summary_overall = iris %>%
group_by(Species, total = TRUE) %>%
summarize(mean_s_length = mean(Sepal.Length),
max_s_width = max(Sepal.Width))
FYI - if you're familiar with SAS I'm looking for the same type of functionality available via a class, ways or types statements in proc means that let me control the level of summarization and get multiple levels in one call.
Any help is appreciated. I know I can create my own function, but was hoping there is something that already exists. I would also prefer to stick with the tidyverse style of programming though I'm not set on that.
Another alternative:
library(tidyverse)
iris %>%
mutate_at("Species", as.character) %>%
list(group_by(.,Species), .) %>%
map(~summarize(.,mean_s_length = mean(Sepal.Length),
max_s_width = max(Sepal.Width))) %>%
bind_rows() %>%
replace_na(list(Species="Overall"))
#> # A tibble: 4 x 3
#> Species mean_s_length max_s_width
#> <chr> <dbl> <dbl>
#> 1 setosa 5.01 4.4
#> 2 versicolor 5.94 3.4
#> 3 virginica 6.59 3.8
#> 4 Overall 5.84 4.4
You can write a function which does the same summarize on an ungrouped tibble and rbinds that to the end.
summarize2 <- function(df, ...){
bind_rows(summarise(df, ...), summarize(ungroup(df), ...))
}
iris %>%
group_by(Species) %>%
summarize2(
mean_s_length = mean(Sepal.Length),
max_s_width = max(Sepal.Width)
)
# # A tibble: 4 x 3
# Species mean_s_length max_s_width
# <fct> <dbl> <dbl>
# 1 setosa 5.01 4.4
# 2 versicolor 5.94 3.4
# 3 virginica 6.59 3.8
# 4 NA 5.84 4.4
You could add some logic for what the "Overall" groups should be named if you want
summarize2 <- function(df, ...){
s1 <- summarise(df, ...)
s2 <- summarize(ungroup(df), ...)
for(v in group_vars(s1)){
if(is.factor(s1[[v]]))
s1[[v]] <- as.character(s1[[v]])
if(is.character(s1[[v]]))
s2[[v]] <- 'Overall'
else if(is.numeric(s1[[v]]))
s2[[v]] <- -Inf
}
bind_rows(s1, s2)
}
iris %>%
group_by(Species, g = Petal.Length %/% 1) %>%
summarize2(
mean_s_length = mean(Sepal.Length),
max_s_width = max(Sepal.Width)
)
# # Groups: Species [4]
# Species g mean_s_length max_s_width
# <chr> <dbl> <dbl> <dbl>
# 1 setosa 1 5.01 4.4
# 2 versicolor 3 5.35 2.9
# 3 versicolor 4 6.09 3.4
# 4 versicolor 5 6.35 3
# 5 virginica 4 5.85 3
# 6 virginica 5 6.44 3.4
# 7 virginica 6 7.43 3.8
# 8 Overall -Inf 5.84 4.4
library(dplyr)
iris %>%
group_by(Species) %>%
summarize(mean_s_length = mean(Sepal.Length),
max_s_width = max(Sepal.Width)) %>%
ungroup() %>%
mutate_at(vars(Species), as.character) %>%
{rbind(.,c("Overal",mean(.$mean_s_length),max(.$max_s_width)))} %>%
mutate_at(vars(-Species), as.double) %>%
mutate_at(vars(Species), as.factor)
#> # A tibble: 4 x 3
#> Species mean_s_length max_s_width
#> <fct> <dbl> <dbl>
#> 1 setosa 5.01 4.4
#> 2 versicolor 5.94 3.4
#> 3 virginica 6.59 3.8
#> 4 Overal 5.84 4.4
Created on 2019-06-21 by the reprex package (v0.3.0)
One way, also tedious but in one longer pipe, is to put the second summarise instructions in bind_rows.
The as.character call avoids a warning:
Warning messages:
1: In bind_rows_(x, .id) :
binding factor and character vector, coercing into character vector
2: In bind_rows_(x, .id) :
binding character and factor vector, coercing into character vector
library(tidyverse)
summary_grouped <- iris %>%
mutate(Species = as.character(Species)) %>%
group_by(Species) %>%
summarize(mean_s_length = mean(Sepal.Length),
max_s_width = max(Sepal.Width)) %>%
bind_rows(iris %>%
summarize(mean_s_length = mean(Sepal.Length),
max_s_width = max(Sepal.Width)) %>%
mutate(Species = "Overall"))
## A tibble: 4 x 3
# Species mean_s_length max_s_width
# <chr> <dbl> <dbl>
#1 setosa 5.01 4.4
#2 versicolor 5.94 3.4
#3 virginica 6.59 3.8
#4 Overall 5.84 4.4
Maybe something like this:
As you want to perform different operations on the same input (iris), best to map over the different summary functions and apply to the data.
map_dfr combines the list outputs using bind_rows
library(dplyr)
library(purrr)
pipe <- . %>%
group_by(Species) %>%
summarize(
mean_s_length = mean(Sepal.Length),
max_s_width = max(Sepal.Width))
map_dfr(
list(pipe, . %>% mutate(Species = "Overall") %>% pipe),
exec,
iris)
#> Warning in bind_rows_(x, .id): binding factor and character vector,
#> coercing into character vector
#> Warning in bind_rows_(x, .id): binding character and factor vector,
#> coercing into character vector
#> # A tibble: 4 x 3
#> Species mean_s_length max_s_width
#> <chr> <dbl> <dbl>
#> 1 setosa 5.01 4.4
#> 2 versicolor 5.94 3.4
#> 3 virginica 6.59 3.8
#> 4 Overall 5.84 4.4
Solution where you need to apply wanted function only once on a double dataset:
library(tidyverse)
iris %>%
rbind(mutate(., Species = "Overall")) %>%
group_by(Species) %>%
summarize(
mean_s_length = mean(Sepal.Length),
max_s_width = max(Sepal.Width)
)
# A tibble: 4 x 3
Species mean_s_length max_s_width
<chr> <dbl> <dbl>
1 Overall 5.84 4.4
2 setosa 5.01 4.4
3 versicolor 5.94 3.4
4 virginica 6.59 3.8
Trick is to pass original dataset with a new group ID (ie Species): mutate(iris, Species = "Overall")
Ok, I'm just trying to rename a column inside a nested tibble based on an identifier/character column:
MWE:
library(magrittr)
iris %>%
tibble::as_tibble() %>%
tidyr::nest(-Species) %>%
dplyr::mutate(
Species = as.character(Species),
data = purrr::map2(data, Species,
~dplyr::rename(.x, !!.y := Sepal.Width)))
but this returns the error:
Error in quos(..., .named = TRUE) : object '.y' not found
I have tried using ensym from rlang and all sort of combinations of !! and := without success.
That is the first tibble in the data column should have the Sepal.Width column renamed to setosa, the second to versicolor, and for the last tibble Sepal.Widht should be renamed to virginica.
You could switch away from the formula notation:
library(magrittr)
irisNest <- iris %>%
tibble::as_tibble() %>%
tidyr::nest(-Species) %>%
dplyr::mutate(Species = as.character(Species))
f <- function(x,y) {dplyr::rename(x, !!y := Sepal.Width)}
irisCheck <- dplyr::mutate(irisNest,
data = purrr::map2(data, Species, f))
library("tidyverse")
rename_func <- function(data, Species) {
Species <- as.character(Species)
data %>%
rename(!!Species := Sepal.Length)
}
iris2 <- as_tibble(iris) %>%
nest(-Species) %>%
group_by(Species) %>%
mutate(
data = map2(data, Species, rename_func))
iris2 %>% filter(Species == "setosa") %>% unnest() %>% head(1)
#> # A tibble: 1 x 5
#> # Groups: Species [3]
#> Species setosa Sepal.Width Petal.Length Petal.Width
#> <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 setosa 5.1 3.5 1.4 0.2
iris2 %>% filter(Species == "versicolor") %>% unnest() %>% head(1)
#> # A tibble: 1 x 5
#> # Groups: Species [3]
#> Species versicolor Sepal.Width Petal.Length Petal.Width
#> <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 versicolor 7 3.2 4.7 1.4
iris2 %>% filter(Species == "virginica") %>% unnest() %>% head(1)
#> # A tibble: 1 x 5
#> # Groups: Species [3]
#> Species virginica Sepal.Width Petal.Length Petal.Width
#> <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 virginica 6.3 3.3 6 2.5
Created on 2019-03-10 by the reprex package (v0.2.1)