Evaluate Multiple Lines in Dplyr - r

I have dataset which shows Variables, calculation I want to perform (sum, no. of distinct values) and new variable names after the calculation.
library(dplyr)
RefDf <- read.table(text = "Variables Calculation NewVariable
Sepal.Length sum Sepal.Length2
Petal.Length n_distinct Petal.LengthNew
", header = T)
Manual Approach - Summarise by grouping of Species variable.
iris %>% group_by_at("Species") %>%
summarise(Sepal.Length2 = sum(Sepal.Length,na.rm = T),
Petal.LengthNew = n_distinct(Petal.Length, na.rm = T)
)
Automate via eval(parse( ))
x <- RefDf %>% mutate(Check = paste0(NewVariable, " = ", Calculation, "(", Variables, ", na.rm = T", ")")) %>% pull(Check)
iris %>% group_by_at("Species") %>% summarise(eval(parse(text = x)))
As of now it is returning -
Species `eval(parse(text = x))`
<fct> <int>
1 setosa 9
2 versicolor 19
3 virginica 20
It should return -
Species Sepal.Length2 Petal.LengthNew
<fct> <dbl> <int>
1 setosa 250. 9
2 versicolor 297. 19
3 virginica 329. 20

You can use parse_exprs:
library(tidyverse)
library(rlang)
RefDf <- read.table(text = "Variables Calculation NewVariable
Sepal.Length sum Sepal.Length2
Petal.Length n_distinct Petal.LengthNew
", header = T)
#
expr_txt <- set_names(str_c(RefDf$Calculation, "(", RefDf$Variables, ")"),
RefDf$NewVariable)
iris %>%
group_by_at("Species") %>%
summarise(!!!parse_exprs(expr_txt), .groups = "drop")
## A tibble: 3 x 3
#Species Sepal.Length2 Petal.LengthNew
#<fct> <dbl> <int>
#1 setosa 250. 9
#2 versicolor 297. 19
#3 virginica 329. 20

Updated
I found a way of sparing those extra lines.
This is just another way of getting your desired result. I'd rather create a function call for every row of your data set and then iterate over it beside the new column names to get to the desired output:
library(dplyr)
library(rlang)
library(purrr)
# First we create a new variable which is actually of type call in your data set
RefDf %>%
rowwise() %>%
mutate(Call = list(call2(Calculation, parse_expr(Variables)))) -> Rf
Rf
# A tibble: 2 x 4
# Rowwise:
Variables Calculation NewVariable Call
<chr> <chr> <chr> <list>
1 Sepal.Length sum Sepal.Length2 <language>
2 Petal.Length n_distinct Petal.LengthNew <language>
# Then we iterate over `NewVariable` and `Call` at the same time to set the new variable
# name and also evaluate the `call` at the same time
map2(Rf$NewVariable, Rf$Call, ~ iris %>% group_by(Species) %>%
summarise(!!.x := eval_tidy(.y))) %>%
reduce(~ left_join(.x, .y, by = "Species"))
# A tibble: 3 x 3
Species Sepal.Length2 Petal.LengthNew
<fct> <dbl> <int>
1 setosa 250. 9
2 versicolor 297. 19
3 virginica 329. 20

Related

tidyverse and a $ subsetting pecularity? intent for such behavior?

cdata is a tibble (I used haven to import a .sav file into the cdata object).
Why does using cdata$WEIGHT instead of WEIGHT produce such a radical difference in the output below?
this code uses cdata$WEIGHT :
cdata %>% group_by(as.factor(state)) %>%
summarise(n = n(), weighted_n = sum(cdata$WEIGHT))
produces an unwanted table:
this code uses WEIGHT :
cdata %>% group_by(as.factor(state)) %>%
summarise(n = n(), weighted_n = sum(WEIGHT))
produces the correct table:
I realize that tibble has a different mental model than base R. However, the above difference doesn't make intuitive sense to me. What's the intent behind this difference in output when using a common column identification technique (cdata$WEIGHT)?
When we having a grouping variable, cdata$WEIGHT extracts the whole column and thus the sum is from the whole column whereas if we use only WEIGHT, it returns only the data from the column for each group
If we really wanted to use $, then use the pronoun .data
library(dplyr)
iris %>%
group_by(Species) %>%
summarise(Sepal.Length = sum(.data$Sepal.Length), .groups = 'drop')
# A tibble: 3 x 2
Species Sepal.Length
<fct> <dbl>
1 setosa 250.
2 versicolor 297.
3 virginica 329.
which is identical to
iris %>%
group_by(Species) %>%
summarise(Sepal.Length = sum(Sepal.Length), .groups = 'drop')
# A tibble: 3 x 2
Species Sepal.Length
<fct> <dbl>
1 setosa 250.
2 versicolor 297.
3 virginica 329.
Or use cur_data()
iris %>%
group_by(Species) %>%
summarise(Sepal.Length = sum(cur_data()$Sepal.Length), .groups = 'drop')
# A tibble: 3 x 2
Species Sepal.Length
<fct> <dbl>
1 setosa 250.
2 versicolor 297.
3 virginica 329.
Whereas if we use .$ or iris$, it extracts the whole column breaking the group attributes
iris %>%
group_by(Species) %>%
summarise(Sepal.Length = sum(.$Sepal.Length), .groups = 'drop')
# A tibble: 3 x 2
Species Sepal.Length
<fct> <dbl>
1 setosa 876.
2 versicolor 876.
3 virginica 876.

Perform a different simple custom function based on group

I have data with three groups and would like to perform a different custom function on each of the three groups. Rather than write three separate functions, and calling them all separately, I'm wondering whether I can easily wrap all three into one function with a 'group' parameter.
For example, say I want the mean for group A:
library(tidyverse)
data(iris)
iris$Group <- c(rep("A", 50), rep("B", 50), rep("C", 50))
f_a <- function(df){
out <- df %>%
group_by(Species) %>%
summarise(mean = mean(Sepal.Length))
return(out)
}
The median for group B
f_b <- function(df){
out <- df %>%
group_by(Species) %>%
summarise(median = median(Sepal.Length))
return(out)
}
And the standard deviation for group C
f_c <- function(df){
out <- df %>%
group_by(Species) %>%
summarise(sd= sd(Sepal.Length))
return(out)
}
Is there any way I can combine the above functions and run them according to a group parameter?? Like:
fx(df, group = "A")
Which would produce the results of the above f_a function??
Keeping in mind that in my actual use context, I can't simply group_by(group) in the original function, since the actual functions are more complex. Thanks!!
We create a switch inside the function to select the appropriate function to be applied based on the matching input from group. This function is passed into summarise to apply after groupihg by 'Species'
fx <- function(df, group) {
fn_selector <- switch(group,
A = "mean",
B = "median",
C = "sd")
df %>%
group_by(Species) %>%
summarise(!! fn_selector :=
match.fun(fn_selector)(Sepal.Length), .groups = 'drop')
}
-testing
fx(iris, "A")
# A tibble: 3 x 2
# Species mean
# <fct> <dbl>
#1 setosa 5.01
#2 versicolor 5.94
#3 virginica 6.59
fx(iris, "B")
# A tibble: 3 x 2
# Species median
# <fct> <dbl>
#1 setosa 5
#2 versicolor 5.9
#3 virginica 6.5
fx(iris, "C")
# A tibble: 3 x 2
# Species sd
# <fct> <dbl>
#1 setosa 0.352
#2 versicolor 0.516
#3 virginica 0.636
I don't understand the point of having group column in the dataset. When we pass group = "A" in the function this has got nothing to do with group column that was created.
Instead of passing group = "A" in the function and then mapping A to some function you can directly pass the function that you want to apply.
library(dplyr)
f_a <- function(df, fn){
out <- df %>%
group_by(Species) %>%
summarise(out = fn(Sepal.Length))
return(out)
}
f_a(iris, mean)
# A tibble: 3 x 2
# Species out
#* <fct> <dbl>
#1 setosa 5.01
#2 versicolor 5.94
#3 virginica 6.59
f_a(iris, median)
# A tibble: 3 x 2
# Species out
#* <fct> <dbl>
#1 setosa 5
#2 versicolor 5.9
#3 virginica 6.5

Group t test result into columns within tidyverse

I'd like to group multiple t test result into one table. Originally my code looks like this:
tt_data <- iris %>%
group_by(Species) %>%
summarise(p = t.test(Sepal.Length,Petal.Length,alternative="two.sided",paired=T)$p.value,
estimate = t.test(Sepal.Length,Petal.Length,alternative="two.sided",paired=T)$estimate
)
tt_data
# Species p estimate
# setosa 2.542887e-51 3.544
# versicolor 9.667914e-36 1.676
# virginica 7.985259e-28 1.036
However, base on the idea that I should only perform the statistical test once, is there a way for me to run t test once per group and collect the intended table? I think there are some combination of broom and purrr but I am unfamiliar with the syntax.
# code idea (I know this won't work!)
tt_data <- iris %>%
group_by(Species) %>%
summarise(tt = t.test(Sepal.Length,Petal.Length,alternative="two.sided",paired=T)) %>%
select(Species, tt.p, tt.estimate)
tt_data
# Species tt.p tt.estimate
# setosa 2.542887e-51 3.544
# versicolor 9.667914e-36 1.676
# virginica 7.985259e-28 1.036
You can use broom::tidy() to transform the resut of the t.test to a tidy 'tibble':
library(dplyr)
library(broom)
iris %>%
group_by(Species) %>%
group_modify(~{
t.test(.$Sepal.Length,.$Petal.Length,alternative="two.sided",paired=T) %>%
tidy()
}) %>%
select(estimate, p.value)
#> Adding missing grouping variables: `Species`
#> # A tibble: 3 x 3
#> # Groups: Species [3]
#> Species estimate p.value
#> <fct> <dbl> <dbl>
#> 1 setosa 3.54 2.54e-51
#> 2 versicolor 1.68 9.67e-36
#> 3 virginica 1.04 7.99e-28
Created on 2020-09-02 by the reprex package (v0.3.0)
You can use map to select the desired values from the list generated by t.test and by tidying it up to a data frame via broom::tidy, i.e.
library(dplyr)
iris %>%
group_by(Species) %>%
summarise(p = list(broom::tidy(t.test(Sepal.Length, Petal.Length, alternative = "two.sided", paired = T)))) %>%
mutate(p.value = purrr::map(p, ~select(.x, c('p.value', 'estimate')))) %>%
select(-p) %>%
unnest()
# A tibble: 3 x 3
# Species p.value estimate
# <fct> <dbl> <dbl>
#1 setosa 2.54e-51 3.54
#2 versicolor 9.67e-36 1.68
#3 virginica 7.99e-28 1.04

Unquoting fails to find variable in mutate and map2 when renaming column of data in nested tibble R

Ok, I'm just trying to rename a column inside a nested tibble based on an identifier/character column:
MWE:
library(magrittr)
iris %>%
tibble::as_tibble() %>%
tidyr::nest(-Species) %>%
dplyr::mutate(
Species = as.character(Species),
data = purrr::map2(data, Species,
~dplyr::rename(.x, !!.y := Sepal.Width)))
but this returns the error:
Error in quos(..., .named = TRUE) : object '.y' not found
I have tried using ensym from rlang and all sort of combinations of !! and := without success.
That is the first tibble in the data column should have the Sepal.Width column renamed to setosa, the second to versicolor, and for the last tibble Sepal.Widht should be renamed to virginica.
You could switch away from the formula notation:
library(magrittr)
irisNest <- iris %>%
tibble::as_tibble() %>%
tidyr::nest(-Species) %>%
dplyr::mutate(Species = as.character(Species))
f <- function(x,y) {dplyr::rename(x, !!y := Sepal.Width)}
irisCheck <- dplyr::mutate(irisNest,
data = purrr::map2(data, Species, f))
library("tidyverse")
rename_func <- function(data, Species) {
Species <- as.character(Species)
data %>%
rename(!!Species := Sepal.Length)
}
iris2 <- as_tibble(iris) %>%
nest(-Species) %>%
group_by(Species) %>%
mutate(
data = map2(data, Species, rename_func))
iris2 %>% filter(Species == "setosa") %>% unnest() %>% head(1)
#> # A tibble: 1 x 5
#> # Groups: Species [3]
#> Species setosa Sepal.Width Petal.Length Petal.Width
#> <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 setosa 5.1 3.5 1.4 0.2
iris2 %>% filter(Species == "versicolor") %>% unnest() %>% head(1)
#> # A tibble: 1 x 5
#> # Groups: Species [3]
#> Species versicolor Sepal.Width Petal.Length Petal.Width
#> <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 versicolor 7 3.2 4.7 1.4
iris2 %>% filter(Species == "virginica") %>% unnest() %>% head(1)
#> # A tibble: 1 x 5
#> # Groups: Species [3]
#> Species virginica Sepal.Width Petal.Length Petal.Width
#> <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 virginica 6.3 3.3 6 2.5
Created on 2019-03-10 by the reprex package (v0.2.1)

Function with dplyr with two variables

I am trying to carry out the following dplyr task, but within a function.
library("dplyr")
iris %>%
group_by(Species) %>%
summarise(N = sum(Petal.Width == 0.2, na.rm = T))
I was thinking along the lines of the following, which is not complete because I am unclear on the syntax.
getSummary <- function(varName,level) {
summary <- iris %>%
group_by(Species %>%
summarise_(N = interp(~sum(var == ilevel, na.rm = T),
var = as.name(varName))))
sums <- summary$N
}
In this case levels is the numeric 0.2. Are there any changes if the value is a character "0.2"?
dplyr is in the process of switching over from a lazyeval-powered NSE system to an rlang-powered one. On the new version (available now through the GitHub version, and soon through CRAN), you can use
library(dplyr)
getSummary <- function(varName, level) {
varName <- enquo(varName) # parse and quote variable name
iris %>%
group_by(Species) %>%
summarise(N = sum((!!varName) == level), # unquote with !! to use
var = rlang::quo_text(varName)) # turn quosure to string
}
getSummary(Petal.Width, 0.2)
#> # A tibble: 3 × 3
#> Species N var
#> <fctr> <int> <chr>
#> 1 setosa 29 Petal.Width
#> 2 versicolor 0 Petal.Width
#> 3 virginica 0 Petal.Width
# or make it accept strings
getSummary <- function(varName, level) {
iris %>%
group_by(Species) %>%
summarise(N = sum((!!rlang::sym(varName)) == level),
var = varName)
}
getSummary('Sepal.Length', 5.0)
#> # A tibble: 3 × 3
#> Species N var
#> <fctr> <int> <chr>
#> 1 setosa 8 Sepal.Length
#> 2 versicolor 2 Sepal.Length
#> 3 virginica 0 Sepal.Length
To use the old lazyeval syntax, it would look like
getSummary <- function(varName, level) {
iris %>%
group_by(Species) %>%
summarise_(N = lazyeval::interp(~sum(x == y), # formula to substitute into
x = lazyeval::lazy(varName), # substituted but unevaluated name
y = level), # value to substitute
var = ~lazyeval::expr_text(varName)) # convert expression to string (equivalent to `deparse(substitute(...))`)
}
getSummary(Sepal.Length, 5.0)
#> # A tibble: 3 × 3
#> Species N var
#> <fctr> <int> <chr>
#> 1 setosa 8 Sepal.Length
#> 2 versicolor 2 Sepal.Length
#> 3 virginica 0 Sepal.Length
# or make it accept strings
getSummary <- function(varName, level) {
iris %>%
group_by(Species) %>%
summarise_(N = lazyeval::interp(~sum(x == y),
x = as.name(varName),
y = level),
var = ~varName)
}
getSummary('Petal.Width', 0.2)
#> # A tibble: 3 × 3
#> Species N var
#> <fctr> <int> <chr>
#> 1 setosa 29 Petal.Width
#> 2 versicolor 0 Petal.Width
#> 3 virginica 0 Petal.Width

Resources