Passing multiple arguments to function in dplyr::summarise_if - r

I am trying to make a function that uses summarise_if (or summarise_at) to calculate the correlation between one column and many others in the data set.
data_set <- data.frame(grp = rep(c("a","b","c"), each =
3), x = rnorm(9), y = rnorm(9), z = rnorm(9))
multiple_cor <- function(d, vars){
d %>%
dplyr::group_by(grp) %>%
dplyr::summarise_at(vars, cor, x) %>%
return()
}
multiple_cor(data_set, vars = c("y","z") )
This gives the error:
Error in dots_list(...) : object 'x' not found
Called from: dots_list(...)
I'm am fairly sure this is from the cor function not evaluating x within the right environment, but I am not sure how to get around this issue.

summarise_at has a funs argument so it can handle anonymous functions. I created a function called cors inside your function and pass that one on to summarise_at inside the funs argument to handle the x.
multiple_cor <- function(d, vars){
cors <- function(x, a = NULL) {
stats::cor(x, a)
}
d %>%
dplyr::group_by(grp) %>%
dplyr::summarise_at(vars, funs(cors(x, .))) %>%
return()
}
multiple_cor(data_set, vars = c("y","z") )
# A tibble: 3 x 3
grp y z
<fct> <dbl> <dbl>
1 a 0.803 0.894
2 b -0.284 -0.949
3 c 0.805 -0.571
The outcome of the function is exactly identical as the following lines of code:
data_set %>%
group_by(grp) %>%
summarise(cxy = cor(x, y),
cxz = cor(x, z))
# A tibble: 3 x 3
grp cxy cxz
<fct> <dbl> <dbl>
1 a 0.803 0.894
2 b -0.284 -0.949
3 c 0.805 -0.571
Read this dplyr documentation.
And this google groups discussion.

Related

Dplyr to calculate mean, SD, and graph multiple variables

I have a table with columns
[Time, var1, var2, var3, var4...varN]
I need to calculate mean/SE per Time for each var1, var2...var n , and I want to do this programmatically for all variables, rather than 1 at a time which would involve a lot of copy-pasting.
Section 8.2.3 here https://tidyeval.tidyverse.org/dplyr.html is close to what I want but my below code:
x <- as.data.frame(matrix(nrow = 2, ncol = 3))
x[1,1] = 1
x[1,2] = 2
x[1,3] = 3
x[2,1] =4
x[2,2] = 5
x[2,3] = 6
names(x)[1] <- "time"
names(x)[2] <- "var1"
names(x)[3] <- "var2"
grouped_mean3 <- function(.data, ...) {
print(.data)
summary_vars <- enquos(...)
print(summary_vars)
summary_vars <- purrr::map(summary_vars, function(var) {
expr(mean(!!var, na.rm = TRUE))
})
print(summary_vars)
.data %>%
group_by(time)
summarise(!!!summary_vars) # Unquote-splice the list
}
grouped_mean3(x, var("var1"), var("var2"))
Yields
Error in !summary_vars : invalid argument type
And the original cause is "Must group by variables found in .data." and it finds a column that isn't in the dummy "x" that I generated for the purposes of testing. I have no idea what's happening, sadly.
How do I actually extract the mean from the new summary_vars and add it to the .data table? summary_vars becomes something like
[[1]]
mean(~var1, na.rm = TRUE)
[[2]]
mean(~var2, na.rm = TRUE)
Which seems close, but needs evaluation. How do I evaluate this? !!! wasn't working.
For what it's worth, I tried plugging the example in dplyr into this R engine https://rdrr.io/cran/dplyr/man/starwars.html and it didn't work either.
Help?
End goal would be a table along the lines of
[Time, var1mean, var2mean, var3mean, var4mean...]
Try this :
library(dplyr)
grouped_mean3 <- function(.data, ...) {
vars <- c(...)
.data %>%
group_by(time) %>%
summarise(across(all_of(vars), mean))
}
grouped_mean3(x, 'var1')
# time var1mean
# <dbl> <dbl>
#1 1 2
#2 4 5
grouped_mean3(x, 'var1', 'var2')
# time var1mean var2mean
# <dbl> <dbl> <dbl>
#1 1 2 3
#2 4 5 6
Perhaps this is what you are looking for?
x %>%
group_by(time) %>%
summarise_at(vars(starts_with('var')), ~mean(.,na.rm=T)) %>%
rename_at(vars(starts_with('var')),funs(paste(.,"mean"))) %>%
merge(x)
With your data (from your question) following is the output:
time var1mean var2mean var1 var2
1 1 2 3 2 3
2 4 5 6 5 6

R: Looping over custom dplyr function

I want to build a custom dplyr function and iterate over it ideally with purrr::map to stay in the tidyverse.
To keep things as easy as possible I replicate my problem using a very simple summarize function.
When buildings custom functions with dplyr I ran into the problem of non-standard evaluation (NSE). I found three different ways to deal with it. Each way of dealing with NSE works fine when the function is called directly, but not when looping over it. Below you’ll find the code to replicate my problem. What would be the correct way to make my function work with purrr::map?
# loading libraries
library(dplyr)
library(tidyr)
library(purrr)
# generate test data
test_tbl <- rbind(tibble(group = rep(sample(letters[1:4], 150, TRUE), each = 4),
score = sample(0:10, size = 600, replace = TRUE)),
tibble(group = rep(sample(letters[5:7], 50, TRUE), each = 3),
score = sample(0:10, size = 150, replace = TRUE))
)
# generate two variables to loop over
test_tbl$group2 <- test_tbl$group
vars <- c("group", "group2")
# summarise function 1 using enquo()
sum_tbl1 <- function(df, x) {
x <- dplyr::enquo(x)
df %>%
dplyr::group_by(!! x) %>%
dplyr::summarise(score = mean(score, na.rm =TRUE),
n = dplyr::n())
}
# summarise function 2 using .dots = lazyeval
sum_tbl2 <- function(df, x) {
df %>%
dplyr::group_by_(.dots = lazyeval::lazy(x)) %>%
dplyr::summarize(score = mean(score, na.rm =TRUE),
n = dplyr::n())
}
# summarise function 3 using ensym()
sum_tbl3 <- function(df, x) {
df %>%
dplyr::group_by(!!rlang::ensym(x)) %>%
dplyr::summarize(score = mean(score, na.rm =TRUE),
n = dplyr::n())
}
# Looping over the functions with map
# each variation produces an error no matter which function I choose
# call within anonymous function without pipe
map(vars, function(x) sum_tbl1(test_tbl, x))
map(vars, function(x) sum_tbl2(test_tbl, x))
map(vars, function(x) sum_tbl3(test_tbl, x))
# call within anonymous function witin pipe
map(vars, function(x) test_tbl %>% sum_tbl1(x))
map(vars, function(x) test_tbl %>% sum_tbl2(x))
map(vars, function(x) test_tbl %>% sum_tbl3(x))
# call with formular notation without pipe
map(vars, ~sum_tbl1(test_tbl, .x))
map(vars, ~sum_tbl2(test_tbl, .x))
map(vars, ~sum_tbl3(test_tbl, .x))
# call with formular notation within pipe
map(vars, ~test_tbl %>% sum_tbl1(.x))
map(vars, ~test_tbl %>% sum_tbl2(.x))
map(vars, ~test_tbl %>% sum_tbl3(.x))
I know that there are other solutions for producing summarize tables in a loop, like calling map directly and creating an anonymous function inside map (see code below). However, the problem I am interested in is how to deal with NSE in loops in general.
# One possibility to create summarize tables in loops with map
vars %>%
map(function(x){
test_tbl %>%
dplyr::group_by(!!rlang::ensym(x)) %>%
dplyr::summarize(score = mean(score, na.rm =TRUE),
n = dplyr::n())
})
Update:
Below akrun provides a solution that makes the call via purrr::map() possible. A direct call to the function is then however only possible by calling the grouping variable as a string either directly
sum_tbl(test_tbl, “group”)
or indirectly as
sum_tbl(test_tbl, vars[1])
In this solution it is not possible to call the grouping variable in a normal dplyr way as
sum_tbl(test_tbl, group)
Eventually, it seems to me that solutions to NSE in custom dpylr functions can address the problem either at the level of the function call itself, then using map/lapply is not possible, or NSE can be adressed to work with iterations, then variables can only be called as "strings".
Building on akruns answer I built a workaround function which allows both strings and normal variable names in the function call. However, there are definitely better ways to make this possible. Ideally, there is a more straight-forward way of dealing with NSE in custom dplyr functions, so that a workaround, like the one below, is not necessary in the first place.
sum_tbl <- function(df, x) {
x_var <- dplyr::enquo(x)
x_env <- rlang::get_env(x_var)
if(identical(x_env,empty_env())) {
# works, when x is a string and in loops via map/lapply
sum_tbl <- df %>%
dplyr::group_by(!! rlang::sym(x)) %>%
dplyr::summarise(score = mean(score, na.rm = TRUE),
n = dplyr::n())
} else {
# works, when x is a normal variable name without quotation marks
x = dplyr::enquo(x)
sum_tbl <- df %>%
dplyr::group_by(!! x) %>%
dplyr::summarise(score = mean(score, na.rm = TRUE),
n = dplyr::n())
}
return(sum_tbl)
}
Final update/solution
In an updated version of his answer akrun provides a solution which accounts for four ways of calling variable x:
as a normal (non-string) variable name: sum_tbl(test_tbl, group)
as a string name: sum_tbl(test_tbl, "group")
as an indexed vector: sum_tbl(test_tbl, !!vars[1])
and as a vector within purr::map(): map(vars, ~ sum_tbl(test_tbl,
!!.x))
In (3) and (4) it is necessary to unquote the variable x using !!.
If I would use the function for myself only, this wouldn’t be a problem, but as soon as other team members use the function I would need to explain, document the function.
To avoid this, I now extended akrun’s solution to account for all four ways without unquoting. However, I am not sure whether this solution created other pitfalls.
sum_tbl <- function(df, x) {
# if x is a symbol such as group without strings, than turn it into a string
if(is.symbol(get_expr(enquo(x)))) {
x <- quo_name(enquo(x))
# if x is a language object such as vars[1], evaluate it
# (this turns it into a symbol), then turn it into a string
} else if (is.language(get_expr(enquo(x)))) {
x <- eval(x)
x <- quo_name(enquo(x))
}
# this part of the function works with normal strings as x
sum_tbl <- df %>%
dplyr::group_by(!! rlang::sym(x)) %>%
dplyr::summarise(score = mean(score, na.rm = TRUE),
n = dplyr::n())
return(sum_tbl)
}
We can just use group_by_at that can take a string as argument
sum_tbl1 <- function(df, x) {
df %>%
dplyr::group_by_at(x) %>%
dplyr::summarise(score = mean(score, na.rm =TRUE),
n = dplyr::n())
}
and then call as
out1 <- map(vars, ~ sum_tbl1(test_tbl, .x))
Or another option is to convert to symbol and then evaluate (!!) within group_by
sum_tbl2 <- function(df, x) {
df %>%
dplyr::group_by(!! rlang::sym(x)) %>%
dplyr::summarise(score = mean(score, na.rm =TRUE),
n = dplyr::n())
}
out2 <- map(vars, ~ sum_tbl2(test_tbl, .x))
identical(out1 , out2)
#[1] TRUE
If we specify one of the parameters, we don't have to provide the second argument, thus can also run without anonymous call
map(vars, sum_tbl2, df = test_tbl)
Update
If we wanted to use it with conditions mentioned in the updated OP's post
sum_tbl3 <- function(df, x) {
x1 <- enquo(x)
x2 <- quo_name(x1)
df %>%
dplyr::group_by_at(x2) %>%
dplyr::summarise(score = mean(score, na.rm =TRUE),
n = dplyr::n())
}
sum_tbl3(test_tbl, group)
# A tibble: 7 x 3
# group score n
# <chr> <dbl> <int>
#1 a 5.43 148
#2 b 5.01 144
#3 c 5.35 156
#4 d 5.19 152
#5 e 5.65 72
#6 f 5.31 36
#7 g 5.24 42
sum_tbl3(test_tbl, "group")
# A tibble: 7 x 3
# group score n
# <chr> <dbl> <int>
#1 a 5.43 148
#2 b 5.01 144
#3 c 5.35 156
#4 d 5.19 152
#5 e 5.65 72
#6 f 5.31 36
#7 g 5.24 42
or call from 'vars'
sum_tbl3(test_tbl, !!vars[1])
# A tibble: 7 x 3
# group score n
# <chr> <dbl> <int>
#1 a 5.43 148
#2 b 5.01 144
#3 c 5.35 156
#4 d 5.19 152
#5 e 5.65 72
#6 f 5.31 36
#7 g 5.24 42
and with map
map(vars, ~ sum_tbl3(test_tbl, !!.x))
#[[1]]
# A tibble: 7 x 3
# group score n
# <chr> <dbl> <int>
#1 a 5.43 148
#2 b 5.01 144
#3 c 5.35 156
#4 d 5.19 152
#5 e 5.65 72
#6 f 5.31 36
#7 g 5.24 42
#[[2]]
# A tibble: 7 x 3
# group2 score n
# <chr> <dbl> <int>
#1 a 5.43 148
#2 b 5.01 144
#3 c 5.35 156
#4 d 5.19 152
#5 e 5.65 72
#6 f 5.31 36
#7 g 5.24 42

Is there an helper function to make this code cleaner on tibble?

I need to sum sequences generated by one of column. I have done it in that way:
test <- tibble::tibble(
x = c(1,2,3)
)
test %>% dplyr::mutate(., s = plyr::aaply(x, .margins = 1, .fun = function(x_i){sum(seq(x_i))}))
Is there a cleaner way to do this? Is there some helper functions, construction which allows me to reduce this:
plyr::aaply(x, .margins = 1, .fun = function(x_i){sum(seq(x_i))})
I am looking for a generic solution, here sum and seq is only an example. Maybe the real problem is that I do want to execute function on element not all vector.
This is my real case:
test <- tibble::tibble(
x = c(1,2,3),
y = c(0.5,1,1.5)
)
d <- c(1.23, 0.99, 2.18)
test %>% mutate(., s = (function(x, y) {
dn <- dnorm(x = d, mean = x, sd = y)
s <- sum(dn)
s
})(x,y))
test %>% plyr::ddply(., c("x","y"), .fun = function(row) {
dn <- dnorm(x = d, mean = row$x, sd = row$y)
s <- sum(dn)
s
})
I would like to do that by mutate function in a row way not vectorized way.
For the specific example, it is a direct application of cumsum
test %>%
mutate(s = cumsum(x))
For generic cases to loop through the sequence of rows, we can use map
test %>%
mutate(s = map_dbl(row_number(), ~ sum(seq(.x))))
# A tibble: 3 x 2
# x s
# <dbl> <dbl>
#1 1 1
#2 2 3
#3 3 6
Update
For the updated dataset, use map2, as we are using corresponding arguments in dnorm from the 'x' and 'y' columns of the dataset
test %>%
mutate(V1 = map2_dbl(x, y, ~ dnorm(d, mean = .x, sd = .y) %>%
sum))
# A tibble: 3 x 3
# x y V1
# <dbl> <dbl> <dbl>
#1 1 0.5 1.56
#2 2 1 0.929
#3 3 1.5 0.470

Tidy Eval, using enquo with infer package

This is my first question on this website.
The infer package which I am trying to use is part is of tidyverse (tidymodels) link
library(tidyverse)
library(rlang)
library(infer)
mtcars$am <- as.factor(mtcars$am)
f <- function(dataset, col){
col <- enquo(col)
bootstrap <- dataset %>%
specify(!!col ~ am ) %>%
generate(reps = 100, type = "bootstrap") %>%
calculate("diff in means", order = c("1", "0"))
}
f(mtcars, mpg)
Error: The response variable `!` cannot be found in this dataframe.The response variable `!col` cannot be found in this dataframe.
In addition: Warning message:
In if (!(as.character(attr(x, "response")) %in% names(x))) { :
Show Traceback
Rerun with Debug
Error: The response variable `!` cannot be found in this dataframe.The response variable `!col` cannot be found in this dataframe.
I have tried to use qq_show and everything looks fine, so I don't understand the error.
The issue is in the formula. We can use paste after converting the quosure to string (quo_name) and convert the string in to a formula object
f <- function(dataset, col){
col <- enquo(col)
dataset %>%
specify(as.formula(paste0(quo_name(col), '~ am'))) %>%
generate(reps = 100, type = "bootstrap") %>%
calculate("diff in means", order = c("1", "0"))
}
f(mtcars, mpg)
# A tibble: 100 x 2
# replicate stat
# <int> <dbl>
# 1 1 8.41
# 2 2 10.7
# 3 3 7.65
# 4 4 7.21
# 5 5 7.47
# 6 6 6.59
# 7 7 9.32
# 8 8 5.70
# 9 9 8.25
#10 10 6.24
# ... with 90 more rows
Based on #Lionel Henry's suggetion
f <- function(dataset, col){
col <- ensym(col)
g <- expr(!!col ~ am)
dataset %>%
specify(g) %>%
generate(reps = 100, type = "bootstrap") %>%
calculate("diff in means", order = c("1", "0"))
}
f(mtcars, mpg)

Can get non-standard evaluation to work in dplyr for filter_ and count_ but not distinct_

I am trying to write a function that uses dplyr to count up all unique values of z. My function works fine when I have the variable actually named z. However, if the variable is named x, I get an error (below code).
test.data<-data.frame(y=c(1:10),
x=c(letters[1:10]))
test.data$x<-as.character(test.data$x)
obsfunction<-function(z,y,data){
filter_(data,
!is.na(deparse(substitute(y))))%>%
distinct_(., deparse(substitute(z)))%>% #the line that breaks it
count_(.)
}
obsfunction(z=x,y,data=test.data)
So, the above code doesn't work and gives this error:
>Error in eval(substitute(expr), envir, enclos) : unknown column 'z'
Changing z to x in the function (or renaming x as z) makes it work, but I don't want to have to rename everything, especially considering y works with different names.
I have tried lazyeval::interp and quote() per the vignette, this question, and this question.
distinct_(lazyeval::interp(as.name(z)))%>%
>Error in as.name(z) : object 'x' not found
distinct_(quote(z))%>%
>Error in eval(substitute(expr), envir, enclos) : unknown column 'z'
What am I missing? How do I get z to accept x as the column name?
as dplyr standard evaluation understand strings, I tried the following code and with additional test data, it seems work. I first extracted variable name and then constructed expressions using character strings:
test.data<-data.frame(y=c(1:10),
x=c(letters[1:10]))
test.data$x<-as.character(test.data$x)
f <- function(z, y, data){
z <- deparse(substitute(z))
y <- deparse(substitute(y))
res <- data %>% filter_(
paste('!is.na(', y, ')', sep = '')) %>%
distinct_(z) %>%
count_(.)
}
x <- f(z = x, y, test.data)
# # A tibble: 1 × 1
# n
# <int>
# 1 10
test.data <- data.frame(
y=c(1:4, NA, NA, 7:10),
x=c(letters[c(1:8, 8, 8)]),
stringsAsFactors = F)
x <- f(z = x, y, test.data)
# # A tibble: 1 × 1
# n
# <int>
# 1 6
You can use match.call to capture function arguments and convert them to characters before passing to the dplyr SE functions:
obsfunction<-function(z, y, data){
cl = match.call()
y = as.character(cl['y'])
z = as.character(cl['z'])
data %>% filter_(paste('!is.na(', y, ')', sep = '')) %>%
distinct_(z) %>%
count_(.)
}
obsfunction(z = x, y = y, data = test.data)
# A tibble: 1 × 1
# n
# <int>
#1 10
obsfunction(x, y, test.data)
# A tibble: 1 × 1
# n
# <int>
#1 10
Another lazyeval/dplyr variation where the variables are passed as formulas,and f_interp substitutes uq(x) with the formula passed to it, similar to deparse(substitute(x))
library(dplyr)
library(lazyeval)
test.data<-data.frame(y=c(1:10),
x=c(letters[1:10]))
test.data$x<-as.character(test.data$x)
obsfunction<-function(z, y, data){
data %>% filter_(f_interp(~!is.na(uq(y)))) %>%
distinct_(f_interp(~uq(z))) %>% count()
}
obsfunction(z=~x,~y,data=test.data)
#A tibble: 1 × 1
# n
# <int>
#1 10
test.data.NA <- data.frame(
y=c(1:4, NA, NA, 7:10),
x=c(letters[c(1:8, 8, 8)]),
stringsAsFactors = FALSE)
obsfunction(z=~x,~y,data=test.data.NA)
# # A tibble: 1 × 1
# n
# <int>
# 1 6

Resources