pass a varying list of functions to dplyr summarize - r

Is it possible to pass a list of functions to dplyr::summarize in a way to allow the list of functions to vary? I'd like to create an overall function to create a summary table but allow different for different groups of functions in the output - [edit: when the functions are not all being applied to the same column].
I was thinking this could be done by creating an overall function with which group of summary functions to be included with T/F arguments (where funA=T/F, funB=T/F are lists of functions and the user could include all functions from funA, funB or both), but am not how to write the initial list functions (funA, funB)- when the functions are not all being applied to the same column. Below is an idea of how it would be structured. Is this possible, or is there a better way to do this?
#Essentially - how would I write a function to selectively include a group of functions (for example either funA = c(n, min, max) or funB=c(n_na, n_neg), or both).
extract_all <- function(x){
x %>% summarize(n=n(),
min = min(disp, na.rm=TRUE),
max = max(disp, na.rm=TRUE),
n_na = sum(is.na(wt)),
n_neg = sum(vs < 0, na.rm=TRUE))
}
test <- mtcars %>% group_by(cyl) %>% extract_all()
#Does this structure work?
extract_summaries <- function(x, funA=TRUE, funB=FALSE){
funAls <- list() #but how do you write n, min, max in here?
funBls <- list() #and n_na, n_neg in here
funls <- append(funAls[funA], funBls[funB])
summarize(x, funls)
}
#which could be run with:
test <- mtcars %>% group_by(cyl) %>% extract_summaries(funA=TRUE, funB=TRUE)
}

Here is one option
extract_summaries <- function(x, colnm, funA=TRUE, funB=FALSE){
funAls <- list(n = length, min= min, max = max)
funBls <- list(n_na = function(y) sum(is.na(y)),
n_neg = function(y) sum(y < 0, na.rm=TRUE))
funls <- append(funAls[funA], funBls[funB])
x %>%
summarise_at(vars({{colnm}}), funls)
}
test <- mtcars %>%
group_by(cyl) %>%
extract_summaries(mpg, funA=TRUE, funB=TRUE)
test
# A tibble: 3 x 6
# cyl n min max n_na n_neg
# <dbl> <int> <dbl> <dbl> <int> <int>
#1 4 11 21.4 33.9 0 0
#2 6 7 17.8 21.4 0 0
#3 8 14 10.4 19.2 0 0
test <- mtcars %>%
group_by(cyl) %>%
extract_summaries(mpg, funA = FALSE, funB = TRUE)
test
# A tibble: 3 x 3
# cyl n_na n_neg
# <dbl> <int> <int>
#1 4 0 0
#2 6 0 0
#3 8 0 0

Related

Dplyr to calculate mean, SD, and graph multiple variables

I have a table with columns
[Time, var1, var2, var3, var4...varN]
I need to calculate mean/SE per Time for each var1, var2...var n , and I want to do this programmatically for all variables, rather than 1 at a time which would involve a lot of copy-pasting.
Section 8.2.3 here https://tidyeval.tidyverse.org/dplyr.html is close to what I want but my below code:
x <- as.data.frame(matrix(nrow = 2, ncol = 3))
x[1,1] = 1
x[1,2] = 2
x[1,3] = 3
x[2,1] =4
x[2,2] = 5
x[2,3] = 6
names(x)[1] <- "time"
names(x)[2] <- "var1"
names(x)[3] <- "var2"
grouped_mean3 <- function(.data, ...) {
print(.data)
summary_vars <- enquos(...)
print(summary_vars)
summary_vars <- purrr::map(summary_vars, function(var) {
expr(mean(!!var, na.rm = TRUE))
})
print(summary_vars)
.data %>%
group_by(time)
summarise(!!!summary_vars) # Unquote-splice the list
}
grouped_mean3(x, var("var1"), var("var2"))
Yields
Error in !summary_vars : invalid argument type
And the original cause is "Must group by variables found in .data." and it finds a column that isn't in the dummy "x" that I generated for the purposes of testing. I have no idea what's happening, sadly.
How do I actually extract the mean from the new summary_vars and add it to the .data table? summary_vars becomes something like
[[1]]
mean(~var1, na.rm = TRUE)
[[2]]
mean(~var2, na.rm = TRUE)
Which seems close, but needs evaluation. How do I evaluate this? !!! wasn't working.
For what it's worth, I tried plugging the example in dplyr into this R engine https://rdrr.io/cran/dplyr/man/starwars.html and it didn't work either.
Help?
End goal would be a table along the lines of
[Time, var1mean, var2mean, var3mean, var4mean...]
Try this :
library(dplyr)
grouped_mean3 <- function(.data, ...) {
vars <- c(...)
.data %>%
group_by(time) %>%
summarise(across(all_of(vars), mean))
}
grouped_mean3(x, 'var1')
# time var1mean
# <dbl> <dbl>
#1 1 2
#2 4 5
grouped_mean3(x, 'var1', 'var2')
# time var1mean var2mean
# <dbl> <dbl> <dbl>
#1 1 2 3
#2 4 5 6
Perhaps this is what you are looking for?
x %>%
group_by(time) %>%
summarise_at(vars(starts_with('var')), ~mean(.,na.rm=T)) %>%
rename_at(vars(starts_with('var')),funs(paste(.,"mean"))) %>%
merge(x)
With your data (from your question) following is the output:
time var1mean var2mean var1 var2
1 1 2 3 2 3
2 4 5 6 5 6

How to use group_by() with an empty argument, in R?

I am writing a function that computes the mean of a variable according to some grouping (g1 and g2). I would like the function to take care of the case when the user just wants to compute the mean across the groups, so the group argument will be empty.
I want a solution using tidyverse.
Suppose the following:
y = 1:4
g1 = c('a', 'a', 'b', 'b')
g2 = c(1,2,1,2)
MyData = data.frame(g1, g2, y)
MyFun = function(group){
group_sym = syms(group)
MyData %>%
group_by(!!!group_sym) %>%
summarise(mean = mean(y))
}
# this works well
MyFun(group = c('g1', 'g2'))
Now suppose I want the mean of y across all groups. I would like the function be able to treat something like
MyFun(group = '')
or
MyFun(group = NULL)
So ideally I would like the group argument to be empty / null and thus MyData would not be grouped. One solution could be to add a condition at the beginning of the function checking if the argument is empty and if TRUE write summarise without group_by. But this is not elegant and my real code is much longer than just a few lines.
Any idea?
1) Use {{...}} and use g1 in place of 'g1':
MyFun = function(group) {
MyData %>%
group_by({{group}}) %>%
summarise(mean = mean(y)) %>%
ungroup
}
MyFun(g1)
## # A tibble: 2 x 2
## g1 mean
## <fct> <dbl>
## 1 a 1.5
## 2 b 3.5
MyFun()
## # A tibble: 1 x 1
## mean
## <dbl>
## 1 2.5
2) This approach uses 'g1' as in the question.
MyFun = function(group) {
group <- if (missing(group)) 'All' else sym(group)
MyData %>%
group_by(!!group) %>%
summarise(mean = mean(y)) %>%
ungroup
}
MyFun('g1')
## # A tibble: 2 x 2
## g1 mean
## <fct> <dbl>
## 1 a 1.5
## 2 b 3.5
MyFun()
## # A tibble: 1 x 2
## `"All"` mean
## <chr> <dbl>
## 1 All 2.5
3) This also works and gives the same output as (2).
MyFun = function(...) {
group <- if (...length()) syms(...) else 'All'
MyData %>%
group_by(!!!group) %>%
summarise(mean = mean(y)) %>%
ungroup
}
MyFun('g1')
MyFun()
A different approach consists of creating a fake group (named 'across_group') in the data, in the case of group is missing.
MyFun = function(group) {
if (missing(group)) MyData$across_group = 1
group <- if (missing(group)) syms('across_group') else syms(group)
MyData %>%
group_by(!!!group) %>%
summarise(mean = mean(y)) %>%
ungroup
}
MyFun()
# A tibble: 1 x 2
across_group mean
<dbl> <dbl>
1 1 2.5

Is there an helper function to make this code cleaner on tibble?

I need to sum sequences generated by one of column. I have done it in that way:
test <- tibble::tibble(
x = c(1,2,3)
)
test %>% dplyr::mutate(., s = plyr::aaply(x, .margins = 1, .fun = function(x_i){sum(seq(x_i))}))
Is there a cleaner way to do this? Is there some helper functions, construction which allows me to reduce this:
plyr::aaply(x, .margins = 1, .fun = function(x_i){sum(seq(x_i))})
I am looking for a generic solution, here sum and seq is only an example. Maybe the real problem is that I do want to execute function on element not all vector.
This is my real case:
test <- tibble::tibble(
x = c(1,2,3),
y = c(0.5,1,1.5)
)
d <- c(1.23, 0.99, 2.18)
test %>% mutate(., s = (function(x, y) {
dn <- dnorm(x = d, mean = x, sd = y)
s <- sum(dn)
s
})(x,y))
test %>% plyr::ddply(., c("x","y"), .fun = function(row) {
dn <- dnorm(x = d, mean = row$x, sd = row$y)
s <- sum(dn)
s
})
I would like to do that by mutate function in a row way not vectorized way.
For the specific example, it is a direct application of cumsum
test %>%
mutate(s = cumsum(x))
For generic cases to loop through the sequence of rows, we can use map
test %>%
mutate(s = map_dbl(row_number(), ~ sum(seq(.x))))
# A tibble: 3 x 2
# x s
# <dbl> <dbl>
#1 1 1
#2 2 3
#3 3 6
Update
For the updated dataset, use map2, as we are using corresponding arguments in dnorm from the 'x' and 'y' columns of the dataset
test %>%
mutate(V1 = map2_dbl(x, y, ~ dnorm(d, mean = .x, sd = .y) %>%
sum))
# A tibble: 3 x 3
# x y V1
# <dbl> <dbl> <dbl>
#1 1 0.5 1.56
#2 2 1 0.929
#3 3 1.5 0.470

Compute variable according to factor levels

I am kind of new to R and programming in general. I am currently strugling with a piece of code for data transformation and hope someone can take a little bit of time to help me.
Below a reproducible exemple :
# Data
a <- c(rnorm(12, 20))
b <- c(rnorm(12, 25))
f1 <- rep(c("X","Y","Z"), each=4) #family
f2 <- rep(x = c(0,1,50,100), 3) #reference and test levels
dt <- data.frame(f1=factor(f1), f2=factor(f2), a,b)
#library loading
library(tidyverse)
Goal : Compute all values (a,b) using a reference value. Calculation should be : a/a_ref with a_ref = a when f2=0 depending on the family (f1 can be X,Y or Z).
I tried to solve this by using this code :
test <- filter(dt, f2!=0) %>% group_by(f1) %>%
mutate("a/a_ref"=a/(filter(dt, f2==0) %>% group_by(f1) %>% distinct(a) %>% pull))
I get :
test results
as you can see a is divided by a_ref. But my script seems to recycle the use of reference values (a_ref) regardless of the family f1.
Do you have any suggestion so A is computed with regard of the family (f1) ?
Thank you for reading !
EDIT
I found a way to do it 'manualy'
filter(dt, f1=="X") %>% mutate("a/a_ref"=a/(filter(dt, f1=="X" & f2==0) %>% distinct(a) %>% pull()))
f1 f2 a b a/a_ref
1 X 0 21.77605 24.53115 1.0000000
2 X 1 20.17327 24.02512 0.9263973
3 X 50 19.81482 25.58103 0.9099366
4 X 100 19.90205 24.66322 0.9139422
the problem is that I'd have to update the code for each variable and family and thus is not a clean way to do it.
# use this to reproduce the same dataset and results
set.seed(5)
# Data
a <- c(rnorm(12, 20))
b <- c(rnorm(12, 25))
f1 <- rep(c("X","Y","Z"), each=4) #family
f2 <- rep(x = c(0,1,50,100), 3) #reference and test levels
dt <- data.frame(f1=factor(f1), f2=factor(f2), a,b)
#library loading
library(tidyverse)
dt %>%
group_by(f1) %>% # for each f1 value
mutate(a_ref = a[f2 == 0], # get the a_ref and add it in each row
"a/a_ref" = a/a_ref) %>% # divide a and a_ref
ungroup() %>% # forget the grouping
filter(f2 != 0) # remove rows where f2 == 0
# # A tibble: 9 x 6
# f1 f2 a b a_ref `a/a_ref`
# <fctr> <fctr> <dbl> <dbl> <dbl> <dbl>
# 1 X 1 21.38436 24.84247 19.15914 1.1161437
# 2 X 50 18.74451 23.92824 19.15914 0.9783583
# 3 X 100 20.07014 24.86101 19.15914 1.0475490
# 4 Y 1 19.39709 22.81603 21.71144 0.8934042
# 5 Y 50 19.52783 25.24082 21.71144 0.8994260
# 6 Y 100 19.36463 24.74064 21.71144 0.8919090
# 7 Z 1 20.13811 25.94187 19.71423 1.0215013
# 8 Z 50 21.22763 26.46796 19.71423 1.0767671
# 9 Z 100 19.19822 25.70676 19.71423 0.9738257
You can do this for more than one variable using:
dt %>%
group_by(f1) %>%
mutate_at(vars(a:b), funs(./.[f2 == 0])) %>%
ungroup()
Or generally use vars(a:z) to use all variables between a and z as long as they are one after the other in your dataset.
Another solution could be using mutate_if like:
dt %>%
group_by(f1) %>%
mutate_if(is.numeric, funs(./.[f2 == 0])) %>%
ungroup()
Where the function will be applied to all numeric variables you have. The variables f1 and f2 will be factor variables, so it just excludes those ones.

Why won't dplyr::summarise work with my custom function?

I would like use a custom function within dplyr's function summarise(), as follows:
library(dplyr)
# Define custom function for calculating standard error
se <- function(x) sd(x) / sqrt(length(x))
# Create a dummy data table with two groups
d <- tibble(gp = sample(c("A", "B"), 20, replace = T),
x = ifelse(gp == "A", rnorm(20), rnorm(20) + 1))
# Summarise data
d %>%
group_by(gp) %>%
summarise(x = mean(x),
se = se(x))
Why do I get NA values in the output rather than the correct values of standard error?
# A tibble: 2 × 3
gp x se
<chr> <dbl> <lgl>
1 A -0.4060173 NA
2 B 0.2999004 NA
I'm aware of some possible alternatives. For example, using the base package:
tapply(d$x, d$gp, se)
But I don't understand why the first version gives the result that it does.
summarize evaluates each expression in turn, so when your first line does
x = mean(x)
The x column (within each group) is replaced by a single value, mean(x). Your next line calls sd on that constant x, and the sd of a single value is NA.
As #joran says in the comments, if you just choose a different name for your mean column, everything will work.
d %>%
group_by(gp) %>%
summarise(avg = mean(x),
se = se(x))
# # A tibble: 2 × 3
# gp avg se
# <chr> <dbl> <dbl>
# 1 A -0.2879016 0.2264810
# 2 B 0.8804859 0.2625018
Note that this sequential evaluation is a well-considered feature of dplyr. The practical difference between dplyr::mutate and base::transform is exactly that.
dd = data.frame(x = 1:3)
base::transform(dd, x = 0, y = x * 2)
# x y
# 1 0 2
# 2 0 4
# 3 0 6
dplyr::mutate(dd, x = 0, y = x * 2)
# x y
# 1 0 0
# 2 0 0
# 3 0 0
This is called out in the Introduction to dplyr vignette:
dplyr::mutate() works the same way as plyr::mutate() and similarly to base::transform(). The key difference between mutate() and transform() is that mutate allows you to refer to columns that you’ve just created.

Resources