I have looking for but not found how make a simple if for many columns in dplyr.
I have this code (it works):
library(dplyr)
library(magrittr)
data("PlantGrowth")
PlantGrowth %>% mutate (
a=if_else(group=="ctrl", weight*2, weight*100),
b=if_else(group=="ctrl", weight*1,5, weight/100),
c=if_else(group=="ctrl", weight*4, weight*100),
d=if_else(group=="ctrl", weight*5, weight/1000)
)
And I would like to not repeat the condition. Something like that:
PlantGrowth %>% mutate_if_foo (
group=="ctrl",{
a=weight*2,
b=weight*1,5,
c=weight*4,
d=weight*5
}
)%>% mutate_if_foo (
group!="ctrl",{
a=weight*100,
b=weight/100),
c=weight*100),
d=weight/1000)
}
)
I've found many answers on mutate_if,mutate_all, mutate_at , case_when but they don't answer at my question.
Please with dplyr / tidyverse.
Thanks in advance
EDIT
I've tried, from #Rohit_das idea about functions.
mtcars %>% ( function(df) {
if (df$am==1){
df%>% mutate(
a=df$mpg*3,
b=df$cyl*10)
}else{
df%>% mutate(
a=df$disp*300,
d=df$cyl*1000)
}
})
but I have Warning message:
In if (df$am == 1) { :
the condition has length > 1
and only the first element will be used
Not sure I understand the issue here. If you just want to reduce the verbosity of the code then just create a custom function
customif = function(x,y) {
if_else(group=="ctrl", weight*x, weight*y)
}
then you can call this function in your mutate as
PlantGrowth %>% mutate (
a=customif(2,100),
b=customif(1,5, 1/100),
c=customif(4, 100),
d=customif(5, 1/1000)
)
I think I found a neat solution with purrr. It takes a data frame of inputs and then dynamically names new columns a:d with new inputs for each column. First column will use x = 2, y = 100 and z = "a" and then the next row, and so on. The cool thing with functional programming like this is that it is very easy to scale up.
library(tidyverse)
iterate <- tibble(x = c(2, 1.5, 4, 5),
y = c(100, 1/100, 100, 1/1000),
z = c("a", "b", "c", "d"))
fun <- function(x, y, z) {
PlantGrowth %>%
mutate(!!z := if_else(group == "ctrl", weight * x, weight * y)) %>%
select(3)
}
PlantGrowth %>%
bind_cols(
pmap_dfc(iterate, fun)
) %>%
as_tibble
Which gives you the same df:
# A tibble: 30 x 6
weight group a b c d
<dbl> <fct> <dbl> <dbl> <dbl> <dbl>
1 4.17 ctrl 8.34 6.26 16.7 20.8
2 5.58 ctrl 11.2 8.37 22.3 27.9
3 5.18 ctrl 10.4 7.77 20.7 25.9
4 6.11 ctrl 12.2 9.17 24.4 30.6
5 4.5 ctrl 9 6.75 18 22.5
I think I've found an answer. I tested on mtcars. I didn't test yet on my real code.
Comment please if I you think I am wrong in the concept.
The conditions of the filters have to be exclusives else I will take duplicate lines.
library(dplyr)
library(magrittr)
library(tibble) # only if necessary to preserve rownames
mtcars %>% ( function(df) {
rbind(
(df
%>% tibble::rownames_to_column(.) %>%tibble::rowid_to_column(.) # to preserve rownames
%>%dplyr::filter(am==1)
%>%dplyr::mutate(
a=mpg*3,
b=cyl*10,d=NA)),
(df
%>% tibble::rownames_to_column(.) %>%tibble::rowid_to_column(.) # to preserve rownames
%>%dplyr::filter(am!=1)
%>%dplyr::mutate(
a=disp*3,
d=cyl*100,b=NA))
)
}) %>%arrange(rowid)
Related
Say I have a data frame:
df <- data.frame(a = 1:10,
b = 1:10,
c = 1:10)
I'd like to apply several summary functions to each column, so I use dplyr::summarise_all
library(dplyr)
df %>% summarise_all(.funs = c(mean, sum))
# a_fn1 b_fn1 c_fn1 a_fn2 b_fn2 c_fn2
# 1 5.5 5.5 5.5 55 55 55
This works great! Now, say I have a function that takes an extra parameter. For example, this function calculates the number of elements in a column above a threshold. (Note: this is a toy example and not the real function.)
n_above_threshold <- function(x, threshold) sum(x > threshold)
So, the function works like this:
n_above_threshold(1:10, 5)
#[1] 5
I can apply it to all columns like before, but this time passing the additional parameter, like so:
df %>% summarise_all(.funs = c(mean, n_above_threshold), threshold = 5)
# a_fn1 b_fn1 c_fn1 a_fn2 b_fn2 c_fn2
# 1 5.5 5.5 5.5 5 5 5
But, say I have a vector of thresholds where each element corresponds to a column. Say, c(1, 5, 7) for my example above. Of course, I can't simply do this, as it doesn't make any sense:
df %>% summarise_all(.funs = c(mean, n_above_threshold), threshold = c(1, 5, 7))
If I was using base R, I might do this:
> mapply(n_above_threshold, df, c(1, 5, 7))
# a b c
# 9 5 3
Is there a way of getting this result as part of a dplyr piped workflow like I was using for the simpler cases?
dplyr provides a bunch of context-dependent functions. One is cur_column(). You can use it in summarise to look up the threshold for a given column.
library("tidyverse")
df <- data.frame(
a = 1:10,
b = 1:10,
c = 1:10
)
n_above_threshold <- function(x, threshold) sum(x > threshold)
# Pair the parameters with the columns
thresholds <- c(1, 5, 7)
names(thresholds) <- colnames(df)
df %>%
summarise(
across(
everything(),
# Use `cur_column()` to access each column name in turn
list(count = ~ n_above_threshold(.x, thresholds[cur_column()]),
mean = mean)
)
)
#> a_count a_mean b_count b_mean c_count c_mean
#> 1 9 5.5 5 5.5 3 5.5
This returns NA silently if the current column name doesn't have a known threshold. This is something that you might or might not want to happen.
df %>%
# Add extra column to show what happens if we don't know the threshold for a column
mutate(
x = 1:10
) %>%
summarise(
across(
everything(),
# Use `cur_column()` to access each column name in turn
list(count = ~ n_above_threshold(.x, thresholds[cur_column()]),
mean = mean)
)
)
#> a_count a_mean b_count b_mean c_count c_mean x_count x_mean
#> 1 9 5.5 5 5.5 3 5.5 NA 5.5
Created on 2022-03-11 by the reprex package (v2.0.1)
Reproducible example
cats <-
data.frame(
name = c(letters[1:10]),
weight = c(rnorm(5, 10, 1), rnorm(5, 20, 3)),
type = c(rep("not_fat", 5), rep("fat", 5))
)
get_means <- function(df, metric, group) {
df %>%
group_by(.[[group]]) %>%
mutate(mean_stat = mean(.[[metric]])) %>%
pull(mean_stat) %>%
unique()
}
get_means(cats, metric = "weight", group = "type")
What I tried
I expect to get two values back, instead I get one value. It appears that the groupby is failing.
I tried everything including using quo(), eval() and substitute(), UQ(), !!, and a whole host of other things to try and make the stuff inside the group_by() work.
This seems awfully simple but I can't figure it out.
Reasoning for code
The decision for variables to be in quotes is because I am using them in ggplot aes_string() calls. I excluded ggplot code inside the function to simplify the code, otherwise it'd be easy because we could use standard evaluation.
I think the "intended" way to do this in the tidyeval framework is to enter the arguments as names (rather than strings) and then quote the arguments using enquo(). ggplot2 understands tidy evaluation operators so this works for ggplot2 as well.
First, let's adapt the dplyr summary function in your example:
library(tidyverse)
library(rlang)
get_means <- function(df, metric, group) {
metric = enquo(metric)
group = enquo(group)
df %>%
group_by(!!group) %>%
summarise(!!paste0("mean_", as_label(metric)) := mean(!!metric))
}
get_means(cats, weight, type)
type mean_weight
1 fat 20.0
2 not_fat 10.2
get_means(iris, Petal.Width, Species)
Species mean_Petal.Width
1 setosa 0.246
2 versicolor 1.33
3 virginica 2.03
Now add in ggplot:
get_means <- function(df, metric, group) {
metric = enquo(metric)
group = enquo(group)
df %>%
group_by(!!group) %>%
summarise(mean_stat = mean(!!metric)) %>%
ggplot(aes(!!group, mean_stat)) +
geom_point()
}
get_means(cats, weight, type)
I'm not sure what type of plot you have in mind, but you can plot the data and summary values using tidy evaluation. For example:
plot_func = function(data, metric, group) {
metric = enquo(metric)
group = enquo(group)
data %>%
ggplot(aes(!!group, !!metric)) +
geom_point() +
geom_point(data=. %>%
group_by(!!group) %>%
summarise(!!metric := mean(!!metric)),
shape="_", colour="red", size=8) +
expand_limits(y=0) +
scale_y_continuous(expand=expand_scale(mult=c(0,0.02)))
}
plot_func(cats, weight, type)
FYI, you can allow the function to take any number of grouping variables (including none) using the ... argument and enquos instead of enquo (which also requires the use of !!! (unquote-splice) instead of !! (unquote)).
get_means <- function(df, metric, ...) {
metric = enquo(metric)
groups = enquos(...)
df %>%
group_by(!!!groups) %>%
summarise(!!paste0("mean_", quo_text(metric)) := mean(!!metric))
}
get_means(mtcars, mpg, cyl, vs)
cyl vs mean_mpg
1 4 0 26
2 4 1 26.7
3 6 0 20.6
4 6 1 19.1
5 8 0 15.1
get_means(mtcars, mpg)
mean_mpg
1 20.1
If you want to use strings for the names, as in your example, the correct way to do this is to convert the string to a symbol with sym and unquote with !!:
get_means <- function(df, metric, group) {
df %>%
group_by(!!sym(group)) %>%
mutate(mean_stat = mean(!!sym(metric))) %>%
pull(mean_stat) %>%
unique()
}
get_means(cats, metric = "weight", group = "type")
[1] 10.06063 17.45906
If you want to use bare names in your function, then use enquo with !!:
get_means <- function(df, metric, group) {
group <- enquo(group)
metric <- enquo(metric)
df %>%
group_by(!!group) %>%
mutate(mean_stat = mean(!!metric)) %>%
pull(mean_stat) %>%
unique()
}
get_means(cats, metric = weight, group = type)
[1] 10.06063 17.45906
What is happening in your example?
Interestingly .[[group]], does work for grouping, but not the way you think. This subsets the stated column of the dataframe as a vector, then makes that a new variable that it groups on:
cats %>%
group_by(.[['type']])
# A tibble: 10 x 4
# Groups: .[["type"]] [2]
name weight type `.[["type"]]`
<fct> <dbl> <fct> <fct>
1 a 9.60 not_fat not_fat
2 b 8.71 not_fat not_fat
3 c 12.0 not_fat not_fat
4 d 8.48 not_fat not_fat
5 e 11.5 not_fat not_fat
6 f 17.0 fat fat
7 g 20.3 fat fat
8 h 17.3 fat fat
9 i 15.3 fat fat
10 j 17.4 fat fat
Your problem comes with the mutate statement. Instead of selecting the, mutate(mean_stat = mean(.[['weight']])) simply extracts the weight column as a vector, computes the mean, and then assigns that single value to the new column
cats %>%
group_by(.[['type']]) %>%
mutate(mean_stat = mean(.[['weight']]))
# A tibble: 10 x 5
# Groups: .[["type"]] [2]
name weight type `.[["type"]]` mean_stat
<fct> <dbl> <fct> <fct> <dbl>
1 a 9.60 not_fat not_fat 13.8
2 b 8.71 not_fat not_fat 13.8
3 c 12.0 not_fat not_fat 13.8
4 d 8.48 not_fat not_fat 13.8
5 e 11.5 not_fat not_fat 13.8
6 f 17.0 fat fat 13.8
7 g 20.3 fat fat 13.8
8 h 17.3 fat fat 13.8
9 i 15.3 fat fat 13.8
10 j 17.4 fat fat 13.8
The magrittr pronoun . represents the whole data, so you've taken the mean of all observations. Instead, use the tidy eval pronoun .data which represents the slice of data frame for the current group:
get_means <- function(df, metric, group) {
df %>%
group_by(.data[[group]]) %>%
mutate(mean_stat = mean(.data[[metric]])) %>%
pull(mean_stat) %>%
unique()
}
I would go with slight modification (if I understand correctly what you would like to achive):
get_means <- function(df, metric, group) {
df %>%
group_by(!!sym(group)) %>%
summarise(mean_stat = mean(!!sym(metric)))%>% pull(mean_stat)
}
get_means(cats, "weight", "type")
[1] 20.671772 9.305811
gives exactly same output as :
cats %>% group_by(type) %>% summarise(mean_stat=mean(weight)) %>%
pull(mean_stat)
[1] 20.671772 9.305811
using *_at functions :
library(dplyr)
get_means <- function(df, metric, group) {
df %>%
group_by_at(group) %>%
mutate_at(metric,list(mean_stat = mean)) %>%
pull(mean_stat) %>%
unique()
}
get_means(cats, metric = "weight", group = "type")
# [1] 10.12927 20.40541
data
set.seed(1)
cats <-
data.frame(
name = c(letters[1:10]),
weight = c(rnorm(5, 10, 1), rnorm(5, 20, 3)),
type = c(rep("not_fat", 5), rep("fat", 5))
)
Updated answer usingacross(), .data and {} for renaming, and keeping the original function arguments as strings per OP:
library(tidyverse)
get_means <- function(dat = mtcars, metric = "wt", group = "cyl") {
dat %>%
group_by(across(all_of(c(group)))) %>%
summarise("{paste0('mean_',metric)}" := mean(.data[[metric]]), .groups="keep")
}
get_means()
see: ?dplyr_data_masking for more detailed discussion.
I want to build a custom dplyr function and iterate over it ideally with purrr::map to stay in the tidyverse.
To keep things as easy as possible I replicate my problem using a very simple summarize function.
When buildings custom functions with dplyr I ran into the problem of non-standard evaluation (NSE). I found three different ways to deal with it. Each way of dealing with NSE works fine when the function is called directly, but not when looping over it. Below you’ll find the code to replicate my problem. What would be the correct way to make my function work with purrr::map?
# loading libraries
library(dplyr)
library(tidyr)
library(purrr)
# generate test data
test_tbl <- rbind(tibble(group = rep(sample(letters[1:4], 150, TRUE), each = 4),
score = sample(0:10, size = 600, replace = TRUE)),
tibble(group = rep(sample(letters[5:7], 50, TRUE), each = 3),
score = sample(0:10, size = 150, replace = TRUE))
)
# generate two variables to loop over
test_tbl$group2 <- test_tbl$group
vars <- c("group", "group2")
# summarise function 1 using enquo()
sum_tbl1 <- function(df, x) {
x <- dplyr::enquo(x)
df %>%
dplyr::group_by(!! x) %>%
dplyr::summarise(score = mean(score, na.rm =TRUE),
n = dplyr::n())
}
# summarise function 2 using .dots = lazyeval
sum_tbl2 <- function(df, x) {
df %>%
dplyr::group_by_(.dots = lazyeval::lazy(x)) %>%
dplyr::summarize(score = mean(score, na.rm =TRUE),
n = dplyr::n())
}
# summarise function 3 using ensym()
sum_tbl3 <- function(df, x) {
df %>%
dplyr::group_by(!!rlang::ensym(x)) %>%
dplyr::summarize(score = mean(score, na.rm =TRUE),
n = dplyr::n())
}
# Looping over the functions with map
# each variation produces an error no matter which function I choose
# call within anonymous function without pipe
map(vars, function(x) sum_tbl1(test_tbl, x))
map(vars, function(x) sum_tbl2(test_tbl, x))
map(vars, function(x) sum_tbl3(test_tbl, x))
# call within anonymous function witin pipe
map(vars, function(x) test_tbl %>% sum_tbl1(x))
map(vars, function(x) test_tbl %>% sum_tbl2(x))
map(vars, function(x) test_tbl %>% sum_tbl3(x))
# call with formular notation without pipe
map(vars, ~sum_tbl1(test_tbl, .x))
map(vars, ~sum_tbl2(test_tbl, .x))
map(vars, ~sum_tbl3(test_tbl, .x))
# call with formular notation within pipe
map(vars, ~test_tbl %>% sum_tbl1(.x))
map(vars, ~test_tbl %>% sum_tbl2(.x))
map(vars, ~test_tbl %>% sum_tbl3(.x))
I know that there are other solutions for producing summarize tables in a loop, like calling map directly and creating an anonymous function inside map (see code below). However, the problem I am interested in is how to deal with NSE in loops in general.
# One possibility to create summarize tables in loops with map
vars %>%
map(function(x){
test_tbl %>%
dplyr::group_by(!!rlang::ensym(x)) %>%
dplyr::summarize(score = mean(score, na.rm =TRUE),
n = dplyr::n())
})
Update:
Below akrun provides a solution that makes the call via purrr::map() possible. A direct call to the function is then however only possible by calling the grouping variable as a string either directly
sum_tbl(test_tbl, “group”)
or indirectly as
sum_tbl(test_tbl, vars[1])
In this solution it is not possible to call the grouping variable in a normal dplyr way as
sum_tbl(test_tbl, group)
Eventually, it seems to me that solutions to NSE in custom dpylr functions can address the problem either at the level of the function call itself, then using map/lapply is not possible, or NSE can be adressed to work with iterations, then variables can only be called as "strings".
Building on akruns answer I built a workaround function which allows both strings and normal variable names in the function call. However, there are definitely better ways to make this possible. Ideally, there is a more straight-forward way of dealing with NSE in custom dplyr functions, so that a workaround, like the one below, is not necessary in the first place.
sum_tbl <- function(df, x) {
x_var <- dplyr::enquo(x)
x_env <- rlang::get_env(x_var)
if(identical(x_env,empty_env())) {
# works, when x is a string and in loops via map/lapply
sum_tbl <- df %>%
dplyr::group_by(!! rlang::sym(x)) %>%
dplyr::summarise(score = mean(score, na.rm = TRUE),
n = dplyr::n())
} else {
# works, when x is a normal variable name without quotation marks
x = dplyr::enquo(x)
sum_tbl <- df %>%
dplyr::group_by(!! x) %>%
dplyr::summarise(score = mean(score, na.rm = TRUE),
n = dplyr::n())
}
return(sum_tbl)
}
Final update/solution
In an updated version of his answer akrun provides a solution which accounts for four ways of calling variable x:
as a normal (non-string) variable name: sum_tbl(test_tbl, group)
as a string name: sum_tbl(test_tbl, "group")
as an indexed vector: sum_tbl(test_tbl, !!vars[1])
and as a vector within purr::map(): map(vars, ~ sum_tbl(test_tbl,
!!.x))
In (3) and (4) it is necessary to unquote the variable x using !!.
If I would use the function for myself only, this wouldn’t be a problem, but as soon as other team members use the function I would need to explain, document the function.
To avoid this, I now extended akrun’s solution to account for all four ways without unquoting. However, I am not sure whether this solution created other pitfalls.
sum_tbl <- function(df, x) {
# if x is a symbol such as group without strings, than turn it into a string
if(is.symbol(get_expr(enquo(x)))) {
x <- quo_name(enquo(x))
# if x is a language object such as vars[1], evaluate it
# (this turns it into a symbol), then turn it into a string
} else if (is.language(get_expr(enquo(x)))) {
x <- eval(x)
x <- quo_name(enquo(x))
}
# this part of the function works with normal strings as x
sum_tbl <- df %>%
dplyr::group_by(!! rlang::sym(x)) %>%
dplyr::summarise(score = mean(score, na.rm = TRUE),
n = dplyr::n())
return(sum_tbl)
}
We can just use group_by_at that can take a string as argument
sum_tbl1 <- function(df, x) {
df %>%
dplyr::group_by_at(x) %>%
dplyr::summarise(score = mean(score, na.rm =TRUE),
n = dplyr::n())
}
and then call as
out1 <- map(vars, ~ sum_tbl1(test_tbl, .x))
Or another option is to convert to symbol and then evaluate (!!) within group_by
sum_tbl2 <- function(df, x) {
df %>%
dplyr::group_by(!! rlang::sym(x)) %>%
dplyr::summarise(score = mean(score, na.rm =TRUE),
n = dplyr::n())
}
out2 <- map(vars, ~ sum_tbl2(test_tbl, .x))
identical(out1 , out2)
#[1] TRUE
If we specify one of the parameters, we don't have to provide the second argument, thus can also run without anonymous call
map(vars, sum_tbl2, df = test_tbl)
Update
If we wanted to use it with conditions mentioned in the updated OP's post
sum_tbl3 <- function(df, x) {
x1 <- enquo(x)
x2 <- quo_name(x1)
df %>%
dplyr::group_by_at(x2) %>%
dplyr::summarise(score = mean(score, na.rm =TRUE),
n = dplyr::n())
}
sum_tbl3(test_tbl, group)
# A tibble: 7 x 3
# group score n
# <chr> <dbl> <int>
#1 a 5.43 148
#2 b 5.01 144
#3 c 5.35 156
#4 d 5.19 152
#5 e 5.65 72
#6 f 5.31 36
#7 g 5.24 42
sum_tbl3(test_tbl, "group")
# A tibble: 7 x 3
# group score n
# <chr> <dbl> <int>
#1 a 5.43 148
#2 b 5.01 144
#3 c 5.35 156
#4 d 5.19 152
#5 e 5.65 72
#6 f 5.31 36
#7 g 5.24 42
or call from 'vars'
sum_tbl3(test_tbl, !!vars[1])
# A tibble: 7 x 3
# group score n
# <chr> <dbl> <int>
#1 a 5.43 148
#2 b 5.01 144
#3 c 5.35 156
#4 d 5.19 152
#5 e 5.65 72
#6 f 5.31 36
#7 g 5.24 42
and with map
map(vars, ~ sum_tbl3(test_tbl, !!.x))
#[[1]]
# A tibble: 7 x 3
# group score n
# <chr> <dbl> <int>
#1 a 5.43 148
#2 b 5.01 144
#3 c 5.35 156
#4 d 5.19 152
#5 e 5.65 72
#6 f 5.31 36
#7 g 5.24 42
#[[2]]
# A tibble: 7 x 3
# group2 score n
# <chr> <dbl> <int>
#1 a 5.43 148
#2 b 5.01 144
#3 c 5.35 156
#4 d 5.19 152
#5 e 5.65 72
#6 f 5.31 36
#7 g 5.24 42
I have the following test data:
library(tidyverse)
df <- tibble(
g1 = c(1, 1, 2, 2, 2),
g2 = c(a, a, a, b, b),
a = sample(5),
b = sample(5)
)
I would like to write a function that summarises grouped columns with a mean and I wish I could have the resulting columns prefixed with "mean_"
my_summarise1 <- function(df, group_var, summarise_var) {
df %>%
group_by_at(.vars = group_var) %>%
summarise_at(.vars = summarise_var, .funs= mean) %>%
rename_at(.vars= summarise_var, .funs=paste('mean_', .))
}
Without rename_at line it works fine, but with it throws error:
my_summarise1(df, vars(g1,g2),vars(a,b))
R responds with
Error: `.funs` must contain one renaming function, not 4
How should I effectively prefix the new column names?
Smaller question: is it possible to avoid vars() or quotes arount parameters
column names when calling a function?
Knowing these two small things would greatly enhance my code, thank you all very much in advance for help.
While the earlier answer by #docendodiscimus is more succinct, for what it's worth, there are two issues with your code:
You need to wrap the paste (better: paste0) function within funs.
You need to ungroup prior to renaming (see e.g. this post).
A working version of your code looks like this:
my_summarise1 <- function(df, group_var, summarise_var) {
df %>%
group_by_at(group_var) %>%
summarise_at(summarise_var, mean) %>%
ungroup() %>%
rename_at(summarise_var, funs(paste0('mean_', .)))
}
my_summarise1(df, vars(g1, g2), vars(a, b))
## A tibble: 3 x 4
# g1 g2 mean_a mean_b
# <dbl> <chr> <dbl> <dbl>
#1 1. a 2.50 2.50
#2 2. a 4.00 5.00
#3 2. b 3.00 2.50
If you want to take a simple route, you can use dplyr's way of adding suffixes to the summarised columns:
my_summarise1 <- function(df, group_var, summarise_var) {
df %>%
group_by_at(.vars = group_var) %>%
summarise_at(.vars = summarise_var, funs(mean=mean))
}
my_summarise1(df, vars(g1,g2), vars(a,b))
# A tibble: 3 x 4
# Groups: g1 [?]
g1 g2 a_mean b_mean
<dbl> <chr> <dbl> <dbl>
1 1. a 3.50 4.50
2 2. a 4.00 1.00
3 2. b 2.00 2.50
In this case, funs(mean=mean) tells dplyr to use the suffix mean and apply the function mean. For clarity, you could use funs(mysuffix = mean) to use any different suffix and apply the mean function.
Re OP's question in comment: you can use the following modification which doesn't require the use of vars when calling the function.
my_summarise2 <- function(df, group_var, summarise_var) {
df %>%
group_by_at(.vars = group_var) %>%
summarise_at(.vars = summarise_var, funs(mean=mean))
}
my_summarise2(df, c("g1","g2"), c("a","b"))
Recently I stumbled uppon a strange behaviour of dplyr and I would be happy if somebody would provide some insights.
Assuming I have a data of which com columns contain some numerical values. In an easy scenario I would like to compute rowSums. Although there are many ways to do it, here are two examples:
df <- data.frame(matrix(rnorm(20), 10, 2),
ids = paste("i", 1:20, sep = ""),
stringsAsFactors = FALSE)
# works
dplyr::select(df, - ids) %>% {rowSums(.)}
# does not work
# Error: invalid argument to unary operator
df %>%
dplyr::mutate(blubb = dplyr::select(df, - ids) %>% {rowSums(.)})
# does not work
# Error: invalid argument to unary operator
df %>%
dplyr::mutate(blubb = dplyr::select(., - ids) %>% {rowSums(.)})
# workaround:
tmp <- dplyr::select(df, - ids) %>% {rowSums(.)}
df %>%
dplyr::mutate(blubb = tmp)
# works
rowSums(dplyr::select(df, - ids))
# does not work
# Error: invalid argument to unary operator
df %>%
dplyr::mutate(blubb = rowSums(dplyr::select(df, - ids)))
# workaround
tmp <- rowSums(dplyr::select(df, - ids))
df %>%
dplyr::mutate(blubb = tmp)
First, I don't really understand what is causing the error and second I would like to know how to actually achieve a tidy computation of some (viable) columns in a tidy way.
edit
The question mutate and rowSums exclude columns , although related, focuses on using rowSums for computation. Here I'm eager to understand why the upper examples do not work. It is not so much about how to solve (see the workarounds) but to understand what happens when the naive approach is applied.
The examples do not work because you are nesting select in mutate and using bare variable names. In this case, select is trying to do something like
> -df$ids
Error in -df$ids : invalid argument to unary operator
which fails because you can't negate a character string (i.e. -"i1" or -"i2" makes no sense). Either of the formulations below works:
df %>% mutate(blubb = rowSums(select_(., "X1", "X2")))
df %>% mutate(blubb = rowSums(select(., -3)))
or
df %>% mutate(blubb = rowSums(select_(., "-ids")))
as suggested by #Haboryme.
select_ is deprecated. You can use:
library(dplyr)
df <- data.frame(matrix(rnorm(20), 10, 2),
ids = paste("i", 1:20, sep = ""),
stringsAsFactors = FALSE)
df %>%
mutate(blubb = rowSums(select(., .dots = c("X1", "X2"))))
# Or more generally:
desired_columns <- c("X1", "X2")
df %>%
mutate(blubb = rowSums(select(., .dots = all_of(desired_columns))))
select can now accept bare column names so no need to use .dots or select_ which has been deprecated.
Here are few of the approaches that can work now.
library(dplyr)
#sum all the columns except `id`.
df %>% mutate(blubb = rowSums(select(., -ids), na.rm = TRUE))
#sum X1 and X2 columns
df %>% mutate(blubb = rowSums(select(., X1, X2), na.rm = TRUE))
#sum all the columns that start with 'X'
df %>% mutate(blubb = rowSums(select(., starts_with('X')), na.rm = TRUE))
#sum all the numeric columns
df %>% mutate(blubb = rowSums(select(., where(is.numeric))))
Adding to this old thread because I searched on this question then realized I was asking the wrong question. Also, I detect some yearning in this and related questions for the proper pipe steps way to do this.
The answers here are somewhat non-intuitive because they are trying to use the dplyr vernacular with non-"tidy" data. IF you want to do it the dplyr way, make the data tidy first, using gather(), and then use summarise()
library(tidyverse)
df <- data.frame(matrix(rnorm(20), 10, 2),
ids = paste("i", 1:20, sep = ""),
stringsAsFactors = FALSE)
df %>% gather(key=Xn,value="value",-ids) %>%
group_by(ids) %>%
summarise(rowsum=sum(value))
#> # A tibble: 20 x 2
#> ids rowsum
#> <chr> <dbl>
#> 1 i1 0.942
#> 2 i10 -0.330
#> 3 i11 0.942
#> 4 i12 -0.721
#> 5 i13 2.50
#> 6 i14 -0.611
#> 7 i15 -0.799
#> 8 i16 1.84
#> 9 i17 -0.629
#> 10 i18 -1.39
#> 11 i19 1.44
#> 12 i2 -0.721
#> 13 i20 -0.330
#> 14 i3 2.50
#> 15 i4 -0.611
#> 16 i5 -0.799
#> 17 i6 1.84
#> 18 i7 -0.629
#> 19 i8 -1.39
#> 20 i9 1.44
If you care about the order of the ids when they are not sortable using arrange(), make that column a factor first.
df %>%
mutate(ids=as_factor(ids)) %>%
gather(key=Xn,value="value",-ids) %>%
group_by(ids) %>%
summarise(rowsum=sum(value))
Why do you want to use the pipe operator? Just write an expression such as:
rowSums(df[,sapply(df, is.numeric)])
i.e. calculate the rowsums on all the numeric columns, with the advantage of not needing to specify ids.
If you want to save your results as a column within data, you can use data.table syntax like this:
dt <- as.data.table(df)
dt[, x3 := rowSums(.SD, na.rm=T), .SDcols = which(sapply(dt, is.numeric))]