Summarize many variables with a function of two variables - r

The summarise_if function is very helpful to summary several variables. Assume that I need the mean of every numeric variable in my dataset. I can use
df <- as_tibble(iris)
df %>% summarise_if(is.numeric, .fun = mean)
This works perfectly. But assume now that the function in .fun involves 2 arguments from the dataset (an example is the weighet.mean, where the weight variable is Sepal.Length). I tried,
df %>% summarise_if(is.numeric, .fun = function(x, w) weighted.mean(x, w), w = Sepal.Length)
The error was
Error in list2(...) : object 'Sepal.Width' not found
I suspect that R did not search Sepal.Length in df but in it global environment. So I have to use,
df %>% summarise_if(is.numeric, .fun = function(x, w) weighted.mean(x, w), w = df$Sepal.Length)
This works but it is not a good to do df$Sepal.Length. For example, it becomes completely impossible for me to compute the weighted mean by group.
df %>% group_by(Species) %>% summarise_if(is.numeric, .fun = function(x, w) weighted.mean(x, w), w = df$Sepal.Length)
Error: Problem with summarise() column Sepal.Length.
ℹ Sepal.Length = (function (x, w) ....
x 'x' and 'w' must have the same length
ℹ The error occurred in group 1: Species = setosa.
So, how to use summarise_if or summarise_at with functions involving two variables from the dataset.

If we need to use Sepal.Length as w, concatenate (c) the output from where(is.numeric) and specify -Sepal.Length to remove the column from across, then use weighted.mean on the other numeric columns, with w as 'Sepal.Length'
library(dplyr)
df %>%
summarise(across(c(where(is.numeric), -Sepal.Length),
~ weighted.mean(., w = Sepal.Length)))
# A tibble: 1 × 3
Sepal.Width Petal.Length Petal.Width
<dbl> <dbl> <dbl>
1 3.05 3.97 1.29
Or a grouped one would be
df %>%
group_by(Species) %>%
summarise(across(c(where(is.numeric), -Sepal.Length),
~ weighted.mean(., w = Sepal.Length)))
-output
# A tibble: 3 × 4
Species Sepal.Width Petal.Length Petal.Width
<fct> <dbl> <dbl> <dbl>
1 setosa 3.45 1.47 0.248
2 versicolor 2.78 4.29 1.34
3 virginica 2.99 5.60 2.03
NOTE: _if, _at, _all suffix functions are deprecated in favor for across

Related

R 3.5.2: Pipe inside custom function - object 'column' not found [duplicate]

This question already has answers here:
Pass arguments to dplyr functions
(7 answers)
Closed 2 years ago.
I am having issues with pipes inside a custom function. Based on the previous posts, I understand that a pipe inside a function creates another level(?) which results in the error I'm getting (see below).
I'm hoping to write a summary function for a large data set with hundreds of numeric and categorical variables. I would like to have the option to use this on different data frames (with similar structure), always group by a certain factor variable and get summaries for multiple columns.
library(tidyverse)
data(iris)
iris %>% group_by(Species) %>% summarise(count = n(), mean = mean(Sepal.Length, na.rm = T))
# A tibble: 3 x 3
Species count mean
<fct> <int> <dbl>
1 setosa 50 5.01
2 versicolor 50 5.94
3 virginica 50 6.59
I'm hoping to create a function like this:
sum_cols <- function (df, col) {
df %>%
group_by(Species) %>%
summarise(count = n(),
mean = mean(col, na.rm = T))
}
And this is the error I'm getting:
sum_cols(iris, Sepal.Length)
Error in mean(col, na.rm = T) : object 'Petal.Width' not found
Called from: mean(col, na.rm = T)
I have had this problem for a while and even though I tried to get answers in a few previous posts, I haven't quite grasped why the problem occurs and how to get around it.
Any help would be greatly appreciated, thanks!
Try searching for non-standard evaluation (NSE).
You can use here {{}} to let R know that col is the column name in df.
library(dplyr)
library(rlang)
sum_cols <- function (df, col) {
df %>%
group_by(Species) %>%
summarise(count = n(), mean = mean({{col}}, na.rm = T))
}
sum_cols(iris, Sepal.Length)
# A tibble: 3 x 3
# Species count mean
# <fct> <int> <dbl>
#1 setosa 50 5.01
#2 versicolor 50 5.94
#3 virginica 50 6.59
If we do not have the latest rlang we can use the old method of enquo and !!
sum_cols <- function (df, col) {
df %>%
group_by(Species) %>%
summarise(count = n(), mean = mean(!!enquo(col), na.rm = T))
}
sum_cols(iris, Sepal.Length)

dplyr: Is it possible to return two columns in summarize using one function?

Say I have a function that returns two scalars, and I want to use it with summarize, e.g.
fn = function(x) {
list(mean(x), sd(x))
}
iris %>%
summarize(fn(Petal.Length)) # Error: Column `fn(Petal.Length)` must be length 1 (a summary value), not 2
iris %>%
summarize(c("a","b") := fn(Petal.Length))
# Error: The LHS of `:=` must be a string or a symbol Run `rlang::last_error()` to see where the error occurred.
I tried both ways, but can't figure it out.
However, this can be done with data.table
library(data.table)
iris1 = copy(iris)
setDT(iris1)[, fn(Petal.Length)]
Is there a way to do this in dplyr?
Yes, you can save them as a list in a column and then use unnest_wider to separate them in different columns.
fn = function(x) {
list(mean = mean(x),sd = sd(x))
}
library(dplyr)
library(tidyr)
iris %>%
summarise(temp = list(fn(Petal.Length))) %>%
unnest_wider(temp)
# A tibble: 1 x 2
# mean sd
# <dbl> <dbl>
#1 3.76 1.77
Or unnest_longer to have them in separate rows
iris %>%
summarise(temp = list(fn(Petal.Length))) %>%
unnest_longer(temp)
# temp temp_id
# <dbl> <chr>
#1 3.76 mean
#2 1.77 sd

summarise_at dplyr multiple columns

I am trying to apply a complex function on multiple columns after applying a group on it.
Code example is:
library(dplyr)
data(iris)
add = function(x,y) {
z = x+y
return(mean(z))
}
iris %>%
group_by(Species) %>%
summarise_at(.vars=c('Sepal.Length', 'Sepal.Width'),
.funs = add('Sepal.Length', 'Sepal.Width' ) )
I was expecting that the function would be applied to each group and returned as a new column but I get:
Error in x + y : non-numeric argument to binary operator
How can I get this work?
Note my real problem has a much more complicated function than the simple add function I've written here that requires the two columns be fed in as separate entities I can't just sum them first.
Thanks
Don't think you need summarise_at, since your definition of add takes care fo the multiple input arguments. summarise_at is useful when you are applying the same change to multiple columns, not for combining them.
If you just want sum of the columns, you can try:
iris %>%
group_by(Species) %>%
summarise_at(
.vars= vars( Sepal.Length, Sepal.Width),
.funs = sum)
which gives:
Species Sepal.Length Sepal.Width
<fctr> <dbl> <dbl>
1 setosa 250 171
2 versicolor 297 138
3 virginica 329 149
in case you want to add the columns together, you can just do:
iris %>%
group_by(Species) %>%
summarise( k = sum(Sepal.Length, Sepal.Width))
which gives:
Species k
<fctr> <dbl>
1 setosa 422
2 versicolor 435
3 virginica 478
using this form with your definition of add
add = function(x,y) {
z = x+y
return(mean(z))
}
iris %>%
group_by(Species) %>%
summarise( k = add(Sepal.Length, Sepal.Width))
returns
Species k
<fctr> <dbl>
1 setosa 8
2 versicolor 9
3 virginica 10
summarize() already allows you to summarize multiple columns.
example:
summarize(mean_xvalues = mean(x) , sd_yvalues = sd(y), median_zvalues = median(z))
where x,y,z are columns of a dataframe.

Renaming a column name, by using the data frame title/name

I have a data frame called "Something". I am doing an aggregation on one of the numeric columns using summarise, and I want the name of that column to contain "Something" - data frame title in the column name.
Example:
temp <- Something %>%
group_by(Month) %>%
summarise(avg_score=mean(score))
But i would like to name the aggregate column as "avg_Something_score". Did that make sense?
We can use the devel version of dplyr (soon to be released 0.6.0) that does this with quosures
library(dplyr)
myFun <- function(data, group, value){
dataN <- quo_name(enquo(data))
group <- enquo(group)
value <- enquo(value)
newName <- paste0("avg_", dataN, "_", quo_name(value))
data %>%
group_by(!!group) %>%
summarise(!!newName := mean(!!value))
}
myFun(mtcars, cyl, mpg)
# A tibble: 3 × 2
# cyl avg_mtcars_mpg
# <dbl> <dbl>
#1 4 26.66364
#2 6 19.74286
#3 8 15.10000
myFun(iris, Species, Petal.Width)
# A tibble: 3 × 2
# Species avg_iris_Petal.Width
# <fctr> <dbl>
#1 setosa 0.246
#2 versicolor 1.326
#3 virginica 2.026
Here, the enquo takes the input arguments like substitute from base R and converts to quosure, with quo_name, we can convert it to string, evaluate the quosure by unquoting (!! or UQ) inside group_by/summarise/mutate etc. The column names on the lhs of assignment (:=) can also evaluated by unquoting to get the columns of interest
You can use rename_ from dplyr with deparse(substitute(Something)) like this:
Something %>%
group_by(Month) %>%
summarise(avg_score=mean(score))%>%
rename_(.dots = setNames("avg_score",
paste0("avg_",deparse(substitute(Something)),"_score") ))
It seems like it makes more sense to generate the new column name dynamically so that you don't have to hard-code the name of the data frame inside setNames. Maybe something like the function below, which takes a data frame, a grouping variable, and a numeric variable:
library(dplyr)
library(lazyeval)
my_fnc = function(data, group, value) {
df.name = deparse(substitute(data))
data %>%
group_by_(group) %>%
summarise_(avg = interp(~mean(v), v=as.name(value))) %>%
rename_(.dots = setNames("avg", paste0("avg_", df.name, "_", value)))
}
Now let's run the function on two different data frames:
my_fnc(mtcars, "cyl", "mpg")
cyl avg_mtcars_mpg
<dbl> <dbl>
1 4 26.66364
2 6 19.74286
3 8 15.10000
my_fnc(iris, "Species", "Petal.Width")
Species avg_iris_Petal.Width
1 setosa 0.246
2 versicolor 1.326
3 virginica 2.026
library(dplyr)
# Take mtcars as an example
# Calculate the mean of mpg using cyl as group
data(mtcars)
Something <- mtcars
# Create a list of expression
dots <- list(~mean(mpg))
# Apply the function, Use setNames to name the column
temp <- Something %>%
group_by(cyl) %>%
summarise_(.dots = setNames(dots,
paste0("avg_", as.character(quote(Something)), "_score")))
You could use colnames(Something)<-c("score","something_avg_score")

R dplyr summarise multiple functions to selected variables

I have a dataset for which I want to summarise by mean, but also calculate the max to just 1 of the variables.
Let me start with an example of what I would like to achieve:
iris %>%
group_by(Species) %>%
filter(Sepal.Length > 5) %>%
summarise_at("Sepal.Length:Petal.Width",funs(mean))
which give me the following result
# A tibble: 3 × 5
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
<fctr> <dbl> <dbl> <dbl> <dbl>
1 setosa 5.8 4.4 1.9 0.5
2 versicolor 7.0 3.4 5.1 1.8
3 virginica 7.9 3.8 6.9 2.5
Is there an easy way to add, for example, max(Petal.Width)to summarise?
So far I have tried the following:
iris %>%
group_by(Species) %>%
filter(Sepal.Length > 5) %>%
summarise_at("Sepal.Length:Petal.Width",funs(mean)) %>%
mutate(Max.Petal.Width = max(iris$Petal.Width))
But with this approach I lose both the group_by and the filter from the code above and gives the wrong results.
The only solution I have been able to achieve is the following:
iris %>%
group_by(Species) %>%
filter(Sepal.Length > 5) %>%
summarise_at("Sepal.Length:Petal.Width",funs(mean,max)) %>%
select(Species:Petal.Width_mean,Petal.Width_max) %>%
rename(Max.Petal.Width = Petal.Width_max) %>%
rename_(.dots = setNames(names(.), gsub("_.*$","",names(.))))
Which is a bit convoluted and involves a lot of typing to just add a column with a different summarisation.
Thank you
Although this is an old question, it remains an interesting problem for which I have two solutions that I believe should be available to whoever finds this page.
Solution one
My own take:
mapply(summarise_at,
.vars = lst(names(iris)[!names(iris)%in%"Species"], "Petal.Width"),
.funs = lst(mean, max),
MoreArgs = list(.tbl = iris %>% group_by(Species) %>% filter(Sepal.Length > 5)))
%>% reduce(merge, by = "Species")
# Species Sepal.Length Sepal.Width Petal.Length Petal.Width.x Petal.Width.y
# 1 setosa 5.314 3.714 1.509 0.2773 0.5
# 2 versicolor 5.998 2.804 4.317 1.3468 1.8
# 3 virginica 6.622 2.984 5.573 2.0327 2.5
Solution two
An elegant solution using package purrr from the tidyverse itself, inspired by this discussion:
list(.vars = lst(names(iris)[!names(iris)%in%"Species"], "Petal.Width"),
.funs = lst("mean" = mean, "max" = max)) %>%
pmap(~ iris %>% group_by(Species) %>% filter(Sepal.Length > 5) %>% summarise_at(.x, .y))
%>% reduce(inner_join, by = "Species")
+ + + # A tibble: 3 x 6
Species Sepal.Length Sepal.Width Petal.Length Petal.Width.x Petal.Width.y
<fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 setosa 5.31 3.71 1.51 0.277 0.5
2 versicolor 6.00 2.80 4.32 1.35 1.8
3 virginica 6.62 2.98 5.57 2.03 2.5
Short discussion
The data.frame and tibble are the desired result, the last column being the max of petal.width and the other ones the means (by group and filter) of all other columns.
Both solutions hinge on three realizations:
summarise_at accepts as arguments two lists, one of n variables and one of m functions, and applies all m functions to all n variables, therefore producing m X n vectors in a tibble. The solution might thus imply forcing this function to loop in some way across "couples" formed by all variables to which we want one specific function to be applied and the one function, then another group of variables and their own function, and so on!
Now, what does the above in R? What does force an operation to corresponding elements of two lists? Functions such as mapply or the family of functions map2, pmap and variations thereof from dplyr's tidyverse fellow purrr. Both accept two lists of l elements and perform a given operation on corresponding elements (matched by position) of the two lists.
Because the product is not a tibble or a data.frame, but a list, you
simply need to use reduce with inner_join or just merge.
Note that the means I obtain are different from those of the OP, but they are the means I obtain with his reproducible example as well (maybe we have two different versions of the iris dataset?).
If you wanted to do something more complex like that, you could write your own version of summarize_at. With this version you supply triplets of column names, functions, and naming rules. For example
Here's a rough start
my_summarise_at<-function (.tbl, ...)
{
dots <- list(...)
stopifnot(length(dots)%%3==0)
vars <- do.call("append", Map(function(.cols, .funs, .name) {
cols <- select_colwise_names(.tbl, .cols)
funs <- as.fun_list(.funs, .env = parent.frame())
val<-colwise_(.tbl, funs, cols)
names <- sapply(names(val), function(x) gsub("%", x, .name))
setNames(val, names)
}, dots[seq_along(dots)%%3==1], dots[seq_along(dots)%%3==2], dots[seq_along(dots)%%3==0]))
summarise_(.tbl, .dots = vars)
}
environment(my_summarise_at)<-getNamespace("dplyr")
And you can call it with
iris %>%
group_by(Species) %>%
filter(Sepal.Length > 5) %>%
my_summarise_at("Sepal.Length:Petal.Width", mean, "%_mean",
"Petal.Width", max, "%_max")
For the names we just replace the "%" with the default name. The idea is just to dynamically build the summarize_ expression. The summarize_at function is really just a convenience wrapper around that basic function.
If you are trying to do everything with dplyr (which might be easier to remember), then you can leverage the new across function which will be available from dplyr 1.0.0.
iris %>%
group_by(Species) %>%
filter(Sepal.Length > 5) %>%
summarize(across(Sepal.Length:Petal.Width, mean)) %>%
cbind(iris %>%
group_by(Species) %>%
summarize(across(Petal.Width, max)) %>%
select(-Species)
)
It shows that the only difficulty is to combine two calculations on the same column Petal.Width on a grouped variable - you have to do the grouping again but can nest it into the cbind.
This returns correctly the result:
Species Sepal.Length Sepal.Width Petal.Length Petal.Width Petal.Width
1 setosa 5.313636 3.713636 1.509091 0.2772727 0.6
2 versicolor 5.997872 2.804255 4.317021 1.3468085 1.8
3 virginica 6.622449 2.983673 5.573469 2.0326531 2.5
If the task would not specify two calculations but only one on the same column Petal.Width, then this could be elegantly written as:
iris %>%
group_by(Species) %>%
filter(Sepal.Length > 5) %>%
summarize(
across(Sepal.Length:Petal.Length, mean),
across(Petal.Width, max)
)
I was looking for something similar and tried the following. It works well and much easier to read than the suggested solutions.
iris %>%
group_by(Species) %>%
filter(Sepal.Length > 5) %>%
summarise(MeanSepalLength=mean(Sepal.Length),
MeanSepalWidth = mean(Sepal.Width),
MeanPetalLength=mean(Petal.Length),
MeanPetalWidth=mean(Petal.Width),
MaxPetalWidth=max(Petal.Width))
# A tibble: 3 x 6
Species MeanSepalLength MeanSepalWidth MeanPetalLength MeanPetalWidth MaxPetalWidth
<fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 setosa 5.01 3.43 1.46 0.246 0.6
2 versicolor 5.94 2.77 4.26 1.33 1.8
3 virginica 6.59 2.97 5.55 2.03 2.5
In summarise() part, define your column name and give your column to summarise inside your function of choice.

Resources