I'm having some trouble while I'm searching to specify parameters in custom function passed to .fns argument in dplyr's across.
Consider this code:
data(iris)
ref_col <- "Sepal.Length"
iris_summary <- iris %>%
group_by(Species) %>%
summarise(
Sepal.Length_max = max(Sepal.Length),
across(
Sepal.Width:Petal.Width,
~ .x[which.max(get(ref_col))]
)
)
This works properly. Then I need to replace lambda function with a custom function and then pass requested arguments inside across (in my code the custom function is more complex and it is not convenient to be embedded in dplyr piping). See following code:
ref_col <- "Sepal.Length"
get_which_max <- function(x, col_max) x[which.max(get(col_max))]
iris_summary <- iris %>%
group_by(Species) %>%
summarise(
Sepal.Length_max = max(Sepal.Length),
across(
Sepal.Width:Petal.Width,
~ get_which_max(.x, ref_col)
)
)
R is now giving error "object 'Sepal.Length' not found" as it is sercing for an object instead colname inside piping process. Anyone can help me to fix this problem?
We may either use cur_data() or pick (from the devel version of dplyr to select the column. Also, remove the get from inside the get_which_max
get_which_max <- function(x, col_max) x[which.max(col_max)]
iris_summary <- iris %>%
group_by(Species) %>%
summarise(
Sepal.Length_max = max(Sepal.Length),
across(
Sepal.Width:Petal.Width,
~ get_which_max(.x, cur_data()[[ref_col]])
)
)
-output
# A tibble: 3 × 5
Species Sepal.Length_max Sepal.Width Petal.Length Petal.Width
<fct> <dbl> <dbl> <dbl> <dbl>
1 setosa 5.8 4 1.2 0.2
2 versicolor 7 3.2 4.7 1.4
3 virginica 7.9 3.8 6.4 2
Related
I'm wondering the way R is evaluating several across in the same summarise inside a dplyr piping. Consider the following example:
data(iris)
iris_summary <- iris %>%
group_by(Species) %>%
summarise(
across(
.cols = starts_with("Sepal"),
.fns = mean
),
across(
.cols = starts_with("Petal"),
.fns = ~ .x[which.max(Sepal.Length)]
)
)
The outcome produce is not the same as following code:
iris_summary_2 <- iris %>%
group_by(Species) %>%
summarise(
across(
.cols = starts_with("Petal"),
.fns = ~ .x[which.max(Sepal.Length)]
),
across(
.cols = starts_with("Sepal"),
.fns = mean
)
)
Is it a problem need to the timing R is evaluating two across in the same summarise? See image below:
I expected R was re-starting from step 0 before evaluating both step 1 and step 2, but the results seems indicate that, in step 2, R is taking the vector Sepal.Length from step 1 and not from step 0 (previous piping step).
Anyone has tips to force R to take the vector from step 0 without changing code structure?
Yes, summarize, like mutate and tibble, works sequentially and will use the most recently-updated version of any variables.
mtcars |>
summarize(gear = mean(gear),
gear2 = mean(gear) * 100)
gear gear2
1 3.6875 368.75
You might consider using the .names argument to put your summary numbers in new variables that don't alter the original ones.
iris %>%
group_by(Species) %>%
summarise(
across(
.cols = starts_with("Sepal"),
.fns = mean,
.names = "{.col}_mean"
),
across(
.cols = starts_with("Petal"),
.fns = ~ .x[which.max(Sepal.Length)],
.names = "{.col}_max_Sepal"
)
)
# A tibble: 3 × 5
Species Sepal.Length_mean Sepal.Width_mean Petal.Length_max_Sepal Petal.Width_max_Sepal
<fct> <dbl> <dbl> <dbl> <dbl>
1 setosa 5.01 3.43 1.2 0.2
2 versicolor 5.94 2.77 4.7 1.4
3 virginica 6.59 2.97 6.4 2
I want to use a custom function and return columns with an added "_cat_mean" to each column.
In the code below "$cat_mean" is added and I can't select it by that name.
summarise_categories <- function(x) {
tibble(
cat_mean = round(mean(x) * 2) / 2
)
}
iris_summarised = iris %>%
group_by(Species) %>%
summarise(across(ends_with("Length"), ~summarise_categories(.)))
Select columns by the name which is displayed doesn't work
iris_summarised %>%
select(Species, Sepal.Length$cat_mean)
But this works
iris_summarised %>%
select(Species, Sepal.Length)
I want the column to be named "Sepal.Length_cat_mean"
You can use .names argument in across to give new column names.
library(dplyr)
summarise_categories <- function(x) {
round(mean(x) * 2) / 2
}
iris %>%
group_by(Species) %>%
summarise(across(ends_with("Length"), summarise_categories,
.names = '{col}_cat_mean')) -> iris_summarised
iris_summarised
# Species Sepal.Length_cat_mean Petal.Length_cat_mean
# <fct> <dbl> <dbl>
#1 setosa 5 1.5
#2 versicolor 6 4.5
#3 virginica 6.5 5.5
Using base R with colMeans and by
by(iris[-5], iris$Species, function(x) round(colMeans(x) * 2) /2)
I am trying to apply a complex function on multiple columns after applying a group on it.
Code example is:
library(dplyr)
data(iris)
add = function(x,y) {
z = x+y
return(mean(z))
}
iris %>%
group_by(Species) %>%
summarise_at(.vars=c('Sepal.Length', 'Sepal.Width'),
.funs = add('Sepal.Length', 'Sepal.Width' ) )
I was expecting that the function would be applied to each group and returned as a new column but I get:
Error in x + y : non-numeric argument to binary operator
How can I get this work?
Note my real problem has a much more complicated function than the simple add function I've written here that requires the two columns be fed in as separate entities I can't just sum them first.
Thanks
Don't think you need summarise_at, since your definition of add takes care fo the multiple input arguments. summarise_at is useful when you are applying the same change to multiple columns, not for combining them.
If you just want sum of the columns, you can try:
iris %>%
group_by(Species) %>%
summarise_at(
.vars= vars( Sepal.Length, Sepal.Width),
.funs = sum)
which gives:
Species Sepal.Length Sepal.Width
<fctr> <dbl> <dbl>
1 setosa 250 171
2 versicolor 297 138
3 virginica 329 149
in case you want to add the columns together, you can just do:
iris %>%
group_by(Species) %>%
summarise( k = sum(Sepal.Length, Sepal.Width))
which gives:
Species k
<fctr> <dbl>
1 setosa 422
2 versicolor 435
3 virginica 478
using this form with your definition of add
add = function(x,y) {
z = x+y
return(mean(z))
}
iris %>%
group_by(Species) %>%
summarise( k = add(Sepal.Length, Sepal.Width))
returns
Species k
<fctr> <dbl>
1 setosa 8
2 versicolor 9
3 virginica 10
summarize() already allows you to summarize multiple columns.
example:
summarize(mean_xvalues = mean(x) , sd_yvalues = sd(y), median_zvalues = median(z))
where x,y,z are columns of a dataframe.
Seems like this should be easy but I'm stumped. I've gotten the rough hang of programming with dplyr 0.7, but struggling with this: How do I program in dplyr if the variable I want to program with will be a string?
I am scraping a database, and for a variety of reasons want to summarize a variable that I will know the position of but not the name of (the thing I want is always the first column of the supplied table, but the name of the variable stored in that column will vary depending on the database being scraped). To use iris as an example, suppose that I know that the variable that I want is in the first column
library(tidyverse)
desired_var <- colnames(iris)[1]
print(desired_var)
"Sepal.Length"
I now want to group by Species, and take the mean of desired_var, i.e. what I want is to perform
iris %>%
group_by(Species) %>%
summarise(desired_mean = mean(Sepal.Length))
But, now I want to take the mean of a column which is defined by a string stored in desired_var
I get how to do this with a "bare" Sepal.Length
desired_var <- quo(Sepal.Length)
iris %>%
group_by(Species) %>%
summarise(desired_mean = mean(!!desired_var))
But how in the world do I deal with the fact that I have "Sepal.Length" not Sepal.Length , i.e. that desired_var <- "Sepal.Length" ?
You're wondering into tidyeval which is a rather new feature of the tidyverse (see here) more used to create functions using tidyverse functions. For now it is only available with dplyr but the plan is to extend it to the other tidyverse packages.
For your need though, you don't really need to get into that, when summarize_at will do. This function allows you to extend a particular manipulation that you specify across any variables of your choosing:
iris %>%
group_by(Species) %>%
summarise_at(vars(one_of("Sepal.Length", "Sepal.Width")), funs(desired_mean = mean))
# A tibble: 3 x 3
Species Sepal.Length_desired_mean Sepal.Width_desired_mean
<fctr> <dbl> <dbl>
1 setosa 5.006 3.428
2 versicolor 5.936 2.770
3 virginica 6.588 2.974
You can store the list of variables into a vector, and then use that vector instead:
selected_vectors <- c("Sepal.Length", "Sepal.Width")
iris %>%
group_by(Species) %>%
summarise_at(vars(one_of(selected_vectors)), funs(desired_mean = mean))
1) dynamic variable with !!sym Use sym (or parse_expr) like this:
library(dplyr)
library(rlang)
desired_var <- "Sepal.Length"
iris %>%
group_by(Species) %>%
summarise(desired_mean = mean(!!sym(desired_var))) %>%
ungroup
giving:
# A tibble: 3 x 2
Species desired_mean
<fctr> <dbl>
1 setosa 5.006
2 versicolor 5.936
3 virginica 6.588
2) summarise_at As #Phil points out in the comments in the particular case of summarise this could be done like this without using any rlang facilities:
library(dplyr)
desired_var <- "Sepal.Length"
iris %>%
group_by(Species) %>%
summarise_at(desired_var, funs(mean)) %>%
ungroup
giving:
# A tibble: 3 x 2
Species Sepal.Length
<fctr> <dbl>
1 setosa 5.006
2 versicolor 5.936
3 virginica 6.588
3) dynamic variable and name with !! If you need to set the name dynamically in (1) then try this:
library(dplyr)
library(rlang)
desired_var <- "Sepal.Length"
desired_var_name <- paste("mean", desired_var, sep = "_")
iris %>%
group_by(Species) %>%
summarise(!!desired_var_name := mean(!!sym(desired_var))) %>%
ungroup
giving:
# A tibble: 3 x 2
Species mean_Sepal.Length
<fctr> <dbl>
1 setosa 5.006
2 versicolor 5.936
3 virginica 6.588
I have a dataset for which I want to summarise by mean, but also calculate the max to just 1 of the variables.
Let me start with an example of what I would like to achieve:
iris %>%
group_by(Species) %>%
filter(Sepal.Length > 5) %>%
summarise_at("Sepal.Length:Petal.Width",funs(mean))
which give me the following result
# A tibble: 3 × 5
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
<fctr> <dbl> <dbl> <dbl> <dbl>
1 setosa 5.8 4.4 1.9 0.5
2 versicolor 7.0 3.4 5.1 1.8
3 virginica 7.9 3.8 6.9 2.5
Is there an easy way to add, for example, max(Petal.Width)to summarise?
So far I have tried the following:
iris %>%
group_by(Species) %>%
filter(Sepal.Length > 5) %>%
summarise_at("Sepal.Length:Petal.Width",funs(mean)) %>%
mutate(Max.Petal.Width = max(iris$Petal.Width))
But with this approach I lose both the group_by and the filter from the code above and gives the wrong results.
The only solution I have been able to achieve is the following:
iris %>%
group_by(Species) %>%
filter(Sepal.Length > 5) %>%
summarise_at("Sepal.Length:Petal.Width",funs(mean,max)) %>%
select(Species:Petal.Width_mean,Petal.Width_max) %>%
rename(Max.Petal.Width = Petal.Width_max) %>%
rename_(.dots = setNames(names(.), gsub("_.*$","",names(.))))
Which is a bit convoluted and involves a lot of typing to just add a column with a different summarisation.
Thank you
Although this is an old question, it remains an interesting problem for which I have two solutions that I believe should be available to whoever finds this page.
Solution one
My own take:
mapply(summarise_at,
.vars = lst(names(iris)[!names(iris)%in%"Species"], "Petal.Width"),
.funs = lst(mean, max),
MoreArgs = list(.tbl = iris %>% group_by(Species) %>% filter(Sepal.Length > 5)))
%>% reduce(merge, by = "Species")
# Species Sepal.Length Sepal.Width Petal.Length Petal.Width.x Petal.Width.y
# 1 setosa 5.314 3.714 1.509 0.2773 0.5
# 2 versicolor 5.998 2.804 4.317 1.3468 1.8
# 3 virginica 6.622 2.984 5.573 2.0327 2.5
Solution two
An elegant solution using package purrr from the tidyverse itself, inspired by this discussion:
list(.vars = lst(names(iris)[!names(iris)%in%"Species"], "Petal.Width"),
.funs = lst("mean" = mean, "max" = max)) %>%
pmap(~ iris %>% group_by(Species) %>% filter(Sepal.Length > 5) %>% summarise_at(.x, .y))
%>% reduce(inner_join, by = "Species")
+ + + # A tibble: 3 x 6
Species Sepal.Length Sepal.Width Petal.Length Petal.Width.x Petal.Width.y
<fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 setosa 5.31 3.71 1.51 0.277 0.5
2 versicolor 6.00 2.80 4.32 1.35 1.8
3 virginica 6.62 2.98 5.57 2.03 2.5
Short discussion
The data.frame and tibble are the desired result, the last column being the max of petal.width and the other ones the means (by group and filter) of all other columns.
Both solutions hinge on three realizations:
summarise_at accepts as arguments two lists, one of n variables and one of m functions, and applies all m functions to all n variables, therefore producing m X n vectors in a tibble. The solution might thus imply forcing this function to loop in some way across "couples" formed by all variables to which we want one specific function to be applied and the one function, then another group of variables and their own function, and so on!
Now, what does the above in R? What does force an operation to corresponding elements of two lists? Functions such as mapply or the family of functions map2, pmap and variations thereof from dplyr's tidyverse fellow purrr. Both accept two lists of l elements and perform a given operation on corresponding elements (matched by position) of the two lists.
Because the product is not a tibble or a data.frame, but a list, you
simply need to use reduce with inner_join or just merge.
Note that the means I obtain are different from those of the OP, but they are the means I obtain with his reproducible example as well (maybe we have two different versions of the iris dataset?).
If you wanted to do something more complex like that, you could write your own version of summarize_at. With this version you supply triplets of column names, functions, and naming rules. For example
Here's a rough start
my_summarise_at<-function (.tbl, ...)
{
dots <- list(...)
stopifnot(length(dots)%%3==0)
vars <- do.call("append", Map(function(.cols, .funs, .name) {
cols <- select_colwise_names(.tbl, .cols)
funs <- as.fun_list(.funs, .env = parent.frame())
val<-colwise_(.tbl, funs, cols)
names <- sapply(names(val), function(x) gsub("%", x, .name))
setNames(val, names)
}, dots[seq_along(dots)%%3==1], dots[seq_along(dots)%%3==2], dots[seq_along(dots)%%3==0]))
summarise_(.tbl, .dots = vars)
}
environment(my_summarise_at)<-getNamespace("dplyr")
And you can call it with
iris %>%
group_by(Species) %>%
filter(Sepal.Length > 5) %>%
my_summarise_at("Sepal.Length:Petal.Width", mean, "%_mean",
"Petal.Width", max, "%_max")
For the names we just replace the "%" with the default name. The idea is just to dynamically build the summarize_ expression. The summarize_at function is really just a convenience wrapper around that basic function.
If you are trying to do everything with dplyr (which might be easier to remember), then you can leverage the new across function which will be available from dplyr 1.0.0.
iris %>%
group_by(Species) %>%
filter(Sepal.Length > 5) %>%
summarize(across(Sepal.Length:Petal.Width, mean)) %>%
cbind(iris %>%
group_by(Species) %>%
summarize(across(Petal.Width, max)) %>%
select(-Species)
)
It shows that the only difficulty is to combine two calculations on the same column Petal.Width on a grouped variable - you have to do the grouping again but can nest it into the cbind.
This returns correctly the result:
Species Sepal.Length Sepal.Width Petal.Length Petal.Width Petal.Width
1 setosa 5.313636 3.713636 1.509091 0.2772727 0.6
2 versicolor 5.997872 2.804255 4.317021 1.3468085 1.8
3 virginica 6.622449 2.983673 5.573469 2.0326531 2.5
If the task would not specify two calculations but only one on the same column Petal.Width, then this could be elegantly written as:
iris %>%
group_by(Species) %>%
filter(Sepal.Length > 5) %>%
summarize(
across(Sepal.Length:Petal.Length, mean),
across(Petal.Width, max)
)
I was looking for something similar and tried the following. It works well and much easier to read than the suggested solutions.
iris %>%
group_by(Species) %>%
filter(Sepal.Length > 5) %>%
summarise(MeanSepalLength=mean(Sepal.Length),
MeanSepalWidth = mean(Sepal.Width),
MeanPetalLength=mean(Petal.Length),
MeanPetalWidth=mean(Petal.Width),
MaxPetalWidth=max(Petal.Width))
# A tibble: 3 x 6
Species MeanSepalLength MeanSepalWidth MeanPetalLength MeanPetalWidth MaxPetalWidth
<fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 setosa 5.01 3.43 1.46 0.246 0.6
2 versicolor 5.94 2.77 4.26 1.33 1.8
3 virginica 6.59 2.97 5.55 2.03 2.5
In summarise() part, define your column name and give your column to summarise inside your function of choice.