I want to use a custom function and return columns with an added "_cat_mean" to each column.
In the code below "$cat_mean" is added and I can't select it by that name.
summarise_categories <- function(x) {
tibble(
cat_mean = round(mean(x) * 2) / 2
)
}
iris_summarised = iris %>%
group_by(Species) %>%
summarise(across(ends_with("Length"), ~summarise_categories(.)))
Select columns by the name which is displayed doesn't work
iris_summarised %>%
select(Species, Sepal.Length$cat_mean)
But this works
iris_summarised %>%
select(Species, Sepal.Length)
I want the column to be named "Sepal.Length_cat_mean"
You can use .names argument in across to give new column names.
library(dplyr)
summarise_categories <- function(x) {
round(mean(x) * 2) / 2
}
iris %>%
group_by(Species) %>%
summarise(across(ends_with("Length"), summarise_categories,
.names = '{col}_cat_mean')) -> iris_summarised
iris_summarised
# Species Sepal.Length_cat_mean Petal.Length_cat_mean
# <fct> <dbl> <dbl>
#1 setosa 5 1.5
#2 versicolor 6 4.5
#3 virginica 6.5 5.5
Using base R with colMeans and by
by(iris[-5], iris$Species, function(x) round(colMeans(x) * 2) /2)
Related
I'm having some trouble while I'm searching to specify parameters in custom function passed to .fns argument in dplyr's across.
Consider this code:
data(iris)
ref_col <- "Sepal.Length"
iris_summary <- iris %>%
group_by(Species) %>%
summarise(
Sepal.Length_max = max(Sepal.Length),
across(
Sepal.Width:Petal.Width,
~ .x[which.max(get(ref_col))]
)
)
This works properly. Then I need to replace lambda function with a custom function and then pass requested arguments inside across (in my code the custom function is more complex and it is not convenient to be embedded in dplyr piping). See following code:
ref_col <- "Sepal.Length"
get_which_max <- function(x, col_max) x[which.max(get(col_max))]
iris_summary <- iris %>%
group_by(Species) %>%
summarise(
Sepal.Length_max = max(Sepal.Length),
across(
Sepal.Width:Petal.Width,
~ get_which_max(.x, ref_col)
)
)
R is now giving error "object 'Sepal.Length' not found" as it is sercing for an object instead colname inside piping process. Anyone can help me to fix this problem?
We may either use cur_data() or pick (from the devel version of dplyr to select the column. Also, remove the get from inside the get_which_max
get_which_max <- function(x, col_max) x[which.max(col_max)]
iris_summary <- iris %>%
group_by(Species) %>%
summarise(
Sepal.Length_max = max(Sepal.Length),
across(
Sepal.Width:Petal.Width,
~ get_which_max(.x, cur_data()[[ref_col]])
)
)
-output
# A tibble: 3 × 5
Species Sepal.Length_max Sepal.Width Petal.Length Petal.Width
<fct> <dbl> <dbl> <dbl> <dbl>
1 setosa 5.8 4 1.2 0.2
2 versicolor 7 3.2 4.7 1.4
3 virginica 7.9 3.8 6.4 2
Consider this code:
iris %>% count(Species) %>% group_by(Species)
# A tibble: 3 x 2
# Groups: Species [3]
Species n
<fct> <int>
1 setosa 50
2 versicolor 50
3 virginica 50
I want to define a function which does the same task, something like this :
table_freq <- function(Table, Var) {
freq <- NA
freq <- Table %>%
dplyr::count(Var) %>%
group_by(Var)
return(freq)
}
table_freq(iris, "Species")
But it does not work :
> table_freq(iris, "Species")
Error in `group_by_prepare()`:
! Must group by variables found in `.data`.
* Column `Var` is not found.
Any ideas?
Please do not write alternate solutions, I need to define a function that takes the table and the column name for which we need the freq. table.
You can use table to create a function that will take the table and the column name.
table_freq <- function(Table, Var) {
setNames(data.frame(table(Table[,Var])), c(Var, "n"))
}
table_freq(iris, "Species")
Output
Species n
1 setosa 50
2 versicolor 50
3 virginica 50
The secret sauce is {{}}, we simply write :
table_freq <- function(Table, Var) {
freq <- NA
freq <- Table %>%
dplyr::count({{Var}}) %>%
group_by({{Var}})
return(freq)
}
table_freq(iris, Species)
I do this a lot:
library(tidyverse)
iris %>%
group_by(Species) %>%
summarise(num_Species = n_distinct(Species)) %>%
mutate(perc_Species = 100 * num_Species / sum(num_Species))
So I would like to create a function that outputs the same thing but with dynamically named num_ and perc_ columns:
num_perc <- function(df, group_var, summary_var) {
}
I found this resource useful but it did not directly address how to reuse newly created column names in the way I want.
What you can do is use as_label(enquo()) on your group_var to extract variable passed as a character vector to generate your new columns. You can see a clear example of this is 6.1.3 in the linked document you sent. In this way, we can dynamically prepend num_ and perc_ to your summary variable, and just have to pass in df and group_var.
library(dplyr)
num_perc <- function(df, group_var) {
summary_lbl <- as_label(enquo(group_var))
num_lbl <- paste0("num_", summary_lbl)
perc_lbl <- paste0("perc_", summary_lbl)
df %>%
group_by({{ group_var }}) %>%
summarize(!!num_lbl := n_distinct({{ group_var }})) %>%
mutate(!!perc_lbl := 100 * .data[[num_lbl]] / sum(.data[[num_lbl]]))
}
num_perc(iris, Species)
#> # A tibble: 3 × 3
#> Species num_Species perc_Species
#> <fct> <int> <dbl>
#> 1 setosa 1 33.3
#> 2 versicolor 1 33.3
#> 3 virginica 1 33.3
In this case where group_var and summary_var actually differ, it's the same solution essentially.
num_perc <- function(df, group_var, summary_var) {
summary_lbl <- as_label(enquo(summary_var))
num_lbl <- paste0("num_", summary_lbl)
perc_lbl <- paste0("perc_", summary_lbl)
df %>%
group_by({{ group_var }}) %>%
summarize(!!num_lbl := n_distinct({{ summary_var }})) %>%
mutate(!!perc_lbl := 100 * .data[[num_lbl]] / sum(.data[[num_lbl]]))
}
num_perc(iris, Species, Species)
Another possible solution, which uses deparse(substitute(...)) to get the name of the function parameters as strings:
library(tidyverse)
f <- function(df, group_var, summary_var)
{
group_var <- deparse(substitute(group_var))
summary_var <- deparse(substitute(summary_var))
df %>%
group_by(!!sym(group_var)) %>%
summarise(!!str_c("num_", summary_var) := n_distinct(summary_var)) %>%
mutate(!!str_c("per_", summary_var) := 100 * !!sym(str_c("num_", summary_var)) / sum(!!sym(str_c("num_", summary_var))))
}
f(iris, Species, Species)
#> # A tibble: 3 × 3
#> Species num_Species per_Species
#> <fct> <int> <dbl>
#> 1 setosa 1 33.3
#> 2 versicolor 1 33.3
#> 3 virginica 1 33.3
Are you sure n_distinct is what you want to do? In the case of the iris dataset, there are three Species - setosa, versicolor, virginica. Therefore, each species is 1/3 unique species. The Iris dataset is balanced in the sense that there are 50 of each species, so each species represents 1/3 of the data set but more generally this will not be the case.
A function with data masking will cover imbalanced datasets for you:
library(dplyr)
my_func <- function(df, var, percent){
df %>%
count({{var}}) %>%
mutate(percent = 100 * n/sum(n))
}
my_func(iris, Species, percent)
iris %>%
my_func(Species, percent) #or with pipe
This question already has answers here:
Pass arguments to dplyr functions
(7 answers)
Closed 2 years ago.
I am having issues with pipes inside a custom function. Based on the previous posts, I understand that a pipe inside a function creates another level(?) which results in the error I'm getting (see below).
I'm hoping to write a summary function for a large data set with hundreds of numeric and categorical variables. I would like to have the option to use this on different data frames (with similar structure), always group by a certain factor variable and get summaries for multiple columns.
library(tidyverse)
data(iris)
iris %>% group_by(Species) %>% summarise(count = n(), mean = mean(Sepal.Length, na.rm = T))
# A tibble: 3 x 3
Species count mean
<fct> <int> <dbl>
1 setosa 50 5.01
2 versicolor 50 5.94
3 virginica 50 6.59
I'm hoping to create a function like this:
sum_cols <- function (df, col) {
df %>%
group_by(Species) %>%
summarise(count = n(),
mean = mean(col, na.rm = T))
}
And this is the error I'm getting:
sum_cols(iris, Sepal.Length)
Error in mean(col, na.rm = T) : object 'Petal.Width' not found
Called from: mean(col, na.rm = T)
I have had this problem for a while and even though I tried to get answers in a few previous posts, I haven't quite grasped why the problem occurs and how to get around it.
Any help would be greatly appreciated, thanks!
Try searching for non-standard evaluation (NSE).
You can use here {{}} to let R know that col is the column name in df.
library(dplyr)
library(rlang)
sum_cols <- function (df, col) {
df %>%
group_by(Species) %>%
summarise(count = n(), mean = mean({{col}}, na.rm = T))
}
sum_cols(iris, Sepal.Length)
# A tibble: 3 x 3
# Species count mean
# <fct> <int> <dbl>
#1 setosa 50 5.01
#2 versicolor 50 5.94
#3 virginica 50 6.59
If we do not have the latest rlang we can use the old method of enquo and !!
sum_cols <- function (df, col) {
df %>%
group_by(Species) %>%
summarise(count = n(), mean = mean(!!enquo(col), na.rm = T))
}
sum_cols(iris, Sepal.Length)
I am trying to apply a complex function on multiple columns after applying a group on it.
Code example is:
library(dplyr)
data(iris)
add = function(x,y) {
z = x+y
return(mean(z))
}
iris %>%
group_by(Species) %>%
summarise_at(.vars=c('Sepal.Length', 'Sepal.Width'),
.funs = add('Sepal.Length', 'Sepal.Width' ) )
I was expecting that the function would be applied to each group and returned as a new column but I get:
Error in x + y : non-numeric argument to binary operator
How can I get this work?
Note my real problem has a much more complicated function than the simple add function I've written here that requires the two columns be fed in as separate entities I can't just sum them first.
Thanks
Don't think you need summarise_at, since your definition of add takes care fo the multiple input arguments. summarise_at is useful when you are applying the same change to multiple columns, not for combining them.
If you just want sum of the columns, you can try:
iris %>%
group_by(Species) %>%
summarise_at(
.vars= vars( Sepal.Length, Sepal.Width),
.funs = sum)
which gives:
Species Sepal.Length Sepal.Width
<fctr> <dbl> <dbl>
1 setosa 250 171
2 versicolor 297 138
3 virginica 329 149
in case you want to add the columns together, you can just do:
iris %>%
group_by(Species) %>%
summarise( k = sum(Sepal.Length, Sepal.Width))
which gives:
Species k
<fctr> <dbl>
1 setosa 422
2 versicolor 435
3 virginica 478
using this form with your definition of add
add = function(x,y) {
z = x+y
return(mean(z))
}
iris %>%
group_by(Species) %>%
summarise( k = add(Sepal.Length, Sepal.Width))
returns
Species k
<fctr> <dbl>
1 setosa 8
2 versicolor 9
3 virginica 10
summarize() already allows you to summarize multiple columns.
example:
summarize(mean_xvalues = mean(x) , sd_yvalues = sd(y), median_zvalues = median(z))
where x,y,z are columns of a dataframe.