Error when using dplyr {{ }} with aggregate inside a function - r

I am trying to use aggregate inside a function by using dplyrs {{ }} notation to select the column to aggregate on.
filter <- function(df, level) {
df <- aggregate(.~ {{level}}, data=df, FUN=sum)
return(df)
}
however I get the error
Error in model.frame.default(formula = cbind(phylum, '12K1B.txt', '12K2B.txt', :
variable lengths differ (found for '{{ level }}')
I have double checked my data and there are no missing or NA values and everything works as expected when I run it outside of the function so I am not sure what is causing the error.

{{ }} is tidyverse syntax, and should only work inside tidyverse verbs.
If we want to achieve something like this
aggregate(. ~ Species, data = iris, sum)
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 250.3 171.4 73.1 12.3
2 versicolor 296.8 138.5 213.0 66.3
3 virginica 329.4 148.7 277.6 101.3
We can make a formula on the fly, manipulating as text like so
aggregate_var <- function(df, level) {
level <- deparse(substitute(level))
aggregate(formula(paste(". ~", level)), data=df, FUN=sum)
}
aggregate_var(iris, Species)
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 250.3 171.4 73.1 12.3
2 versicolor 296.8 138.5 213.0 66.3
3 virginica 329.4 148.7 277.6 101.3
As an aside - filter is a popular function name, perhaps a more detailed description is useful. Also note that an explicit return statement and the assignment to df are not needed here.

You may use get to achieve this.
funn <- function(df, level){
df <- aggregate(.~ get(level), data=df, FUN=sum)
return(df)
}
funn(iris, "Species")
get(level) Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 setosa 250.3 171.4 73.1 12.3 50
2 versicolor 296.8 138.5 213.0 66.3 100
3 virginica 329.4 148.7 277.6 101.3 150

To complement the other answers already provided, if you wanted to use {{, the dplyr way is:
my_fun <- function(df, level) {
df |>
group_by({{ level }}) |>
summarize(across(everything(), sum))
}
my_fun(iris, Species)
# A tibble: 3 x 5
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
<fct> <dbl> <dbl> <dbl> <dbl>
1 setosa 250. 171. 73.1 12.3
2 versicolor 297. 138. 213 66.3
3 virginica 329. 149. 278. 101.

Related

Passing multiple dataframes to a function that contains an if statement in R

I have a function below that contains an IF statement, the error message is:
1: In if (df == "iris1") { :
the condition has length > 1 and only the first element will be used
Can anyone amend the code so that it works?
library(tidyverse)
iris1<-iris[1:50, ]
iris2<-iris[51:100,]
add_col<-function(df,colname)
{
df$newcol<-df [, colname]*100
if(df=="iris1"){df<-df%>%mutate(col_id="some text")}
if(df=="iris2"){df<-df%>%mutate(col_id="other text")}
return(df)
}
x <- c("iris1", "iris2")
z<-map(map(x, ~ as.symbol(.x) %>% eval),
~ add_col(.x, "Sepal.Length"))
Create a named list and then use imap to loop over the list
library(purrr)
add_col<-function(df, nm, colname)
{
df$newcol<-df [, colname]*100
if(nm =="iris1"){df<-df%>% mutate(col_id="some text")}
if(nm=="iris2"){df<-df%>%mutate(col_id="other text")}
return(df)
}
-testing
out <- imap(lst(iris1, iris2), ~ add_col(.x, .y, "Sepal.Length"))
> map(out, head, 2)
$iris1
Sepal.Length Sepal.Width Petal.Length Petal.Width Species newcol col_id
1 5.1 3.5 1.4 0.2 setosa 510 some text
2 4.9 3.0 1.4 0.2 setosa 490 some text
$iris2
Sepal.Length Sepal.Width Petal.Length Petal.Width Species newcol col_id
51 7.0 3.2 4.7 1.4 versicolor 700 other text
52 6.4 3.2 4.5 1.5 versicolor 640 other text

R: aggregate data while adding new count column using base R [duplicate]

This question already has answers here:
Apply several summary functions (sum, mean, etc.) on several variables by group in one call
(7 answers)
Closed 1 year ago.
I would like to aggregate a data frame while also adding in a new column (N) that counts the number of rows per value of the grouping variable, in base R.
This is trivial in dplyr:
library(dplyr)
data(iris)
combined_summary <- iris %>% group_by(Species) %>% group_by(N=n(), add=TRUE) %>% summarize_all(mean)
> combined_summary
# A tibble: 3 x 6
# Groups: Species [3]
Species N Sepal.Length Sepal.Width Petal.Length Petal.Width
<fct> <int> <dbl> <dbl> <dbl> <dbl>
1 setosa 50 5.01 3.43 1.46 0.246
2 versicolor 50 5.94 2.77 4.26 1.33
3 virginica 50 6.59 2.97 5.55 2.03
I am however in the unfortunate position of having to write this code in an environment that doesn't allow for packages to be used (don't ask; it's not my decision). So I need a way to do this in base R.
I can do it in base R in a long-winded way as follows:
# First create the aggregated tables separately
summary_means <- aggregate(. ~ Species, data=iris, FUN=mean)
summary_count <- aggregate(Sepal.Length ~ Species, data=iris[, c("Species", "Sepal.Length")], FUN=length)
> summary_means
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 5.006 3.428 1.462 0.246
2 versicolor 5.936 2.770 4.260 1.326
3 virginica 6.588 2.974 5.552 2.026
> summary_count
Species Sepal.Length
1 setosa 50
2 versicolor 50
3 virginica 50
# Then rename the count column
colnames(summary_count)[2] <- "N"
> summary_count
Species N
1 setosa 50
2 versicolor 50
3 virginica 50
# Finally merge the two dataframes
combined_summary_baseR <- merge(x=summary_count, y=summary_means, by="Species", all.x=TRUE)
> combined_summary_baseR
Species N Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 50 5.006 3.428 1.462 0.246
2 versicolor 50 5.936 2.770 4.260 1.326
3 virginica 50 6.588 2.974 5.552 2.026
Is there any way to do this in a more efficient way in base R?
Here is a base R option using a single by call (to aggregate)
do.call(rbind, by(
iris[-ncol(iris)], iris[ncol(iris)], function(x) c(N = nrow(x), colMeans(x))))
# N Sepal.Length Sepal.Width Petal.Length Petal.Width
#setosa 50 5.006 3.428 1.462 0.246
#versicolor 50 5.936 2.770 4.260 1.326
#virginica 50 6.588 2.974 5.552 2.026
Using colMeans ensures that the column names are carried through which avoids an additional setNames call.
Update
In response to your comment, to have row names as a separate column requires an extra step.
d <- do.call(rbind, by(
iris[-ncol(iris)], iris[ncol(iris)], function(x) c(N = nrow(x), colMeans(x))))
cbind(Species = rownames(d), as.data.frame(d))
Not as concise as the initial by call. I think we're having a clash of philosophies here. In dplyr (and the tidyverse) row names are generally avoided, to be consistent with the principles of "tidy data". In base R row names are common and are (more or less) consistently carried through data operations. So in a way you're asking for a mix of dplyr (tidy) and base R data structure concepts which may not be the best/robust approach.

Combine dataframes for means and sd's into one dataframe with sd in brackets after the mean

I would like to create a data frame with several different columns containing means, after which the sd is shown in brackets. To give an example:
df <- iris
mean <- aggregate(df[,1:4], list(iris$Species), mean)
sd <- aggregate(df[,1:4], list(iris$Species), sd)
view(mean)
Group.1 Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 5.006 3.428 1.462 0.246
2 versicolor 5.936 2.770 4.260 1.326
3 virginica 6.588 2.974 5.552 2.026
view(sd)
Group.1 Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 0.3524897 0.3790644 0.1736640 0.1053856
2 versicolor 0.5161711 0.3137983 0.4699110 0.1977527
3 virginica 0.6358796 0.3224966 0.5518947 0.2746501
Now I would like to have something like this:
Group.1 Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 5.0 (0.35) 3.4 (0.38) 1.5 (0.17) 0.2 (0.11)
2 versicolor 5.9 (0.52) 2.8 (0.31) 4.3 (0.47) 1.3 (0.20)
3 virginica 6.6 (0.64) 3.0 (0.32) 5.6 (0.55) 2.0 (0.27)
I reckon there should be a way using the paste function, but I can't figure out how.
We can convert the data to matrix and apply paste directly
dfN <- mean
dfN[-1] <- paste0(round(as.matrix(mean[-1]), 1), " (",
round(as.matrix(sd[-1]), 2), ")")
Also, this can be done in one step instead of creating multiple datasets
library(dplyr)
library(stringr)
df %>%
group_by(Species) %>%
summarise_all(list(~ str_c(round(mean(.), 2), " (", round(sd(.), 2), ")")))
# A tibble: 3 x 5
# Species Sepal.Length Sepal.Width Petal.Length Petal.Width
# <fct> <chr> <chr> <chr> <chr>
#1 setosa 5.01 (0.35) 3.43 (0.38) 1.46 (0.17) 0.25 (0.11)
#2 versicolor 5.94 (0.52) 2.77 (0.31) 4.26 (0.47) 1.33 (0.2)
#3 virginica 6.59 (0.64) 2.97 (0.32) 5.55 (0.55) 2.03 (0.27)
Using mapply we can paste the values.
df1 <- sd
df1[-1] <- mapply(function(x, y) paste0(x, "(", y, ")"), mean[-1], sd[-1])
df1
# Group.1 Sepal.Length Sepal.Width Petal.Length Petal.Width
#1 setosa 5.01(0.35) 3.43(0.38) 1.46(0.17) 0.25(0.11)
#2 versicolor 5.94(0.52) 2.77(0.31) 4.26(0.47) 1.33(0.2)
#3 virginica 6.59(0.64) 2.97(0.32) 5.55(0.55) 2.03(0.27)
Better to use different names for your variables than mean and sd since those are functions in R.
I know this is a 6m question, but I came across it while researching what to do with my own data, I guess that's an option too :)
df %>%
rstatix::get_summary_stats(show = c("mean", "sd", "median", "iqr")) %>%
dplyr::mutate_if(is.numeric, round, digits = 2) %>%
dplyr::mutate(mean2 = str_glue("{mean} ({sd})"))

Using dplyr to group_by and conditionally mutate only with if (without else) statement

I have a dataframe that I need to group by a combination of columns entries in order to conditionally mutate several columns using only an if statement (without an else condition).
More specifically, I want to sum up the column values of a certain group if they cross a pre-defined threshold, otherwise the values should remain unchanged.
I have tried doing this using both if_else and case_when but these functions require either a "false" argument (if_else) or by default set values that are not matched to NA (case_when):
iris_mutated <- iris %>%
dplyr::group_by(Species) %>%
dplyr::mutate(Sepal.Length=if_else(sum(Sepal.Length)>250, sum(Sepal.Length)),
Sepal.Width=if_else(sum(Sepal.Width)>170, sum(Sepal.Width)),
Petal.Length=if_else(sum(Petal.Length)>70, sum(Petal.Length)),
Petal.Width=if_else(sum(Petal.Width)>15, sum(Petal.Width)))
iris_mutated <- iris %>%
dplyr::group_by(Species) %>%
dplyr::mutate(Sepal.Length=case_when(sum(Sepal.Length)>250 ~ sum(Sepal.Length)),
Sepal.Width=case_when(sum(Sepal.Width)>170 ~ sum(Sepal.Width)),
Petal.Length=case_when(sum(Petal.Length)>70 ~ sum(Petal.Length)),
Petal.Width=case_when(sum(Petal.Width)>15 ~ sum(Petal.Width)))
Any ideas how to do this instead?
Edit:
Here is an example for the expected output.
The sum of the petal width for all species-wise grouped entries is 12.3 for setosa, 101.3 for virginica and 66.3 for versicolor. If I require that this sum should be at least 15 for the values to be summed up (otherwise the original value should be kept), then I expect the following output (only showing the columns "Petal.Width" and "Species"):
Petal.Width Species
1 0.2 setosa
2 0.2 setosa
3 0.2 setosa
4 0.2 setosa
5 0.2 setosa
6 0.4 setosa
7 0.3 setosa
8 0.2 setosa
9 0.2 setosa
10 0.1 setosa
#...#
50 0.2 setosa
51 66.3 versicolor
52 66.3 versicolor
53 66.3 versicolor
#...#
100 66.3 versicolor
101 101.3 virginica
102 101.3 virginica
103 101.3 virginica
#...#
150 101.3 virginica
I think you are after this? Using Johnny's method. You shouldn't hit an error when you use the original value as part of case_when in the case when the sum is not greater than the cutoff...
iris_mutated <- iris %>%
group_by(Species) %>%
mutate(Sepal.Length = case_when(sum(Sepal.Length) > 250 ~ sum(Sepal.Length),
T ~ Sepal.Length),
Sepal.Width = case_when(sum(Sepal.Width) > 170 ~ sum(Sepal.Width),
T ~ Sepal.Width),
Petal.Length = case_when(sum(Petal.Length) > 70 ~ sum(Petal.Length),
T ~ Petal.Length),
Petal.Width = case_when(sum(Petal.Width) > 15 ~ sum(Petal.Width),
T ~ Petal.Width))

Aggregate function not working with paste0 in R

I have a data frame subdist.df that has data for sub districts. I am trying to sum up the values of rows based on a common attribute in the data frame i.e DISTRICT column.
The following line of code works
hello2 <-aggregate(.~DISTRICT, subdist.df,sum)
But this one does not.
hello <-aggregate(noquote(paste0(".~","DISTRICT")), subdist.df,sum)
I am unable to understand why this is the case. I need to use it in a function wherein DISTRICT can be any input from the user as an argument.
Using iris data.frame as an example:
aggregate(.~Species, iris, sum)
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 250.3 171.4 73.1 12.3
2 versicolor 296.8 138.5 213.0 66.3
3 virginica 329.4 148.7 277.6 101.3
The following paste0 doesn't work, as noquote only generate an expression and not a formula as required by aggregate function:
aggregate(noquote(paste0(".~","Species")), iris, sum)
Error in aggregate.data.frame(as.data.frame(x), ...) :
arguments must have same length
Instead, adding as.formula before paste0 would work:
aggregate(as.formula(paste0(".~","Species")), iris, sum)
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 250.3 171.4 73.1 12.3
2 versicolor 296.8 138.5 213.0 66.3
3 virginica 329.4 148.7 277.6 101.3

Resources