I have a data frame subdist.df that has data for sub districts. I am trying to sum up the values of rows based on a common attribute in the data frame i.e DISTRICT column.
The following line of code works
hello2 <-aggregate(.~DISTRICT, subdist.df,sum)
But this one does not.
hello <-aggregate(noquote(paste0(".~","DISTRICT")), subdist.df,sum)
I am unable to understand why this is the case. I need to use it in a function wherein DISTRICT can be any input from the user as an argument.
Using iris data.frame as an example:
aggregate(.~Species, iris, sum)
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 250.3 171.4 73.1 12.3
2 versicolor 296.8 138.5 213.0 66.3
3 virginica 329.4 148.7 277.6 101.3
The following paste0 doesn't work, as noquote only generate an expression and not a formula as required by aggregate function:
aggregate(noquote(paste0(".~","Species")), iris, sum)
Error in aggregate.data.frame(as.data.frame(x), ...) :
arguments must have same length
Instead, adding as.formula before paste0 would work:
aggregate(as.formula(paste0(".~","Species")), iris, sum)
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 250.3 171.4 73.1 12.3
2 versicolor 296.8 138.5 213.0 66.3
3 virginica 329.4 148.7 277.6 101.3
Related
I am trying to use aggregate inside a function by using dplyrs {{ }} notation to select the column to aggregate on.
filter <- function(df, level) {
df <- aggregate(.~ {{level}}, data=df, FUN=sum)
return(df)
}
however I get the error
Error in model.frame.default(formula = cbind(phylum, '12K1B.txt', '12K2B.txt', :
variable lengths differ (found for '{{ level }}')
I have double checked my data and there are no missing or NA values and everything works as expected when I run it outside of the function so I am not sure what is causing the error.
{{ }} is tidyverse syntax, and should only work inside tidyverse verbs.
If we want to achieve something like this
aggregate(. ~ Species, data = iris, sum)
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 250.3 171.4 73.1 12.3
2 versicolor 296.8 138.5 213.0 66.3
3 virginica 329.4 148.7 277.6 101.3
We can make a formula on the fly, manipulating as text like so
aggregate_var <- function(df, level) {
level <- deparse(substitute(level))
aggregate(formula(paste(". ~", level)), data=df, FUN=sum)
}
aggregate_var(iris, Species)
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 250.3 171.4 73.1 12.3
2 versicolor 296.8 138.5 213.0 66.3
3 virginica 329.4 148.7 277.6 101.3
As an aside - filter is a popular function name, perhaps a more detailed description is useful. Also note that an explicit return statement and the assignment to df are not needed here.
You may use get to achieve this.
funn <- function(df, level){
df <- aggregate(.~ get(level), data=df, FUN=sum)
return(df)
}
funn(iris, "Species")
get(level) Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 setosa 250.3 171.4 73.1 12.3 50
2 versicolor 296.8 138.5 213.0 66.3 100
3 virginica 329.4 148.7 277.6 101.3 150
To complement the other answers already provided, if you wanted to use {{, the dplyr way is:
my_fun <- function(df, level) {
df |>
group_by({{ level }}) |>
summarize(across(everything(), sum))
}
my_fun(iris, Species)
# A tibble: 3 x 5
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
<fct> <dbl> <dbl> <dbl> <dbl>
1 setosa 250. 171. 73.1 12.3
2 versicolor 297. 138. 213 66.3
3 virginica 329. 149. 278. 101.
This question already has answers here:
Apply several summary functions (sum, mean, etc.) on several variables by group in one call
(7 answers)
Closed 1 year ago.
I would like to aggregate a data frame while also adding in a new column (N) that counts the number of rows per value of the grouping variable, in base R.
This is trivial in dplyr:
library(dplyr)
data(iris)
combined_summary <- iris %>% group_by(Species) %>% group_by(N=n(), add=TRUE) %>% summarize_all(mean)
> combined_summary
# A tibble: 3 x 6
# Groups: Species [3]
Species N Sepal.Length Sepal.Width Petal.Length Petal.Width
<fct> <int> <dbl> <dbl> <dbl> <dbl>
1 setosa 50 5.01 3.43 1.46 0.246
2 versicolor 50 5.94 2.77 4.26 1.33
3 virginica 50 6.59 2.97 5.55 2.03
I am however in the unfortunate position of having to write this code in an environment that doesn't allow for packages to be used (don't ask; it's not my decision). So I need a way to do this in base R.
I can do it in base R in a long-winded way as follows:
# First create the aggregated tables separately
summary_means <- aggregate(. ~ Species, data=iris, FUN=mean)
summary_count <- aggregate(Sepal.Length ~ Species, data=iris[, c("Species", "Sepal.Length")], FUN=length)
> summary_means
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 5.006 3.428 1.462 0.246
2 versicolor 5.936 2.770 4.260 1.326
3 virginica 6.588 2.974 5.552 2.026
> summary_count
Species Sepal.Length
1 setosa 50
2 versicolor 50
3 virginica 50
# Then rename the count column
colnames(summary_count)[2] <- "N"
> summary_count
Species N
1 setosa 50
2 versicolor 50
3 virginica 50
# Finally merge the two dataframes
combined_summary_baseR <- merge(x=summary_count, y=summary_means, by="Species", all.x=TRUE)
> combined_summary_baseR
Species N Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 50 5.006 3.428 1.462 0.246
2 versicolor 50 5.936 2.770 4.260 1.326
3 virginica 50 6.588 2.974 5.552 2.026
Is there any way to do this in a more efficient way in base R?
Here is a base R option using a single by call (to aggregate)
do.call(rbind, by(
iris[-ncol(iris)], iris[ncol(iris)], function(x) c(N = nrow(x), colMeans(x))))
# N Sepal.Length Sepal.Width Petal.Length Petal.Width
#setosa 50 5.006 3.428 1.462 0.246
#versicolor 50 5.936 2.770 4.260 1.326
#virginica 50 6.588 2.974 5.552 2.026
Using colMeans ensures that the column names are carried through which avoids an additional setNames call.
Update
In response to your comment, to have row names as a separate column requires an extra step.
d <- do.call(rbind, by(
iris[-ncol(iris)], iris[ncol(iris)], function(x) c(N = nrow(x), colMeans(x))))
cbind(Species = rownames(d), as.data.frame(d))
Not as concise as the initial by call. I think we're having a clash of philosophies here. In dplyr (and the tidyverse) row names are generally avoided, to be consistent with the principles of "tidy data". In base R row names are common and are (more or less) consistently carried through data operations. So in a way you're asking for a mix of dplyr (tidy) and base R data structure concepts which may not be the best/robust approach.
I would like to create a data frame with several different columns containing means, after which the sd is shown in brackets. To give an example:
df <- iris
mean <- aggregate(df[,1:4], list(iris$Species), mean)
sd <- aggregate(df[,1:4], list(iris$Species), sd)
view(mean)
Group.1 Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 5.006 3.428 1.462 0.246
2 versicolor 5.936 2.770 4.260 1.326
3 virginica 6.588 2.974 5.552 2.026
view(sd)
Group.1 Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 0.3524897 0.3790644 0.1736640 0.1053856
2 versicolor 0.5161711 0.3137983 0.4699110 0.1977527
3 virginica 0.6358796 0.3224966 0.5518947 0.2746501
Now I would like to have something like this:
Group.1 Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 5.0 (0.35) 3.4 (0.38) 1.5 (0.17) 0.2 (0.11)
2 versicolor 5.9 (0.52) 2.8 (0.31) 4.3 (0.47) 1.3 (0.20)
3 virginica 6.6 (0.64) 3.0 (0.32) 5.6 (0.55) 2.0 (0.27)
I reckon there should be a way using the paste function, but I can't figure out how.
We can convert the data to matrix and apply paste directly
dfN <- mean
dfN[-1] <- paste0(round(as.matrix(mean[-1]), 1), " (",
round(as.matrix(sd[-1]), 2), ")")
Also, this can be done in one step instead of creating multiple datasets
library(dplyr)
library(stringr)
df %>%
group_by(Species) %>%
summarise_all(list(~ str_c(round(mean(.), 2), " (", round(sd(.), 2), ")")))
# A tibble: 3 x 5
# Species Sepal.Length Sepal.Width Petal.Length Petal.Width
# <fct> <chr> <chr> <chr> <chr>
#1 setosa 5.01 (0.35) 3.43 (0.38) 1.46 (0.17) 0.25 (0.11)
#2 versicolor 5.94 (0.52) 2.77 (0.31) 4.26 (0.47) 1.33 (0.2)
#3 virginica 6.59 (0.64) 2.97 (0.32) 5.55 (0.55) 2.03 (0.27)
Using mapply we can paste the values.
df1 <- sd
df1[-1] <- mapply(function(x, y) paste0(x, "(", y, ")"), mean[-1], sd[-1])
df1
# Group.1 Sepal.Length Sepal.Width Petal.Length Petal.Width
#1 setosa 5.01(0.35) 3.43(0.38) 1.46(0.17) 0.25(0.11)
#2 versicolor 5.94(0.52) 2.77(0.31) 4.26(0.47) 1.33(0.2)
#3 virginica 6.59(0.64) 2.97(0.32) 5.55(0.55) 2.03(0.27)
Better to use different names for your variables than mean and sd since those are functions in R.
I know this is a 6m question, but I came across it while researching what to do with my own data, I guess that's an option too :)
df %>%
rstatix::get_summary_stats(show = c("mean", "sd", "median", "iqr")) %>%
dplyr::mutate_if(is.numeric, round, digits = 2) %>%
dplyr::mutate(mean2 = str_glue("{mean} ({sd})"))
I have a dataframe that I need to group by a combination of columns entries in order to conditionally mutate several columns using only an if statement (without an else condition).
More specifically, I want to sum up the column values of a certain group if they cross a pre-defined threshold, otherwise the values should remain unchanged.
I have tried doing this using both if_else and case_when but these functions require either a "false" argument (if_else) or by default set values that are not matched to NA (case_when):
iris_mutated <- iris %>%
dplyr::group_by(Species) %>%
dplyr::mutate(Sepal.Length=if_else(sum(Sepal.Length)>250, sum(Sepal.Length)),
Sepal.Width=if_else(sum(Sepal.Width)>170, sum(Sepal.Width)),
Petal.Length=if_else(sum(Petal.Length)>70, sum(Petal.Length)),
Petal.Width=if_else(sum(Petal.Width)>15, sum(Petal.Width)))
iris_mutated <- iris %>%
dplyr::group_by(Species) %>%
dplyr::mutate(Sepal.Length=case_when(sum(Sepal.Length)>250 ~ sum(Sepal.Length)),
Sepal.Width=case_when(sum(Sepal.Width)>170 ~ sum(Sepal.Width)),
Petal.Length=case_when(sum(Petal.Length)>70 ~ sum(Petal.Length)),
Petal.Width=case_when(sum(Petal.Width)>15 ~ sum(Petal.Width)))
Any ideas how to do this instead?
Edit:
Here is an example for the expected output.
The sum of the petal width for all species-wise grouped entries is 12.3 for setosa, 101.3 for virginica and 66.3 for versicolor. If I require that this sum should be at least 15 for the values to be summed up (otherwise the original value should be kept), then I expect the following output (only showing the columns "Petal.Width" and "Species"):
Petal.Width Species
1 0.2 setosa
2 0.2 setosa
3 0.2 setosa
4 0.2 setosa
5 0.2 setosa
6 0.4 setosa
7 0.3 setosa
8 0.2 setosa
9 0.2 setosa
10 0.1 setosa
#...#
50 0.2 setosa
51 66.3 versicolor
52 66.3 versicolor
53 66.3 versicolor
#...#
100 66.3 versicolor
101 101.3 virginica
102 101.3 virginica
103 101.3 virginica
#...#
150 101.3 virginica
I think you are after this? Using Johnny's method. You shouldn't hit an error when you use the original value as part of case_when in the case when the sum is not greater than the cutoff...
iris_mutated <- iris %>%
group_by(Species) %>%
mutate(Sepal.Length = case_when(sum(Sepal.Length) > 250 ~ sum(Sepal.Length),
T ~ Sepal.Length),
Sepal.Width = case_when(sum(Sepal.Width) > 170 ~ sum(Sepal.Width),
T ~ Sepal.Width),
Petal.Length = case_when(sum(Petal.Length) > 70 ~ sum(Petal.Length),
T ~ Petal.Length),
Petal.Width = case_when(sum(Petal.Width) > 15 ~ sum(Petal.Width),
T ~ Petal.Width))
I am trying to reproduce the analysis given in this blog post for the by() function. When I paste the code into R I get an error message, however, instead of the nice table of summarised iris data on the blog post.
attach(iris)
head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
So the data frame's there and all is well.
Pasting in the by() function from the blog gives me this error:
by(iris[, 1:4], Species, mean)
Species: setosa
[1] NA
----------------------------------------------------------------------------------
Species: versicolor
[1] NA
----------------------------------------------------------------------------------
Species: virginica
[1] NA
Warning messages:
1: In mean.default(data[x, , drop = FALSE], ...) :
argument is not numeric or logical: returning NA
2: In mean.default(data[x, , drop = FALSE], ...) :
argument is not numeric or logical: returning NA
3: In mean.default(data[x, , drop = FALSE], ...) :
argument is not numeric or logical: returning NA
I really can't see what's wrong here. I've tried it with other data frames and so on and the problem seems to be with the 1:4 sequence in the indexing for the data frame. If I just specify one column it gives me the means no problem. I can't work out why it's spitting its dummy when given more than one column. Any suggestions?
I am not sure, how old is the blogpost, but if I look into documentation of by, the functionality is different from what the blogpost describes.
by splits input data into subseted dataframes, but you can not get a mean of a dataframe!
mean(iris[,1:4])
[1] NA
Warning message:
In mean.default(iris[, 1:4]) :
argument is not numeric or logical: returning NA
You can use by, if you want to get mean of values in one column
by(iris[,1], iris$Species, mean)
iris$Species: setosa
[1] 5.006
---------------------------------------------------------------------------------------------
iris$Species: versicolor
[1] 5.936
---------------------------------------------------------------------------------------------
iris$Species: virginica
[1] 6.588
But for getting means for all columns, use aggregate as suggested by #Thomas
I'm not sure how that blog post got that answer, because R produces the same out for me as it does for you. Consider aggregate instead:
> aggregate(. ~ Species, iris, mean)
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 5.006 3.428 1.462 0.246
2 versicolor 5.936 2.770 4.260 1.326
3 virginica 6.588 2.974 5.552 2.026
The error message is telling you that 'mean.default' is giving you the error. If you want to know why mean.default is doing what it does, you could look at the source:
> mean.default
function (x, trim = 0, na.rm = FALSE, ...)
{
if (!is.numeric(x) && !is.complex(x) && !is.logical(x)) {
warning("argument is not numeric or logical: returning NA")
return(NA_real_)}
...
'by()' does what it is supposed to, but "mean()" fails because a dataframe it is passed fails the is.numeric() test.