R: aggregate data while adding new count column using base R [duplicate] - r

This question already has answers here:
Apply several summary functions (sum, mean, etc.) on several variables by group in one call
(7 answers)
Closed 1 year ago.
I would like to aggregate a data frame while also adding in a new column (N) that counts the number of rows per value of the grouping variable, in base R.
This is trivial in dplyr:
library(dplyr)
data(iris)
combined_summary <- iris %>% group_by(Species) %>% group_by(N=n(), add=TRUE) %>% summarize_all(mean)
> combined_summary
# A tibble: 3 x 6
# Groups: Species [3]
Species N Sepal.Length Sepal.Width Petal.Length Petal.Width
<fct> <int> <dbl> <dbl> <dbl> <dbl>
1 setosa 50 5.01 3.43 1.46 0.246
2 versicolor 50 5.94 2.77 4.26 1.33
3 virginica 50 6.59 2.97 5.55 2.03
I am however in the unfortunate position of having to write this code in an environment that doesn't allow for packages to be used (don't ask; it's not my decision). So I need a way to do this in base R.
I can do it in base R in a long-winded way as follows:
# First create the aggregated tables separately
summary_means <- aggregate(. ~ Species, data=iris, FUN=mean)
summary_count <- aggregate(Sepal.Length ~ Species, data=iris[, c("Species", "Sepal.Length")], FUN=length)
> summary_means
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 5.006 3.428 1.462 0.246
2 versicolor 5.936 2.770 4.260 1.326
3 virginica 6.588 2.974 5.552 2.026
> summary_count
Species Sepal.Length
1 setosa 50
2 versicolor 50
3 virginica 50
# Then rename the count column
colnames(summary_count)[2] <- "N"
> summary_count
Species N
1 setosa 50
2 versicolor 50
3 virginica 50
# Finally merge the two dataframes
combined_summary_baseR <- merge(x=summary_count, y=summary_means, by="Species", all.x=TRUE)
> combined_summary_baseR
Species N Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 50 5.006 3.428 1.462 0.246
2 versicolor 50 5.936 2.770 4.260 1.326
3 virginica 50 6.588 2.974 5.552 2.026
Is there any way to do this in a more efficient way in base R?

Here is a base R option using a single by call (to aggregate)
do.call(rbind, by(
iris[-ncol(iris)], iris[ncol(iris)], function(x) c(N = nrow(x), colMeans(x))))
# N Sepal.Length Sepal.Width Petal.Length Petal.Width
#setosa 50 5.006 3.428 1.462 0.246
#versicolor 50 5.936 2.770 4.260 1.326
#virginica 50 6.588 2.974 5.552 2.026
Using colMeans ensures that the column names are carried through which avoids an additional setNames call.
Update
In response to your comment, to have row names as a separate column requires an extra step.
d <- do.call(rbind, by(
iris[-ncol(iris)], iris[ncol(iris)], function(x) c(N = nrow(x), colMeans(x))))
cbind(Species = rownames(d), as.data.frame(d))
Not as concise as the initial by call. I think we're having a clash of philosophies here. In dplyr (and the tidyverse) row names are generally avoided, to be consistent with the principles of "tidy data". In base R row names are common and are (more or less) consistently carried through data operations. So in a way you're asking for a mix of dplyr (tidy) and base R data structure concepts which may not be the best/robust approach.

Related

Error when using dplyr {{ }} with aggregate inside a function

I am trying to use aggregate inside a function by using dplyrs {{ }} notation to select the column to aggregate on.
filter <- function(df, level) {
df <- aggregate(.~ {{level}}, data=df, FUN=sum)
return(df)
}
however I get the error
Error in model.frame.default(formula = cbind(phylum, '12K1B.txt', '12K2B.txt', :
variable lengths differ (found for '{{ level }}')
I have double checked my data and there are no missing or NA values and everything works as expected when I run it outside of the function so I am not sure what is causing the error.
{{ }} is tidyverse syntax, and should only work inside tidyverse verbs.
If we want to achieve something like this
aggregate(. ~ Species, data = iris, sum)
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 250.3 171.4 73.1 12.3
2 versicolor 296.8 138.5 213.0 66.3
3 virginica 329.4 148.7 277.6 101.3
We can make a formula on the fly, manipulating as text like so
aggregate_var <- function(df, level) {
level <- deparse(substitute(level))
aggregate(formula(paste(". ~", level)), data=df, FUN=sum)
}
aggregate_var(iris, Species)
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 250.3 171.4 73.1 12.3
2 versicolor 296.8 138.5 213.0 66.3
3 virginica 329.4 148.7 277.6 101.3
As an aside - filter is a popular function name, perhaps a more detailed description is useful. Also note that an explicit return statement and the assignment to df are not needed here.
You may use get to achieve this.
funn <- function(df, level){
df <- aggregate(.~ get(level), data=df, FUN=sum)
return(df)
}
funn(iris, "Species")
get(level) Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 setosa 250.3 171.4 73.1 12.3 50
2 versicolor 296.8 138.5 213.0 66.3 100
3 virginica 329.4 148.7 277.6 101.3 150
To complement the other answers already provided, if you wanted to use {{, the dplyr way is:
my_fun <- function(df, level) {
df |>
group_by({{ level }}) |>
summarize(across(everything(), sum))
}
my_fun(iris, Species)
# A tibble: 3 x 5
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
<fct> <dbl> <dbl> <dbl> <dbl>
1 setosa 250. 171. 73.1 12.3
2 versicolor 297. 138. 213 66.3
3 virginica 329. 149. 278. 101.

Combine dataframes for means and sd's into one dataframe with sd in brackets after the mean

I would like to create a data frame with several different columns containing means, after which the sd is shown in brackets. To give an example:
df <- iris
mean <- aggregate(df[,1:4], list(iris$Species), mean)
sd <- aggregate(df[,1:4], list(iris$Species), sd)
view(mean)
Group.1 Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 5.006 3.428 1.462 0.246
2 versicolor 5.936 2.770 4.260 1.326
3 virginica 6.588 2.974 5.552 2.026
view(sd)
Group.1 Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 0.3524897 0.3790644 0.1736640 0.1053856
2 versicolor 0.5161711 0.3137983 0.4699110 0.1977527
3 virginica 0.6358796 0.3224966 0.5518947 0.2746501
Now I would like to have something like this:
Group.1 Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 5.0 (0.35) 3.4 (0.38) 1.5 (0.17) 0.2 (0.11)
2 versicolor 5.9 (0.52) 2.8 (0.31) 4.3 (0.47) 1.3 (0.20)
3 virginica 6.6 (0.64) 3.0 (0.32) 5.6 (0.55) 2.0 (0.27)
I reckon there should be a way using the paste function, but I can't figure out how.
We can convert the data to matrix and apply paste directly
dfN <- mean
dfN[-1] <- paste0(round(as.matrix(mean[-1]), 1), " (",
round(as.matrix(sd[-1]), 2), ")")
Also, this can be done in one step instead of creating multiple datasets
library(dplyr)
library(stringr)
df %>%
group_by(Species) %>%
summarise_all(list(~ str_c(round(mean(.), 2), " (", round(sd(.), 2), ")")))
# A tibble: 3 x 5
# Species Sepal.Length Sepal.Width Petal.Length Petal.Width
# <fct> <chr> <chr> <chr> <chr>
#1 setosa 5.01 (0.35) 3.43 (0.38) 1.46 (0.17) 0.25 (0.11)
#2 versicolor 5.94 (0.52) 2.77 (0.31) 4.26 (0.47) 1.33 (0.2)
#3 virginica 6.59 (0.64) 2.97 (0.32) 5.55 (0.55) 2.03 (0.27)
Using mapply we can paste the values.
df1 <- sd
df1[-1] <- mapply(function(x, y) paste0(x, "(", y, ")"), mean[-1], sd[-1])
df1
# Group.1 Sepal.Length Sepal.Width Petal.Length Petal.Width
#1 setosa 5.01(0.35) 3.43(0.38) 1.46(0.17) 0.25(0.11)
#2 versicolor 5.94(0.52) 2.77(0.31) 4.26(0.47) 1.33(0.2)
#3 virginica 6.59(0.64) 2.97(0.32) 5.55(0.55) 2.03(0.27)
Better to use different names for your variables than mean and sd since those are functions in R.
I know this is a 6m question, but I came across it while researching what to do with my own data, I guess that's an option too :)
df %>%
rstatix::get_summary_stats(show = c("mean", "sd", "median", "iqr")) %>%
dplyr::mutate_if(is.numeric, round, digits = 2) %>%
dplyr::mutate(mean2 = str_glue("{mean} ({sd})"))

Aggregate function not working with paste0 in R

I have a data frame subdist.df that has data for sub districts. I am trying to sum up the values of rows based on a common attribute in the data frame i.e DISTRICT column.
The following line of code works
hello2 <-aggregate(.~DISTRICT, subdist.df,sum)
But this one does not.
hello <-aggregate(noquote(paste0(".~","DISTRICT")), subdist.df,sum)
I am unable to understand why this is the case. I need to use it in a function wherein DISTRICT can be any input from the user as an argument.
Using iris data.frame as an example:
aggregate(.~Species, iris, sum)
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 250.3 171.4 73.1 12.3
2 versicolor 296.8 138.5 213.0 66.3
3 virginica 329.4 148.7 277.6 101.3
The following paste0 doesn't work, as noquote only generate an expression and not a formula as required by aggregate function:
aggregate(noquote(paste0(".~","Species")), iris, sum)
Error in aggregate.data.frame(as.data.frame(x), ...) :
arguments must have same length
Instead, adding as.formula before paste0 would work:
aggregate(as.formula(paste0(".~","Species")), iris, sum)
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 250.3 171.4 73.1 12.3
2 versicolor 296.8 138.5 213.0 66.3
3 virginica 329.4 148.7 277.6 101.3

How to use tapply from within a for loop

I have a data.frame "df" which has 200 observations and 18 columns.
The 18 columns are var1, var2, etc....
When I use:
tapply(df$var1, INDEX=df$varX, FUN=mean, na.rm=T)
where varX is a fixed value of a certain variable (var) of type string, I get the mean of var1 for each value of varX.
my question is:
How may I put the above command in a for loop such that it would iterate the same command such that it will cover all variables (var1, var2, ...etc) except of course varX?
I tried this:
for (k in c(var1, var2, ..., varn)) {
tapply(df$k, INDEX=df$varX, FUN=mean, na.rm=T)
}
But it did not work.
Please note:
I am sure much more effective and elegant methods/scripts can be used, but since I am a beginner, and so much behind, I sometimes try to go ahead and apply some ideas before I finish reading the respective chapter of a book I have. This is why my method(s) sometimes look primitive.
The most direct adaptation of what you are looking for (using iris as the example data frame) is:
for(k in iris[-5]) # we loop through the columns in `iris`, except last
print(tapply(k, INDEX=iris$Species, FUN=mean, na.rm=T))
Which produces:
setosa versicolor virginica
5.006 5.936 6.588
setosa versicolor virginica
3.428 2.770 2.974
setosa versicolor virginica
1.462 4.260 5.552
setosa versicolor virginica
0.246 1.326 2.026
Slightly more elegantly using sapply instead of for:
sapply(iris[-5], tapply, INDEX=iris$Species, mean, na.rm=T)
which produces:
Sepal.Length Sepal.Width Petal.Length Petal.Width
setosa 5.006 3.428 1.462 0.246
versicolor 5.936 2.770 4.260 1.326
virginica 6.588 2.974 5.552 2.026
But really, you want to use aggregate, dplyr, or data.table as others have suggested:
data.table(iris)[, lapply(.SD, mean, na.rm=TRUE), by=Species]
iris %>% group_by(Species) %>% summarise_each(funs(mean(., na.rm=TRUE)))
aggregate(. ~ Species, iris, mean, na.rm = TRUE) # Courtesy David Arenburg
The firs two require loading the packages data.table and dplyr respectively.
library(dplyr)
df %>%
na.omit() %>%
group_by(varX) %>%
summarise_each(funs(mean))
You could use rowsum(), which is one of the fastest base R aggregation functions (although here we'll need to divide it by the counts of the grouping variable to get the mean).
Following BrodieG's example using data(iris) grouped by Species, we can do
grp <- iris$Species
rowsum(iris[-5], grp, na.rm = TRUE) / tabulate(grp, nlevels(grp))
# Sepal.Length Sepal.Width Petal.Length Petal.Width
# setosa 5.006 3.428 1.462 0.246
# versicolor 5.936 2.770 4.260 1.326
# virginica 6.588 2.974 5.552 2.026

Create tabular summary with a total row

Is there an elegant one-liner (using any R package) to accomplish the following?
tab <- aggregate(. ~ Species, dat=iris, mean)
total <- data.frame(Species='Overall', t(colMeans(iris[,-5])))
rbind(tab, total)
Package tables
library(tables)
tabular( (Species + 1) ~ All(iris)*(mean),data=iris)
> tabular( (Species + 1) ~ All(iris)*(mean),data=iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width
Species mean mean mean mean
setosa 5.006 3.428 1.462 0.246
versicolor 5.936 2.770 4.260 1.326
virginica 6.588 2.974 5.552 2.026
All 5.843 3.057 3.758 1.199
but I cheated and made a slight copy to the example in the help files ;) so credit to Duncan Murdoch.
or in sqldf
library(sqldf)
library(sqldf)
sqldf("
select Species,
avg(Sepal_Length) `Sepal.Length`,
avg(Sepal_Width) `Sepal.Width`,
avg(Petal_Length) `Petal.Length`,
avg(Petal_Width) `Petal.Width`
from iris
group by Species
union all
select 'All',
avg(Sepal_Length) `Sepal.Length`,
avg(Sepal_Width) `Sepal.Width`,
avg(Petal_Length) `Petal.Length`,
avg(Petal_Width) `Petal.Width`
from iris"
)
which could be written a bit more compactly like this:
variables <- "avg(Sepal_Length) `Sepal.Length`,
avg(Sepal_Width) `Sepal.Width`,
avg(Petal_Length) `Petal.Length`,
avg(Petal_Width) `Petal.Width`"
fn$sqldf(" select Species, $variables from iris group by Species
union all select 'All', $variables from iris")
giving
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 5.006000 3.428000 1.462 0.246000
2 versicolor 5.936000 2.770000 4.260 1.326000
3 virginica 6.588000 2.974000 5.552 2.026000
4 All 5.843333 3.057333 3.758 1.199333
package reshape2 is perhaps a bit slicker here, getting it into two steps:
library(reshape2)
iris.m <- melt(iris, id.vars = "Species")
dcast(Species ~ variable, data = iris.m, fun.aggregate = mean, margins = "Species")
#-----
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 5.006000 3.428000 1.462 0.246000
2 versicolor 5.936000 2.770000 4.260 1.326000
3 virginica 6.588000 2.974000 5.552 2.026000
4 (all) 5.843333 3.057333 3.758 1.199333
See the details to the margins argument on ?dcast

Resources