How to get a summary table with totals - r

I wonder if there is a more efficient way to get a summary table including totals.
I made a four step procedure here.
data<-iris %>% group_by(Species) %>%
summarise(
Sepal.Len = paste(format(round(median(Sepal.Length),2),nsmall=2) ),
P.len = paste(format(round(median(Petal.Length),2),nsmall=2) ) ,
counts=n() )
datatotal<-iris %>% group_by(.) %>%
summarize(
Sepal.Len = paste(format(round(median(Sepal.Length),2),nsmall=2) ),
P.len = paste(format(round(median(Petal.Length),2),nsmall=2) ) ,
counts=n() )
datatotal<-cbind(Species="Total",datatotal)
final<-rbind(data,datatotal)
final
# A tibble: 4 × 4
Species Sepal.Len P.len counts
* <fctr> <chr> <chr> <int>
1 setosa 5.00 1.50 50
2 versicolor 5.90 4.35 50
3 virginica 6.50 5.55 50
4 Total 5.80 4.35 150

A further improvement on #Richard's answer where everything is in one chain:
iris %>%
group_by(Species) %>%
summarise(
Sepal.Len = median(Sepal.Length),
P.len = median(Petal.Length) ,
counts = n()
) %>%
bind_rows(., iris %>%
summarize(
Sepal.Len = median(Sepal.Length),
P.len = median(Petal.Length) ,
counts = n()
) %>%
mutate(Species = "Total")
) %>%
mutate_each(funs(format(., nsmall = 2, digits = 2)), 2:3)
the result:
# A tibble: 4 × 4
Species Sepal.Len P.len counts
<chr> <chr> <chr> <int>
1 setosa 5.00 1.50 50
2 versicolor 5.90 4.35 50
3 virginica 6.50 5.55 50
4 Total 5.80 4.35 150
Another alternative is using the margins parameter of dcast from the reshape2 package:
dcast(transform(melt(iris, id.vars = 'Species', measure.vars = c('Sepal.Length','Petal.Length')),
counts = ave(value, variable, Species, FUN = length)),
Species + counts ~ variable,
fun.aggregate = median,
margins = 'Species')
the result (unfortunately not exactly as described):
Species counts Sepal.Length Petal.Length
1 setosa 50 5.0 1.50
2 versicolor 50 5.9 4.35
3 virginica 50 6.5 5.55
4 (all) (all) 5.8 4.35

You can simplify the code, by moving the formatting to the final object etc, but it won't make it much faster
data <- iris %>% group_by(Species) %>%
summarise(
Sepal.Len = median(Sepal.Length),
P.len = median(Petal.Length) ,
counts = n()
)
datatotal <- iris %>%
summarize(
Sepal.Len = median(Sepal.Length),
P.len = median(Petal.Length) ,
counts = n()
) %>%
mutate(Species = "Total")
final <- rbind(data, datatotal)
format(final, nsmall = 2, digits = 2)

Related

Dplyr: Count number of observations in group and summarise?

Im wondering if there is a more elegant way to perform this.
Right now, I am grouping all observations by Species. Then I summarize the median values.
median <- iris %>%
group_by(Species) %>%
summarise(medianSL = median(Sepal.Length),
medianSW = median(Sepal.Width),
medianPL = median(Petal.Length),
medianPW = median(Petal.Width))
I also wanted a column (n) that shows the amount of flowers in each row:
median_n <- iris %>%
group_by(Species) %>%
tally()
Can I combine these two code chunks? So that way the above code chunk will generate a table with the median lengths AND the total n for each row?
We may use across in summarise to loop over the numeric columns to get the median as well as create a frequency count with n() outside the across
library(dplyr)
library(stringr)
iris %>%
group_by(Species) %>%
summarise(across(where(is.numeric),
~ median(.x, na.rm = TRUE),
.names = "median{str_remove_all(.col, '[a-z.]+')}"),
n = n(), .groups = "drop")
-output
# A tibble: 3 × 6
Species medianSL medianSW medianPL medianPW n
<fct> <dbl> <dbl> <dbl> <dbl> <int>
1 setosa 5 3.4 1.5 0.2 50
2 versicolor 5.9 2.8 4.35 1.3 50
3 virginica 6.5 3 5.55 2 50

How to create a function to get summary statistics as columns?

I have three workflows to get Mean, Standard Deviation, and Variance. Would it be possible to simplify this by creating one function with one table with all the summaries as the result?
Mean
iris %>%
select(-Species) %>%
summarise_all( , mean, na.rm = TRUE) %>%
t() %>%
as.data.frame() %>%
rownames_to_column("Name") %>%
rename(Mean = V1)
Standard Deviation
iris %>%
select(-Species) %>%
summarise_all(., sd, na.rm = TRUE) %>%
t() %>%
as.data.frame() %>%
rownames_to_column("Name") %>%
rename(SD = V1)
Variance
iris %>%
select(-Species) %>%
summarise_all(., var, na.rm = TRUE) %>%
t() %>%
as.data.frame() %>%
rownames_to_column("Name") %>%
rename(Variance = V1)
We could reshape to 'long' format and then do a group by operation to create the three summarise columns
library(dplyr)
library(tidyr)
iris %>%
select(where(is.numeric)) %>%
pivot_longer(cols = everything(), names_to = "Name") %>%
group_by(Name) %>%
summarise(Mean = mean(value, na.rm = TRUE),
SD = sd(value, na.rm = TRUE),
Variance = var(value, na.rm = TRUE))
-output
# A tibble: 4 × 4
Name Mean SD Variance
<chr> <dbl> <dbl> <dbl>
1 Petal.Length 3.76 1.77 3.12
2 Petal.Width 1.20 0.762 0.581
3 Sepal.Length 5.84 0.828 0.686
4 Sepal.Width 3.06 0.436 0.190
iris %>%
select(-Species) %>%
summarise_all(list(mean = mean,sd = sd, var = var), na.rm = TRUE)%>%
pivot_longer(everything(), names_sep = '_', names_to = c('Name','.value'))
# A tibble: 4 x 4
Name mean sd var
<chr> <dbl> <dbl> <dbl>
1 Sepal.Length 5.84 0.828 0.686
2 Sepal.Width 3.06 0.436 0.190
3 Petal.Length 3.76 1.77 3.12
4 Petal.Width 1.20 0.762 0.581

Evaluate Multiple Lines in Dplyr

I have dataset which shows Variables, calculation I want to perform (sum, no. of distinct values) and new variable names after the calculation.
library(dplyr)
RefDf <- read.table(text = "Variables Calculation NewVariable
Sepal.Length sum Sepal.Length2
Petal.Length n_distinct Petal.LengthNew
", header = T)
Manual Approach - Summarise by grouping of Species variable.
iris %>% group_by_at("Species") %>%
summarise(Sepal.Length2 = sum(Sepal.Length,na.rm = T),
Petal.LengthNew = n_distinct(Petal.Length, na.rm = T)
)
Automate via eval(parse( ))
x <- RefDf %>% mutate(Check = paste0(NewVariable, " = ", Calculation, "(", Variables, ", na.rm = T", ")")) %>% pull(Check)
iris %>% group_by_at("Species") %>% summarise(eval(parse(text = x)))
As of now it is returning -
Species `eval(parse(text = x))`
<fct> <int>
1 setosa 9
2 versicolor 19
3 virginica 20
It should return -
Species Sepal.Length2 Petal.LengthNew
<fct> <dbl> <int>
1 setosa 250. 9
2 versicolor 297. 19
3 virginica 329. 20
You can use parse_exprs:
library(tidyverse)
library(rlang)
RefDf <- read.table(text = "Variables Calculation NewVariable
Sepal.Length sum Sepal.Length2
Petal.Length n_distinct Petal.LengthNew
", header = T)
#
expr_txt <- set_names(str_c(RefDf$Calculation, "(", RefDf$Variables, ")"),
RefDf$NewVariable)
iris %>%
group_by_at("Species") %>%
summarise(!!!parse_exprs(expr_txt), .groups = "drop")
## A tibble: 3 x 3
#Species Sepal.Length2 Petal.LengthNew
#<fct> <dbl> <int>
#1 setosa 250. 9
#2 versicolor 297. 19
#3 virginica 329. 20
Updated
I found a way of sparing those extra lines.
This is just another way of getting your desired result. I'd rather create a function call for every row of your data set and then iterate over it beside the new column names to get to the desired output:
library(dplyr)
library(rlang)
library(purrr)
# First we create a new variable which is actually of type call in your data set
RefDf %>%
rowwise() %>%
mutate(Call = list(call2(Calculation, parse_expr(Variables)))) -> Rf
Rf
# A tibble: 2 x 4
# Rowwise:
Variables Calculation NewVariable Call
<chr> <chr> <chr> <list>
1 Sepal.Length sum Sepal.Length2 <language>
2 Petal.Length n_distinct Petal.LengthNew <language>
# Then we iterate over `NewVariable` and `Call` at the same time to set the new variable
# name and also evaluate the `call` at the same time
map2(Rf$NewVariable, Rf$Call, ~ iris %>% group_by(Species) %>%
summarise(!!.x := eval_tidy(.y))) %>%
reduce(~ left_join(.x, .y, by = "Species"))
# A tibble: 3 x 3
Species Sepal.Length2 Petal.LengthNew
<fct> <dbl> <int>
1 setosa 250. 9
2 versicolor 297. 19
3 virginica 329. 20

Data Frame to matrix - many rows

I'm trying to convert data.frame to matrix. I calculated some statistics for iris dataset and want every statistics to be placed in seperate row. Code below shows all statistics (avg and median) in one single row and that's not a desired output. I want sth like this:
stat Sepal.Lenght Sepal.Width ....
avg 10.5 .....
med ...... .....
Code below:
data_iris <- iris
avg <- data_iris %>%
summarise_at(vars(Sepal.Length:Petal.Width),mean,na.rm=TRUE)
med <- data_iris %>%
summarise_at(vars(Sepal.Length:Petal.Width),median,na.rm=TRUE)
column <- colnames(data_iris[1:4])
rown <- c("avg","median")
df <- data.frame(avg=avg,med=med)
m <- data.matrix(df)
And additional question: I'd like to calculate quantiles but error comes up:
qrtl <- data_iris %>%
summarise_at(vars(Sepal.Length:Petal.Width),quantile,na.rm=TRUE)
error: Column Sepal.Length must be length 1 (a summary value), not 5
What's wrong?
It can be done if we do a reshape into 'long' with pivot_longer
library(dplyr)
library(tidyr)
iris %>%
summarise_if(is.numeric, list(avg = mean, med = median)) %>%
pivot_longer(everything(), names_to = c('.value', 'stat'), names_sep="_")
# stat Sepal.Length Sepal.Width Petal.Length Petal.Width
#1 avg 5.843333 3.057333 3.758 1.199333
#2 med 5.800000 3.000000 4.350 1.300000
If it needs to be converted to matrix, then change the 'stat' to rownames and then use data.matrix
library(tibble)
iris %>%
summarise_if(is.numeric, list(avg = mean, med = median)) %>%
pivot_longer(everything(), names_to = c('.value', 'stat'), names_sep="_") %>%
column_to_rownames('stat') %>%
data.matrix
The quantile is working fine in the dev version of dplyr - 0.8.99.9000`
iris %>%
summarise_at(vars(Sepal.Length:Petal.Width),quantile, na.rm=TRUE)
# Sepal.Length Sepal.Width Petal.Length Petal.Width
#1 4.3 2.0 1.00 0.1
#2 5.1 2.8 1.60 0.3
#3 5.8 3.0 4.35 1.3
#4 6.4 3.3 5.10 1.8
#5 7.9 4.4 6.90 2.5
The OP's package version is 0.8.3, so may be wrapping with list would work
iris %>%
summarise_at(vars(Sepal.Length:Petal.Width),
list(quantile = ~ list(quantile(., na.rm=TRUE)))) %>%
unnest(c(names(.)))
We can use map with transpose and then bind rows from different statistics together.
library(purrr)
map(data_iris[1:4], ~list(mean = mean(.x), sd = sd(.x))) %>%
transpose() %>%
dplyr::bind_rows(.id = "statistics")
# A tibble: 2 x 5
# statistics Sepal.Length Sepal.Width Petal.Length Petal.Width
# <chr> <dbl> <dbl> <dbl> <dbl>
#1 mean 5.84 3.06 3.76 1.20
#2 sd 0.828 0.436 1.77 0.762
Or
map_df(data_iris[1:4], ~c(mean = mean(.x), sd = sd(.x)))

Summarize data at different aggregate levels - R and tidyverse

I'm creating a bunch of basic status reports and one of things I'm finding tedious is adding a total row to all my tables. I'm currently using the Tidyverse approach and this is an example of my current code. What I'm looking for is an option to have a few different levels included by default.
#load into RStudio viewer (not required)
iris = iris
#summary at the group level
summary_grouped = iris %>%
group_by(Species) %>%
summarize(mean_s_length = mean(Sepal.Length),
max_s_width = max(Sepal.Width))
#summary at the overall level
summary_overall = iris %>%
summarize(mean_s_length = mean(Sepal.Length),
max_s_width = max(Sepal.Width)) %>%
mutate(Species = "Overall")
#append results for report
summary_table = rbind(summary_grouped, summary_overall)
Doing this multiple times over is very tedious. I kind of want:
summary_overall = iris %>%
group_by(Species, total = TRUE) %>%
summarize(mean_s_length = mean(Sepal.Length),
max_s_width = max(Sepal.Width))
FYI - if you're familiar with SAS I'm looking for the same type of functionality available via a class, ways or types statements in proc means that let me control the level of summarization and get multiple levels in one call.
Any help is appreciated. I know I can create my own function, but was hoping there is something that already exists. I would also prefer to stick with the tidyverse style of programming though I'm not set on that.
Another alternative:
library(tidyverse)
iris %>%
mutate_at("Species", as.character) %>%
list(group_by(.,Species), .) %>%
map(~summarize(.,mean_s_length = mean(Sepal.Length),
max_s_width = max(Sepal.Width))) %>%
bind_rows() %>%
replace_na(list(Species="Overall"))
#> # A tibble: 4 x 3
#> Species mean_s_length max_s_width
#> <chr> <dbl> <dbl>
#> 1 setosa 5.01 4.4
#> 2 versicolor 5.94 3.4
#> 3 virginica 6.59 3.8
#> 4 Overall 5.84 4.4
You can write a function which does the same summarize on an ungrouped tibble and rbinds that to the end.
summarize2 <- function(df, ...){
bind_rows(summarise(df, ...), summarize(ungroup(df), ...))
}
iris %>%
group_by(Species) %>%
summarize2(
mean_s_length = mean(Sepal.Length),
max_s_width = max(Sepal.Width)
)
# # A tibble: 4 x 3
# Species mean_s_length max_s_width
# <fct> <dbl> <dbl>
# 1 setosa 5.01 4.4
# 2 versicolor 5.94 3.4
# 3 virginica 6.59 3.8
# 4 NA 5.84 4.4
You could add some logic for what the "Overall" groups should be named if you want
summarize2 <- function(df, ...){
s1 <- summarise(df, ...)
s2 <- summarize(ungroup(df), ...)
for(v in group_vars(s1)){
if(is.factor(s1[[v]]))
s1[[v]] <- as.character(s1[[v]])
if(is.character(s1[[v]]))
s2[[v]] <- 'Overall'
else if(is.numeric(s1[[v]]))
s2[[v]] <- -Inf
}
bind_rows(s1, s2)
}
iris %>%
group_by(Species, g = Petal.Length %/% 1) %>%
summarize2(
mean_s_length = mean(Sepal.Length),
max_s_width = max(Sepal.Width)
)
# # Groups: Species [4]
# Species g mean_s_length max_s_width
# <chr> <dbl> <dbl> <dbl>
# 1 setosa 1 5.01 4.4
# 2 versicolor 3 5.35 2.9
# 3 versicolor 4 6.09 3.4
# 4 versicolor 5 6.35 3
# 5 virginica 4 5.85 3
# 6 virginica 5 6.44 3.4
# 7 virginica 6 7.43 3.8
# 8 Overall -Inf 5.84 4.4
library(dplyr)
iris %>%
group_by(Species) %>%
summarize(mean_s_length = mean(Sepal.Length),
max_s_width = max(Sepal.Width)) %>%
ungroup() %>%
mutate_at(vars(Species), as.character) %>%
{rbind(.,c("Overal",mean(.$mean_s_length),max(.$max_s_width)))} %>%
mutate_at(vars(-Species), as.double) %>%
mutate_at(vars(Species), as.factor)
#> # A tibble: 4 x 3
#> Species mean_s_length max_s_width
#> <fct> <dbl> <dbl>
#> 1 setosa 5.01 4.4
#> 2 versicolor 5.94 3.4
#> 3 virginica 6.59 3.8
#> 4 Overal 5.84 4.4
Created on 2019-06-21 by the reprex package (v0.3.0)
One way, also tedious but in one longer pipe, is to put the second summarise instructions in bind_rows.
The as.character call avoids a warning:
Warning messages:
1: In bind_rows_(x, .id) :
binding factor and character vector, coercing into character vector
2: In bind_rows_(x, .id) :
binding character and factor vector, coercing into character vector
library(tidyverse)
summary_grouped <- iris %>%
mutate(Species = as.character(Species)) %>%
group_by(Species) %>%
summarize(mean_s_length = mean(Sepal.Length),
max_s_width = max(Sepal.Width)) %>%
bind_rows(iris %>%
summarize(mean_s_length = mean(Sepal.Length),
max_s_width = max(Sepal.Width)) %>%
mutate(Species = "Overall"))
## A tibble: 4 x 3
# Species mean_s_length max_s_width
# <chr> <dbl> <dbl>
#1 setosa 5.01 4.4
#2 versicolor 5.94 3.4
#3 virginica 6.59 3.8
#4 Overall 5.84 4.4
Maybe something like this:
As you want to perform different operations on the same input (iris), best to map over the different summary functions and apply to the data.
map_dfr combines the list outputs using bind_rows
library(dplyr)
library(purrr)
pipe <- . %>%
group_by(Species) %>%
summarize(
mean_s_length = mean(Sepal.Length),
max_s_width = max(Sepal.Width))
map_dfr(
list(pipe, . %>% mutate(Species = "Overall") %>% pipe),
exec,
iris)
#> Warning in bind_rows_(x, .id): binding factor and character vector,
#> coercing into character vector
#> Warning in bind_rows_(x, .id): binding character and factor vector,
#> coercing into character vector
#> # A tibble: 4 x 3
#> Species mean_s_length max_s_width
#> <chr> <dbl> <dbl>
#> 1 setosa 5.01 4.4
#> 2 versicolor 5.94 3.4
#> 3 virginica 6.59 3.8
#> 4 Overall 5.84 4.4
Solution where you need to apply wanted function only once on a double dataset:
library(tidyverse)
iris %>%
rbind(mutate(., Species = "Overall")) %>%
group_by(Species) %>%
summarize(
mean_s_length = mean(Sepal.Length),
max_s_width = max(Sepal.Width)
)
# A tibble: 4 x 3
Species mean_s_length max_s_width
<chr> <dbl> <dbl>
1 Overall 5.84 4.4
2 setosa 5.01 4.4
3 versicolor 5.94 3.4
4 virginica 6.59 3.8
Trick is to pass original dataset with a new group ID (ie Species): mutate(iris, Species = "Overall")

Resources