How to add totals as well as group_by statistics in R - r

When computing any statistic using summarise and group_by we only get the summary statistic per-category, and not the value for all the population (Total). How to get both?
I am looking for something clean and short. Until now I can only think of:
bind_rows(
iris %>% group_by(Species) %>% summarise(
"Mean" = mean(Sepal.Width),
"Median" = median(Sepal.Width),
"sd" = sd(Sepal.Width),
"p10" = quantile(Sepal.Width, probs = 0.1))
,
iris %>% summarise(
"Mean" = mean(Sepal.Width),
"Median" = median(Sepal.Width),
"sd" = sd(Sepal.Width),
"p10" = quantile(Sepal.Width, probs = 0.1)) %>%
mutate(Species = "Total")
)
But I would like something more compact. In particular, I don't want to type the code (for summarize) twice, once for each group and once for the total.

You can simplify it if you untangle what you're trying to do: you have iris data that has several species, and you want that summarized along with data for all species. You don't need to calculate those summary stats before you can bind. Instead, bind iris with a version of iris that's been set to Species = "Total", then group and summarize.
library(tidyverse)
bind_rows(
iris,
iris %>% mutate(Species = "Total")
) %>%
group_by(Species) %>%
summarise(Mean = mean(Sepal.Width),
Median = median(Sepal.Width),
sd = sd(Sepal.Width),
p10 = quantile(Sepal.Width, probs = 0.1))
#> # A tibble: 4 x 5
#> Species Mean Median sd p10
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 setosa 3.43 3.4 0.379 3
#> 2 Total 3.06 3 0.436 2.5
#> 3 versicolor 2.77 2.8 0.314 2.3
#> 4 virginica 2.97 3 0.322 2.59
I like the caution in the comments above, though I have to do this sort of calculation for work enough that I have a similar shorthand function in a personal package. It perhaps makes less sense for things like standard deviations, but it's something I need to do a lot for adding up totals of demographic groups, etc. (If it's useful, that function is here).

bit shorter, though quite similar to bind_rows
q10 <- function(x){quantile(x , probs=0.1)}
iris %>%
select(Species,Sepal.Width)%>%
group_by(Species) %>%
summarise_all(c("mean", "sd", "q10")) %>%
t() %>%
cbind(c("total", iris %>% select(Sepal.Width) %>% summarise_all(c("mean", "sd", "q10")))) %>%
t()
more clean probably:
bind_rows(
iris %>%
group_by(Species) %>%
select(Sepal.Width)%>%
summarise_all(c("mean", "sd", "q10"))
,
iris %>%
select(Sepal.Width)%>%
summarise_all(c("mean", "sd", "q10")) %>%
mutate(Species = "Total")
)

Related

Plot histograms per row using gt tables - R

I want to create a gt table where I see some metrics like number of observations, mean and median, and I want a column with its histogram. For this question I will use the iris dataset.
I have recently learned how to put a plot in a tibble using this code:
library(dplyr)
library(tidyr)
library(purrr)
library(gt)
my_tibble <- iris %>%
pivot_longer(-Species,
names_to = "Vars",
values_to = "Values") %>%
group_by(Vars) %>%
summarise(obs = n(),
mean = round(mean(Values),2),
median = round(median(Values),2),
plots = list(ggplot(cur_data(), aes(Values)) + geom_histogram()))
Now I want to use the plots column for plotting an histogram per variable, so I have tried this:
my_tibble %>%
mutate(ggplot = NA) %>%
gt() %>%
text_transform(
locations = cells_body(vars(ggplot)),
fn = function(x) {
map(.$plots,ggplot_image)
}
)
But it returns me an error:
Error in body[[col]][stub_df$rownum_i %in% loc$rows] <- fn(body[[col]][stub_df$rownum_i %in% :
replacement has length zero
The gt table should be like this:
Any help will be greatly appreciated.
After reviewing the excellent ideas from #akrun and #TarJae, I have this solution that gives the required gt table:
plots <- iris %>%
pivot_longer(-Species,
names_to = "Vars",
values_to = "Values") %>%
group_by(Vars) %>%
nest() %>%
mutate(plot = map(data,
function(df) df %>%
ggplot(aes(Values)) +
geom_histogram())) %>%
select(-data)
iris %>%
pivot_longer(-Species,
names_to = "Vars",
values_to = "Values") %>%
group_by(Vars) %>%
summarise(obs = n(),
mean = round(mean(Values),2),
median = round(median(Values),2)) %>%
mutate(ggplot = NA) %>%
gt() %>%
text_transform(
locations = cells_body(vars(ggplot)),
fn = function(x) {
map(plots$plot, ggplot_image, height = px(100))
}
)
And this is the table:
I had to create the plot outside the output table, so I could call it in the gt table.
We need to loop over the plots
library(dplyr)
library(tidyr)
library(purrr)
library(gt)
library(ggplot2)
iris %>%
pivot_longer(-Species,
names_to = "Vars",
values_to = "Values") %>%
nest_by(Vars) %>%
mutate(n = nrow(data),
mean = round(mean(data$Values), 2),
median = round(median(data$Values), 2),
plots = list(ggplot(data, aes(Values)) + geom_histogram()), .keep = "unused") %>%
ungroup %>%
mutate(ggplot = NA) %>%
{dat <- .
dat %>%
select(-plots) %>%
gt() %>%
text_transform(locations = cells_body(c(ggplot)),
fn = function(x) {
map(dat$plots, ggplot_image, height = px(100))
}
)
}
-check for the output
Update: See comments:
For your purposes in accordance with a shiny app you may use summarytools see here: https://cran.r-project.org/web/packages/summarytools/vignettes/introduction.html
it is compatible with r shiny!
Here is a small example:
library(summarytools)
dfSummary(iris,
plain.ascii = FALSE,
style = "grid",
graph.magnif = 0.75,
valid.col = FALSE,
tmp.img.dir = "/tmp")
view(dfSummary(iris))
Try this:
library(skimr)
skim(iris)
skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
* <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 Sepal.Length 0 1 5.84 0.828 4.3 5.1 5.8 6.4 7.9 ▆▇▇▅▂
2 Sepal.Width 0 1 3.06 0.436 2 2.8 3 3.3 4.4 ▁▆▇▂▁
3 Petal.Length 0 1 3.76 1.77 1 1.6 4.35 5.1 6.9 ▇▁▆▇▂
4 Petal.Width 0 1 1.20 0.762 0.1 0.3 1.3 1.8 2.5 ▇▁▇▅▃

Summarizing Multiple Columns of Data Using Pipes

I'm looking to report the min, max, and mean of certain columns (price, age, and dist)from the houses data set using pipes in a concise tibble. For now, I have the following code which produces a rather inelegant solution with a 1x9 tibble:
houses %>%
select(price, age, dist) %>%
summarize_each(list(min = min, max = max, mean = mean))
I was hoping to create a more organized solution using pipes with the selected data as rows and the summary stats (min, max, mean) as columns resulting in a 3x3 tibble. Any ideas?
You may first get the data in long format and then calculate summary statistics for each column. Here is an example with mtcars dataset.
library(dplyr)
library(tidyr)
mtcars %>%
select(mpg, disp, cyl) %>%
pivot_longer(cols = everything()) %>%
group_by(name) %>%
summarise(min = min(value, na.rm = TRUE),
max = max(value, na.rm = TRUE),
mean = mean(value, na.rm = TRUE))
# name min max mean
# <chr> <dbl> <dbl> <dbl>
#1 cyl 4 8 6.19
#2 disp 71.1 472 231.
#3 mpg 10.4 33.9 20.1
A possible solution to output a dataframe:
library(dplyr)
houses %>%
summarise(across(c(price,age,dist),c(max,min,mean))) %>%
matrix(ncol = 3, byrow = T) %>%
as.data.frame() %>%
rename(Max=V1, Min=V2, Mean=V3)
A possible solution to output a tibble:
library(dplyr)
houses %>%
summarise(across(c(price,age,dist),c(max,min,mean))) %>%
matrix(ncol = 3, byrow = T) %>%
tibble(Max=unlist(.[,1]),Min=unlist(.[,2]),Mean=unlist(.[,3])) %>%
select(Max,Min,Mean)
EDIT (2021-10-01)
A very short solution:
library(dplyr)
library(purrr)
map_dfc(c("Max","Min","Mean"),
~ tibble(!!sym(.x) := apply(select(houses, price, age, dist),2,tolower(.x))))

Adding differences to a summary-table created without iteration

Based on my first question found here about creating a summary table without iteration, ie. without using map, I made the following algorithm based on the formidable answers,
library(tidyverse)
sum_variables <- c("mpg", "hp", "disp")
# Create grouping var; ####
mtcars <- mtcars %>% mutate(
am_factor = case_when(
am == 0 ~ "Automatic",
TRUE ~ "Manual"
)
)
# Create summary table; ####
mtcars %>%
group_by(am_factor) %>%
summarise(
across(
all_of(sum_variables),
~ paste0(mean(.) %>% round(2), "(±", sd(.) %>% round(2), ")")
)
) %>% pivot_longer(
cols = -"am_factor"
) %>% pivot_wider(
names_from = "am_factor"
)
Which gives me the following output,
# A tibble: 3 x 3
name Automatic Manual
<chr> <chr> <chr>
1 mpg 17.15(±3.83) 24.39(±6.17)
2 hp 160.26(±53.91) 126.85(±84.06)
3 disp 290.38(±110.17) 143.53(±87.2)
Using paste0 here has the benefit of reducing the amount of codes needed in the algorithm, but complicates further additions to table. If I, for example, want to add differences to this table, my current solution is the following,
mtcars %>%
group_by(am_factor) %>%
summarise(
across(
all_of(sum_variables),
~ paste0(mean(.) %>% round(2), "(±", sd(.) %>% round(2), ")")
)
) %>% pivot_longer(
cols = -"am_factor"
) %>% pivot_wider(
names_from = "am_factor"
) %>% mutate(
difference = str_extract(Automatic, "[:digit:].?[:digit:]") %>% as.numeric() -
str_extract(Manual, "[:digit:].?[:digit:]") %>% as.numeric()
)
Which gives the desired output,
# A tibble: 3 × 4
name Automatic Manual difference
<chr> <chr> <chr> <dbl>
1 mpg 17.15(±3.83) 24.39(±6.17) -7
2 hp 160.26(±53.91) 126.85(±84.06) 34
3 disp 290.38(±110.17) 143.53(±87.2) 147
Although it works, it defeats the purpose of making a simple algorithm for the purpose.
How do I create a summary of my data in a simple manner? It has to be a tidyverse-solution, preferably without iteration.
This is not necessarily simpler or shorter but I would prefer to do the mathematical calculations on numbers directly rather than extracting them from strings.
library(dplyr)
library(tidyr)
mtcars %>%
group_by(am_factor) %>%
summarise(across(all_of(sum_variables), list(mean = mean,
sd = ~sprintf('%.2f (± %.2f)', mean(.), sd(.))))) %>%
pivot_longer(cols = -am_factor,
names_to = c('measure', '.value'),
names_sep = '_') %>%
group_by(measure) %>%
mutate(difference = -diff(mean)) %>%
ungroup %>%
select(-mean) %>%
pivot_wider(names_from = am_factor, values_from = sd)
# measure difference Automatic Manual
# <chr> <dbl> <chr> <chr>
#1 mpg -7.24 17.15 (± 3.83) 24.39 (± 6.17)
#2 hp 33.4 160.26 (± 53.91) 126.85 (± 84.06)
#3 disp 147. 290.38 (± 110.17) 143.53 (± 87.20)

Add frequency counts to 2x2 prop.table

How do I add frequency counts to a 2x2 prop.table? So here 'dataset' contains two categorical variables.
dataset %>% prop.table(margin = 2) %>% '*' (100) %>% round(2)
I would like the counts in addition to percentages of each category.
Sorry for the dopey example, but it should look like this, except the sum doesn't need to be reported in every cell.
A reproducible example and solution:
tab <-iris %>% mutate(size = factor(1+(Sepal.Length>median(iris$Sepal.Length)),levels = 1:2, labels = c('S','L'))) %>%
select(Species, size) %>%
table()
prop <- prop.table(tab,margin = 2) %>% '*' (100) %>% round(2)
matrix(paste(tab,prop),nrow = nrow(tab),dimnames = dimnames(tab))
gives
size
Species S L
setosa "50 62.5" "0 0"
versicolor "24 30" "26 37.14"
virginica "6 7.5" "44 62.86"
or another solution:
iris %>% mutate(size = factor(1+(Sepal.Length>median(iris$Sepal.Length)),levels = 1:2, labels = c('S','L'))) %>%
group_by(Species, size) %>%
summarise(n = n()) %>%
group_by(size) %>%
mutate(p = paste(n,round(n/sum(n)*100,2))) %>%
select(-n) %>%
spread(size,p,fill = paste(0,0))
gives
# A tibble: 3 x 3
Species S L
<fct> <chr> <chr>
1 setosa 50 62.5 0 0
2 versicolor 24 30 26 37.14
3 virginica 6 7.5 44 62.86
addmargins applied to your table might do what you want.
set.seed(34)
n <- 20
tab <- table(sample(1:3, n, replace = TRUE), sample(c("A", "B"), n, replace = TRUE))
addmargins(tab)

reformatting dplyr summarise_at() output

I am using summarise_at() to obtain the mean and standard error of multiple variables by group.
The output has 1 row for each group, and 1 column for each calculated quantity, per group. I'd like to have a table with 1 row for each variable, and 1 column for each calculated quantity:
data <- mtcars
data$condition <- as.factor(c(rep("control", 16), rep("treat", 16)))
data %>%
group_by(condition) %>%
summarise_at(vars(mpg, cyl, wt),
funs(mean = mean, se=sd(.)/sqrt(n())))
# A tibble: 2 x 7
condition mpg_mean cyl_mean wt_mean mpg_se cyl_se wt_se
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 control 18.2 6.5 3.56 1.04 0.387 0.204
2 treat 22.0 5.88 2.87 1.77 0.499 0.257
Here's what I think would be more useful (the numbers are not meaningful):
# MEAN.control, MEAN.treat, SE.control, SE.treat
# mpg 1.5 2.4 .30 .45
# cyl 3.2 1.9 .20 .60
# disp 12.3 17.8 .20 .19
Any ideas? New to the tidyverse, so sorry if this is too basic.
The funs is getting deprecated in dplyr. Instead use list in summarise_at/mutate_at. After the summarise step, gather the data into 'long' format, separate the 'key' column into two by splitting at the delimiter _, then unite the 'cond' and 'key2' (after changing the case of 'key2'), spread it to 'wide' format and if needed, change the row names with the column 'key1'
library(tidyverse)
data %>%
group_by(condition) %>%
summarise_at(vars(mpg, cyl, wt), list(MEAN = ~ mean(.),
SE = ~sd(.)/sqrt(n()))) %>%
gather(key, val, -condition) %>%
separate(key, into = c("key1", "key2")) %>%
unite(cond, key2, condition, sep=".") %>%
spread(cond, val) %>%
column_to_rownames('key1')
# MEAN.control MEAN.treat SE.control SE.treat
#cyl 6.500000 5.875000 0.3872983 0.4989572
#mpg 18.200000 21.981250 1.0369024 1.7720332
#wt 3.560875 2.873625 0.2044885 0.2571034
A different possibility could be:
data %>%
group_by(condition) %>%
summarise_at(vars(mpg, cyl, wt), list(mean = ~ mean(.),
se = ~ sd(.)/sqrt(n()))) %>%
gather(var, val, -condition) %>%
separate(var, c("vars", "var2")) %>%
mutate(var2 = paste(toupper(var2), as.character(condition), sep = "_")) %>%
select(-condition) %>%
spread(var2, val)
vars MEAN_control MEAN_treat SE_control SE_treat
<chr> <dbl> <dbl> <dbl> <dbl>
1 cyl 6.5 5.88 0.387 0.499
2 mpg 18.2 22.0 1.04 1.77
3 wt 3.56 2.87 0.204 0.257
Here, after your initial steps, it performs a wide-to-long data transformation, excluding the "condition" column. Second, it separates the variable names into two columns. Third, it combines the metric and the condition, with the metric being upper case. Finally, it removes the redundant variable and returns it to the desired format.
Or you can avoid separate() by using some regex:
data %>%
group_by(condition) %>%
summarise_at(vars(mpg, cyl, wt), list(mean = ~ mean(.),
se = ~ sd(.)/sqrt(n()))) %>%
gather(var, val, -condition) %>%
mutate(vars = gsub("_.*$", "", var),
var2 = gsub(".*\\_", "", var)) %>%
mutate(var2 = paste(toupper(var2), as.character(condition), sep = "_")) %>%
select(-condition, -var) %>%
spread(var2, val)
Or with strsplit():
data %>%
group_by(condition) %>%
summarise_at(vars(mpg, cyl, wt), list(mean = ~ mean(.),
se = ~ sd(.)/sqrt(n()))) %>%
gather(var, val, -condition) %>%
mutate(vars = sapply(strsplit(var, "_"), function(x) x[1]),
var2 = sapply(strsplit(var, "_"), function(x) x[2])) %>%
mutate(var2 = paste(toupper(var2), as.character(condition), sep = "_")) %>%
select(-condition, -var) %>%
spread(var2, val)
Or you can completely rewrite it to:
data %>%
select(mpg, cyl, wt, condition) %>%
gather(vars, val, -condition) %>%
group_by(condition, vars) %>%
summarise(mean = mean(val),
se = sd(val)/sqrt(n())) %>%
ungroup() %>%
gather(var2, val, -c(condition, vars)) %>%
mutate(var2 = paste(toupper(var2), condition, sep = "_")) %>%
select(-condition) %>%
spread(var2, val)
In this case it, first, selects the variables of interest. Second, it performs a transformation from wide to long format, excluding the "condition" column. Third, it groups by conditions and variable names and calculates the metrics. In the forth step, it performs a second wide-to-long transformation, excluding the "condition" column and the column with initial variable names. Finally, it combines together the metric (upper case) and condition, removes the redundant variable and returns it to the desired format.

Resources