How to summarize by sets of grouping variables in R and dplyr?

How to summarize by sets of grouping variables in R and dplyr? - r

I want to group a data frame using different sets of grouping variables. For each group I want to count the number of observations (or summarize in any other way) and then collect all results in one data frame.
Important: I want to define the sets of grouping variables programmatically, for example as a list.
How do I achieve this in the tidyverse?
Here is my attempt:
library(tidyverse)
count_by_group <- function(...) {
mtcars %>%
count(...) %>%
mutate(
grouping_variable = paste(ensyms(...), collapse = "."),
group = paste(!!!enquos(...), sep = ".")
) %>%
select(grouping_variable, group, n)
}
# I want this ...
bind_rows(
count_by_group(cyl),
count_by_group(gear),
count_by_group(cyl, gear)
)
#> grouping_variable group n
#> 1 cyl 4 11
#> 2 cyl 6 7
#> 3 cyl 8 14
#> 4 gear 3 15
#> 5 gear 4 12
#> 6 gear 5 5
#> 7 cyl.gear 4.3 1
#> 8 cyl.gear 4.4 8
#> 9 cyl.gear 4.5 2
#> 10 cyl.gear 6.3 2
#> 11 cyl.gear 6.4 4
#> 12 cyl.gear 6.5 1
#> 13 cyl.gear 8.3 12
#> 14 cyl.gear 8.5 2
# ... but without the repetition of "count_by_group(var)".
# The following does not work:
map_dfr(
list(
cyl,
gear,
c(cyl, gear)
),
count_by_group
)
#> Error in map(.x, .f, ...): object 'cyl' not found
Created on 2020-09-17 by the reprex package (v0.3.0)

Update (2020-10-12): More transparent solution (thanks to #LionelHenry)
library(tidyverse)
count_by_group <- function(...) {
dots <- enquos(..., .named = TRUE)
names <- names(dots)
counted <- count(mtcars, !!!dots)
group <- counted %>%
select(-n) %>%
rowwise() %>%
mutate(paste(c_across(), collapse = ".")) %>%
pull()
# # Equivalently:
# group <- counted %>%
# select(-n) %>%
# pmap(counted, paste, sep = ".")
counted %>%
mutate(
grouping_variable = paste(names, collapse = "."),
group = group
) %>%
select(grouping_variable, group, n)
}
grouping_variables <- list(
vars(cyl),
vars(gear),
vars(cyl, gear)
)
map_dfr(grouping_variables, ~ count_by_group(!!! .x))
#> grouping_variable group n
#> 1 cyl 4 11
#> 2 cyl 6 7
#> 3 cyl 8 14
#> 4 gear 3 15
#> 5 gear 4 12
#> 6 gear 5 5
#> 7 cyl.gear 4.3 1
#> 8 cyl.gear 4.4 8
#> 9 cyl.gear 4.5 2
#> 10 cyl.gear 6.3 2
#> 11 cyl.gear 6.4 4
#> 12 cyl.gear 6.5 1
#> 13 cyl.gear 8.3 12
#> 14 cyl.gear 8.5 2
Created on 2020-10-12 by the reprex package (v0.3.0)
I just found that this works!
library(tidyverse)
count_by_group <- function(...) {
mtcars %>%
count(...) %>%
mutate(
grouping_variable = paste(ensyms(...), collapse = "."),
group = paste(!!!enquos(...), sep = ".")
) %>%
select(grouping_variable, group, n)
}
grouping_variables <- list(
vars(cyl),
vars(gear),
vars(cyl, gear)
)
map_dfr(grouping_variables, ~count_by_group(!!! .))
#> grouping_variable group n
#> 1 cyl 4 11
#> 2 cyl 6 7
#> 3 cyl 8 14
#> 4 gear 3 15
#> 5 gear 4 12
#> 6 gear 5 5
#> 7 cyl.gear 4.3 1
#> 8 cyl.gear 4.4 8
#> 9 cyl.gear 4.5 2
#> 10 cyl.gear 6.3 2
#> 11 cyl.gear 6.4 4
#> 12 cyl.gear 6.5 1
#> 13 cyl.gear 8.3 12
#> 14 cyl.gear 8.5 2
Created on 2020-10-12 by the reprex package (v0.3.0)

Related

using lapply with list of arguments on dplyr functions that uses data masking

when programming using dplyr, to programmatically use variables in dplyr vers from function arguments, they need to be references by {{var}}
This works well, but I would like to use lapply with the var argument supplied in a list. This is throwing me an error. I have tried back and forth using substitute and rlang vars like sym but to no avail.
any suggestions? Thanks!
library(tidyverse)
tb <- tibble(a = 1:10, b = 10:1)
foo <- function(var, scalar){
tb %>% mutate(new_var = {{var}}*scalar)
}
foo(a, pi) #works
lapply(X = list(
list(sym("a"), pi),
list(substitute(b), exp(1))), FUN = function(ll) foo(var = ll$a, scalar = ll$pi) ) #err

You can get round the non-standard evalutation by naming the list elements and using do.call
lapply(X = list(
list(var = sym("a"), scalar = pi),
list(var = substitute(b), scalar = exp(1))),
FUN = function(ll) do.call(foo, ll))
#> [[1]]
#> # A tibble: 10 x 3
#> a b new_var
#> <int> <int> <dbl>
#> 1 1 10 3.14
#> 2 2 9 6.28
#> 3 3 8 9.42
#> 4 4 7 12.6
#> 5 5 6 15.7
#> 6 6 5 18.8
#> 7 7 4 22.0
#> 8 8 3 25.1
#> 9 9 2 28.3
#> 10 10 1 31.4
#>
#> [[2]]
#> # A tibble: 10 x 3
#> a b new_var
#> <int> <int> <dbl>
#> 1 1 10 27.2
#> 2 2 9 24.5
#> 3 3 8 21.7
#> 4 4 7 19.0
#> 5 5 6 16.3
#> 6 6 5 13.6
#> 7 7 4 10.9
#> 8 8 3 8.15
#> 9 9 2 5.44
#> 10 10 1 2.72
Created on 2022-11-03 with reprex v2.0.2

How to make multi layer cross tabs in R

I am attempting to create a multi-layered cross tab in R. Currently, when using this code:
NewMexico_DEM_xtab_ <- NewMexico_DEM_Voterfile %>%
group_by(Sex, CountyName) %>%
tally() %>%
spread(Sex, n)
I receive this output:
My goal is to add a layer for age using the Age column and for R to output a tab like this:
Is there a way I can do this with my current code or a package that would make this easier?

Do either of these approaches solve your problem?
library(tidyverse)
# Create sample data
iris_df <- iris
iris_df$Sample <- sample(c("M","F"), 150, replace = TRUE)
# crosstabs
iris_df %>%
group_by(Species, Sample) %>%
tally() %>%
spread(Sample, n)
#> # A tibble: 3 × 3
#> # Groups: Species [3]
#> Species F M
#> <fct> <int> <int>
#> 1 setosa 26 24
#> 2 versicolor 25 25
#> 3 virginica 27 23
# Add in 'Age'
iris_df$Age <- sample(c("18-24", "25-35", "36-45", "45+"), 150, replace = TRUE)
# crosstabs
iris_df %>%
group_by(Species, Sample, Age) %>%
tally() %>%
spread(Age, n)
#> # A tibble: 6 × 6
#> # Groups: Species, Sample [6]
#> Species Sample `18-24` `25-35` `36-45` `45+`
#> <fct> <chr> <int> <int> <int> <int>
#> 1 setosa F 2 4 14 6
#> 2 setosa M 11 4 5 4
#> 3 versicolor F 3 8 8 6
#> 4 versicolor M 5 8 2 10
#> 5 virginica F 5 8 7 7
#> 6 virginica M 6 10 3 4
# Using janitor::tabyl()
library(janitor)
#>
#> Attaching package: 'janitor'
#> The following objects are masked from 'package:stats':
#>
#> chisq.test, fisher.test
iris_df %>%
tabyl(Species, Sample, Age)
#> $`18-24`
#> Species F M
#> setosa 2 11
#> versicolor 3 5
#> virginica 5 6
#>
#> $`25-35`
#> Species F M
#> setosa 4 4
#> versicolor 8 8
#> virginica 8 10
#>
#> $`36-45`
#> Species F M
#> setosa 14 5
#> versicolor 8 2
#> virginica 7 3
#>
#> $`45+`
#> Species F M
#> setosa 6 4
#> versicolor 6 10
#> virginica 7 4
Created on 2022-08-24 by the reprex package (v2.0.1)

Use multiquantile groups from a large dataframe in a grouped dataframe in R

I have the next problem, I have a large dataframe, in which I have to extract the quantiles from a variable but by group, by instance:
list_q <- list()
for (i in 3:5){
tmp <- mtcars %>%
filter(gear == i) %>%
pull(mpg) %>%
quantile(probs = seq(0, 1, 0.25), na.rm = TRUE)
list_q[[i]] <- tmp
}
list_q
With this output:
[[3]]
0% 25% 50% 75% 100%
10.4 14.5 15.5 18.4 21.5
[[4]]
0% 25% 50% 75% 100%
17.800 21.000 22.800 28.075 33.900
[[5]]
0% 25% 50% 75% 100%
15.0 15.8 19.7 26.0 30.4
Now, I need to group the variable means and determine which quantile it belongs but using the original measures:
a <- mtcars %>%
group_by(gear, carb) %>%
summarize(mpg_mean = mean(mpg)) %>%
ungroup()
gear carb mpg_mean
<dbl> <dbl> <dbl>
1 3 1 20.3
2 3 2 17.2
3 3 3 16.3
4 3 4 12.6
5 4 1 29.1
6 4 2 24.8
7 4 4 19.8
8 5 2 28.2
9 5 4 15.8
10 5 6 19.7
11 5 8 15
So I could do this:
g3 <- a %>%
filter(gear == 3) %>%
mutate(quantile = cut(mpg_mean, list_q[[3]], labels = FALSE, include.lowest = TRUE))
g4 <- a %>%
filter(gear == 4) %>%
mutate(quantile = cut(mpg_mean, list_q[[4]], labels = FALSE, include.lowest = TRUE))
g5 <- a %>%
filter(gear == 5) %>%
mutate(quantile = cut(mpg_mean, list_q[[5]], labels = FALSE, include.lowest = TRUE))
bind_rows(g3, g4, g5)
Obtaining:
# A tibble: 11 x 4
gear carb mpg_mean quantile
<dbl> <dbl> <dbl> <int>
1 3 1 20.3 4
2 3 2 17.2 3
3 3 3 16.3 3
4 3 4 12.6 1
5 4 1 29.1 4
6 4 2 24.8 3
7 4 4 19.8 1
8 5 2 28.2 4
9 5 4 15.8 1
10 5 6 19.7 2
11 5 8 15 1
I wish to know if there is a way to do this more efficiently

We can first group_by gear and store the quantiles for mpg in a list. We can then also group_by carb to get mean of mpg value and use the quantiles stored in the list previously to cut this mean of mpg column.
library(dplyr)
mtcars %>%
group_by(gear) %>%
mutate(gear_q = list(quantile(mpg))) %>%
group_by(carb, add = TRUE) %>%
summarize(mpg_mean = mean(mpg),
gear_q = list(first(gear_q))) %>%
mutate(quantile = cut(mpg_mean, first(gear_q),
labels = FALSE, include.lowest = TRUE)) %>%
select(-gear_q)
# gear carb mpg_mean quantile
# <dbl> <dbl> <dbl> <int>
# 1 3 1 20.3 4
# 2 3 2 17.2 3
# 3 3 3 16.3 3
# 4 3 4 12.6 1
# 5 4 1 29.1 4
# 6 4 2 24.8 3
# 7 4 4 19.8 1
# 8 5 2 28.2 4
# 9 5 4 15.8 1
#10 5 6 19.7 2
#11 5 8 15 1

How to avoid ellipsis ... in dplyr?

I want to create a function that takes a grouping argument. Which can be a single or multiple variables. I want it to look like this:
wanted <- function(data, groups, other_params){
data %>% group_by( {{groups}} ) %>% count()
}
This work only when a single group is given but breaks when there are multiple groups. I know it's possible to use the following with ellipsis ... (But I want the syntax groups = something):
not_wanted <- function(data, ..., other_params){
data %>% group_by( ... ) %>% count()
}
Here is the entire code:
library(dplyr)
library(magrittr)
iris$group2 <- rep(1:5, 30)
wanted <- function(data, groups, other_params){
data %>% group_by( {{groups}} ) %>% count()
}
not_wanted <- function(data, ..., other_params){
data %>% group_by( ... ) %>% count()
}
# works
wanted(iris, groups = Species )
not_wanted(iris, Species, group2)
# doesn't work
wanted(iris, groups = vars(Species, group2) )
wanted(iris, groups = c(Species, group2) )
wanted(iris, groups = vars("Species", "group2") )
# Error: Column `vars(Species, group2)` must be length 150 (the number of rows) or one, not 2

You guys are over complicating things, this works just fine:
library(tidyverse)
wanted <- function(data, groups){
data %>% count(!!!groups)
}
mtcars %>% wanted(groups = vars(mpg,disp,hp))
# A tibble: 31 x 4
mpg disp hp n
<dbl> <dbl> <dbl> <int>
1 10.4 460 215 1
2 10.4 472 205 1
3 13.3 350 245 1
4 14.3 360 245 1
5 14.7 440 230 1
6 15 301 335 1
7 15.2 276. 180 1
8 15.2 304 150 1
9 15.5 318 150 1
10 15.8 351 264 1
# … with 21 more rows

The triple bang operator and parse_quos from the rlang package will do the trick. For more info, see e.g. https://stackoverflow.com/a/49941635/6086135
library(dplyr)
library(magrittr)
iris$group2 <- rep(1:5, 30)
vec <- c("Species", "group2")
wanted <- function(data, groups){
data %>% count(!!!rlang::parse_quos(groups, rlang::current_env()))
}
wanted(iris, vec)
#> # A tibble: 15 x 3
#> Species group2 n
#> <fct> <int> <int>
#> 1 setosa 1 10
#> 2 setosa 2 10
#> 3 setosa 3 10
#> 4 setosa 4 10
#> 5 setosa 5 10
#> 6 versicolor 1 10
#> 7 versicolor 2 10
#> 8 versicolor 3 10
#> 9 versicolor 4 10
#> 10 versicolor 5 10
#> 11 virginica 1 10
#> 12 virginica 2 10
#> 13 virginica 3 10
#> 14 virginica 4 10
#> 15 virginica 5 10
Created on 2020-01-06 by the reprex package (v0.3.0)

Here is another option to avoid quotations in the function call. I admit its not very pretty though.
library(tidyverse)
wanted <- function(data, groups){
grouping <- gsub(x = rlang::quo_get_expr(enquo(groups)), pattern = "\\((.*)?\\)", replacement = "\\1")[-1]
data %>% group_by_at(grouping) %>% count()
}
iris$group2 <- rep(1:5, 30)
wanted(iris, groups = c(Species, group2) )
#> # A tibble: 15 x 3
#> # Groups: Species, group2 [15]
#> Species group2 n
#> <fct> <int> <int>
#> 1 setosa 1 10
#> 2 setosa 2 10
#> 3 setosa 3 10
#> 4 setosa 4 10
#> 5 setosa 5 10
#> 6 versicolor 1 10
#> 7 versicolor 2 10
#> 8 versicolor 3 10
#> 9 versicolor 4 10
#> 10 versicolor 5 10
#> 11 virginica 1 10
#> 12 virginica 2 10
#> 13 virginica 3 10
#> 14 virginica 4 10
#> 15 virginica 5 10

dplyr summarize with subtotals

One of the great things about pivot tables in excel is that they provide subtotals automatically. First, I would like to know if there is anything already created within dplyr that can accomplish this. If not, what is the easiest way to achieve it?
In the example below, I show the mean displacement by number of cylinders and carburetors. For each group of cylinders (4,6,8), I'd like to see the mean displacement for the group (or total displacement, or any other summary statistic).
library(dplyr)
mtcars %>% group_by(cyl,carb) %>% summarize(mean(disp))
cyl carb mean(disp)
1 4 1 91.38
2 4 2 116.60
3 6 1 241.50
4 6 4 163.80
5 6 6 145.00
6 8 2 345.50
7 8 3 275.80
8 8 4 405.50
9 8 8 301.00

data.table It's very clunky, but this is one way:
library(data.table)
DT <- data.table(mtcars)
rbind(
DT[,.(mean(disp)), by=.(cyl,carb)],
DT[,.(mean(disp), carb=NA), by=.(cyl) ],
DT[,.(mean(disp), cyl=NA), by=.(carb)]
)[order(cyl,carb)]
This gives
cyl carb V1
1: 4 1 91.3800
2: 4 2 116.6000
3: 4 NA 105.1364
4: 6 1 241.5000
5: 6 4 163.8000
6: 6 6 145.0000
7: 6 NA 183.3143
8: 8 2 345.5000
9: 8 3 275.8000
10: 8 4 405.5000
11: 8 8 301.0000
12: 8 NA 353.1000
13: NA 1 134.2714
14: NA 2 208.1600
15: NA 3 275.8000
16: NA 4 308.8200
17: NA 6 145.0000
18: NA 8 301.0000
I'd rather see results in something like an R table, but don't know of any functions for that.
dplyr #akrun found this analogous code
bind_rows(
mtcars %>%
group_by(cyl, carb) %>%
summarise(Mean= mean(disp)),
mtcars %>%
group_by(cyl) %>%
summarise(carb=NA, Mean=mean(disp)),
mtcars %>%
group_by(carb) %>%
summarise(cyl=NA, Mean=mean(disp))
) %>% arrange(cyl, carb)
We could wrap the repeat operations in a function
library(lazyeval)
f1 <- function(df, grp, Var, func){
FUN <- match.fun(func)
df %>%
group_by_(.dots=grp) %>%
summarise_(interp(~FUN(v), v=as.name(Var)))
}
m1 <- f1(mtcars, c('carb', 'cyl'), 'disp', 'mean')
m2 <- f1(mtcars, 'carb', 'disp', 'mean')
m3 <- f1(mtcars, 'cyl', 'disp', 'mean')
bind_rows(list(m1, m2, m3)) %>%
arrange(cyl, carb) %>%
rename(Mean=`FUN(disp)`)
carb cyl Mean
1 1 4 91.3800
2 2 4 116.6000
3 NA 4 105.1364
4 1 6 241.5000
5 4 6 163.8000
6 6 6 145.0000
7 NA 6 183.3143
8 2 8 345.5000
9 3 8 275.8000
10 4 8 405.5000
11 8 8 301.0000
12 NA 8 353.1000
13 1 NA 134.2714
14 2 NA 208.1600
15 3 NA 275.8000
16 4 NA 308.8200
17 6 NA 145.0000
18 8 NA 301.0000
Either option can be made a little less ugly with data.table's rbindlist with fill:
rbindlist(list(
mtcars %>% group_by(cyl) %>% summarise(mean(disp)),
mtcars %>% group_by(carb) %>% summarise(mean(disp)),
mtcars %>% group_by(cyl,carb) %>% summarise(mean(disp))
),fill=TRUE) %>% arrange(cyl,carb)
rbindlist(list(
DT[,mean(disp),by=.(cyl,carb)],
DT[,mean(disp),by=.(cyl)],
DT[,mean(disp),by=.(carb)]
),fill=TRUE)[order(cyl,carb)]

Also possible by simply joining the two group results:
cyl_carb <- mtcars %>% group_by(cyl,carb) %>% summarize(mean(disp))
cyl <- mtcars %>% group_by(cyl) %>% summarize(mean(disp))
joined <- full_join(cyl_carb, cyl)
result <- arrange(joined, cyl)
result
gives:
Source: local data frame [12 x 3]
Groups: cyl [3]
cyl carb mean(disp)
(dbl) (dbl) (dbl)
1 4 1 91.3800
2 4 2 116.6000
3 4 NA 105.1364
4 6 1 241.5000
5 6 4 163.8000
6 6 6 145.0000
7 6 NA 183.3143
8 8 2 345.5000
9 8 3 275.8000
10 8 4 405.5000
11 8 8 301.0000
12 8 NA 353.1000
or with an additional column:
cyl_carb <- mtcars %>% group_by(cyl,carb) %>% summarize(mean(disp))
cyl <- mtcars %>% group_by(cyl) %>% summarize(mean.cyl = mean(disp))
joined <- full_join(cyl_carb, cyl)
joined
gives:
Source: local data frame [9 x 4]
Groups: cyl [?]
cyl carb mean(disp) mean.cyl
(dbl) (dbl) (dbl) (dbl)
1 4 1 91.38 105.1364
2 4 2 116.60 105.1364
3 6 1 241.50 183.3143
4 6 4 163.80 183.3143
5 6 6 145.00 183.3143
6 8 2 345.50 353.1000
7 8 3 275.80 353.1000
8 8 4 405.50 353.1000
9 8 8 301.00 353.1000

Something similar to table with addmargins (although actually a data.frame)
library(dplyr)
library(reshape2)
out <- bind_cols(
mtcars %>% group_by(cyl, carb) %>%
summarise(mu = mean(disp)) %>%
dcast(cyl ~ carb),
(mtcars %>% group_by(cyl) %>% summarise(Total=mean(disp)))[,2]
)
margin <- t((mtcars %>% group_by(carb) %>% summarise(Total=mean(disp)))[,2])
rbind(out, c(NA, margin, mean(mtcars$disp))) %>%
`rownames<-`(c(paste("cyl", c(4,6,8)), "Total")) # add some row names
# cyl 1 2 3 4 6 8 Total
# cyl 4 4 91.3800 116.60 NA NA NA NA 105.1364
# cyl 6 6 241.5000 NA NA 163.80 145 NA 183.3143
# cyl 8 8 NA 345.50 275.8 405.50 NA 301 353.1000
# Total NA 134.2714 208.16 275.8 308.82 145 301 230.7219
The bottom row is the column wise margins, columns named 1:8 are carbs, and Total is the rowwise margins.

Here is a simple one-liner creating margins within a data_frame:
library(plyr)
library(dplyr)
# Margins without labels
mtcars %>%
group_by(cyl,carb) %>%
summarize(Mean_Disp=mean(disp)) %>%
do(plyr::rbind.fill(., data_frame(cyl=first(.$cyl), Mean_Disp=sum(.$Mean_Disp, na.rm=T))))
output:
Source: local data frame [12 x 3]
Groups: cyl [3]
cyl carb Mean_Disp
<dbl> <dbl> <dbl>
1 4 1 91.38
2 4 2 116.60
3 4 NA 207.98
4 6 1 241.50
5 6 4 163.80
6 6 6 145.00
7 6 NA 550.30
8 8 2 345.50
9 8 3 275.80
10 8 4 405.50
11 8 8 301.00
12 8 NA 1327.80
You may also add labels for the summary statistics like:
mtcars %>%
group_by(cyl,carb) %>%
summarize(Mean_Disp=mean(disp)) %>%
do(plyr::rbind.fill(., data_frame(cyl=first(.$cyl), carb=c("Total", "Mean"), Mean_Disp=c(sum(.$Mean_Disp, na.rm=T), mean(.$Mean_Disp, na.rm=T)))))
output:
Source: local data frame [15 x 3]
Groups: cyl [3]
cyl carb Mean_Disp
<dbl> <chr> <dbl>
1 4 1 91.38
2 4 2 116.60
3 4 Total 207.98
4 4 Mean 103.99
5 6 1 241.50
6 6 4 163.80
7 6 6 145.00
8 6 Total 550.30
9 6 Mean 183.43
10 8 2 345.50
11 8 3 275.80
12 8 4 405.50
13 8 8 301.00
14 8 Total 1327.80
15 8 Mean 331.95

With data.table version above v1.11
library(data.table)
cubed <- cube(
as.data.table(mtcars),
.(`mean(disp)` = mean(disp)),
by = c("cyl", "carb")
)
#> cyl carb mean(disp)
#> 1: 6 4 163.8000
#> 2: 4 1 91.3800
#> 3: 6 1 241.5000
#> 4: 8 2 345.5000
#> 5: 8 4 405.5000
#> 6: 4 2 116.6000
#> 7: 8 3 275.8000
#> 8: 6 6 145.0000
#> 9: 8 8 301.0000
#> 10: 6 NA 183.3143
#> 11: 4 NA 105.1364
#> 12: 8 NA 353.1000
#> 13: NA 4 308.8200
#> 14: NA 1 134.2714
#> 15: NA 2 208.1600
#> 16: NA 3 275.8000
#> 17: NA 6 145.0000
#> 18: NA 8 301.0000
#> 19: NA NA 230.7219
res <- dcast(
cubed,
cyl ~ carb,
value.var = "mean(disp)"
)
#> cyl NA 1 2 3 4 6 8
#> 1: NA 230.7219 134.2714 208.16 275.8 308.82 145 301
#> 2: 4 105.1364 91.3800 116.60 NA NA NA NA
#> 3: 6 183.3143 241.5000 NA NA 163.80 145 NA
#> 4: 8 353.1000 NA 345.50 275.8 405.50 NA 301
Created on 2020-02-20 by the reprex package (v0.3.0)
Source: https://jozef.io/r912-datatable-grouping-sets/
library(kableExtra)
options(knitr.kable.NA = "")
res <- as.data.frame(res)
names(res)[2] <- "overall"
res[1, 1] <- "overall"
x <- kable(res, "html")
x <- kable_styling(x, "striped")
add_header_above(x, c(" " = 1, "carb" = ncol(res) - 1))

I know that this may not be a very elegant solution, but I hope it helps anyway:
p <-mtcars %>% group_by(cyl,carb)
p$cyl <- as.factor(p$cyl)
average_disp <- sapply(1:length(levels(p$cyl)), function(x)mean(subset(p,p$cyl==levels(p$cyl)[x])$disp))
df <- data.frame(levels(p$cyl),average_disp)
colnames(df)[1]<-"cyl"
#> df
# cyl average_disp
#1 4 105.1364
#2 6 183.3143
#3 8 353.1000
(Edit: After a minor modification in the definition of p this now yields the same results as #Frank's and #akrun's solution)

You can use this wrapper around ddply, which applies ddply for each possible margin and rbinds the results with its usual output.
To marginalize over all grouping factors:
mtcars %>% ddplym(.variables = .(cyl, carb), .fun = summarise, mean(disp))
To marginalize over carb only:
mtcars %>% ddplym(
.variables = .(carb),
.fun = function(data) data %>% group_by(cyl) %>% summarise(mean(disp)))
Wrapper:
require(plyr)
require(dplyr)
ddplym <- function(.data, .variables, .fun, ..., .margin = TRUE, .margin_name = '(all)') {
if (.margin) {
df <- .ddplym(.data, .variables, .fun, ..., .margin_name = .margin_name)
} else {
df <- ddply(.data, .variables, .fun, ...)
if (.variables %>% length == 0) {
df$.id <- NULL
}
}
return(df)
}
.ddplym <- function(.data,
.variables,
.fun,
...,
.margin_name = '(all)'
) {
.variables <- as.quoted(.variables)
n <- length(.variables)
var_combn_idx <- lapply(0:n, function(x) {
combn(1:n, n - x) %>% alply(2, c)
}) %>%
unlist(recursive = FALSE, use.names = FALSE)
data_list <- lapply(var_combn_idx, function(x) {
data <- ddply(.data, .variables[x], .fun, ...)
# drop '.id' column created when no variables to split by specified
if (!length(.variables[x]))
data <- data[, -1, drop = FALSE]
return(data)
})
# workaround for NULL .variables
if (unlist(.variables) %>% is.null && names(.variables) %>% is.null) {
data_list <- data_list[1]
} else if (unlist(.variables) %>% is.null) {
data_list <- data_list[2]
}
if (length(data_list) > 1) {
data_list <- lapply(data_list, function(data)
rbind_pre(
data = data,
colnames = colnames(data_list[[1]]),
fill = .margin_name
))
}
Reduce(rbind, data_list)
}
rbind_pre <- function(data, colnames, fill = NA) {
colnames_fill <- setdiff(colnames, colnames(data))
data_fill <- matrix(fill,
nrow = nrow(data),
ncol = length(colnames_fill)) %>%
as.data.frame %>% setNames(colnames_fill)
cbind(data, data_fill)[, colnames]
}

Sharing my approach to this (if its helpful at all). This approach allows for customised sub-totals and totals to be added very easily.
data = data.frame( thing1=sprintf("group %i",trunc(runif(200,0,5))),
thing2=sprintf("type %i",trunc(runif(200,0,5))),
value=rnorm(200,0,1) )
data %>%
group_by( thing1, thing2 ) %>%
summarise( sum=sum(value),
count=n() ) %>%
ungroup() %>%
bind_rows(.,
identity(.) %>%
group_by(thing1) %>%
summarise( aggregation="sub total",
sum=sum(sum),
count=sum(count) ) %>%
ungroup(),
identity(.) %>%
summarise( aggregation="total",
sum=sum(sum),
count=sum(count) ) %>%
ungroup() ) %>%
arrange( thing1, thing2, aggregation ) %>%
select( aggregation, everything() )

Having tried long and hard for very similar issues, I have found that data.table offers the simplest and fastest solution which fits exactly this purpose
data.table::cube(
data.table::as.data.table(mtcars),
.(mean_disp = mean(disp)),
by = c("cyl","carb"))
cyl carb mean_disp
1: 6 4 163.8000
2: 4 1 91.3800
3: 6 1 241.5000
4: 8 2 345.5000
5: 8 4 405.5000
6: 4 2 116.6000
7: 8 3 275.8000
8: 6 6 145.0000
9: 8 8 301.0000
10: 6 NA 183.3143
11: 4 NA 105.1364
12: 8 NA 353.1000
13: NA 4 308.8200
14: NA 1 134.2714
15: NA 2 208.1600
16: NA 3 275.8000
17: NA 6 145.0000
18: NA 8 301.0000
19: NA NA 230.7219
The NA entries are the subtotals you are looking for; for instance in row 10 the 183.31 result is the mean for all 6 cylinders. The last row with double NA is the one with the overall mean.
From there, you can easily wrap the result with as_tibble() to jump back into the dplyr semantics world.

Having had this same issue, I'm working on a function to hopefully address this (see https://github.com/jrf1111/TCCD/blob/dev/R/with_subtotals.R). It's still in its development phase, but it does exactly what you're looking for.
mtcars %>%
group_by(cyl, carb) %>%
with_subtotals() %>%
summarize(mean(disp))
# A tibble: 19 x 3
# Groups: cyl [5]
cyl carb `mean(disp)`
<chr> <chr> <dbl>
1 4 1 91.4
2 4 2 117.
3 4 subtotal 105.
4 6 1 242.
5 6 4 164.
6 6 6 145
7 6 subtotal 183.
8 8 2 346.
9 8 3 276.
10 8 4 406.
11 8 8 301
12 8 subtotal 353.
13 subtotal 1 134.
14 subtotal 2 208.
15 subtotal 3 276.
16 subtotal 4 309.
17 subtotal 6 145
18 subtotal 8 301
19 total total 231.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to summarize by sets of grouping variables in R and dplyr? - r

Related

using lapply with list of arguments on dplyr functions that uses data masking

How to make multi layer cross tabs in R

Use multiquantile groups from a large dataframe in a grouped dataframe in R

How to avoid ellipsis ... in dplyr?

dplyr summarize with subtotals

Categories

Resources