Dplyr to count means by group and then quantiles for each - r

I have a problem with dplyr, or I just can't figure out how to code the quantile-part right.
I have a data that i want to group by X and Y, then count the means for a in each group
dmean %>%
group_by(x,y) %>%
summarise(mean=mean(a))
This part works, no problem.
How do i continue the code to get the lowest 10% and highest 10% percentile of each group?

You can put several expressions inside summarise, as so:
library(dplyr)
mtcars %>%
group_by(cyl, am) %>%
summarise(mean = mean(mpg),
quantile_10 = quantile(mpg, 0.1),
quantile_90 = quantile(mpg, 0.9))
# A tibble: 6 x 5
# Groups: cyl [?]
cyl am mean quantile_10 quantile_90
<dbl> <dbl> <dbl> <dbl> <dbl>
1 4 0 22.90000 21.76 24.08
2 4 1 28.07500 22.38 32.85
3 6 0 19.12500 17.89 20.74
4 6 1 20.56667 19.96 21.00
5 8 0 15.05000 10.69 18.56
6 8 1 15.40000 15.08 15.72

Related

Can I use summarise_at for existing variables while adding other variables at the same time?

Suppose I have a grouped data frame:
> mtcars %>%
+ group_by(cyl) %>%
+ summarise(blah = mean(disp))
# A tibble: 3 x 2
cyl blah
<dbl> <dbl>
1 4 105.
2 6 183.
3 8 353.
Then suppose I want to sum some existing variables:
> mtcars %>%
+ group_by(cyl) %>%
+ summarise_at(vars(vs:carb), sum)
# A tibble: 3 x 5
cyl vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl>
1 4 10 8 45 17
2 6 4 3 27 24
3 8 0 2 46 49
However, if I want to add both summarise commands together, I cannot:
> mtcars %>%
+ group_by(cyl) %>%
+ summarise_at(vars(vs:carb), sum) %>%
+ summarise(blah = mean(disp))
Error in mean(disp) : object 'disp' not found
After using group_by() in a dplyr chain, Hhow can I add new features with summarise() as well as summing existing features as above with summarise_at(vars(vs:carb), sum)?
The only way I can think of (at the moment) is the store the data immediately before your first summary, then run two summary verbs, and join them on the grouped variable. For instance:
library(dplyr)
grouped_data <- group_by(mtcars, cyl)
left_join(
summarize(grouped_data, blah = mean(disp)),
summarize_at(grouped_data, vars(vs:carb), sum),
by = "cyl")
# # A tibble: 3 x 6
# cyl blah vs am gear carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 4 105. 10 8 45 17
# 2 6 183. 4 3 27 24
# 3 8 353. 0 2 46 49
You can left_join with the dataframe resulting from the summarise.
library(dplyr)
data(mtcars)
mtcars %>%
group_by(cyl) %>%
summarise_at(vars(vs:carb), sum) %>%
left_join(mtcars %>% group_by(cyl) %>% summarise(blah = mean(disp)))
#Joining, by = "cyl"
## A tibble: 3 x 6
# cyl vs am gear carb blah
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 4 10 8 45 17 105.
#2 6 4 3 27 24 183.
#3 8 0 2 46 49 353.
What I would do is use mutate_at for first step so that other columns are not collapsed and then use summarise_at with mean for all the columns together.
library(dplyr)
mtcars %>%
group_by(cyl) %>%
mutate_at(vars(vs:carb), sum) %>%
summarise_at(vars(vs:carb, disp), mean)
# cyl vs am gear carb disp
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 4 10 8 45 17 105.
#2 6 4 3 27 24 183.
#3 8 0 2 46 49 353.
Here's a way, we need to define an helper function first and it works only in a pipe chain and uses unexported functions from dplyr though so might break one day.
.at <- function(.vars, .funs, ...) {
# make sure we are in a piped call
in_a_piped_fun <- exists(".",parent.frame()) &&
length(ls(envir=parent.frame(), all.names = TRUE)) == 1
if (!in_a_piped_fun)
stop(".at() must be called as an argument to a piped function")
# borrow code from summarize_at
.tbl <- try(eval.parent(quote(.)))
dplyr:::manip_at(
.tbl, .vars, .funs, rlang::enquo(.funs), rlang:::caller_env(),
.include_group_vars = TRUE, ...)
}
library(dplyr, warn.conflicts = FALSE)
mtcars %>%
summarize(!!!.at(vars(vs:carb), sum), blah = mean(disp))
#> vs am gear carb blah
#> 1 14 13 118 90 230.7219
Created on 2019-11-17 by the reprex package (v0.3.0)

how to calculate proportion by another variable (not by frequency) in dplyr in R

Using mtcars data, I want to calculate proportion of mpg for each group of cyl and am. How to calc it?
mtcars %>%
group_by(cyl, am) %>%
summarise(mpg = n(mpg)) %>%
mutate(mpg.gr = mpg/(sum(mpg))
Thanks in advance!
If I understand you correctly, you want the proportion of records for each combination of cyl and am. If so, then I believe your code isn't working because n() doesn't accept an argument. You also need to ungroup() before calculating your proportions.
You could simply do:
mtcars %>%
group_by(cyl, am) %>%
summarise(mpg = n()) %>%
ungroup() %>%
mutate(mpg.gr = mpg/(sum(mpg))
#> # A tibble: 6 x 4
#> cyl am mpg mpg.gr
#> <dbl> <dbl> <int> <dbl>
#> 1 4 0 3 0.0938
#> 2 4 1 8 0.25
#> 3 6 0 4 0.125
#> 4 6 1 3 0.0938
#> 5 8 0 12 0.375
#> 6 8 1 2 0.0625
Note that thanks to ungroup(), the proportions are calculated using the counts of all records, not just those within the cyl group, as before.

filter inside dplyr's summarise

I want to use filter or similar function inside summarise from dplyr package. So I've got a dataframe (e.g. mtcars) where I need to group by factor (e.g. cyl) and then calculate some statistics and a percentage of total wt for every cyl type —> wt.pc.
The question is how can I subset/filter wt column inside summarise function to get a percentage but without last 10 rows?
I've tried this code but it returns NA:(
mtcars %>%
group_by(cyl) %>%
summarise(wt = round(sum(wt)),
wt.pc = sum(wt) * 100 / sum(mtcars[, 6]),
wt.pc.short = sum(wt[1:22]) * 100 / sum(mtcars[1:22, 6]),
drat.max = round(max(drat)))
# A tibble: 3 x 5
cyl wt wt.pc wt.pc.short drat.max
<dbl> <dbl> <dbl> <dbl> <dbl>
1 4 25 24.3 NA 5
2 6 22 21.4 NA 4
3 8 56 54.4 NA 4
wt.pc.short — % of sum(wt) for every cyl for shorter dataframe mtcars[1:22,]
Something like this?
mtcars %>%
mutate(id = row_number()) %>%
group_by(cyl) %>%
summarise(wt_new = round(sum(wt)), # note the change in name here!
wt.pc = sum(wt) * 100 / sum(mtcars[, 6]),
wt.pc.short = sum(wt[id<23]) * 100 / sum(mtcars[1:22, 6]),
drat.max = round(max(drat)))
# A tibble: 3 x 5
cyl wt_new wt.pc wt.pc.short drat.max
<dbl> <dbl> <dbl> <dbl> <dbl>
1 4 25 24.3 22.7 5
2 6 22 21.4 25.8 4
3 8 56 54.4 51.6 4
The important part here is that when you assign wt in the call to summarize, all subsequent references to wt will take the previously assigned wt, not the original wt. A statement such as wt[1:22] is thus somewhat problematic. You can see this here:
mean(mtcars[,"mpg"])
# [1] 20.09062
var(mtcars[,"mpg"])
# [1] 36.3241
mtcars %>% summarise(var_before = var(mpg),
mpg = mean(mpg),
var_after = var(mpg))
# var_before mpg var_after
# 1 36.3241 20.09062 NA
I think you can do it like this. First we calculate the row number within the group, if max(row_number) > 10 then we have enough observations to remove the last 10 rows, in which case we filter to max(ID)-9 (i.e. remove the last 10 rows), otherwise ID==ID returns true and doesn't remove anything.
mtcars %>% group_by(cyl) %>%
mutate(ID = row_number()) %>%
filter(if (max(ID) > 10) ID < (max(ID) - 9) else ID == ID)

Use group size (`group_size`) in `summarise` in `dplyr` [duplicate]

This question already has answers here:
Count number of rows within each group
(17 answers)
Closed 3 years ago.
I want to use the size of a group as part of a groupwise operation in dplyr::summarise.
E.g calculate the proportion of manuals by cylinder, by grouping the cars data by cyl and dividing the number of manuals by the size of the group:
mtcars %>%
group_by(cyl) %>%
summarise(zz = sum(am)/group_size(.))
But, (I think), because group_size is after a grouped tbl_df and . is ungrouped, this returns
Error in mutate_impl(.data, dots) : basic_string::resize
Is there a way to do this?
You probably can use n() to get the number of rows for group
library(dplyr)
mtcars %>%
group_by(cyl) %>%
summarise(zz = sum(am)/n())
# cyl zz
# <dbl> <dbl>
#1 4.00 0.727
#2 6.00 0.429
#3 8.00 0.143
It is just a group by mean
mtcars %>%
group_by(cyl) %>%
summarise(zz = mean(am))
# A tibble: 3 x 2
# cyl zz
# <dbl> <dbl>
#1 4 0.727
#2 6 0.429
#3 8 0.143
If we need to use group_size
library(tidyverse)
mtcars %>%
group_by(cyl) %>%
nest %>%
mutate(zz = map_dbl(data, ~ sum(.x$am)/group_size(.x))) %>%
arrange(cyl) %>%
select(-data)
# A tibble: 3 x 2
# cyl zz
# <dbl> <dbl>
#1 4 0.727
#2 6 0.429
#3 8 0.143
Or using do
mtcars %>%
group_by(cyl) %>%
do(data.frame(zz = sum(.$am)/group_size(.)))
# A tibble: 3 x 2
# Groups: cyl [3]
# cyl zz
# <dbl> <dbl>
#1 4 0.727
#2 6 0.429
#3 8 0.143

Summarise for multiple group_by variables combined and individually

I am using dplyr's group_by and summarise to get a mean by each group_by variable combined, but also want to get the mean by each group_by variable individually.
For example if I run
mtcars %>%
group_by(cyl, vs) %>%
summarise(new = mean(wt))
I get
cyl vs new
<dbl> <dbl> <dbl>
4 0 2.140000
4 1 2.300300
6 0 2.755000
6 1 3.388750
8 0 3.999214
But I want to get
cyl vs new
<dbl> <dbl> <dbl>
4 0 2.140000
4 1 2.300300
4 NA 2.285727
6 0 2.755000
6 1 3.388750
6 NA 3.117143
8 0 3.999214
NA 0 3.688556
NA 1 2.611286
I.e. get the mean for the variables both combined and individually
Edit
Jaap marked this as duplicate and pointed me in the direction of Using aggregate to apply several functions on several variables in one call. I looked at jaap's answer there which referenced dplyr but I can't see how that answers my question? You say to use summarise_each, but I still don't see how I can use that to get the mean of each of my group by variables individually? Apologies if I am being stupid...
Here is an idea using bind_rows,
library(dplyr)
mtcars %>%
group_by(cyl, vs) %>%
summarise(new = mean(wt)) %>%
bind_rows(.,
mtcars %>% group_by(cyl) %>% summarise(new = mean(wt)) %>% mutate(vs = NA),
mtcars %>% group_by(vs) %>% summarise(new = mean(wt)) %>% mutate(cyl = NA)) %>%
arrange(cyl) %>%
ungroup()
# A tibble: 10 × 3
# cyl vs new
# <dbl> <dbl> <dbl>
#1 4 0 2.140000
#2 4 1 2.300300
#3 4 NA 2.285727
#4 6 0 2.755000
#5 6 1 3.388750
#6 6 NA 3.117143
#7 8 0 3.999214
#8 8 NA 3.999214
#9 NA 0 3.688556
#10 NA 1 2.611286

Resources