Suppose I want to summarise a data frame after grouping with differing functions. How can I do that?
mtcars %>% group_by(cyl) %>% summarise(size = n())
# A tibble: 3 x 2
cyl size
<dbl> <int>
1 4 11
2 6 7
3 8 14
But if I try:
mtcars %>% group_by(cyl) %>% summarise(size = n()) %>% summarise_at(vars(c(mpg, am:carb)), mean)
Error in is_string(y) : object 'carb' not found
How can I get first the size of each group with n() and then the mean of the other chosen features?
Here is one way using a dplyr::inner_join() on the two summarize operations by the grouping variable:
mtcars %>%
group_by(cyl) %>%
summarise(size = n()) %>%
inner_join(
mtcars %>%
group_by(cyl) %>%
summarise_at(vars(c(mpg, am:carb)), mean),
by='cyl' )
Output is:
# A tibble: 3 x 6
cyl size mpg am gear carb
<dbl> <int> <dbl> <dbl> <dbl> <dbl>
1 4 11 26.7 0.727 4.09 1.55
2 6 7 19.7 0.429 3.86 3.43
3 8 14 15.1 0.143 3.29 3.5
Since summarise removes the column which are not grouped or summarised, an alternative in this case would be to first add a new column with mutate (so that all other columns remain as it is) to count number of rows in each group and include that column in summarise_at calculation.
library(dplyr)
mtcars %>%
group_by(cyl) %>%
mutate(n = n()) %>%
summarise_at(vars(mpg, am:carb, n), mean)
# A tibble: 3 x 6
# cyl mpg am gear carb n
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 4 26.7 0.727 4.09 1.55 11
#2 6 19.7 0.429 3.86 3.43 7
#3 8 15.1 0.143 3.29 3.5 14
We can use data.table methods
library(data.table)
as.data.table(mtcars)[, n := .N, cyl][, lapply(.SD, mean), cyl,
.SDcols = c("mpg", "am", "gear", "carb", "n")]
#. yl mpg am gear carb n
#1: 6 19.74286 0.4285714 3.857143 3.428571 7
#2: 4 26.66364 0.7272727 4.090909 1.545455 11
#3: 8 15.10000 0.1428571 3.285714 3.500000 14
Or with tidyverse
library(tidyverse)
mtcars %>%
add_count(cyl) %>%
group_by(cyl) %>%
summarise_at(vars(mpg, am:carb, n), mean)
# A tibble: 3 x 6
# cyl mpg am gear carb n
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 4 26.7 0.727 4.09 1.55 11
#2 6 19.7 0.429 3.86 3.43 7
#3 8 15.1 0.143 3.29 3.5 14
Or using base R
nm1 <- c("mpg", "am", "gear", "carb", "cyl")
transform(aggregate(.~ cyl, mtcars[nm1], mean), n = as.vector(table(mtcars$cyl)))
# cyl mpg am gear carb n
#1 4 26.66364 0.7272727 4.090909 1.545455 11
#2 6 19.74286 0.4285714 3.857143 3.428571 7
#3 8 15.10000 0.1428571 3.285714 3.500000 14
Related
I am trying to update my function using the new version of dplyr.
First, I had this function (old version):
slope.k <- function(data, Treatment, Replicate, Day, Ln.AFDMrem){
fitted_models <- data %>% group_by(Treatment, Replicate) %>%
do(model = lm(Ln.AFDMrem ~ Day, data = .))
broom::tidy(fitted_models,model) %>% print(n = Inf)
}
However, the do() function was superseded. Now, I am trying to update with this (new) version:
slope.k <- function(data, Treatment, Replicate, Day, Ln.AFDMrem){
mod_t <- data %>% nest_by(Treatment, Replicate) %>%
mutate(model = list(lm(Ln.AFDMrem ~ Day, data = data))) %>%
summarise(tidy_out = list(tidy(model)))
unnest(select(mod_t, Treatment, tidy_out)) %>% print(n = Inf)
}
However, it doesn't work properly, because I have the following warnings:
Warning messages:
1: `cols` is now required when using unnest().
Please use `cols = c(tidy_out)`
2: `...` is not empty.
We detected these problematic arguments:
* `needs_dots`
These dots only exist to allow future extensions and should be empty.
Did you misspecify an argument?
Thanks in advance!!!
The issue would be the use of select with unnest. It can be reproduced by changing the select to c
libary(dplyr)
library(broom)
library(tidyr)
mtcars %>%
nest_by(carb, gear) %>%
mutate(model = list(lm(mpg ~ disp + drat, data = data))) %>%
summarise(tidy_out = list(tidy(model)), .groups = 'drop') %>%
unnest(c(tidy_out))
-output
# A tibble: 33 x 7
# carb gear term estimate std.error statistic p.value
# <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 1 3 (Intercept) -8.50 NaN NaN NaN
# 2 1 3 disp 0.0312 NaN NaN NaN
# 3 1 3 drat 7.10 NaN NaN NaN
# 4 1 4 (Intercept) -70.5 302. -0.234 0.854
# 5 1 4 disp -0.0445 0.587 -0.0757 0.952
# 6 1 4 drat 25.5 62.4 0.408 0.753
# 7 2 3 (Intercept) -3.72 8.57 -0.434 0.739
# 8 2 3 disp 0.0437 0.0123 3.54 0.175
# 9 2 3 drat 1.90 2.88 0.661 0.628
#10 2 4 (Intercept) -10.0 226. -0.0443 0.972
# … with 23 more rows
Also, after the mutate, step, we can directly use the unnest on the 'tidy_out' column
If we use as a function, assuming that unquoted arguments are passed as column names
slope.k <- function(data, Treatment, Replicate, Day, Ln.AFDMrem){
ln_col <- rlang::as_string(ensym(Ln.AFDMrem))
day_col <- rlang::as_string(ensym(Day))
data %>%
nest_by({{Treatment}}, {{Replicate}}) %>%
mutate(model = list(lm(reformulate(day_col, ln_col), data = data))) %>%
summarise(tidy_out = list(tidy(model)), .groups = 'drop') %>%
unnest(tidy_out)
}
slope.k(mtcars, carb, gear, disp, mpg)
# A tibble: 22 x 7
carb gear term estimate std.error statistic p.value
<dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 3 (Intercept) 22.0 5.35 4.12 0.152
2 1 3 disp -0.00841 0.0255 -0.329 0.797
3 1 4 (Intercept) 52.6 8.32 6.32 0.0242
4 1 4 disp -0.279 0.0975 -2.86 0.104
5 2 3 (Intercept) 1.25 3.49 0.357 0.755
6 2 3 disp 0.0460 0.0100 4.59 0.0443
7 2 4 (Intercept) 36.6 6.57 5.57 0.0308
8 2 4 disp -0.0978 0.0529 -1.85 0.206
9 2 5 (Intercept) 47.0 NaN NaN NaN
10 2 5 disp -0.175 NaN NaN NaN
# … with 12 more rows
I want to create a summary table from summarized data using dplyr.
library(dplyr)
mtcars %>% group_by(cyl, gear) %>% summarise(avg_wt = mean(wt))
Here's the output:
# A tibble: 8 x 3
# Groups: cyl [3]
cyl gear avg_wt
<dbl> <dbl> <dbl>
1 4 3 2.46
2 4 4 2.38
3 4 5 1.83
4 6 3 3.34
5 6 4 3.09
6 6 5 2.77
7 8 3 4.10
8 8 5 3.37
How can I generate this output?
columns are cyl and rows are gear:
4 6 8
3 2.46 3.34 4.10
4 2.38 3.09 NA
5 1.83 2.77 3.37
library(dplyr)
library(tidyr)
library(tibble)
mtcars %>%
group_by(cyl, gear) %>%
summarise(avg_wt = mean(wt)) %>%
pivot_wider(
id_cols = "gear",
names_from = "cyl",
values_from = "avg_wt"
) %>%
column_to_rownames("gear")
#> 4 6 8
#> 3 2.465000 3.33750 4.104083
#> 4 2.378125 3.09375 NA
#> 5 1.826500 2.77000 3.370000
Try this:
mytable <- mtcars %>% group_by(cyl, gear) %>% summarise(avg_wt = mean(wt))
tidyr::spread(mytable, cyl, avg_wt)
You should get the following:
gear `4` `6` `8`
<dbl> <dbl> <dbl> <dbl>
1 3 2.46 3.34 4.10
2 4 2.38 3.09 NA
3 5 1.83 2.77 3.37
Hope this helps you.
I am building a function that uses {{ }} (curly curly or double mustache)
I would like the user to be able to pass multiple variables into the same {{ }}, but I am not sure if this is possible using {{ }}. I can't find any examples showing how to do this.
Can you tell me if it possible, and if yes help me make the below minimal reprex work?
library(tidyverse)
group_mean <- function(.data, group){
.data %>%
group_by({{group}}) %>%
summarise_all(mean)
}
# Works
mtcars %>%
group_mean(group = cyl)
# Fails
mtcars %>%
group_mean(group = c(cyl, am))
Error: Column `c(cyl, am)` must be length 32 (the number of rows) or one, not 64
Edit 2022: Nowadays we'd tend to use the c() syntax of tidyselect for taking in multiple groups of variables.
library(dplyr)
my_mean <- function(data, group_vars, summary_vars) {
data |>
group_by(across({{ group_vars }})) |>
summarise(across({{ summary_vars }}, \(x) mean(x, na.rm = TRUE)))
}
mtcars |> my_mean(c(cyl, am), c(mpg, disp))
#> `summarise()` has grouped output by 'cyl'. You can override using the
#> `.groups` argument.
#> # A tibble: 6 × 4
#> # Groups: cyl [3]
#> cyl am mpg disp
#> <dbl> <dbl> <dbl> <dbl>
#> 1 4 0 22.9 136.
#> 2 4 1 28.1 93.6
#> 3 6 0 19.1 205.
#> 4 6 1 20.6 155
#> 5 8 0 15.0 358.
#> 6 8 1 15.4 326
See also the Bidge patterns section in https://rlang.r-lib.org/reference/topic-data-mask-programming.html
If your function takes several groups of multiple variables, you need external quoting with vars(). This function simply capture its inputs as a list of expressions:
vars(foo, bar)
#> [[1]]
#> <quosure>
#> expr: ^foo
#> env: global
#>
#> [[2]]
#> <quosure>
#> expr: ^bar
#> env: global
Take an argument that you splice with !!!:
group_mean <- function(.data, .vars, ...) {
.data <- doingsomethingelse(.data, ...)
.data %>%
group_by(!!!.vars) %>%
summarise_all(mean)
}
Use it like this:
data %>% group_mean(vars(foo, bar), baz, quux)
For multiple grouping variables, you don't need curly-curly, pass three dots instead.
group_mean <- function(.data, ...){
.data %>%
group_by(...) %>%
summarise_all(mean)
}
mtcars %>% group_mean(cyl)
# A tibble: 3 x 11
# cyl mpg disp hp drat wt qsec vs am gear carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 4 26.7 105. 82.6 4.07 2.29 19.1 0.909 0.727 4.09 1.55
#2 6 19.7 183. 122. 3.59 3.12 18.0 0.571 0.429 3.86 3.43
#3 8 15.1 353. 209. 3.23 4.00 16.8 0 0.143 3.29 3.5
mtcars %>% group_mean(cyl, am)
# cyl am mpg disp hp drat wt qsec vs gear carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 4 0 22.9 136. 84.7 3.77 2.94 21.0 1 3.67 1.67
#2 4 1 28.1 93.6 81.9 4.18 2.04 18.4 0.875 4.25 1.5
#3 6 0 19.1 205. 115. 3.42 3.39 19.2 1 3.5 2.5
#4 6 1 20.6 155 132. 3.81 2.76 16.3 0 4.33 4.67
#5 8 0 15.0 358. 194. 3.12 4.10 17.1 0 3 3.08
#6 8 1 15.4 326 300. 3.88 3.37 14.6 0 5 6
I was trying to pass a list of functions into dplyr summerize_at function and got a warning:
library(tidyverse)
library(purrr)
p <- c(0.2, 0.5, 0.8)
p_names <- map_chr(p, ~paste0(.x*100, "%"))
p_funs <- map(p, ~partial(quantile, probs = .x, na.rm = TRUE)) %>%
set_names(nm = p_names)
mtcars %>%
group_by(cyl) %>%
summarize_at(vars(mpg), funs(!!!p_funs))
#> Warning: funs() is soft deprecated as of dplyr 0.8.0
#> please use list() instead
#>
#> # Before:
#> funs(name = f(.)
#>
#> # After:
#> list(name = ~f(.))
#> This warning is displayed once per session.
#> # A tibble: 3 x 4
#> cyl `20%` `50%` `80%`
#> <dbl> <dbl> <dbl> <dbl>
#> 1 4 22.8 26 30.4
#> 2 6 18.3 19.7 21
#> 3 8 13.9 15.2 16.8
I then changed the funs to list but couldn't find a way to unquote the list of funs.
mtcars %>%
group_by(cyl) %>%
summarize_at(vars(mpg), list(~ !!!p_funs))
#> Error in !p_funs: invalid argument type
mtcars %>%
group_by(cyl) %>%
summarize_at(vars(mpg), list(~ {{p_funs}}))
#> Error: Column `mpg` must be length 1 (a summary value), not 3
list doesn't support splicing (!!!), use list2 or lst instead :
mtcars %>%
group_by(cyl) %>%
summarize_at(vars(mpg), rlang::list2(!!!p_funs))
# # A tibble: 3 x 4
# cyl `20%` `50%` `80%`
# <dbl> <dbl> <dbl> <dbl>
# 1 4 22.8 26 30.4
# 2 6 18.3 19.7 21
# 3 8 13.9 15.2 16.8
mtcars %>%
group_by(cyl) %>%
summarize_at(vars(mpg), lst(!!!p_funs))
# # A tibble: 3 x 4
# cyl `20%` `50%` `80%`
# <dbl> <dbl> <dbl> <dbl>
# 1 4 22.8 26 30.4
# 2 6 18.3 19.7 21
# 3 8 13.9 15.2 16.8
Though here the simplest is just to do :
mtcars %>%
group_by(cyl) %>%
summarize_at(vars(mpg), p_funs)
# # A tibble: 3 x 4
# cyl `20%` `50%` `80%`
# <dbl> <dbl> <dbl> <dbl>
# 1 4 22.8 26 30.4
# 2 6 18.3 19.7 21
# 3 8 13.9 15.2 16.8
I have dataset that looks like this:
Category Weekly_Date a b
<chr> <date> <dbl> <dbl>
1 aa 2018-07-01 36.6 1.4
2 aa 2018-07-02 5.30 0
3 bb 2018-07-01 4.62 1.2
4 bb 2018-07-02 3.71 1.5
5 cc 2018-07-01 3.41 12
... ... ... ... ...
I fitted linear regression for each group separately:
fit_linreg <- train %>%
group_by(Category) %>%
do(model = lm(Target ~ Unit_price + Unit_discount, data = .))
Now I have different models for each category:
aa model1
bb model2
cc model3
So, I need to apply each model to the appropriate category. How to achieve that? (dplyr is preferable)
if you nest the data of your test data, join it with the models, then you can use map2 to make predictions on the test data with the trained models. See below example with mtcars.
library(tidyverse)
x <- mtcars %>%
group_by(gear) %>%
do(model = lm(mpg ~ hp + wt, data = .))
x
Source: local data frame [3 x 2]
Groups: <by row>
# A tibble: 3 x 2
gear model
* <dbl> <list>
1 3 <S3: lm>
2 4 <S3: lm>
3 5 <S3: lm>
mtcars %>%
group_by(gear) %>%
nest %>%
inner_join(x) %>%
mutate(preds = map2(model, data, predict)) %>%
unnest(preds)
Joining, by = "gear"
# A tibble: 32 x 2
gear preds
<dbl> <dbl>
1 4 22.0
2 4 21.2
3 4 25.1
4 4 26.0
5 4 22.2
6 4 17.8
7 4 17.8
8 4 28.7
9 4 32.3
10 4 30.0
# ... with 22 more rows
Here's one approach, I'm using data.table to filter but you can use dplyr instead as well, I just prefer the data.table syntax.
d <- as.data.table(mtcars)
cats <- unique(d$cyl)
m <- lapply(cats, function(z){
return(lm(formula = mpg ~ wt + hp + disp,
data = d[cyl == z, ] ))
})
names(m) <- cats
OUTPUT
> summary(m)
Length Class Mode
6 12 lm list
4 12 lm list
8 12 lm list
# Checking first model
> m[[1]]
Call:
lm(formula = mpg ~ wt + hp + disp, data = d[cyl == z, ])
Coefficients:
(Intercept) wt hp disp
30.27791 -3.89618 -0.01097 0.01610
> sapply(1:length(m), function(z) return(summary(m[[z]])$adj.r.squared))
[1] 0.4434228 0.5829574 0.3461900
I named the list because it might be easier to refer to models by name aa or bb in your case. Hope this helps!
I find the nesting and un-nesting very unnatural, so here's my attempt.
Let's say you want the quality of the model's fit.
library(dplyr)
mtcars %>%
group_by(cyl) %>%
do(data.frame(r2 = summary(lm(mpg ~ wt, data = .))$r.squared))
#> # A tibble: 3 x 2
#> # Groups: cyl [3]
#> cyl r2
#> <dbl> <dbl>
#> 1 4 0.509
#> 2 6 0.465
#> 3 8 0.423
Let's say you want the residuals:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
mtcars %>%
group_by(cyl) %>%
do(data.frame(resid = residuals(lm(mpg ~ wt, data = .))))
#> # A tibble: 32 x 2
#> # Groups: cyl [3]
#> cyl resid
#> <dbl> <dbl>
#> 1 4 -3.67
#> 2 4 2.84
#> 3 4 1.02
#> 4 4 5.25
#> 5 4 -0.0513
#> 6 4 4.69
#> 7 4 -4.15
#> 8 4 -1.34
#> 9 4 -1.49
#> 10 4 -0.627
#> # ... with 22 more rows
See ?do for why you need the embedded data.frame(). You'll probably want to include other columns in the result. Not just the grouping variable and the residuals. I can't find a neat way to do this, other than listing them!
library(dplyr)
mtcars %>%
group_by(cyl) %>%
do(data.frame(disp = .$disp,
qsec = .$qsec,
resid = residuals(lm(mpg ~ wt, data = .))))
#> # A tibble: 32 x 4
#> # Groups: cyl [3]
#> cyl disp qsec resid
#> <dbl> <dbl> <dbl> <dbl>
#> 1 4 108 18.6 -3.67
#> 2 4 147. 20 2.84
#> 3 4 141. 22.9 1.02
#> 4 4 78.7 19.5 5.25
#> 5 4 75.7 18.5 -0.0513
#> 6 4 71.1 19.9 4.69
#> 7 4 120. 20.0 -4.15
#> 8 4 79 18.9 -1.34
#> 9 4 120. 16.7 -1.49
#> 10 4 95.1 16.9 -0.627
#> # ... with 22 more rows
Something that doesn't work
For the first example, I thought the following would work:
library(dplyr)
mtcars %>%
group_by(cyl) %>%
summarise(r2 = summary(lm(mpg ~ wt, data = .))$r.squared)
#> # A tibble: 3 x 2
#> cyl r2
#> <dbl> <dbl>
#> 1 4 0.753
#> 2 6 0.753
#> 3 8 0.753
But you can see all models have the same r2. It's because the model is being fit to all the data, not per cyl. Looking at the authors' code, I believe this is because they've optimised the evaluation of mutate() and summarise() using Rcpp, and the optimisation doesn't work in this case. But do() works as expected. It subsets the data by group before passing it to the expression to be evaluated. I see they are pondering this, see Hyrbid Folding