I have a package that creates calls containing stats details that can then be displayed in plots.
Here is a simple use case:
# setup
set.seed(123)
library(statsExpressions)
library(tidyverse)
# two-sample t-test results in an expression
stats_exp <- bf_ttest(mtcars, am, wt)
# class of object
class(stats_exp)
#> [1] "call"
# using the expression to display details in a plot
ggplot(mtcars, aes(as.factor(am), wt)) + geom_boxplot() +
labs(subtitle = stats_exp)
Now let's say I wanted to do the same kind of visualizations for all levels of a grouping variable. In this case, I will need to create and save the call for each level.
I can successfully do so using tidyr, which can save the call objects in a list column:
# doing this across groups
(df <- mtcars %>%
group_nest(cyl) %>%
mutate(stats_exp = data %>% map(., ~bf_ttest(., am, wt))))
# A tibble: 3 x 3
cyl data stats_exp
<dbl> <list> <list>
1 4 <tibble [11 × 10]> <language>
2 6 <tibble [7 × 10]> <language>
3 8 <tibble [14 × 10]> <language>
# did it work? yes!
df$stats_exp[[1]]
#> atop(displaystyle(NULL), expr = paste("In favor of null: ", "log"["e"],
#> "(BF"["01"], ") = ", "-1.58", ", ", italic("r")["Cauchy"]^"JZS",
#> " = ", "0.71"))
The problem arises when I try to unnest it, which I would like to do since I will need to do some other operations on this dataframe somewhere downstream in my workflow:
# unnest
tidyr::unnest(data = df, cols = c(stats_exp, data))
#> Error: Input must be list of vectors
How can I avoid this error?
I'm not sure what you intend to do to the stats_exp after you've manipulated the other data but this could a potential solution:
set.seed(123)
library(statsExpressions)
library(tidyverse)
stats_exp <- bf_ttest(mtcars, am, wt)
df <- mtcars %>%
group_nest(cyl) %>%
mutate(stats_exp = map(data, ~ bf_ttest(.x, am, wt)),
stats_chr = map(stats_exp, ~ paste0(deparse(.x), collapse = " ")))
df %>%
select(stats_chr) %>%
unnest(cols = stats_chr)
#> # A tibble: 3 x 1
#> stats_chr
#> <chr>
#> 1 "atop(displaystyle(NULL), expr = paste(\"In favor of null: \", \"log\"[\"e\"]~
#> 2 "atop(displaystyle(NULL), expr = paste(\"In favor of null: \", \"log\"[\"e\"]~
#> 3 "atop(displaystyle(NULL), expr = paste(\"In favor of null: \", \"log\"[\"e\"]~
Created on 2020-02-25 by the reprex package (v0.3.0)
Based on a solution provided on Twitter (h/t #dvaughan32). unnest won't fail if stats_exp is not included in cols argument:
library(tidyverse)
library(statsExpressions)
# doing this across groups
df <- mtcars %>%
group_nest(cyl) %>%
mutate(stats_exp = data %>% map(., ~bf_ttest(., am, wt)))
# alternative
tidyr::unnest(data = df, cols = c(data))
#> # A tibble: 32 x 12
#> cyl mpg disp hp drat wt qsec vs am gear carb stats_exp
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <list>
#> 1 4 22.8 108 93 3.85 2.32 18.6 1 1 4 1 <language>
#> 2 4 24.4 147. 62 3.69 3.19 20 1 0 4 2 <language>
#> 3 4 22.8 141. 95 3.92 3.15 22.9 1 0 4 2 <language>
#> 4 4 32.4 78.7 66 4.08 2.2 19.5 1 1 4 1 <language>
#> 5 4 30.4 75.7 52 4.93 1.62 18.5 1 1 4 2 <language>
#> 6 4 33.9 71.1 65 4.22 1.84 19.9 1 1 4 1 <language>
#> 7 4 21.5 120. 97 3.7 2.46 20.0 1 0 3 1 <language>
#> 8 4 27.3 79 66 4.08 1.94 18.9 1 1 4 1 <language>
#> 9 4 26 120. 91 4.43 2.14 16.7 0 1 5 2 <language>
#> 10 4 30.4 95.1 113 3.77 1.51 16.9 1 1 5 2 <language>
#> # … with 22 more rows
Created on 2020-02-27 by the reprex package (v0.3.0)
Related
A very similar question was asked here, but I want to add columns for a confidence interval. Their example that works:
x <- mtcars %>%
group_by(gear) %>%
do(model = lm(mpg ~ hp + wt, data = .))
x
Source: local data frame [3 x 2]
Groups: <by row>
# A tibble: 3 x 2
gear model
* <dbl> <list>
1 3 <S3: lm>
2 4 <S3: lm>
3 5 <S3: lm>
mtcars %>%
group_by(gear) %>%
nest %>%
inner_join(x) %>%
mutate(preds = map2(model, data, predict)) %>%
unnest(data, preds)
This works, and produces an additional column for mtcars with predicted values made with a separate model for each grouping. Now what I'd like to do, is include confidence interval columns from predict()
mtcars %>%
group_by(gear) %>%
nest %>%
inner_join(x) %>%
mutate(preds = map2(model, data, predict, interval = "confidence")) %>%
unnest(data, preds)
This returns the error:
Error in vec_rbind(!!!x, .ptype = ptype) : Internal error in `vec_assign()`: `value` should have been recycled to fit `x`.
The error is triggered in unnest() in the final line. I think the issue is something related the output format of predict(), which is a 3-column dataframe (fit, upr, lwr). Any help would be appreciated!
Output of predict is a matrix, convert it to a dataframe and then unnest
library(tidyverse)
mtcars %>%
group_by(gear) %>%
nest %>%
inner_join(x) %>%
mutate(preds = map2(model, data,
~as.data.frame(predict(.x, .y, interval = "confidence")))) %>%
unnest(cols = c(preds, data))
# gear mpg cyl disp hp drat wt qsec vs am carb model fit lwr upr
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <dbl> <dbl> <dbl>
# 1 4 21 6 160 110 3.9 2.62 16.5 0 1 4 <lm> 22.0 19.6 24.4
# 2 4 21 6 160 110 3.9 2.88 17.0 0 1 4 <lm> 21.2 19.2 23.2
# 3 4 22.8 4 108 93 3.85 2.32 18.6 1 1 1 <lm> 25.1 23.0 27.1
# 4 4 24.4 4 147. 62 3.69 3.19 20 1 0 2 <lm> 26.0 21.5 30.6
# 5 4 22.8 4 141. 95 3.92 3.15 22.9 1 0 2 <lm> 22.2 19.9 24.4
# 6 4 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 <lm> 17.8 15.1 20.5
# 7 4 17.8 6 168. 123 3.92 3.44 18.9 1 0 4 <lm> 17.8 15.1 20.5
# 8 4 32.4 4 78.7 66 4.08 2.2 19.5 1 1 1 <lm> 28.7 26.6 30.8
# 9 4 30.4 4 75.7 52 4.93 1.62 18.5 1 1 2 <lm> 32.3 29.3 35.3
#10 4 33.9 4 71.1 65 4.22 1.84 19.9 1 1 1 <lm> 30.0 27.5 32.5
# … with 22 more rows
If I have a function defined using rlang, how I can use purrr::map to use it with several variables ?
Suppose I have a function defined as:
mean_by <- function(data, by, var) {
data %>%
group_by({{ by }}) %>%
summarise(avg = mean({{ var }}, na.rm = TRUE))
}
Which computes group means,
Preferably using a purrr::map solution, how could I apply this function for several "by" variables but a single "var" in a data frame?
You need the !!! operator or using group_by_at
library(tidyverse)
mean_by <- function(data, by, var) {
data %>%
group_by_at(by) %>%
summarise(avg = {{var}} %>% mean(na.rm =TRUE))
}
mtcars %>%
mean_by(by = vars(mpg,cyl),hp)
#> # A tibble: 27 x 3
#> # Groups: mpg [25]
#> mpg cyl avg
#> <dbl> <dbl> <dbl>
#> 1 10.4 8 210
#> 2 13.3 8 245
#> 3 14.3 8 245
#> 4 14.7 8 230
#> 5 15 8 335
#> 6 15.2 8 165
#> 7 15.5 8 150
#> 8 15.8 8 264
#> 9 16.4 8 180
#> 10 17.3 8 180
#> # … with 17 more rows
# or
mean_by <- function(data, by, var) {
data %>%
group_by(!!!by) %>%
summarise(avg = {{var}} %>% mean(na.rm =TRUE))
}
mtcars %>%
mean_by(by = vars(cyl,disp),hp)
#> # A tibble: 27 x 3
#> # Groups: cyl [3]
#> cyl disp avg
#> <dbl> <dbl> <dbl>
#> 1 4 71.1 65
#> 2 4 75.7 52
#> 3 4 78.7 66
#> 4 4 79 66
#> 5 4 95.1 113
#> 6 4 108 93
#> 7 4 120. 97
#> 8 4 120. 91
#> 9 4 121 109
#> 10 4 141. 95
#> # … with 17 more rows
Created on 2020-01-07 by the reprex package (v0.3.0)
A good alternative is to "pass the dots".
The first argument will be the single variable you want to summarise, and use ... to pass all (if any) grouping variables you want.
This way you have a cleaner syntax for your function and you avoid including the vars function.
library(tidyverse)
mean_by <- function(data, var, ...) {
data %>%
group_by(...) %>%
summarise(avg = {{var}} %>% mean(na.rm =TRUE))
}
mtcars %>%
mean_by(hp, cyl, disp)
#> # A tibble: 27 x 3
#> # Groups: cyl [3]
#> cyl disp avg
#> <dbl> <dbl> <dbl>
#> 1 4 71.1 65
#> 2 4 75.7 52
#> 3 4 78.7 66
#> 4 4 79 66
#> 5 4 95.1 113
#> 6 4 108 93
#> 7 4 120. 97
#> 8 4 120. 91
#> 9 4 121 109
#> 10 4 141. 95
#> # ... with 17 more rows
mtcars %>%
mean_by(hp)
#> # A tibble: 1 x 1
#> avg
#> <dbl>
#> 1 147.
Created on 2020-01-08 by the reprex package (v0.3.0)
I am building a function that uses {{ }} (curly curly or double mustache)
I would like the user to be able to pass multiple variables into the same {{ }}, but I am not sure if this is possible using {{ }}. I can't find any examples showing how to do this.
Can you tell me if it possible, and if yes help me make the below minimal reprex work?
library(tidyverse)
group_mean <- function(.data, group){
.data %>%
group_by({{group}}) %>%
summarise_all(mean)
}
# Works
mtcars %>%
group_mean(group = cyl)
# Fails
mtcars %>%
group_mean(group = c(cyl, am))
Error: Column `c(cyl, am)` must be length 32 (the number of rows) or one, not 64
Edit 2022: Nowadays we'd tend to use the c() syntax of tidyselect for taking in multiple groups of variables.
library(dplyr)
my_mean <- function(data, group_vars, summary_vars) {
data |>
group_by(across({{ group_vars }})) |>
summarise(across({{ summary_vars }}, \(x) mean(x, na.rm = TRUE)))
}
mtcars |> my_mean(c(cyl, am), c(mpg, disp))
#> `summarise()` has grouped output by 'cyl'. You can override using the
#> `.groups` argument.
#> # A tibble: 6 × 4
#> # Groups: cyl [3]
#> cyl am mpg disp
#> <dbl> <dbl> <dbl> <dbl>
#> 1 4 0 22.9 136.
#> 2 4 1 28.1 93.6
#> 3 6 0 19.1 205.
#> 4 6 1 20.6 155
#> 5 8 0 15.0 358.
#> 6 8 1 15.4 326
See also the Bidge patterns section in https://rlang.r-lib.org/reference/topic-data-mask-programming.html
If your function takes several groups of multiple variables, you need external quoting with vars(). This function simply capture its inputs as a list of expressions:
vars(foo, bar)
#> [[1]]
#> <quosure>
#> expr: ^foo
#> env: global
#>
#> [[2]]
#> <quosure>
#> expr: ^bar
#> env: global
Take an argument that you splice with !!!:
group_mean <- function(.data, .vars, ...) {
.data <- doingsomethingelse(.data, ...)
.data %>%
group_by(!!!.vars) %>%
summarise_all(mean)
}
Use it like this:
data %>% group_mean(vars(foo, bar), baz, quux)
For multiple grouping variables, you don't need curly-curly, pass three dots instead.
group_mean <- function(.data, ...){
.data %>%
group_by(...) %>%
summarise_all(mean)
}
mtcars %>% group_mean(cyl)
# A tibble: 3 x 11
# cyl mpg disp hp drat wt qsec vs am gear carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 4 26.7 105. 82.6 4.07 2.29 19.1 0.909 0.727 4.09 1.55
#2 6 19.7 183. 122. 3.59 3.12 18.0 0.571 0.429 3.86 3.43
#3 8 15.1 353. 209. 3.23 4.00 16.8 0 0.143 3.29 3.5
mtcars %>% group_mean(cyl, am)
# cyl am mpg disp hp drat wt qsec vs gear carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 4 0 22.9 136. 84.7 3.77 2.94 21.0 1 3.67 1.67
#2 4 1 28.1 93.6 81.9 4.18 2.04 18.4 0.875 4.25 1.5
#3 6 0 19.1 205. 115. 3.42 3.39 19.2 1 3.5 2.5
#4 6 1 20.6 155 132. 3.81 2.76 16.3 0 4.33 4.67
#5 8 0 15.0 358. 194. 3.12 4.10 17.1 0 3 3.08
#6 8 1 15.4 326 300. 3.88 3.37 14.6 0 5 6
I was trying to pass a list of functions into dplyr summerize_at function and got a warning:
library(tidyverse)
library(purrr)
p <- c(0.2, 0.5, 0.8)
p_names <- map_chr(p, ~paste0(.x*100, "%"))
p_funs <- map(p, ~partial(quantile, probs = .x, na.rm = TRUE)) %>%
set_names(nm = p_names)
mtcars %>%
group_by(cyl) %>%
summarize_at(vars(mpg), funs(!!!p_funs))
#> Warning: funs() is soft deprecated as of dplyr 0.8.0
#> please use list() instead
#>
#> # Before:
#> funs(name = f(.)
#>
#> # After:
#> list(name = ~f(.))
#> This warning is displayed once per session.
#> # A tibble: 3 x 4
#> cyl `20%` `50%` `80%`
#> <dbl> <dbl> <dbl> <dbl>
#> 1 4 22.8 26 30.4
#> 2 6 18.3 19.7 21
#> 3 8 13.9 15.2 16.8
I then changed the funs to list but couldn't find a way to unquote the list of funs.
mtcars %>%
group_by(cyl) %>%
summarize_at(vars(mpg), list(~ !!!p_funs))
#> Error in !p_funs: invalid argument type
mtcars %>%
group_by(cyl) %>%
summarize_at(vars(mpg), list(~ {{p_funs}}))
#> Error: Column `mpg` must be length 1 (a summary value), not 3
list doesn't support splicing (!!!), use list2 or lst instead :
mtcars %>%
group_by(cyl) %>%
summarize_at(vars(mpg), rlang::list2(!!!p_funs))
# # A tibble: 3 x 4
# cyl `20%` `50%` `80%`
# <dbl> <dbl> <dbl> <dbl>
# 1 4 22.8 26 30.4
# 2 6 18.3 19.7 21
# 3 8 13.9 15.2 16.8
mtcars %>%
group_by(cyl) %>%
summarize_at(vars(mpg), lst(!!!p_funs))
# # A tibble: 3 x 4
# cyl `20%` `50%` `80%`
# <dbl> <dbl> <dbl> <dbl>
# 1 4 22.8 26 30.4
# 2 6 18.3 19.7 21
# 3 8 13.9 15.2 16.8
Though here the simplest is just to do :
mtcars %>%
group_by(cyl) %>%
summarize_at(vars(mpg), p_funs)
# # A tibble: 3 x 4
# cyl `20%` `50%` `80%`
# <dbl> <dbl> <dbl> <dbl>
# 1 4 22.8 26 30.4
# 2 6 18.3 19.7 21
# 3 8 13.9 15.2 16.8
I have dataset that looks like this:
Category Weekly_Date a b
<chr> <date> <dbl> <dbl>
1 aa 2018-07-01 36.6 1.4
2 aa 2018-07-02 5.30 0
3 bb 2018-07-01 4.62 1.2
4 bb 2018-07-02 3.71 1.5
5 cc 2018-07-01 3.41 12
... ... ... ... ...
I fitted linear regression for each group separately:
fit_linreg <- train %>%
group_by(Category) %>%
do(model = lm(Target ~ Unit_price + Unit_discount, data = .))
Now I have different models for each category:
aa model1
bb model2
cc model3
So, I need to apply each model to the appropriate category. How to achieve that? (dplyr is preferable)
if you nest the data of your test data, join it with the models, then you can use map2 to make predictions on the test data with the trained models. See below example with mtcars.
library(tidyverse)
x <- mtcars %>%
group_by(gear) %>%
do(model = lm(mpg ~ hp + wt, data = .))
x
Source: local data frame [3 x 2]
Groups: <by row>
# A tibble: 3 x 2
gear model
* <dbl> <list>
1 3 <S3: lm>
2 4 <S3: lm>
3 5 <S3: lm>
mtcars %>%
group_by(gear) %>%
nest %>%
inner_join(x) %>%
mutate(preds = map2(model, data, predict)) %>%
unnest(preds)
Joining, by = "gear"
# A tibble: 32 x 2
gear preds
<dbl> <dbl>
1 4 22.0
2 4 21.2
3 4 25.1
4 4 26.0
5 4 22.2
6 4 17.8
7 4 17.8
8 4 28.7
9 4 32.3
10 4 30.0
# ... with 22 more rows
Here's one approach, I'm using data.table to filter but you can use dplyr instead as well, I just prefer the data.table syntax.
d <- as.data.table(mtcars)
cats <- unique(d$cyl)
m <- lapply(cats, function(z){
return(lm(formula = mpg ~ wt + hp + disp,
data = d[cyl == z, ] ))
})
names(m) <- cats
OUTPUT
> summary(m)
Length Class Mode
6 12 lm list
4 12 lm list
8 12 lm list
# Checking first model
> m[[1]]
Call:
lm(formula = mpg ~ wt + hp + disp, data = d[cyl == z, ])
Coefficients:
(Intercept) wt hp disp
30.27791 -3.89618 -0.01097 0.01610
> sapply(1:length(m), function(z) return(summary(m[[z]])$adj.r.squared))
[1] 0.4434228 0.5829574 0.3461900
I named the list because it might be easier to refer to models by name aa or bb in your case. Hope this helps!
I find the nesting and un-nesting very unnatural, so here's my attempt.
Let's say you want the quality of the model's fit.
library(dplyr)
mtcars %>%
group_by(cyl) %>%
do(data.frame(r2 = summary(lm(mpg ~ wt, data = .))$r.squared))
#> # A tibble: 3 x 2
#> # Groups: cyl [3]
#> cyl r2
#> <dbl> <dbl>
#> 1 4 0.509
#> 2 6 0.465
#> 3 8 0.423
Let's say you want the residuals:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
mtcars %>%
group_by(cyl) %>%
do(data.frame(resid = residuals(lm(mpg ~ wt, data = .))))
#> # A tibble: 32 x 2
#> # Groups: cyl [3]
#> cyl resid
#> <dbl> <dbl>
#> 1 4 -3.67
#> 2 4 2.84
#> 3 4 1.02
#> 4 4 5.25
#> 5 4 -0.0513
#> 6 4 4.69
#> 7 4 -4.15
#> 8 4 -1.34
#> 9 4 -1.49
#> 10 4 -0.627
#> # ... with 22 more rows
See ?do for why you need the embedded data.frame(). You'll probably want to include other columns in the result. Not just the grouping variable and the residuals. I can't find a neat way to do this, other than listing them!
library(dplyr)
mtcars %>%
group_by(cyl) %>%
do(data.frame(disp = .$disp,
qsec = .$qsec,
resid = residuals(lm(mpg ~ wt, data = .))))
#> # A tibble: 32 x 4
#> # Groups: cyl [3]
#> cyl disp qsec resid
#> <dbl> <dbl> <dbl> <dbl>
#> 1 4 108 18.6 -3.67
#> 2 4 147. 20 2.84
#> 3 4 141. 22.9 1.02
#> 4 4 78.7 19.5 5.25
#> 5 4 75.7 18.5 -0.0513
#> 6 4 71.1 19.9 4.69
#> 7 4 120. 20.0 -4.15
#> 8 4 79 18.9 -1.34
#> 9 4 120. 16.7 -1.49
#> 10 4 95.1 16.9 -0.627
#> # ... with 22 more rows
Something that doesn't work
For the first example, I thought the following would work:
library(dplyr)
mtcars %>%
group_by(cyl) %>%
summarise(r2 = summary(lm(mpg ~ wt, data = .))$r.squared)
#> # A tibble: 3 x 2
#> cyl r2
#> <dbl> <dbl>
#> 1 4 0.753
#> 2 6 0.753
#> 3 8 0.753
But you can see all models have the same r2. It's because the model is being fit to all the data, not per cyl. Looking at the authors' code, I believe this is because they've optimised the evaluation of mutate() and summarise() using Rcpp, and the optimisation doesn't work in this case. But do() works as expected. It subsets the data by group before passing it to the expression to be evaluated. I see they are pondering this, see Hyrbid Folding