Linear mixed effect model with group_by - r

I'm trying to fit a lmer model using dplyr::group_by to not fit the model for each of my species separately.
I found this line of code that seems to work but, I don't know how to visualize the results.
library("lme4")
data(Orthodont,package="nlme")
ort_test <- Orthodont %>% group_by(Sex) %>%
do(model = lmer(.,formula=distance~age+(1|Subject)))
and this is what I get out of this
# A tibble: 2 × 2
# Rowwise:
Sex model
<fct> <list>
1 Male <lmrMdLmT>
2 Female <lmrMdLmT>
Can you help me to get the info from the ort_test$model column?
Thanks!!!

We could use tidy from broom.mixed package
library(tidyr)
library(dplyr)
ort_test %>%
mutate(out = list(broom.mixed::tidy(model))) %>%
ungroup %>%
select(Sex, out) %>%
unnest(out)
-output
# A tibble: 8 × 7
Sex effect group term estimate std.error statistic
<fct> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Male fixed <NA> (Intercept) 16.3 1.13 14.5
2 Male fixed <NA> age 0.784 0.0938 8.36
3 Male ran_pars Subject sd__(Intercept) 1.63 NA NA
4 Male ran_pars Residual sd__Observation 1.68 NA NA
5 Female fixed <NA> (Intercept) 17.4 0.859 20.2
6 Female fixed <NA> age 0.480 0.0526 9.12
7 Female ran_pars Subject sd__(Intercept) 2.07 NA NA
8 Female ran_pars Residual sd__Observation 0.780 NA NA

The new reframe function from dplyr v1.1 is a perfect fit for doing this kind of work. Read more about it here. Assuming that you're only interested in the model coefficients, you could do the following:
Orthodont |>
group_by(Sex) |>
reframe(
lmer(distance ~ age + (1 | Subject)) |>
summary() |>
(`$`)("coefficients") |>
as_tibble(rownames = "term", .name_repair = janitor::make_clean_names)
)
#> # A tibble: 4 × 5
#> sex term estimate std_error t_value
#> <fct> <chr> <dbl> <dbl> <dbl>
#> 1 Male (Intercept) 16.3 1.13 14.5
#> 2 Male age 0.784 0.0938 8.36
#> 3 Female (Intercept) 17.4 0.859 20.2
#> 4 Female age 0.480 0.0526 9.12

Related

map_dfr outputting a row rather than a column

This is similar to purrr::map_dfr binds by columns, not row as expected but the solutions there aren't working for me. I have a dataframe like
beta_df <- structure(list(intercept = c(-2.75747056032685, -2.90831892599742,
-2.92478082251453, -2.99701559041538, -2.88885796048347, -3.09564193631675
), B1 = c(0.0898235360814854, 0.0291839369781567, 0.0881023522236231,
0.231703026085554, 0.0441573699433149, 0.258219673780526), B2 = c(-0.222367437619057,
0.770536384299238, 0.199648657850609, 0.0529038155448773, 0.00310458335580774,
0.132604387458483), B3 = c(1.26339268033385, 1.29883641278223,
0.949504940387809, 1.26904511447941, 0.863882674439083, 0.823907268679309
), B4 = c(2.13662994525526, 1.02340744740827, 0.959079691725652,
1.60672779812489, 1.19095838867883, -0.0693120654049908)), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
#> # A tibble: 6 × 5
#> intercept B1 B2 B3 B4
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 -2.76 0.0898 -0.222 1.26 2.14
#> 2 -2.91 0.0292 0.771 1.30 1.02
#> 3 -2.92 0.0881 0.200 0.950 0.959
#> 4 -3.00 0.232 0.0529 1.27 1.61
#> 5 -2.89 0.0442 0.00310 0.864 1.19
#> 6 -3.10 0.258 0.133 0.824 -0.0693
I'd like to turn this into a tibble with columns for the mean, 0.025 and 0.975 quantiles. For the quantile function this works:
beta_df %>%
map_dfr(quantile,0.025)
#> # A tibble: 5 × 1
#> `2.5%`
#> <dbl>
#> 1 -3.08
#> 2 0.0311
#> 3 -0.194
#> 4 0.829
#> 5 0.0592
And this gets me both quantiles
bind_cols(beta_df %>%
map_dfr(quantile, 0.025),
beta_df %>%
map_dfr(quantile, 0.975))
#> # A tibble: 5 × 2
#> `2.5%` `97.5%`
#> <dbl> <dbl>
#> 1 -3.08 -2.77
#> 2 0.0311 0.255
#> 3 -0.194 0.699
#> 4 0.829 1.30
#> 5 0.0592 2.07
But for mean,
beta_df %>%
map_dfr(mean)
#> # A tibble: 1 × 5
#> intercept B1 B2 B3 B4
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 -2.93 0.124 0.156 1.08 1.14
Gives me a long row rather than a column. How can I turn the mean of each column of the original dataframe into a row of a single column dataframe labelled mean?
The reason is because the output of quantile() is a named vector whereas for the mean() is just a single value.
Lets create a custom function with the mean that outputs a named vector,
myMean <- function(x) {setNames(mean(x), nm = 'theMean')}
Applying that using map_dfr we get,
library(dplyr)
beta_df %>%
purrr::map_dfr(myMean)
# A tibble: 5 x 1
theMean
<dbl>
1 -2.93
2 0.124
3 0.156
4 1.08
5 1.14

Having problems updating do() function

I am trying to update my function using the new version of dplyr.
First, I had this function (old version):
slope.k <- function(data, Treatment, Replicate, Day, Ln.AFDMrem){
fitted_models <- data %>% group_by(Treatment, Replicate) %>%
do(model = lm(Ln.AFDMrem ~ Day, data = .))
broom::tidy(fitted_models,model) %>% print(n = Inf)
}
However, the do() function was superseded. Now, I am trying to update with this (new) version:
slope.k <- function(data, Treatment, Replicate, Day, Ln.AFDMrem){
mod_t <- data %>% nest_by(Treatment, Replicate) %>%
mutate(model = list(lm(Ln.AFDMrem ~ Day, data = data))) %>%
summarise(tidy_out = list(tidy(model)))
unnest(select(mod_t, Treatment, tidy_out)) %>% print(n = Inf)
}
However, it doesn't work properly, because I have the following warnings:
Warning messages:
1: `cols` is now required when using unnest().
Please use `cols = c(tidy_out)`
2: `...` is not empty.
We detected these problematic arguments:
* `needs_dots`
These dots only exist to allow future extensions and should be empty.
Did you misspecify an argument?
Thanks in advance!!!
The issue would be the use of select with unnest. It can be reproduced by changing the select to c
libary(dplyr)
library(broom)
library(tidyr)
mtcars %>%
nest_by(carb, gear) %>%
mutate(model = list(lm(mpg ~ disp + drat, data = data))) %>%
summarise(tidy_out = list(tidy(model)), .groups = 'drop') %>%
unnest(c(tidy_out))
-output
# A tibble: 33 x 7
# carb gear term estimate std.error statistic p.value
# <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 1 3 (Intercept) -8.50 NaN NaN NaN
# 2 1 3 disp 0.0312 NaN NaN NaN
# 3 1 3 drat 7.10 NaN NaN NaN
# 4 1 4 (Intercept) -70.5 302. -0.234 0.854
# 5 1 4 disp -0.0445 0.587 -0.0757 0.952
# 6 1 4 drat 25.5 62.4 0.408 0.753
# 7 2 3 (Intercept) -3.72 8.57 -0.434 0.739
# 8 2 3 disp 0.0437 0.0123 3.54 0.175
# 9 2 3 drat 1.90 2.88 0.661 0.628
#10 2 4 (Intercept) -10.0 226. -0.0443 0.972
# … with 23 more rows
Also, after the mutate, step, we can directly use the unnest on the 'tidy_out' column
If we use as a function, assuming that unquoted arguments are passed as column names
slope.k <- function(data, Treatment, Replicate, Day, Ln.AFDMrem){
ln_col <- rlang::as_string(ensym(Ln.AFDMrem))
day_col <- rlang::as_string(ensym(Day))
data %>%
nest_by({{Treatment}}, {{Replicate}}) %>%
mutate(model = list(lm(reformulate(day_col, ln_col), data = data))) %>%
summarise(tidy_out = list(tidy(model)), .groups = 'drop') %>%
unnest(tidy_out)
}
slope.k(mtcars, carb, gear, disp, mpg)
# A tibble: 22 x 7
carb gear term estimate std.error statistic p.value
<dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 3 (Intercept) 22.0 5.35 4.12 0.152
2 1 3 disp -0.00841 0.0255 -0.329 0.797
3 1 4 (Intercept) 52.6 8.32 6.32 0.0242
4 1 4 disp -0.279 0.0975 -2.86 0.104
5 2 3 (Intercept) 1.25 3.49 0.357 0.755
6 2 3 disp 0.0460 0.0100 4.59 0.0443
7 2 4 (Intercept) 36.6 6.57 5.57 0.0308
8 2 4 disp -0.0978 0.0529 -1.85 0.206
9 2 5 (Intercept) 47.0 NaN NaN NaN
10 2 5 disp -0.175 NaN NaN NaN
# … with 12 more rows

How to extract confidence intervals from multiple regression models?

I am extracting the regression results for two different groups as shown in this example below. In the temp data.frame i get the estimate, std.error, statistic and p-value. However, i don't get the confidence intervals. Is there a simple way to extract them as well?
df <- tibble(
a = rnorm(1000),
b = rnorm(1000),
c = rnorm(1000),
d = rnorm(1000),
group = rbinom(n=1000, size=1, prob=0.5)
)
df$group = as.factor(df$group)
temp <- df %>%
group_by(group) %>%
do(model1 = tidy(lm(a ~ b + c + d, data = .))) %>%
gather(model_name, model, -group) %>%
unnest()
You are doing tidy on a lm object. If you check the help page, there is an option to include the confidence interval, conf.int=TRUE:
temp <- df %>%
group_by(group) %>%
do(model1 = tidy(lm(a ~ b + c + d, data = . ), conf.int=TRUE)) %>%
gather(model_name, model, -group) %>%
unnest()
# A tibble: 8 x 9
group model_name term estimate std.error statistic p.value conf.low conf.high
<fct> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0 model1 (Int… 0.0616 0.0423 1.46 0.146 -0.0215 0.145
2 0 model1 b 0.00178 0.0421 0.0424 0.966 -0.0808 0.0844
3 0 model1 c -0.00339 0.0431 -0.0787 0.937 -0.0881 0.0813
4 0 model1 d -0.0537 0.0445 -1.21 0.228 -0.141 0.0337
5 1 model1 (Int… -0.0185 0.0454 -0.408 0.683 -0.108 0.0707
6 1 model1 b 0.00128 0.0435 0.0295 0.976 -0.0842 0.0868
7 1 model1 c -0.0972 0.0430 -2.26 0.0244 -0.182 -0.0126
8 1 model1 d 0.0734 0.0457 1.60 0.109 -0.0165 0.163
If your version of dplyr is higher than 1.0.0, you can use:
df %>%
group_by(group) %>%
summarise(tidy(lm(a ~ b + c + d), conf.int = TRUE), .groups = "drop")
#> # A tibble: 8 x 8
#> group term estimate std.error statistic p.value conf.low conf.high
#> <fct> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0 (Intercept) 0.0734 0.0468 1.57 0.117 -0.0185 0.165
#> 2 0 b -0.101 0.0461 -2.19 0.0292 -0.191 -0.0102
#> 3 0 c 0.0337 0.0464 0.726 0.468 -0.0575 0.125
#> 4 0 d -0.101 0.0454 -2.23 0.0265 -0.190 -0.0118
#> 5 1 (Intercept) -0.0559 0.0468 -1.20 0.232 -0.148 0.0360
#> 6 1 b -0.0701 0.0474 -1.48 0.140 -0.163 0.0230
#> 7 1 c 0.0319 0.0477 0.668 0.504 -0.0619 0.126
#> 8 1 d -0.0728 0.0466 -1.56 0.119 -0.164 0.0188

Get summary of the model using purrr::map within dplyr piping

Using mtcars data, I am testing map() to build some lm() models:
library(tidyverse)
mtcars %>%
group_by(cyl) %>%
nest()%>%
mutate(fit = map(.x=data,~lm(mpg ~ ., data = .x)))
#> # A tibble: 3 x 3
#> cyl data fit
#> <dbl> <list> <list>
#> 1 6 <tibble [7 x 10]> <S3: lm>
#> 2 4 <tibble [11 x 10]> <S3: lm>
#> 3 8 <tibble [14 x 10]> <S3: lm>
The output shows that I have a new column, fit.
Now I wish to see the summary of each lm
When I try:
library(tidyverse)
mtcars %>%
group_by(cyl) %>%
nest()%>%
mutate(fit = map(.x=data,~lm(mpg ~ ., data = .x))) %>%
map(fit,summary)
#> Error in as_mapper(.f, ...): object 'fit' not found
It gives the error:
Error in as_mapper(.f, ...) : object 'fit' not found
If I wish to calculate R2 or aic then I can using the following code without any problem:
library(tidyverse)
library(modelr)
mtcars %>%
group_by(cyl) %>%
nest()%>%
mutate(fit = map(.x=data,~lm(mpg ~ ., data = .x))) %>%
mutate(r2 = map_dbl(fit, ~rsquare(., data = mtcars)),
aic = map_dbl(fit, ~AIC(.))) %>%
arrange(aic)
#> # A tibble: 3 x 5
#> cyl data fit r2 aic
#> <dbl> <list> <list> <dbl> <dbl>
#> 1 6 <tibble [7 x 10]> <S3: lm> -8.96 -Inf
#> 2 4 <tibble [11 x 10]> <S3: lm> -26.4 56.4
#> 3 8 <tibble [14 x 10]> <S3: lm> -1.000 67.3
Created on 2019-06-18 by the reprex package (v0.3.0)
What am I missing?
As IceCreamToucan's comment laid out, purrr::map does not look into the data which has been made within your piping.
If you use it with dplyr::mutate then it has access to fit which you have created in the previous piping.
Another option would be explicitly referring to fit which you can see below, as my second suggestion.
library(tidyverse)
mtcars %>%
group_by(cyl) %>%
nest()%>%
mutate(fit = map(.x=data,~lm(mpg ~ ., data = .x))) %>%
mutate(fit_sum = map(fit,summary))
#> # A tibble: 3 x 4
#> cyl data fit fit_sum
#> <dbl> <list> <list> <list>
#> 1 6 <tibble [7 x 10]> <lm> <smmry.lm>
#> 2 4 <tibble [11 x 10]> <lm> <smmry.lm>
#> 3 8 <tibble [14 x 10]> <lm> <smmry.lm>
mtcars %>%
group_by(cyl) %>%
nest()%>%
mutate(fit = map(.x=data,~lm(mpg ~ ., data = .x))) %>%
{map(.$fit, summary)} #or using pull: `pull(fit) %>% map(summary)`
#> [[1]]
#>
#> Call:
#> lm(formula = mpg ~ ., data = .x)
#>
#> Residuals:
#> ALL 7 residuals are 0: no residual degrees of freedom!
#>
#> Coefficients: (3 not defined because of singularities)
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 32.78649 NA NA NA
#> disp 0.07456 NA NA NA
#> hp -0.04252 NA NA NA
#> drat 1.52367 NA NA NA
#> wt 5.12418 NA NA NA
#> qsec -2.33333 NA NA NA
#> vs -1.75289 NA NA NA
#> am NA NA NA NA
#> gear NA NA NA NA
#> carb NA NA NA NA
#>
#> Residual standard error: NaN on 0 degrees of freedom
#> Multiple R-squared: 1, Adjusted R-squared: NaN
#> F-statistic: NaN on 6 and 0 DF, p-value: NA
####truncated the results for the sake of space####
Created on 2019-06-17 by the reprex package (v0.3.0)
From the latest release of dplyr, tidyverse seems to be encouraging using group_modify functions instead of using purrr + nested dataframes.
In that workflow, here is how you can get both model summaries and estimates in the same dataframe via broom package:
# setup
set.seed(123)
library(tidyverse)
options(tibble.width = Inf)
# joining dataframes with regression estimates and model summaries
dplyr::full_join(
# to get a tidy dataframe of regression estimates
x = mtcars %>%
group_by(cyl) %>%
group_modify(.f = ~ broom::tidy(lm(mpg ~ ., data = .x), conf.int = TRUE)),
# to get a tidy dataframe of model summaries
y = mtcars %>%
group_by(cyl) %>%
group_modify(.f = ~ broom::glance(lm(mpg ~ ., data = .x))),
by = "cyl"
) %>%
dplyr::ungroup(x = .)
#> Warning in qt(a, object$df.residual): NaNs produced
#> # A tibble: 25 x 20
#> cyl term estimate std.error statistic.x p.value.x conf.low
#> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 4 (Intercept) 60.9 180. 0.338 0.793 -2229.
#> 2 4 disp -0.345 0.469 -0.735 0.596 -6.31
#> 3 4 hp -0.0332 0.364 -0.0915 0.942 -4.65
#> 4 4 drat -4.19 46.4 -0.0903 0.943 -594.
#> 5 4 wt 4.48 29.7 0.151 0.905 -373.
#> 6 4 qsec -0.106 7.82 -0.0136 0.991 -99.4
#> 7 4 vs -3.64 34.0 -0.107 0.932 -435.
#> 8 4 am -6.33 45.2 -0.140 0.912 -581.
#> 9 4 gear 4.07 29.1 0.140 0.912 -366.
#> 10 4 carb 3.22 28.2 0.114 0.928 -355.
#> conf.high r.squared adj.r.squared sigma statistic.y p.value.y df
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2351. 0.928 0.276 3.84 1.42 0.576 9
#> 2 5.62 0.928 0.276 3.84 1.42 0.576 9
#> 3 4.59 0.928 0.276 3.84 1.42 0.576 9
#> 4 586. 0.928 0.276 3.84 1.42 0.576 9
#> 5 382. 0.928 0.276 3.84 1.42 0.576 9
#> 6 99.2 0.928 0.276 3.84 1.42 0.576 9
#> 7 428. 0.928 0.276 3.84 1.42 0.576 9
#> 8 568. 0.928 0.276 3.84 1.42 0.576 9
#> 9 374. 0.928 0.276 3.84 1.42 0.576 9
#> 10 362. 0.928 0.276 3.84 1.42 0.576 9
#> logLik AIC BIC deviance df.residual nobs
#> <dbl> <dbl> <dbl> <dbl> <int> <int>
#> 1 -17.2 56.4 60.8 14.7 1 11
#> 2 -17.2 56.4 60.8 14.7 1 11
#> 3 -17.2 56.4 60.8 14.7 1 11
#> 4 -17.2 56.4 60.8 14.7 1 11
#> 5 -17.2 56.4 60.8 14.7 1 11
#> 6 -17.2 56.4 60.8 14.7 1 11
#> 7 -17.2 56.4 60.8 14.7 1 11
#> 8 -17.2 56.4 60.8 14.7 1 11
#> 9 -17.2 56.4 60.8 14.7 1 11
#> 10 -17.2 56.4 60.8 14.7 1 11
#> # ... with 15 more rows
Created on 2019-06-17 by the reprex package (v0.3.0)

Make prediction for each group differently

I have dataset that looks like this:
Category Weekly_Date a b
<chr> <date> <dbl> <dbl>
1 aa 2018-07-01 36.6 1.4
2 aa 2018-07-02 5.30 0
3 bb 2018-07-01 4.62 1.2
4 bb 2018-07-02 3.71 1.5
5 cc 2018-07-01 3.41 12
... ... ... ... ...
I fitted linear regression for each group separately:
fit_linreg <- train %>%
group_by(Category) %>%
do(model = lm(Target ~ Unit_price + Unit_discount, data = .))
Now I have different models for each category:
aa model1
bb model2
cc model3
So, I need to apply each model to the appropriate category. How to achieve that? (dplyr is preferable)
if you nest the data of your test data, join it with the models, then you can use map2 to make predictions on the test data with the trained models. See below example with mtcars.
library(tidyverse)
x <- mtcars %>%
group_by(gear) %>%
do(model = lm(mpg ~ hp + wt, data = .))
x
Source: local data frame [3 x 2]
Groups: <by row>
# A tibble: 3 x 2
gear model
* <dbl> <list>
1 3 <S3: lm>
2 4 <S3: lm>
3 5 <S3: lm>
mtcars %>%
group_by(gear) %>%
nest %>%
inner_join(x) %>%
mutate(preds = map2(model, data, predict)) %>%
unnest(preds)
Joining, by = "gear"
# A tibble: 32 x 2
gear preds
<dbl> <dbl>
1 4 22.0
2 4 21.2
3 4 25.1
4 4 26.0
5 4 22.2
6 4 17.8
7 4 17.8
8 4 28.7
9 4 32.3
10 4 30.0
# ... with 22 more rows
Here's one approach, I'm using data.table to filter but you can use dplyr instead as well, I just prefer the data.table syntax.
d <- as.data.table(mtcars)
cats <- unique(d$cyl)
m <- lapply(cats, function(z){
return(lm(formula = mpg ~ wt + hp + disp,
data = d[cyl == z, ] ))
})
names(m) <- cats
OUTPUT
> summary(m)
Length Class Mode
6 12 lm list
4 12 lm list
8 12 lm list
# Checking first model
> m[[1]]
Call:
lm(formula = mpg ~ wt + hp + disp, data = d[cyl == z, ])
Coefficients:
(Intercept) wt hp disp
30.27791 -3.89618 -0.01097 0.01610
> sapply(1:length(m), function(z) return(summary(m[[z]])$adj.r.squared))
[1] 0.4434228 0.5829574 0.3461900
I named the list because it might be easier to refer to models by name aa or bb in your case. Hope this helps!
I find the nesting and un-nesting very unnatural, so here's my attempt.
Let's say you want the quality of the model's fit.
library(dplyr)
mtcars %>%
group_by(cyl) %>%
do(data.frame(r2 = summary(lm(mpg ~ wt, data = .))$r.squared))
#> # A tibble: 3 x 2
#> # Groups: cyl [3]
#> cyl r2
#> <dbl> <dbl>
#> 1 4 0.509
#> 2 6 0.465
#> 3 8 0.423
Let's say you want the residuals:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
mtcars %>%
group_by(cyl) %>%
do(data.frame(resid = residuals(lm(mpg ~ wt, data = .))))
#> # A tibble: 32 x 2
#> # Groups: cyl [3]
#> cyl resid
#> <dbl> <dbl>
#> 1 4 -3.67
#> 2 4 2.84
#> 3 4 1.02
#> 4 4 5.25
#> 5 4 -0.0513
#> 6 4 4.69
#> 7 4 -4.15
#> 8 4 -1.34
#> 9 4 -1.49
#> 10 4 -0.627
#> # ... with 22 more rows
See ?do for why you need the embedded data.frame(). You'll probably want to include other columns in the result. Not just the grouping variable and the residuals. I can't find a neat way to do this, other than listing them!
library(dplyr)
mtcars %>%
group_by(cyl) %>%
do(data.frame(disp = .$disp,
qsec = .$qsec,
resid = residuals(lm(mpg ~ wt, data = .))))
#> # A tibble: 32 x 4
#> # Groups: cyl [3]
#> cyl disp qsec resid
#> <dbl> <dbl> <dbl> <dbl>
#> 1 4 108 18.6 -3.67
#> 2 4 147. 20 2.84
#> 3 4 141. 22.9 1.02
#> 4 4 78.7 19.5 5.25
#> 5 4 75.7 18.5 -0.0513
#> 6 4 71.1 19.9 4.69
#> 7 4 120. 20.0 -4.15
#> 8 4 79 18.9 -1.34
#> 9 4 120. 16.7 -1.49
#> 10 4 95.1 16.9 -0.627
#> # ... with 22 more rows
Something that doesn't work
For the first example, I thought the following would work:
library(dplyr)
mtcars %>%
group_by(cyl) %>%
summarise(r2 = summary(lm(mpg ~ wt, data = .))$r.squared)
#> # A tibble: 3 x 2
#> cyl r2
#> <dbl> <dbl>
#> 1 4 0.753
#> 2 6 0.753
#> 3 8 0.753
But you can see all models have the same r2. It's because the model is being fit to all the data, not per cyl. Looking at the authors' code, I believe this is because they've optimised the evaluation of mutate() and summarise() using Rcpp, and the optimisation doesn't work in this case. But do() works as expected. It subsets the data by group before passing it to the expression to be evaluated. I see they are pondering this, see Hyrbid Folding

Resources