purrr: Group by (nest) and bootstrap - r

I calculated the mean of bootstrap samples for mpg variable from mtcars dataset. My code looks like this (Please, let me know if there's a "better practice" to do it.):
mean_mpg <- function(x) {
rsample::analysis(x) %>%
pull(mpg) %>%
mean()
}
mtcars2 <- rsample::bootstraps(mtcars) %>%
mutate(mean_mpg = purrr::map(splits, mean_mpg)) %>%
tidyr::unnest(mean_mpg) %>%
select(-splits)
However, now I would like to do the same on a grouped dataset. For example:
mtcars %>%
group_by(am)
# now calculate boostrap means of `mpg` for each `am` group
What's the best way to do it?

I think I would nest() to do this, rather than group_by().
Here is a slightly modified version of how to find the mean mpg for each bootstrap resample of the dataset overall.
library(rsample)
library(tidyverse)
bootstraps(mtcars) %>%
mutate(mpg = map(splits, ~ analysis(.) %>% pull(mpg)),
mean_mpg = map_dbl(mpg, mean))
#> # Bootstrap sampling
#> # A tibble: 25 x 4
#> splits id mpg mean_mpg
#> * <list> <chr> <list> <dbl>
#> 1 <split [32/10]> Bootstrap01 <dbl [32]> 18.8
#> 2 <split [32/13]> Bootstrap02 <dbl [32]> 20.4
#> 3 <split [32/9]> Bootstrap03 <dbl [32]> 21.1
#> 4 <split [32/12]> Bootstrap04 <dbl [32]> 19.4
#> 5 <split [32/10]> Bootstrap05 <dbl [32]> 19.8
#> 6 <split [32/11]> Bootstrap06 <dbl [32]> 20.1
#> 7 <split [32/13]> Bootstrap07 <dbl [32]> 19.1
#> 8 <split [32/11]> Bootstrap08 <dbl [32]> 18.7
#> 9 <split [32/13]> Bootstrap09 <dbl [32]> 19.3
#> 10 <split [32/13]> Bootstrap10 <dbl [32]> 20.9
#> # … with 15 more rows
And here is how I would go about creating bootstrap resamples for each value of am, and then finding the mean value of mpg for those resamples.
mtcars %>%
nest(-am) %>%
mutate(nested_boot = map(data, bootstraps)) %>%
select(-data) %>%
unnest(nested_boot) %>%
mutate(mpg = map(splits, ~ analysis(.) %>% pull(mpg)),
mean_mpg = map_dbl(mpg, mean))
#> # A tibble: 50 x 5
#> am splits id mpg mean_mpg
#> <dbl> <list> <chr> <list> <dbl>
#> 1 1 <split [13/4]> Bootstrap01 <dbl [13]> 21.9
#> 2 1 <split [13/4]> Bootstrap02 <dbl [13]> 24.0
#> 3 1 <split [13/5]> Bootstrap03 <dbl [13]> 24.8
#> 4 1 <split [13/5]> Bootstrap04 <dbl [13]> 25.9
#> 5 1 <split [13/3]> Bootstrap05 <dbl [13]> 24.0
#> 6 1 <split [13/5]> Bootstrap06 <dbl [13]> 22.1
#> 7 1 <split [13/4]> Bootstrap07 <dbl [13]> 24.3
#> 8 1 <split [13/4]> Bootstrap08 <dbl [13]> 25.0
#> 9 1 <split [13/5]> Bootstrap09 <dbl [13]> 22.7
#> 10 1 <split [13/6]> Bootstrap10 <dbl [13]> 23.3
#> # … with 40 more rows
Created on 2020-05-26 by the reprex package (v0.3.0)

Related

How do I insert a column of static text into a nested dataframe in r?

Likely a trivial task for the pros out there, but have not been able to figure out how to insert the text found in the "Slug" column into each of the three nested tables associated with the slug.
![data] (https://i.stack.imgur.com/YClrE.png)
I am just looking to get the Slug value inserted into the nested tables and repeated for each row so I can combine and keep track of associations properly.
Any tips are most welcome! Thank you
Solution
You can use rowwise() with mutate(across())
df %>%
rowwise() %>%
mutate(across(floor_price_array:holder_hist, ~list(mutate(.x,slug=slug))))
Explanation
If your original data, say df, looks like this:
id slug floor_price_array num_listed_hist holder_hist
<chr> <chr> <list> <list> <list>
1 a hyznu <tibble [10 x 3]> <tibble [10 x 3]> <tibble [10 x 3]>
2 b awxeb <tibble [10 x 3]> <tibble [10 x 3]> <tibble [10 x 3]>
3 c pbncj <tibble [10 x 3]> <tibble [10 x 3]> <tibble [10 x 3]>
then, the above code will add the value in the slug column as a new constant column in each of the nested tibbles, and resulting in this (notice that each now has four columns):
id slug floor_price_array num_listed_hist holder_hist
<chr> <chr> <list> <list> <list>
1 a hyznu <tibble [10 x 4]> <tibble [10 x 4]> <tibble [10 x 4]>
2 b awxeb <tibble [10 x 4]> <tibble [10 x 4]> <tibble [10 x 4]>
3 c pbncj <tibble [10 x 4]> <tibble [10 x 4]> <tibble [10 x 4]>
For example, floor_price_array, now contains this:
[[1]]
# A tibble: 10 x 4
x y z slug
<dbl> <dbl> <dbl> <chr>
1 1.44 2.02 -0.272 hyznu
2 -0.598 -0.723 -0.528 hyznu
3 0.490 -0.576 -1.62 hyznu
4 -0.145 0.349 0.341 hyznu
5 -0.362 0.503 0.584 hyznu
6 -0.798 -0.839 -0.352 hyznu
7 -0.503 -1.27 -1.18 hyznu
8 -0.916 -0.654 0.335 hyznu
9 0.578 0.137 -0.590 hyznu
10 -0.194 -0.674 1.73 hyznu
[[2]]
# A tibble: 10 x 4
x y z slug
<dbl> <dbl> <dbl> <chr>
1 0.876 0.665 -0.723 awxeb
2 -0.0442 -0.00906 0.0829 awxeb
3 -2.15 1.33 0.0692 awxeb
4 0.264 0.237 -0.497 awxeb
5 0.0381 0.0502 -1.58 awxeb
6 -0.802 0.783 -1.34 awxeb
7 -0.940 1.50 -0.542 awxeb
8 0.209 -1.06 0.853 awxeb
9 0.569 -1.15 -0.347 awxeb
10 -1.57 -0.0774 0.0250 awxeb
[[3]]
# A tibble: 10 x 4
x y z slug
<dbl> <dbl> <dbl> <chr>
1 -0.0289 -1.63 1.29 pbncj
2 -0.716 0.647 0.0230 pbncj
3 -0.0797 -0.0227 2.12 pbncj
4 -0.358 -1.43 -1.81 pbncj
5 -1.35 -0.402 -0.463 pbncj
6 -0.00494 -0.136 1.50 pbncj
7 1.09 0.124 -0.974 pbncj
8 -1.18 1.78 -0.836 pbncj
9 0.896 -1.38 0.199 pbncj
10 0.293 0.420 0.562 pbncj
Input data:
df <- tibble(id = letters[1:3]
) %>%
rowwise() %>%
mutate(slug = paste0(sample(letters,5), collapse="")) %>%
mutate(floor_price_array=list(tibble(x=rnorm(10), y=rnorm(10), z=rnorm(10))),
num_listed_hist=list(tibble(x=rnorm(10), y=rnorm(10), z=rnorm(10))),
holder_hist=list(tibble(x=rnorm(10), y=rnorm(10), z=rnorm(10)))
) %>% ungroup()

Group by and run multiple t tests in R

I have the following dataset (dput here):
# A tibble: 3,713 x 17
ID Age Group RHR HRV Sleep.Onset Wake.Onset Hours.in.Bed Hours.of.Sleep Sleep.Disturbances Latency.min Cycles REM.Sleep.hours Deep.Sleep.hours
<int> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <int> <dbl> <dbl>
1 5027 Young Increase 58 73 0.180 0.458 6.66 5.33 9 8.98 6 1.4 0.32
2 5027 Young Increase 83 27 0.162 0.542 9.1 6.84 15 3.48 9 1.19 1.54
3 5027 Young Increase 57 85 0.113 0.318 4.92 4.43 5 1.98 4 1.32 0.44
4 5027 Young Increase 60 70 0.0975 0.319 5.32 3.75 3 26.5 4 1.02 0.14
5 5027 Young Increase 63 72 0.105 0.329 5.38 4.74 5 2.48 5 1.32 0.07
6 5027 Young Increase 62 61 0.983 0.472 11.8 9.44 9 4.48 8 2.07 0.84
7 5027 Young Increase 66 68 0.142 0.426 6.83 5.48 15 2.98 6 1.48 0.35
8 5027 Young Increase 81 28 0.0908 0.177 2.06 1.93 2 2.48 1 0.22 0.22
9 5027 Young Increase 69 57 0.158 0.443 6.85 6.58 13 0.48 6 2.43 0
10 5027 Young Increase 63 60 0.0859 0.318 5.58 5.47 4 0.48 5 1.34 0.13
# ... with 3,703 more rows, and 3 more variables: Light.Sleep.hours <dbl>, Awake.hours <dbl>, Session <chr>
I am trying to calculate a t-test across every variable, grouped by Age and Group between Session (pre or post).
df %>%
select(-ID) %>%
group_by(Age, Group) %>%
summarize_at(
vars(-group_cols(), -Session),
list(p.value = ~ t.test(. ~ Session)$p.value))
I am successful with p values:
# A tibble: 4 x 15
# Groups: Age [2]
Age Group RHR_p.value HRV_p.value Sleep.Onset_p.value Wake.Onset_p.value Hours.in.Bed_p.value Hours.of.Sleep_p~ Sleep.Disturban~ Latency.min_p.v~
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Old Decrease 0.0594 0.865 0.495 0.885 0.316 0.307 0.148 0.00237
2 Old Increase 0.00920 0.634 0.0979 0.0514 0.00774 0.00762 0.247 0.933
3 Young Decrease 0.0975 0.259 0.779 0.760 0.959 0.975 0.256 0.181
4 Young Increase 0.115 0.604 0.846 0.164 0.140 0.242 0.692 0.412
# ... with 5 more variables: Cycles_p.value <dbl>, REM.Sleep.hours_p.value <dbl>, Deep.Sleep.hours_p.value <dbl>, Light.Sleep.hours_p.value <dbl>,
# Awake.hours_p.value <dbl>
However, I am struggling to calculate the other t-statistics (mean, sd, t, df, 95%CI) between these pre-post and also correct p-values groups. I am struggling to do this so any help is appreciated.
I think I may need to convert data long and use something like this?
df %>%
group_by(Age, Group) %>%
t_test(mean ~ ., by = "Session") %>%
adjust_pvalue(method = "bonferroni") %>%
add_significance()
Dndata frames can only have certain object classes as column types. A
htest is not one of those.
However, we can store lists as list-columns.
If we adapt the current code to output lists htests as results, we can later extract elements of the tests separately.
library(dplyr)
output <- df %>%
select(-ID) %>%
group_by(Age, Group) %>%
summarize_at(
vars(-group_cols(), -Session),
list(t.test = ~ list(t.test(. ~ Session))))
output
# A tibble: 4 × 15
# Groups: Age [2]
Age Group RHR_t.test HRV_t.test Sleep.Onset_t.test Wake.Onset_t.test Hours.in.Bed_t.test Hours.of.Sleep_t.test Sleep.Disturbance… Latency.min_t.t… Cycles_t.test REM.Sleep.hours…
<chr> <chr> <list> <list> <list> <list> <list> <list> <list> <list> <list> <list>
1 Old Decrease <htest> <htest> <htest> <htest> <htest> <htest> <htest> <htest> <htest> <htest>
2 Old Increase <htest> <htest> <htest> <htest> <htest> <htest> <htest> <htest> <htest> <htest>
3 Young Decrease <htest> <htest> <htest> <htest> <htest> <htest> <htest> <htest> <htest> <htest>
4 Young Increase <htest> <htest> <htest> <htest> <htest> <htest> <htest> <htest> <htest> <htest>
With this output data.frame, we can extract individual tests and values from them as desired:
output$RHR_t.test
[[1]]
Welch Two Sample t-test
data: . by Session
t = -1.8965, df = 188.22, p-value = 0.05942
alternative hypothesis: true difference in means between group Post and group Pre is not equal to 0
95 percent confidence interval:
-3.09118590 0.06082897
sample estimates:
mean in group Post mean in group Pre
62.28902 63.80420
[[2]]
Welch Two Sample t-test
data: . by Session
t = -2.6271, df = 226.21, p-value = 0.009199
alternative hypothesis: true difference in means between group Post and group Pre is not equal to 0
95 percent confidence interval:
-3.3949577 -0.4848655
sample estimates:
mean in group Post mean in group Pre
57.95946 59.89937
[[3]]
Welch Two Sample t-test
data: . by Session
t = 1.6633, df = 251.75, p-value = 0.0975
alternative hypothesis: true difference in means between group Post and group Pre is not equal to 0
95 percent confidence interval:
-0.2074028 2.4611194
sample estimates:
mean in group Post mean in group Pre
60.58255 59.45570
[[4]]
Welch Two Sample t-test
data: . by Session
t = 1.5849, df = 208.4, p-value = 0.1145
alternative hypothesis: true difference in means between group Post and group Pre is not equal to 0
95 percent confidence interval:
-0.244287 2.247775
sample estimates:
mean in group Post mean in group Pre
60.23462 59.23288
output$RHR_t.test %>%
map_dbl('p.value')
[1] 0.059424354 0.009199459 0.097497620 0.114502332
We can also convert these lists to user-friendly tibbles with broom::tidy
output %>%
mutate(across(ends_with('t.test'), map, broom::tidy))
# A tibble: 4 × 15
# Groups: Age [2]
Age Group RHR_t.test HRV_t.test Sleep.Onset_t.te… Wake.Onset_t.test Hours.in.Bed_t.t… Hours.of.Sleep_… Sleep.Disturbanc… Latency.min_t.t… Cycles_t.test REM.Sleep.hours…
<chr> <chr> <list> <list> <list> <list> <list> <list> <list> <list> <list> <list>
1 Old Decrease <tibble [1 × 10]> <tibble [1 … <tibble [1 × 10]> <tibble [1 × 10]> <tibble [1 × 10]> <tibble [1 × 10… <tibble [1 × 10]> <tibble [1 × 10… <tibble [1 ×… <tibble [1 × 10…
2 Old Increase <tibble [1 × 10]> <tibble [1 … <tibble [1 × 10]> <tibble [1 × 10]> <tibble [1 × 10]> <tibble [1 × 10… <tibble [1 × 10]> <tibble [1 × 10… <tibble [1 ×… <tibble [1 × 10…
3 Young Decrease <tibble [1 × 10]> <tibble [1 … <tibble [1 × 10]> <tibble [1 × 10]> <tibble [1 × 10]> <tibble [1 × 10… <tibble [1 × 10]> <tibble [1 × 10… <tibble [1 ×… <tibble [1 × 10…
4 Young Increase <tibble [1 × 10]> <tibble [1 … <tibble [1 × 10]> <tibble [1 × 10]> <tibble [1 × 10]> <tibble [1 × 10… <tibble [1 × 10]> <tibble [1 × 10… <tibble [1 ×… <tibble [1 × 10…
# … with 3 more variables: Deep.Sleep.hours_t.test <list>, Light.Sleep.hours_t.test <list>, Awake.hours_t.test <list>
To have all tests "statistics", we can do it like this:
tidy_output %>%
mutate(across(ends_with('t.test'), sapply, pull, 'statistic'))
# A tibble: 4 × 15
# Groups: Age [2]
Age Group RHR_t.test HRV_t.test Sleep.Onset_t.test Wake.Onset_t.test Hours.in.Bed_t.test Hours.of.Sleep_t.test Sleep.Disturbance… Latency.min_t.t… Cycles_t.test REM.Sleep.hours…
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Old Decrease -1.90 0.171 0.684 -0.145 -1.01 -1.02 -1.45 3.05 -0.928 -0.906
2 Old Increase -2.63 0.477 -1.66 -1.96 -2.69 -2.69 -1.16 0.0848 -1.76 -1.87
3 Young Decrease 1.66 1.13 0.281 -0.305 0.0509 -0.0320 1.14 -1.34 -0.675 0.672
4 Young Increase 1.58 0.519 0.195 -1.40 -1.48 -1.17 0.397 -0.821 -1.73 0.886
# … with 3 more variables: Deep.Sleep.hours_t.test <dbl>, Light.Sleep.hours_t.test <dbl>, Awake.hours_t.test <dbl>

How can I get coefficients of a parsnip multinomial logistic regression model with vfold_cv?

I do resamples_fit with work_flow() and V-Fold Cross-Validation.
My model is logistic regression.
How can I get coefficients of a parsnip logistic regression model with V-Fold Cross-Validation?
If my V-Fold Cross-Validation v=5, I want to get the 5 times coefficients.
You typically do not want to use fit_resamples() to train and keep five models; the main purpose of the fit_resamples() function is to use resampling to estimate performance. The five models are fit and then thrown away.
However, if you do have some use case where you want to keep around the models that are fit, such as in this article, then you would use extract_model.
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
data(penguins)
set.seed(2021)
penguin_split <- penguins %>%
filter(!is.na(sex)) %>%
initial_split(strata = sex)
penguin_train <- training(penguin_split)
penguin_test <- testing(penguin_split)
penguin_folds <- vfold_cv(penguin_train, v = 5, strata = sex)
penguin_folds
#> # 5-fold cross-validation using stratification
#> # A tibble: 5 x 2
#> splits id
#> <list> <chr>
#> 1 <split [198/51]> Fold1
#> 2 <split [199/50]> Fold2
#> 3 <split [199/50]> Fold3
#> 4 <split [200/49]> Fold4
#> 5 <split [200/49]> Fold5
glm_spec <- logistic_reg() %>%
set_engine("glm")
glm_rs <- workflow() %>%
add_formula(sex ~ species + bill_length_mm + bill_depth_mm + body_mass_g) %>%
add_model(glm_spec) %>%
fit_resamples(
resamples = penguin_folds,
control = control_resamples(extract = extract_model, save_pred = TRUE)
)
Now that you have used extract_model in your resampling, it is there in your results and you have the models available for each fold.
glm_rs
#> # Resampling results
#> # 5-fold cross-validation using stratification
#> # A tibble: 5 x 6
#> splits id .metrics .notes .extracts .predictions
#> <list> <chr> <list> <list> <list> <list>
#> 1 <split [198/… Fold1 <tibble [2 × … <tibble [0 ×… <tibble [1 ×… <tibble [51 × …
#> 2 <split [199/… Fold2 <tibble [2 × … <tibble [0 ×… <tibble [1 ×… <tibble [50 × …
#> 3 <split [199/… Fold3 <tibble [2 × … <tibble [0 ×… <tibble [1 ×… <tibble [50 × …
#> 4 <split [200/… Fold4 <tibble [2 × … <tibble [0 ×… <tibble [1 ×… <tibble [49 × …
#> 5 <split [200/… Fold5 <tibble [2 × … <tibble [0 ×… <tibble [1 ×… <tibble [49 × …
glm_rs$.extracts[[1]]
#> # A tibble: 1 x 2
#> .extracts .config
#> <list> <chr>
#> 1 <glm> Preprocessor1_Model1
You can use tidyr and broom functions to get the coefficients out, if that is what you are looking for.
glm_rs %>%
dplyr::select(id, .extracts) %>%
unnest(cols = .extracts) %>%
mutate(tidied = map(.extracts, tidy)) %>%
unnest(tidied)
#> # A tibble: 30 x 8
#> id .extracts .config term estimate std.error statistic p.value
#> <chr> <list> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Fold1 <glm> Preprocesso… (Interce… -7.44e+1 12.6 -5.89 3.75e-9
#> 2 Fold1 <glm> Preprocesso… speciesC… -6.59e+0 1.82 -3.61 3.03e-4
#> 3 Fold1 <glm> Preprocesso… speciesG… -7.49e+0 2.54 -2.95 3.18e-3
#> 4 Fold1 <glm> Preprocesso… bill_len… 5.56e-1 0.151 3.67 2.40e-4
#> 5 Fold1 <glm> Preprocesso… bill_dep… 1.72e+0 0.424 4.06 4.83e-5
#> 6 Fold1 <glm> Preprocesso… body_mas… 5.88e-3 0.00130 4.51 6.44e-6
#> 7 Fold2 <glm> Preprocesso… (Interce… -6.87e+1 11.3 -6.06 1.37e-9
#> 8 Fold2 <glm> Preprocesso… speciesC… -5.59e+0 1.75 -3.20 1.39e-3
#> 9 Fold2 <glm> Preprocesso… speciesG… -7.61e+0 2.80 -2.71 6.65e-3
#> 10 Fold2 <glm> Preprocesso… bill_len… 4.88e-1 0.145 3.36 7.88e-4
#> # … with 20 more rows
Created on 2021-06-27 by the reprex package (v2.0.0)

Map over columns in tidy format using tidyverse

I often use a pattern as seen below, where I store data in a tibble using list-columns, apply functions to the data using purrr::map, and then use pivot_longer to convert to tidy format (long).
Is there a cleaner / more idiomatic way to do this in one step, without having to pivot the data each time?
library(tidyverse)
df <- tibble(n = 5:10)
df$data <- map(df$n, ~rnorm(.x))
df$mean <- map_dbl(df$data, ~mean(.x))
df$median <- map_dbl(df$data, ~median(.x))
# A tibble: 6 x 4
n data mean median
<int> <list> <dbl> <dbl>
1 5 <dbl [5]> -0.0239 -0.324
2 6 <dbl [6]> -0.396 0.0153
3 7 <dbl [7]> 0.506 0.711
4 8 <dbl [8]> 0.463 0.537
5 9 <dbl [9]> -0.248 -0.555
6 10 <dbl [10]> -0.153 -0.293
df <- pivot_longer(df, mean:median)
# A tibble: 12 x 4
n data name value
<int> <list> <chr> <dbl>
1 5 <dbl [5]> mean -0.386
2 5 <dbl [5]> median -0.407
3 6 <dbl [6]> mean -0.190
4 6 <dbl [6]> median -0.451
5 7 <dbl [7]> mean -0.456
6 7 <dbl [7]> median -0.0801
7 8 <dbl [8]> mean -0.0408
8 8 <dbl [8]> median 0.0577
9 9 <dbl [9]> mean 0.273
10 9 <dbl [9]> median 0.410
11 10 <dbl [10]> mean -0.720
12 10 <dbl [10]> median -1.01
I think you already have a good approach, I would have used the same by chaining all the function in one pipe (%>%).
If you want to avoid pivot_longer step you can group by each row and create two new rows for each one. This is possible for dplyr 1.0.0 or higher.
library(tidyverse)
df %>%
mutate(data = map(n, rnorm),
group = row_number()) %>%
group_by(group) %>%
summarise(n = n,
data = data,
value = {tmp <- unlist(data);c(median(tmp), mean(tmp))},
name = c('median', 'mean')) %>%
ungroup %>%
select(-group)
# n data value name
# <int> <list> <dbl> <chr>
# 1 5 <dbl [5]> 0.571 median
# 2 5 <dbl [5]> 0.343 mean
# 3 6 <dbl [6]> 0.220 median
# 4 6 <dbl [6]> 0.0419 mean
# 5 7 <dbl [7]> -0.193 median
# 6 7 <dbl [7]> -0.132 mean
# 7 8 <dbl [8]> -0.171 median
# 8 8 <dbl [8]> 0.00583 mean
# 9 9 <dbl [9]> 0.952 median
#10 9 <dbl [9]> 0.471 mean
#11 10 <dbl [10]> 0.684 median
#12 10 <dbl [10]> 0.250 mean

nest_by on R and running multiple models

I am fairly new to this community and R, thank you for all the support.
I encountered the new nest_by option of dplyr and it seems rather good. I have managed to split existing Dataframe and but not to run multiple models with them. I would like to iterate through all the dataframes and get raw and summary data of statistical models (GLM models mainly).
library(tidyverse)
nested <- mtcars %>% nest_by (cyl,carb)
# A tibble: 9 x 3
# Rowwise: cyl, carb
cyl carb data
<dbl> <dbl> <list<tbl_df[,9]>>
1 4 1 [5 x 9]
2 4 2 [6 x 9]
3 6 1 [2 x 9]
4 6 4 [4 x 9]
5 6 6 [1 x 9]
6 8 2 [4 x 9]
7 8 3 [3 x 9]
8 8 4 [6 x 9]
9 8 8 [1 x 9]
#Now i would like to run each line seperately in a lm model. This line should do it, but it doesn't
fit<- nested %>%
mutate(model = map(data, ~lm(mpg~hp, data=.)))
Now, I am trying to make a printable version of all models for my statistics teacher.
nested <- mtcars %>% nest (data = -c(cyl,carb))
regressions <-nested %>%
mutate(
fit = map(data, ~ lm(mpg ~ hp, data = .x))
)
printing<- regressions %>% rowwise() %>%
mutate (printed = paste(carb, cyl, "This model summary is"), summary(fit), sep = '*')
However this doesn't work altogether.
Any thoughts?
EDIT: In your precise case try this:
nested <- mtcars %>% nest (data = -c(cyl,carb))
regressions <-nested %>%
mutate(
fit = map(data, ~ lm(mpg ~ hp, data = .x)),
tidied = map(fit, tidy),
glanced = map(fit, glance),
augmented = map(fit, augment)
)
regressions %>%
unnest(glanced) # to get statistics of fits
regressions %>%
unnest(tidied) # to get coefficients of all fits
You can use dplyr in combination with broom as in this vignette. There is an exact example with mtcars:
data(mtcars)
mtcars <- as_tibble(mtcars) # to play nicely with list-cols
mtcars
## # A tibble: 32 x 11
## mpg cyl disp hp drat wt qsec vs am gear carb
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
## 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
## 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
## 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
## 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
## 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
## 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
## 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
## 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
## 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
## # ... with 22 more rows
mtcars %>%
nest(-am) %>%
mutate(
fit = map(data, ~ lm(wt ~ mpg + qsec + gear, data = .x)), # S3 list-col
tidied = map(fit, tidy)
) %>%
unnest(tidied)
## # A tibble: 8 x 8
## am data fit term estimate std.error statistic p.value
## <dbl> <list> <list> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1 <tibble [13 x 10~ <lm> (Intercep~ 4.28 3.46 1.24 2.47e-1
## 2 1 <tibble [13 x 10~ <lm> mpg -0.101 0.0294 -3.43 7.50e-3
## 3 1 <tibble [13 x 10~ <lm> qsec 0.0398 0.151 0.264 7.98e-1
## 4 1 <tibble [13 x 10~ <lm> gear -0.0229 0.349 -0.0656 9.49e-1
## 5 0 <tibble [19 x 10~ <lm> (Intercep~ 4.92 1.40 3.52 3.09e-3
## 6 0 <tibble [19 x 10~ <lm> mpg -0.192 0.0443 -4.33 5.91e-4
## 7 0 <tibble [19 x 10~ <lm> qsec 0.0919 0.0983 0.935 3.65e-1
## 8 0 <tibble [19 x 10~ <lm> gear 0.147 0.368 0.398 6.96e-1
What if you want not just the tidy output, but the augment and glance outputs as well, while still performing each regression only once? Since we’re using list-columns, we can just fit the model once and use multiple list-columns to store the tidied, glanced and augmented outputs.
regressions <- mtcars %>%
nest(-am) %>%
mutate(
fit = map(data, ~ lm(wt ~ mpg + qsec + gear, data = .x)),
tidied = map(fit, tidy),
glanced = map(fit, glance),
augmented = map(fit, augment)
)
regressions %>%
unnest(tidied)
## # A tibble: 8 x 10
## am data fit term estimate std.error statistic p.value glanced augmented
## <dbl> <lis> <lis> <chr> <dbl> <dbl> <dbl> <dbl> <list> <list>
## 1 1 <tib~ <lm> (Int~ 4.28 3.46 1.24 2.47e-1 <tibbl~ <tibble ~
## 2 1 <tib~ <lm> mpg -0.101 0.0294 -3.43 7.50e-3 <tibbl~ <tibble ~
## 3 1 <tib~ <lm> qsec 0.0398 0.151 0.264 7.98e-1 <tibbl~ <tibble ~
## 4 1 <tib~ <lm> gear -0.0229 0.349 -0.0656 9.49e-1 <tibbl~ <tibble ~
## 5 0 <tib~ <lm> (Int~ 4.92 1.40 3.52 3.09e-3 <tibbl~ <tibble ~
## 6 0 <tib~ <lm> mpg -0.192 0.0443 -4.33 5.91e-4 <tibbl~ <tibble ~
## 7 0 <tib~ <lm> qsec 0.0919 0.0983 0.935 3.65e-1 <tibbl~ <tibble ~
## 8 0 <tib~ <lm> gear 0.147 0.368 0.398 6.96e-1 <tibbl~ <tibble ~
regressions %>%
unnest(glanced)
## # A tibble: 2 x 16
## am data fit tidied r.squared adj.r.squared sigma statistic p.value df
## <dbl> <lis> <lis> <list> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 1 <tib~ <lm> <tibb~ 0.833 0.778 0.291 15.0 7.59e-4 4
## 2 0 <tib~ <lm> <tibb~ 0.625 0.550 0.522 8.32 1.70e-3 4
## # ... with 6 more variables: logLik <dbl>, AIC <dbl>, BIC <dbl>,
## # deviance <dbl>, df.residual <int>, augmented <list>
regressions %>%
unnest(augmented)
## # A tibble: 32 x 16
## am data fit tidied glanced wt mpg qsec gear .fitted .se.fit
## <dbl> <lis> <lis> <list> <list> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 <tib~ <lm> <tibb~ <tibbl~ 2.62 21 16.5 4 2.73 0.209
## 2 1 <tib~ <lm> <tibb~ <tibbl~ 2.88 21 17.0 4 2.75 0.152
## 3 1 <tib~ <lm> <tibb~ <tibbl~ 2.32 22.8 18.6 4 2.63 0.163
## 4 1 <tib~ <lm> <tibb~ <tibbl~ 2.2 32.4 19.5 4 1.70 0.137
## 5 1 <tib~ <lm> <tibb~ <tibbl~ 1.62 30.4 18.5 4 1.86 0.151
## 6 1 <tib~ <lm> <tibb~ <tibbl~ 1.84 33.9 19.9 4 1.56 0.156
## 7 1 <tib~ <lm> <tibb~ <tibbl~ 1.94 27.3 18.9 4 2.19 0.113
## 8 1 <tib~ <lm> <tibb~ <tibbl~ 2.14 26 16.7 5 2.21 0.153
## 9 1 <tib~ <lm> <tibb~ <tibbl~ 1.51 30.4 16.9 5 1.77 0.191
## 10 1 <tib~ <lm> <tibb~ <tibbl~ 3.17 15.8 14.5 5 3.15 0.157
## # ... with 22 more rows, and 5 more variables: .resid <dbl>, .hat <dbl>,
## # .sigma <dbl>, .cooksd <dbl>, .std.resid <dbl>

Resources