How to efficiently nest() and unnest_wider() in R's tidyverse - r

I am estimating rolling regressions on grouped data.
First, I group_by() and nest() my data by group.
Second, I use map() to estimate rolling regressions with a custom function my_beta(), which returns a list column.
The last step is where I stumble.
I want to extract the groups, dates, and coefficients so that I can merge the coefficients back to my original tibble.
However, my current solution requires three unnest() operations and a bind_cols().
The multiple unnest()s seem inefficient and the bind_cols() seems error prone.
Is there a syntactically and computationally more efficient way to estimate these rolling regressions? My actual data will have 10,000ish groups and 200,000ish observations.
library(tidyverse)
library(tsibble)
#>
#> Attaching package: 'tsibble'
#> The following object is masked from 'package:dplyr':
#>
#> id
set.seed(2001)
df <-
tibble(
date = 1:20,
y = runif(20),
x = runif(20),
z = runif(20),
group = rep(1:2, each = 10)
)
my_beta <- function(...) {
tail(coef(lm(y ~ x + z, data = list(...))), n = -1)
}
current_output <- df %>%
as_tsibble(key = group, index = date) %>%
group_by_key() %>%
nest() %>%
mutate(coefs = purrr::map(data, ~ pslide(., my_beta, .size = 5))) %>%
unnest(coefs) %>%
unnest_wider(coefs, names_sep = '_') %>%
ungroup()
#> New names:
#> * `` -> ...1
#> New names:
#> * `` -> ...1
#> New names:
#> * `` -> ...1
#> New names:
#> * `` -> ...1
#> New names:
#> * `` -> ...1
#> New names:
#> * `` -> ...1
#> New names:
#> * `` -> ...1
#> New names:
#> * `` -> ...1
current_output
#> # A tibble: 20 x 5
#> group data coefs_...1 coefs_x coefs_z
#> <int> <list> <lgl> <dbl> <dbl>
#> 1 1 <tsibble [10 × 4]> NA NA NA
#> 2 1 <tsibble [10 × 4]> NA NA NA
#> 3 1 <tsibble [10 × 4]> NA NA NA
#> 4 1 <tsibble [10 × 4]> NA NA NA
#> 5 1 <tsibble [10 × 4]> NA 1.46 2.08
#> 6 1 <tsibble [10 × 4]> NA 0.141 -0.396
#> 7 1 <tsibble [10 × 4]> NA 0.754 1.10
#> 8 1 <tsibble [10 × 4]> NA 0.651 0.889
#> 9 1 <tsibble [10 × 4]> NA 0.743 0.954
#> 10 1 <tsibble [10 × 4]> NA 0.308 0.795
#> 11 2 <tsibble [10 × 4]> NA NA NA
#> 12 2 <tsibble [10 × 4]> NA NA NA
#> 13 2 <tsibble [10 × 4]> NA NA NA
#> 14 2 <tsibble [10 × 4]> NA NA NA
#> 15 2 <tsibble [10 × 4]> NA -0.0433 -0.252
#> 16 2 <tsibble [10 × 4]> NA 0.696 0.334
#> 17 2 <tsibble [10 × 4]> NA 0.594 -0.0698
#> 18 2 <tsibble [10 × 4]> NA 0.881 0.0474
#> 19 2 <tsibble [10 × 4]> NA 3.23 -1.32
#> 20 2 <tsibble [10 × 4]> NA -0.942 1.85
desired_output <- df %>%
bind_cols(current_output %>% select(coefs_x, coefs_z))
desired_output
#> # A tibble: 20 x 7
#> date y x z group coefs_x coefs_z
#> <int> <dbl> <dbl> <dbl> <int> <dbl> <dbl>
#> 1 1 0.759 0.368 0.644 1 NA NA
#> 2 2 0.608 0.992 0.0542 1 NA NA
#> 3 3 0.218 0.815 0.252 1 NA NA
#> 4 4 0.229 0.982 0.0606 1 NA NA
#> 5 5 0.153 0.275 0.488 1 1.46 2.08
#> 6 6 0.374 0.856 0.268 1 0.141 -0.396
#> 7 7 0.619 0.737 0.599 1 0.754 1.10
#> 8 8 0.259 0.641 0.189 1 0.651 0.889
#> 9 9 0.637 0.598 0.543 1 0.743 0.954
#> 10 10 0.325 0.990 0.0265 1 0.308 0.795
#> 11 11 0.816 0.519 0.351 2 NA NA
#> 12 12 0.717 0.766 0.333 2 NA NA
#> 13 13 0.781 0.365 0.380 2 NA NA
#> 14 14 0.838 0.924 0.0778 2 NA NA
#> 15 15 0.736 0.453 0.258 2 -0.0433 -0.252
#> 16 16 0.173 0.291 0.328 2 0.696 0.334
#> 17 17 0.677 0.714 0.884 2 0.594 -0.0698
#> 18 18 0.833 0.718 0.902 2 0.881 0.0474
#> 19 19 0.134 0.351 0.422 2 3.23 -1.32
#> 20 20 0.675 0.963 0.981 2 -0.942 1.85
Created on 2020-02-25 by the reprex package (v0.3.0)

We could simplify the code a bit with
res %>%
unnest(cols = c(data, coefs)) %>%
unnest_wider(col = coefs, names_sep = '_') %>%
select(-coefs_...1)
Where res is
res <-
df %>%
as_tsibble(key = group, index = date) %>%
group_by_key() %>%
nest() %>%
mutate(coefs = purrr::map(data, ~ pslide(., my_beta, .size = 5)))
The code that does the estimate part is left untouched. This only addresses the data wrangling part, about multiple unnest()s and bind_cols().
I haven't done a performance benchmark.

Related

map_dfr outputting a row rather than a column

This is similar to purrr::map_dfr binds by columns, not row as expected but the solutions there aren't working for me. I have a dataframe like
beta_df <- structure(list(intercept = c(-2.75747056032685, -2.90831892599742,
-2.92478082251453, -2.99701559041538, -2.88885796048347, -3.09564193631675
), B1 = c(0.0898235360814854, 0.0291839369781567, 0.0881023522236231,
0.231703026085554, 0.0441573699433149, 0.258219673780526), B2 = c(-0.222367437619057,
0.770536384299238, 0.199648657850609, 0.0529038155448773, 0.00310458335580774,
0.132604387458483), B3 = c(1.26339268033385, 1.29883641278223,
0.949504940387809, 1.26904511447941, 0.863882674439083, 0.823907268679309
), B4 = c(2.13662994525526, 1.02340744740827, 0.959079691725652,
1.60672779812489, 1.19095838867883, -0.0693120654049908)), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
#> # A tibble: 6 × 5
#> intercept B1 B2 B3 B4
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 -2.76 0.0898 -0.222 1.26 2.14
#> 2 -2.91 0.0292 0.771 1.30 1.02
#> 3 -2.92 0.0881 0.200 0.950 0.959
#> 4 -3.00 0.232 0.0529 1.27 1.61
#> 5 -2.89 0.0442 0.00310 0.864 1.19
#> 6 -3.10 0.258 0.133 0.824 -0.0693
I'd like to turn this into a tibble with columns for the mean, 0.025 and 0.975 quantiles. For the quantile function this works:
beta_df %>%
map_dfr(quantile,0.025)
#> # A tibble: 5 × 1
#> `2.5%`
#> <dbl>
#> 1 -3.08
#> 2 0.0311
#> 3 -0.194
#> 4 0.829
#> 5 0.0592
And this gets me both quantiles
bind_cols(beta_df %>%
map_dfr(quantile, 0.025),
beta_df %>%
map_dfr(quantile, 0.975))
#> # A tibble: 5 × 2
#> `2.5%` `97.5%`
#> <dbl> <dbl>
#> 1 -3.08 -2.77
#> 2 0.0311 0.255
#> 3 -0.194 0.699
#> 4 0.829 1.30
#> 5 0.0592 2.07
But for mean,
beta_df %>%
map_dfr(mean)
#> # A tibble: 1 × 5
#> intercept B1 B2 B3 B4
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 -2.93 0.124 0.156 1.08 1.14
Gives me a long row rather than a column. How can I turn the mean of each column of the original dataframe into a row of a single column dataframe labelled mean?
The reason is because the output of quantile() is a named vector whereas for the mean() is just a single value.
Lets create a custom function with the mean that outputs a named vector,
myMean <- function(x) {setNames(mean(x), nm = 'theMean')}
Applying that using map_dfr we get,
library(dplyr)
beta_df %>%
purrr::map_dfr(myMean)
# A tibble: 5 x 1
theMean
<dbl>
1 -2.93
2 0.124
3 0.156
4 1.08
5 1.14

How do I insert a column of static text into a nested dataframe in r?

Likely a trivial task for the pros out there, but have not been able to figure out how to insert the text found in the "Slug" column into each of the three nested tables associated with the slug.
![data] (https://i.stack.imgur.com/YClrE.png)
I am just looking to get the Slug value inserted into the nested tables and repeated for each row so I can combine and keep track of associations properly.
Any tips are most welcome! Thank you
Solution
You can use rowwise() with mutate(across())
df %>%
rowwise() %>%
mutate(across(floor_price_array:holder_hist, ~list(mutate(.x,slug=slug))))
Explanation
If your original data, say df, looks like this:
id slug floor_price_array num_listed_hist holder_hist
<chr> <chr> <list> <list> <list>
1 a hyznu <tibble [10 x 3]> <tibble [10 x 3]> <tibble [10 x 3]>
2 b awxeb <tibble [10 x 3]> <tibble [10 x 3]> <tibble [10 x 3]>
3 c pbncj <tibble [10 x 3]> <tibble [10 x 3]> <tibble [10 x 3]>
then, the above code will add the value in the slug column as a new constant column in each of the nested tibbles, and resulting in this (notice that each now has four columns):
id slug floor_price_array num_listed_hist holder_hist
<chr> <chr> <list> <list> <list>
1 a hyznu <tibble [10 x 4]> <tibble [10 x 4]> <tibble [10 x 4]>
2 b awxeb <tibble [10 x 4]> <tibble [10 x 4]> <tibble [10 x 4]>
3 c pbncj <tibble [10 x 4]> <tibble [10 x 4]> <tibble [10 x 4]>
For example, floor_price_array, now contains this:
[[1]]
# A tibble: 10 x 4
x y z slug
<dbl> <dbl> <dbl> <chr>
1 1.44 2.02 -0.272 hyznu
2 -0.598 -0.723 -0.528 hyznu
3 0.490 -0.576 -1.62 hyznu
4 -0.145 0.349 0.341 hyznu
5 -0.362 0.503 0.584 hyznu
6 -0.798 -0.839 -0.352 hyznu
7 -0.503 -1.27 -1.18 hyznu
8 -0.916 -0.654 0.335 hyznu
9 0.578 0.137 -0.590 hyznu
10 -0.194 -0.674 1.73 hyznu
[[2]]
# A tibble: 10 x 4
x y z slug
<dbl> <dbl> <dbl> <chr>
1 0.876 0.665 -0.723 awxeb
2 -0.0442 -0.00906 0.0829 awxeb
3 -2.15 1.33 0.0692 awxeb
4 0.264 0.237 -0.497 awxeb
5 0.0381 0.0502 -1.58 awxeb
6 -0.802 0.783 -1.34 awxeb
7 -0.940 1.50 -0.542 awxeb
8 0.209 -1.06 0.853 awxeb
9 0.569 -1.15 -0.347 awxeb
10 -1.57 -0.0774 0.0250 awxeb
[[3]]
# A tibble: 10 x 4
x y z slug
<dbl> <dbl> <dbl> <chr>
1 -0.0289 -1.63 1.29 pbncj
2 -0.716 0.647 0.0230 pbncj
3 -0.0797 -0.0227 2.12 pbncj
4 -0.358 -1.43 -1.81 pbncj
5 -1.35 -0.402 -0.463 pbncj
6 -0.00494 -0.136 1.50 pbncj
7 1.09 0.124 -0.974 pbncj
8 -1.18 1.78 -0.836 pbncj
9 0.896 -1.38 0.199 pbncj
10 0.293 0.420 0.562 pbncj
Input data:
df <- tibble(id = letters[1:3]
) %>%
rowwise() %>%
mutate(slug = paste0(sample(letters,5), collapse="")) %>%
mutate(floor_price_array=list(tibble(x=rnorm(10), y=rnorm(10), z=rnorm(10))),
num_listed_hist=list(tibble(x=rnorm(10), y=rnorm(10), z=rnorm(10))),
holder_hist=list(tibble(x=rnorm(10), y=rnorm(10), z=rnorm(10)))
) %>% ungroup()

Group by and run multiple t tests in R

I have the following dataset (dput here):
# A tibble: 3,713 x 17
ID Age Group RHR HRV Sleep.Onset Wake.Onset Hours.in.Bed Hours.of.Sleep Sleep.Disturbances Latency.min Cycles REM.Sleep.hours Deep.Sleep.hours
<int> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <int> <dbl> <dbl>
1 5027 Young Increase 58 73 0.180 0.458 6.66 5.33 9 8.98 6 1.4 0.32
2 5027 Young Increase 83 27 0.162 0.542 9.1 6.84 15 3.48 9 1.19 1.54
3 5027 Young Increase 57 85 0.113 0.318 4.92 4.43 5 1.98 4 1.32 0.44
4 5027 Young Increase 60 70 0.0975 0.319 5.32 3.75 3 26.5 4 1.02 0.14
5 5027 Young Increase 63 72 0.105 0.329 5.38 4.74 5 2.48 5 1.32 0.07
6 5027 Young Increase 62 61 0.983 0.472 11.8 9.44 9 4.48 8 2.07 0.84
7 5027 Young Increase 66 68 0.142 0.426 6.83 5.48 15 2.98 6 1.48 0.35
8 5027 Young Increase 81 28 0.0908 0.177 2.06 1.93 2 2.48 1 0.22 0.22
9 5027 Young Increase 69 57 0.158 0.443 6.85 6.58 13 0.48 6 2.43 0
10 5027 Young Increase 63 60 0.0859 0.318 5.58 5.47 4 0.48 5 1.34 0.13
# ... with 3,703 more rows, and 3 more variables: Light.Sleep.hours <dbl>, Awake.hours <dbl>, Session <chr>
I am trying to calculate a t-test across every variable, grouped by Age and Group between Session (pre or post).
df %>%
select(-ID) %>%
group_by(Age, Group) %>%
summarize_at(
vars(-group_cols(), -Session),
list(p.value = ~ t.test(. ~ Session)$p.value))
I am successful with p values:
# A tibble: 4 x 15
# Groups: Age [2]
Age Group RHR_p.value HRV_p.value Sleep.Onset_p.value Wake.Onset_p.value Hours.in.Bed_p.value Hours.of.Sleep_p~ Sleep.Disturban~ Latency.min_p.v~
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Old Decrease 0.0594 0.865 0.495 0.885 0.316 0.307 0.148 0.00237
2 Old Increase 0.00920 0.634 0.0979 0.0514 0.00774 0.00762 0.247 0.933
3 Young Decrease 0.0975 0.259 0.779 0.760 0.959 0.975 0.256 0.181
4 Young Increase 0.115 0.604 0.846 0.164 0.140 0.242 0.692 0.412
# ... with 5 more variables: Cycles_p.value <dbl>, REM.Sleep.hours_p.value <dbl>, Deep.Sleep.hours_p.value <dbl>, Light.Sleep.hours_p.value <dbl>,
# Awake.hours_p.value <dbl>
However, I am struggling to calculate the other t-statistics (mean, sd, t, df, 95%CI) between these pre-post and also correct p-values groups. I am struggling to do this so any help is appreciated.
I think I may need to convert data long and use something like this?
df %>%
group_by(Age, Group) %>%
t_test(mean ~ ., by = "Session") %>%
adjust_pvalue(method = "bonferroni") %>%
add_significance()
Dndata frames can only have certain object classes as column types. A
htest is not one of those.
However, we can store lists as list-columns.
If we adapt the current code to output lists htests as results, we can later extract elements of the tests separately.
library(dplyr)
output <- df %>%
select(-ID) %>%
group_by(Age, Group) %>%
summarize_at(
vars(-group_cols(), -Session),
list(t.test = ~ list(t.test(. ~ Session))))
output
# A tibble: 4 × 15
# Groups: Age [2]
Age Group RHR_t.test HRV_t.test Sleep.Onset_t.test Wake.Onset_t.test Hours.in.Bed_t.test Hours.of.Sleep_t.test Sleep.Disturbance… Latency.min_t.t… Cycles_t.test REM.Sleep.hours…
<chr> <chr> <list> <list> <list> <list> <list> <list> <list> <list> <list> <list>
1 Old Decrease <htest> <htest> <htest> <htest> <htest> <htest> <htest> <htest> <htest> <htest>
2 Old Increase <htest> <htest> <htest> <htest> <htest> <htest> <htest> <htest> <htest> <htest>
3 Young Decrease <htest> <htest> <htest> <htest> <htest> <htest> <htest> <htest> <htest> <htest>
4 Young Increase <htest> <htest> <htest> <htest> <htest> <htest> <htest> <htest> <htest> <htest>
With this output data.frame, we can extract individual tests and values from them as desired:
output$RHR_t.test
[[1]]
Welch Two Sample t-test
data: . by Session
t = -1.8965, df = 188.22, p-value = 0.05942
alternative hypothesis: true difference in means between group Post and group Pre is not equal to 0
95 percent confidence interval:
-3.09118590 0.06082897
sample estimates:
mean in group Post mean in group Pre
62.28902 63.80420
[[2]]
Welch Two Sample t-test
data: . by Session
t = -2.6271, df = 226.21, p-value = 0.009199
alternative hypothesis: true difference in means between group Post and group Pre is not equal to 0
95 percent confidence interval:
-3.3949577 -0.4848655
sample estimates:
mean in group Post mean in group Pre
57.95946 59.89937
[[3]]
Welch Two Sample t-test
data: . by Session
t = 1.6633, df = 251.75, p-value = 0.0975
alternative hypothesis: true difference in means between group Post and group Pre is not equal to 0
95 percent confidence interval:
-0.2074028 2.4611194
sample estimates:
mean in group Post mean in group Pre
60.58255 59.45570
[[4]]
Welch Two Sample t-test
data: . by Session
t = 1.5849, df = 208.4, p-value = 0.1145
alternative hypothesis: true difference in means between group Post and group Pre is not equal to 0
95 percent confidence interval:
-0.244287 2.247775
sample estimates:
mean in group Post mean in group Pre
60.23462 59.23288
output$RHR_t.test %>%
map_dbl('p.value')
[1] 0.059424354 0.009199459 0.097497620 0.114502332
We can also convert these lists to user-friendly tibbles with broom::tidy
output %>%
mutate(across(ends_with('t.test'), map, broom::tidy))
# A tibble: 4 × 15
# Groups: Age [2]
Age Group RHR_t.test HRV_t.test Sleep.Onset_t.te… Wake.Onset_t.test Hours.in.Bed_t.t… Hours.of.Sleep_… Sleep.Disturbanc… Latency.min_t.t… Cycles_t.test REM.Sleep.hours…
<chr> <chr> <list> <list> <list> <list> <list> <list> <list> <list> <list> <list>
1 Old Decrease <tibble [1 × 10]> <tibble [1 … <tibble [1 × 10]> <tibble [1 × 10]> <tibble [1 × 10]> <tibble [1 × 10… <tibble [1 × 10]> <tibble [1 × 10… <tibble [1 ×… <tibble [1 × 10…
2 Old Increase <tibble [1 × 10]> <tibble [1 … <tibble [1 × 10]> <tibble [1 × 10]> <tibble [1 × 10]> <tibble [1 × 10… <tibble [1 × 10]> <tibble [1 × 10… <tibble [1 ×… <tibble [1 × 10…
3 Young Decrease <tibble [1 × 10]> <tibble [1 … <tibble [1 × 10]> <tibble [1 × 10]> <tibble [1 × 10]> <tibble [1 × 10… <tibble [1 × 10]> <tibble [1 × 10… <tibble [1 ×… <tibble [1 × 10…
4 Young Increase <tibble [1 × 10]> <tibble [1 … <tibble [1 × 10]> <tibble [1 × 10]> <tibble [1 × 10]> <tibble [1 × 10… <tibble [1 × 10]> <tibble [1 × 10… <tibble [1 ×… <tibble [1 × 10…
# … with 3 more variables: Deep.Sleep.hours_t.test <list>, Light.Sleep.hours_t.test <list>, Awake.hours_t.test <list>
To have all tests "statistics", we can do it like this:
tidy_output %>%
mutate(across(ends_with('t.test'), sapply, pull, 'statistic'))
# A tibble: 4 × 15
# Groups: Age [2]
Age Group RHR_t.test HRV_t.test Sleep.Onset_t.test Wake.Onset_t.test Hours.in.Bed_t.test Hours.of.Sleep_t.test Sleep.Disturbance… Latency.min_t.t… Cycles_t.test REM.Sleep.hours…
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Old Decrease -1.90 0.171 0.684 -0.145 -1.01 -1.02 -1.45 3.05 -0.928 -0.906
2 Old Increase -2.63 0.477 -1.66 -1.96 -2.69 -2.69 -1.16 0.0848 -1.76 -1.87
3 Young Decrease 1.66 1.13 0.281 -0.305 0.0509 -0.0320 1.14 -1.34 -0.675 0.672
4 Young Increase 1.58 0.519 0.195 -1.40 -1.48 -1.17 0.397 -0.821 -1.73 0.886
# … with 3 more variables: Deep.Sleep.hours_t.test <dbl>, Light.Sleep.hours_t.test <dbl>, Awake.hours_t.test <dbl>

How do I combine many tibbles by a simple code?

I have pop_1910, ... pop_2000. Each tibble has the following style. I want to combine these tibbles to one tibble. I know bind_rows to do that pop_1910 %>% bind_rows(pop_1920) %>% bind_rows(pop_1930). But it is a little bit tedious. Are there some efficient ways to combine many dataframes?
> pop_2000
# A tibble: 3,143 x 3
fips year pop
<chr> <dbl> <dbl>
1 01001 2000 33364
2 01003 2000 112162
3 01005 2000 23042
4 01007 2000 15432
5 01009 2000 40165
6 01011 2000 9142
7 01013 2000 16798
8 01015 2000 90175
9 01017 2000 29086
10 01019 2000 19470
If you have them inside a list, you can use reduce() to bind all in one move.
library(tidyverse)
my_df_list <- map(1:4, ~tibble(x = rnorm(5), y = rnorm(5)))
my_df_list
#> [[1]]
#> # A tibble: 5 x 2
#> x y
#> <dbl> <dbl>
#> 1 1.99 1.19
#> 2 0.273 0.208
#> 3 1.12 1.18
#> 4 0.00855 -0.593
#> 5 0.502 -0.926
#>
#> [[2]]
#> # A tibble: 5 x 2
#> x y
#> <dbl> <dbl>
#> 1 0.570 -0.709
#> 2 0.599 -0.408
#> 3 -0.687 1.38
#> 4 0.375 1.53
#> 5 0.0394 1.90
#>
#> [[3]]
#> # A tibble: 5 x 2
#> x y
#> <dbl> <dbl>
#> 1 -0.576 1.64
#> 2 0.147 -0.0384
#> 3 0.904 0.164
#> 4 -1.16 -1.02
#> 5 -0.678 1.32
#>
#> [[4]]
#> # A tibble: 5 x 2
#> x y
#> <dbl> <dbl>
#> 1 -0.849 -0.445
#> 2 -0.786 -0.991
#> 3 1.17 -1.00
#> 4 0.222 1.65
#> 5 -0.656 -0.808
reduce(my_df_list, bind_rows)
#> # A tibble: 20 x 2
#> x y
#> <dbl> <dbl>
#> 1 1.99 1.19
#> 2 0.273 0.208
#> 3 1.12 1.18
#> 4 0.00855 -0.593
#> 5 0.502 -0.926
#> 6 0.570 -0.709
#> 7 0.599 -0.408
#> 8 -0.687 1.38
#> 9 0.375 1.53
#> 10 0.0394 1.90
#> 11 -0.576 1.64
#> 12 0.147 -0.0384
#> 13 0.904 0.164
#> 14 -1.16 -1.02
#> 15 -0.678 1.32
#> 16 -0.849 -0.445
#> 17 -0.786 -0.991
#> 18 1.17 -1.00
#> 19 0.222 1.65
#> 20 -0.656 -0.808
Created on 2021-06-07 by the reprex package (v2.0.0)
you may also use map_dfr simply
purrr::map_dfr(my_list, ~.x)
This will give you a single df binded by rows.
OR in baseR
do.call(rbind, my_list)
Even easier is piping your list to dplyr::bind_rows(), e.g.
library(dplyr)
my_list %>% bind_rows()

Get summary of the model using purrr::map within dplyr piping

Using mtcars data, I am testing map() to build some lm() models:
library(tidyverse)
mtcars %>%
group_by(cyl) %>%
nest()%>%
mutate(fit = map(.x=data,~lm(mpg ~ ., data = .x)))
#> # A tibble: 3 x 3
#> cyl data fit
#> <dbl> <list> <list>
#> 1 6 <tibble [7 x 10]> <S3: lm>
#> 2 4 <tibble [11 x 10]> <S3: lm>
#> 3 8 <tibble [14 x 10]> <S3: lm>
The output shows that I have a new column, fit.
Now I wish to see the summary of each lm
When I try:
library(tidyverse)
mtcars %>%
group_by(cyl) %>%
nest()%>%
mutate(fit = map(.x=data,~lm(mpg ~ ., data = .x))) %>%
map(fit,summary)
#> Error in as_mapper(.f, ...): object 'fit' not found
It gives the error:
Error in as_mapper(.f, ...) : object 'fit' not found
If I wish to calculate R2 or aic then I can using the following code without any problem:
library(tidyverse)
library(modelr)
mtcars %>%
group_by(cyl) %>%
nest()%>%
mutate(fit = map(.x=data,~lm(mpg ~ ., data = .x))) %>%
mutate(r2 = map_dbl(fit, ~rsquare(., data = mtcars)),
aic = map_dbl(fit, ~AIC(.))) %>%
arrange(aic)
#> # A tibble: 3 x 5
#> cyl data fit r2 aic
#> <dbl> <list> <list> <dbl> <dbl>
#> 1 6 <tibble [7 x 10]> <S3: lm> -8.96 -Inf
#> 2 4 <tibble [11 x 10]> <S3: lm> -26.4 56.4
#> 3 8 <tibble [14 x 10]> <S3: lm> -1.000 67.3
Created on 2019-06-18 by the reprex package (v0.3.0)
What am I missing?
As IceCreamToucan's comment laid out, purrr::map does not look into the data which has been made within your piping.
If you use it with dplyr::mutate then it has access to fit which you have created in the previous piping.
Another option would be explicitly referring to fit which you can see below, as my second suggestion.
library(tidyverse)
mtcars %>%
group_by(cyl) %>%
nest()%>%
mutate(fit = map(.x=data,~lm(mpg ~ ., data = .x))) %>%
mutate(fit_sum = map(fit,summary))
#> # A tibble: 3 x 4
#> cyl data fit fit_sum
#> <dbl> <list> <list> <list>
#> 1 6 <tibble [7 x 10]> <lm> <smmry.lm>
#> 2 4 <tibble [11 x 10]> <lm> <smmry.lm>
#> 3 8 <tibble [14 x 10]> <lm> <smmry.lm>
mtcars %>%
group_by(cyl) %>%
nest()%>%
mutate(fit = map(.x=data,~lm(mpg ~ ., data = .x))) %>%
{map(.$fit, summary)} #or using pull: `pull(fit) %>% map(summary)`
#> [[1]]
#>
#> Call:
#> lm(formula = mpg ~ ., data = .x)
#>
#> Residuals:
#> ALL 7 residuals are 0: no residual degrees of freedom!
#>
#> Coefficients: (3 not defined because of singularities)
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 32.78649 NA NA NA
#> disp 0.07456 NA NA NA
#> hp -0.04252 NA NA NA
#> drat 1.52367 NA NA NA
#> wt 5.12418 NA NA NA
#> qsec -2.33333 NA NA NA
#> vs -1.75289 NA NA NA
#> am NA NA NA NA
#> gear NA NA NA NA
#> carb NA NA NA NA
#>
#> Residual standard error: NaN on 0 degrees of freedom
#> Multiple R-squared: 1, Adjusted R-squared: NaN
#> F-statistic: NaN on 6 and 0 DF, p-value: NA
####truncated the results for the sake of space####
Created on 2019-06-17 by the reprex package (v0.3.0)
From the latest release of dplyr, tidyverse seems to be encouraging using group_modify functions instead of using purrr + nested dataframes.
In that workflow, here is how you can get both model summaries and estimates in the same dataframe via broom package:
# setup
set.seed(123)
library(tidyverse)
options(tibble.width = Inf)
# joining dataframes with regression estimates and model summaries
dplyr::full_join(
# to get a tidy dataframe of regression estimates
x = mtcars %>%
group_by(cyl) %>%
group_modify(.f = ~ broom::tidy(lm(mpg ~ ., data = .x), conf.int = TRUE)),
# to get a tidy dataframe of model summaries
y = mtcars %>%
group_by(cyl) %>%
group_modify(.f = ~ broom::glance(lm(mpg ~ ., data = .x))),
by = "cyl"
) %>%
dplyr::ungroup(x = .)
#> Warning in qt(a, object$df.residual): NaNs produced
#> # A tibble: 25 x 20
#> cyl term estimate std.error statistic.x p.value.x conf.low
#> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 4 (Intercept) 60.9 180. 0.338 0.793 -2229.
#> 2 4 disp -0.345 0.469 -0.735 0.596 -6.31
#> 3 4 hp -0.0332 0.364 -0.0915 0.942 -4.65
#> 4 4 drat -4.19 46.4 -0.0903 0.943 -594.
#> 5 4 wt 4.48 29.7 0.151 0.905 -373.
#> 6 4 qsec -0.106 7.82 -0.0136 0.991 -99.4
#> 7 4 vs -3.64 34.0 -0.107 0.932 -435.
#> 8 4 am -6.33 45.2 -0.140 0.912 -581.
#> 9 4 gear 4.07 29.1 0.140 0.912 -366.
#> 10 4 carb 3.22 28.2 0.114 0.928 -355.
#> conf.high r.squared adj.r.squared sigma statistic.y p.value.y df
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2351. 0.928 0.276 3.84 1.42 0.576 9
#> 2 5.62 0.928 0.276 3.84 1.42 0.576 9
#> 3 4.59 0.928 0.276 3.84 1.42 0.576 9
#> 4 586. 0.928 0.276 3.84 1.42 0.576 9
#> 5 382. 0.928 0.276 3.84 1.42 0.576 9
#> 6 99.2 0.928 0.276 3.84 1.42 0.576 9
#> 7 428. 0.928 0.276 3.84 1.42 0.576 9
#> 8 568. 0.928 0.276 3.84 1.42 0.576 9
#> 9 374. 0.928 0.276 3.84 1.42 0.576 9
#> 10 362. 0.928 0.276 3.84 1.42 0.576 9
#> logLik AIC BIC deviance df.residual nobs
#> <dbl> <dbl> <dbl> <dbl> <int> <int>
#> 1 -17.2 56.4 60.8 14.7 1 11
#> 2 -17.2 56.4 60.8 14.7 1 11
#> 3 -17.2 56.4 60.8 14.7 1 11
#> 4 -17.2 56.4 60.8 14.7 1 11
#> 5 -17.2 56.4 60.8 14.7 1 11
#> 6 -17.2 56.4 60.8 14.7 1 11
#> 7 -17.2 56.4 60.8 14.7 1 11
#> 8 -17.2 56.4 60.8 14.7 1 11
#> 9 -17.2 56.4 60.8 14.7 1 11
#> 10 -17.2 56.4 60.8 14.7 1 11
#> # ... with 15 more rows
Created on 2019-06-17 by the reprex package (v0.3.0)

Resources