I have the following dataset (dput here):
# A tibble: 3,713 x 17
ID Age Group RHR HRV Sleep.Onset Wake.Onset Hours.in.Bed Hours.of.Sleep Sleep.Disturbances Latency.min Cycles REM.Sleep.hours Deep.Sleep.hours
<int> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <int> <dbl> <dbl>
1 5027 Young Increase 58 73 0.180 0.458 6.66 5.33 9 8.98 6 1.4 0.32
2 5027 Young Increase 83 27 0.162 0.542 9.1 6.84 15 3.48 9 1.19 1.54
3 5027 Young Increase 57 85 0.113 0.318 4.92 4.43 5 1.98 4 1.32 0.44
4 5027 Young Increase 60 70 0.0975 0.319 5.32 3.75 3 26.5 4 1.02 0.14
5 5027 Young Increase 63 72 0.105 0.329 5.38 4.74 5 2.48 5 1.32 0.07
6 5027 Young Increase 62 61 0.983 0.472 11.8 9.44 9 4.48 8 2.07 0.84
7 5027 Young Increase 66 68 0.142 0.426 6.83 5.48 15 2.98 6 1.48 0.35
8 5027 Young Increase 81 28 0.0908 0.177 2.06 1.93 2 2.48 1 0.22 0.22
9 5027 Young Increase 69 57 0.158 0.443 6.85 6.58 13 0.48 6 2.43 0
10 5027 Young Increase 63 60 0.0859 0.318 5.58 5.47 4 0.48 5 1.34 0.13
# ... with 3,703 more rows, and 3 more variables: Light.Sleep.hours <dbl>, Awake.hours <dbl>, Session <chr>
I am trying to calculate a t-test across every variable, grouped by Age and Group between Session (pre or post).
df %>%
select(-ID) %>%
group_by(Age, Group) %>%
summarize_at(
vars(-group_cols(), -Session),
list(p.value = ~ t.test(. ~ Session)$p.value))
I am successful with p values:
# A tibble: 4 x 15
# Groups: Age [2]
Age Group RHR_p.value HRV_p.value Sleep.Onset_p.value Wake.Onset_p.value Hours.in.Bed_p.value Hours.of.Sleep_p~ Sleep.Disturban~ Latency.min_p.v~
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Old Decrease 0.0594 0.865 0.495 0.885 0.316 0.307 0.148 0.00237
2 Old Increase 0.00920 0.634 0.0979 0.0514 0.00774 0.00762 0.247 0.933
3 Young Decrease 0.0975 0.259 0.779 0.760 0.959 0.975 0.256 0.181
4 Young Increase 0.115 0.604 0.846 0.164 0.140 0.242 0.692 0.412
# ... with 5 more variables: Cycles_p.value <dbl>, REM.Sleep.hours_p.value <dbl>, Deep.Sleep.hours_p.value <dbl>, Light.Sleep.hours_p.value <dbl>,
# Awake.hours_p.value <dbl>
However, I am struggling to calculate the other t-statistics (mean, sd, t, df, 95%CI) between these pre-post and also correct p-values groups. I am struggling to do this so any help is appreciated.
I think I may need to convert data long and use something like this?
df %>%
group_by(Age, Group) %>%
t_test(mean ~ ., by = "Session") %>%
adjust_pvalue(method = "bonferroni") %>%
add_significance()
Dndata frames can only have certain object classes as column types. A
htest is not one of those.
However, we can store lists as list-columns.
If we adapt the current code to output lists htests as results, we can later extract elements of the tests separately.
library(dplyr)
output <- df %>%
select(-ID) %>%
group_by(Age, Group) %>%
summarize_at(
vars(-group_cols(), -Session),
list(t.test = ~ list(t.test(. ~ Session))))
output
# A tibble: 4 × 15
# Groups: Age [2]
Age Group RHR_t.test HRV_t.test Sleep.Onset_t.test Wake.Onset_t.test Hours.in.Bed_t.test Hours.of.Sleep_t.test Sleep.Disturbance… Latency.min_t.t… Cycles_t.test REM.Sleep.hours…
<chr> <chr> <list> <list> <list> <list> <list> <list> <list> <list> <list> <list>
1 Old Decrease <htest> <htest> <htest> <htest> <htest> <htest> <htest> <htest> <htest> <htest>
2 Old Increase <htest> <htest> <htest> <htest> <htest> <htest> <htest> <htest> <htest> <htest>
3 Young Decrease <htest> <htest> <htest> <htest> <htest> <htest> <htest> <htest> <htest> <htest>
4 Young Increase <htest> <htest> <htest> <htest> <htest> <htest> <htest> <htest> <htest> <htest>
With this output data.frame, we can extract individual tests and values from them as desired:
output$RHR_t.test
[[1]]
Welch Two Sample t-test
data: . by Session
t = -1.8965, df = 188.22, p-value = 0.05942
alternative hypothesis: true difference in means between group Post and group Pre is not equal to 0
95 percent confidence interval:
-3.09118590 0.06082897
sample estimates:
mean in group Post mean in group Pre
62.28902 63.80420
[[2]]
Welch Two Sample t-test
data: . by Session
t = -2.6271, df = 226.21, p-value = 0.009199
alternative hypothesis: true difference in means between group Post and group Pre is not equal to 0
95 percent confidence interval:
-3.3949577 -0.4848655
sample estimates:
mean in group Post mean in group Pre
57.95946 59.89937
[[3]]
Welch Two Sample t-test
data: . by Session
t = 1.6633, df = 251.75, p-value = 0.0975
alternative hypothesis: true difference in means between group Post and group Pre is not equal to 0
95 percent confidence interval:
-0.2074028 2.4611194
sample estimates:
mean in group Post mean in group Pre
60.58255 59.45570
[[4]]
Welch Two Sample t-test
data: . by Session
t = 1.5849, df = 208.4, p-value = 0.1145
alternative hypothesis: true difference in means between group Post and group Pre is not equal to 0
95 percent confidence interval:
-0.244287 2.247775
sample estimates:
mean in group Post mean in group Pre
60.23462 59.23288
output$RHR_t.test %>%
map_dbl('p.value')
[1] 0.059424354 0.009199459 0.097497620 0.114502332
We can also convert these lists to user-friendly tibbles with broom::tidy
output %>%
mutate(across(ends_with('t.test'), map, broom::tidy))
# A tibble: 4 × 15
# Groups: Age [2]
Age Group RHR_t.test HRV_t.test Sleep.Onset_t.te… Wake.Onset_t.test Hours.in.Bed_t.t… Hours.of.Sleep_… Sleep.Disturbanc… Latency.min_t.t… Cycles_t.test REM.Sleep.hours…
<chr> <chr> <list> <list> <list> <list> <list> <list> <list> <list> <list> <list>
1 Old Decrease <tibble [1 × 10]> <tibble [1 … <tibble [1 × 10]> <tibble [1 × 10]> <tibble [1 × 10]> <tibble [1 × 10… <tibble [1 × 10]> <tibble [1 × 10… <tibble [1 ×… <tibble [1 × 10…
2 Old Increase <tibble [1 × 10]> <tibble [1 … <tibble [1 × 10]> <tibble [1 × 10]> <tibble [1 × 10]> <tibble [1 × 10… <tibble [1 × 10]> <tibble [1 × 10… <tibble [1 ×… <tibble [1 × 10…
3 Young Decrease <tibble [1 × 10]> <tibble [1 … <tibble [1 × 10]> <tibble [1 × 10]> <tibble [1 × 10]> <tibble [1 × 10… <tibble [1 × 10]> <tibble [1 × 10… <tibble [1 ×… <tibble [1 × 10…
4 Young Increase <tibble [1 × 10]> <tibble [1 … <tibble [1 × 10]> <tibble [1 × 10]> <tibble [1 × 10]> <tibble [1 × 10… <tibble [1 × 10]> <tibble [1 × 10… <tibble [1 ×… <tibble [1 × 10…
# … with 3 more variables: Deep.Sleep.hours_t.test <list>, Light.Sleep.hours_t.test <list>, Awake.hours_t.test <list>
To have all tests "statistics", we can do it like this:
tidy_output %>%
mutate(across(ends_with('t.test'), sapply, pull, 'statistic'))
# A tibble: 4 × 15
# Groups: Age [2]
Age Group RHR_t.test HRV_t.test Sleep.Onset_t.test Wake.Onset_t.test Hours.in.Bed_t.test Hours.of.Sleep_t.test Sleep.Disturbance… Latency.min_t.t… Cycles_t.test REM.Sleep.hours…
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Old Decrease -1.90 0.171 0.684 -0.145 -1.01 -1.02 -1.45 3.05 -0.928 -0.906
2 Old Increase -2.63 0.477 -1.66 -1.96 -2.69 -2.69 -1.16 0.0848 -1.76 -1.87
3 Young Decrease 1.66 1.13 0.281 -0.305 0.0509 -0.0320 1.14 -1.34 -0.675 0.672
4 Young Increase 1.58 0.519 0.195 -1.40 -1.48 -1.17 0.397 -0.821 -1.73 0.886
# … with 3 more variables: Deep.Sleep.hours_t.test <dbl>, Light.Sleep.hours_t.test <dbl>, Awake.hours_t.test <dbl>
Related
I have a dataframe "data" that contains
employee ID ("CPNo") - int
Gender - factor
Job Role - factor
Country - factor
Annual Salary - int
I want to run a t-test for each job role in each country to see if there is a significant paygap between the genders in the same job role and country.
I create a nested dataframe which contains dataframes with at least 20 observations:
dataNested <- data %>%
select(CPNo, Gender, JobRole, Country, AnnualSalaryLocal) %>%
nest(data = c(CPNo, Gender, AnnualSalaryLocal)) %>% filter(map_int(data, nrow) > 20)
And I want to run a t-test on that nested dataframe:
dataNested %>%
mutate(t_test = map(data, ~t.test(.x$AnnualSalaryLocal ~ .x$Gender, var.eq=F, paired=F)))
Now, if I run the code I get the following table which is a nested dataframe that contain the results of my t-tests:
JobRole
<fctr>
JobStage
<fctr>
Country
<fctr>
data
<list>
t_test
<list>
76 Product Development 06 Ireland <tibble> <S3: htest>
76 Product Development 06 Italy <tibble> <S3: htest>
82 Service Delivery 05 Italy <tibble> <S3: htest>
82 Service Delivery 06 Italy <tibble> <S3: htest>
82 Service Delivery 03 Mexico <tibble> <S3: htest>
83 Supply & Logistics 01 Mexico <tibble> <S3: htest>
76 Product Development 05 Poland <tibble> <S3: htest>
How do I write the syntax if I want to add a new variable "sig" which extracts the p.value from my "t_test" variable?
You can extract by using broom::tidy(). Here's an example using the gapminder dataset:
library(gapminder)
library(dplyr)
library(tidyr)
library(purrr)
library(broom)
gapminder |>
filter(continent %in% c("Europe", "Asia")) |>
group_by(year) |>
nest() |>
mutate(t_test = map(data, ~ t.test(.x$lifeExp ~ .x$continent, var.eq = F, paired = F)),
res = map(t_test, tidy)) |>
unnest(res) |>
ungroup()
# A tibble: 12 × 13
year data t_test estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high method alternative
<int> <list> <list> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 1952 <tibble [63 × 5]> <htest> -18.1 46.3 64.4 -9.09 1.14e-12 56.8 -22.1 -14.1 Welch Two Sample t-test two.sided
2 1957 <tibble [63 × 5]> <htest> -17.4 49.3 66.7 -8.98 4.73e-12 50.6 -21.3 -13.5 Welch Two Sample t-test two.sided
3 1962 <tibble [63 × 5]> <htest> -17.0 51.6 68.5 -9.02 1.24e-11 44.7 -20.8 -13.2 Welch Two Sample t-test two.sided
4 1967 <tibble [63 × 5]> <htest> -15.1 54.7 69.7 -8.29 2.01e-10 42.5 -18.7 -11.4 Welch Two Sample t-test two.sided
5 1972 <tibble [63 × 5]> <htest> -13.5 57.3 70.8 -7.50 3.96e- 9 39.6 -17.1 -9.83 Welch Two Sample t-test two.sided
6 1977 <tibble [63 × 5]> <htest> -12.3 59.6 71.9 -6.72 5.46e- 8 38.7 -16.0 -8.61 Welch Two Sample t-test two.sided
7 1982 <tibble [63 × 5]> <htest> -10.2 62.6 72.8 -6.38 1.18e- 7 41.7 -13.4 -6.96 Welch Two Sample t-test two.sided
8 1987 <tibble [63 × 5]> <htest> -8.79 64.9 73.6 -5.71 1.04e- 6 42.1 -11.9 -5.68 Welch Two Sample t-test two.sided
9 1992 <tibble [63 × 5]> <htest> -7.90 66.5 74.4 -5.19 5.54e- 6 42.7 -11.0 -4.83 Welch Two Sample t-test two.sided
10 1997 <tibble [63 × 5]> <htest> -7.48 68.0 75.5 -4.93 1.34e- 5 42.0 -10.5 -4.42 Welch Two Sample t-test two.sided
11 2002 <tibble [63 × 5]> <htest> -7.47 69.2 76.7 -4.81 2.13e- 5 40.3 -10.6 -4.33 Welch Two Sample t-test two.sided
12 2007 <tibble [63 × 5]> <htest> -6.92 70.7 77.6 -4.65 3.39e- 5 41.5 -9.93 -3.91 Welch Two Sample t-test two.sided
There is a package, rstatix, that do what you want in one step:
library(rstatix)
data %>% t_test(Gender ~ Country + JobRole)
Likely a trivial task for the pros out there, but have not been able to figure out how to insert the text found in the "Slug" column into each of the three nested tables associated with the slug.
![data] (https://i.stack.imgur.com/YClrE.png)
I am just looking to get the Slug value inserted into the nested tables and repeated for each row so I can combine and keep track of associations properly.
Any tips are most welcome! Thank you
Solution
You can use rowwise() with mutate(across())
df %>%
rowwise() %>%
mutate(across(floor_price_array:holder_hist, ~list(mutate(.x,slug=slug))))
Explanation
If your original data, say df, looks like this:
id slug floor_price_array num_listed_hist holder_hist
<chr> <chr> <list> <list> <list>
1 a hyznu <tibble [10 x 3]> <tibble [10 x 3]> <tibble [10 x 3]>
2 b awxeb <tibble [10 x 3]> <tibble [10 x 3]> <tibble [10 x 3]>
3 c pbncj <tibble [10 x 3]> <tibble [10 x 3]> <tibble [10 x 3]>
then, the above code will add the value in the slug column as a new constant column in each of the nested tibbles, and resulting in this (notice that each now has four columns):
id slug floor_price_array num_listed_hist holder_hist
<chr> <chr> <list> <list> <list>
1 a hyznu <tibble [10 x 4]> <tibble [10 x 4]> <tibble [10 x 4]>
2 b awxeb <tibble [10 x 4]> <tibble [10 x 4]> <tibble [10 x 4]>
3 c pbncj <tibble [10 x 4]> <tibble [10 x 4]> <tibble [10 x 4]>
For example, floor_price_array, now contains this:
[[1]]
# A tibble: 10 x 4
x y z slug
<dbl> <dbl> <dbl> <chr>
1 1.44 2.02 -0.272 hyznu
2 -0.598 -0.723 -0.528 hyznu
3 0.490 -0.576 -1.62 hyznu
4 -0.145 0.349 0.341 hyznu
5 -0.362 0.503 0.584 hyznu
6 -0.798 -0.839 -0.352 hyznu
7 -0.503 -1.27 -1.18 hyznu
8 -0.916 -0.654 0.335 hyznu
9 0.578 0.137 -0.590 hyznu
10 -0.194 -0.674 1.73 hyznu
[[2]]
# A tibble: 10 x 4
x y z slug
<dbl> <dbl> <dbl> <chr>
1 0.876 0.665 -0.723 awxeb
2 -0.0442 -0.00906 0.0829 awxeb
3 -2.15 1.33 0.0692 awxeb
4 0.264 0.237 -0.497 awxeb
5 0.0381 0.0502 -1.58 awxeb
6 -0.802 0.783 -1.34 awxeb
7 -0.940 1.50 -0.542 awxeb
8 0.209 -1.06 0.853 awxeb
9 0.569 -1.15 -0.347 awxeb
10 -1.57 -0.0774 0.0250 awxeb
[[3]]
# A tibble: 10 x 4
x y z slug
<dbl> <dbl> <dbl> <chr>
1 -0.0289 -1.63 1.29 pbncj
2 -0.716 0.647 0.0230 pbncj
3 -0.0797 -0.0227 2.12 pbncj
4 -0.358 -1.43 -1.81 pbncj
5 -1.35 -0.402 -0.463 pbncj
6 -0.00494 -0.136 1.50 pbncj
7 1.09 0.124 -0.974 pbncj
8 -1.18 1.78 -0.836 pbncj
9 0.896 -1.38 0.199 pbncj
10 0.293 0.420 0.562 pbncj
Input data:
df <- tibble(id = letters[1:3]
) %>%
rowwise() %>%
mutate(slug = paste0(sample(letters,5), collapse="")) %>%
mutate(floor_price_array=list(tibble(x=rnorm(10), y=rnorm(10), z=rnorm(10))),
num_listed_hist=list(tibble(x=rnorm(10), y=rnorm(10), z=rnorm(10))),
holder_hist=list(tibble(x=rnorm(10), y=rnorm(10), z=rnorm(10)))
) %>% ungroup()
I do resamples_fit with work_flow() and V-Fold Cross-Validation.
My model is logistic regression.
How can I get coefficients of a parsnip logistic regression model with V-Fold Cross-Validation?
If my V-Fold Cross-Validation v=5, I want to get the 5 times coefficients.
You typically do not want to use fit_resamples() to train and keep five models; the main purpose of the fit_resamples() function is to use resampling to estimate performance. The five models are fit and then thrown away.
However, if you do have some use case where you want to keep around the models that are fit, such as in this article, then you would use extract_model.
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
data(penguins)
set.seed(2021)
penguin_split <- penguins %>%
filter(!is.na(sex)) %>%
initial_split(strata = sex)
penguin_train <- training(penguin_split)
penguin_test <- testing(penguin_split)
penguin_folds <- vfold_cv(penguin_train, v = 5, strata = sex)
penguin_folds
#> # 5-fold cross-validation using stratification
#> # A tibble: 5 x 2
#> splits id
#> <list> <chr>
#> 1 <split [198/51]> Fold1
#> 2 <split [199/50]> Fold2
#> 3 <split [199/50]> Fold3
#> 4 <split [200/49]> Fold4
#> 5 <split [200/49]> Fold5
glm_spec <- logistic_reg() %>%
set_engine("glm")
glm_rs <- workflow() %>%
add_formula(sex ~ species + bill_length_mm + bill_depth_mm + body_mass_g) %>%
add_model(glm_spec) %>%
fit_resamples(
resamples = penguin_folds,
control = control_resamples(extract = extract_model, save_pred = TRUE)
)
Now that you have used extract_model in your resampling, it is there in your results and you have the models available for each fold.
glm_rs
#> # Resampling results
#> # 5-fold cross-validation using stratification
#> # A tibble: 5 x 6
#> splits id .metrics .notes .extracts .predictions
#> <list> <chr> <list> <list> <list> <list>
#> 1 <split [198/… Fold1 <tibble [2 × … <tibble [0 ×… <tibble [1 ×… <tibble [51 × …
#> 2 <split [199/… Fold2 <tibble [2 × … <tibble [0 ×… <tibble [1 ×… <tibble [50 × …
#> 3 <split [199/… Fold3 <tibble [2 × … <tibble [0 ×… <tibble [1 ×… <tibble [50 × …
#> 4 <split [200/… Fold4 <tibble [2 × … <tibble [0 ×… <tibble [1 ×… <tibble [49 × …
#> 5 <split [200/… Fold5 <tibble [2 × … <tibble [0 ×… <tibble [1 ×… <tibble [49 × …
glm_rs$.extracts[[1]]
#> # A tibble: 1 x 2
#> .extracts .config
#> <list> <chr>
#> 1 <glm> Preprocessor1_Model1
You can use tidyr and broom functions to get the coefficients out, if that is what you are looking for.
glm_rs %>%
dplyr::select(id, .extracts) %>%
unnest(cols = .extracts) %>%
mutate(tidied = map(.extracts, tidy)) %>%
unnest(tidied)
#> # A tibble: 30 x 8
#> id .extracts .config term estimate std.error statistic p.value
#> <chr> <list> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Fold1 <glm> Preprocesso… (Interce… -7.44e+1 12.6 -5.89 3.75e-9
#> 2 Fold1 <glm> Preprocesso… speciesC… -6.59e+0 1.82 -3.61 3.03e-4
#> 3 Fold1 <glm> Preprocesso… speciesG… -7.49e+0 2.54 -2.95 3.18e-3
#> 4 Fold1 <glm> Preprocesso… bill_len… 5.56e-1 0.151 3.67 2.40e-4
#> 5 Fold1 <glm> Preprocesso… bill_dep… 1.72e+0 0.424 4.06 4.83e-5
#> 6 Fold1 <glm> Preprocesso… body_mas… 5.88e-3 0.00130 4.51 6.44e-6
#> 7 Fold2 <glm> Preprocesso… (Interce… -6.87e+1 11.3 -6.06 1.37e-9
#> 8 Fold2 <glm> Preprocesso… speciesC… -5.59e+0 1.75 -3.20 1.39e-3
#> 9 Fold2 <glm> Preprocesso… speciesG… -7.61e+0 2.80 -2.71 6.65e-3
#> 10 Fold2 <glm> Preprocesso… bill_len… 4.88e-1 0.145 3.36 7.88e-4
#> # … with 20 more rows
Created on 2021-06-27 by the reprex package (v2.0.0)
I calculated the mean of bootstrap samples for mpg variable from mtcars dataset. My code looks like this (Please, let me know if there's a "better practice" to do it.):
mean_mpg <- function(x) {
rsample::analysis(x) %>%
pull(mpg) %>%
mean()
}
mtcars2 <- rsample::bootstraps(mtcars) %>%
mutate(mean_mpg = purrr::map(splits, mean_mpg)) %>%
tidyr::unnest(mean_mpg) %>%
select(-splits)
However, now I would like to do the same on a grouped dataset. For example:
mtcars %>%
group_by(am)
# now calculate boostrap means of `mpg` for each `am` group
What's the best way to do it?
I think I would nest() to do this, rather than group_by().
Here is a slightly modified version of how to find the mean mpg for each bootstrap resample of the dataset overall.
library(rsample)
library(tidyverse)
bootstraps(mtcars) %>%
mutate(mpg = map(splits, ~ analysis(.) %>% pull(mpg)),
mean_mpg = map_dbl(mpg, mean))
#> # Bootstrap sampling
#> # A tibble: 25 x 4
#> splits id mpg mean_mpg
#> * <list> <chr> <list> <dbl>
#> 1 <split [32/10]> Bootstrap01 <dbl [32]> 18.8
#> 2 <split [32/13]> Bootstrap02 <dbl [32]> 20.4
#> 3 <split [32/9]> Bootstrap03 <dbl [32]> 21.1
#> 4 <split [32/12]> Bootstrap04 <dbl [32]> 19.4
#> 5 <split [32/10]> Bootstrap05 <dbl [32]> 19.8
#> 6 <split [32/11]> Bootstrap06 <dbl [32]> 20.1
#> 7 <split [32/13]> Bootstrap07 <dbl [32]> 19.1
#> 8 <split [32/11]> Bootstrap08 <dbl [32]> 18.7
#> 9 <split [32/13]> Bootstrap09 <dbl [32]> 19.3
#> 10 <split [32/13]> Bootstrap10 <dbl [32]> 20.9
#> # … with 15 more rows
And here is how I would go about creating bootstrap resamples for each value of am, and then finding the mean value of mpg for those resamples.
mtcars %>%
nest(-am) %>%
mutate(nested_boot = map(data, bootstraps)) %>%
select(-data) %>%
unnest(nested_boot) %>%
mutate(mpg = map(splits, ~ analysis(.) %>% pull(mpg)),
mean_mpg = map_dbl(mpg, mean))
#> # A tibble: 50 x 5
#> am splits id mpg mean_mpg
#> <dbl> <list> <chr> <list> <dbl>
#> 1 1 <split [13/4]> Bootstrap01 <dbl [13]> 21.9
#> 2 1 <split [13/4]> Bootstrap02 <dbl [13]> 24.0
#> 3 1 <split [13/5]> Bootstrap03 <dbl [13]> 24.8
#> 4 1 <split [13/5]> Bootstrap04 <dbl [13]> 25.9
#> 5 1 <split [13/3]> Bootstrap05 <dbl [13]> 24.0
#> 6 1 <split [13/5]> Bootstrap06 <dbl [13]> 22.1
#> 7 1 <split [13/4]> Bootstrap07 <dbl [13]> 24.3
#> 8 1 <split [13/4]> Bootstrap08 <dbl [13]> 25.0
#> 9 1 <split [13/5]> Bootstrap09 <dbl [13]> 22.7
#> 10 1 <split [13/6]> Bootstrap10 <dbl [13]> 23.3
#> # … with 40 more rows
Created on 2020-05-26 by the reprex package (v0.3.0)
I am estimating rolling regressions on grouped data.
First, I group_by() and nest() my data by group.
Second, I use map() to estimate rolling regressions with a custom function my_beta(), which returns a list column.
The last step is where I stumble.
I want to extract the groups, dates, and coefficients so that I can merge the coefficients back to my original tibble.
However, my current solution requires three unnest() operations and a bind_cols().
The multiple unnest()s seem inefficient and the bind_cols() seems error prone.
Is there a syntactically and computationally more efficient way to estimate these rolling regressions? My actual data will have 10,000ish groups and 200,000ish observations.
library(tidyverse)
library(tsibble)
#>
#> Attaching package: 'tsibble'
#> The following object is masked from 'package:dplyr':
#>
#> id
set.seed(2001)
df <-
tibble(
date = 1:20,
y = runif(20),
x = runif(20),
z = runif(20),
group = rep(1:2, each = 10)
)
my_beta <- function(...) {
tail(coef(lm(y ~ x + z, data = list(...))), n = -1)
}
current_output <- df %>%
as_tsibble(key = group, index = date) %>%
group_by_key() %>%
nest() %>%
mutate(coefs = purrr::map(data, ~ pslide(., my_beta, .size = 5))) %>%
unnest(coefs) %>%
unnest_wider(coefs, names_sep = '_') %>%
ungroup()
#> New names:
#> * `` -> ...1
#> New names:
#> * `` -> ...1
#> New names:
#> * `` -> ...1
#> New names:
#> * `` -> ...1
#> New names:
#> * `` -> ...1
#> New names:
#> * `` -> ...1
#> New names:
#> * `` -> ...1
#> New names:
#> * `` -> ...1
current_output
#> # A tibble: 20 x 5
#> group data coefs_...1 coefs_x coefs_z
#> <int> <list> <lgl> <dbl> <dbl>
#> 1 1 <tsibble [10 × 4]> NA NA NA
#> 2 1 <tsibble [10 × 4]> NA NA NA
#> 3 1 <tsibble [10 × 4]> NA NA NA
#> 4 1 <tsibble [10 × 4]> NA NA NA
#> 5 1 <tsibble [10 × 4]> NA 1.46 2.08
#> 6 1 <tsibble [10 × 4]> NA 0.141 -0.396
#> 7 1 <tsibble [10 × 4]> NA 0.754 1.10
#> 8 1 <tsibble [10 × 4]> NA 0.651 0.889
#> 9 1 <tsibble [10 × 4]> NA 0.743 0.954
#> 10 1 <tsibble [10 × 4]> NA 0.308 0.795
#> 11 2 <tsibble [10 × 4]> NA NA NA
#> 12 2 <tsibble [10 × 4]> NA NA NA
#> 13 2 <tsibble [10 × 4]> NA NA NA
#> 14 2 <tsibble [10 × 4]> NA NA NA
#> 15 2 <tsibble [10 × 4]> NA -0.0433 -0.252
#> 16 2 <tsibble [10 × 4]> NA 0.696 0.334
#> 17 2 <tsibble [10 × 4]> NA 0.594 -0.0698
#> 18 2 <tsibble [10 × 4]> NA 0.881 0.0474
#> 19 2 <tsibble [10 × 4]> NA 3.23 -1.32
#> 20 2 <tsibble [10 × 4]> NA -0.942 1.85
desired_output <- df %>%
bind_cols(current_output %>% select(coefs_x, coefs_z))
desired_output
#> # A tibble: 20 x 7
#> date y x z group coefs_x coefs_z
#> <int> <dbl> <dbl> <dbl> <int> <dbl> <dbl>
#> 1 1 0.759 0.368 0.644 1 NA NA
#> 2 2 0.608 0.992 0.0542 1 NA NA
#> 3 3 0.218 0.815 0.252 1 NA NA
#> 4 4 0.229 0.982 0.0606 1 NA NA
#> 5 5 0.153 0.275 0.488 1 1.46 2.08
#> 6 6 0.374 0.856 0.268 1 0.141 -0.396
#> 7 7 0.619 0.737 0.599 1 0.754 1.10
#> 8 8 0.259 0.641 0.189 1 0.651 0.889
#> 9 9 0.637 0.598 0.543 1 0.743 0.954
#> 10 10 0.325 0.990 0.0265 1 0.308 0.795
#> 11 11 0.816 0.519 0.351 2 NA NA
#> 12 12 0.717 0.766 0.333 2 NA NA
#> 13 13 0.781 0.365 0.380 2 NA NA
#> 14 14 0.838 0.924 0.0778 2 NA NA
#> 15 15 0.736 0.453 0.258 2 -0.0433 -0.252
#> 16 16 0.173 0.291 0.328 2 0.696 0.334
#> 17 17 0.677 0.714 0.884 2 0.594 -0.0698
#> 18 18 0.833 0.718 0.902 2 0.881 0.0474
#> 19 19 0.134 0.351 0.422 2 3.23 -1.32
#> 20 20 0.675 0.963 0.981 2 -0.942 1.85
Created on 2020-02-25 by the reprex package (v0.3.0)
We could simplify the code a bit with
res %>%
unnest(cols = c(data, coefs)) %>%
unnest_wider(col = coefs, names_sep = '_') %>%
select(-coefs_...1)
Where res is
res <-
df %>%
as_tsibble(key = group, index = date) %>%
group_by_key() %>%
nest() %>%
mutate(coefs = purrr::map(data, ~ pslide(., my_beta, .size = 5)))
The code that does the estimate part is left untouched. This only addresses the data wrangling part, about multiple unnest()s and bind_cols().
I haven't done a performance benchmark.