I often use a pattern as seen below, where I store data in a tibble using list-columns, apply functions to the data using purrr::map, and then use pivot_longer to convert to tidy format (long).
Is there a cleaner / more idiomatic way to do this in one step, without having to pivot the data each time?
library(tidyverse)
df <- tibble(n = 5:10)
df$data <- map(df$n, ~rnorm(.x))
df$mean <- map_dbl(df$data, ~mean(.x))
df$median <- map_dbl(df$data, ~median(.x))
# A tibble: 6 x 4
n data mean median
<int> <list> <dbl> <dbl>
1 5 <dbl [5]> -0.0239 -0.324
2 6 <dbl [6]> -0.396 0.0153
3 7 <dbl [7]> 0.506 0.711
4 8 <dbl [8]> 0.463 0.537
5 9 <dbl [9]> -0.248 -0.555
6 10 <dbl [10]> -0.153 -0.293
df <- pivot_longer(df, mean:median)
# A tibble: 12 x 4
n data name value
<int> <list> <chr> <dbl>
1 5 <dbl [5]> mean -0.386
2 5 <dbl [5]> median -0.407
3 6 <dbl [6]> mean -0.190
4 6 <dbl [6]> median -0.451
5 7 <dbl [7]> mean -0.456
6 7 <dbl [7]> median -0.0801
7 8 <dbl [8]> mean -0.0408
8 8 <dbl [8]> median 0.0577
9 9 <dbl [9]> mean 0.273
10 9 <dbl [9]> median 0.410
11 10 <dbl [10]> mean -0.720
12 10 <dbl [10]> median -1.01
I think you already have a good approach, I would have used the same by chaining all the function in one pipe (%>%).
If you want to avoid pivot_longer step you can group by each row and create two new rows for each one. This is possible for dplyr 1.0.0 or higher.
library(tidyverse)
df %>%
mutate(data = map(n, rnorm),
group = row_number()) %>%
group_by(group) %>%
summarise(n = n,
data = data,
value = {tmp <- unlist(data);c(median(tmp), mean(tmp))},
name = c('median', 'mean')) %>%
ungroup %>%
select(-group)
# n data value name
# <int> <list> <dbl> <chr>
# 1 5 <dbl [5]> 0.571 median
# 2 5 <dbl [5]> 0.343 mean
# 3 6 <dbl [6]> 0.220 median
# 4 6 <dbl [6]> 0.0419 mean
# 5 7 <dbl [7]> -0.193 median
# 6 7 <dbl [7]> -0.132 mean
# 7 8 <dbl [8]> -0.171 median
# 8 8 <dbl [8]> 0.00583 mean
# 9 9 <dbl [9]> 0.952 median
#10 9 <dbl [9]> 0.471 mean
#11 10 <dbl [10]> 0.684 median
#12 10 <dbl [10]> 0.250 mean
Related
Likely a trivial task for the pros out there, but have not been able to figure out how to insert the text found in the "Slug" column into each of the three nested tables associated with the slug.
![data] (https://i.stack.imgur.com/YClrE.png)
I am just looking to get the Slug value inserted into the nested tables and repeated for each row so I can combine and keep track of associations properly.
Any tips are most welcome! Thank you
Solution
You can use rowwise() with mutate(across())
df %>%
rowwise() %>%
mutate(across(floor_price_array:holder_hist, ~list(mutate(.x,slug=slug))))
Explanation
If your original data, say df, looks like this:
id slug floor_price_array num_listed_hist holder_hist
<chr> <chr> <list> <list> <list>
1 a hyznu <tibble [10 x 3]> <tibble [10 x 3]> <tibble [10 x 3]>
2 b awxeb <tibble [10 x 3]> <tibble [10 x 3]> <tibble [10 x 3]>
3 c pbncj <tibble [10 x 3]> <tibble [10 x 3]> <tibble [10 x 3]>
then, the above code will add the value in the slug column as a new constant column in each of the nested tibbles, and resulting in this (notice that each now has four columns):
id slug floor_price_array num_listed_hist holder_hist
<chr> <chr> <list> <list> <list>
1 a hyznu <tibble [10 x 4]> <tibble [10 x 4]> <tibble [10 x 4]>
2 b awxeb <tibble [10 x 4]> <tibble [10 x 4]> <tibble [10 x 4]>
3 c pbncj <tibble [10 x 4]> <tibble [10 x 4]> <tibble [10 x 4]>
For example, floor_price_array, now contains this:
[[1]]
# A tibble: 10 x 4
x y z slug
<dbl> <dbl> <dbl> <chr>
1 1.44 2.02 -0.272 hyznu
2 -0.598 -0.723 -0.528 hyznu
3 0.490 -0.576 -1.62 hyznu
4 -0.145 0.349 0.341 hyznu
5 -0.362 0.503 0.584 hyznu
6 -0.798 -0.839 -0.352 hyznu
7 -0.503 -1.27 -1.18 hyznu
8 -0.916 -0.654 0.335 hyznu
9 0.578 0.137 -0.590 hyznu
10 -0.194 -0.674 1.73 hyznu
[[2]]
# A tibble: 10 x 4
x y z slug
<dbl> <dbl> <dbl> <chr>
1 0.876 0.665 -0.723 awxeb
2 -0.0442 -0.00906 0.0829 awxeb
3 -2.15 1.33 0.0692 awxeb
4 0.264 0.237 -0.497 awxeb
5 0.0381 0.0502 -1.58 awxeb
6 -0.802 0.783 -1.34 awxeb
7 -0.940 1.50 -0.542 awxeb
8 0.209 -1.06 0.853 awxeb
9 0.569 -1.15 -0.347 awxeb
10 -1.57 -0.0774 0.0250 awxeb
[[3]]
# A tibble: 10 x 4
x y z slug
<dbl> <dbl> <dbl> <chr>
1 -0.0289 -1.63 1.29 pbncj
2 -0.716 0.647 0.0230 pbncj
3 -0.0797 -0.0227 2.12 pbncj
4 -0.358 -1.43 -1.81 pbncj
5 -1.35 -0.402 -0.463 pbncj
6 -0.00494 -0.136 1.50 pbncj
7 1.09 0.124 -0.974 pbncj
8 -1.18 1.78 -0.836 pbncj
9 0.896 -1.38 0.199 pbncj
10 0.293 0.420 0.562 pbncj
Input data:
df <- tibble(id = letters[1:3]
) %>%
rowwise() %>%
mutate(slug = paste0(sample(letters,5), collapse="")) %>%
mutate(floor_price_array=list(tibble(x=rnorm(10), y=rnorm(10), z=rnorm(10))),
num_listed_hist=list(tibble(x=rnorm(10), y=rnorm(10), z=rnorm(10))),
holder_hist=list(tibble(x=rnorm(10), y=rnorm(10), z=rnorm(10)))
) %>% ungroup()
I have the following dataset (dput here):
# A tibble: 3,713 x 17
ID Age Group RHR HRV Sleep.Onset Wake.Onset Hours.in.Bed Hours.of.Sleep Sleep.Disturbances Latency.min Cycles REM.Sleep.hours Deep.Sleep.hours
<int> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <int> <dbl> <dbl>
1 5027 Young Increase 58 73 0.180 0.458 6.66 5.33 9 8.98 6 1.4 0.32
2 5027 Young Increase 83 27 0.162 0.542 9.1 6.84 15 3.48 9 1.19 1.54
3 5027 Young Increase 57 85 0.113 0.318 4.92 4.43 5 1.98 4 1.32 0.44
4 5027 Young Increase 60 70 0.0975 0.319 5.32 3.75 3 26.5 4 1.02 0.14
5 5027 Young Increase 63 72 0.105 0.329 5.38 4.74 5 2.48 5 1.32 0.07
6 5027 Young Increase 62 61 0.983 0.472 11.8 9.44 9 4.48 8 2.07 0.84
7 5027 Young Increase 66 68 0.142 0.426 6.83 5.48 15 2.98 6 1.48 0.35
8 5027 Young Increase 81 28 0.0908 0.177 2.06 1.93 2 2.48 1 0.22 0.22
9 5027 Young Increase 69 57 0.158 0.443 6.85 6.58 13 0.48 6 2.43 0
10 5027 Young Increase 63 60 0.0859 0.318 5.58 5.47 4 0.48 5 1.34 0.13
# ... with 3,703 more rows, and 3 more variables: Light.Sleep.hours <dbl>, Awake.hours <dbl>, Session <chr>
I am trying to calculate a t-test across every variable, grouped by Age and Group between Session (pre or post).
df %>%
select(-ID) %>%
group_by(Age, Group) %>%
summarize_at(
vars(-group_cols(), -Session),
list(p.value = ~ t.test(. ~ Session)$p.value))
I am successful with p values:
# A tibble: 4 x 15
# Groups: Age [2]
Age Group RHR_p.value HRV_p.value Sleep.Onset_p.value Wake.Onset_p.value Hours.in.Bed_p.value Hours.of.Sleep_p~ Sleep.Disturban~ Latency.min_p.v~
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Old Decrease 0.0594 0.865 0.495 0.885 0.316 0.307 0.148 0.00237
2 Old Increase 0.00920 0.634 0.0979 0.0514 0.00774 0.00762 0.247 0.933
3 Young Decrease 0.0975 0.259 0.779 0.760 0.959 0.975 0.256 0.181
4 Young Increase 0.115 0.604 0.846 0.164 0.140 0.242 0.692 0.412
# ... with 5 more variables: Cycles_p.value <dbl>, REM.Sleep.hours_p.value <dbl>, Deep.Sleep.hours_p.value <dbl>, Light.Sleep.hours_p.value <dbl>,
# Awake.hours_p.value <dbl>
However, I am struggling to calculate the other t-statistics (mean, sd, t, df, 95%CI) between these pre-post and also correct p-values groups. I am struggling to do this so any help is appreciated.
I think I may need to convert data long and use something like this?
df %>%
group_by(Age, Group) %>%
t_test(mean ~ ., by = "Session") %>%
adjust_pvalue(method = "bonferroni") %>%
add_significance()
Dndata frames can only have certain object classes as column types. A
htest is not one of those.
However, we can store lists as list-columns.
If we adapt the current code to output lists htests as results, we can later extract elements of the tests separately.
library(dplyr)
output <- df %>%
select(-ID) %>%
group_by(Age, Group) %>%
summarize_at(
vars(-group_cols(), -Session),
list(t.test = ~ list(t.test(. ~ Session))))
output
# A tibble: 4 × 15
# Groups: Age [2]
Age Group RHR_t.test HRV_t.test Sleep.Onset_t.test Wake.Onset_t.test Hours.in.Bed_t.test Hours.of.Sleep_t.test Sleep.Disturbance… Latency.min_t.t… Cycles_t.test REM.Sleep.hours…
<chr> <chr> <list> <list> <list> <list> <list> <list> <list> <list> <list> <list>
1 Old Decrease <htest> <htest> <htest> <htest> <htest> <htest> <htest> <htest> <htest> <htest>
2 Old Increase <htest> <htest> <htest> <htest> <htest> <htest> <htest> <htest> <htest> <htest>
3 Young Decrease <htest> <htest> <htest> <htest> <htest> <htest> <htest> <htest> <htest> <htest>
4 Young Increase <htest> <htest> <htest> <htest> <htest> <htest> <htest> <htest> <htest> <htest>
With this output data.frame, we can extract individual tests and values from them as desired:
output$RHR_t.test
[[1]]
Welch Two Sample t-test
data: . by Session
t = -1.8965, df = 188.22, p-value = 0.05942
alternative hypothesis: true difference in means between group Post and group Pre is not equal to 0
95 percent confidence interval:
-3.09118590 0.06082897
sample estimates:
mean in group Post mean in group Pre
62.28902 63.80420
[[2]]
Welch Two Sample t-test
data: . by Session
t = -2.6271, df = 226.21, p-value = 0.009199
alternative hypothesis: true difference in means between group Post and group Pre is not equal to 0
95 percent confidence interval:
-3.3949577 -0.4848655
sample estimates:
mean in group Post mean in group Pre
57.95946 59.89937
[[3]]
Welch Two Sample t-test
data: . by Session
t = 1.6633, df = 251.75, p-value = 0.0975
alternative hypothesis: true difference in means between group Post and group Pre is not equal to 0
95 percent confidence interval:
-0.2074028 2.4611194
sample estimates:
mean in group Post mean in group Pre
60.58255 59.45570
[[4]]
Welch Two Sample t-test
data: . by Session
t = 1.5849, df = 208.4, p-value = 0.1145
alternative hypothesis: true difference in means between group Post and group Pre is not equal to 0
95 percent confidence interval:
-0.244287 2.247775
sample estimates:
mean in group Post mean in group Pre
60.23462 59.23288
output$RHR_t.test %>%
map_dbl('p.value')
[1] 0.059424354 0.009199459 0.097497620 0.114502332
We can also convert these lists to user-friendly tibbles with broom::tidy
output %>%
mutate(across(ends_with('t.test'), map, broom::tidy))
# A tibble: 4 × 15
# Groups: Age [2]
Age Group RHR_t.test HRV_t.test Sleep.Onset_t.te… Wake.Onset_t.test Hours.in.Bed_t.t… Hours.of.Sleep_… Sleep.Disturbanc… Latency.min_t.t… Cycles_t.test REM.Sleep.hours…
<chr> <chr> <list> <list> <list> <list> <list> <list> <list> <list> <list> <list>
1 Old Decrease <tibble [1 × 10]> <tibble [1 … <tibble [1 × 10]> <tibble [1 × 10]> <tibble [1 × 10]> <tibble [1 × 10… <tibble [1 × 10]> <tibble [1 × 10… <tibble [1 ×… <tibble [1 × 10…
2 Old Increase <tibble [1 × 10]> <tibble [1 … <tibble [1 × 10]> <tibble [1 × 10]> <tibble [1 × 10]> <tibble [1 × 10… <tibble [1 × 10]> <tibble [1 × 10… <tibble [1 ×… <tibble [1 × 10…
3 Young Decrease <tibble [1 × 10]> <tibble [1 … <tibble [1 × 10]> <tibble [1 × 10]> <tibble [1 × 10]> <tibble [1 × 10… <tibble [1 × 10]> <tibble [1 × 10… <tibble [1 ×… <tibble [1 × 10…
4 Young Increase <tibble [1 × 10]> <tibble [1 … <tibble [1 × 10]> <tibble [1 × 10]> <tibble [1 × 10]> <tibble [1 × 10… <tibble [1 × 10]> <tibble [1 × 10… <tibble [1 ×… <tibble [1 × 10…
# … with 3 more variables: Deep.Sleep.hours_t.test <list>, Light.Sleep.hours_t.test <list>, Awake.hours_t.test <list>
To have all tests "statistics", we can do it like this:
tidy_output %>%
mutate(across(ends_with('t.test'), sapply, pull, 'statistic'))
# A tibble: 4 × 15
# Groups: Age [2]
Age Group RHR_t.test HRV_t.test Sleep.Onset_t.test Wake.Onset_t.test Hours.in.Bed_t.test Hours.of.Sleep_t.test Sleep.Disturbance… Latency.min_t.t… Cycles_t.test REM.Sleep.hours…
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Old Decrease -1.90 0.171 0.684 -0.145 -1.01 -1.02 -1.45 3.05 -0.928 -0.906
2 Old Increase -2.63 0.477 -1.66 -1.96 -2.69 -2.69 -1.16 0.0848 -1.76 -1.87
3 Young Decrease 1.66 1.13 0.281 -0.305 0.0509 -0.0320 1.14 -1.34 -0.675 0.672
4 Young Increase 1.58 0.519 0.195 -1.40 -1.48 -1.17 0.397 -0.821 -1.73 0.886
# … with 3 more variables: Deep.Sleep.hours_t.test <dbl>, Light.Sleep.hours_t.test <dbl>, Awake.hours_t.test <dbl>
I have a dataframe that I need to get into the right configuration for an external program (PRISM Graphpad). In this toy example I have a matrix of 5x3 conditions with duplicate measurements for each condition. Originally for plotting and analysis in R the data is in the proper long tidy format.
While I am comfortable pivoting the data, I run into problems that the values are not unique when pivoting the DF wide. I need the replicates to be in adjacent columns with the same name for PRISM to properly recognize things. However, when I pivot wide, the duplicate values get shoved into a list because they do not have a unique identified in the id_cols.
In the real-life example the matrix of conditions is of course much larger, there are more repeats (but an identical number for each condition), and on top of that every df is an entry in a list-column, so I will likely need to apply the solution using a purrr::map function.
library(tidyverse)
df <- data.frame("ID" = c(rep(LETTERS[seq(1,5)], 2)),
"cond1" = runif(10, 0, 1),
"cond2" = runif(10, 1, 10),
"cond3" = runif(10, 10, 100))
#// original long dataframe
long_df <- pivot_longer(data = df, cols = c("cond1", "cond2", "cond3"))
long_df
#> # A tibble: 30 x 3
#> ID name value
#> <chr> <chr> <dbl>
#> 1 A cond1 0.424
#> 2 A cond2 9.01
#> 3 A cond3 61.6
#> 4 B cond1 0.460
#> 5 B cond2 2.33
#> 6 B cond3 40.3
#> 7 C cond1 0.107
#> 8 C cond2 5.82
#> 9 C cond3 23.9
#> 10 D cond1 0.714
#> # ... with 20 more rows
#// desired output
desired_df <- cbind(df[c(1:5),], df[c(6:10),])
desired_df <- desired_df[,c(1,2,6,3,7,4,8)]
colnames(desired_df)[c(3,5,7)] <- c("cond1", "cond2", "cond3")
desired_df
#> ID cond1 cond1 cond2 cond2 cond3 cond3
#> 1 A 0.4244798 0.8078372 9.005544 5.349371 61.61488 73.80651
#> 2 B 0.4596927 0.3509671 2.325029 8.636263 40.33949 66.54288
#> 3 C 0.1069974 0.3903294 5.817079 7.100623 23.87013 99.98683
#> 4 D 0.7144698 0.1005499 9.886948 7.006333 19.40680 66.86696
#> 5 E 0.2903691 0.6177356 8.890734 9.863695 46.56568 66.42537
#// result from pivot_wider
wide_df <- pivot_wider(long_df, id_cols = ID, names_from = name, values_from = value)
#> Warning: Values are not uniquely identified; output will contain list-cols.
#> * Use `values_fn = list` to suppress this warning.
#> * Use `values_fn = length` to identify where the duplicates arise
#> * Use `values_fn = {summary_fun}` to summarise duplicates
wide_df
#> # A tibble: 5 x 4
#> ID cond1 cond2 cond3
#> <chr> <list> <list> <list>
#> 1 A <dbl [2]> <dbl [2]> <dbl [2]>
#> 2 B <dbl [2]> <dbl [2]> <dbl [2]>
#> 3 C <dbl [2]> <dbl [2]> <dbl [2]>
#> 4 D <dbl [2]> <dbl [2]> <dbl [2]>
#> 5 E <dbl [2]> <dbl [2]> <dbl [2]>
Created on 2021-01-15 by the reprex package (v0.3.0)
.0)
It is not recommended to have duplicate column names, therefore, we modify the 'name' column by appending an unique index created with rowid, and use that to reshape with pivot_wider
library(dplyr)
library(tidyr)
library(stringr)
library(data.table)
long_df %>%
mutate(name = str_c(name, "_", rowid(ID, name))) %>%
pivot_wider(names_from = name, values_from = value, names_sort = TRUE)
-output
# A tibble: 5 x 7
# ID cond1_1 cond1_2 cond2_1 cond2_2 cond3_1 cond3_2
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 A 0.293 0.920 6.44 9.14 18.5 71.9
#2 B 0.225 0.280 4.34 2.78 59.7 16.9
#3 C 0.704 0.764 7.05 1.40 75.3 64.0
#4 D 0.519 0.802 7.06 5.51 22.4 66.7
#5 E 0.663 0.255 3.88 2.25 30.1 14.2
If it needs to have repeating names, just strip off the _\\d+ at the end of the name with str_remove
Try this:
library(tidyverse)
#Code
new <- df %>%
pivot_longer(-1) %>%
group_by(ID,name) %>%
mutate(name=paste0(name,'.',row_number())) %>%
pivot_wider(names_sort = T,names_from=name,values_from=value)
Output:
# A tibble: 5 x 7
# Groups: ID [5]
ID cond1.1 cond1.2 cond2.1 cond2.2 cond3.1 cond3.2
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 0.619 0.851 4.49 9.17 70.9 84.2
2 B 0.989 0.542 9.64 3.57 55.3 28.3
3 C 0.594 0.602 5.16 8.97 26.2 19.0
4 D 0.349 0.244 5.29 8.52 44.8 17.7
5 E 0.683 0.848 7.27 8.07 97.3 73.9
Then you can process like this:
#Further process
names(new) <- gsub("\\..*","",names(new))
Output:
# A tibble: 5 x 7
# Groups: ID [5]
ID cond1 cond1 cond2 cond2 cond3 cond3
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 0.129 0.360 4.60 2.74 55.3 58.3
2 B 0.416 0.384 5.93 9.17 15.7 21.8
3 C 0.724 0.622 9.30 7.81 76.9 79.0
4 D 0.101 0.951 6.35 1.58 30.3 68.5
5 E 0.238 0.814 9.46 9.50 12.4 57.8
And export to .txt for the other software.
I calculated the mean of bootstrap samples for mpg variable from mtcars dataset. My code looks like this (Please, let me know if there's a "better practice" to do it.):
mean_mpg <- function(x) {
rsample::analysis(x) %>%
pull(mpg) %>%
mean()
}
mtcars2 <- rsample::bootstraps(mtcars) %>%
mutate(mean_mpg = purrr::map(splits, mean_mpg)) %>%
tidyr::unnest(mean_mpg) %>%
select(-splits)
However, now I would like to do the same on a grouped dataset. For example:
mtcars %>%
group_by(am)
# now calculate boostrap means of `mpg` for each `am` group
What's the best way to do it?
I think I would nest() to do this, rather than group_by().
Here is a slightly modified version of how to find the mean mpg for each bootstrap resample of the dataset overall.
library(rsample)
library(tidyverse)
bootstraps(mtcars) %>%
mutate(mpg = map(splits, ~ analysis(.) %>% pull(mpg)),
mean_mpg = map_dbl(mpg, mean))
#> # Bootstrap sampling
#> # A tibble: 25 x 4
#> splits id mpg mean_mpg
#> * <list> <chr> <list> <dbl>
#> 1 <split [32/10]> Bootstrap01 <dbl [32]> 18.8
#> 2 <split [32/13]> Bootstrap02 <dbl [32]> 20.4
#> 3 <split [32/9]> Bootstrap03 <dbl [32]> 21.1
#> 4 <split [32/12]> Bootstrap04 <dbl [32]> 19.4
#> 5 <split [32/10]> Bootstrap05 <dbl [32]> 19.8
#> 6 <split [32/11]> Bootstrap06 <dbl [32]> 20.1
#> 7 <split [32/13]> Bootstrap07 <dbl [32]> 19.1
#> 8 <split [32/11]> Bootstrap08 <dbl [32]> 18.7
#> 9 <split [32/13]> Bootstrap09 <dbl [32]> 19.3
#> 10 <split [32/13]> Bootstrap10 <dbl [32]> 20.9
#> # … with 15 more rows
And here is how I would go about creating bootstrap resamples for each value of am, and then finding the mean value of mpg for those resamples.
mtcars %>%
nest(-am) %>%
mutate(nested_boot = map(data, bootstraps)) %>%
select(-data) %>%
unnest(nested_boot) %>%
mutate(mpg = map(splits, ~ analysis(.) %>% pull(mpg)),
mean_mpg = map_dbl(mpg, mean))
#> # A tibble: 50 x 5
#> am splits id mpg mean_mpg
#> <dbl> <list> <chr> <list> <dbl>
#> 1 1 <split [13/4]> Bootstrap01 <dbl [13]> 21.9
#> 2 1 <split [13/4]> Bootstrap02 <dbl [13]> 24.0
#> 3 1 <split [13/5]> Bootstrap03 <dbl [13]> 24.8
#> 4 1 <split [13/5]> Bootstrap04 <dbl [13]> 25.9
#> 5 1 <split [13/3]> Bootstrap05 <dbl [13]> 24.0
#> 6 1 <split [13/5]> Bootstrap06 <dbl [13]> 22.1
#> 7 1 <split [13/4]> Bootstrap07 <dbl [13]> 24.3
#> 8 1 <split [13/4]> Bootstrap08 <dbl [13]> 25.0
#> 9 1 <split [13/5]> Bootstrap09 <dbl [13]> 22.7
#> 10 1 <split [13/6]> Bootstrap10 <dbl [13]> 23.3
#> # … with 40 more rows
Created on 2020-05-26 by the reprex package (v0.3.0)
I am estimating rolling regressions on grouped data.
First, I group_by() and nest() my data by group.
Second, I use map() to estimate rolling regressions with a custom function my_beta(), which returns a list column.
The last step is where I stumble.
I want to extract the groups, dates, and coefficients so that I can merge the coefficients back to my original tibble.
However, my current solution requires three unnest() operations and a bind_cols().
The multiple unnest()s seem inefficient and the bind_cols() seems error prone.
Is there a syntactically and computationally more efficient way to estimate these rolling regressions? My actual data will have 10,000ish groups and 200,000ish observations.
library(tidyverse)
library(tsibble)
#>
#> Attaching package: 'tsibble'
#> The following object is masked from 'package:dplyr':
#>
#> id
set.seed(2001)
df <-
tibble(
date = 1:20,
y = runif(20),
x = runif(20),
z = runif(20),
group = rep(1:2, each = 10)
)
my_beta <- function(...) {
tail(coef(lm(y ~ x + z, data = list(...))), n = -1)
}
current_output <- df %>%
as_tsibble(key = group, index = date) %>%
group_by_key() %>%
nest() %>%
mutate(coefs = purrr::map(data, ~ pslide(., my_beta, .size = 5))) %>%
unnest(coefs) %>%
unnest_wider(coefs, names_sep = '_') %>%
ungroup()
#> New names:
#> * `` -> ...1
#> New names:
#> * `` -> ...1
#> New names:
#> * `` -> ...1
#> New names:
#> * `` -> ...1
#> New names:
#> * `` -> ...1
#> New names:
#> * `` -> ...1
#> New names:
#> * `` -> ...1
#> New names:
#> * `` -> ...1
current_output
#> # A tibble: 20 x 5
#> group data coefs_...1 coefs_x coefs_z
#> <int> <list> <lgl> <dbl> <dbl>
#> 1 1 <tsibble [10 × 4]> NA NA NA
#> 2 1 <tsibble [10 × 4]> NA NA NA
#> 3 1 <tsibble [10 × 4]> NA NA NA
#> 4 1 <tsibble [10 × 4]> NA NA NA
#> 5 1 <tsibble [10 × 4]> NA 1.46 2.08
#> 6 1 <tsibble [10 × 4]> NA 0.141 -0.396
#> 7 1 <tsibble [10 × 4]> NA 0.754 1.10
#> 8 1 <tsibble [10 × 4]> NA 0.651 0.889
#> 9 1 <tsibble [10 × 4]> NA 0.743 0.954
#> 10 1 <tsibble [10 × 4]> NA 0.308 0.795
#> 11 2 <tsibble [10 × 4]> NA NA NA
#> 12 2 <tsibble [10 × 4]> NA NA NA
#> 13 2 <tsibble [10 × 4]> NA NA NA
#> 14 2 <tsibble [10 × 4]> NA NA NA
#> 15 2 <tsibble [10 × 4]> NA -0.0433 -0.252
#> 16 2 <tsibble [10 × 4]> NA 0.696 0.334
#> 17 2 <tsibble [10 × 4]> NA 0.594 -0.0698
#> 18 2 <tsibble [10 × 4]> NA 0.881 0.0474
#> 19 2 <tsibble [10 × 4]> NA 3.23 -1.32
#> 20 2 <tsibble [10 × 4]> NA -0.942 1.85
desired_output <- df %>%
bind_cols(current_output %>% select(coefs_x, coefs_z))
desired_output
#> # A tibble: 20 x 7
#> date y x z group coefs_x coefs_z
#> <int> <dbl> <dbl> <dbl> <int> <dbl> <dbl>
#> 1 1 0.759 0.368 0.644 1 NA NA
#> 2 2 0.608 0.992 0.0542 1 NA NA
#> 3 3 0.218 0.815 0.252 1 NA NA
#> 4 4 0.229 0.982 0.0606 1 NA NA
#> 5 5 0.153 0.275 0.488 1 1.46 2.08
#> 6 6 0.374 0.856 0.268 1 0.141 -0.396
#> 7 7 0.619 0.737 0.599 1 0.754 1.10
#> 8 8 0.259 0.641 0.189 1 0.651 0.889
#> 9 9 0.637 0.598 0.543 1 0.743 0.954
#> 10 10 0.325 0.990 0.0265 1 0.308 0.795
#> 11 11 0.816 0.519 0.351 2 NA NA
#> 12 12 0.717 0.766 0.333 2 NA NA
#> 13 13 0.781 0.365 0.380 2 NA NA
#> 14 14 0.838 0.924 0.0778 2 NA NA
#> 15 15 0.736 0.453 0.258 2 -0.0433 -0.252
#> 16 16 0.173 0.291 0.328 2 0.696 0.334
#> 17 17 0.677 0.714 0.884 2 0.594 -0.0698
#> 18 18 0.833 0.718 0.902 2 0.881 0.0474
#> 19 19 0.134 0.351 0.422 2 3.23 -1.32
#> 20 20 0.675 0.963 0.981 2 -0.942 1.85
Created on 2020-02-25 by the reprex package (v0.3.0)
We could simplify the code a bit with
res %>%
unnest(cols = c(data, coefs)) %>%
unnest_wider(col = coefs, names_sep = '_') %>%
select(-coefs_...1)
Where res is
res <-
df %>%
as_tsibble(key = group, index = date) %>%
group_by_key() %>%
nest() %>%
mutate(coefs = purrr::map(data, ~ pslide(., my_beta, .size = 5)))
The code that does the estimate part is left untouched. This only addresses the data wrangling part, about multiple unnest()s and bind_cols().
I haven't done a performance benchmark.