Using mtcars data, I am testing map() to build some lm() models:
library(tidyverse)
mtcars %>%
group_by(cyl) %>%
nest()%>%
mutate(fit = map(.x=data,~lm(mpg ~ ., data = .x)))
#> # A tibble: 3 x 3
#> cyl data fit
#> <dbl> <list> <list>
#> 1 6 <tibble [7 x 10]> <S3: lm>
#> 2 4 <tibble [11 x 10]> <S3: lm>
#> 3 8 <tibble [14 x 10]> <S3: lm>
The output shows that I have a new column, fit.
Now I wish to see the summary of each lm
When I try:
library(tidyverse)
mtcars %>%
group_by(cyl) %>%
nest()%>%
mutate(fit = map(.x=data,~lm(mpg ~ ., data = .x))) %>%
map(fit,summary)
#> Error in as_mapper(.f, ...): object 'fit' not found
It gives the error:
Error in as_mapper(.f, ...) : object 'fit' not found
If I wish to calculate R2 or aic then I can using the following code without any problem:
library(tidyverse)
library(modelr)
mtcars %>%
group_by(cyl) %>%
nest()%>%
mutate(fit = map(.x=data,~lm(mpg ~ ., data = .x))) %>%
mutate(r2 = map_dbl(fit, ~rsquare(., data = mtcars)),
aic = map_dbl(fit, ~AIC(.))) %>%
arrange(aic)
#> # A tibble: 3 x 5
#> cyl data fit r2 aic
#> <dbl> <list> <list> <dbl> <dbl>
#> 1 6 <tibble [7 x 10]> <S3: lm> -8.96 -Inf
#> 2 4 <tibble [11 x 10]> <S3: lm> -26.4 56.4
#> 3 8 <tibble [14 x 10]> <S3: lm> -1.000 67.3
Created on 2019-06-18 by the reprex package (v0.3.0)
What am I missing?
As IceCreamToucan's comment laid out, purrr::map does not look into the data which has been made within your piping.
If you use it with dplyr::mutate then it has access to fit which you have created in the previous piping.
Another option would be explicitly referring to fit which you can see below, as my second suggestion.
library(tidyverse)
mtcars %>%
group_by(cyl) %>%
nest()%>%
mutate(fit = map(.x=data,~lm(mpg ~ ., data = .x))) %>%
mutate(fit_sum = map(fit,summary))
#> # A tibble: 3 x 4
#> cyl data fit fit_sum
#> <dbl> <list> <list> <list>
#> 1 6 <tibble [7 x 10]> <lm> <smmry.lm>
#> 2 4 <tibble [11 x 10]> <lm> <smmry.lm>
#> 3 8 <tibble [14 x 10]> <lm> <smmry.lm>
mtcars %>%
group_by(cyl) %>%
nest()%>%
mutate(fit = map(.x=data,~lm(mpg ~ ., data = .x))) %>%
{map(.$fit, summary)} #or using pull: `pull(fit) %>% map(summary)`
#> [[1]]
#>
#> Call:
#> lm(formula = mpg ~ ., data = .x)
#>
#> Residuals:
#> ALL 7 residuals are 0: no residual degrees of freedom!
#>
#> Coefficients: (3 not defined because of singularities)
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 32.78649 NA NA NA
#> disp 0.07456 NA NA NA
#> hp -0.04252 NA NA NA
#> drat 1.52367 NA NA NA
#> wt 5.12418 NA NA NA
#> qsec -2.33333 NA NA NA
#> vs -1.75289 NA NA NA
#> am NA NA NA NA
#> gear NA NA NA NA
#> carb NA NA NA NA
#>
#> Residual standard error: NaN on 0 degrees of freedom
#> Multiple R-squared: 1, Adjusted R-squared: NaN
#> F-statistic: NaN on 6 and 0 DF, p-value: NA
####truncated the results for the sake of space####
Created on 2019-06-17 by the reprex package (v0.3.0)
From the latest release of dplyr, tidyverse seems to be encouraging using group_modify functions instead of using purrr + nested dataframes.
In that workflow, here is how you can get both model summaries and estimates in the same dataframe via broom package:
# setup
set.seed(123)
library(tidyverse)
options(tibble.width = Inf)
# joining dataframes with regression estimates and model summaries
dplyr::full_join(
# to get a tidy dataframe of regression estimates
x = mtcars %>%
group_by(cyl) %>%
group_modify(.f = ~ broom::tidy(lm(mpg ~ ., data = .x), conf.int = TRUE)),
# to get a tidy dataframe of model summaries
y = mtcars %>%
group_by(cyl) %>%
group_modify(.f = ~ broom::glance(lm(mpg ~ ., data = .x))),
by = "cyl"
) %>%
dplyr::ungroup(x = .)
#> Warning in qt(a, object$df.residual): NaNs produced
#> # A tibble: 25 x 20
#> cyl term estimate std.error statistic.x p.value.x conf.low
#> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 4 (Intercept) 60.9 180. 0.338 0.793 -2229.
#> 2 4 disp -0.345 0.469 -0.735 0.596 -6.31
#> 3 4 hp -0.0332 0.364 -0.0915 0.942 -4.65
#> 4 4 drat -4.19 46.4 -0.0903 0.943 -594.
#> 5 4 wt 4.48 29.7 0.151 0.905 -373.
#> 6 4 qsec -0.106 7.82 -0.0136 0.991 -99.4
#> 7 4 vs -3.64 34.0 -0.107 0.932 -435.
#> 8 4 am -6.33 45.2 -0.140 0.912 -581.
#> 9 4 gear 4.07 29.1 0.140 0.912 -366.
#> 10 4 carb 3.22 28.2 0.114 0.928 -355.
#> conf.high r.squared adj.r.squared sigma statistic.y p.value.y df
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2351. 0.928 0.276 3.84 1.42 0.576 9
#> 2 5.62 0.928 0.276 3.84 1.42 0.576 9
#> 3 4.59 0.928 0.276 3.84 1.42 0.576 9
#> 4 586. 0.928 0.276 3.84 1.42 0.576 9
#> 5 382. 0.928 0.276 3.84 1.42 0.576 9
#> 6 99.2 0.928 0.276 3.84 1.42 0.576 9
#> 7 428. 0.928 0.276 3.84 1.42 0.576 9
#> 8 568. 0.928 0.276 3.84 1.42 0.576 9
#> 9 374. 0.928 0.276 3.84 1.42 0.576 9
#> 10 362. 0.928 0.276 3.84 1.42 0.576 9
#> logLik AIC BIC deviance df.residual nobs
#> <dbl> <dbl> <dbl> <dbl> <int> <int>
#> 1 -17.2 56.4 60.8 14.7 1 11
#> 2 -17.2 56.4 60.8 14.7 1 11
#> 3 -17.2 56.4 60.8 14.7 1 11
#> 4 -17.2 56.4 60.8 14.7 1 11
#> 5 -17.2 56.4 60.8 14.7 1 11
#> 6 -17.2 56.4 60.8 14.7 1 11
#> 7 -17.2 56.4 60.8 14.7 1 11
#> 8 -17.2 56.4 60.8 14.7 1 11
#> 9 -17.2 56.4 60.8 14.7 1 11
#> 10 -17.2 56.4 60.8 14.7 1 11
#> # ... with 15 more rows
Created on 2019-06-17 by the reprex package (v0.3.0)
Related
It might not be very clear from the title but what I wish to do is:
I have a dataframe df with, say, 200 columns and the first 80 columns are response variables (y1, y2, y3, ...) and the rest of 120 are predictors (x1, x2, x3, ...).
I wish to compute a linear model for each pair – lm(yi ~ xi, data = df).
Many problems and solutions I have looked through online have a either a fixed response vs many predictors or the other way around, using lapply() and its related functions.
Could anyone who is familiar with it point me to the right step?
use tidyverse
library(tidyverse)
library(broom)
df <- mtcars
y <- names(df)[1:3]
x <- names(df)[4:7]
result <- expand_grid(x, y) %>%
rowwise() %>%
mutate(frm = list(reformulate(x, y)),
model = list(lm(frm, data = df)))
result$model <- purrr::set_names(result$model, nm = paste0(result$y, " ~ ", result$x))
result$model[1:2]
#> $`mpg ~ hp`
#>
#> Call:
#> lm(formula = frm, data = df)
#>
#> Coefficients:
#> (Intercept) hp
#> 30.09886 -0.06823
#>
#>
#> $`cyl ~ hp`
#>
#> Call:
#> lm(formula = frm, data = df)
#>
#> Coefficients:
#> (Intercept) hp
#> 3.00680 0.02168
map_df(result$model, tidy)
#> # A tibble: 24 x 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) 30.1 1.63 18.4 6.64e-18
#> 2 hp -0.0682 0.0101 -6.74 1.79e- 7
#> 3 (Intercept) 3.01 0.425 7.07 7.41e- 8
#> 4 hp 0.0217 0.00264 8.23 3.48e- 9
#> 5 (Intercept) 21.0 32.6 0.644 5.25e- 1
#> 6 hp 1.43 0.202 7.08 7.14e- 8
#> 7 (Intercept) -7.52 5.48 -1.37 1.80e- 1
#> 8 drat 7.68 1.51 5.10 1.78e- 5
#> 9 (Intercept) 14.6 1.58 9.22 2.93e-10
#> 10 drat -2.34 0.436 -5.37 8.24e- 6
#> # ... with 14 more rows
map_df(result$model, glance)
#> # A tibble: 12 x 12
#> r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0.602 0.589 3.86 45.5 1.79e- 7 1 -87.6 181. 186.
#> 2 0.693 0.683 1.01 67.7 3.48e- 9 1 -44.6 95.1 99.5
#> 3 0.626 0.613 77.1 50.1 7.14e- 8 1 -183. 373. 377.
#> 4 0.464 0.446 4.49 26.0 1.78e- 5 1 -92.4 191. 195.
#> 5 0.490 0.473 1.30 28.8 8.24e- 6 1 -52.7 111. 116.
#> 6 0.504 0.488 88.7 30.5 5.28e- 6 1 -188. 382. 386.
#> 7 0.753 0.745 3.05 91.4 1.29e-10 1 -80.0 166. 170.
#> 8 0.612 0.599 1.13 47.4 1.22e- 7 1 -48.3 103. 107.
#> 9 0.789 0.781 57.9 112. 1.22e-11 1 -174. 355. 359.
#> 10 0.175 0.148 5.56 6.38 1.71e- 2 1 -99.3 205. 209.
#> 11 0.350 0.328 1.46 16.1 3.66e- 4 1 -56.6 119. 124.
#> 12 0.188 0.161 114. 6.95 1.31e- 2 1 -196. 398. 402.
#> # ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
Created on 2020-12-11 by the reprex package (v0.3.0)
I am fairly new to this community and R, thank you for all the support.
I encountered the new nest_by option of dplyr and it seems rather good. I have managed to split existing Dataframe and but not to run multiple models with them. I would like to iterate through all the dataframes and get raw and summary data of statistical models (GLM models mainly).
library(tidyverse)
nested <- mtcars %>% nest_by (cyl,carb)
# A tibble: 9 x 3
# Rowwise: cyl, carb
cyl carb data
<dbl> <dbl> <list<tbl_df[,9]>>
1 4 1 [5 x 9]
2 4 2 [6 x 9]
3 6 1 [2 x 9]
4 6 4 [4 x 9]
5 6 6 [1 x 9]
6 8 2 [4 x 9]
7 8 3 [3 x 9]
8 8 4 [6 x 9]
9 8 8 [1 x 9]
#Now i would like to run each line seperately in a lm model. This line should do it, but it doesn't
fit<- nested %>%
mutate(model = map(data, ~lm(mpg~hp, data=.)))
Now, I am trying to make a printable version of all models for my statistics teacher.
nested <- mtcars %>% nest (data = -c(cyl,carb))
regressions <-nested %>%
mutate(
fit = map(data, ~ lm(mpg ~ hp, data = .x))
)
printing<- regressions %>% rowwise() %>%
mutate (printed = paste(carb, cyl, "This model summary is"), summary(fit), sep = '*')
However this doesn't work altogether.
Any thoughts?
EDIT: In your precise case try this:
nested <- mtcars %>% nest (data = -c(cyl,carb))
regressions <-nested %>%
mutate(
fit = map(data, ~ lm(mpg ~ hp, data = .x)),
tidied = map(fit, tidy),
glanced = map(fit, glance),
augmented = map(fit, augment)
)
regressions %>%
unnest(glanced) # to get statistics of fits
regressions %>%
unnest(tidied) # to get coefficients of all fits
You can use dplyr in combination with broom as in this vignette. There is an exact example with mtcars:
data(mtcars)
mtcars <- as_tibble(mtcars) # to play nicely with list-cols
mtcars
## # A tibble: 32 x 11
## mpg cyl disp hp drat wt qsec vs am gear carb
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
## 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
## 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
## 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
## 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
## 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
## 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
## 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
## 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
## 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
## # ... with 22 more rows
mtcars %>%
nest(-am) %>%
mutate(
fit = map(data, ~ lm(wt ~ mpg + qsec + gear, data = .x)), # S3 list-col
tidied = map(fit, tidy)
) %>%
unnest(tidied)
## # A tibble: 8 x 8
## am data fit term estimate std.error statistic p.value
## <dbl> <list> <list> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1 <tibble [13 x 10~ <lm> (Intercep~ 4.28 3.46 1.24 2.47e-1
## 2 1 <tibble [13 x 10~ <lm> mpg -0.101 0.0294 -3.43 7.50e-3
## 3 1 <tibble [13 x 10~ <lm> qsec 0.0398 0.151 0.264 7.98e-1
## 4 1 <tibble [13 x 10~ <lm> gear -0.0229 0.349 -0.0656 9.49e-1
## 5 0 <tibble [19 x 10~ <lm> (Intercep~ 4.92 1.40 3.52 3.09e-3
## 6 0 <tibble [19 x 10~ <lm> mpg -0.192 0.0443 -4.33 5.91e-4
## 7 0 <tibble [19 x 10~ <lm> qsec 0.0919 0.0983 0.935 3.65e-1
## 8 0 <tibble [19 x 10~ <lm> gear 0.147 0.368 0.398 6.96e-1
What if you want not just the tidy output, but the augment and glance outputs as well, while still performing each regression only once? Since we’re using list-columns, we can just fit the model once and use multiple list-columns to store the tidied, glanced and augmented outputs.
regressions <- mtcars %>%
nest(-am) %>%
mutate(
fit = map(data, ~ lm(wt ~ mpg + qsec + gear, data = .x)),
tidied = map(fit, tidy),
glanced = map(fit, glance),
augmented = map(fit, augment)
)
regressions %>%
unnest(tidied)
## # A tibble: 8 x 10
## am data fit term estimate std.error statistic p.value glanced augmented
## <dbl> <lis> <lis> <chr> <dbl> <dbl> <dbl> <dbl> <list> <list>
## 1 1 <tib~ <lm> (Int~ 4.28 3.46 1.24 2.47e-1 <tibbl~ <tibble ~
## 2 1 <tib~ <lm> mpg -0.101 0.0294 -3.43 7.50e-3 <tibbl~ <tibble ~
## 3 1 <tib~ <lm> qsec 0.0398 0.151 0.264 7.98e-1 <tibbl~ <tibble ~
## 4 1 <tib~ <lm> gear -0.0229 0.349 -0.0656 9.49e-1 <tibbl~ <tibble ~
## 5 0 <tib~ <lm> (Int~ 4.92 1.40 3.52 3.09e-3 <tibbl~ <tibble ~
## 6 0 <tib~ <lm> mpg -0.192 0.0443 -4.33 5.91e-4 <tibbl~ <tibble ~
## 7 0 <tib~ <lm> qsec 0.0919 0.0983 0.935 3.65e-1 <tibbl~ <tibble ~
## 8 0 <tib~ <lm> gear 0.147 0.368 0.398 6.96e-1 <tibbl~ <tibble ~
regressions %>%
unnest(glanced)
## # A tibble: 2 x 16
## am data fit tidied r.squared adj.r.squared sigma statistic p.value df
## <dbl> <lis> <lis> <list> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 1 <tib~ <lm> <tibb~ 0.833 0.778 0.291 15.0 7.59e-4 4
## 2 0 <tib~ <lm> <tibb~ 0.625 0.550 0.522 8.32 1.70e-3 4
## # ... with 6 more variables: logLik <dbl>, AIC <dbl>, BIC <dbl>,
## # deviance <dbl>, df.residual <int>, augmented <list>
regressions %>%
unnest(augmented)
## # A tibble: 32 x 16
## am data fit tidied glanced wt mpg qsec gear .fitted .se.fit
## <dbl> <lis> <lis> <list> <list> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 <tib~ <lm> <tibb~ <tibbl~ 2.62 21 16.5 4 2.73 0.209
## 2 1 <tib~ <lm> <tibb~ <tibbl~ 2.88 21 17.0 4 2.75 0.152
## 3 1 <tib~ <lm> <tibb~ <tibbl~ 2.32 22.8 18.6 4 2.63 0.163
## 4 1 <tib~ <lm> <tibb~ <tibbl~ 2.2 32.4 19.5 4 1.70 0.137
## 5 1 <tib~ <lm> <tibb~ <tibbl~ 1.62 30.4 18.5 4 1.86 0.151
## 6 1 <tib~ <lm> <tibb~ <tibbl~ 1.84 33.9 19.9 4 1.56 0.156
## 7 1 <tib~ <lm> <tibb~ <tibbl~ 1.94 27.3 18.9 4 2.19 0.113
## 8 1 <tib~ <lm> <tibb~ <tibbl~ 2.14 26 16.7 5 2.21 0.153
## 9 1 <tib~ <lm> <tibb~ <tibbl~ 1.51 30.4 16.9 5 1.77 0.191
## 10 1 <tib~ <lm> <tibb~ <tibbl~ 3.17 15.8 14.5 5 3.15 0.157
## # ... with 22 more rows, and 5 more variables: .resid <dbl>, .hat <dbl>,
## # .sigma <dbl>, .cooksd <dbl>, .std.resid <dbl>
I am estimating rolling regressions on grouped data.
First, I group_by() and nest() my data by group.
Second, I use map() to estimate rolling regressions with a custom function my_beta(), which returns a list column.
The last step is where I stumble.
I want to extract the groups, dates, and coefficients so that I can merge the coefficients back to my original tibble.
However, my current solution requires three unnest() operations and a bind_cols().
The multiple unnest()s seem inefficient and the bind_cols() seems error prone.
Is there a syntactically and computationally more efficient way to estimate these rolling regressions? My actual data will have 10,000ish groups and 200,000ish observations.
library(tidyverse)
library(tsibble)
#>
#> Attaching package: 'tsibble'
#> The following object is masked from 'package:dplyr':
#>
#> id
set.seed(2001)
df <-
tibble(
date = 1:20,
y = runif(20),
x = runif(20),
z = runif(20),
group = rep(1:2, each = 10)
)
my_beta <- function(...) {
tail(coef(lm(y ~ x + z, data = list(...))), n = -1)
}
current_output <- df %>%
as_tsibble(key = group, index = date) %>%
group_by_key() %>%
nest() %>%
mutate(coefs = purrr::map(data, ~ pslide(., my_beta, .size = 5))) %>%
unnest(coefs) %>%
unnest_wider(coefs, names_sep = '_') %>%
ungroup()
#> New names:
#> * `` -> ...1
#> New names:
#> * `` -> ...1
#> New names:
#> * `` -> ...1
#> New names:
#> * `` -> ...1
#> New names:
#> * `` -> ...1
#> New names:
#> * `` -> ...1
#> New names:
#> * `` -> ...1
#> New names:
#> * `` -> ...1
current_output
#> # A tibble: 20 x 5
#> group data coefs_...1 coefs_x coefs_z
#> <int> <list> <lgl> <dbl> <dbl>
#> 1 1 <tsibble [10 × 4]> NA NA NA
#> 2 1 <tsibble [10 × 4]> NA NA NA
#> 3 1 <tsibble [10 × 4]> NA NA NA
#> 4 1 <tsibble [10 × 4]> NA NA NA
#> 5 1 <tsibble [10 × 4]> NA 1.46 2.08
#> 6 1 <tsibble [10 × 4]> NA 0.141 -0.396
#> 7 1 <tsibble [10 × 4]> NA 0.754 1.10
#> 8 1 <tsibble [10 × 4]> NA 0.651 0.889
#> 9 1 <tsibble [10 × 4]> NA 0.743 0.954
#> 10 1 <tsibble [10 × 4]> NA 0.308 0.795
#> 11 2 <tsibble [10 × 4]> NA NA NA
#> 12 2 <tsibble [10 × 4]> NA NA NA
#> 13 2 <tsibble [10 × 4]> NA NA NA
#> 14 2 <tsibble [10 × 4]> NA NA NA
#> 15 2 <tsibble [10 × 4]> NA -0.0433 -0.252
#> 16 2 <tsibble [10 × 4]> NA 0.696 0.334
#> 17 2 <tsibble [10 × 4]> NA 0.594 -0.0698
#> 18 2 <tsibble [10 × 4]> NA 0.881 0.0474
#> 19 2 <tsibble [10 × 4]> NA 3.23 -1.32
#> 20 2 <tsibble [10 × 4]> NA -0.942 1.85
desired_output <- df %>%
bind_cols(current_output %>% select(coefs_x, coefs_z))
desired_output
#> # A tibble: 20 x 7
#> date y x z group coefs_x coefs_z
#> <int> <dbl> <dbl> <dbl> <int> <dbl> <dbl>
#> 1 1 0.759 0.368 0.644 1 NA NA
#> 2 2 0.608 0.992 0.0542 1 NA NA
#> 3 3 0.218 0.815 0.252 1 NA NA
#> 4 4 0.229 0.982 0.0606 1 NA NA
#> 5 5 0.153 0.275 0.488 1 1.46 2.08
#> 6 6 0.374 0.856 0.268 1 0.141 -0.396
#> 7 7 0.619 0.737 0.599 1 0.754 1.10
#> 8 8 0.259 0.641 0.189 1 0.651 0.889
#> 9 9 0.637 0.598 0.543 1 0.743 0.954
#> 10 10 0.325 0.990 0.0265 1 0.308 0.795
#> 11 11 0.816 0.519 0.351 2 NA NA
#> 12 12 0.717 0.766 0.333 2 NA NA
#> 13 13 0.781 0.365 0.380 2 NA NA
#> 14 14 0.838 0.924 0.0778 2 NA NA
#> 15 15 0.736 0.453 0.258 2 -0.0433 -0.252
#> 16 16 0.173 0.291 0.328 2 0.696 0.334
#> 17 17 0.677 0.714 0.884 2 0.594 -0.0698
#> 18 18 0.833 0.718 0.902 2 0.881 0.0474
#> 19 19 0.134 0.351 0.422 2 3.23 -1.32
#> 20 20 0.675 0.963 0.981 2 -0.942 1.85
Created on 2020-02-25 by the reprex package (v0.3.0)
We could simplify the code a bit with
res %>%
unnest(cols = c(data, coefs)) %>%
unnest_wider(col = coefs, names_sep = '_') %>%
select(-coefs_...1)
Where res is
res <-
df %>%
as_tsibble(key = group, index = date) %>%
group_by_key() %>%
nest() %>%
mutate(coefs = purrr::map(data, ~ pslide(., my_beta, .size = 5)))
The code that does the estimate part is left untouched. This only addresses the data wrangling part, about multiple unnest()s and bind_cols().
I haven't done a performance benchmark.
I run lm() for every column of a dataset with one of the column as the dependent variable, using purrr:map() function.
The results are almost perfect except for this - I want to replace .x in the results with the variable that i run lm() for.
The post R purrr map show column names in output is related but I want to avoid creating a function.
Below, are the codes using the mtcars dataset. I know, for example that .x for the first output refers to $mpg. I am not sure if setNames() will work.
library(tidyverse)
library(broom)
mod3 <- map(mtcars, ~ lm(mpg ~ .x, data = mtcars)) %>%
map(~tidy(.x))
#> Warning in summary.lm(x): essentially perfect fit: summary may be
#> unreliable
mod3
#> $mpg
#> # A tibble: 2 x 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) -5.02e-15 9.94e-16 -5.06e 0 0.0000198
#> 2 .x 1.00e+ 0 4.74e-17 2.11e16 0
#>
#> $cyl
#> # A tibble: 2 x 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) 37.9 2.07 18.3 8.37e-18
#> 2 .x -2.88 0.322 -8.92 6.11e-10
#>
#> $disp
#> # A tibble: 2 x 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) 29.6 1.23 24.1 3.58e-21
#> 2 .x -0.0412 0.00471 -8.75 9.38e-10
#>
#> $hp
#> # A tibble: 2 x 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) 30.1 1.63 18.4 6.64e-18
#> 2 .x -0.0682 0.0101 -6.74 1.79e- 7
#>
#> $drat
#> # A tibble: 2 x 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) -7.52 5.48 -1.37 0.180
#> 2 .x 7.68 1.51 5.10 0.0000178
#>
#> $wt
#> # A tibble: 2 x 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) 37.3 1.88 19.9 8.24e-19
#> 2 .x -5.34 0.559 -9.56 1.29e-10
#>
#> $qsec
#> # A tibble: 2 x 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) -5.11 10.0 -0.510 0.614
#> 2 .x 1.41 0.559 2.53 0.0171
#>
#> $vs
#> # A tibble: 2 x 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) 16.6 1.08 15.4 8.85e-16
#> 2 .x 7.94 1.63 4.86 3.42e- 5
Here is one way to do it
library(tidyverse)
library(broom)
names(mtcars)[-1] %>%
set_names() %>%
map(~ lm(as.formula(paste0('mpg ~ ', .x)), data = mtcars)) %>%
map_dfr(., broom::tidy, .id = "variable")
#> # A tibble: 20 x 6
#> variable term estimate std.error statistic p.value
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 cyl (Intercept) 37.9 2.07 18.3 8.37e-18
#> 2 cyl cyl -2.88 0.322 -8.92 6.11e-10
#> 3 disp (Intercept) 29.6 1.23 24.1 3.58e-21
#> 4 disp disp -0.0412 0.00471 -8.75 9.38e-10
#> 5 hp (Intercept) 30.1 1.63 18.4 6.64e-18
#> 6 hp hp -0.0682 0.0101 -6.74 1.79e- 7
#> 7 drat (Intercept) -7.52 5.48 -1.37 1.80e- 1
#> 8 drat drat 7.68 1.51 5.10 1.78e- 5
#> 9 wt (Intercept) 37.3 1.88 19.9 8.24e-19
#> 10 wt wt -5.34 0.559 -9.56 1.29e-10
#> 11 qsec (Intercept) -5.11 10.0 -0.510 6.14e- 1
#> 12 qsec qsec 1.41 0.559 2.53 1.71e- 2
#> 13 vs (Intercept) 16.6 1.08 15.4 8.85e-16
#> 14 vs vs 7.94 1.63 4.86 3.42e- 5
#> 15 am (Intercept) 17.1 1.12 15.2 1.13e-15
#> 16 am am 7.24 1.76 4.11 2.85e- 4
#> 17 gear (Intercept) 5.62 4.92 1.14 2.62e- 1
#> 18 gear gear 3.92 1.31 3.00 5.40e- 3
#> 19 carb (Intercept) 25.9 1.84 14.1 9.22e-15
#> 20 carb carb -2.06 0.569 -3.62 1.08e- 3
Created on 2019-02-10 by the reprex package (v0.2.1.9000)
Hi you can use purrr::imap() like so:
mod3 <- map(mtcars, ~ lm(mpg ~ .x, data = mtcars)) %>%
map(tidy) %>%
imap( ~ {.x[2, 1] <- .y ; return(.x)} )
imap sends two things to the function/ formula : .x the item and .y which is either the name of the item (name in this case) or the index. I had to wrap everything in {} in this case to get the assignment to work
Is it possible to change the contrasts of interaction terms which have been specified in an lm using the colon : notation?
In the example below, the reference category defaults to the last of the six terms generated by gear:vs (i.e., gear5:vs1). I'd instead like it to use the first of the six as the reference (i.e., gear3:vs0).
mtcars.1 <- mtcars %>%
mutate(gear = as.factor(gear)) %>%
mutate(vs = as.factor(vs))
lm(data=mtcars.1, mpg ~ gear:vs) %>%
tidy
#> # A tibble: 6 x 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) 30.4 4.13 7.36 0.0000000824
#> 2 gear3:vs0 -15.4 4.30 -3.57 0.00143
#> 3 gear4:vs0 -9.40 5.06 -1.86 0.0747
#> 4 gear5:vs0 -11.3 4.62 -2.44 0.0218
#> 5 gear3:vs1 -10.1 4.77 -2.11 0.0447
#> 6 gear4:vs1 -5.16 4.33 -1.19 0.245
Specifying contrasts for gear and vs separately doesn't seem to have an effect:
lm(data=mtcars.1, mpg ~ gear:vs,
contrasts = list(gear = contr.treatment(n=3,base=3),
vs = contr.treatment(n=2,base=2))) %>%
tidy
#> # A tibble: 6 x 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) 30.4 4.13 7.36 0.0000000824
#> 2 gear3:vs0 -15.4 4.30 -3.57 0.00143
#> 3 gear4:vs0 -9.40 5.06 -1.86 0.0747
#> 4 gear5:vs0 -11.3 4.62 -2.44 0.0218
#> 5 gear3:vs1 -10.1 4.77 -2.11 0.0447
#> 6 gear4:vs1 -5.16 4.33 -1.19 0.245
And I'm not sure how to specify a contrast for gear:vs directly:
lm(data=mtcars.1, mpg ~ gear:vs,
contrasts = list("gear:vs" = contr.treatment(n=6,base=6))) %>%
tidy
#> Warning in model.matrix.default(mt, mf, contrasts): variable 'gear:vs' is
#> absent, its contrast will be ignored
#> # A tibble: 6 x 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) 30.4 4.13 7.36 0.0000000824
#> 2 gear3:vs0 -15.4 4.30 -3.57 0.00143
#> 3 gear4:vs0 -9.40 5.06 -1.86 0.0747
#> 4 gear5:vs0 -11.3 4.62 -2.44 0.0218
#> 5 gear3:vs1 -10.1 4.77 -2.11 0.0447
#> 6 gear4:vs1 -5.16 4.33 -1.19 0.245
Created on 2019-01-21 by the reprex package (v0.2.1)
One way around this is to pre-calculate the interaction term before regression.
To demonstrate, we can create a factor column GV in mtcars with the same levels as observed in your lm output. It generates the same values:
mtcars %>%
mutate(GV = interaction(factor(gear), factor(vs)),
GV = factor(GV, levels = c("5.1", "3.0", "4.0", "5.0", "3.1", "4.1"))) %>%
lm(mpg ~ GV, .) %>%
tidy()
# A tibble: 6 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 30.4 4.13 7.36 0.0000000824
2 GV3.0 -15.4 4.30 -3.57 0.00143
3 GV4.0 -9.4 5.06 -1.86 0.0747
4 GV5.0 -11.3 4.62 -2.44 0.0218
5 GV3.1 -10.1 4.77 -2.11 0.0447
6 GV4.1 -5.16 4.33 -1.19 0.245
Now we omit the second mutate term, so the levels are 3.0, 4.0, 5.0, 3.1, 4.1, 5.1.
mtcars %>%
mutate(GV = interaction(factor(gear), factor(vs))) %>%
lm(mpg ~ GV, .) %>%
tidy()
# A tibble: 6 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 15.1 1.19 12.6 1.38e-12
2 GV4.0 5.95 3.16 1.88 7.07e- 2
3 GV5.0 4.08 2.39 1.71 9.96e- 2
4 GV3.1 5.28 2.67 1.98 5.83e- 2
5 GV4.1 10.2 1.77 5.76 4.61e- 6
6 GV5.1 15.4 4.30 3.57 1.43e- 3
Use interaction(factor(gear), factor(vs), lex.order = TRUE) to get the levels 3.0, 3.1, 4.0, 4.1, 5.0, 5.1.
mtcars %>%
mutate(GV = interaction(factor(gear), factor(vs), lex.order = TRUE)) %>%
lm(mpg ~ GV, .) %>%
tidy()
# A tibble: 6 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 15.0 1.19 12.6 1.38e-12
2 GV3.1 5.28 2.67 1.98 5.83e- 2
3 GV4.0 5.95 3.16 1.88 7.07e- 2
4 GV4.1 10.2 1.77 5.76 4.61e- 6
5 GV5.0 4.07 2.39 1.71 9.96e- 2
6 GV5.1 15.3 4.30 3.57 1.43e- 3