Based on the response to this question that I posed I have something like the following:
library(tidyverse)
library(dplyr)
library(broom)
library(tidyr)
library(purrr)
dataset <- tibble(
y1=rnorm(n=100),
y2=rnorm(n=100),
x1=rnorm(n=100),
x2=rnorm(n=100),)
outcomes <- dataset %>%
select(y1,y2) %>% colnames
covars <- dataset %>%
select(x1,x2) %>% colnames
paramlist <- expand_grid(outcomes, covars)
paramlist %>%
rowwise %>%
mutate(mod = list(lm(reformulate(outcomes, covars), data = dataset)),
res = list(broom::tidy(mod)),
predicted=list(predict(mod)),
data=list(cbind(dataset,predicted)))
# A tibble: 4 x 6
# Rowwise:
#> outcomes covars mod res predicted data
#> <chr> <chr> <list> <list> <list> <list>
#> 1 y1 x1 <lm> <tibble [2 x 5]> <dbl [100]> <df [100 x 5]>
#> 2 y1 x2 <lm> <tibble [2 x 5]> <dbl [100]> <df [100 x 5]>
#> 3 y2 x1 <lm> <tibble [2 x 5]> <dbl [100]> <df [100 x 5]>
#> 4 y2 x2 <lm> <tibble [2 x 5]> <dbl [100]> <df [100 x 5]>
What I would like to do now is - for each combination of outcomes and covars - I'd like to calculate the mean or sd of the predicted value in data conditional on some value of x1. For example, x1 might be a treatment variable, and I'd like the adjusted mean of the outcome for those with x1=0. The tricky part seems to be that the outcome and conditioning variable differ across rows.
Related
I want to do a equation-by-equation instrumental variable (IV) regression with a control function in R (using tidyverse and broom). I want to implement this based on a grouped data frame with a dependent variable, y, an endogenous variable, x, an instrument for this endogenous variable, z1, and an exogeneous variable, z2. Following a Two Stage Least Squares (2SLS) approach, I would run: (1) Regress x on z1 and z2 and (2) Regress y on x, z2 and v(the residuals from (1)). For more details for this approach see: https://www.irp.wisc.edu/newsevents/workshops/appliedmicroeconometrics/participants/slides/Slides_14.pdf. Unfortunately, I am not able to run the second regression without an error (see below).
My data looks like this:
df <- data.frame(
id = sort(rep(seq(1, 20, 1), 5)),
group = rep(seq(1, 4, 1), 25),
y = runif(100),
x = runif(100),
z1 = runif(100),
z2 = runif(100)
)
where id is an identifier for the observations, group is an identifier for the groups and the rest is defined above.
library(tidyverse)
library(broom)
# Nest the data frame
df_nested <- df %>%
group_by(group) %>%
nest()
# Run first stage regression and retrieve residuals
df_fit <- df_nested %>%
mutate(
fit1 = map(data, ~ lm(x ~ z1 + z2, data = .x)),
resids = map(fit1, residuals)
)
Now, I want to run the second stage regression. I've tried two things.
First:
df_fit %>%
group_by(group) %>%
unnest(c(data, resids)) %>%
do(lm(y ~ x + z2, data = .x))
This produces Error in is.data.frame(data) : object '.x' not found.
Second:
df_fit %>%
mutate(
fit2 = map2(data, resids, ~ lm(y ~ x + z2, data = .x))
)
df_fit %>% unnest(fit2)
This produces: Error: Must subset columns with a valid subscript vector. x Subscript has the wrong type `grouped_df< . If you would work with a larger data set, the second approach would even run into storage problems.
How is this done correctly?
The broom package is loaded but there was no tidy applied to the lm output. In addition, the OP's code had some typos i.e. after mutateing to create the fit2, the object 'df_fit' was not updated (<-), thus df_fit %>% unnest(fit2) wouldn't work as the column is not found
library(dplyr)
library(purrr)
library(broom)
library(tidyr)
df_fit %>%
ungroup %>%
mutate(
fit2 = map2(data, resids, ~ tidy(lm(y ~ x + z2, data = .x))
)) %>%
unnest(fit2)
-output
# A tibble: 12 × 9
group data fit1 resids term estimate std.error statistic p.value
<dbl> <list> <list> <list> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 <tibble [25 × 5]> <lm> <dbl [25]> (Intercept) 0.357 0.126 2.82 0.00987
2 1 <tibble [25 × 5]> <lm> <dbl [25]> x -0.0290 0.173 -0.168 0.868
3 1 <tibble [25 × 5]> <lm> <dbl [25]> z2 0.204 0.183 1.11 0.278
4 2 <tibble [25 × 5]> <lm> <dbl [25]> (Intercept) 0.470 0.139 3.38 0.00272
5 2 <tibble [25 × 5]> <lm> <dbl [25]> x 0.168 0.206 0.816 0.423
6 2 <tibble [25 × 5]> <lm> <dbl [25]> z2 0.00615 0.176 0.0350 0.972
7 3 <tibble [25 × 5]> <lm> <dbl [25]> (Intercept) 0.625 0.147 4.25 0.000325
8 3 <tibble [25 × 5]> <lm> <dbl [25]> x 0.209 0.255 0.818 0.422
9 3 <tibble [25 × 5]> <lm> <dbl [25]> z2 -0.398 0.183 -2.18 0.0406
10 4 <tibble [25 × 5]> <lm> <dbl [25]> (Intercept) 0.511 0.235 2.17 0.0407
11 4 <tibble [25 × 5]> <lm> <dbl [25]> x 0.0468 0.247 0.189 0.851
12 4 <tibble [25 × 5]> <lm> <dbl [25]> z2 -0.0246 0.271 -0.0908 0.929
I have a data frame like this:
df <- tibble(
i = rep(1:10, times = 5),
t = rep(1:5, each = 10)
) %>%
mutate(y = rnorm(50))
I want to apply a function that takes data frame of each t as argument:
f <- function(df){
return(lm(y ~ +1, data = df))
}
When I apply purrr::map for a nested data frame with pipe operator, I get error.
# does not work
df_nested <- df %>%
nest(data = c(t, y)) %>%
rename(data_col = data)
df_nested %>%
purrr::map(.x = .$data_col, .f = f)
On the other hand, when I do not use pipe operator, I get the desired result.
# Ok
purrr::map(.x = df_nested$data_col, .f = f)
To my understanding, both code should return the same result. What is wrong with the code with pipe operator?
Pipe already passes the previous value (df_nested) as the first argument to map. You may use {} to stop that from happening.
library(tidyverse)
df_nested %>%
{purrr::map(.x = .$data_col, .f = f)}
Another way would be to use -
df %>%
nest(data_col = c(t, y)) %>%
mutate(model = map(data_col, f))
# i data_col model
# <int> <list> <list>
# 1 1 <tibble [5 × 2]> <lm>
# 2 2 <tibble [5 × 2]> <lm>
# 3 3 <tibble [5 × 2]> <lm>
# 4 4 <tibble [5 × 2]> <lm>
# 5 5 <tibble [5 × 2]> <lm>
# 6 6 <tibble [5 × 2]> <lm>
# 7 7 <tibble [5 × 2]> <lm>
# 8 8 <tibble [5 × 2]> <lm>
# 9 9 <tibble [5 × 2]> <lm>
#10 10 <tibble [5 × 2]> <lm>
Imagine a high resolution temperature and light time series taken at many locations (stations) over many days. Except, at each station temp and light are taken by different sensors, resulting in a slightly different set of timestamps.
To merge these into one data.frame, I've been trying to make a model of light for each day at each station in df.light. Then, I want to predict light values at the exact timestamps of temp readings, which are nested the same way in df.temp (the temperature dataset).
station <- rep(1:5, each=36500)
dayofyear <- rep(1:365, 5, each=100)
hourofday.light <- runif(182500, min=0, max=24)
light <- runif(182500, min=0, max=40)
hourofday.temp <- runif(182500, min=0, max=24)
temp <- runif(182500, min=0, max=40)
df.light <- data.frame(station, dayofyear, hourofday.light, light)
df.temp <- data.frame(station, dayofyear, hourofday.temp, temp)
> head(df.light)
station dayofyear hourofday.light light
1 1 1 10.217349 0.120381
2 1 1 12.179213 12.423694
3 1 1 16.515400 7.277784
4 1 1 3.775723 31.793782
5 1 1 7.719266 30.578220
6 1 1 9.269916 16.937042
> tail(df.light)
station dayofyear hourofday.light light
182495 5 365 4.712285 19.2047471
182496 5 365 11.190919 39.5921675
182497 5 365 18.710969 11.8182347
182498 5 365 20.288101 11.6874453
182499 5 365 15.466373 0.3264828
182500 5 365 12.969125 29.4429034
> head(df.temp)
station dayofyear hourofday.temp temp
1 1 1 12.1298554 30.862308
2 1 1 23.6226076 9.328942
3 1 1 9.3699831 28.970397
4 1 1 0.1814767 1.405557
5 1 1 23.6300014 39.875743
6 1 1 7.6999984 39.786182
I can make the light models, e.g. GAMs, for each day at each station in df.light using dplyr. But I am stuck not knowing how to feed the nested newdata from df.temp to the models to generate the per-station-per-day predictions.
library("mgcv")
library("tidyverse")
data <- as_tibble(df.light) %>%
group_by(station, dayofyear) %>%
nest()
models <- data %>%
mutate(
model = map(data, ~ gam(light ~ s(hourofday.light), data = .x)),
predicted = map(model, ~ predict.gam(.x, newdata = hourofday.temp)) # newdata doesn't look nested
)
The last line starting with predicted does not work because newdata is not nested...I think. Please help. I'm guessing this could be a common issue in merging time series generated by multiple sources.
You can first prepare the data.
names(df.temp)[3:4] <- names(df.light)[3:4]
data1 <- df.light %>% group_by(station, dayofyear) %>%nest() %>% ungroup()
data2 <- df.temp %>% group_by(station, dayofyear) %>% nest() %>% ungroup()
apply model and get predicted values.
result <- data1 %>%
mutate(data2 = data2$data,
model = map(data, ~ gam(light ~ s(hourofday.light),data = .x)),
predicted = map2(model, data2, predict.gam))
result
# A tibble: 1,825 x 6
# station dayofyear data data2 model predicted
# <int> <int> <list> <list> <list> <list>
# 1 1 1 <tibble [100 × 2]> <tibble [100 × 2]> <gam> <dbl [100]>
# 2 1 2 <tibble [100 × 2]> <tibble [100 × 2]> <gam> <dbl [100]>
# 3 1 3 <tibble [100 × 2]> <tibble [100 × 2]> <gam> <dbl [100]>
# 4 1 4 <tibble [100 × 2]> <tibble [100 × 2]> <gam> <dbl [100]>
# 5 1 5 <tibble [100 × 2]> <tibble [100 × 2]> <gam> <dbl [100]>
# 6 1 6 <tibble [100 × 2]> <tibble [100 × 2]> <gam> <dbl [100]>
# 7 1 7 <tibble [100 × 2]> <tibble [100 × 2]> <gam> <dbl [100]>
# 8 1 8 <tibble [100 × 2]> <tibble [100 × 2]> <gam> <dbl [100]>
# 9 1 9 <tibble [100 × 2]> <tibble [100 × 2]> <gam> <dbl [100]>
#10 1 10 <tibble [100 × 2]> <tibble [100 × 2]> <gam> <dbl [100]>
# … with 1,815 more rows
I'm computing the model outputs for a linear regression for a dependent variable with 45 different id values. How can I use tidy (dplyr, apply, etc.) code to accomplish this?
I have a dataset with three variables data = c(id, distance, actPct) such that id == 1:45; -10 <= distance <= 10; 0 <= actsPct <= 1.
I need to run a regression, model0n, on each value of id, such that model0n has out put in a new tibble/df. I have completed it for a single regression:
model01 <- data %>%
filter(id == 1) %>%
filter(distance < 1) %>%
filter(distance > -4)
model01 <- lm(data = model01, actPct~distance)
Example Data
set.seed(42)
id <- as.tibble(sample(1:45,100,replace = T))
distance <- as.tibble(sample(-4:4,100,replace = T))
actPct <- as.tibble(runif(100, min=0, max=1))
data01 <- bind_cols(id=id, distance=distance, actPct=actPct)
attr(data01, "col.names") <- c("id", "distance", "actPct")
I expect a new tibble or dataframe that has model01:model45 so I can put all of the regression outputs into a single table.
You can use group_by, nest and mutate with map from the tidyverse to accomplish this:
data01 %>%
group_by(id) %>%
nest() %>%
mutate(models = map(data, ~ lm(actPct ~ distance, data = .x)))
# A tibble: 41 x 3
# id data models
# <int> <list> <list>
# 1 42 <tibble [3 x 2]> <S3: lm>
# 2 43 <tibble [4 x 2]> <S3: lm>
# 3 13 <tibble [2 x 2]> <S3: lm>
# 4 38 <tibble [4 x 2]> <S3: lm>
# 5 29 <tibble [2 x 2]> <S3: lm>
# 6 24 <tibble [5 x 2]> <S3: lm>
# 7 34 <tibble [5 x 2]> <S3: lm>
# 8 7 <tibble [3 x 2]> <S3: lm>
# 9 30 <tibble [2 x 2]> <S3: lm>
# 10 32 <tibble [2 x 2]> <S3: lm>
# ... with 31 more rows
See also the chapter in R for R for Data Science about many models: https://r4ds.had.co.nz/many-models.html
Data
set.seed(42)
id <- sample(1:45, 100, replace = T)
distance <- sample(-4:4, 100, replace = T)
actPct <- runif(100, min = 0, max = 1)
data01 <- tibble(id = id, distance = distance, actPct = actPct)
I'm trying to extract model info from model in a list column.
Using mtcars to illustrate my problem:
mtcars %>%
nest(-cyl) %>%
mutate(model= map(data, ~lm(mpg~wt, data=.))) %>%
mutate(aic=AIC(model))
what I got is error message:
Error in mutate_impl(.data, dots) :
Evaluation error: no applicable method for 'logLik' applied to an object of class "list".
But when I do it this way, it works.
mtcars %>%
group_by(cyl) %>%
do(model= lm(mpg~wt, data=.)) %>%
mutate(aic=AIC(model))
Can anyone explain why? Why the second way works? I could not figure it out. In both cases, the list column 'model' contains model info . But there might be some differences... Thanks a lot.
Let's compare the differences between these two approaches. We can run your entire code in addition to the last AIC call and save the results to a and b.
a <- mtcars %>%
nest(-cyl) %>%
mutate(model= map(data, ~lm(mpg~wt, data=.)))
b <- mtcars %>%
group_by(cyl) %>%
do(model= lm(mpg~wt, data=.))
Now we can print the results in the console.
a
# A tibble: 3 x 3
cyl data model
<dbl> <list> <list>
1 6 <tibble [7 x 10]> <S3: lm>
2 4 <tibble [11 x 10]> <S3: lm>
3 8 <tibble [14 x 10]> <S3: lm>
b
Source: local data frame [3 x 2]
Groups: <by row>
# A tibble: 3 x 2
cyl model
* <dbl> <list>
1 4 <S3: lm>
2 6 <S3: lm>
3 8 <S3: lm>
Now we can see dataframe b is grouped by row, while dataframe a is not. This is the key.
To extract AIC in dataframe a, we can use the rowwise function to group dataframe by each row.
mtcars %>%
nest(-cyl) %>%
mutate(model= map(data, ~lm(mpg~wt, data=.))) %>%
rowwise() %>%
mutate(aic=AIC(model))
Source: local data frame [3 x 4]
Groups: <by row>
# A tibble: 3 x 4
cyl data model aic
<dbl> <list> <list> <dbl>
1 6 <tibble [7 x 10]> <S3: lm> 25.65036
2 4 <tibble [11 x 10]> <S3: lm> 61.48974
3 8 <tibble [14 x 10]> <S3: lm> 63.31555
Or we can use the map_dbl function because we know each AIC is numeric.
mtcars %>%
nest(-cyl) %>%
mutate(model= map(data, ~lm(mpg~wt, data=.))) %>%
mutate(aic = map_dbl(model, AIC))
# A tibble: 3 x 4
cyl data model aic
<dbl> <list> <list> <dbl>
1 6 <tibble [7 x 10]> <S3: lm> 25.65036
2 4 <tibble [11 x 10]> <S3: lm> 61.48974
3 8 <tibble [14 x 10]> <S3: lm> 63.31555