As shown in the following example, what I want to achieve is to run the regression many times, each time R records the estimates of did in one data.frame.
Each time, I changed the year condition in "ifelse", ie., ifelse(mydata$year >= 1993, 1, 0), thus each time I run a different regression.
mydata$time = ifelse(mydata$year >= 1994, 1, 0)
Can anyone help it? My basic code is as below (the data can be downloaded through browser if R returned errors):
library(foreign)
mydata = read.dta("http://dss.princeton.edu/training/Panel101.dta")
mydata$time = ifelse(mydata$year >= 1994, 1, 0)
mydata$did = mydata$time * mydata$treated
mydata$treated = ifelse(mydata$country == "E" | mydata$country == "F" | mydata$country == "G", 1, 0)
didreg = lm(y ~ treated + time + did, data = mydata)
summary(didreg)
Generally if you want to repeat a process many times with some different input each time, you need a function. The following function takes a scalar value year_value as its input, creates local variables for regression and exports estimates for model term did.
foo <- function (year_value) {
## create local variables from `mydata`
y <- mydata$y
treated <- as.numeric(mydata$country %in% c("E", "F", "G")) ## use `%in%`
time <- as.numeric(mydata$year >= year_value) ## use `year_value`
did <- time * treated
## run regression using local variables
didreg <- lm(y ~ treated + time + did)
## return estimate for model term `did`
coef(summary(didreg))["did", ]
}
foo(1993)
# Estimate Std. Error t value Pr(>|t|)
#-2.784222e+09 1.504349e+09 -1.850782e+00 6.867661e-02
Note there are several places where your original code can be improved. Say, using "%in%" instead of multiple "|", and using as.numeric instead of ifelse to coerce boolean to numeric.
Now you need something like a loop to iterate this function over several different year_value. I would use lappy.
## raw list of result from `lapply`
year_of_choice <- 1993:1994 ## taken for example
result <- lapply(year_of_choice, foo)
## rbind them into a matrix
data.frame(year = year_of_choice, do.call("rbind", result), check.names = FALSE)
# year Estimate Std. Error t value Pr(>|t|)
#1 1993 -2784221881 1504348732 -1.850782 0.06867661
#2 1994 -2519511630 1455676087 -1.730819 0.08815711
Note, don't include year 1990 (the minimum of variable year) as a choice, otherwise time will be a vector of 1, as same as the intercept. The resulting model is rank-deficient and you will get "subscript out of bounds" error. R version since 3.5.0 has a new complete argument to generic function coef. So for stability we may use
coef(summary(didreg), complete = TRUE)["did", ]
But you should see all NA or NaN for year 1990.
Here is another option, here we create a matrix for all the years, join it to mydata, gather to long, nest by grouping, then run regression to extract the estimates. Note that "gt_et_**" stands for "greater than or equal to.."
library(foreign)
library(dplyr)
library(tidyr)
library(purrr)
mydata = read.dta("http://dss.princeton.edu/training/Panel101.dta")
mtrx <- matrix(0, length(min(mydata$year):max(mydata$year)), length(min(mydata$year):max(mydata$year)))
mtrx[lower.tri(mtrx, diag = TRUE)] <- 1
df <- mtrx %>% as.data.frame() %>% mutate(year = min(mydata$year):max(mydata$year))
colnames(df) <- c(paste0("gt_et_", df$year), "year")
models <- df %>%
full_join(., mydata, by = "year") %>%
gather(mod, time, gt_et_1990:gt_et_1999) %>%
nest(-mod) %>%
mutate(data = map(data, ~mutate(.x, treated = ifelse(country == "E"|country == "F"|country == "G", 1, 0),
did = time * treated)),
mods = map(data, ~lm(y ~ treated + time + did, data = .x) %>% summary() %>% coef())) %>%
unnest(mods %>% map(broom::tidy)) %>%
filter(.rownames == "did") %>%
select(-.rownames)
models
#> mod Estimate Std..Error t.value Pr...t..
#> 1 gt_et_1991 -2309823993 2410140350 -0.95837738 0.34137018
#> 2 gt_et_1992 -2036098728 1780081308 -1.14382344 0.25682856
#> 3 gt_et_1993 -2784221881 1504348732 -1.85078222 0.06867661
#> 4 gt_et_1994 -2519511630 1455676087 -1.73081886 0.08815711
#> 5 gt_et_1995 -2357323806 1455203186 -1.61992760 0.11001662
#> 6 gt_et_1996 250180589 1511322882 0.16553749 0.86902697
#> 7 gt_et_1997 405842197 1619653548 0.25057346 0.80292231
#> 8 gt_et_1998 -75683039 1852314277 -0.04085864 0.96753194
#> 9 gt_et_1999 2951694230 2452126428 1.20372840 0.23299421
Created on 2018-09-01 by the reprex
package (v0.2.0).
Related
Exactly like this question but how do you also get the R squared value for each model? link
Sample data
test <- data.frame(row=c(1:16),
plot = c(1,1,1,1,1,2,2,2,3,3,3,3,3,3,3,3),
logT = c(1.092,1.091,1.0915,1.09,1.08,1.319,1.316,1.301,1.2134,1.213,1.21,1.22,1.23,1.20,1.19,1.19),
utc_datetime = c(2020-03-05T00:00:00Z,2020-03-05T00:30:00Z,2020-03-05T01:00:00Z,2020-03-05T01:30:00Z,2020-03-05T02:00:00Z, 2020-03-06T01:00:00Z,2020-03-06T01:30:00Z,2020-03-06T02:00:00Z,
2020-03-10T02:00:00Z,2020-03-10T02:30:00Z,2020-03-10T03:00:00Z,2020-03-10T03:30:00Z,2020-03-10T04:00:00Z,2020-03-10T04:30:00Z,2020-03-10T05:00:00Z,2020-03-10T05:30:00Z,),
hrs_since = 1,2,3,4,5,1,2,3,1,2,3,4,5,6,7,8))
A deeper explanation of the data I am dealing with is here but I believe the sample data provided above would suffice data. Ideally, I would want to use the utc_datetime as the x axis/IV value but no code I've tried works with using that so I created the hrs_since variable which works.
I am looking for an output datframe that looks something like this:
plot
slope(coeff)
r2 value
rsd
1
2.1
.96
.01
2
1.3
.85
.01
3
.8
.99
.02
When I run the code below...
output <- ddply(test, "plot", function(x) {
model <- lm(logT ~ hrs_since, data = x)
coef(model)
})
I create a dataframe that looks like this:
plot
(Intercept)
hrs_since
1
2.1
.96
2
1.3
.85
3
.8
.99
But when I add summary(model)$r.squared to it, such as below...
output <- ddply(test, "plot", function(x) {
model <- lm(logT ~ hrs_since, data = x)
coef(model)
summary(model)$r.squared
})
I create a dataframe that looks like this:
plot
V1
1
0.98
2
0.97
3
0.89
Where the correct R squared value has been added as column V1 to the df "output", but I have for some reason lost the coeff column? Ideally, I want to also add rsd and maybe st.dev columns but have not attempted yet because getting the R squared and coeff columns correct are the most important parameters I need. Also, originally I tried using r.squared(model) instead of summary(model)$r.squared in the line below coef(model), but this resulted in getting the error "Error in UseMethod("pmodel.response") :
no applicable method for 'pmodel.response' applied to an object of class "lm""
Also, I tried a method using this code as well and it worked but the coeff was not returned in the parameters returned for each plot
output <- test %>%
group_by(plot) %>%
do(glance(lm(lnT~hrs_since, data=.)))
Thank you in advance!
Here's an approach that nests each of the models in a dataframe and captures the results in the dataframe as well. Then uses the broom package to extract the statistics. There are two different broom functions that extract the intercept and r2, so I run them separately and combine into one dataframe.
library(dplyr)
library(modelr)
library(tidyverse)
dat_all <- data.frame()
#nest the datasets as separate dataframes
for (p in unique(test$plot)){
data <- data.frame(x = test$logT[test$plot == p], y = test$hrs_since[test$plot == p])
names(data) <- c("logT", "hrs_since")
dd <- data.frame(plot = p, data = data) %>%
group_by(plot) %>%
nest()
dat_all <- rbind(dat_all, dd)
}
myModel <- function(x){
lm(data.logT ~ data.hrs_since, data = x)
}
#use map to run the model and each of the nested dataframes
dat_all <- dat_all %>%
mutate(model = map(data, myModel))
#extract the intercepts
i <- dat_all %>%
mutate(tidy = map(model, broom::tidy)) %>%
unnest(tidy) %>%
filter(term == "(Intercept)") %>%
select(plot, intercept = estimate)
#extract r2
r <- dat_all %>%
mutate(glance = map(model, broom::glance)) %>%
unnest(glance) %>%
select(plot, r.squared)
#combine statistics by plot
results <- i %>%
left_join(r, by = "plot")
plot intercept r.squared
<dbl> <dbl> <dbl>
1 1 1.10 0.618
2 2 1.33 0.871
3 3 1.22 0.380
I want to do regression imputation with dplyr in R efficiently. Here is my problem: I have a data set with many missing values for one column - let's call it p. Now I want to estimate the missing values of p with a regression imputation approach. For that I regress p on a set of variables with OLS using uncensored data (a subset of the data set without missing values for p). Then I use the estimated coefficients to calculate the missing values of p.
My data set looks like that:
df = data.frame(
id = c(1, 1, 1, 2, 2, 2),
group = c(1, 1, 2, 1, 1, 2),
sub_group = c(1, 2, 3, 1, 2, 3),
p = c(4.3, 5.7, NA, NA, NA, 10),
var1 = c(0.3, 0.1, 0.4, 0.9, 0.1, 0.2),
var2 = c(0, 0, 0, 1, 1, 1)
)
where id represent individuals, which buy goods from a group (e.g. "food") with subgroups (like "bread"). p is the price, while var1 and var2 are some demographic variables (like "education" and "age").
What I've done so far:
library(dplyr)
df <- as_tibble(df)
# Create uncensored data
uncensored_df <- df %>%
filter(!is.na(p))
# Run regression on uncensored data
imp_model <- lm(p ~ var1 + var2, data = uncensored_df)
# Get the coefficients of the fitted model
coefs <- unname(imp_model$coefficients)
# Use coefficients to compute missing values of p
censored_df <-df %>%
filter(is.na(p)) %>%
group_by(id, group, sub_group) %>%
mutate(p = coefs[1] + coefs[2] * var1 + coefs[3] * var2)
# And finally combine the two subsets
bind_rows(uncensored_df, censored_df) %>% arrange(id, group, sub_group)
As I use more than var1 and var2 in my actual problem (about 30 variables), what is a better way to do regression imputation with dplyr? (I'm also open for non-dplyr solutions, though.)
library(dplyr)
fit <- lm(p ~ ., data = select(df, p, starts_with("var")))
df %>%
rowwise() %>%
mutate(p = ifelse(is.na(p), predict(fit, newdata = across()), p)) %>%
ungroup()
How it works
For starters, when fitting your model, you can subset your data frame using select and any of the tidyselect helpers to select your dependent variables (here used starts_with("var")). This subset data frame then allows you to use the ~ . notation which means regress p on everything in the subset data frame.
Next you create a row-wise data frame and use your model to predict where p is missing. In this instance across turns each row into a 1x6 tibble that you can pass to the newdata argument. predict then uses the model fit and this new data to predict a value of p.
Output
id group sub_group p var1 var2
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 4.3 0.3 0
2 1 1 2 5.7 0.1 0
3 1 2 3 3.60 0.4 0
4 2 1 1 5.10 0.9 1
5 2 1 2 10.7 0.1 1
6 2 2 3 10 0.2 1
Benchmarking
As mentioned in the comments, for large data frames the rowwise operation takes significantly longer than some other options:
library(microbenchmark)
set.seed(1)
df1 <- df %>%
slice_sample(n = 1E5, replace = T)
fit <- lm(p ~ ., data = select(df1, p, starts_with("var")))
dplyr_rowwise <- function(){
df1 %>%
rowwise() %>%
mutate(p = ifelse(is.na(p), predict(fit, newdata = across()), p)) %>%
ungroup()
}
dplyr_coalesce <- function(){
df1 %>%
mutate(p = coalesce(p, predict(fit, newdata = df1)))
}
base_index <- function(){
isna <- is.na(df1$p)
df1$p[isna] <- predict(fit, newdata = subset(df1, isna))
}
microbenchmark(
dplyr_rowwise(),
dplyr_coalesce(),
base_index(),
times = 10L
)
Unit: milliseconds
expr min lq mean median uq
dplyr_rowwise() 63739.9512 64441.0800 66926.46041 65513.51785 66923.0241
dplyr_coalesce() 6.5901 6.9037 8.55971 7.21125 7.7157
base_index() 13.0368 13.1790 15.73682 13.53310 19.3004
I am trying to output multiple versions of a model using fixest::feols. My goal is to run the model independently for each restriction column in my dataset and either get a list or etable of all the regression summaries. The tricky part is that I am interacting the restriction variable with Year, and I am not sure how to use sw() within the i() functionality of fixest. I also tried writing a function and using map to output the models by feeding it a list, but I ran into trouble there as well with my limited knowledge of NSE. Here is a simplified version of my dataset:
df <- data.frame(Year = rep(2000:2007), fipscode = c("001", "002"),
conditional_ban = rep(0:1), registration = rep(0:1), pct = rnorm(n = 16, mean = .02, sd = .005))
There are two ways I have tried to go about doing this. The first is using sw() within feols:
require(fixest)
res <- feols(fml = pct ~
i(sw(registration, conditional_ban), Year) | factor(Year) + factor(fipscode),
data = df)
etable(res)
This produces the error: "You cannot combine stepwise functions with any other
element." Upon that discovery, I tried to create a function for my model and then using it in map:
# function that takes a restriction as its argument and results in a table summary with the name
# of the restriction
model_restr <- function(restr){
mod <- eval(substitute(feols(fml = pct ~
i(restr, Year) | factor(Year) +
factor(fipscode),
data = df)))
modsum <- summary(mod)
varName <- deparse(substitute(restr))
. <- etable(modsum, signifCode = c("***" = 0.01, "**" = 0.05, "*" = 0.1))
assign(varName, ., envir = globalenv())
}
# Then, I create a list with the restrictions
restrs <- list(names = c("conditional_ban", "registration")
# Next, I try to use map to loop the function over the list
map(restrs$names, model_restr)
This results in the error: "Error in feols(fml = pct ~ i(.x[[i]], Year) | factor(Year) + : The variables '.x' and 'i' are in the RHS (first part) of the formula but not in the data set."
Now, I understand that I am feeding a quoted argument to the function model_restr, and I've tried to use !! or noquote() to unquote the argument like this:
model_restr <- function(restr){
restr <- noquote(restr)
mod <- eval(substitute(feols(fml = pct ~
i(restr, Year) | factor(Year) +
factor(fipscode),
data = df)))
modsum <- summary(mod)
varName <- deparse(substitute(restr))
. <- etable(modsum, signifCode = c("***" = 0.01, "**" = 0.05, "*" = 0.1))
assign(varName, ., envir = globalenv())
}
I know I am probably missing something important here when it comes to NSE, but I cannot figure out what exactly is going wrong. I am pretty new to R and programming, but I am eager to understand this so that I can speed up my workflow. I appreciate your help!!
Simply insert the i()s into the sw() and that will do:
data(base_did) ; base = base_did
# Alternative treatment indicator, purely random
base$treat_bis = 1 * (base$id %in% sample(108, 50))
# 2 estimations
res = feols(y ~ sw(i(treat, period, 5), i(treat_bis, period, 5)) | id + period, base)
etable(res)
#> model 1 model 2
#> Dependent Var.: y y
#>
#> treat x period = 1 -2.015 (1.342)
#> treat x period = 2 -1.664 (1.389)
#> treat x period = 3 0.5041 (1.323)
#> treat x period = 4 -0.8846 (1.416)
#> treat x period = 6 1.159 (1.227)
#> treat x period = 7 4.335** (1.308)
#> treat x period = 8 3.826* (1.514)
#> treat x period = 9 4.640*** (1.293)
#> treat x period = 10 6.947*** (1.426)
#> treat_bis x period = 1 -1.565 (1.358)
#> treat_bis x period = 2 -1.200 (1.425)
#> treat_bis x period = 3 -3.350* (1.301)
#> treat_bis x period = 4 -3.538* (1.385)
#> treat_bis x period = 6 -2.660* (1.227)
#> treat_bis x period = 7 -2.274 (1.382)
#> treat_bis x period = 8 -3.399* (1.529)
#> treat_bis x period = 9 -2.724. (1.374)
#> treat_bis x period = 10 -2.177 (1.615)
#> Fixed-Effects: ---------------- ---------------
#> id Yes Yes
#> period Yes Yes
#> _______________________ ________________ _______________
#> S.E.: Clustered by: id by: id
#> Observations 1,080 1,080
#> R2 0.32343 0.26744
#> Within R2 0.08710 0.01155
I took the liberty to change the example since the interactions were collinear with the fixed-effects.
By the way, factor() is not needed in the RHS of the pipe (it only makes the estimation slower).
I was wondering why lm() says 5 coefs not defined because of singularities and then gives all NA in the summary output for 5 coefficients.
Note that all my predictors are categorical.
Is there anything wrong with my data on these 5 coefficients or code? How can I possibly fix this?
d <- read.csv("https://raw.githubusercontent.com/rnorouzian/m/master/v.csv", h = T) # Data
nms <- c("Age","genre","Length","cf.training","error.type","cf.scope","cf.type","cf.revision")
d[nms] <- lapply(d[nms], as.factor) # make factor
vv <- lm(dint~Age+genre+Length+cf.training+error.type+cf.scope+cf.type+cf.revision, data = d)
summary(vv)
First 6 lines of output:
Coefficients: (5 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.17835 0.63573 0.281 0.779330
Age1 -0.04576 0.86803 -0.053 0.958010
Age2 0.46431 0.87686 0.530 0.596990
Age99 -1.64099 1.04830 -1.565 0.118949
genre2 1.57015 0.55699 2.819 0.005263 **
genre4 NA NA NA NA ## For example here is all `NA`s? there are 4 more !
As others noted, a problem is that you seem to have multicollinearity. Another is that there are missing values in your dataset. The missing values should probably just be removed. As for correlated variables, you should inspect your data to identify this collinearity, and remove it. Deciding which variables to remove and which to retain is a very domain-specific topic. However, you could if you wish decide to use regularisation and fit a model while retaining all variables. This also allows you to fit a model when n (number of samples) is less than p (number of predictors).
I've shown code below that demonstrates how to examine the correlation structure within your data, and to identify which variables are most correlated (thanks to this answer. I've included an example of fitting such a model, using L2 regularisation (commonly known as ridge regression).
d <- read.csv("https://raw.githubusercontent.com/rnorouzian/m/master/v.csv", h = T) # Data
nms <- c("Age","genre","Length","cf.training","error.type","cf.scope","cf.type","cf.revision")
d[nms] <- lapply(d[nms], as.factor) # make factor
vv <- lm(dint~Age+genre+Length+cf.training+error.type+cf.scope+cf.type+cf.revision, data = d)
df <- d
df[] <- lapply(df, as.numeric)
cor_mat <- cor(as.matrix(df), use = "complete.obs")
library("gplots")
heatmap.2(cor_mat, trace = "none")
## https://stackoverflow.com/questions/22282531/how-to-compute-correlations-between-all-columns-in-r-and-detect-highly-correlate
library("tibble")
library("dplyr")
library("tidyr")
d2 <- df %>%
as.matrix() %>%
cor(use = "complete.obs") %>%
## Set diag (a vs a) to NA, then remove
(function(x) {
diag(x) <- NA
x
}) %>%
as.data.frame %>%
rownames_to_column(var = 'var1') %>%
gather(var2, value, -var1) %>%
filter(!is.na(value)) %>%
## Sort by decreasing absolute correlation
arrange(-abs(value))
## 2 pairs of variables are almost exactly correlated!
head(d2)
#> var1 var2 value
#> 1 id study.name 0.9999430
#> 2 study.name id 0.9999430
#> 3 Location timed 0.9994082
#> 4 timed Location 0.9994082
#> 5 Age ed.level 0.7425026
#> 6 ed.level Age 0.7425026
## Remove some variables here, or maybe try regularized regression (see below)
library("glmnet")
## glmnet requires matrix input
X <- d[, c("Age", "genre", "Length", "cf.training", "error.type", "cf.scope", "cf.type", "cf.revision")]
X[] <- lapply(X, as.numeric)
X <- as.matrix(X)
ind_na <- apply(X, 1, function(row) any(is.na(row)))
X <- X[!ind_na, ]
y <- d[!ind_na, "dint"]
glmnet <- glmnet(
x = X,
y = y,
## alpha = 0 is ridge regression
alpha = 0)
plot(glmnet)
Created on 2019-11-08 by the reprex package (v0.3.0)
Under such situation you can use "olsrr" package in R for stepwise regression analysis. I am providing you a sample code to do stepwise regression analysis in R
library("olsrr")
#Load the data
d <- read.csv("https://raw.githubusercontent.com/rnorouzian/m/master/v.csv", h = T)
# stepwise regression
vv <- lm(dint ~ Age + genre + Length + cf.training + error.type + cf.scope + cf.type + cf.revision, data = d)
summary(vv)
k <- ols_step_both_p(vv, pent = 0.05, prem = 0.1)
# stepwise regression plot
plot(k)
# final model
k$model
It will provide you exactly the same output as that of SPSS.
In the past I've used the lm function with matrix-type data and data.frame-type. But I guess this is the first time that I tried to use predict using a model fitted without a data.frame. And I'm can't figure out how to make it work.
I read some other questions (such as Getting Warning: " 'newdata' had 1 row but variables found have 32 rows" on predict.lm) and I'm pretty sure that my problem is related with the coefficient names I'm getting after fitting the model. For some reason the coefficients names are a paste of the matrix name with the column name... and I haven't been able to find how to fix that...
library(tidyverse)
library(MASS)
set.seed(1)
label <- sample(c(T,F), nrow(Boston), replace = T, prob = c(.6,.4))
x.train <- Boston %>% dplyr::filter(., label) %>%
dplyr::select(-medv) %>% as.matrix()
y.train <- Boston %>% dplyr::filter(., label) %>%
dplyr::select(medv) %>% as.matrix()
x.test <- Boston %>% dplyr::filter(., !label) %>%
dplyr::select(-medv) %>% as.matrix()
y.test <- Boston %>% dplyr::filter(., !label) %>%
dplyr::select(medv) %>% as.matrix()
fit_lm <- lm(y.train ~ x.train)
fit_lm2 <- lm(medv ~ ., data = Boston, subset = label)
predict(object = fit_lm, newdata = x.test %>% as.data.frame()) %>% length()
predict(object = fit_lm2, newdata = x.test %>% as.data.frame()) %>% length()
# they get different numbers of predicted data
# the first one gets a number a results consistent with x.train
Any help will be welcome.
I can't fix your tidyverse code because I don't work with this package. But I am able to explain why predict fails in the first case.
Let me just use the built-in dataset trees for a demonstration:
head(trees, 2)
# Girth Height Volume
#1 8.3 70 10.3
#2 8.6 65 10.3
The normal way to use lm is
fit <- lm(Girth ~ ., trees)
The variable names (on the RHS of ~) are
attr(terms(fit), "term.labels")
#[1] "Height" "Volume"
You need to provide these variables in the newdata when using predict.
predict(fit, newdata = data.frame(Height = 1, Volume = 2))
# 1
#11.16125
Now if you fit a model using a matrix:
X <- as.matrix(trees[2:3])
y <- trees[[1]]
fit2 <- lm(y ~ X)
attr(terms(fit2), "term.labels")
#[1] "X"
The variable you need to provide in newdata for predict is now X, not Height or Girth. Note that since X is a matrix variable, you need to protect it with I() when feeding it to a data frame.
newdat <- data.frame(X = I(cbind(1, 2)))
str(newdat)
#'data.frame': 1 obs. of 1 variable:
# $ X: AsIs [1, 1:2] 1 2
predict(fit2, newdat)
# 1
#11.16125
It does not matter that cbind(1, 2) has no column names. What is important is that this matrix is named X in newdat.