Run multiple similar models on different outcomes - r

I would like to run the same model specification on different outcomes in a neat way, instead of running a model for each outcome. I would also like to iteratively hold out one observation at a time (e.g. a county) from a model to check if single observations drive the results. I have tried creating a for loop but without luck so far.
library(lfe)
## Create long format dataset. Unit of analysis is county-year,
## i.e. one observations equal a county in a given year.
## Independent variable, x is a dummy (0, 1)
year <- c(2007, 2007, 2007, 2007, 2007, 2009, 2009, 2009, 2009, 2009)
county <- c("county1", "county2", "county3", "county4", "county5",
"county1", "county2", "county3", "county4", "county5")
x <- c(0, 1, 0, 1, 0, 1, 0, 1, 0, 1)
y1 <- c(2.5, 8, 10, 7, 2, 3, 13, 17, 4.5, 1.3)
y2 <- c(6.5, 2, 3, 18, 2, 14, 7.6, 2.4, 8.2, 4.9)
y3 <- c(5.2, 2, 5, 7.5, 5, 9, 3, 1.7, 2.5, 5.3)
D <- data.frame(year, county, x, y1, y2, y3)
# I have multiple dependent variables: y1, y2, y3, y4 and so on. I only have one inde-
# pendent variable, x. I want to estimate the model specification below for each dependent variable in a smart way, without have to write it out each time
m1 <- felm(y1 ~ x # outcome regressed on treatment
| factor(county) + factor(year) # county and time fixed effects
| 0 # no IVs
| county, # SE clustered on the county
data = D)
# Furthermore, I'd like to iteratively hold out/remove one county or year while estimating a model, to check if they are driving the results

Here's a function that should do it:
library(lfe)
#> Loading required package: Matrix
## Create long format dataset. Unit of analysis is county-year,
## i.e. one observations equal a county in a given year.
## Independent variable, x is a dummy (0, 1)
year <- c(2007, 2007, 2007, 2007, 2007, 2008, 2008, 2008, 2008, 2008, 2009, 2009, 2009, 2009, 2009)
county <- c("county1", "county2", "county3", "county4", "county5",
"county1", "county2", "county3", "county4", "county5",
"county1", "county2", "county3", "county4", "county5")
x <- c(0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0)
y1 <- c(2.5, 8, 10, 7, 2, 3, 13, 17, 4.5, 1.3, 4,7,2,3,5)
y2 <- c(6.5, 2, 3, 18, 2, 14, 7.6, 2.4, 8.2, 4.9, 5,2,4,6,2)
y3 <- c(5.2, 2, 5, 7.5, 5, 9, 3, 1.7, 2.5, 5.3, 8,7,3,4,6)
D <- data.frame(year, county, x, y1, y2, y3)
# I have multiple dependent variables: y1, y2, y3, y4 and so on. I only have one inde-
# pendent variable, x. I want to estimate the model specification below for each dependent variable in a smart way, without have to write it out each time
m1 <- felm(y1 ~ x # outcome regressed on treatment
| factor(county) + factor(year) # county and time fixed effects
| 0 # no IVs
| county, # SE clustered on the county
data = D)
jfun <- function(model, data, remove=NULL){
if(is.null(remove)){stop("Must choose a variable whose values will be jackknifed out.\n")}
dat <- get_all_vars(model, data)
if(!is.null(remove) & !(remove %in% names(dat))){stop("The remove variable must be in the model.\n")}
obs <- unique(dat[[remove]])
res <- NULL
for(i in 1:length(obs)){
subd <- subset(dat, dat[[remove]] != obs[i])
mod <- update(model, data=subd)
res <- rbind(res, coef(mod))
}
cbind(data.frame(obs_removed = obs), res)
}
jfun(m1, D, "county")
#> obs_removed x
#> 1 county1 -1.050000
#> 2 county2 -1.250000
#> 3 county3 -1.163333
#> 4 county4 -3.991667
#> 5 county5 -0.562500
jfun(m1, D, "year")
#> obs_removed x
#> 1 2007 -3.4857143
#> 2 2008 -3.5000000
#> 3 2009 0.5083333
Created on 2022-03-06 by the reprex package (v2.0.1)
The function jfun() takes a model object (that you want to jackknife), a dataset (used in the model) and a string variable name identifying the variable whose values you would like to jackknife. The function identifies all possible values of the jackknife variable and then in a loop, removes each one in turn saving the model coefficients.

## 1. fitting models on different outcomes.
# My solution redefines the data frame to be passed in "data" at each iteration. The trick is to
# select only the desired columns.
model.list = vector(mode = "list", length = 3) # Pre-allocating list to store fitted models, as long as your outcomes.
j = 1 # Counter.
for (i in c("y1", "y2", "y3"))
{
temp.dta = data.frame(y = D[, i], D[, (!colnames(D) %in% c("y1", "y2", "y3"))]) # It selects the outcome at each iteration.
model.list[[j]] <- felm(y ~ x | factor(county) + factor(year) | 0 | county, data = temp.dta) # Stores fit in list, j-th position.
j = j + 1 # Increase counter.
}
summary(model.list[[1]]) # Model fitted on y1.
## 2. fitting same model n times, with i-th observations removed, where i = 1, ..., n.
# With similar reasoning (i.e., redefining the data frame), we can omit one row at each iteration.
# For simplicity, focus on y1.
model.list2 = vector(mode = "list", length = dim(D)[1]) # Pre-allocating list to store fitted models, as long as your data.
for (h in seq_len(dim(D)[1]))
{
model.list2[[h]] <- felm(y1 ~ x | factor(county) + factor(year) | 0 | county, data = D[-h, ]) # Notice I am omitting the i-th row.
}
summary(model.list2[[1]]) # Model with first row omitted.
## 3. combining both ideas -> just combine both solutions (nested loops).
Maybe is not the most elegant solution, but it works, and it is easy to understand and implement.
Regarding the first question, we can use a for loop so to redefine the data frame we want to use at each iteration. The idea is to select only the columns we want to use for the fit, that is, covariates (which stay constant across iterations), and the desired outcome. Notice that I always name the outcome column as y, so I do not have to worry about changing the formula as well. With the data frame so defined (stored in temp.dta), we can fit all the models by setting data = temp.dta within felm(). Results are stored in the list model.list, which must be defined before the loop.
The same trick can be used to fit the model several times while dropping one observation at once. Now, rather than select columns, we select rows. In this case we do not need to redefine the data frame, as we can directly subset our sample in the data parameter.
Notice that for the second solution I focused on y1 for simplicity. If you want to fit the model for all three outcomes, and for each of them you want to repeat the operation by dropping one observation at once, just combine the solutions by implementing two nested loops. Sort of "the proof is left as exercise for the reader".

Related

variogram function in R returns one single observation

I'm trying to construct a variogram cloud in R using the variogram function from the gstat package. I'm not sure if there's something about the topic that I've misunderstood, but surely I should get more than one observation, right? Here's my code:
data = data.frame(matrix(c(2, 4, 8, 7, 6, 4, 7, 9, 4, 4, -1.01, .05, .47, 1.36, 1.18), nrow=5, ncol=3))
data = rename(data, X=X1, Y=X2, Z=X3)
coordinates(data) = c("X","Y")
var.cld = variogram(Z ~ 1,data=data, cloud = TRUE)
And here's the output:
> var.cld
dist gamma dir.hor dir.ver id left right
1 1 0.0162 0 0 var1 5 4
I found the problem! Apparently the default value of the cutoff argument was too low for my specific set of data. Specifying a higher value resulted in additional observations.

Applying 'clustering functions' to a series of linear models

I want to iterate over a list of linear models and apply "clustered" standard errors to each model using the vcovCL function. My goal is to do this as efficiently as possible (I am running a linear model across many columns of a dataframe). My problem is trying to specify additional arguments inside of the anonymous function. Below I simulate some fake data. Precincts represent my cross-sectional dimension; months represent my time dimension (5 units observed across 4 months). The variable int is a dummy for when an intervention takes place.
df <- data.frame(
precinct = c( rep(1, 4), rep(2, 4), rep(3, 4), rep(4, 4), rep(5, 4) ),
month = rep(1:4, 5),
crime = rnorm(20, 10, 5),
int = c(c(0, 1, 1, 0), rep(0, 4), rep(0, 4), c(1, 1, 1, 0), rep(0, 4))
)
df[1:10, ]
outcome <- df[3]
est <- lapply(outcome, FUN = function(x) { lm(x ~ as.factor(precinct) + as.factor(month) + int, data = df) })
se <- lapply(est, function(x) { sqrt(diag(vcovCL(x, cluster = ~ precinct + month))) })
I receive the following error message when adding the cluster argument inside of the vcovCL function.
Error in eval(expr, envir, enclos) : object 'x' not found
The only way around it, in my estimation, would be to index the dataframe, i.e., df$, and then specify the 'clustering' variables. Could this be achieved by specifying an additional argument for df inside of the function call? Is this code efficient?
Maybe specifying the model equation formulaically is a better way to go, I suppose.
Any thoughts/comments are always helpful :)
Here is one approach that would retrieve clustered standard errors for multiple models:
library(sandwich)
# I am going to use the same model three times to get the "sequence" of linear models.
mod <- lm(crime ~ as.factor(precinct) + as.factor(month) + int, data = df)
# define function to retrieve standard errors:
robust_se <- function(mod) {sqrt(diag(vcovCL(mod, cluster = list(df$precinct, df$month))))}
# apply function to all models:
se <- lapply(list(mod, mod, mod), robust_se)
If you want to get the entire output adjusted, the following might be helpful:
library(lmtest)
adj_stats <- function(mod) {coeftest(mod, vcovCL(mod, cluster = list(df$precinct, df$month)))}
adjusted_models <- lapply(list(mod, mod, mod), adj_stats)
To address the multiple column issue:
In case you are struggling with running linear models over several columns, the following might be helpful. All the above would stay the same, except that you are passing your list of models to lapply.
First, let's use this dataframe here:
df <- data.frame(
precinct = c( rep(1, 4), rep(2, 4), rep(3, 4), rep(4, 4), rep(5, 4) ),
month = rep(1:4, 5),
crime = rnorm(20, 10, 5),
crime2 = rnorm(20, 10, 5),
crime3 = rnorm(20, 10, 5),
int = c(c(0, 1, 1, 0), rep(0, 4), rep(0, 4), c(1, 1, 1, 0), rep(0, 4))
)
Let's define the outcome columns:
outcome_columns <- c("crime", "crime2", "crime3")
Now, let's run a regression with each outcome:
models <- lapply(outcome_columns,
function(outcome) lm( eval(parse(text = paste0(outcome, " ~ as.factor(precinct) + as.factor(month) + int"))), data = df) )
And then you would just call
adjusted_models <- lapply(models, adj_stats)
Regarding efficiency:
The above code is efficient in that it is easily adjustable and quick to write up. For most use cases, it will be perfectly fine. For computational efficiency, note that your design matrix is the same in all cases, i.e. by precomputing the common elements (e.g. inv(X'X)*X'), you could save some computations. You would however lose out on the convenience of many built-in functions.

Stratified cluster sampling estimates from survey package

I want to estimate means and totals from a stratified sampling design in which single stage cluster sampling was used in each stratum. I believe I have the design properly specified using the svydesign() function of the survey package. But I'm not sure how to correctly specify the stratum weights.
Example code is shown below. I provide unadjusted stratum weights using the weights= argument. I expected that the estimate and the SE from svytotal() would be equal to the sum of the stratum weights (70, in the example) times the estimate and SE from svymean(). Instead the estimates differ by a factor of 530 (which is the sum of the stratum weights over all of the elements in the counts data) and the SEs differ by a factor of 898 (???). My questions are (1) how can I provide my 3 stratum weights to svydesign() in a way that it understands, and (2) why aren't the estimates and SEs from svytotal() and svymean() differing by the same factor?
library(survey)
# example data from a stratified sampling design in which
# single stage cluster sampling is used in each stratum
counts <- data.frame(
Stratum=rep(c("A", "B", "C"), c(5, 8, 8)),
Cluster=rep(1:8, c(3, 2, 3, 2, 3, 2, 3, 3)),
Element=c(1, 2, 3, 1, 2, 1, 2, 3, 1, 2, 1, 2, 3, 1, 2, 1, 2, 3, 1, 2, 3),
Count = 1:21
)
# stratum weights
weights <- data.frame(
Stratum=c("A", "B", "C"),
W=c(10, 20, 40)
)
# combine counts and weights
both <- merge(counts, weights)
# estimate mean and total count
D <- svydesign(id=~Cluster, strata=~Stratum, weights=~W, data=both)
a <- svymean(~Count, D)
b <- svytotal(~Count, D)
sum(weights$W) # 70
sum(both$W) # 530
coef(b)/coef(a) # 530
SE(b)/SE(a) # 898.4308
First update
I'm adding a diagram to help explain my design. The entire population is a lake with known area (70 ha in this example). The strata have known areas, too (10, 20, and 40 ha). The number of clusters allocated to each stratum was not proportional. Also, the clusters are tiny relative to the number that could possibly be sampled, so the finite population correction is FPC = 1.
I want to calculate an overall mean and SE on a per unit area basis and a total that is equal to 70 times this mean and SE.
Second update
I wrote the code to do the calculations from scratch. I get a total estimate of 920 with se 61.6.
library(survey)
library(tidyverse)
# example data from a stratified sampling design in which
# single stage cluster sampling is used in each stratum
counts <- data.frame(
Stratum=rep(c("A", "B", "C"), c(5, 8, 8)),
Cluster=rep(1:8, c(3, 2, 3, 2, 3, 2, 3, 3)),
Element=c(1, 2, 3, 1, 2, 1, 2, 3, 1, 2, 1, 2, 3, 1, 2, 1, 2, 3, 1, 2, 3),
Count = c(5:1, 6:21)
)
# stratum weights
areas <- data.frame(
Stratum=c("A", "B", "C"),
A_h=c(10, 20, 40)
)
# calculate cluster means
step1 <- counts %>%
group_by(Stratum, Cluster) %>%
summarise(P_hi = sum(Count), m_hi=n())
step2 <- step1 %>%
group_by(Stratum) %>%
summarise(
ybar_h = sum(P_hi) / sum(m_hi),
n_h = n(),
sh.numerator = sum((P_hi - ybar_h*m_hi)^2),
mbar_h = mean(m_hi)
) %>%
mutate(
S_ybar_h = 1 / mbar_h * sqrt( sh.numerator / (n_h * (n_h-1)) )
)
# now expand up to strata
step3 <- step2 %>%
left_join(areas) %>%
mutate(
W_h = A_h / sum(A_h)
) %>%
summarise(
A = sum(A_h),
ybar_strat = sum(W_h * ybar_h),
S_ybar_strat = sum(W_h * S_ybar_h / sqrt(n_h))
) %>%
mutate(
tot = A * ybar_strat,
S_tot = A * S_ybar_strat
)
step2
step3
This gives the following output:
> step2
# A tibble: 3 x 6
Stratum ybar_h n_h sh.numerator mbar_h S_ybar_h
<fctr> <dbl> <int> <dbl> <dbl> <dbl>
1 A 3.0 2 18.0 2.500000 1.200000
2 B 9.5 3 112.5 2.666667 1.623798
3 C 17.5 3 94.5 2.666667 1.488235
> step3
# A tibble: 1 x 5
A ybar_strat S_ybar_strat tot S_tot
<dbl> <dbl> <dbl> <dbl> <dbl>
1 70 13.14286 0.8800657 920 61.6046
(Revised answer to revised question)
In this case svytotal isn't what you want -- it's for the actual population total of the elements being sampled, and so doesn't make sense when the population is thought of as infinitely bigger than the sample. The whole survey package is really designed for discrete, finite populations, but we can work around it.
I think you want to get a mean for each stratum and then multiply it by the stratum weights. To do that,
D <- svydesign(id=~Cluster, strata=~Stratum, data=both)
means<- svyby(~Count, ~Stratum, svymean, design=D)
svycontrast(means, quote(10*A+20*B+40*C))
You'll get a warning
Warning message:
In vcov.svyby(stat) : Only diagonal elements of vcov() available
That's because svyby doesn't return covariances between the stratum means. It's harmless, because the strata really are independent samples (that's what stratification means) so the covariances are zero.
svytotal is doing what I think it should do here: weights are based on sampling probability, so they are only defined for sampling units. The svydesign call applied those weights to the clusters and (because cluster sampling) to the elements, giving the 530-fold higher total. You need to supply either observation weights or enough information for svydesign to calculate them itself. If this is cluster sampling with no subsampling, you can divide the stratum weight over the clusters to get the cluster weight and the divide this over elements within a cluster to get the observation weight. Or, if the stratum weight is the number of clusters in the population, you can use the fpc argument to svydesign
The fact that the SE doesn't scale the same way as the point estimate is because the population size is unknown and has to be estimated. The mean is the estimated total divided by the estimated population size, and the SE estimate takes account of the variance of the denominator and its covariance with the numerator.

Which packages in R allow for lagged variables for time series analysis?

I would like to include multiple lags of an exogenous variable in a regression. Let's say that I have the following data:
X = c(1, 4, 8, 9, 3, 5...)
X2 = c(4, 6, 7, 9, 7, 8...)
I want to use lags of X2 to predict X. Does anyone know why package allows for me to do this? I have tried using dynlm and lag() from stats.
Thanks
library(zoo)
set.seed(1111)
x <- as.zoo(rnorm(10, 0, 0.02))
y <- lag(x, 2, na.pad = TRUE)
cbind(x, y)
This performs an ordinary linear regression of X on the first 2 lags of X2 with an intercept (fit2), on the first lag with an intercept (fit1) and just on an intercept (fit0). Note that in R one normally uses negative numbers to lag so for convenience we defined a Lag function which uses positive numbers to indicate lags. lag.zoo allows vector lags so Lag(z2, 1:2) has two columns, one column for each of the two lags.
library(dyn)
X = c(1, 4, 8, 9, 3, 5)
X2 = c(4, 6, 7, 9, 7, 8)
z <- zoo(X)
z2 <- zoo(X2)
Lag <- function(x, k = 1) lag(x, k = -k)
fit2 <- dyn$lm(z ~ Lag(z2, 1:2))
fit1 <- dyn$lm(z ~ Lag(z2))
fit0 <- dyn$lm(z ~ 1)
For example, here is fit2.
> fit2
Call:
lm(formula = dyn(z ~ Lag(z2, 1:2)))
Coefficients:
(Intercept) Lag(z2, 1:2)1 Lag(z2, 1:2)2
19.3333 -1.4242 -0.4242
Here is a comparison of the three fits showing that the one and two lag fits are not significantly better than just using the intercept; however, there is a quite drop in residual sum of squares by adding the first lag to the intercept only model so you might want to ignore the statistical significance and use the first lag anyways.
> anova(fit0, fit1, fit12)
Analysis of Variance Table
Model 1: z ~ 1
Model 2: z ~ Lag(z2)
Model 3: z ~ Lag(z2, 1:2)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 3 22.7500
2 2 8.4211 1 14.3289 2.1891 0.3784
3 1 6.5455 1 1.8756 0.2865 0.6871
It would also be possible to use ts class in place of the zoo class; however, lag.ts does not support vector lags so with ts each term would have to be written out separately. Lag is from above.
tt <- ts(X)
tt2 <- ts(X2)
fits12_ts <- dyn$lm(tt ~ Lag(tt2) + Lag(tt2, 2))
No external R library is required, I would say
X2 = c(4, 6, 7, 9, 7, 8)
lag = 2
lagged_data <- function(x) c(tail(X2, -x), rep(NA, x))
lagged_data(lag)
# [1] 7 9 7 8 NA NA

How to grab coefficients with R when estimating a Zero Inflation Model

Probably pretty easy, but I want to know, how to grab coefficients when using the zeroinfl command?
treatment <- factor(rep(c(1, 2), c(43, 41)),
levels = c(1, 2),labels = c("placebo", "treated"))
improved <- factor(rep(c(1, 2, 3, 1, 2, 3), c(29, 7, 7, 13, 7, 21)),
levels = c(1, 2, 3),labels = c("none", "some", "marked"))
numberofdrugs <- rpois(84, 2)
healthvalue <- rpois(84,0.5)
y <- data.frame(healthvalue,numberofdrugs, treatment, improved)
require(pscl)
ZIP<-zeroinfl(healthvalue~numberofdrugs+treatment+improved, y)
summary(ZIP)
I usually use ZIP$coef[1] to grab a coefficient, but unfortunately here you grab a whole bunch. So how can I grab one single coeficients from a ZIP model?
Use the coef extraction function to list all coefficients in one long vector, and then you can use single index notation to select them:
coef(ZIP)[1]
count_(Intercept)
0.1128742
Alternatively, you need to select which model you want to get the coefficients from first:
ZIP$coef$count[1]
(Intercept)
0.1128742
ZIP$coef[[1]][1]
(Intercept)
0.1128742
If you wanted to get fancy you could split the coefficients into a list:
clist <- function(m) {
cc <- coef(m)
ptype <- gsub("_.+$","",names(cc))
ss <- split(cc,ptype)
lapply(ss, function(x) names(x) <- gsub("^.*_","",names(x)))
}
> clist(ZIP)
$count
(Intercept) numberofdrugs treatmenttreated improvedsome
-1.16112045 0.16126724 -0.07200549 -0.34807344
improvedmarked
0.23593220
$zero
(Intercept) numberofdrugs treatmenttreated improvedsome
7.509235 -14.449669 -58.644743 -8.060501
improvedmarked
58.034805
c1 <- clist(ZIP)
c1$count["numberofdrugs"]

Resources