IV regression computation

IV regression computation - r

For my thesis I am doing an Instrumental Variables (IV) regression and I was wondering if I did it the right way. Couple of issues I have:
Comparing the linear model with the IV models, the sign of the effect changes (positive to negative or the other way round).
Using Two Stage Least Squares (2SLS) with ivreg (from the AER package) gives negative R² values, so I decided to manually compute the 2SLS estimates. These give the same estimates as the ivreg code but now with statistically significant results.
I have limited data and therefore I did not expect any significant results as I already did some non-parametric tests and the means of the different groups were not significantly different.
I am researching the effect of policies of organizations on a given budget.
The organization performs well if the budget residual is positive, so they have less costs than budgeted.
The variable is a percentage, either positive or negative.
There is non random selection into treatment as organizations can determine their own policy.
Furthermore, the policy factors are mostly dummy variables, 19 variables are binary and 2 are categorical and 1 is ratio.
My IV is any number between 0 and 1.
This is what I did:
1. I estimate a simple Ordinary Least Squares model to see what it would do (I know the results don't mean anything).
lm1 <- lm(budget ~ policy1, data=df)
lm2 <- lm(budget ~ policy2, data=df)
summ(lm1)
summ(lm2)
2. Then I performed an IV with the ivreg code, though the R² became negative which I thought was weird.
ivreg1 <- ivreg(budget ~ policy1| iv, data=df)
ivreg2 <- ivreg(budget ~ policy2 | iv, data=df)
library(stargazer)
stargazer(ivreg1, ivreg2, dep.var.labels=c("Budget"), covariate.labels = c("policy 1", "policy2") , align=TRUE, column.sep.width = "-15pt", font.size = "small", type="text")
3. So I tried to do the 2SLS in steps myself.
Instead of fitted.values(reg1) I also used predict(reg1). This gives the same output.
attach(df)
reg1<- lm(policy1~iv)
policy1.hat <- fitted.values(reg1)
reg2 <- lm(policy2~iv)
policy2.hat <- fitted.values(reg2)
ivreg3 <- lm(budget~policy1.hat)
ivreg4 <- lm(budget~policy2.hat)
stargazer(ivreg1, ivreg2, dep.var.labels=c("Budget"), covariate.labels = c("policy 1", "policy2"), align=TRUE, column.sep.width = "-15pt", font.size = "small", type="text")
detach(df)
With this step I got a positive adjusted R² but I noticed that the policy factors are now significant and that the sign compared to the lm model changes.
Question:
Am I computing the IV regression wrong?
Example data (not real numbers due to anonymity of data):
df <- data.frame(
budget = c(4,2.8,9.1,15.5,10.1,12.9,4.3,
-1.9,-4.9,-1.3,14.1,8.6,7.8,-5.8,3.8,7.2,5.2,-5.3,8.6,
3.5,-1.2,-15.7,1.6,6.9,12.6,10.4,4.5,-8.3,-15.3,
9.8,21.5),
iv = c(0.52,0.43,0.41,0.44,0.41,0.4,0.39,
0.43,0.38,0.37,0.34,0.42,0.4,0.36,0.35,0.41,0.39,
0.35,0.31,0.43,0.36,0.51,0.35,0.34,0.37,0.37,0.39,
0.46,0.44,0.36,0.37),
policy1 = c(1L,1L,1L,1L,1L,1L,0L,1L,1L,1L,
1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,
1L,1L,1L,1L,1L,1L),
policy2 = c(1L,1L,1L,1L,1L,1L,1L,0L,0L,1L,
0L,1L,0L,1L,1L,1L,1L,0L,1L,1L,1L,1L,1L,1L,1L,
0L,1L,1L,0L,1L,0L)
)

Related

Quasi-Poisson mixed-effect model on overdispersed count data from multiple imputed datasets in R

I'm dealing with problems of three parts that I can solve separately, but now I need to solve them together:
extremely skewed, over-dispersed dependent count variable (the number of incidents while doing something),
necessity to include random effects,
lots of missing values -> multiple imputation -> 10 imputed datasets.
To solve the first two parts, I chose a quasi-Poisson mixed-effect model. Since stats::glm isn't able to include random effects properly (or I haven't figured it out) and lme4::glmer doesn't support the quasi-families, I worked with glmer(family = "poisson") and then adjusted the std. errors, z statistics and p-values as recommended here and discussed here. So I basically turn Poisson mixed-effect regression into quasi-Poisson mixed-effect regression "by hand".
This is all good with one dataset. But I have 10 of them.
I roughly understand the procedure of analyzing multiple imputed datasets – 1. imputation, 2. model fitting, 3. pooling results (I'm using mice library). I can do these steps for a Poisson regression but not for a quasi-Poisson mixed-effect regression. Is it even possible to A) pool across models based on a quasi-distribution, B) get residuals from a pooled object (class "mipo")? I'm not sure. Also I'm not sure how to understand the pooled results for mixed models (I miss random effects in the pooled output; although I've found this page which I'm currently trying to go through).
Can I get some help, please? Any suggestions on how to complete the analysis (addressing all three issues above) would be highly appreciated.
Example of data is here (repre_d_v1 and repre_all_data are stored in there) and below is a crucial part of my code.
library(dplyr); library(tidyr); library(tidyverse); library(lme4); library(broom.mixed); library(mice)
# please download "qP_data.RData" from the last link above and load them
## ===========================================================================================
# quasi-Poisson mixed model from single data set (this is OK)
# first run Poisson regression on df "repre_d_v1", then turn it into quasi-Poisson
modelSingle = glmer(Y ~ Gender + Age + Xi + Age:Xi + (1|Country) + (1|Participant_ID),
family = "poisson",
data = repre_d_v1)
# I know there are some warnings but it's because I share only a modified subset of data with you (:
printCoefmat(coef(summary(modelSingle))) # unadjusted coefficient table
# define quasi-likelihood adjustment function
quasi_table = function(model, ctab = coef(summary(model))) {
phi = sum(residuals(model, type = "pearson")^2) / df.residual(model)
qctab = within(as.data.frame(ctab),
{`Std. Error` = `Std. Error`*sqrt(phi)
`z value` = Estimate/`Std. Error`
`Pr(>|z|)` = 2*pnorm(abs(`z value`), lower.tail = FALSE)
})
return(qctab)
}
printCoefmat(quasi_table(modelSingle)) # done, makes sense
## ===========================================================================================
# now let's work with more than one data set
# object "repre_all_data" of class "mids" contains 10 imputed data sets
# fit model using with() function, then pool()
modelMultiple = with(data = repre_all_data,
expr = glmer(Y ~ Gender + Age + Xi + Age:Xi + (1|Country) + (1|Participant_ID),
family = "poisson"))
summary(pool(modelMultiple)) # class "mipo" ("mipo.summary")
# this has quite similar structure as coef(summary(someGLM))
# but I don't see where are the random effects?
# and more importantly, I wanted a quasi-Poisson model, not just Poisson model...
# ...but here it is not possible to use quasi_table function (defined earlier)...
# ...and that's because I can't compute "phi"

This seems reasonable, with the caveat that I'm only thinking about the computation, not whether this makes statistical sense. What I'm doing here is computing the dispersion for each of the individual fits and then applying it to the summary table, using a variant of the machinery that you posted above.
## compute dispersion values
phivec <- vapply(modelMultiple$analyses,
function(model) sum(residuals(model, type = "pearson")^2) / df.residual(model),
FUN.VALUE = numeric(1))
phi_mean <- mean(phivec)
ss <- summary(pool(modelMultiple)) # class "mipo" ("mipo.summary")
## adjust
qctab <- within(as.data.frame(ss),
{ std.error <- std.error*sqrt(phi_mean)
statistic <- estimate/std.error
p.value <- 2*pnorm(abs(statistic), lower.tail = FALSE)
})
The results look weird (dispersion < 1, all model results identical), but I'm assuming that's because you gave us a weird subset as a reproducible example ...

Exchangeable correlations and constant variance in MCMCglmm?

I am attempting to fit a very simple model with MCMCglmm, but am getting quite stuck.
Imagine a class (30 students) get grades for two papers throughout the semester where the paper assignments are exactly the same (we don't want to model a difference in average scores between the papers, there are no "learning effects", and we can assume that the variance in grades are the same.)
Let $i = 1...30$ index the student, $y_{i1}$ and $y_{i2}$ index the scores for that student's first and second papers.
One way to model this data is using random intercepts for student scores to account for correlation between each students scores. Let $\mu_i$ be the student intercept, $sigma$ be the residual sd, and $\sigma_{\mu}$ be the sd of the intercepts. Then we write (in shorthand) our random intercept model at $f(y_{ij}|\mu_i) = Normal(\mu_i, \sigma)$ and $f(\mu_i) = Normal(\mu, \sigma_{\mu)$.
An alternative way to write this model would be to model the residual correlation structure more explicitly. That is, we would write that ${y_{i1}, y_{i2}}$ have a multivariate normal distribution with mean ${\mu, \mu}$ variance $\tau = \sigma^2 + \sigma_{\mu}^2$ and correlation $\rho = \sigma_{\mu}^2 / (\sigma^2 + \sigma_{\mu}^2)$.
To be clear, these models are mathematically equivalent, but statistical software will often have a specific implementation for each. For example we can fit the two approaches separately with nlme:
library(nlme)
library(tidyverse)
library(MCMCglmm)
df <-
tibble(id = factor(rep(1:100, each = 20))) %>%
mutate(paper = 1:n()) %>%
group_by(id) %>%
mutate(mu = rnorm(1),
y = mu + rnorm(n(), 0, 3))
gls(data = df,
model = y~1,
correlation = corCompSymm(form = ~ 1 | id))
lme(data = df, fixed = y ~ 1, random = ~1|id)
It seems MCMCglmm can fit the first parameterization (random intercepts) of the model just fine.
MCMCglmm(data = df,
fixed = y ~ 1,
random = ~id,
nitt = 1000, burnin = 0, thin = 1)
However, I am not seeing a way to implement the second approach. My best attempt involves "widening" the data frame and fitting a multiple response model.
df.wide <- df %>% select(- paper) %>%
pivot_wider(values_from = "y",
names_from = "obs", names_prefix = "paper") %>%
as.data.frame
MCMCglmm(fixed = cbind(paper1, paper2) ~ 1,
rcov = ~us(trait):units,
data = df.wide)
However, (1) I am not sure that I am fitting this model correctly, (2) I am not sure how to interpret the fitted values (especially since my posterior mean covariances seem much too small) and (3) there doesn't seem to be a way to get a constant variance across traits.
p.s. I would appreciate not being told to just fit the random intercept model. I am writing some course materials, and would like students to be able to more directly compare the exchangeable correlation model with other types of correlation structures that we might use when we have more than two observations (i.e. AR, Toeplitz, etc.), and I would like my students to be able to do the comparison of the two parameterizations themselves, as I would do when I used nlme.
FOLLOW-UP: I am currently trying to fit the model with BRMS, though would still be open to any "hacks" in MCMCglmm.
model1 <- brms::brm(data = df,
formula = y ~ 1 + cosy(gr = id, time = obs),
family = "gaussian",
chains = 4, thin = 1, iter = 5000, warmup = 100)

Is exchangeability + equal variances the same as what I would call compound symmetry? (I guess so, since you're using corCompSymm() in nlme) ...
As far as I can tell this isn't possible (I can't rule out that there's some way to hack it with the available variance structures, but it's far from obvious ...) From ?MCMCglmm:
Currently, the only ‘variance.functions’ available are ‘idv’,
‘idh’, ‘us’, ‘cor[]’ and ‘ante[]’. ‘idv’ fits a constant
variance across all components in ‘formula’. Both ‘idh’ and
‘us’ fit different variances across each component in
‘formula’, but ‘us’ will also fit the covariances. ‘corg’
fixes the variances along the diagonal to one and ‘corgh’
fixes the variances along the diagonal to those specified in
the prior. ‘cors’ allows correlation submatrices. ‘ante[]’
fits ante-dependence structures of different order (e.g
ante1, ante2), and the number can be prefixed by a ‘c’ to
hold all regression coefficients of the same order equal. The
number can also be suffixed by a ‘v’ to hold all innovation
variances equal (e.g ‘antec2v’ has 3 parameters).
By using the us() (unstructured, what nlme would call pdSymm for "positive-definite symmetric") structure, I believe you're not constraining the correlation parameters to be all the same (i.e., violating exchangeability).
For what it's worth, one reason (other than pedagogy) to want to specify a compound-symmetric correlation matrix explicitly rather than by composing the sum of group-level and individual-level random effects would be if you wanted to model negative compound symmetry (the sum-of-random-effects approach can only model rho>0).
My guess is that you're also restricted to answers using MCMCglmm, but if "some Bayesian MCMC approach" is good enough, then you could do this via brms or (somewhat more obscurely, sort of) glmmTMB + tmbstan (although this combination does not currently use informative priors!)

Syntax for diagonal variance-covariance matrix for non-linear mixed effects model in nlme

I am analysing routinely collected substance use data during the first 12 months' of treatment in a large sample of outpatients attending drug and alcohol treatment services. I am interested in whether differing levels of methamphetamine use (no use, low use, and high use) at the outset of treatment predicts different levels after a year in treatment, but the data is very irregular, with different clients measured at different times and different numbers of times during their year of treatment.
The data for the high and low use group seem to suggest that drug use at outset reduces during the first 3 months of treatment and then asymptotes. Hence I thought I would try a non-linear exponential decay model.
I started with the following nonlinear generalised least squares model using the gnls() function in the nlme package:
fitExp <- gnls(outcome ~ C*exp(-k*yearsFromStart),
params = list(C ~ atsBase_fac, k ~ atsBase_fac),
data = dfNL,
start = list(C = c(nsC[1], lsC[1], hsC[1]),
k = c(nsC[2], lsC[2], hsC[2])),
weights = varExp(-0.8, form = ~ yearsFromStart),
control = gnlsControl(nlsTol = 0.1))
where outcome is number of days of drug use in the 28 days previous to measurement, atsBase_fac is a three-level categorical predictor indicating level of amphetamine use at baseline (noUse, lowUse, and highUse), yearsFromStart is a continuous predictor indicating time from start of treatment in years (baseline = 0, max - 1), C is a parameter indicating initial level of drug use, and k is the rate of decay in drug use. The starting values of C and k are taken from nls models estimating these parameters for each group. These are the results of that model
Generalized nonlinear least squares fit
Model: outcome ~ C * exp(-k * yearsFromStart)
Data: dfNL
AIC BIC logLik
27672.17 27725.29 -13828.08
Variance function:
Structure: Exponential of variance covariate
Formula: ~yearsFromStart
Parameter estimates:
expon
0.7927517
Coefficients:
Value Std.Error t-value p-value
C.(Intercept) 0.130410 0.0411728 3.16738 0.0015
C.atsBase_faclow 3.409828 0.1249553 27.28839 0.0000
C.atsBase_fachigh 20.574833 0.3122500 65.89218 0.0000
k.(Intercept) -1.667870 0.5841222 -2.85534 0.0043
k.atsBase_faclow 2.481850 0.6110666 4.06150 0.0000
k.atsBase_fachigh 9.485155 0.7175471 13.21886 0.0000
So it looks as if there are differences between groups in initial rate of drug use and in rate of reduction in drug use. I would like to go a step further and fit a nonlinear mixed effects model.I tried consulting Pinhiero and Bates' book accompanying the nlme package but the only models I could find that used irregular, sparse data like mine used a self-starting function, and my model does not do that.
I tried to adapt the gnls() model to nlme like so:
fitNLME <- nlme(model = outcome ~ C*exp(-k*yearsFromStart),
data = dfNL,
fixed = list(C ~ atsBase_fac, k ~ atsBase_fac),
random = pdDiag(yearsFromStart ~ id),
groups = ~ id,
start = list(fixed = c(nsC[1], lsC[1], hsC[1], nsC[2], lsC[2], hsC[2])),
weights = varExp(-0.8, form = ~ yearsFromStart),
control = nlmeControl(optim = "optimizer"))
bit I keep getting error message, I presume through errors in the syntax specifying the random effects.
Can anyone give me some tips on how the syntax for the random effects works in nlme?
The only dataset in Pinhiero and Bates that resembled mine used a diagonal variance-covariance matrix. Can anyone filled me in on the syntax of this nlme function, or suggest a better one?
p.s. I wish I could provide a reproducible example but coming up with synthetic data that re-creates the same errors is way beyond my skills.

hurdle model prediction - count vs response

I'm working on a hurdle model and ran into a question I can't quite figure out. It was my understanding that the overall response prediction of the hurdle is the multiplication of the count prediction by the probability prediction. I.e., the overall response has to be smaller or equal to the count prediction. However, in my data, the response prediction is higher than the count prediction, and I can't figure out why.
Here's a similar result for a toy model (code adapted from here):
library("pscl")
data("RecreationDemand", package = "AER")
## model
m <- hurdle(trips ~ quality | ski, data = RecreationDemand, dist = "negbin")
nd <- data.frame(quality = 0:5, ski = "no")
predict(m, newdata = nd, type = "count")
predict(m, newdata = nd, type = "response")
Why is it that the counts are higher than the responses?
added comparison to glm.nb
Also - I was under the impression that the count part of the hurdle model should give identical predictions to a count-model of only positive values. When I try that, I get completely different values. What am I missing??
library(MASS)
m.nb <- glm.nb(trips ~ quality, data = RecreationDemand[RecreationDemand$trips > 0,])
predict(m, newdata = nd, type = "count") ## hurdle
predict(m.nb, newdata = nd, type = "response") ## positive counts only

The last question is the easiest to answer. The "count" part of the hurdle modle is not simply a standard count model (including a positive probability for zeros) but a zero-truncated count model (where zeros cannot occur).
Using the countreg package from R-Forge you can fit the model you attempted to fit with glm.nb in your example. (Alternatively, VGAM or gamlss could also be used to fit the same model.)
library("countreg")
m.truncnb <- zerotrunc(trips ~ quality, data = RecreationDemand,
subset = trips > 0, dist = "negbin")
cbind(hurdle = coef(m, model = "count"), zerotrunc = coef(m.truncnb), negbin = coef(m.nb))
## hurdle zerotrunc negbin
## (Intercept) 0.08676189 0.08674119 1.75391028
## quality 0.02482553 0.02483015 0.01671314
Up to small numerical differences the first two models are exactly equivalent. The non-truncated model, however, has to compensate the lack of zeros by increasing the intercept and dampening the slope parameter, which is clearly not appropriate here.
As for the predictions, one can distinguish three quantities:
The expectation from the untruncated count part, i.e., simply exp(x'b).
The conditional/truncated expectation from the count part, i.e., accounting for the zero trunctation: exp(x'b)/(1 - f(0)) where f(0) is the probability for 0 in that count part.
The overall expectation for the complete hurdle model, i.e., the probability for crossing the hurdle times the conditional expectation from 2.: exp(x'b)/(1 - f(0)) * (1 - g(0)) where g(0) is the probability for 0 in the zero hurdle part of the model.
See also Section 2.2 and Appendix C in vignette("countreg", package = "pscl") for more details and formulas. predict(..., type = "count") computes item 1 from above where predict(..., type = "response") computes item 3 for a hurdle model and item 2 for a zerotrunc model.

Estimating variance attributed a fixed effect

Disregarding how "important" it is, I am interested in trying to estimate how much of the variance is attributed to a single fixed effect (it being a main effect, or interaction term).
As a quick thought I imagined that constructing a linear model for the predicted values of mixed model (without the random effect), and assessing the ANOVA-table would provide a estimate (yes, the residual variance will then be zero, but we know(?) this from the mixed model). However, from playing around apparently not.
Where is the flaw in my reasoning? Or did I do something wrong along the way? Is there an alternative method?
Disclaimer: I know some people have suggested looking at the change in residual variance when removing/adding fixed effects, but as this does not take into account the correlation between fixed and random effects I am not interested .
data(Orthodont,package="nlme")
Orthodont = na.omit(Orthodont)
#Fitting a linear mixed model
library(lme4)
mod = lmer(distance ~ age*Sex + (1|Subject) , data=Orthodont)
# Predicting across all observed values,
pred.frame = expand.grid(age = seq(min(Orthodont$age, na.rm = T),max(Orthodont$age, na.rm=T)),
Sex = unique(Orthodont$Sex))
# But not including random effects
pred.frame$fit = predict(mod, newdata = pred.frame, re.form=NA)
anova(lm(fit~age*Sex, data = pred.frame))
library(data.table)
Orthodont = data.table(Orthodont)
# to test the validity of the approach
# by estimating a linear model using a random observation
# per individual and look at the means
tmp = sapply(1:500, function(x){
print(x)
as.matrix(anova(lm(distance~age*Sex, data =Orthodont[,.SD[sample(1:.N,1)],"Subject"])))[,2]
}
)
# These are clearly not similar
prop.table(as.table(rowMeans(tmp)[-4]))
age Sex age:Sex
0.60895615 0.31874622 0.07229763
> prop.table(as.table(anova(lm(fit~age*Sex, data = pred.frame))[1:3,2]))
A B C
0.52597575 0.44342996 0.03059429