Manually calculate robust standard errors of 2SLS regression - r

I used the "systemfit" function in R to estimate a 2SLS model as it allows to specify the first and second stage separately which is important for my estimation. I need robust standard errors but I am having problems replicating the robust standard errors obtained when estimating the model using "iv_robust" or "ivreg" in combination with "coeftest".
I tried to calculate the robust standard errors manually as follows:
firststagehfsystem <- H ~ HL1
secondstageivsystem <- Index ~ H
system = list(firststagehfsystem, secondstageivsystem)
inst <- ~ HL1
ivhfsystem = systemfit(system, method = "2SLS", inst = inst, data = DataControls)
coef <- ivhfsystem$eq[[2]]$coefficients
vcov <- ivhfsystem$eq[[2]]$coefCov
se <- sqrt(diag(vcov))
residuals <- ivhfsystem$eq[[2]]$residuals
n <- nrow(DataControls)
k <- length(coef)
X <- model.matrix(ivhfsystem$eq[[2]])
hc1 <- (n/(n-k))*ivhfsystem$eq[[2]]$residuals^2
vce_hc1 <- solve(t(X) %*% X) %*% (t(X) %*% (hc1*diag(n)) %*% X) %*% solve(t(X) %*% X)
sqrt(diag(vce_hc1))
Can anyone point out my mistake? I already googled it for two days and could not find a solution. Thank you!

Related

95% CI for survival linear combination (interaction) using vcov

i have this model
Where TD is a binary variable, and Strata is a numeric variable equals to {1,2,3}. I need to get 95% CI for this two linear combinations:
I have this function to construct confidence intervals
pwp_gt_int <- coxph(Surv(tstart2,tstop2,status==1) ~ TD+ TD:strata(event)
mod_summ <- summary(pwp_gt_int)
coefs <- modsum$coefficients
X <- model.matrix(pwp_gt_int)
dof <- nrow(X) - ncol(X)
coefs_var <- vcov(pwp_gt_int)
halfCI <- qt(0.975, dof) * sqrt(diag(coefs_var))
matrix(c(coefs - halfCI, coefs + halfCI), nrow=3)
but i need something like this
coefs[2] = coefs[1] + 2*coefs[2]
coefs[3] = coefs[1] + 3*coefs[3]
matrix(c(coefs - halfCI, coefs + halfCI), nrow=3)
But the CI's i got are not plausible, i'm think im not getting right the variance-covariance matrix for the linear combinations.
Please help.
It looks like you're asking for two different things - one is the variance of a linear combination and the other is a confidence interval (and as such, a variance) for a non-linear combination. The linear combination is relatively easy. We know that the variance of a linear combination is:
where A is a matrix of constants and V(b) is the variance-covariance matrix of the random variables (in this case, the coefficients). If your coefficient vector has three values in it, and you want to do as you suggest in your last block of code, then the you would define:
or in R as:
A = matrix(c(1,1,2,0,0,3), ncol=3)
Then, you could make the linear combinations and their variances with:
b <- matrix(coef(pwp_gt_int)[1:3], ncol=1)
V <- vcov(pwp_gt_int)[1:3,1:3]
lincom <- A %*% b
v_lincom <- A %*% V %*% t(A)
sds <- sqrt(diag(v_lincom))
crit <- qt(.975, dof)
cis <- cbind(lincom - crit*sds, sincom + crit*sds)
That would be the confidence interval for the linear combination. The problem is that there isn't such an easy formula for the variance of a non-linear combination. Further, the confidence intervals may be asymmetric. One thing you could do is an end-point transformation, where you take lincom and cis and then exponentiate all of them. Another option would be a parametric bootstrap. Here's what that would look like.
B <- MASS::mvrnorm(2500, b, V)
nlcom <- exp(A %*% b)
nlsim <- exp(A %*% t(B))
nlcis <- apply(nlsim, 1, quantile, c(.025,.975))
Now, nlcis would have the confidence bounds for the non-linear combination. This should work given your data, but without the data to try it out, I'm not sure.

Evaluating the score function at the MLEs for an lme model in nlme and R

As a sanity check for later goals, I am trying to understand how components of an lme object are related. I am trying to evaluate the score functions at the MLEs obtained from a lme object created using gamm to check that they are equal to zero. I've provided a minimal working example. My score functions were obtained from equation (7) in https://arxiv.org/pdf/1612.04911.pdf.
library(lme4)
library(mgcv)
library(lasso2) # for Prostate data set
data(Prostate)
model = gamm(lpsa ~lcavol + s(lweight), data=Prostate)$lme
Here I compute the objects needed to evaluate the score functions
X = model.matrix(formula(model), data=model$data)
YminXbeta = c(model$data$y - X %*% fixed.effects(model))
varcomps = VarCorr(model)
varcomps = as.numeric(varcomps[nrow(varcomps) - (1:0),1])
Sigmainv = solve(extract.lme.cov2(model, data=model$data)$V)
Sigmainvsq = Sigmainv %*% Sigmainv
ZVZT = (model$data$Xr %*% (getVarCov(model,
type='random.effects')/varcomps[1]) %*% t(model$data$Xr))
SigmainvZVZT = Sigmainv %*% ZVZT
SigmainvZVZTsq = SigmainvZVZT %*% SigmainvZVZT
Now I evaluate the scores at the MLEs
# the scores
# the score for the mean parameter
sbeta = t(X) %*% Sigmainv %*% YminXbeta
# [,1]
# X(Intercept) 5.329071e-14
# Xlcavol 1.190159e-13
# Xs(lweight)Fx1 -1.110223e-15
# the score for the error variance parameter
ssigmasq = (-sum(diag(Sigmainv)) + YminXbeta %*% Sigmainvsq %*%
YminXbeta)/2
# 3.664974e-08
# the score for the random effect parameter
stausq = (-sum(diag(SigmainvZVZT)) + YminXbeta %*% SigmainvZVZT
%*% Sigmainv %*% YminXbeta)/2
# -7.507903
Note that the random effects variance component is close to zero
varcomps[1] # 2.665509e-09
but I'm not sure that would make the derivative nonzero.
Why is the score function for the variance component term not close to zero? Am I making a mistake or misunderstand what the objects in the lme object are?

Get Residual Variance-Covariance Matrix in lme4

I am fitting a linear mixed effects model using lme4:
library(lme4)
data(Orthodont)
dent <- Orthodont
d.test <- lmer(distance ~ age + (1|Subject), data=dent)
If we say generically Y = X * B + Z * d + e is the form of a linear mixed effects model, then I am trying to get Var(Y) = Z * Var(d) * Z^t + Var(e) from the results of the model.
Is the following formulation the right way to do this?
k <- table(dent$Subject)[1]
vars <- VarCorr(d.test)
v <- as.data.frame(vars)
sigma <- attr(vars, "sc")
s.tech <- diag(v$vcov[1], nrow=k)
icc <- v$vcov[1]/sum(v$vcov)
s.tech[upper.tri(s.tech)] <- icc
s.tech[lower.tri(s.tech)] <- icc
sI <- diag(sigma^2, nrow=length(dent$age))
var.b <- kronecker(diag(1, nrow=length(dent$age)/k), s.tech)
var.y <- sI + var.b
I think this is a simple question, but I can't find anywhere code for doing this, so I'm asking if I'm doing it right.
You can do this a bit more easily if you know about getME(), which is a general purpose extract-bits-of-a-lmer-fit function. In particular, you can extract the transposed Z matrix (getME(.,"Zt")) and the transposed Lambda matrix - the Lambda matrix is the Cholesky factor of the scaled variance-covariance matrix of the conditional models (BLUPs); in your notation, Var(d) is the residual variance times the cross-product of Lambda.
The answer cited here is pretty good but the answer below is slightly more general (it should work for any lmer fit).
Fit model:
library(lme4)
data(Orthodont,package="nlme")
d.test <- lmer(distance ~ age + (1|Subject), data=Orthodont)
Extract components:
var.d <- crossprod(getME(d.test,"Lambdat"))
Zt <- getME(d.test,"Zt")
vr <- sigma(d.test)^2
Combine them:
var.b <- vr*(t(Zt) %*% var.d %*% Zt)
sI <- vr * Diagonal(nrow(Orthodont))
var.y <- var.b + sI
A picture:
image(var.y)

Nonlinear regression with sampling weights (package survey)

I would like to estimate the coefficients of a nonlinear model with a binary dependent variable. The nonlinearity arises because two regressors, A and B, depend on a subset of the dataset and on the two parameters lambda1 and lambda2 respectively:
y = alpha + beta1 * A(lambda1) + beta2 * B(lambda2) + delta * X + epsilon
where for each observation i, we have
Where a and Rs are variables in the data.frame. The regressor B(lambda2) is defined in a similar way.
Moreover, I need to include what in Stata are known as pweights, i.e. survey weights or sampling weights. For this reason, I'm working with the R package survey by Thomas Lumley.
First, I create a function for A (and B), i.e.:
A <- function(l1){
R <- as.matrix(data[,1:(80)])
a <- data[,169]
N = length(a)
var <- numeric(N)
for (i in 1:N) {
ai <- rep(a[i],a[i]-1) # vector of a(i)
k <- 1:(a[i]-1) # numbers from 1 to a(i)-1
num <- (ai-k)^l1
den <- sum((ai-k)^l1)
w <- num/den
w <- c(w,rep(0,dim(R)[2]-length(w)))
var[i] <- R[i,] %*% w
}
return(var)
}
B <- function(l2){
C <- as.matrix(data[,82:(161-1)])
a <- data[,169]
N = length(a)
var <- numeric(N)
for (i in 1:N) {
ai <- rep(a[i],a[i]-1) # vector of a(i)
k <- 1:(a[i]-1) # numbers from 1 to a(i)-1
num <- (ai-k)^l2
den <- sum((ai-k)^l2)
w <- num/den
w <- c(w,rep(0,dim(C)[2]-length(w)))
var[i] <- C[i,] %*% w
}
return(var)
}
But the problem is that I don't know how to include the nonlinear regressors in the model (or in the survey design, using the function svydesign):
d_test <- svydesign(id=~1, data = data, weights = ~data$hw0010)
Because, when I try to estimate the model:
# loglikelihood function:
LLsvy <- function(y, model, lambda1, lambda2){
aux1 <- y * log(pnorm(model))
aux2 <- (1-y) * log(1-pnorm(model))
LL <- (aux1) + (aux2)
return(LL)
}
fit <- svymle(loglike=LLsvy,
formulas=list(~y, model = ~ A(lambda1)+B(lambda2)+X,lambda1=~1,lambda2=~1),
design=d_test,
start=list(c(0,0,0,0),c(lambda1=11),c(lambda2=8)),
na.action="na.exclude")
I get the error message:
Error in eval(expr, envir, enclos) : object 'lambda1' not found
I think that the problem is in including the nonlinear part, because everything works fine if I fix A and B for some lambda1 and lambda2 (so that the model becomes linear):
lambda1=11
lambda2=8
data$A <- A(lambda1)
data$B <- B(lambda2)
d_test <- svydesign(id=~1, data = data, weights = ~data$hw0010)
LLsvylin <- function(y, model){
aux1 <- y * log(pnorm(model))
aux2 <- (1-y) * log(1-pnorm(model))
LL <- (aux1) + (aux2)
return(LL)
}
fitlin <- svymle(loglike=LLsvylin,
formulas=list(~y, model = ~A+B+X),
design=d_test,
start=list(0,0,0,0),
na.action="na.exclude")
On the contrary, if I don't use the sampling weights, I can easily estimate my nonlinear model using the function mle from package stats4 or the function mle2 from package bbmle.
To sum up,
how can I combine sampling weights (svymle) while estimating a nonlinear model (which I can do using mle or mle2)?
=========================================================================
A problem with the nonlinear part of the model arises also when using the function svyglm (with fixed lambda1 and lambda2, in order to get good starting values for svymle):
lambda1=11
lambda2=8
model0 = y ~ A(lambda1) + B(lambda2) + X
probit1 = svyglm(formula = model0,
data = data,
family = binomial(link=probit),
design = d_test)
Because I get the error message:
Error in svyglm.survey.design(formula = model0, data = data, family = binomial(link = probit), :
all variables must be in design= argument
This isn't what svymle does -- it's for generalised linear models, which have linear predictors and a potentially complicated likelihood or loss function. You want non-linear weighted least squares, with a simple loss function but complicated predictors.
There isn't an implementation of design-weighted nonlinear least squares in the survey package, probably because no-one has previously asked for one. You could try emailing the package author.
The upcoming version 4 of the survey package will have a function svynls, so if you know how to fit your model without sampling weights using nls you will be able to fit it with sampling weights.

How to simulate quantities of interest using arm or rstanarm packages in R?

I would like to know how to simulate quantities of interest out of a regression model estimated using either the arm or the rstanarm packages in R. I am a newbie in Bayesian methods and R and have been using the Zelig package for some time. I asked a similar question before, but I would like to know if it is possible to simulate those quantities using the posterior distribution estimated by those packages.
In Zelig you can set the values you want for the independent values and it calculates the results for the outcome variable (expected value, probability, etc). An example:
# Creating a dataset:
set.seed(10)
x <- rnorm(100,20,10)
z <- rnorm(100,10,5)
e <- rnorm(100,0,1)
y <- 2*x+3*z+e
df <- data.frame(x,z,e,y)
# Loading Zelig
require(Zelig)
# Model
m1.zelig <- zelig(y ~ x + z, model="ls", data=df)
summary(m1.zelig)
# Simulating z = 10
s1 <- setx(m1.zelig, z = 10)
simulation <- sim(m1.zelig, x = s1)
summary(simulation)
So Zelig keeps x at its mean (20.56), and simulates the quantity of interest with z = 10. In this case, y is approximately 71.
The same model using arm:
# Model
require(arm)
m1.arm <- bayesglm(y ~ x + z, data=df)
summary(m1.arm)
And using rstanarm:
# Model
require(rstanarm)
m1.stan <- stanlm(y ~ x + z, data=df)
print(m1.stan)
Is there any way to simulate z = 10 and x equals to its mean with the posterior distribution estimated by those two packages and get the expected value of y? Thank you very much!
In the case of bayesglm, you could do
sims <- arm::sim(m1.arm, n = 1000)
y_sim <- rnorm(n = 1000, mean = sims#coef %*% t(as.matrix(s1)), sd = sims#sigma)
mean(y_sim)
For the (unreleased) rstanarm, it would be similar
sims <- as.matrix(m1.stan)
y_sim <- rnorm(n = nrow(sims), mean = sims[,1:(ncol(sims)-1)] %*% t(as.matrix(s1)),
sd = sims[,ncol(sims)])
mean(y_sim)
In general for Stan, you could pass s1 as a row_vector and utilize it in a generated quantities block of a .stan file like
generated quantities {
real y_sim;
y_sim <- normal_rng(s1 * beta, sigma);
}
in which case the posterior distribution of y_sim would appear when you print the posterior summary.

Resources