i have fit negative binomial model to my data as follows:
> ngbinmodel <- glm.nb( seizure.rate ~ age + treatment, data = epilepsy_reduced)
> summary(ngbinmodel)
Call:
glm.nb(formula = seizure.rate ~ age + treatment, data = epilepsy_reduced,
init.theta = 1.498983674, link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.3510 -0.8790 -0.4563 0.4328 1.8916
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.0985089 0.5845392 3.590 0.000331 ***
age -0.0007965 0.0193064 -0.041 0.967092
treatment -0.5011593 0.2405658 -2.083 0.037228 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Negative Binomial(1.499) family taken to be 1)
Null deviance: 71.217 on 57 degrees of freedom
Residual deviance: 66.875 on 55 degrees of freedom
AIC: 341.12
Number of Fisher Scoring iterations: 1
Theta: 1.499
Std. Err.: 0.362
2 x log-likelihood: -333.119
Now I would like to check if i should include the interaction effect between age and treatment. I have found two methods to do it:
> intearaction_nbm<-addterm(ngbinmodel, . ~ . * age,test="Chisq")
> summary(intearaction_nbm)
Df AIC LRT Pr(Chi)
Min. :1 Min. :339.1 Min. :0.9383 Min. :0.3327
1st Qu.:1 1st Qu.:339.4 1st Qu.:0.9383 1st Qu.:0.3327
Median :1 Median :339.6 Median :0.9383 Median :0.3327
Mean :1 Mean :339.6 Mean :0.9383 Mean :0.3327
3rd Qu.:1 3rd Qu.:339.9 3rd Qu.:0.9383 3rd Qu.:0.3327
Max. :1 Max. :340.2 Max. :0.9383 Max. :0.3327
NA's :1 NA's :1 NA's :1
and
> ngbinmodel_int <- glm.nb( seizure.rate ~ age*treatment, data = epilepsy_reduced)
> summary(ngbinmodel_int)
glm.nb(formula = seizure.rate ~ age * treatment, data = epilepsy_reduced,
init.theta = 1.531539174, link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.3503 -0.8742 -0.3848 0.3403 1.8508
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.51361 0.83920 1.804 0.0713 .
age 0.01914 0.02826 0.677 0.4981
treatment 0.60748 1.12199 0.541 0.5882
age:treatment -0.03893 0.03850 -1.011 0.3119
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Negative Binomial(1.5315) family taken to be 1)
Null deviance: 72.238 on 57 degrees of freedom
Residual deviance: 66.874 on 54 degrees of freedom
AIC: 342.18
Number of Fisher Scoring iterations: 1
Theta: 1.532
Std. Err.: 0.373
2 x log-likelihood: -332.180
I was expecting to obtain the same result from both of the methods.
How can i access the regression estimates of intearaction_nbm?
why are the outcomes different? According to intearaction_nbm i should include the interaction term (the AIC is lower) but according to ngbinmodel_int i should not include the interaction term (AIC increases).
would discretizing my continuous variable age be advised?
Remark: You should move this post to cross validated.
How can i access the regression estimates of intearaction_nbm?
intearaction_nbm gives you the result of the addition of single term to your model, if you print it, you will have a row per possible additional term (age:treatment, age:another_variable, etc.) giving you the AIC and P-value among other things.
why are the outcomes different?
Not possible to answer without the data, but what I would do is define both models and compare their AIC using: AIC(model_1, model_2). This way I am sure that I am comparing the same quantity. As you know, the AIC is defined up to an additional term, and unless you check how it is computed, you cannot be sure that two different functions in two different packages use the same definition.
would discretizing my continuous variable age be advised?
Not possible to answer without the data...
Let us consider the dataset quine and the following model with only main effects for Eth and Lrn factors:
library(MASS)
negbin_no_int <- glm.nb(Days ~ Eth + Lrn, data = quine)
summary(negbin_no_int)
# Coefficients:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) 3.0367 0.1334 22.764 < 2e-16 ***
# EthN -0.5520 0.1597 -3.457 0.000546 ***
# LrnSL 0.0388 0.1611 0.241 0.809661
extractAIC(negbin_no_int)
# [1] 3.000 1112.576
The model with the interaction term between the two factor is:
negbin_with_int <- glm.nb(Days ~ Eth * Lrn, data = quine)
summary(negbin_with_int)
# Coefficients:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) 2.9218 0.1503 19.446 <2e-16 ***
# EthN -0.3374 0.2100 -1.607 0.108
# LrnSL 0.2929 0.2307 1.269 0.204
# EthN:LrnSL -0.4956 0.3201 -1.549 0.122
extractAIC(negbin_with_int)
# [1] 4.000 1112.196
The statistical significance of the interaction term is p=0.122.
Now we compare the two models using addterm:
interaction_nbm <- addterm(negbin_no_int, . ~ . + Eth:Lrn, test="Chisq")
print(interaction_nbm)
# Model:
# Days ~ Eth + Lrn
# Df AIC LRT Pr(Chi)
# <none> 1112.6
# Eth:Lrn 1 1112.2 2.3804 0.1229
The AICs given by addterm are the same calculated using extractAIC.
If you want to see the regression estimates of addterm, you can add a summary(print(nfit)) inside the function, as follows:
myaddterm <- function (object, scope, scale = 0, test = c("none", "Chisq"),
k = 2, sorted = FALSE, trace = FALSE, ...)
{
if (missing(scope) || is.null(scope))
stop("no terms in scope")
if (!is.character(scope))
scope <- add.scope(object, update.formula(object, scope))
if (!length(scope))
stop("no terms in scope for adding to object")
ns <- length(scope)
ans <- matrix(nrow = ns + 1L, ncol = 2L, dimnames = list(c("<none>",
scope), c("df", "AIC")))
ans[1L, ] <- extractAIC(object, scale, k = k, ...)
n0 <- nobs(object, use.fallback = TRUE)
env <- environment(formula(object))
for (i in seq_len(ns)) {
tt <- scope[i]
if (trace) {
message(gettextf("trying + %s", tt), domain = NA)
utils::flush.console()
}
nfit <- update(object, as.formula(paste("~ . +", tt)),
evaluate = FALSE)
nfit <- try(eval(nfit, envir = env), silent = TRUE)
print(summary(nfit))
ans[i + 1L, ] <- if (!inherits(nfit, "try-error")) {
nnew <- nobs(nfit, use.fallback = TRUE)
if (all(is.finite(c(n0, nnew))) && nnew != n0)
stop("number of rows in use has changed: remove missing values?")
extractAIC(nfit, scale, k = k, ...)
}
else NA_real_
}
dfs <- ans[, 1L] - ans[1L, 1L]
dfs[1L] <- NA
aod <- data.frame(Df = dfs, AIC = ans[, 2L])
o <- if (sorted)
order(aod$AIC)
else seq_along(aod$AIC)
test <- match.arg(test)
if (test == "Chisq") {
dev <- ans[, 2L] - k * ans[, 1L]
dev <- dev[1L] - dev
dev[1L] <- NA
nas <- !is.na(dev)
P <- dev
P[nas] <- MASS:::safe_pchisq(dev[nas], dfs[nas], lower.tail = FALSE)
aod[, c("LRT", "Pr(Chi)")] <- list(dev, P)
}
aod <- aod[o, ]
head <- c("Single term additions", "\nModel:", deparse(formula(object)))
if (scale > 0)
head <- c(head, paste("\nscale: ", format(scale), "\n"))
class(aod) <- c("anova", "data.frame")
attr(aod, "heading") <- head
aod
}
interaction_nbm1 <- myaddterm(negbin_no_int, . ~ . + Eth:Lrn, test="Chisq")
The output is:
Call:
glm.nb(formula = Days ~ Eth + Lrn + Eth:Lrn, data = quine, init.theta = 1.177546225,
link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.5770 -1.0470 -0.3645 0.3521 2.7227
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.9218 0.1503 19.446 <2e-16 ***
EthN -0.3374 0.2100 -1.607 0.108
LrnSL 0.2929 0.2307 1.269 0.204
EthN:LrnSL -0.4956 0.3201 -1.549 0.122
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Negative Binomial(1.1775) family taken to be 1)
Null deviance: 182.93 on 145 degrees of freedom
Residual deviance: 168.18 on 142 degrees of freedom
AIC: 1114.2
Number of Fisher Scoring iterations: 1
Theta: 1.178
Std. Err.: 0.146
2 x log-likelihood: -1104.196
Related
I am doing negative binomial analysis for some count data in the following link:https://www.dropbox.com/s/q7fwqicw3ebvwlg/stackquestion.csv?dl=0
I had some problems (error messages) when I tried to fit all the independent variables into the model, which makes me want to look at each independent variables one by one to find out which variable caused the problem. Here is what I found:
For all the other variables, when I fit the variables to the Y which is column A looks normal:
m2 <- glm.nb(A~K, data=d)
summary(m2)
Call:
glm.nb(formula = A ~ K, data = d, init.theta = 0.5569971932,
link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.5070 -1.2538 -0.4360 0.1796 1.9588
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.66185 0.84980 -0.779 0.436
K 0.25628 0.03016 8.498 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Negative Binomial(0.557) family taken to be 1)
Null deviance: 113.202 on 56 degrees of freedom
Residual deviance: 70.092 on 55 degrees of freedom
AIC: 834.86
Number of Fisher Scoring iterations: 1
Theta: 0.5570
Std. Err.: 0.0923
2 x log-likelihood: -828.8570
However, I found this variable L, when I fit L to the Y, I got this:
m1 <- glm.nb(A~L, data=d)
There were 50 or more warnings (use warnings() to see the first 50)
summary(m1)
Call:
glm.nb(formula = A ~ L, data = d, init.theta = 5136324.722, link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-67.19 -18.93 -12.07 13.25 64.00
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.45341 0.01796 192.3 <2e-16 ***
L 0.24254 0.00103 235.5 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Negative Binomial(5136325) family taken to be 1)
Null deviance: 97084 on 56 degrees of freedom
Residual deviance: 28529 on 55 degrees of freedom
AIC: 28941
Number of Fisher Scoring iterations: 1
Error in prettyNum(.Internal(format(x, trim, digits, nsmall, width, 3L, :
invalid 'nsmall' argument
You can see that the init.theta and AIC is too large, and there are 50 warning and an error message.
The warning message is this
In theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace = control$trace > ... :
iteration limit reached
Actually, variables M and L are two observations of one thing. I did not find anything abnormal with variable L. For all the data, only column L has this problem.
So I am wondering what exactly does this error message mean: Error in prettyNum(.Internal(format(x, trim, digits, nsmall, width, 3L,: invalid 'nsmall' argument. Since I just observed these data, how should I fix this error? Thank you!
The important message is in the warnings(): when L is the independent variable, the default number of iterations in the GLM convergence procedure is not high enough to converge on a model fit.
If you manually set the maxit parameter to a higher value, you can fit A ~ L without error:
glm.nb(A ~ L, data = d, control = glm.control(maxit = 500))
See the glm.control documentation for more. Note that you can also set a reasonable value for init.theta - and this will prevent both theta and AIC from fitting to unreasonable values:
m1 <- glm.nb(A ~ L, data = df, control = glm.control(maxit = 500), init.theta = 1.0)
Output:
Call:
glm.nb(formula = A ~ L, data = df, control = glm.control(maxit = 500),
init.theta = 0.8016681349, link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.3020 -0.9347 -0.3578 0.1435 2.5420
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.25962 0.40094 3.142 0.00168 **
L 0.38823 0.02994 12.967 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Negative Binomial(0.8017) family taken to be 1)
Null deviance: 160.693 on 56 degrees of freedom
Residual deviance: 67.976 on 55 degrees of freedom
AIC: 809.41
Number of Fisher Scoring iterations: 1
Theta: 0.802
Std. Err.: 0.140
2 x log-likelihood: -803.405
I have a question about R; I would like to use two lags periods instead of one (please check my below code) in my model but I don't know how to write it in R. Can someone help please?
Here below are details of my R code:
library(plm)
fixed = plm(sp ~lag(debt)+lag(I(debt^2))+outgp+gvex+vlimp+vlexp+bcour+infcpi, data=pdata, index=c("country", "year"), model="within")
The lags must be on the variable debt.
This should give 2 lags on the debt variable.
library(plm)
fixed = plm(sp ~lag(debt, k=1:2)+lag(I(debt^2))+outgp+gvex+vlimp+vlexp+bcour+infcpi, data=pdata, index=c("country", "year"), model="within")
For example:
data("Grunfeld", package = "plm")
lags2mod <- plm(inv ~ lag(value, k=1:2) + capital, data = Grunfeld, model = "within")
summary(lags2mod)
Oneway (individual) effect Within Model
Call:
plm(formula = inv ~ lag(value, k = 1:2) + capital, data = Grunfeld,
model = "within")
Balanced Panel: n = 10, T = 18, N = 180
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-272.21434 -19.24168 0.42825 18.09930 260.85548
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
lag(value, k = 1:2)1 0.078234 0.015438 5.0677 1.059e-06 ***
lag(value, k = 1:2)2 -0.018754 0.016078 -1.1664 0.2451
capital 0.352658 0.021003 16.7910 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Total Sum of Squares: 2034500
Residual Sum of Squares: 617850
R-Squared: 0.69631
Adj. R-Squared: 0.67449
F-statistic: 127.633 on 3 and 167 DF, p-value: < 2.22e-16
I want to use the partial least squares regression to find the most representative variables to predict my data.
Here is my code:
library(pls)
potion<-read.table("potion-insomnie.txt",header=T)
potionTrain <- potion[1:182,]
potionTest <- potion[183:192,]
potion1 <- plsr(Sommeil ~ Aubepine + Bave + Poudre + Pavot, data = potionTrain, validation = "LOO")
The summary(lm(potion1)) give me this answer:
Call:
lm(formula = potion1)
Residuals:
Min 1Q Median 3Q Max
-14.9475 -5.3961 0.0056 5.2321 20.5847
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.63931 1.67955 22.410 < 2e-16 ***
Aubepine -0.28226 0.05195 -5.434 1.81e-07 ***
Bave -1.79894 0.26849 -6.700 2.68e-10 ***
Poudre 0.35420 0.72849 0.486 0.627
Pavot -0.47678 0.52027 -0.916 0.361
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.845 on 177 degrees of freedom
Multiple R-squared: 0.293, Adjusted R-squared: 0.277
F-statistic: 18.34 on 4 and 177 DF, p-value: 1.271e-12
I deduced that only the variables Aubepine et Bave are representative. So I redid the model just with this two variables:
potion1 <- plsr(Sommeil ~ Aubepine + Bave, data = potionTrain, validation = "LOO")
And I plot:
plot(potion1, ncomp = 2, asp = 1, line = TRUE)
Here is the plot of predicted vs measured values:
The problem is that I see the linear regression on the plot, but I can not know its equation and R². Is it possible ?
Is the first part is the same as a multiple regression linear (ANOVA)?
pacman::p_load(pls)
data(mtcars)
potion <- mtcars
potionTrain <- potion[1:28,]
potionTest <- potion[29:32,]
potion1 <- plsr(mpg ~ cyl + disp + hp + drat, data = potionTrain, validation = "LOO")
coef(potion1) # coefficeints
scores(potion1) # scores
## R^2:
R2(potion1, estimate = "train")
## cross-validated R^2:
R2(potion1)
## Both:
R2(potion1, estimate = "all")
Hi I'm new to R and would like to ask a more general question. How do I simulate or create an example data set which is suitable to be posted here and simultaneously posses the property of reproducibility. I would like, for instance, create a numeric example which abstract my data set properly. One condition woud be to implement some correlation between my dependent and independent variables.
For instance. how to introduce some correlation between my count and my in.var1 and in.var2?
set.seed(1122)
count<-rpois(1000,30)
in.var1<- rnorm(1000, mean = 25, sd = 3)
in.var1<- rnorm(1000, mean = 12, sd = 2)
data<-cbind(count,in.var1,in.var2)
You can introduce dependence by adding in some portion of the "information" in the two variables to the construction of the count variable:
set.seed(1222)
in.var1<- rnorm(1000, mean = 25, sd = 3)
#Corrected spelling of in.var2
in.var2<- rnorm(1000, mean = 12, sd = 2)
count<-rpois(1000,30) + 0.15*in.var1 + 0.3*in.var2
# Avoid use 'data` as an object name
dat<-data.frame(count,in.var1,in.var2)
> spearman(count, in.var1)
rho
0.06859676
> spearman(count, in.var2)
rho
0.1276568
> spearman(in.var1, in.var2)
rho
-0.02175273
> summary( glm(count ~ in.var1 + in.var2, data=dat) )
Call:
glm(formula = count ~ in.var1 + in.var2, data = dat)
Deviance Residuals:
Min 1Q Median 3Q Max
-16.6816 -3.6910 -0.4238 3.4435 15.5326
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 29.05034 1.74084 16.688 < 2e-16 ***
in.var1 0.14701 0.05613 2.619 0.00895 **
in.var2 0.35512 0.08228 4.316 1.74e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
If you want count to be a function of in.var1 and invar.2 try this. Note that count is already a function name so I am changing it to Count
set.seed(1122)
in.var1<- rnorm(1000, mean = 4, sd = 3)
in.var2<- rnorm(1000, mean = 6, sd = 2)
Count<-rpois(1000, exp(3+ 0.5*in.var1 - 0.25*in.var2))
Data<-data.frame(Count=Count, Var1=in.var1, Var2=in.var2)
You now have a poisson count based on in.var1 and in.var2. A poisson regression will show an intercept of 3 and coefficients of 0.5 for Var1 and -0.25 for Var2
summary(glm(Count~Var1+Var2,data=Data, family=poisson))
Call:
glm(formula = Count ~ Var1 + Var2, family = poisson, data = Data)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.84702 -0.76292 -0.04463 0.67525 2.79537
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.001390 0.011782 254.7 <2e-16 ***
Var1 0.499789 0.001004 498.0 <2e-16 ***
Var2 -0.250949 0.001443 -173.9 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 308190.7 on 999 degrees of freedom
Residual deviance: 1063.3 on 997 degrees of freedom
AIC: 6319.2
Number of Fisher Scoring iterations: 4
As I understand you want to add some pattern to your data.
# Basic info taken from Data Science Exploratory Analysis Course
# http://datasciencespecialization.github.io/courses/04_ExploratoryAnalysis/
set.seed(1122)
rowNumber = 1000
count<-rpois(rowNumber,30)
in.var1<- rnorm(rowNumber, mean = 25, sd = 3)
in.var2<- rnorm(rowNumber, mean = 12, sd = 2)
data<-cbind(count,in.var1,in.var2)
dataNew <- data
for (i in 1:rowNumber) {
# flip a coin
coinFlip <- rbinom(1, size = 1, prob = 0.5)
# if coin is heads add a common pattern to that row
if (coinFlip) {
dataNew[i,"count"] <- 2 * data[i,"in.var1"] + 10* data[i,"in.var2"]
}
}
Basically, I am adding a pattern count = 2 *in.var1 + 10 * in.var2 to some random rows, here coinFlip variable. Of course you should vectorize it for more rows.
I want to run a two stage probit least square regression in R. Does anyone know how to do this? Is there any package out there? I know it's possible to do it using Stata, so I imagine it's possible to do it with R.
You might want to be more specific when you say 'two-stage-probit-least-squares'. Since you refer to a Stata program that implements this I am guessing you are talking about the CDSIMEQ package, which implements the Amemiya (1978) procedure for the Heckit model (a.k.a Generalized Tobit, a.k.a. Tobit type II model, etc.). As Grant said, systemfit will do a Tobit for you, but not with two equations. The MicEcon package did have a Heckit (but the package has split so many times I don't know where it is now).
If you want what the CDSIMEQ does, it can easily be implemented in R. I wrote a function that replicates CDSIMEQ:
tspls <- function(formula1, formula2, data) {
# The Continous model
mf1 <- model.frame(formula1, data)
y1 <- model.response(mf1)
x1 <- model.matrix(attr(mf1, "terms"), mf1)
# The dicontionous model
mf2 <- model.frame(formula2, data)
y2 <- model.response(mf2)
x2 <- model.matrix(attr(mf2, "terms"), mf2)
# The matrix of all the exogenous variables
X <- cbind(x1, x2)
X <- X[, unique(colnames(X))]
J1 <- matrix(0, nrow = ncol(X), ncol = ncol(x1))
J2 <- matrix(0, nrow = ncol(X), ncol = ncol(x2))
for (i in 1:ncol(x1)) J1[match(colnames(x1)[i], colnames(X)), i] <- 1
for (i in 1:ncol(x2)) J2[match(colnames(x2)[i], colnames(X)), i] <- 1
# Step 1:
cat("\n\tNOW THE FIRST STAGE REGRESSION")
m1 <- lm(y1 ~ X - 1)
m2 <- glm(y2 ~ X - 1, family = binomial(link = "probit"))
print(summary(m1))
print(summary(m2))
yhat1 <- m1$fitted.values
yhat2 <- X %*% coef(m2)
PI1 <- m1$coefficients
PI2 <- m2$coefficients
V0 <- vcov(m2)
sigma1sq <- sum(m1$residuals ^ 2) / m1$df.residual
sigma12 <- 1 / length(y2) * sum(y2 * m1$residuals / dnorm(yhat2))
# Step 2:
cat("\n\tNOW THE SECOND STAGE REGRESSION WITH INSTRUMENTS")
m1 <- lm(y1 ~ yhat2 + x1 - 1)
m2 <- glm(y2 ~ yhat1 + x2 - 1, family = binomial(link = "probit"))
sm1 <- summary(m1)
sm2 <- summary(m2)
print(sm1)
print(sm2)
# Step 3:
cat("\tNOW THE SECOND STAGE REGRESSION WITH CORRECTED STANDARD ERRORS\n\n")
gamma1 <- m1$coefficients[1]
gamma2 <- m2$coefficients[1]
cc <- sigma1sq - 2 * gamma1 * sigma12
dd <- gamma2 ^ 2 * sigma1sq - 2 * gamma2 * sigma12
H <- cbind(PI2, J1)
G <- cbind(PI1, J2)
XX <- crossprod(X) # X'X
HXXH <- solve(t(H) %*% XX %*% H) # (H'X'XH)^(-1)
HXXVXXH <- t(H) %*% XX %*% V0 %*% XX %*% H # H'X'V0X'XH
Valpha1 <- cc * HXXH + gamma1 ^ 2 * HXXH %*% HXXVXXH %*% HXXH
GV <- t(G) %*% solve(V0) # G'V0^(-1)
GVG <- solve(GV %*% G) # (G'V0^(-1)G)^(-1)
Valpha2 <- GVG + dd * GVG %*% GV %*% solve(XX) %*% solve(V0) %*% G %*% GVG
ans1 <- coef(sm1)
ans2 <- coef(sm2)
ans1[,2] <- sqrt(diag(Valpha1))
ans2[,2] <- sqrt(diag(Valpha2))
ans1[,3] <- ans1[,1] / ans1[,2]
ans2[,3] <- ans2[,1] / ans2[,2]
ans1[,4] <- 2 * pt(abs(ans1[,3]), m1$df.residual, lower.tail = FALSE)
ans2[,4] <- 2 * pnorm(abs(ans2[,3]), lower.tail = FALSE)
cat("Continuous:\n")
print(ans1)
cat("Dichotomous:\n")
print(ans2)
}
For comparison, we can replicate the sample from the author of CDSIMEQ in their article about the package.
> library(foreign)
> cdsimeq <- read.dta("http://www.stata-journal.com/software/sj3-2/st0038/cdsimeq.dta")
> tspls(continuous ~ exog3 + exog2 + exog1 + exog4,
+ dichotomous ~ exog1 + exog2 + exog5 + exog6 + exog7,
+ data = cdsimeq)
NOW THE FIRST STAGE REGRESSION
Call:
lm(formula = y1 ~ X - 1)
Residuals:
Min 1Q Median 3Q Max
-1.885921 -0.438579 -0.006262 0.432156 2.133738
Coefficients:
Estimate Std. Error t value Pr(>|t|)
X(Intercept) 0.010752 0.020620 0.521 0.602187
Xexog3 0.158469 0.021862 7.249 8.46e-13 ***
Xexog2 -0.009669 0.021666 -0.446 0.655488
Xexog1 0.159955 0.021260 7.524 1.19e-13 ***
Xexog4 0.316575 0.022456 14.097 < 2e-16 ***
Xexog5 0.497207 0.021356 23.282 < 2e-16 ***
Xexog6 -0.078017 0.021755 -3.586 0.000352 ***
Xexog7 0.161177 0.022103 7.292 6.23e-13 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.6488 on 992 degrees of freedom
Multiple R-squared: 0.5972, Adjusted R-squared: 0.594
F-statistic: 183.9 on 8 and 992 DF, p-value: < 2.2e-16
Call:
glm(formula = y2 ~ X - 1, family = binomial(link = "probit"))
Deviance Residuals:
Min 1Q Median 3Q Max
-2.49531 -0.59244 0.01983 0.59708 2.41810
Coefficients:
Estimate Std. Error z value Pr(>|z|)
X(Intercept) 0.08352 0.05280 1.582 0.113692
Xexog3 0.21345 0.05678 3.759 0.000170 ***
Xexog2 0.21131 0.05471 3.862 0.000112 ***
Xexog1 0.45591 0.06023 7.570 3.75e-14 ***
Xexog4 0.39031 0.06173 6.322 2.57e-10 ***
Xexog5 0.75955 0.06427 11.818 < 2e-16 ***
Xexog6 0.85461 0.06831 12.510 < 2e-16 ***
Xexog7 -0.16691 0.05653 -2.953 0.003152 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1386.29 on 1000 degrees of freedom
Residual deviance: 754.14 on 992 degrees of freedom
AIC: 770.14
Number of Fisher Scoring iterations: 6
NOW THE SECOND STAGE REGRESSION WITH INSTRUMENTS
Call:
lm(formula = y1 ~ yhat2 + x1 - 1)
Residuals:
Min 1Q Median 3Q Max
-2.32152 -0.53160 0.04886 0.53502 2.44818
Coefficients:
Estimate Std. Error t value Pr(>|t|)
yhat2 0.257592 0.021451 12.009 <2e-16 ***
x1(Intercept) 0.012185 0.024809 0.491 0.623
x1exog3 0.042520 0.026735 1.590 0.112
x1exog2 0.011854 0.026723 0.444 0.657
x1exog1 0.007773 0.028217 0.275 0.783
x1exog4 0.318636 0.028311 11.255 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.7803 on 994 degrees of freedom
Multiple R-squared: 0.4163, Adjusted R-squared: 0.4128
F-statistic: 118.2 on 6 and 994 DF, p-value: < 2.2e-16
Call:
glm(formula = y2 ~ yhat1 + x2 - 1, family = binomial(link = "probit"))
Deviance Residuals:
Min 1Q Median 3Q Max
-2.49610 -0.58595 0.01969 0.59857 2.41281
Coefficients:
Estimate Std. Error z value Pr(>|z|)
yhat1 1.26287 0.16061 7.863 3.75e-15 ***
x2(Intercept) 0.07080 0.05276 1.342 0.179654
x2exog1 0.25093 0.06466 3.880 0.000104 ***
x2exog2 0.22604 0.05389 4.194 2.74e-05 ***
x2exog5 0.12912 0.09510 1.358 0.174544
x2exog6 0.95609 0.07172 13.331 < 2e-16 ***
x2exog7 -0.37128 0.06759 -5.493 3.94e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1386.29 on 1000 degrees of freedom
Residual deviance: 754.21 on 993 degrees of freedom
AIC: 768.21
Number of Fisher Scoring iterations: 6
NOW THE SECOND STAGE REGRESSION WITH CORRECTED STANDARD ERRORS
Continuous:
Estimate Std. Error t value Pr(>|t|)
yhat2 0.25759209 0.1043073 2.46955009 0.01369540
x1(Intercept) 0.01218500 0.1198713 0.10165068 0.91905445
x1exog3 0.04252006 0.1291588 0.32920764 0.74206810
x1exog2 0.01185438 0.1290754 0.09184073 0.92684309
x1exog1 0.00777347 0.1363643 0.05700519 0.95455252
x1exog4 0.31863627 0.1367881 2.32941597 0.02003661
Dichotomous:
Estimate Std. Error z value Pr(>|z|)
yhat1 1.26286574 0.7395166 1.7076909 0.0876937093
x2(Intercept) 0.07079775 0.2666447 0.2655134 0.7906139867
x2exog1 0.25092561 0.3126763 0.8025092 0.4222584495
x2exog2 0.22603717 0.2739307 0.8251618 0.4092797527
x2exog5 0.12911922 0.4822986 0.2677163 0.7889176766
x2exog6 0.95609385 0.2823662 3.3860070 0.0007091758
x2exog7 -0.37128221 0.3265478 -1.1369920 0.2555416141
systemfit will also do the trick.
there are several packages available in R to do two state least squares. here are a few
sem: Two-Stage Least Squares
Zelig: Link removed, no longer functional (28.07.11)
let me know if these serve your purpose.