So I want to find the estimate parameter using GLM and compare it with mle2 package.
Here's my code for GLM
d <- read.delim("http://dnett.github.io/S510/Disease.txt")
d$disease=factor(d$disease)
d$ses=factor(d$ses)
d$sector=factor(d$sector)
str(d)
glm2 <- glm(disease~ses+sector, family=binomial(link=logit), data=d)
summary(glm2)
And my code for mle2()
y<-as.numeric(as.character(d$disease))
x1<-as.numeric(as.character(d$age))
x2<-as.numeric(as.character(d$sector))
x3<-as.numeric(as.character(d$ses))
library(bbmle)
nlldbin=function(A,B,C,D){
eta<-A+B*(x3==2)+C*(x3==3)+D*(x2==2)
p<-1/(1+exp(-eta))
joint.pdf= (p^y)*((1-p)^(1-y))
-sum(joint.pdf, log=TRUE ,na.rm=TRUE)
}
st <- list(A=0.0001,B=0.0001,C=0.0001,D=0.0001)
est_mle2<-mle2(start=st,nlldbin,hessian=TRUE)
summary(est_mle2)
But the result is quiet different. Please help me to fix this, thank you!
> summary(est_mle2)
Maximum likelihood estimation
Call:
mle2(minuslogl = nlldbin, start = st, hessian.opts = TRUE)
Coefficients:
Estimate Std. Error z value Pr(z)
A -20.4999 5775.1484 -0.0035 0.9972
B -5.2499 120578.9515 0.0000 1.0000
C -7.9999 722637.2670 0.0000 1.0000
D -2.2499 39746.6639 -0.0001 1.0000
> summary(glm2)
Call:
glm(formula = disease ~ ses + sector, family = binomial(link = logit),
data = d)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.52001 0.33514 -4.535 5.75e-06 ***
ses2 -0.08525 0.41744 -0.204 0.838177
ses3 0.16086 0.39261 0.410 0.682019
sector2 1.28098 0.34140 3.752 0.000175 ***
I'm not sure your definition of eta is correct. I would use the model matrix.
X <- model.matrix(~ ses + sector, data = d)
nlldbin <- function(A,B,C,D){
eta <- X %*% c(A, B, C, D)
p <- 1/(1+exp(-eta))
logpdf <- y*log(p) + (1-y)*log(1-p)
-sum(logpdf)
}
This line
-sum(joint.pdf, log=TRUE ,na.rm=TRUE)
is wrong. sum doesn't have a special log argument; what you're doing is adding the value TRUE (which gets converted to 1) to the pdf.
What you want is
-sum(log(joint.pdf), na.rm=TRUE)
but this is also not very good for numerical reasons, as the pdf is likely to underflow. A better way of writing it would be
logpdf <- y*log(p) + (1-y)*log(1-p)
-sum(logpdf, na.rm=TRUE)
Related
I am replicating a negative binomial regression model in R. When calculating robust standard errors, the output does not match Stata output of standard errors.
The original Stata code is
nbreg displaced eei lcostofwar cfughh roadskm lpopdensity ltkilled, robust nolog
I have attempted both manual calculation and vcovHC from sandwich. However, neither produces the same results.
My regression model is as follows:
mod1 <- glm.nb(displaced ~ eei + costofwar_log + cfughh + roadskm + popdensity_log + tkilled_log, data = mod1_df)
With vcovHC I have tried every option from HC0 to HC5.
Attempt 1:
cov_m1 <- vcovHC(mod1, type = "HC0", sandwich = T)
se <- sqrt(diag(cov_m1))
Attempt 2:
mod1_rob <- coeftest(mod1, vcovHC = vcov(mod1, type = "HC0"))
The most successful has been HC0 and vcov = sandwich but no SEs are correct.
Any suggestions?
EDIT
My output is as follows (using HC0):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.3281183 1.5441312 0.8601 0.389730
eei -0.0435529 0.0183359 -2.3753 0.017536 *
costofwar_log 0.2984376 0.1350518 2.2098 0.027119 *
cfughh -0.0380690 0.0130254 -2.9227 0.003470 **
roadskm 0.0020812 0.0010864 1.9156 0.055421 .
popdensity_log -0.4661079 0.1748682 -2.6655 0.007688 **
tkilled_log 1.0949084 0.2159161 5.0710 3.958e-07 ***
The Stata output I am attempting to replicate is:
Estimate Std. Error
(Intercept) 1.328 1.272
eei -0.044 0.015
costofwar_log 0.298 0.123
cfughh -0.038 0.018
roadskm 0.002 0.0001
popdensity_log -0.466 0.208
tkilled_log 1.095 0.209
The dataset is found here and the recoded variables are:
mod1_df <- table %>%
select(displaced, eei_01, costofwar, cfughh, roadskm, popdensity,
tkilled)
mod1_df$popdensity_log <- log(mod1_df$popdensity + 1)
mod1_df$tkilled_log <- log(mod1_df$tkilled + 1)
mod1_df$costofwar_log <- log(mod1_df$costofwar + 1)
mod1_df$eei <- mod1_df$eei_01*100
Stata uses the observed Hessian for its computations, glm.nb() uses the expected Hessian. Therefore, the default bread() employed by the sandwich() function is different, leading to different results. There are other R packages that employ the observed hessian for its variance-covariance estimate (e.g., gamlss) but these do not supply an estfun() method for the sandwich package.
Hence, below I simply set up a dedicated bread_obs() function that extracts the ML estimates from a negbin object, sets up the negative log-likelihood, computes the observed Hessian numerically via numDeriv::hessian() and computes the "bread" from it (omitting the estimate for log(theta)):
bread_obs <- function(object, method = "BFGS", maxit = 5000, reltol = 1e-12, ...) {
## data and estimated parameters
Y <- model.response(model.frame(object))
X <- model.matrix(object)
par <- c(coef(object), "log(theta)" = log(object$theta))
## dimensions
n <- NROW(X)
k <- length(par)
## nb log-likelihood
nll <- function(par) suppressWarnings(-sum(dnbinom(Y,
mu = as.vector(exp(X %*% head(par, -1))),
size = exp(tail(par, 1)), log = TRUE)))
## covariance based on observed Hessian
rval <- numDeriv::hessian(nll, par)
rval <- solve(rval) * n
rval[-k, -k]
}
With that function I can compare the sandwich() output (based on the expected Hessian) with the output using the bread_obs() (based on the observed Hessian).
s_exp <- sandwich(mod1)
s_obs <- sandwich(mod1, vcov = bread_obs)
cbind("Coef" = coef(mod1), "SE (Exp)" = sqrt(diag(s_exp)), "SE (Obs)" = sqrt(diag(s_obs)))
## Coef SE (Exp) SE (Obs)
## (Intercept) 1.328 1.259 1.259
## eei -0.044 0.017 0.015
## costofwar_log 0.298 0.160 0.121
## cfughh -0.038 0.015 0.018
## roadskm 0.002 0.001 0.001
## popdensity_log -0.466 0.135 0.207
## tkilled_log 1.095 0.179 0.208
This still has slight differences compared to Stata but these are likely numerical differences from the optimization etc.
If you create a new dedicated bread() method for negbin objects
bread.negbin <- bread_obs
then the method dispatch will use this if you do sandwich(mod1).
In R you need to manually provide a degree of freedom correction, so try this which I borrowed from this source:
dfa <- (G/(G - 1)) * (N - 1)/pm1$df.residual
# display with cluster VCE and df-adjustment
firm_c_vcov <- dfa * vcovHC(pm1, type = "HC0", cluster = "group", adjust = T)
coeftest(pm1, vcov = firm_c_vcov)
Here G is the number of Panels in your data set, N is the number of observations and pm1 is your model estimated. Obviously, you could drop the clustering.
I have run the Hosmer Lemeshow statistic in R, but I have obtained an p-value of 1. This seems strange to me. I know that a high p-valvalue means that we do not reject the null hypothesis that observed and expected are the same, but is it possible i have an error somewhere?
How do i interpret such p-value?
Below is the code i have used to run the test. I also attach how my model looks like. Response variable is a count variable, while all regressors are continous. I have run a negative binomial model, due to detected overdispersion in my initial poisson model.
> hosmerlem <- function(y, yhat, g=10)
+ {cutyhat <- cut(yhat, breaks = quantile(yhat, probs=seq(0,1, 1/g)), include.lowest=TRUE)
+ obs <- xtabs(cbind(1 - y, y) ~ cutyhat)
+ expect <- xtabs(cbind(1 - yhat, yhat) ~ cutyhat)
+ chisq <- sum((obs - expect)^2/expect)
+ P <- 1 - pchisq(chisq, g - 2)
+ return(list(chisq=chisq,p.value=P))}
> hosmerlem(y=TOT.N, yhat=fitted(final.model))
$chisq
[1] -2.529054
$p.value
[1] 1
> final.model <-glm.nb(TOT.N ~ D.PARK + OPEN.L + L.WAT.C + sqrt(L.P.ROAD))
> summary(final.model)
Call:
glm.nb(formula = TOT.N ~ D.PARK + OPEN.L + L.WAT.C + sqrt(L.P.ROAD),
init.theta = 4.979895131, link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.08218 -0.70494 -0.09268 0.55575 1.67860
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.032e+00 3.363e-01 11.989 < 2e-16 ***
D.PARK -1.154e-04 1.061e-05 -10.878 < 2e-16 ***
OPEN.L -1.085e-02 3.122e-03 -3.475 0.00051 ***
L.WAT.C 1.597e-01 7.852e-02 2.034 0.04195 *
sqrt(L.P.ROAD) 4.924e-01 3.101e-01 1.588 0.11231
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Negative Binomial(4.9799) family taken to be 1)
Null deviance: 197.574 on 51 degrees of freedom
Residual deviance: 51.329 on 47 degrees of freedom
AIC: 383.54
Number of Fisher Scoring iterations: 1
Theta: 4.98
Std. Err.: 1.22
2 x log-likelihood: -371.542
As correctly pointed out by #BenBolker, Hosmer-Lemeshow is a test for logistic regression, not for a negative binomial generalized linear model.
If we consider to apply the test to a logistic regression,
the inputs of the function hosmerlem (a copy of the hoslem.test function in the package ResourceSelection) should be:
- y = a numeric vector of observations, binary (0/1)
- yhat = expected values (probabilities)
Here is an illustrative example that shows how to get the correct inputs:
set.seed(123)
n <- 500
x <- rnorm(n)
y <- rbinom(n, 1, plogis(0.1 + 0.5*x))
logmod <- glm(y ~ x, family=binomial)
# Important: use the type="response" option
yhat <- predict(logmod, type="response")
hosmerlem(y, yhat)
########
$chisq
[1] 4.522719
$p.value
[1] 0.8071559
The same result is given by the function hoslem.test:
library(ResourceSelection)
hoslem.test(y, yhat)
########
Hosmer and Lemeshow goodness of fit (GOF) test
data: y, yhat
X-squared = 4.5227, df = 8, p-value = 0.8072
As already mentioned, HL-test is not appropriate for the specified model. It is also important to know that a large p-value doesn't necessarily mean a good fit. It could also be that there isn't enough evidence to prove it's a poor fit.
Meanwhile, the gofcat package implementation of the HL-test provides for passing model objects directly to the function without necessarily supplying the observed and predicted values. For the simulated data one has:
library(gofcat)
set.seed(123)
n <- 500
x <- rnorm(n)
y <- rbinom(n, 1, plogis(0.1 + 0.5*x))
logmod <- glm(y ~ x, family=binomial)
hosmerlem(logmod, group = 10)
Hosmer-Lemeshow Test:
Chi-sq df pr(>chi)
binary(Hosmerlem) 4.5227 8 0.8072
H0: No lack of fit dictated
rho: 100%
When we have a linear model with a factor variable X (with levels A, B, and C)
y ~ factor(X) + Var2 + Var3
The result shows the estimate XB and XC which is differences B - A and C - A. (suppose that the reference is A).
If we want to know the p-value of the difference between B and C: C - B,
we should designate B or C as a reference group and re-run the model.
Can we get the p-values of the effect B - A, C - A, and C - B at one time?
You are looking for linear hypothesis test by check p-value of some linear combination of regression coefficients. Based on my answer: How to conduct linear hypothesis test on regression coefficients with a clustered covariance matrix?, where we only considered sum of coefficients, I will extend the function LinearCombTest to handle more general cases, supposing alpha as some combination coefficients of variables in vars:
LinearCombTest <- function (lmObject, vars, alpha, .vcov = NULL) {
## if `.vcov` missing, use the one returned by `lm`
if (is.null(.vcov)) .vcov <- vcov(lmObject)
## estimated coefficients
beta <- coef(lmObject)
## linear combination of `vars` with combination coefficients `alpha`
LinearComb <- sum(beta[vars] * alpha)
## get standard errors for sum of `LinearComb`
LinearComb_se <- sum(alpha * crossprod(.vcov[vars, vars], alpha)) ^ 0.5
## perform t-test on `sumvars`
tscore <- LinearComb / LinearComb_se
pvalue <- 2 * pt(abs(tscore), lmObject$df.residual, lower.tail = FALSE)
## return a matrix
form <- paste0("(", paste(alpha, vars, sep = " * "), ")")
form <- paste0(paste0(form, collapse = " + "), " = 0")
matrix(c(LinearComb, LinearComb_se, tscore, pvalue), nrow = 1L,
dimnames = list(form, c("Estimate", "Std. Error", "t value", "Pr(>|t|)")))
}
Consider a simple example, where we have a balanced design for three groups A, B and C, with group mean 0, 1, 2, respectively.
x <- gl(3,100,labels = LETTERS[1:3])
set.seed(0)
y <- c(rnorm(100, 0), rnorm(100, 1), rnorm(100, 2)) + 0.1
fit <- lm(y ~ x)
coef(summary(fit))
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.1226684 0.09692277 1.265631 2.066372e-01
#xB 0.9317800 0.13706949 6.797866 5.823987e-11
#xC 2.0445528 0.13706949 14.916177 6.141008e-38
Since A is the reference level, xB is giving B - A while xC is giving C - A. Suppose we are now interested in the difference between group B and C, i.e., C - B, we can use
LinearCombTest(fit, c("xC", "xB"), c(1, -1))
# Estimate Std. Error t value Pr(>|t|)
#(1 * xC) + (-1 * xB) = 0 1.112773 0.1370695 8.118312 1.270686e-14
Note, this function is also handy to work out the group mean of B and C, that is (Intercept) + xB and (Intercept) + xC:
LinearCombTest(fit, c("(Intercept)", "xB"), c(1, 1))
# Estimate Std. Error t value Pr(>|t|)
#(1 * (Intercept)) + (1 * xB) = 0 1.054448 0.09692277 10.87926 2.007956e-23
LinearCombTest(fit, c("(Intercept)", "xC"), c(1, 1))
# Estimate Std. Error t value Pr(>|t|)
#(1 * (Intercept)) + (1 * xC) = 0 2.167221 0.09692277 22.36029 1.272811e-65
Alternative solution with lsmeans
Consider the above toy example again:
library(lsmeans)
lsmeans(fit, spec = "x", contr = "revpairwise")
#$lsmeans
# x lsmean SE df lower.CL upper.CL
# A 0.1226684 0.09692277 297 -0.06807396 0.3134109
# B 1.0544484 0.09692277 297 0.86370603 1.2451909
# C 2.1672213 0.09692277 297 1.97647888 2.3579637
#
#Confidence level used: 0.95
#
#$contrasts
# contrast estimate SE df t.ratio p.value
# B - A 0.931780 0.1370695 297 6.798 <.0001
# C - A 2.044553 0.1370695 297 14.916 <.0001
# C - B 1.112773 0.1370695 297 8.118 <.0001
#
#P value adjustment: tukey method for comparing a family of 3 estimates
The $lsmeans domain returns the marginal group mean, while $contrasts returns pairwise group mean difference, since we have used "revpairwise" contrast. Read p.32 of lsmeans for difference between "pairwise" and "revpairwise".
Well this is certainly interesting, as we can compare with the result from LinearCombTest. We see that LinearCombTest is doing correctly.
glht (general linear hypothesis testing) from multcomp package makes this sort of multiple hypothesis test easy without re-running a bunch of separate models. It is essentially crafting a customized contrast matrix based on your defined comparisons of interest.
Using your example comparisons and building on the data #ZheyuanLi provided:
x <- gl(3,100,labels = LETTERS[1:3])
set.seed(0)
y <- c(rnorm(100, 0), rnorm(100, 1), rnorm(100, 2)) + 0.1
fit <- lm(y ~ x)
library(multcomp)
my_ht <- glht(fit, linfct = mcp(x = c("B-A = 0",
"C-A = 0",
"C-B = 0")))
summary(my_ht) will give you the adjusted p-values for the comparisons of interest.
#Linear Hypotheses:
# Estimate Std. Error t value Pr(>|t|)
#B - A == 0 0.9318 0.1371 6.798 1.11e-10 ***
#C - A == 0 2.0446 0.1371 14.916 < 1e-10 ***
#C - B == 0 1.1128 0.1371 8.118 < 1e-10 ***
You could use the library car, and use the function linearHypothesis with the parameter vcov.
Set this as the variance-covariance matrix of your model.
The function takes formulas or a matrix to describe the system of equations that you would like to test.
I'm trying to reproduce this stata example and move from stargazer to texreg. The data is available here.
To run the regression and get the se I run this code:
library(readstata13)
library(sandwich)
cluster_se <- function(model_result, data, cluster){
model_variables <- intersect(colnames(data), c(colnames(model_result$model), cluster))
model_rows <- as.integer(rownames(model_result$model))
data <- data[model_rows, model_variables]
cl <- data[[cluster]]
M <- length(unique(cl))
N <- nrow(data)
K <- model_result$rank
dfc <- (M/(M-1))*((N-1)/(N-K))
uj <- apply(estfun(model_result), 2, function(x) tapply(x, cl, sum));
vcovCL <- dfc*sandwich(model_result, meat=crossprod(uj)/N)
sqrt(diag(vcovCL))
}
elemapi2 <- read.dta13(file = 'elemapi2.dta')
lm1 <- lm(formula = api00 ~ acs_k3 + acs_46 + full + enroll, data = elemapi2)
se.lm1 <- cluster_se(model_result = lm1, data = elemapi2, cluster = "dnum")
stargazer::stargazer(lm1, type = "text", style = "aer", se = list(se.lm1))
==========================================================
api00
----------------------------------------------------------
acs_k3 6.954
(6.901)
acs_46 5.966**
(2.531)
full 4.668***
(0.703)
enroll -0.106**
(0.043)
Constant -5.200
(121.786)
Observations 395
R2 0.385
Adjusted R2 0.379
Residual Std. Error 112.198 (df = 390)
F Statistic 61.006*** (df = 4; 390)
----------------------------------------------------------
Notes: ***Significant at the 1 percent level.
**Significant at the 5 percent level.
*Significant at the 10 percent level.
texreg produces this:
texreg::screenreg(lm1, override.se=list(se.lm1))
========================
Model 1
------------------------
(Intercept) -5.20
(121.79)
acs_k3 6.95
(6.90)
acs_46 5.97 ***
(2.53)
full 4.67 ***
(0.70)
enroll -0.11 ***
(0.04)
------------------------
R^2 0.38
Adj. R^2 0.38
Num. obs. 395
RMSE 112.20
========================
How can I fix the p-values?
Robust Standard Errors with texreg are easy: just pass the coeftest directly!
This has become much easier since the question was last answered: it appears you can now just pass the coeftest with the desired variance-covariance matrix directly. Downside: you lose the goodness of fit statistics (such as R^2 and number of observations), but depending on your needs, this may not be a big problem
How to include robust standard errors with texreg
> screenreg(list(reg1, coeftest(reg1,vcov = vcovHC(reg1, 'HC1'))),
custom.model.names = c('Standard Standard Errors', 'Robust Standard Errors'))
=============================================================
Standard Standard Errors Robust Standard Errors
-------------------------------------------------------------
(Intercept) -192.89 *** -192.89 *
(55.59) (75.38)
x 2.84 ** 2.84 **
(0.96) (1.04)
-------------------------------------------------------------
R^2 0.08
Adj. R^2 0.07
Num. obs. 100
RMSE 275.88
=============================================================
*** p < 0.001, ** p < 0.01, * p < 0.05
To generate this example, I created a dataframe with heteroscedasticity, see below for full runnable sample code:
require(sandwich);
require(texreg);
set.seed(1234)
df <- data.frame(x = 1:100);
df$y <- 1 + 0.5*df$x + 5*100:1*rnorm(100)
reg1 <- lm(y ~ x, data = df)
First, notice that your usage of as.integer is dangerous and likely to cause problems once you use data with non-numeric rownames. For instance, using the built-in dataset mtcars whose rownames consist of car names, your function will coerce all rownames to NA, and your function will not work.
To your actual question, you can provide custom p-values to texreg, which means that you need to compute the corresponding p-values. To achieve this, you could compute the variance-covariance matrix, compute the test-statistics, and then compute the p-value manually, or you just compute the variance-covariance matrix and supply it to e.g. coeftest. Then you can extract the standard errors and p-values from there. Since I am unwilling to download any data, I use the mtcars-data for the following:
library(sandwich)
library(lmtest)
library(texreg)
cluster_se <- function(model_result, data, cluster){
model_variables <- intersect(colnames(data), c(colnames(model_result$model), cluster))
model_rows <- rownames(model_result$model) # changed to be able to work with mtcars, not tested with other data
data <- data[model_rows, model_variables]
cl <- data[[cluster]]
M <- length(unique(cl))
N <- nrow(data)
K <- model_result$rank
dfc <- (M/(M-1))*((N-1)/(N-K))
uj <- apply(estfun(model_result), 2, function(x) tapply(x, cl, sum));
vcovCL <- dfc*sandwich(model_result, meat=crossprod(uj)/N)
}
lm1 <- lm(formula = mpg ~ cyl + disp, data = mtcars)
vcov.lm1 <- cluster_se(model_result = lm1, data = mtcars, cluster = "carb")
standard.errors <- coeftest(lm1, vcov. = vcov.lm1)[,2]
p.values <- coeftest(lm1, vcov. = vcov.lm1)[,4]
texreg::screenreg(lm1, override.se=standard.errors, override.p = p.values)
And just for completeness sake, let's do it manually:
t.stats <- abs(coefficients(lm1) / sqrt(diag(vcov.lm1)))
t.stats
(Intercept) cyl disp
38.681699 5.365107 3.745143
These are your t-statistics using the cluster-robust standard errors. The degree of freedom is stored in lm1$df.residual, and using the built in functions for the t-distribution (see e.g. ?pt), we get:
manual.p <- 2*pt(-t.stats, df=lm1$df.residual)
manual.p
(Intercept) cyl disp
1.648628e-26 9.197470e-06 7.954759e-04
Here, pt is the distribution function, and we want to compute the probability of observing a statistic at least as extreme as the one we observe. Since we testing two-sided and it is a symmetric density, we first take the left extreme using the negative value, and then double it. This is identical to using 2*(1-pt(t.stats, df=lm1$df.residual)). Now, just to check that this yields the same result as before:
all.equal(p.values, manual.p)
[1] TRUE
I'm running a multivariate regression with 2 outcome variables and 5 predictors. I would like to obtain the confidence intervals for all regression coefficients. Usually I use the function lm but it doesn't seem to work for a multivariate regression model (object mlm).
Here's a reproducible example.
library(car)
mod <- lm(cbind(income, prestige) ~ education + women, data=Prestige)
confint(mod) # doesn't return anything.
Any alternative way to do it? (I could just use the value of the standard error and multiply by the right critical t value, but I was wondering if there was an easier way to do it).
confint won't return you anything, because there is no "mlm" method supported:
methods(confint)
#[1] confint.default confint.glm* confint.lm confint.nls*
As you said, we can just plus / minus some multiple of standard error to get upper / lower bound of confidence interval. You were probably going to do this via coef(summary(mod)), then use some *apply method to extract standard errors. But my answer to Obtain standard errors of regression coefficients for an “mlm” object returned by lm() gives you a supper efficient way to get standard errors without going through summary. Applying std_mlm to your example model gives:
se <- std_mlm(mod)
# income prestige
#(Intercept) 1162.299027 3.54212524
#education 103.731410 0.31612316
#women 8.921229 0.02718759
Now, we define another small function to compute lower and upper bound:
## add "mlm" method to generic function "confint"
confint.mlm <- function (model, level = 0.95) {
beta <- coef(model)
se <- std_mlm (model)
alpha <- qt((1 - level) / 2, df = model$df.residual)
list(lower = beta + alpha * se, upper = beta - alpha * se)
}
## call "confint"
confint(mod)
#$lower
# income prestige
#(Intercept) -3798.25140 -15.7825086
#education 739.05564 4.8005390
#women -81.75738 -0.1469923
#
#$upper
# income prestige
#(Intercept) 814.25546 -1.72581876
#education 1150.70689 6.05505285
#women -46.35407 -0.03910015
It is easy to interpret this. For example, for response income, the 95%-confidence interval for all variables are
#(intercept) (-3798.25140, 814.25546)
# education (739.05564, 1150.70689)
# women (-81.75738, -46.35407)
This comes from the predict.lm example. You want the interval = 'confidence' option.
x <- rnorm(15)
y <- x + rnorm(15)
predict(lm(y ~ x))
new <- data.frame(x = seq(-3, 3, 0.5))
predict(lm(y ~ x), new, se.fit = TRUE)
pred.w.clim <- predict(lm(y ~ x), new, interval = "confidence")
matplot(new$x, pred.w.clim,
lty = c(1,2,2,3,3), type = "l", ylab = "predicted y")
This seems to have been discussed recently (July 2018) on the R-devel list, so hopefully by the next version of R it will be fixed. A workaround proposed on that list is to use:
confint.mlm <- function (object, level = 0.95, ...) {
cf <- coef(object)
ncfs <- as.numeric(cf)
a <- (1 - level)/2
a <- c(a, 1 - a)
fac <- qt(a, object$df.residual)
pct <- stats:::format.perc(a, 3)
ses <- sqrt(diag(vcov(object)))
ci <- ncfs + ses %o% fac
setNames(data.frame(ci),pct)
}
Test:
fit_mlm <- lm(cbind(mpg, disp) ~ wt, mtcars)
confint(fit_mlm)
Gives:
2.5 % 97.5 %
mpg:(Intercept) 33.450500 41.119753
mpg:wt -6.486308 -4.202635
disp:(Intercept) -204.091436 -58.205395
disp:wt 90.757897 134.198380
Personnally, I like it in a clean tibble way (using broom::tidy would be even better, but has an issue currently)
library(tidyverse)
confint(fit_mlm) %>%
rownames_to_column() %>%
separate(rowname, c("response", "term"), sep=":")
Gives:
response term 2.5 % 97.5 %
1 mpg (Intercept) 33.450500 41.119753
2 mpg wt -6.486308 -4.202635
3 disp (Intercept) -204.091436 -58.205395
4 disp wt 90.757897 134.198380