I am trying to extract the confidence intervals for my panel logit regression. I am using the following code:
model <- bife(dependent_variable ~ x1 + x2 | area, data = df, model = 'logit')
confint(model)
Running confint gives me NA values for all the coefficients and their confidence intervals.
Is this because of the 'bife' object? The model itself runs fine.
It's the bife:::vcov.bife method which doesn't produce dimnames. Until the author fixes this, we could help ourselves by writing a confint.bife method, that assigns coefficient names to the vcov.
confint.bife <- function (object, parm, level=0.95, ...) {
cf <- coef(object)
pnames <- names(cf)
if (missing(parm)) parm <- pnames
else if (is.numeric(parm)) parm <- pnames[parm]
a <- (1 - level)/2
a <- c(a, 1 - a)
pct <- stats:::format.perc(a, 3)
fac <- qnorm(a)
ci <- array(NA, dim=c(length(parm), 2L),
dimnames=list(parm, pct))
vc <- `dimnames<-`(vcov(object), list(pnames, pnames))
ses <- sqrt(diag(vc))[parm]
ci[] <- cf[parm] + ses %o% fac
ci
}
library('bife')
mod <- bife(LFP ~ I(AGE^2) + log(INCH) + KID1 + KID2 + KID3 +
factor(TIME) | ID, psid)
confint(mod)
# 2.5 % 97.5 %
# I(AGE^2) -0.003787755 -0.001185755
# log(INCH) -0.606681358 -0.236717893
# KID1 -1.393748723 -1.008131941
# KID2 -0.830532213 -0.485097762
# KID3 -0.248997085 0.012550225
# factor(TIME)2 -0.244728227 0.303869081
# factor(TIME)3 -0.190434814 0.438179674
# factor(TIME)4 0.117647679 0.870167422
# factor(TIME)5 0.635239557 1.547524672
# factor(TIME)6 0.613792831 1.689971248
# factor(TIME)7 0.639896725 1.876532219
# factor(TIME)8 0.585828050 2.017753781
# factor(TIME)9 0.753717289 2.381327746
Related
.lm.fit is considerably faster than lm for reasons documented in several places, but it is not as straight forward to get an adjusted r-squared value so I'm hoping for some help.
Using lm() and then summary() to get the adjusted r-squared.
tstlm <- lm(cyl ~ hp + wt, data = mtcars)
summary(tstlm)$adj.r.squared
Using .lm.fit
mtmatrix <- as.matrix(mtcars)
tstlmf <- .lm.fit(cbind(1,mtmatrix [,c("hp","wt")]), mtmatrix [,"cyl"])
And here I'm stuck. I suspect the information I need to calculate adjusted r-squared is found in the .lm.fit model somewhere but I can't quite figure out how to proceed.
Thanks in advance for any suggestions.
1) R squared equals the squared correlation between the dependent variable and the fitted values. We can get the residuals from tstlmf using resid(tstslmf) and the fitted values equal y minus those residuals.
Adjusted R squared is formed by multiplying R squared by an expression using only the number of rows and columns of X.
Note that the formulas would change if there is no intercept.
X <- with(mtcars, cbind(1, hp, wt))
y <- mtcars$cyl
testlmf <- .lm.fit(X, y)
rsq <- cor(y, y - resid(tstlmf))^2; rsq
## [1] 0.7898
adj <- 1 - (1-rsq) * (nrow(X) - 1) / -diff(dim(X)); adj
## [1] 0.7753
# check
tstlm <- lm(cyl ~ hp + wt, mtcars)
s <- summary(tstlm)
s$r.squared
## [1] 0.7898
s$adj.r.squared
## [1] 0.7753
2) R squared can alternately be calculated as the ratio var(fitted) / var(y) as in the link above and in that case we write:
testlmf <- .lm.fit(X, y)
rsq <- var(y - resid(tstlmf)) / var(y); rsq
## [1] 0.7898
adj <- 1 - (1-rsq) * (nrow(X) - 1) / -diff(dim(X)); adj
## [1] 0.7753
collapse
flm in the collapse package may be slightly faster than .lm.fit. It returns the coefficients only.
library(collapse)
tstflm <- flm(y, X)
rsq <- c(cor(y, X %*% tstflm)^2); rsq
## [1] 0.7898
adj <- 1 - (1-rsq) * (nrow(X) - 1) / -diff(dim(X)); adj
## [1] 0.7753
or
tstflm <- flm(y, X)
rsq <- var(X %*% tstflm) / var(y); rsq
## [1] 0.7898
adj <- 1 - (1-rsq) * (nrow(X) - 1) / -diff(dim(X)); adj
## [1] 0.7753
The following function computes the adjusted R2 from an object returned by .lm.fit and the response vector y.
adj_r2_lmfit <- function(object, y){
ypred <- y - resid(object)
mss <- sum((ypred - mean(ypred))^2)
rss <- sum(resid(object)^2)
rdf <- length(resid(object)) - object$rank
r.squared <- mss/(mss + rss)
adj.r.squared <- 1 - (1 - r.squared)*(NROW(y) - 1)/rdf
adj.r.squared
}
tstlm <- lm(cyl ~ hp + wt, data = mtcars)
tstlmf <- .lm.fit(cbind(1,mtmatrix [,c("hp","wt")]), mtmatrix [,"cyl"])
summary(tstlm)$adj.r.squared
#[1] 0.7753073
adj_r2_lmfit(tstlmf, mtmatrix [,"cyl"])
#[1] 0.7753073
I am trying to calculate manually the r-squared given by lm() in R
Considering:
fit <- lm(obs_values ~ preds_values, df)
with sd(df$obs_values) == sd(df$preds_values) and mean(df$obs_values) == mean(df$preds_values)
To do so I can extract the residuals by doing
res_a = residuals(fit) and then inject them in the formula as :
y = sum( (df$obs_values - mean(df$obs_values))^2 )
r-squared = 1 - sum(res_a^2)/y
Here I get the expected r-squared
Now, I would like to get the residual manually.
It should be as trivial as :
res_b = df$obs_values - df$predss_values, but for some reason, res_b is different than res_a...
You can't just do y - x in a regression y ~ x to get residuals. Where have regression coefficients gone?
fit <- lm(y ~ x)
b <- coef(fit)
resi <- y - (b[1] + b[2] * x)
You have many options:
## Residuals manually
# option 1
beta_hat <- coef(fit)
obs_values_hat <- beta_hat["(Intercept)"] + beta_hat["preds_values"] * preds_values
u_hat <- obs_values - obs_values_hat # residuals
# option 2
obs_values_hat <- fitted(fit)
u_hat <- obs_values - obs_values_hat # residuals
# (option 3 - not manually) or just u_hat <- resid(fit)
## R-squared manually
# option 1
var(obs_values_hat) / var(obs_values)
# option 2
1 - var(u_hat) / var(obs_values)
# option 3
cor(obs_values, obs_values_hat)^2
When I run a cluster standard error panel specification with plm and lfe I get results that differ at the second significant figure. Does anyone know why they differ in their calculation of the SE's?
set.seed(572015)
library(lfe)
library(plm)
library(lmtest)
# clustering example
x <- c(sapply(sample(1:20), rep, times = 1000)) + rnorm(20*1000, sd = 1)
y <- 5 + 10*x + rnorm(20*1000, sd = 10) + c(sapply(rnorm(20, sd = 10), rep, times = 1000))
facX <- factor(sapply(1:20, rep, times = 1000))
mydata <- data.frame(y=y,x=x,facX=facX, state=rep(1:1000, 20))
model <- plm(y ~ x, data = mydata, index = c("facX", "state"), effect = "individual", model = "within")
plmTest <- coeftest(model,vcov=vcovHC(model,type = "HC1", cluster="group"))
lfeTest <- summary(felm(y ~ x | facX | 0 | facX))
data.frame(lfeClusterSE=lfeTest$coefficients[2],
plmClusterSE=plmTest[2])
lfeClusterSE plmClusterSE
1 0.06746538 0.06572588
The difference is in the degrees-of-freedom adjustment. This is the usual first guess when looking for differences in supposedly similar standard errors (see e.g., Different Robust Standard Errors of Logit Regression in Stata and R). Here, the problem can be illustrated when comparing the results from (1) plm+vcovHC, (2) felm, (3) lm+cluster.vcov (from package multiwayvcov).
First, I refit all models:
m1 <- plm(y ~ x, data = mydata, index = c("facX", "state"),
effect = "individual", model = "within")
m2 <- felm(y ~ x | facX | 0 | facX, data = mydata)
m3 <- lm(y ~ facX + x, data = mydata)
All lead to the same coefficient estimates. For m3 the fixed effects are explicitly reported while they are not for m1 and m2. Hence, for m3 only the last coefficient is extracted with tail(..., 1).
all.equal(coef(m1), coef(m2))
## [1] TRUE
all.equal(coef(m1), tail(coef(m3), 1))
## [1] TRUE
The non-robust standard errors also agree.
se <- function(object) tail(sqrt(diag(object)), 1)
se(vcov(m1))
## x
## 0.07002696
se(vcov(m2))
## x
## 0.07002696
se(vcov(m3))
## x
## 0.07002696
And when comparing the clustered standard errors we can now show that felm uses the degrees-of-freedom correction while plm does not:
se(vcovHC(m1))
## x
## 0.06572423
m2$cse
## x
## 0.06746538
se(cluster.vcov(m3, mydata$facX))
## x
## 0.06746538
se(cluster.vcov(m3, mydata$facX, df_correction = FALSE))
## x
## 0.06572423
I'm running a multivariate regression with 2 outcome variables and 5 predictors. I would like to obtain the confidence intervals for all regression coefficients. Usually I use the function lm but it doesn't seem to work for a multivariate regression model (object mlm).
Here's a reproducible example.
library(car)
mod <- lm(cbind(income, prestige) ~ education + women, data=Prestige)
confint(mod) # doesn't return anything.
Any alternative way to do it? (I could just use the value of the standard error and multiply by the right critical t value, but I was wondering if there was an easier way to do it).
confint won't return you anything, because there is no "mlm" method supported:
methods(confint)
#[1] confint.default confint.glm* confint.lm confint.nls*
As you said, we can just plus / minus some multiple of standard error to get upper / lower bound of confidence interval. You were probably going to do this via coef(summary(mod)), then use some *apply method to extract standard errors. But my answer to Obtain standard errors of regression coefficients for an “mlm” object returned by lm() gives you a supper efficient way to get standard errors without going through summary. Applying std_mlm to your example model gives:
se <- std_mlm(mod)
# income prestige
#(Intercept) 1162.299027 3.54212524
#education 103.731410 0.31612316
#women 8.921229 0.02718759
Now, we define another small function to compute lower and upper bound:
## add "mlm" method to generic function "confint"
confint.mlm <- function (model, level = 0.95) {
beta <- coef(model)
se <- std_mlm (model)
alpha <- qt((1 - level) / 2, df = model$df.residual)
list(lower = beta + alpha * se, upper = beta - alpha * se)
}
## call "confint"
confint(mod)
#$lower
# income prestige
#(Intercept) -3798.25140 -15.7825086
#education 739.05564 4.8005390
#women -81.75738 -0.1469923
#
#$upper
# income prestige
#(Intercept) 814.25546 -1.72581876
#education 1150.70689 6.05505285
#women -46.35407 -0.03910015
It is easy to interpret this. For example, for response income, the 95%-confidence interval for all variables are
#(intercept) (-3798.25140, 814.25546)
# education (739.05564, 1150.70689)
# women (-81.75738, -46.35407)
This comes from the predict.lm example. You want the interval = 'confidence' option.
x <- rnorm(15)
y <- x + rnorm(15)
predict(lm(y ~ x))
new <- data.frame(x = seq(-3, 3, 0.5))
predict(lm(y ~ x), new, se.fit = TRUE)
pred.w.clim <- predict(lm(y ~ x), new, interval = "confidence")
matplot(new$x, pred.w.clim,
lty = c(1,2,2,3,3), type = "l", ylab = "predicted y")
This seems to have been discussed recently (July 2018) on the R-devel list, so hopefully by the next version of R it will be fixed. A workaround proposed on that list is to use:
confint.mlm <- function (object, level = 0.95, ...) {
cf <- coef(object)
ncfs <- as.numeric(cf)
a <- (1 - level)/2
a <- c(a, 1 - a)
fac <- qt(a, object$df.residual)
pct <- stats:::format.perc(a, 3)
ses <- sqrt(diag(vcov(object)))
ci <- ncfs + ses %o% fac
setNames(data.frame(ci),pct)
}
Test:
fit_mlm <- lm(cbind(mpg, disp) ~ wt, mtcars)
confint(fit_mlm)
Gives:
2.5 % 97.5 %
mpg:(Intercept) 33.450500 41.119753
mpg:wt -6.486308 -4.202635
disp:(Intercept) -204.091436 -58.205395
disp:wt 90.757897 134.198380
Personnally, I like it in a clean tibble way (using broom::tidy would be even better, but has an issue currently)
library(tidyverse)
confint(fit_mlm) %>%
rownames_to_column() %>%
separate(rowname, c("response", "term"), sep=":")
Gives:
response term 2.5 % 97.5 %
1 mpg (Intercept) 33.450500 41.119753
2 mpg wt -6.486308 -4.202635
3 disp (Intercept) -204.091436 -58.205395
4 disp wt 90.757897 134.198380
Simulate an AR(1) in R as follows:
# True parameters
b0 <- 1 # intercept
b1 <- 0.9 # coefficient
trueMean <- b0 / (1-b1) # equals to 10
set.seed(8236)
capT <- 1000
eps <- rnorm(capT)
y <- rep(NA,capT)
y[1] <- b0 + b1*trueMean + eps[1] # Initialize the series
for(t in 2:capT) y[t] = b0 + b1*y[t-1] + eps[t]
reg1 <- ar(y)
reg2 <- arima(y, order=c(1,0,0))
reg3 <- lm( y[2:capT] ~y[1:(capT-1)] )
Both reg1 and reg3 estimates are close to the true values. However, reg2 which uses the arima function estimates an intercept close to the true Mean of 10. Any clue as to why this is happening?
Got the answer on this page http://www.stat.pitt.edu/stoffer/tsa2/Rissues.htm
It seems arima() reports the mean but calls it intercept!