Fast adjusted r-squared extraction - r

.lm.fit is considerably faster than lm for reasons documented in several places, but it is not as straight forward to get an adjusted r-squared value so I'm hoping for some help.
Using lm() and then summary() to get the adjusted r-squared.
tstlm <- lm(cyl ~ hp + wt, data = mtcars)
summary(tstlm)$adj.r.squared
Using .lm.fit
mtmatrix <- as.matrix(mtcars)
tstlmf <- .lm.fit(cbind(1,mtmatrix [,c("hp","wt")]), mtmatrix [,"cyl"])
And here I'm stuck. I suspect the information I need to calculate adjusted r-squared is found in the .lm.fit model somewhere but I can't quite figure out how to proceed.
Thanks in advance for any suggestions.

1) R squared equals the squared correlation between the dependent variable and the fitted values. We can get the residuals from tstlmf using resid(tstslmf) and the fitted values equal y minus those residuals.
Adjusted R squared is formed by multiplying R squared by an expression using only the number of rows and columns of X.
Note that the formulas would change if there is no intercept.
X <- with(mtcars, cbind(1, hp, wt))
y <- mtcars$cyl
testlmf <- .lm.fit(X, y)
rsq <- cor(y, y - resid(tstlmf))^2; rsq
## [1] 0.7898
adj <- 1 - (1-rsq) * (nrow(X) - 1) / -diff(dim(X)); adj
## [1] 0.7753
# check
tstlm <- lm(cyl ~ hp + wt, mtcars)
s <- summary(tstlm)
s$r.squared
## [1] 0.7898
s$adj.r.squared
## [1] 0.7753
2) R squared can alternately be calculated as the ratio var(fitted) / var(y) as in the link above and in that case we write:
testlmf <- .lm.fit(X, y)
rsq <- var(y - resid(tstlmf)) / var(y); rsq
## [1] 0.7898
adj <- 1 - (1-rsq) * (nrow(X) - 1) / -diff(dim(X)); adj
## [1] 0.7753
collapse
flm in the collapse package may be slightly faster than .lm.fit. It returns the coefficients only.
library(collapse)
tstflm <- flm(y, X)
rsq <- c(cor(y, X %*% tstflm)^2); rsq
## [1] 0.7898
adj <- 1 - (1-rsq) * (nrow(X) - 1) / -diff(dim(X)); adj
## [1] 0.7753
or
tstflm <- flm(y, X)
rsq <- var(X %*% tstflm) / var(y); rsq
## [1] 0.7898
adj <- 1 - (1-rsq) * (nrow(X) - 1) / -diff(dim(X)); adj
## [1] 0.7753

The following function computes the adjusted R2 from an object returned by .lm.fit and the response vector y.
adj_r2_lmfit <- function(object, y){
ypred <- y - resid(object)
mss <- sum((ypred - mean(ypred))^2)
rss <- sum(resid(object)^2)
rdf <- length(resid(object)) - object$rank
r.squared <- mss/(mss + rss)
adj.r.squared <- 1 - (1 - r.squared)*(NROW(y) - 1)/rdf
adj.r.squared
}
tstlm <- lm(cyl ~ hp + wt, data = mtcars)
tstlmf <- .lm.fit(cbind(1,mtmatrix [,c("hp","wt")]), mtmatrix [,"cyl"])
summary(tstlm)$adj.r.squared
#[1] 0.7753073
adj_r2_lmfit(tstlmf, mtmatrix [,"cyl"])
#[1] 0.7753073

Related

Confidence intervals gives NA values with BIFE

I am trying to extract the confidence intervals for my panel logit regression. I am using the following code:
model <- bife(dependent_variable ~ x1 + x2 | area, data = df, model = 'logit')
confint(model)
Running confint gives me NA values for all the coefficients and their confidence intervals.
Is this because of the 'bife' object? The model itself runs fine.
It's the bife:::vcov.bife method which doesn't produce dimnames. Until the author fixes this, we could help ourselves by writing a confint.bife method, that assigns coefficient names to the vcov.
confint.bife <- function (object, parm, level=0.95, ...) {
cf <- coef(object)
pnames <- names(cf)
if (missing(parm)) parm <- pnames
else if (is.numeric(parm)) parm <- pnames[parm]
a <- (1 - level)/2
a <- c(a, 1 - a)
pct <- stats:::format.perc(a, 3)
fac <- qnorm(a)
ci <- array(NA, dim=c(length(parm), 2L),
dimnames=list(parm, pct))
vc <- `dimnames<-`(vcov(object), list(pnames, pnames))
ses <- sqrt(diag(vc))[parm]
ci[] <- cf[parm] + ses %o% fac
ci
}
library('bife')
mod <- bife(LFP ~ I(AGE^2) + log(INCH) + KID1 + KID2 + KID3 +
factor(TIME) | ID, psid)
confint(mod)
# 2.5 % 97.5 %
# I(AGE^2) -0.003787755 -0.001185755
# log(INCH) -0.606681358 -0.236717893
# KID1 -1.393748723 -1.008131941
# KID2 -0.830532213 -0.485097762
# KID3 -0.248997085 0.012550225
# factor(TIME)2 -0.244728227 0.303869081
# factor(TIME)3 -0.190434814 0.438179674
# factor(TIME)4 0.117647679 0.870167422
# factor(TIME)5 0.635239557 1.547524672
# factor(TIME)6 0.613792831 1.689971248
# factor(TIME)7 0.639896725 1.876532219
# factor(TIME)8 0.585828050 2.017753781
# factor(TIME)9 0.753717289 2.381327746

How to manually calculate the residuals of linear model in R

I am trying to calculate manually the r-squared given by lm() in R
Considering:
fit <- lm(obs_values ~ preds_values, df)
with sd(df$obs_values) == sd(df$preds_values) and mean(df$obs_values) == mean(df$preds_values)
To do so I can extract the residuals by doing
res_a = residuals(fit) and then inject them in the formula as :
y = sum( (df$obs_values - mean(df$obs_values))^2 )
r-squared = 1 - sum(res_a^2)/y
Here I get the expected r-squared
Now, I would like to get the residual manually.
It should be as trivial as :
res_b = df$obs_values - df$predss_values, but for some reason, res_b is different than res_a...
You can't just do y - x in a regression y ~ x to get residuals. Where have regression coefficients gone?
fit <- lm(y ~ x)
b <- coef(fit)
resi <- y - (b[1] + b[2] * x)
You have many options:
## Residuals manually
# option 1
beta_hat <- coef(fit)
obs_values_hat <- beta_hat["(Intercept)"] + beta_hat["preds_values"] * preds_values
u_hat <- obs_values - obs_values_hat # residuals
# option 2
obs_values_hat <- fitted(fit)
u_hat <- obs_values - obs_values_hat # residuals
# (option 3 - not manually) or just u_hat <- resid(fit)
## R-squared manually
# option 1
var(obs_values_hat) / var(obs_values)
# option 2
1 - var(u_hat) / var(obs_values)
# option 3
cor(obs_values, obs_values_hat)^2

Fit many formulae at once, faster options than lapply?

I have a list for formulas I want to fit to data, rather than running a loop I'd like to do this at once, for performance's sake. The estimations should still be separate, I'm not trying to estimate a SUR or anything.
The following code does what I want
x <- matrix(rnorm(300),ncol=3)
y <- x %*% c(1,2,3)+rnorm(100)
formulae <-list(y~x[,1],
y~x[,2],
y~x[,1] + x[,2])
lapply(formulae,lm)
Unfortunately this gets somewhat slow as the length of formulae increases is there a way to truly vectorize this?
If it is any help, the only results of lm I care about are coefficients, and some standard errors.
As I said in my comment, what you really need is a more efficient yet stable fitting routine other than lm(). Here I would provide you a well tested one written myself, called lm.chol(). It takes a formula and data, and returns:
a coefficient summary table, as you normally see in summary(lm(...))$coef;
Pearson estimate of residual standard error, as you get from summary(lm(...))$sigma;
adjusted-R.squared, as you get from summary(lm(...))$adj.r.squared.
## linear model estimation based on pivoted Cholesky factorization with Jacobi preconditioner
lm.chol <- function(formula, data) {
## stage0: get response vector and model matrix
## we did not follow the normal route: match.call, model.frame, model.response, model matrix, etc
y <- data[[as.character(formula[[2]])]]
X <- model.matrix(formula, data)
n <- nrow(X); p <- ncol(X)
## stage 1: XtX and Jacobi diagonal preconditioner
XtX <- crossprod(X)
D <- 1 / sqrt(diag(XtX))
## stage 2: pivoted Cholesky factorization
R <- suppressWarnings(chol(t(D * t(D * XtX)), pivot = TRUE))
piv <- attr(R, "pivot")
r <- attr(R, "rank")
if (r < p) {
warning("Model is rank-deficient!")
piv <- piv[1:r]
R <- R[1:r, 1:r]
}
## stage 3: solve linear system for coefficients
D <- D[piv]
b <- D * crossprod(X, y)[piv]
z <- forwardsolve(t(R), b)
RSS <- sum(y * y) - sum(z * z)
sigma <- sqrt(RSS / (n - r))
para <- D * backsolve(R, z)
beta.hat <- rep(NA, p)
beta.hat[piv] <- para
## stage 4: get standard error
Rinv <- backsolve(R, diag(r))
se <- rep(NA, p)
se[piv] <- D * sqrt(rowSums(Rinv * Rinv)) * sigma
## stage 5: t-statistic and p-value
t.statistic <- beta.hat / se
p.value <- 2 * pt(-abs(t.statistic), df = n - r)
## stage 6: construct coefficient summary matrix
coefficients <- matrix(c(beta.hat, se, t.statistic, p.value), ncol = 4L)
colnames(coefficients) <- c("Estimate", "Std. Error", "t value", "Pr(>|t|)")
rownames(coefficients) <- colnames(X)
## stage 7: compute adjusted R.squared
adj.R2 <- 1 - sigma * sigma / var(y)
## return model fitting results
attr(coefficients, "sigma") <- sigma
attr(coefficients, "adj.R2") <- adj.R2
coefficients
}
Here I would offer three examples.
Example 1: full rank linear model
We take R's built-in dataset trees as an example.
# using `lm()`
summary(lm(Height ~ Girth + Volume, trees))
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 83.2958 9.0866 9.167 6.33e-10 ***
#Girth -1.8615 1.1567 -1.609 0.1188
#Volume 0.5756 0.2208 2.607 0.0145 *
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#Residual standard error: 5.056 on 28 degrees of freedom
#Multiple R-squared: 0.4123, Adjusted R-squared: 0.3703
#F-statistic: 9.82 on 2 and 28 DF, p-value: 0.0005868
## using `lm.chol()`
lm.chol(Height ~ Girth + Volume, trees)
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 83.2957705 9.0865753 9.166905 6.333488e-10
#Girth -1.8615109 1.1566879 -1.609346 1.187591e-01
#Volume 0.5755946 0.2208225 2.606594 1.449097e-02
#attr(,"sigma")
#[1] 5.056318
#attr(,"adj.R2")
#[1] 0.3702869
The results are exactly the same!
Example 2: rank-deficient linear model
## toy data
set.seed(0)
dat <- data.frame(y = rnorm(100), x1 = runif(100), x2 = rbeta(100,3,5))
dat$x3 <- with(dat, (x1 + x2) / 2)
## using `lm()`
summary(lm(y ~ x1 + x2 + x3, dat))
#Coefficients: (1 not defined because of singularities)
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.2164 0.2530 0.856 0.394
#x1 -0.1526 0.3252 -0.469 0.640
#x2 -0.3534 0.5707 -0.619 0.537
#x3 NA NA NA NA
#Residual standard error: 0.8886 on 97 degrees of freedom
#Multiple R-squared: 0.0069, Adjusted R-squared: -0.01358
#F-statistic: 0.337 on 2 and 97 DF, p-value: 0.7147
## using `lm.chol()`
lm.chol(y ~ x1 + x2 + x3, dat)
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.2164455 0.2529576 0.8556595 0.3942949
#x1 NA NA NA NA
#x2 -0.2007894 0.6866871 -0.2924030 0.7706030
#x3 -0.3051760 0.6504256 -0.4691944 0.6399836
#attr(,"sigma")
#[1] 0.8886214
#attr(,"adj.R2")
#[1] -0.01357594
#Warning message:
#In lm.chol(y ~ x1 + x2 + x3, dat) : Model is rank-deficient!
Here, lm.chol() based on Cholesky factorization with complete pivoting and lm() based on QR factorization with partial pivoting have shrunk different coefficients to NA. But two estimation are equivalent, with the same fitted values and residuals.
Example 3: performance for large linear models
n <- 10000; p <- 300
set.seed(0)
dat <- as.data.frame(setNames(replicate(p, rnorm(n), simplify = FALSE), paste0("x",1:p)))
dat$y <- rnorm(n)
## using `lm()`
system.time(lm(y ~ ., dat))
# user system elapsed
# 3.212 0.096 3.315
## using `lm.chol()`
system.time(lm.chol(y ~ ., dat))
# user system elapsed
# 1.024 0.028 1.056
lm.chol() is 3 ~ 4 times faster than lm(). If you want to know the reason, read my this answer.
Remark
I have focused on improving performance on computational kernel. You can take one step further, by using Ben Bolker's parallelism suggestion. If my approach gives 3 times boost, and parallel computing gives 3 times boost on 4 cores, you end up with 9 times boost!
There's not really an easy way to vectorize this, but the pdredge function from the MuMIn package gives you a pretty easy way to parallelize it (this assumes you have multiple cores on your machine or that you can set up a local cluster in one of the ways supported by the parallel package ...
library(parallel)
clust <- makeCluster(2,"PSOCK")
library(MuMIn)
Construct data:
set.seed(101)
x <- matrix(rnorm(300),ncol=3)
y <- x %*% c(1,2,3)+rnorm(100)
It will be easier to do this with a named data frame rather than an anonymous matrix:
df <- setNames(data.frame(y,x),c("y",paste0("x",1:3)))
The cluster nodes all need access to the data set:
clusterExport(clust,"df")
Fit the full model (you could use y~. to fit all variables)
full <- lm(y~x1+x2,data=df,na.action=na.fail)
Now fit all submodels (see ?MuMIn::dredge for many more options to control which submodels are fitted)
p <- pdredge(full,cluster=clust)
coef(p)
## (Intercept) x1 x2
## 3 -0.003805107 0.7488708 2.590204
## 2 -0.028502039 NA 2.665305
## 1 -0.101434662 1.0490816 NA
## 0 -0.140451160 NA NA

Is this really the most practical way to return the p-value of a linear model (lm) object in R?

What is the most practical way of extracting the global p-value of a linear model, lm? I usually end up taking the results from summary and plugging the F-test statistic and degrees of freedom into pf:
set.seed(1)
n <- 10
x <- 1:10
y <- 2*x+rnorm(n)
fit <- lm(y ~ x)
summary(fit) # global p-value: 1.324e-08
fstat <- summary(fit)$fstat
pval <- pf(fstat[1], fstat[2], fstat[3], lower.tail = FALSE)
pval
Check out the broom package:
library(broom)
set.seed(1)
n <- 10
x <- 1:10
y <- 2*x+rnorm(n)
fit <- lm(y ~ x)
glance(fit)
# r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual
# 1 0.9851881 0.9833366 0.8090653 532.1048 1.324022e-08 2 -10.95491 27.90982 28.81758 5.236693 8
glance(fit)$p.value
# [1] 1.324022e-08
tidy(fit)
# term estimate std.error statistic p.value
# 1 (Intercept) -0.1688236 0.55269681 -0.3054542 7.678170e-01
# 2 x 2.0547321 0.08907516 23.0673979 1.324022e-08
Since you asked for it:
Here is a bare-bones implementation that omits the bells and whistles (and checks) of lm. As a consequence it is faster, but you'd use it at your own risk, i.e., the warnings in help("lm.fit") apply. Due to laziness, code for calculation of the F-stats was extracted from the summary.lm source code and only slightly amended (so please consider licence() and citation("stats")).
fit1 <- lm.fit(cbind(1, x), y)
fstats <- function(obj) {
p <- obj$rank
rdf <- obj$df.residual
r <- obj$residuals
f <- obj$fitted.values
mss <- sum((f - mean(f))^2)
rss <- sum(r^2)
resvar <- rss/rdf
df.int <- 1L #assumes there is always an intercept
fstatistic <- c(value = (mss/(p - df.int))/resvar,
numdf = p - df.int, dendf = rdf)
fstatistic["pval"] <- pf(fstatistic[1L],
fstatistic[2L],
fstatistic[3L], lower.tail = FALSE)
fstatistic
}
fstats(fit1)
# value numdf dendf pval
#5.321048e+02 1.000000e+00 8.000000e+00 1.324022e-08
Check the source of print.summary.lm, it uses the pf function to get the pvalue.
format.pval(pf(x$fstatistic[1L],
x$fstatistic[2L], x$fstatistic[3L], lower.tail = FALSE),
digits = digits))

Get confidence intervals for regression coefficients of "mlm" object returned by `lm()`

I'm running a multivariate regression with 2 outcome variables and 5 predictors. I would like to obtain the confidence intervals for all regression coefficients. Usually I use the function lm but it doesn't seem to work for a multivariate regression model (object mlm).
Here's a reproducible example.
library(car)
mod <- lm(cbind(income, prestige) ~ education + women, data=Prestige)
confint(mod) # doesn't return anything.
Any alternative way to do it? (I could just use the value of the standard error and multiply by the right critical t value, but I was wondering if there was an easier way to do it).
confint won't return you anything, because there is no "mlm" method supported:
methods(confint)
#[1] confint.default confint.glm* confint.lm confint.nls*
As you said, we can just plus / minus some multiple of standard error to get upper / lower bound of confidence interval. You were probably going to do this via coef(summary(mod)), then use some *apply method to extract standard errors. But my answer to Obtain standard errors of regression coefficients for an “mlm” object returned by lm() gives you a supper efficient way to get standard errors without going through summary. Applying std_mlm to your example model gives:
se <- std_mlm(mod)
# income prestige
#(Intercept) 1162.299027 3.54212524
#education 103.731410 0.31612316
#women 8.921229 0.02718759
Now, we define another small function to compute lower and upper bound:
## add "mlm" method to generic function "confint"
confint.mlm <- function (model, level = 0.95) {
beta <- coef(model)
se <- std_mlm (model)
alpha <- qt((1 - level) / 2, df = model$df.residual)
list(lower = beta + alpha * se, upper = beta - alpha * se)
}
## call "confint"
confint(mod)
#$lower
# income prestige
#(Intercept) -3798.25140 -15.7825086
#education 739.05564 4.8005390
#women -81.75738 -0.1469923
#
#$upper
# income prestige
#(Intercept) 814.25546 -1.72581876
#education 1150.70689 6.05505285
#women -46.35407 -0.03910015
It is easy to interpret this. For example, for response income, the 95%-confidence interval for all variables are
#(intercept) (-3798.25140, 814.25546)
# education (739.05564, 1150.70689)
# women (-81.75738, -46.35407)
This comes from the predict.lm example. You want the interval = 'confidence' option.
x <- rnorm(15)
y <- x + rnorm(15)
predict(lm(y ~ x))
new <- data.frame(x = seq(-3, 3, 0.5))
predict(lm(y ~ x), new, se.fit = TRUE)
pred.w.clim <- predict(lm(y ~ x), new, interval = "confidence")
matplot(new$x, pred.w.clim,
lty = c(1,2,2,3,3), type = "l", ylab = "predicted y")
This seems to have been discussed recently (July 2018) on the R-devel list, so hopefully by the next version of R it will be fixed. A workaround proposed on that list is to use:
confint.mlm <- function (object, level = 0.95, ...) {
cf <- coef(object)
ncfs <- as.numeric(cf)
a <- (1 - level)/2
a <- c(a, 1 - a)
fac <- qt(a, object$df.residual)
pct <- stats:::format.perc(a, 3)
ses <- sqrt(diag(vcov(object)))
ci <- ncfs + ses %o% fac
setNames(data.frame(ci),pct)
}
Test:
fit_mlm <- lm(cbind(mpg, disp) ~ wt, mtcars)
confint(fit_mlm)
Gives:
2.5 % 97.5 %
mpg:(Intercept) 33.450500 41.119753
mpg:wt -6.486308 -4.202635
disp:(Intercept) -204.091436 -58.205395
disp:wt 90.757897 134.198380
Personnally, I like it in a clean tibble way (using broom::tidy would be even better, but has an issue currently)
library(tidyverse)
confint(fit_mlm) %>%
rownames_to_column() %>%
separate(rowname, c("response", "term"), sep=":")
Gives:
response term 2.5 % 97.5 %
1 mpg (Intercept) 33.450500 41.119753
2 mpg wt -6.486308 -4.202635
3 disp (Intercept) -204.091436 -58.205395
4 disp wt 90.757897 134.198380

Resources