How to manually calculate the residuals of linear model in R

How to manually calculate the residuals of linear model in R - r

I am trying to calculate manually the r-squared given by lm() in R
Considering:
fit <- lm(obs_values ~ preds_values, df)
with sd(df$obs_values) == sd(df$preds_values) and mean(df$obs_values) == mean(df$preds_values)
To do so I can extract the residuals by doing
res_a = residuals(fit) and then inject them in the formula as :
y = sum( (df$obs_values - mean(df$obs_values))^2 )
r-squared = 1 - sum(res_a^2)/y
Here I get the expected r-squared
Now, I would like to get the residual manually.
It should be as trivial as :
res_b = df$obs_values - df$predss_values, but for some reason, res_b is different than res_a...

You can't just do y - x in a regression y ~ x to get residuals. Where have regression coefficients gone?
fit <- lm(y ~ x)
b <- coef(fit)
resi <- y - (b[1] + b[2] * x)

You have many options:
## Residuals manually
# option 1
beta_hat <- coef(fit)
obs_values_hat <- beta_hat["(Intercept)"] + beta_hat["preds_values"] * preds_values
u_hat <- obs_values - obs_values_hat # residuals
# option 2
obs_values_hat <- fitted(fit)
u_hat <- obs_values - obs_values_hat # residuals
# (option 3 - not manually) or just u_hat <- resid(fit)
## R-squared manually
# option 1
var(obs_values_hat) / var(obs_values)
# option 2
1 - var(u_hat) / var(obs_values)
# option 3
cor(obs_values, obs_values_hat)^2

Related

Confidence intervals gives NA values with BIFE

I am trying to extract the confidence intervals for my panel logit regression. I am using the following code:
model <- bife(dependent_variable ~ x1 + x2 | area, data = df, model = 'logit')
confint(model)
Running confint gives me NA values for all the coefficients and their confidence intervals.
Is this because of the 'bife' object? The model itself runs fine.

It's the bife:::vcov.bife method which doesn't produce dimnames. Until the author fixes this, we could help ourselves by writing a confint.bife method, that assigns coefficient names to the vcov.
confint.bife <- function (object, parm, level=0.95, ...) {
cf <- coef(object)
pnames <- names(cf)
if (missing(parm)) parm <- pnames
else if (is.numeric(parm)) parm <- pnames[parm]
a <- (1 - level)/2
a <- c(a, 1 - a)
pct <- stats:::format.perc(a, 3)
fac <- qnorm(a)
ci <- array(NA, dim=c(length(parm), 2L),
dimnames=list(parm, pct))
vc <- `dimnames<-`(vcov(object), list(pnames, pnames))
ses <- sqrt(diag(vc))[parm]
ci[] <- cf[parm] + ses %o% fac
ci
}
library('bife')
mod <- bife(LFP ~ I(AGE^2) + log(INCH) + KID1 + KID2 + KID3 +
factor(TIME) | ID, psid)
confint(mod)
# 2.5 % 97.5 %
# I(AGE^2) -0.003787755 -0.001185755
# log(INCH) -0.606681358 -0.236717893
# KID1 -1.393748723 -1.008131941
# KID2 -0.830532213 -0.485097762
# KID3 -0.248997085 0.012550225
# factor(TIME)2 -0.244728227 0.303869081
# factor(TIME)3 -0.190434814 0.438179674
# factor(TIME)4 0.117647679 0.870167422
# factor(TIME)5 0.635239557 1.547524672
# factor(TIME)6 0.613792831 1.689971248
# factor(TIME)7 0.639896725 1.876532219
# factor(TIME)8 0.585828050 2.017753781
# factor(TIME)9 0.753717289 2.381327746

Fast adjusted r-squared extraction

.lm.fit is considerably faster than lm for reasons documented in several places, but it is not as straight forward to get an adjusted r-squared value so I'm hoping for some help.
Using lm() and then summary() to get the adjusted r-squared.
tstlm <- lm(cyl ~ hp + wt, data = mtcars)
summary(tstlm)$adj.r.squared
Using .lm.fit
mtmatrix <- as.matrix(mtcars)
tstlmf <- .lm.fit(cbind(1,mtmatrix [,c("hp","wt")]), mtmatrix [,"cyl"])
And here I'm stuck. I suspect the information I need to calculate adjusted r-squared is found in the .lm.fit model somewhere but I can't quite figure out how to proceed.
Thanks in advance for any suggestions.

1) R squared equals the squared correlation between the dependent variable and the fitted values. We can get the residuals from tstlmf using resid(tstslmf) and the fitted values equal y minus those residuals.
Adjusted R squared is formed by multiplying R squared by an expression using only the number of rows and columns of X.
Note that the formulas would change if there is no intercept.
X <- with(mtcars, cbind(1, hp, wt))
y <- mtcars$cyl
testlmf <- .lm.fit(X, y)
rsq <- cor(y, y - resid(tstlmf))^2; rsq
## [1] 0.7898
adj <- 1 - (1-rsq) * (nrow(X) - 1) / -diff(dim(X)); adj
## [1] 0.7753
# check
tstlm <- lm(cyl ~ hp + wt, mtcars)
s <- summary(tstlm)
s$r.squared
## [1] 0.7898
s$adj.r.squared
## [1] 0.7753
2) R squared can alternately be calculated as the ratio var(fitted) / var(y) as in the link above and in that case we write:
testlmf <- .lm.fit(X, y)
rsq <- var(y - resid(tstlmf)) / var(y); rsq
## [1] 0.7898
adj <- 1 - (1-rsq) * (nrow(X) - 1) / -diff(dim(X)); adj
## [1] 0.7753
collapse
flm in the collapse package may be slightly faster than .lm.fit. It returns the coefficients only.
library(collapse)
tstflm <- flm(y, X)
rsq <- c(cor(y, X %*% tstflm)^2); rsq
## [1] 0.7898
adj <- 1 - (1-rsq) * (nrow(X) - 1) / -diff(dim(X)); adj
## [1] 0.7753
or
tstflm <- flm(y, X)
rsq <- var(X %*% tstflm) / var(y); rsq
## [1] 0.7898
adj <- 1 - (1-rsq) * (nrow(X) - 1) / -diff(dim(X)); adj
## [1] 0.7753

The following function computes the adjusted R2 from an object returned by .lm.fit and the response vector y.
adj_r2_lmfit <- function(object, y){
ypred <- y - resid(object)
mss <- sum((ypred - mean(ypred))^2)
rss <- sum(resid(object)^2)
rdf <- length(resid(object)) - object$rank
r.squared <- mss/(mss + rss)
adj.r.squared <- 1 - (1 - r.squared)*(NROW(y) - 1)/rdf
adj.r.squared
}
tstlm <- lm(cyl ~ hp + wt, data = mtcars)
tstlmf <- .lm.fit(cbind(1,mtmatrix [,c("hp","wt")]), mtmatrix [,"cyl"])
summary(tstlm)$adj.r.squared
#[1] 0.7753073
adj_r2_lmfit(tstlmf, mtmatrix [,"cyl"])
#[1] 0.7753073

Use bayesian inference to find relationship between variables

I have a lot of data, like the image below, which have a relationship with each other. I want to make an equation to describe this relationship, something like Power = a * WindSpeed ^ b . How can I use Bayesian Inference to find a and b? I want to use R for this.

Welcome to SO and good luck. Do not forget to pay attention to the comment (they are really relevant & you can increase probability to get an answer).
Please see below example of using Bayesian univariate regression using Bolstadt package.
library(Bolstad)
# Simulation of Power vs Wind
# Power = 5 * Windspeed ^ 2
set.seed(123)
n <- 100
# y = Power
# x = WindSpeed
# e = error term
x <- (1:(25 * n))/ n
e <- rnorm(length(x)) / 10
# y = a * x ^ b
# log(y) = log(a) + b * log(x) + e
# or
# in exponential form
y <- exp(log(5) + e) * x ^ 2
# bayes univariate linear regression model
z <- bayes.lin.reg(log(y), log(x))
# Standard deviation of residuals: 0.0943
# Posterior Mean Posterior Std. Deviation
# -------------- ------------------------
# Intercept: 6.076 0.0059657
# Slope: 1.996 0.0062209
# ------------------------------------------------
# pay attention the result of bayession regression
# are shifted for intercept by the mean
# is is accouted as below
intercept_shifted <- z$intercept$mean - z$slope$mean * mean(log(x))
intercept_shifted
# [1] 1.617218
# validate by standar linear model:
lm(log(y) ~ log(x))
# Coefficients:
# (Intercept) log(x)
# 1.617 1.996
a = exp(intercept_shifted)
a
# [1] 5.039051
b = z$slope$mean
b
# [1] 1.996134

Fit many formulae at once, faster options than lapply?

I have a list for formulas I want to fit to data, rather than running a loop I'd like to do this at once, for performance's sake. The estimations should still be separate, I'm not trying to estimate a SUR or anything.
The following code does what I want
x <- matrix(rnorm(300),ncol=3)
y <- x %*% c(1,2,3)+rnorm(100)
formulae <-list(y~x[,1],
y~x[,2],
y~x[,1] + x[,2])
lapply(formulae,lm)
Unfortunately this gets somewhat slow as the length of formulae increases is there a way to truly vectorize this?
If it is any help, the only results of lm I care about are coefficients, and some standard errors.

As I said in my comment, what you really need is a more efficient yet stable fitting routine other than lm(). Here I would provide you a well tested one written myself, called lm.chol(). It takes a formula and data, and returns:
a coefficient summary table, as you normally see in summary(lm(...))$coef;
Pearson estimate of residual standard error, as you get from summary(lm(...))$sigma;
adjusted-R.squared, as you get from summary(lm(...))$adj.r.squared.
## linear model estimation based on pivoted Cholesky factorization with Jacobi preconditioner
lm.chol <- function(formula, data) {
## stage0: get response vector and model matrix
## we did not follow the normal route: match.call, model.frame, model.response, model matrix, etc
y <- data[[as.character(formula[[2]])]]
X <- model.matrix(formula, data)
n <- nrow(X); p <- ncol(X)
## stage 1: XtX and Jacobi diagonal preconditioner
XtX <- crossprod(X)
D <- 1 / sqrt(diag(XtX))
## stage 2: pivoted Cholesky factorization
R <- suppressWarnings(chol(t(D * t(D * XtX)), pivot = TRUE))
piv <- attr(R, "pivot")
r <- attr(R, "rank")
if (r < p) {
warning("Model is rank-deficient!")
piv <- piv[1:r]
R <- R[1:r, 1:r]
}
## stage 3: solve linear system for coefficients
D <- D[piv]
b <- D * crossprod(X, y)[piv]
z <- forwardsolve(t(R), b)
RSS <- sum(y * y) - sum(z * z)
sigma <- sqrt(RSS / (n - r))
para <- D * backsolve(R, z)
beta.hat <- rep(NA, p)
beta.hat[piv] <- para
## stage 4: get standard error
Rinv <- backsolve(R, diag(r))
se <- rep(NA, p)
se[piv] <- D * sqrt(rowSums(Rinv * Rinv)) * sigma
## stage 5: t-statistic and p-value
t.statistic <- beta.hat / se
p.value <- 2 * pt(-abs(t.statistic), df = n - r)
## stage 6: construct coefficient summary matrix
coefficients <- matrix(c(beta.hat, se, t.statistic, p.value), ncol = 4L)
colnames(coefficients) <- c("Estimate", "Std. Error", "t value", "Pr(>|t|)")
rownames(coefficients) <- colnames(X)
## stage 7: compute adjusted R.squared
adj.R2 <- 1 - sigma * sigma / var(y)
## return model fitting results
attr(coefficients, "sigma") <- sigma
attr(coefficients, "adj.R2") <- adj.R2
coefficients
}
Here I would offer three examples.
Example 1: full rank linear model
We take R's built-in dataset trees as an example.
# using `lm()`
summary(lm(Height ~ Girth + Volume, trees))
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 83.2958 9.0866 9.167 6.33e-10 ***
#Girth -1.8615 1.1567 -1.609 0.1188
#Volume 0.5756 0.2208 2.607 0.0145 *
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#Residual standard error: 5.056 on 28 degrees of freedom
#Multiple R-squared: 0.4123, Adjusted R-squared: 0.3703
#F-statistic: 9.82 on 2 and 28 DF, p-value: 0.0005868
## using `lm.chol()`
lm.chol(Height ~ Girth + Volume, trees)
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 83.2957705 9.0865753 9.166905 6.333488e-10
#Girth -1.8615109 1.1566879 -1.609346 1.187591e-01
#Volume 0.5755946 0.2208225 2.606594 1.449097e-02
#attr(,"sigma")
#[1] 5.056318
#attr(,"adj.R2")
#[1] 0.3702869
The results are exactly the same!
Example 2: rank-deficient linear model
## toy data
set.seed(0)
dat <- data.frame(y = rnorm(100), x1 = runif(100), x2 = rbeta(100,3,5))
dat$x3 <- with(dat, (x1 + x2) / 2)
## using `lm()`
summary(lm(y ~ x1 + x2 + x3, dat))
#Coefficients: (1 not defined because of singularities)
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.2164 0.2530 0.856 0.394
#x1 -0.1526 0.3252 -0.469 0.640
#x2 -0.3534 0.5707 -0.619 0.537
#x3 NA NA NA NA
#Residual standard error: 0.8886 on 97 degrees of freedom
#Multiple R-squared: 0.0069, Adjusted R-squared: -0.01358
#F-statistic: 0.337 on 2 and 97 DF, p-value: 0.7147
## using `lm.chol()`
lm.chol(y ~ x1 + x2 + x3, dat)
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.2164455 0.2529576 0.8556595 0.3942949
#x1 NA NA NA NA
#x2 -0.2007894 0.6866871 -0.2924030 0.7706030
#x3 -0.3051760 0.6504256 -0.4691944 0.6399836
#attr(,"sigma")
#[1] 0.8886214
#attr(,"adj.R2")
#[1] -0.01357594
#Warning message:
#In lm.chol(y ~ x1 + x2 + x3, dat) : Model is rank-deficient!
Here, lm.chol() based on Cholesky factorization with complete pivoting and lm() based on QR factorization with partial pivoting have shrunk different coefficients to NA. But two estimation are equivalent, with the same fitted values and residuals.
Example 3: performance for large linear models
n <- 10000; p <- 300
set.seed(0)
dat <- as.data.frame(setNames(replicate(p, rnorm(n), simplify = FALSE), paste0("x",1:p)))
dat$y <- rnorm(n)
## using `lm()`
system.time(lm(y ~ ., dat))
# user system elapsed
# 3.212 0.096 3.315
## using `lm.chol()`
system.time(lm.chol(y ~ ., dat))
# user system elapsed
# 1.024 0.028 1.056
lm.chol() is 3 ~ 4 times faster than lm(). If you want to know the reason, read my this answer.
Remark
I have focused on improving performance on computational kernel. You can take one step further, by using Ben Bolker's parallelism suggestion. If my approach gives 3 times boost, and parallel computing gives 3 times boost on 4 cores, you end up with 9 times boost!

There's not really an easy way to vectorize this, but the pdredge function from the MuMIn package gives you a pretty easy way to parallelize it (this assumes you have multiple cores on your machine or that you can set up a local cluster in one of the ways supported by the parallel package ...
library(parallel)
clust <- makeCluster(2,"PSOCK")
library(MuMIn)
Construct data:
set.seed(101)
x <- matrix(rnorm(300),ncol=3)
y <- x %*% c(1,2,3)+rnorm(100)
It will be easier to do this with a named data frame rather than an anonymous matrix:
df <- setNames(data.frame(y,x),c("y",paste0("x",1:3)))
The cluster nodes all need access to the data set:
clusterExport(clust,"df")
Fit the full model (you could use y~. to fit all variables)
full <- lm(y~x1+x2,data=df,na.action=na.fail)
Now fit all submodels (see ?MuMIn::dredge for many more options to control which submodels are fitted)
p <- pdredge(full,cluster=clust)
coef(p)
## (Intercept) x1 x2
## 3 -0.003805107 0.7488708 2.590204
## 2 -0.028502039 NA 2.665305
## 1 -0.101434662 1.0490816 NA
## 0 -0.140451160 NA NA

Arima fails to estimate a simple AR(1) in R

Simulate an AR(1) in R as follows:
# True parameters
b0 <- 1 # intercept
b1 <- 0.9 # coefficient
trueMean <- b0 / (1-b1) # equals to 10
set.seed(8236)
capT <- 1000
eps <- rnorm(capT)
y <- rep(NA,capT)
y[1] <- b0 + b1*trueMean + eps[1] # Initialize the series
for(t in 2:capT) y[t] = b0 + b1*y[t-1] + eps[t]
reg1 <- ar(y)
reg2 <- arima(y, order=c(1,0,0))
reg3 <- lm( y[2:capT] ~y[1:(capT-1)] )
Both reg1 and reg3 estimates are close to the true values. However, reg2 which uses the arima function estimates an intercept close to the true Mean of 10. Any clue as to why this is happening?

Got the answer on this page http://www.stat.pitt.edu/stoffer/tsa2/Rissues.htm
It seems arima() reports the mean but calls it intercept!

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to manually calculate the residuals of linear model in R - r

You can't just do y - x in a regression y ~ x to get residuals. Where have regression coefficients gone? fit <- lm(y ~ x) b <- coef(fit) resi <- y - (b[1] + b[2] * x)

Related

Confidence intervals gives NA values with BIFE

Fast adjusted r-squared extraction

Use bayesian inference to find relationship between variables

Fit many formulae at once, faster options than lapply?

Arima fails to estimate a simple AR(1) in R

Categories

Resources