Getting wrong betas when doing OLS regression in R - r

My first question here. This problem have stolen days from my life. I know, it's not that important, but at the same time: I need to know! I know there are many good formulas for making regression. But when I try to do it using good-old arithmetic just to get the hangs of it, I get ridiculous answers on beta.
Beta vector is supposed to be (X'X)^(-1)X'y (where X is the matrix of regressors and y the vector of answers). I'll give one example (and that it's not suitable for OLS is irrelevant - I just want b:s here):
X <- matrix(1:10)
y <- matrix(2:11)
b <- (t(X) %*% X)^(-1) %*% t(X) %*% y
Which gives b = 1.142857, while summary(lm(y~X)) gives beta = 1 and an intercept of 1. I add a constant to X to get an intercept: X <-cbind(X,1) and the results I get is b = (2.324675,14.5) which doesn't make sense at all. What am I doing wrong here?

There are two problems here. The first is a problem of notation. The power of -1 in the formula actually indicates a matrix inverse. That is calculated with solve in R and not with ^-1, which indicates element-wise reciprocals.
Then, you need to create a design matrix that actually contains an intercept.
X <- matrix(1:10)
y <- matrix(2:11)^2
coef(lm(y~X))
#(Intercept) X
# -21 13
X <- cbind(1, X)
solve(t(X) %*% X) %*% t(X) %*% y
# [,1]
#[1,] -21
#[2,] 13
Obviously, you should not actually do this matrix inversion in real world applications (and R's lm doesn't do it).

The issue is with using ^(-1) for the inverse. It doesn't work like that for Matrices. solve is used to get the inverse of a matrix: https://www.statmethods.net/advstats/matrix.html
# use solve
b <- solve(t(X) %*% X) %*% t(X) %*% y
# fit model without intercept
m <- lm(y~-1+X)
summary(m)
# same coefficients
b
m$coefficients
# with intercept
X2 <- cbind(rep(1, 10), X)
b2 <- solve(t(X2) %*% X2) %*% t(X2) %*% y
m2 <- lm(y~+X)
summary(m2)
b2
m2$coefficients

X <- cbind(1, matrix(1:10))
b<-solve(t(X)%*%X)%*%t(X)%*%y
https://www.rdocumentation.org/packages/Matrix/versions/0.3-26/topics/solve.Matrix

Related

How R calculates the Regression coefficients using lm() function

I wanted to replicate R's calculation on estimation of regression equation on below data:
set.seed(1)
Vec = rnorm(1000, 100, 3)
DF = data.frame(X1 = Vec[-1], X2 = Vec[-length(Vec)])
Below R reports estimates of coefficients
coef(lm(X1~X2, DF)) ### slope = -0.03871511
Then I manually estimate the regression estimate for slope
(sum(DF[,1]*DF[,2])*nrow(DF) - sum(DF[,1])*sum(DF[,2])) / (nrow(DF) * sum(DF[,1]^2) - (sum(DF[,1])^2)) ### -0.03871178
They are close but still are nor matching exactly.
Can you please help me to understand what am I missing here?
Any pointer will be very helpful.
The problem is that X1 and X2 are switched in lm relative to the long formula.
Background
The formula for slope in lm(y ~ x) is the following where x and y each have length n and x is short for x[i] and y is short for y[i] and the summations are over i = 1, 2, ..., n.
Source of the problem
Thus the long formula in the question, also shown in (1) below, corresponds to lm(X2 ~ X1, DF) and not to lm(X1 ~ X2, DF). Either change the formula in the lm model as in (1) below or else change the long formula in the answer by replacing each occurrence of DF[, 1] in the denominator with DF[, 2] as in (2) below.
# (1)
coef(lm(X2 ~ X1, DF))[[2]]
## [1] -0.03871178
(sum(DF[,1]*DF[,2])*nrow(DF) - sum(DF[,1])*sum(DF[,2])) /
(nrow(DF) * sum(DF[,1]^2) - (sum(DF[,1])^2)) # as in question
## [1] -0.03871178
# (2)
coef(lm(X1 ~ X2, DF))[[2]] # as in question
## [1] -0.03871511
(sum(DF[,1]*DF[,2])*nrow(DF) - sum(DF[,1])*sum(DF[,2])) /
(nrow(DF) * sum(DF[,2]^2) - (sum(DF[,2])^2))
## [1] -0.03871511
This is not a StackOverflow question per se, but rather a stats question for the sister site.
The narrow answer is that you can look into the R sources; it generally farms off to LAPACK and BLAS but a key part of the regression calculation is specialised in order to deal (in a statistically, rather than numerical way) with low-rank cases.
Anyway, here, I believe you are 'merely' not adjusting for degrees of freedom correctly which 'almost but not quite' washes out when you use 1000 observations. A simpler case follows, along with a 'simpler' way to calculate the coefficient 'by hand' which also has the advantage of matching:
> set.seed(1)
> Vec <- rnorm(5,100,3)
> DF <- data.frame(X1=Vec[-1], X2=Vec[-length(Vec)])
> coef(lm(X1 ~ X2, DF))[2]
X2
-0.322898
> cov(DF$X1, DF$X2) / var(DF$X2)
[1] -0.322898
>
coef(lm(X1~X2, DF))
# (Intercept) X2
# 103.83714016 -0.03871511
You can apply the formula of coefficients in OLS matrix form as below.
X = cbind(1,DF[,2])
solve(t(X) %*% (X)) %*% t(X)%*% as.matrix(DF[,1])
giving,
# [,1]
#[1,] 103.83714016
#[2,] -0.03871511
which is same with lm() output.
Data:
set.seed(1)
Vec = rnorm(1000, 100, 3)
DF = data.frame(X1 = Vec[-1], X2 = Vec[-length(Vec)])

observed information matrix for logistic model

I have proposed my own model and now am trying to implement it using R, I have got stuck on how to find the observed matrix applying my formula i have use glm() to fit logistic model with penalty term, using binary data set x1, x2, x3 ,y (all binary 0,1) fit1 if the glm() model def.new is the penalise deviance.
X.tilde <- as.matrix(x) # n*p matrix of the data set
W <- Diagonal(length(y), weights) # n*n diagonal matrix of the weights
qq <- exp(fit1$fitted.values)/(1 + exp(fit1$fitted.values)) # n*1 vector (pi=probability of the logistic model )
cc <- t(1 - qq) # n*1 vector
gg <- (dev.new) * t(dev.new) # p*p matrix
ff <- (X.tilde) %*% t(X.tilde) # n*n matrix
pp <- exp(fit1$coefficients)/(1 + exp(fit1$coefficients)) # p*1 matrix
ss <- t(1/(1 + exp(fit1$coefficients))) # p*1 vector
aa <- t(X.tilde) %*% qq %*% cc %*% W %*% (X.tilde) # p*p matrix
firstP <- (aa + (pp * ss)) # p*p matrix
info.mat <- firstP+gg # p*p matrix
info.mat <- as.matrix(info.mat)
this code returns the following error
Error in e1 + Matrix(e2) :
Matrices must have same number of rows for arithmetic
As in my theory the dimension is fine by when I implement its not correct
any help?
r

Is it possible to cache `lm()` matrices to fit new data?

I wrote an algorithm which fits a linear model with lm() and then "updates" the response variable iteratively. The problem is: In a high-dimension scenario, fitting linear models creates a bottleneck.
On the other hand, most of the work required is a matrix inversion that only depends on the covariate matrix X, i.e., the coefficients are given by: solve(t(X) %*% X) %*% X %*% y.
Reading lm() code, I understand that R uses QR decomposition.
Is it possible to recover the internal matrix operation used and fit a new model with new y values faster?
Here's a minimal example:
set.seed(1)
X <- matrix(runif(400*150000), nrow = 150000)
y1 <- runif(150000)
y2 <- runif(150000)
mod1 <- lm(y1 ~ X)
mod2 <- lm(y2 ~ X)
Theoretically, mod2 "repeats" costful matrix operations identical to the ones made in the first lm() call.
I want to keep using lm() for its efficient implementation and ability to handle incomplete rank matrices automatically.
# Data
set.seed(1)
n = 5
X <- matrix(runif(5*n), nrow = n)
y1 <- runif(n)
y2 <- runif(n)
# lm models
mod1 <- lm(y1 ~ X)
mod2 <- lm(y2 ~ X)
# Obtain QR decomposition of X
q = qr(X)
# Reuse 'q' to obtain fitted values repeatedly
mod1_fv = qr.fitted(q, y1)
mod2_fv = qr.fitted(q, y2)
# Compare fitted values from reusing 'q' to fitted values in 'lm' models
Vectorize(all.equal)(unname(mod1$fitted.values), mod1_fv)
#> [1] TRUE TRUE TRUE TRUE TRUE
Vectorize(all.equal)(unname(mod2$fitted.values), mod2_fv)
#> [1] TRUE TRUE TRUE TRUE TRUE
Created on 2019-09-06 by the reprex package (v0.3.0)
Have you tried just fitting a multivariate model? I haven't checked the code, but on my system it's almost half as fast as fitting separately, so I wouldn't be surprised if it's doing what you suggest behind the scenes. That is,
mods <- lm(cbind(y1, y2) ~ X)

Lars algorithm with lasso modification

I'm trying to implement the lars algorithm with the lasso modification.
At point 3. I'm stuck, I want to program it but I don't really understand it.
Point 1 and 2 I already did, here is the code:
#1. Standardize the predictors to have mean zero and unit norm.
set.seed(19875)
n <- 10
p <- 5
real_p <- 5
x <- matrix(rnorm(n*p), nrow=n, ncol=p)
x <- x-matrix(apply(x,2,mean),ncol=ncol(x),nrow=nrow(x),byrow=T)
x <- x/matrix(apply(x,2,sd),ncol=ncol(x),nrow=nrow(x),byrow=T)
y <- apply(x[,1:real_p], 1, sum) + rnorm(n)
#Start with the residual r = y − y ¯, β1,β2,... ,βp = 0
r=y-mean(y)
beta=matrix(0, ncol=ncol(x), nrow=1)
#2. Find the predictor xj most correlated with r.
co= t(x)%*%r
j= (1:ncol(x))[abs(co)==max(abs(co))][1]
#3. Move βj from 0 towards its least-squares coefficient xj,ri, until some
#other competitor xk has as much correlation with the current residual
#as does xj.
I would very much appreciate any clarification.

Linear optimization of difference subject to constraints

I have a vector of values (x1,x2,x3,x4,x5,x6,x7) and i want to create a vector which minimizes a new unknown vector (y1,y2,y3,y4,y5,y6,y7) such that I can minimize ||x-y||^2. I also want to create this new vector subject to the constraints that x1+x2+x3+x4+x5=x6 and x1+x2+x3+x4=x7. I tried to use constrOptim but I do not think I have the right inputs. Any help would be greatly appreciated!
Would it be best to come up with a set of values and then use a nls model to predict them? How would I do that?
Thank you!!
We assume that what the question actually intended was that y is known and we want to get x with the indicated constraints.
Note that nls does not work for zero residual problems and since no data was provided in the question we don't know whether that is the case here or not so we first present two solutions that can handle that and then finally we show an nls for the non-zero residual case. We use y shown below in (1) as our test input for (1) and (2) and it does have zero residuals. For (3), the nls solution, we use a different y which does not lead to zero residuals.
Here are some alternative solutions:
1) lm We define x5_to_x7 which maps the first 5 components of x to the entire 7-element vector. Because x5_to_x7 is a linear operator it corresponds to a matrix X which we form and then use in lm:
# test data
y <- c(1:5, sum(1:5), sum(1:4))
x5_to_x7 <- function(x5) c(x5, sum(x5), sum(x5[1:4]))
X <- apply(diag(5), 1, x5_to_x7)
fm <- lm(y ~ X + 0)
giving:
coef(fm)
## X1 X2 X3 X4 X5
## 1 2 3 4 5
all.equal(x5_to_x7(coef(fm)), y)
## [1] TRUE
2) optim Alternatively we can use optim by defining a residual sum of squares function and solve it using optim where y and x5_to_x7 are as above:
rss <- function(x) sum((y - x5_to_x7(x))^2)
result <- optim(numeric(5), rss, method = "BFGS")
giving:
> result
$par
[1] 1 2 3 4 5
$value
[1] 5.685557e-20
$counts
function gradient
18 12
$convergence
[1] 0
$message
NULL
> all.equal(x5_to_x7(result$par), y)
[1] TRUE
3) nls If y were such that the residuals are not zero then it would be possible to use nls as suggested in the question.
y <- 1:7
fm1 <- lm(y ~ X + 0)
fm2 <- nls(y ~ x5_to_x7(x), start = list(x = numeric(5)))
all.equal(coef(fm1), coef(fm2), check.attributes = FALSE)
## [1] TRUE

Resources