How R calculates the Regression coefficients using lm() function - r

I wanted to replicate R's calculation on estimation of regression equation on below data:
set.seed(1)
Vec = rnorm(1000, 100, 3)
DF = data.frame(X1 = Vec[-1], X2 = Vec[-length(Vec)])
Below R reports estimates of coefficients
coef(lm(X1~X2, DF)) ### slope = -0.03871511
Then I manually estimate the regression estimate for slope
(sum(DF[,1]*DF[,2])*nrow(DF) - sum(DF[,1])*sum(DF[,2])) / (nrow(DF) * sum(DF[,1]^2) - (sum(DF[,1])^2)) ### -0.03871178
They are close but still are nor matching exactly.
Can you please help me to understand what am I missing here?
Any pointer will be very helpful.

The problem is that X1 and X2 are switched in lm relative to the long formula.
Background
The formula for slope in lm(y ~ x) is the following where x and y each have length n and x is short for x[i] and y is short for y[i] and the summations are over i = 1, 2, ..., n.
Source of the problem
Thus the long formula in the question, also shown in (1) below, corresponds to lm(X2 ~ X1, DF) and not to lm(X1 ~ X2, DF). Either change the formula in the lm model as in (1) below or else change the long formula in the answer by replacing each occurrence of DF[, 1] in the denominator with DF[, 2] as in (2) below.
# (1)
coef(lm(X2 ~ X1, DF))[[2]]
## [1] -0.03871178
(sum(DF[,1]*DF[,2])*nrow(DF) - sum(DF[,1])*sum(DF[,2])) /
(nrow(DF) * sum(DF[,1]^2) - (sum(DF[,1])^2)) # as in question
## [1] -0.03871178
# (2)
coef(lm(X1 ~ X2, DF))[[2]] # as in question
## [1] -0.03871511
(sum(DF[,1]*DF[,2])*nrow(DF) - sum(DF[,1])*sum(DF[,2])) /
(nrow(DF) * sum(DF[,2]^2) - (sum(DF[,2])^2))
## [1] -0.03871511

This is not a StackOverflow question per se, but rather a stats question for the sister site.
The narrow answer is that you can look into the R sources; it generally farms off to LAPACK and BLAS but a key part of the regression calculation is specialised in order to deal (in a statistically, rather than numerical way) with low-rank cases.
Anyway, here, I believe you are 'merely' not adjusting for degrees of freedom correctly which 'almost but not quite' washes out when you use 1000 observations. A simpler case follows, along with a 'simpler' way to calculate the coefficient 'by hand' which also has the advantage of matching:
> set.seed(1)
> Vec <- rnorm(5,100,3)
> DF <- data.frame(X1=Vec[-1], X2=Vec[-length(Vec)])
> coef(lm(X1 ~ X2, DF))[2]
X2
-0.322898
> cov(DF$X1, DF$X2) / var(DF$X2)
[1] -0.322898
>

coef(lm(X1~X2, DF))
# (Intercept) X2
# 103.83714016 -0.03871511
You can apply the formula of coefficients in OLS matrix form as below.
X = cbind(1,DF[,2])
solve(t(X) %*% (X)) %*% t(X)%*% as.matrix(DF[,1])
giving,
# [,1]
#[1,] 103.83714016
#[2,] -0.03871511
which is same with lm() output.
Data:
set.seed(1)
Vec = rnorm(1000, 100, 3)
DF = data.frame(X1 = Vec[-1], X2 = Vec[-length(Vec)])

Related

Replace lm coefficients and calculate results of lm new in R

I am able to change the coefficients of my linear model. Then i want to compare the results of my "new" model with the new coefficients, but R is not calculating the results with the new coefficients.
As you can see in my following example the summary of my models fit and fit1 are excactly the same, though results like multiple R-squared should or fitted values should change.
set.seed(2157010) #forgot set.
x1 <- 1998:2011
x2 <- x1 + rnorm(length(x1))
y <- 3*x2 + rnorm(length(x1)) #you had x, not x1 or x2
fit <- lm( y ~ x1 + x2)
# view original coefficients
coef(fit)
# generate second function for comparing results
fit1 <- fit
# replace coefficients with new values, use whole name which is coefficients:
fit1$coefficients[2:3] <- c(5, 1)
# view new coefficents
coef(fit1)
# Comparing
summary(fit)
summary(fit1)
Thanks in advance
It might be easier to compute the multiple R^2 yourself with the substituted parameters.
mult_r2 <- function(beta, y, X) {
tot_ss <- var(y) * (length(y) - 1)
rss <- sum((y - X %*% beta)^2)
1 - rss/tot_ss
}
(or, more compactly, following the comments, you could compute p <- X %*% beta; (cor(y,beta))^2)
mult_r2(coef(fit), y = model.response(model.frame(fit)), X = model.matrix(fit))
## 0.9931179, matches summary()
Now with new coefficients:
new_coef <- coef(fit)
new_coef[2:3] <- c(5,1)
mult_r2(new_coef, y = model.response(model.frame(fit)), X = model.matrix(fit))
## [1] -343917
That last result seems pretty wild, but the substituted coefficients are very different from the true least-squares coeffs, and negative R^2 is possible when the model is bad enough ...

Is it possible to cache `lm()` matrices to fit new data?

I wrote an algorithm which fits a linear model with lm() and then "updates" the response variable iteratively. The problem is: In a high-dimension scenario, fitting linear models creates a bottleneck.
On the other hand, most of the work required is a matrix inversion that only depends on the covariate matrix X, i.e., the coefficients are given by: solve(t(X) %*% X) %*% X %*% y.
Reading lm() code, I understand that R uses QR decomposition.
Is it possible to recover the internal matrix operation used and fit a new model with new y values faster?
Here's a minimal example:
set.seed(1)
X <- matrix(runif(400*150000), nrow = 150000)
y1 <- runif(150000)
y2 <- runif(150000)
mod1 <- lm(y1 ~ X)
mod2 <- lm(y2 ~ X)
Theoretically, mod2 "repeats" costful matrix operations identical to the ones made in the first lm() call.
I want to keep using lm() for its efficient implementation and ability to handle incomplete rank matrices automatically.
# Data
set.seed(1)
n = 5
X <- matrix(runif(5*n), nrow = n)
y1 <- runif(n)
y2 <- runif(n)
# lm models
mod1 <- lm(y1 ~ X)
mod2 <- lm(y2 ~ X)
# Obtain QR decomposition of X
q = qr(X)
# Reuse 'q' to obtain fitted values repeatedly
mod1_fv = qr.fitted(q, y1)
mod2_fv = qr.fitted(q, y2)
# Compare fitted values from reusing 'q' to fitted values in 'lm' models
Vectorize(all.equal)(unname(mod1$fitted.values), mod1_fv)
#> [1] TRUE TRUE TRUE TRUE TRUE
Vectorize(all.equal)(unname(mod2$fitted.values), mod2_fv)
#> [1] TRUE TRUE TRUE TRUE TRUE
Created on 2019-09-06 by the reprex package (v0.3.0)
Have you tried just fitting a multivariate model? I haven't checked the code, but on my system it's almost half as fast as fitting separately, so I wouldn't be surprised if it's doing what you suggest behind the scenes. That is,
mods <- lm(cbind(y1, y2) ~ X)

Simulate data from regression model with exact parameters in R

How can I simulate data so that the coefficients recovered by lm are determined to be particular pre-determined values and have normally distributed residuals? For example, could I generate data so that lm(y ~ 1 + x) will yield (Intercept) = 1.500 and x = 4.000? I would like the solution to be versatile enough to work for multiple regression with continuous x (e.g., lm(y ~ 1 + x1 + x2)) but there are bonus points if it works for interactions as well (lm(y ~ 1 + x1 + x2 + x1*x2)). Also, it should work for small N (e.g., N < 200).
I know how to simulate random data which is generated by these parameters (see e.g. here), but that randomness carries over to variation in the estimated coefficients, e.g., Intercept = 1.488 and x = 4.067.
Related: It is possible to generate data that yields pre-determined correlation coefficients (see here and here). So I'm asking if this can be done for multiple regression?
One approach is to use a perfectly symmetrical noise. The noise cancels itself so the estimated parameters are exactly the input parameters, yet the residuals appear normally distributed.
x <- 1:100
y <- cbind(1,x) %*% c(1.5, 4)
eps <- rnorm(100)
x <- c(x, x)
y <- c(y + eps, y - eps)
fit <- lm(y ~ x)
# (Intercept) x
# 1.5 4.0
plot(fit)
Residuals are normally distributed...
... but exhibit an anormally perfect symmetry!
EDIT by OP: I wrote up a general-purpose code exploiting the symmetrical-residuals trick. It scales well with more complex models. This example also shows that it works for categorical predictors and interaction effects.
library(dplyr)
# Data and residuals
df = tibble(
# Predictors
x1 = 1:100, # Continuous
x2 = rep(c(0, 1), each=50), # Dummy-coded categorical
# Generate y from model, including interaction term
y_model = 1.5 + 4 * x1 - 2.1 * x2 + 8.76543 * x1 * x2,
noise = rnorm(100) # Residuals
)
# Do the symmetrical-residuals trick
# This is copy-and-paste ready, no matter model complexity.
df = bind_rows(
df %>% mutate(y = y_model + noise),
df %>% mutate(y = y_model - noise) # Mirrored
)
# Check that it works
fit <- lm(y ~ x1 + x2 + x1*x2, df)
coef(fit)
# (Intercept) x1 x2 x1:x2
# 1.50000 4.00000 -2.10000 8.76543
You could do rejection sampling:
set.seed(42)
tol <- 1e-8
x <- 1:100
continue <- TRUE
while(continue) {
y <- cbind(1,x) %*% c(1.5, 4) + rnorm(length(x))
if (sum((coef(lm(y ~ x)) - c(1.5, 4))^2) < tol) continue <- FALSE
}
coef(lm(y ~ x))
#(Intercept) x
# 1.500013 4.000023
Obviously, this is a brute-force approach and the smaller the tolerance and the more complex the model, the longer this will take. A more efficient approach should be possible by providing residuals as input and then employing some matrix algebra to calculate y values. But that's more of a maths question ...

Calculating coefficients of bivariate linear regression

Question to be answered
Does anyone know how to solve the attached problem in two lines of code? I believe an as.matrix would work to create a matrix, X, and then use X %*% X, t(X), and solve(X) to get the answer. However, it does not seem to be working. Any answers will help, thanks.
I would recommend using read.csv instead of read.table
It would be useful for you to go over the difference of the two functions in this thread: read.csv vs. read.table
df <- read.csv("http://pengstats.macssa.com/download/rcc/lmdata.csv")
model1 <- lm(y ~ x1 + x2, data = df)
coefficients(model1) # get the coefficients of your regression model1
summary(model1) # get the summary of model1
Based on the answer of #kon_u, here is an example to do it by hands:
df <- read.csv("http://pengstats.macssa.com/download/rcc/lmdata.csv")
model1 <- lm(y ~ x1 + x2, data = df)
coefficients(model1) # get the coefficients of your regression model1
summary(model1) # get the summary of model1
### Based on the formula
X <- cbind(1, df$x1, df$x2) # the column of 1 is to consider the intercept
Y <- df$y
bhat <- solve(t(X) %*% X) %*% t(X) %*% Y # coefficients
bhat # Note that we got the same coefficients with the lm function

Linear optimization of difference subject to constraints

I have a vector of values (x1,x2,x3,x4,x5,x6,x7) and i want to create a vector which minimizes a new unknown vector (y1,y2,y3,y4,y5,y6,y7) such that I can minimize ||x-y||^2. I also want to create this new vector subject to the constraints that x1+x2+x3+x4+x5=x6 and x1+x2+x3+x4=x7. I tried to use constrOptim but I do not think I have the right inputs. Any help would be greatly appreciated!
Would it be best to come up with a set of values and then use a nls model to predict them? How would I do that?
Thank you!!
We assume that what the question actually intended was that y is known and we want to get x with the indicated constraints.
Note that nls does not work for zero residual problems and since no data was provided in the question we don't know whether that is the case here or not so we first present two solutions that can handle that and then finally we show an nls for the non-zero residual case. We use y shown below in (1) as our test input for (1) and (2) and it does have zero residuals. For (3), the nls solution, we use a different y which does not lead to zero residuals.
Here are some alternative solutions:
1) lm We define x5_to_x7 which maps the first 5 components of x to the entire 7-element vector. Because x5_to_x7 is a linear operator it corresponds to a matrix X which we form and then use in lm:
# test data
y <- c(1:5, sum(1:5), sum(1:4))
x5_to_x7 <- function(x5) c(x5, sum(x5), sum(x5[1:4]))
X <- apply(diag(5), 1, x5_to_x7)
fm <- lm(y ~ X + 0)
giving:
coef(fm)
## X1 X2 X3 X4 X5
## 1 2 3 4 5
all.equal(x5_to_x7(coef(fm)), y)
## [1] TRUE
2) optim Alternatively we can use optim by defining a residual sum of squares function and solve it using optim where y and x5_to_x7 are as above:
rss <- function(x) sum((y - x5_to_x7(x))^2)
result <- optim(numeric(5), rss, method = "BFGS")
giving:
> result
$par
[1] 1 2 3 4 5
$value
[1] 5.685557e-20
$counts
function gradient
18 12
$convergence
[1] 0
$message
NULL
> all.equal(x5_to_x7(result$par), y)
[1] TRUE
3) nls If y were such that the residuals are not zero then it would be possible to use nls as suggested in the question.
y <- 1:7
fm1 <- lm(y ~ X + 0)
fm2 <- nls(y ~ x5_to_x7(x), start = list(x = numeric(5)))
all.equal(coef(fm1), coef(fm2), check.attributes = FALSE)
## [1] TRUE

Resources