Linear optimization of difference subject to constraints - r

I have a vector of values (x1,x2,x3,x4,x5,x6,x7) and i want to create a vector which minimizes a new unknown vector (y1,y2,y3,y4,y5,y6,y7) such that I can minimize ||x-y||^2. I also want to create this new vector subject to the constraints that x1+x2+x3+x4+x5=x6 and x1+x2+x3+x4=x7. I tried to use constrOptim but I do not think I have the right inputs. Any help would be greatly appreciated!
Would it be best to come up with a set of values and then use a nls model to predict them? How would I do that?
Thank you!!

We assume that what the question actually intended was that y is known and we want to get x with the indicated constraints.
Note that nls does not work for zero residual problems and since no data was provided in the question we don't know whether that is the case here or not so we first present two solutions that can handle that and then finally we show an nls for the non-zero residual case. We use y shown below in (1) as our test input for (1) and (2) and it does have zero residuals. For (3), the nls solution, we use a different y which does not lead to zero residuals.
Here are some alternative solutions:
1) lm We define x5_to_x7 which maps the first 5 components of x to the entire 7-element vector. Because x5_to_x7 is a linear operator it corresponds to a matrix X which we form and then use in lm:
# test data
y <- c(1:5, sum(1:5), sum(1:4))
x5_to_x7 <- function(x5) c(x5, sum(x5), sum(x5[1:4]))
X <- apply(diag(5), 1, x5_to_x7)
fm <- lm(y ~ X + 0)
giving:
coef(fm)
## X1 X2 X3 X4 X5
## 1 2 3 4 5
all.equal(x5_to_x7(coef(fm)), y)
## [1] TRUE
2) optim Alternatively we can use optim by defining a residual sum of squares function and solve it using optim where y and x5_to_x7 are as above:
rss <- function(x) sum((y - x5_to_x7(x))^2)
result <- optim(numeric(5), rss, method = "BFGS")
giving:
> result
$par
[1] 1 2 3 4 5
$value
[1] 5.685557e-20
$counts
function gradient
18 12
$convergence
[1] 0
$message
NULL
> all.equal(x5_to_x7(result$par), y)
[1] TRUE
3) nls If y were such that the residuals are not zero then it would be possible to use nls as suggested in the question.
y <- 1:7
fm1 <- lm(y ~ X + 0)
fm2 <- nls(y ~ x5_to_x7(x), start = list(x = numeric(5)))
all.equal(coef(fm1), coef(fm2), check.attributes = FALSE)
## [1] TRUE

Related

How R calculates the Regression coefficients using lm() function

I wanted to replicate R's calculation on estimation of regression equation on below data:
set.seed(1)
Vec = rnorm(1000, 100, 3)
DF = data.frame(X1 = Vec[-1], X2 = Vec[-length(Vec)])
Below R reports estimates of coefficients
coef(lm(X1~X2, DF)) ### slope = -0.03871511
Then I manually estimate the regression estimate for slope
(sum(DF[,1]*DF[,2])*nrow(DF) - sum(DF[,1])*sum(DF[,2])) / (nrow(DF) * sum(DF[,1]^2) - (sum(DF[,1])^2)) ### -0.03871178
They are close but still are nor matching exactly.
Can you please help me to understand what am I missing here?
Any pointer will be very helpful.
The problem is that X1 and X2 are switched in lm relative to the long formula.
Background
The formula for slope in lm(y ~ x) is the following where x and y each have length n and x is short for x[i] and y is short for y[i] and the summations are over i = 1, 2, ..., n.
Source of the problem
Thus the long formula in the question, also shown in (1) below, corresponds to lm(X2 ~ X1, DF) and not to lm(X1 ~ X2, DF). Either change the formula in the lm model as in (1) below or else change the long formula in the answer by replacing each occurrence of DF[, 1] in the denominator with DF[, 2] as in (2) below.
# (1)
coef(lm(X2 ~ X1, DF))[[2]]
## [1] -0.03871178
(sum(DF[,1]*DF[,2])*nrow(DF) - sum(DF[,1])*sum(DF[,2])) /
(nrow(DF) * sum(DF[,1]^2) - (sum(DF[,1])^2)) # as in question
## [1] -0.03871178
# (2)
coef(lm(X1 ~ X2, DF))[[2]] # as in question
## [1] -0.03871511
(sum(DF[,1]*DF[,2])*nrow(DF) - sum(DF[,1])*sum(DF[,2])) /
(nrow(DF) * sum(DF[,2]^2) - (sum(DF[,2])^2))
## [1] -0.03871511
This is not a StackOverflow question per se, but rather a stats question for the sister site.
The narrow answer is that you can look into the R sources; it generally farms off to LAPACK and BLAS but a key part of the regression calculation is specialised in order to deal (in a statistically, rather than numerical way) with low-rank cases.
Anyway, here, I believe you are 'merely' not adjusting for degrees of freedom correctly which 'almost but not quite' washes out when you use 1000 observations. A simpler case follows, along with a 'simpler' way to calculate the coefficient 'by hand' which also has the advantage of matching:
> set.seed(1)
> Vec <- rnorm(5,100,3)
> DF <- data.frame(X1=Vec[-1], X2=Vec[-length(Vec)])
> coef(lm(X1 ~ X2, DF))[2]
X2
-0.322898
> cov(DF$X1, DF$X2) / var(DF$X2)
[1] -0.322898
>
coef(lm(X1~X2, DF))
# (Intercept) X2
# 103.83714016 -0.03871511
You can apply the formula of coefficients in OLS matrix form as below.
X = cbind(1,DF[,2])
solve(t(X) %*% (X)) %*% t(X)%*% as.matrix(DF[,1])
giving,
# [,1]
#[1,] 103.83714016
#[2,] -0.03871511
which is same with lm() output.
Data:
set.seed(1)
Vec = rnorm(1000, 100, 3)
DF = data.frame(X1 = Vec[-1], X2 = Vec[-length(Vec)])

Is it possible to cache `lm()` matrices to fit new data?

I wrote an algorithm which fits a linear model with lm() and then "updates" the response variable iteratively. The problem is: In a high-dimension scenario, fitting linear models creates a bottleneck.
On the other hand, most of the work required is a matrix inversion that only depends on the covariate matrix X, i.e., the coefficients are given by: solve(t(X) %*% X) %*% X %*% y.
Reading lm() code, I understand that R uses QR decomposition.
Is it possible to recover the internal matrix operation used and fit a new model with new y values faster?
Here's a minimal example:
set.seed(1)
X <- matrix(runif(400*150000), nrow = 150000)
y1 <- runif(150000)
y2 <- runif(150000)
mod1 <- lm(y1 ~ X)
mod2 <- lm(y2 ~ X)
Theoretically, mod2 "repeats" costful matrix operations identical to the ones made in the first lm() call.
I want to keep using lm() for its efficient implementation and ability to handle incomplete rank matrices automatically.
# Data
set.seed(1)
n = 5
X <- matrix(runif(5*n), nrow = n)
y1 <- runif(n)
y2 <- runif(n)
# lm models
mod1 <- lm(y1 ~ X)
mod2 <- lm(y2 ~ X)
# Obtain QR decomposition of X
q = qr(X)
# Reuse 'q' to obtain fitted values repeatedly
mod1_fv = qr.fitted(q, y1)
mod2_fv = qr.fitted(q, y2)
# Compare fitted values from reusing 'q' to fitted values in 'lm' models
Vectorize(all.equal)(unname(mod1$fitted.values), mod1_fv)
#> [1] TRUE TRUE TRUE TRUE TRUE
Vectorize(all.equal)(unname(mod2$fitted.values), mod2_fv)
#> [1] TRUE TRUE TRUE TRUE TRUE
Created on 2019-09-06 by the reprex package (v0.3.0)
Have you tried just fitting a multivariate model? I haven't checked the code, but on my system it's almost half as fast as fitting separately, so I wouldn't be surprised if it's doing what you suggest behind the scenes. That is,
mods <- lm(cbind(y1, y2) ~ X)

R: one regression model for 2 different data sets to prepare for waldtest

I have two different data sets. Each of them represents one portfolio of my two portfolios.
y(p) as dependent variable and x1(p), x2(p),x3(p),x4(p) as independent variables.
(p) indicates a portfolio-specific value. column 1 of each variable represents portfolio 1 and column 2 represents portfolio 2.
The regression equation is:
y(p)=∝(p)+ 𝛽1(p)*x1(p)+𝛽2(p)*x2(p)+𝛽3(p)*x3(p)+𝛽4(p)*x4(p)
What i did so far is to implement a separate regression model for each portfolio in R:
lm1 <- lm(y[,1]~x1[,1]+x2[,1]+x3[,1]+x4[,1])
lm2 <- lm(y[,2]~x1[,2]+x2[,2]+x3[,2]+x4[,2])
My objective is to compare the two intercepts of both regression models. Within the scope of this comparison i need to test the joint significance of these intercepts. As far as i can tell, using the wald test should be appropriate.
If I use the waldtest-function from the lmtest-package it does not work.
Obviously, because the response variable is not the same for both models.
library(lmtest)
waldtest(lm1,lm2)
In waldtest.default(object, ..., test = match.arg(test)) :
models with response "y[, 2]" removed because response differs from model 1
All workarounds I tried so far did not work either, e.g. R: Waldtest: "Error in solve.default(vc[ovar, ovar]) : 'a' is 0-diml"
My guess is that the regression needs to be done in a different way to fix the problems regarding the waldtest.
So that leads to my question:
Is there a possibility to do the regression in one model, which still generates portfolio-specific intercepts and coefficients? (I assume, that this would fix the problems with the waldtest-function.)
Any advice or suggestion will be appreciated.
The following data can be used for a reproducible example:
y=matrix(rnorm(10),ncol=2)
x1=matrix(rnorm(10),ncol=2)
x2=matrix(rnorm(10),ncol=2)
x3=matrix(rnorm(10),ncol=2)
x4=matrix(rnorm(10),ncol=2)
lm1 <- lm(y[,1]~x1[,1]+x2[,1]+x3[,1]+x4[,1])
lm2 <- lm(y[,2]~x1[,2]+x2[,2]+x3[,2]+x4[,2])
library(lmtest)
waldtest(lm1,lm2)
Best regards,
Simon
Here are three ways to test intercepts equality. The second one is an implementation of the accepted answer to this question, while the other two are implementations of the second answer to the aforementioned question under different assumptions.
Let
n <- 5
y <- matrix(rnorm(10), ncol = 2)
x <- matrix(rnorm(10), ncol = 2)
First, we may indeed perform the test with only a single model. For that purpose we create a new vector Y that concatenates y[, 1] and y[, 2]. As for the independent variables, we create a block-diagonal matrix with the regressors of one model at the upper-left block and those for the other model at the lower-right block. Lastly, I create a group factor indicating the hidden model. Hence,
library(Matrix)
Y <- c(y)
X <- as.matrix(bdiag(x[, 1], x[, 2]))
G <- factor(rep(0:1, each = n))
Now the unrestricted model is
m1 <- lm(Y ~ G + X - 1)
while the restricted one is
m2 <- lm(Y ~ X)
Testing for intercepts equality gives
library(lmtest)
waldtest(m1, m2)
# Wald test
#
# Model 1: Y ~ G + X - 1
# Model 2: Y ~ X
# Res.Df Df F Pr(>F)
# 1 6
# 2 7 -1 0.5473 0.4873
so that, as expected, we cannot reject they equality. A problem with this solution, however, is that it is like estimating the two models separately but assuming that the errors have the same variance in both. Also, we don't allow for a cross-correlation between errors.
Second, we can relax the assumption of identical errors variance by estimating two separate models and employing a Z-test as follows.
M1 <- lm(y[, 1] ~ x[, 1])
M2 <- lm(y[, 2] ~ x[, 2])
Z <- unname((coef(M1)[1] - coef(M2)[1]) / (coef(summary(M1))[1, 2]^2 + coef(summary(M2))[1, 2])^2)
2 * pnorm(-abs(Z))
# [1] 0.5425736
leading to the same conclusion.
Lastly, we can employ the SUR in this way allowing for model-dependent errors variance as well as contemporaneous errors cross-dependence (that may be not necessary in your case, it matters what kind of data you are using). For that we can use the systemfit package as follows:
library(systemfit)
eq1 <- y[, 1] ~ x[, 1]
eq2 <- y[, 2] ~ x[, 2]
m <- systemfit(list(eq1, eq2), method = "SUR")
In this case we also are able to perform the Wald test:
R <- matrix(c(1, 0, -1, 0), nrow = 1) # Restriction matrix
linearHypothesis(m, R, test = "Chisq")
# Linear hypothesis test (Chi^2 statistic of a Wald test)
#
# Hypothesis:
# eq1_((Intercept) - eq2_(Intercept) = 0
#
# Model 1: restricted model
# Model 2: m
#
# Res.Df Df Chisq Pr(>Chisq)
# 1 7
# 2 6 1 0.3037 0.5816

Linear fit without slope in r

I want to fit a linear model with no slope and extract information of it. My objective is to know which is the best y-intercept for an horizontal line in a data set and also evaluate from derived linear fit to identify if y has a particular behavior (x is date). I've using range to evaluate behavior, but I'm looking for an index without unit.
Removing y-intercept:
X <- 1:10
Y <- 2:11
lm1 <- lm(Y~X + 0, data = data.frame(X=X,Y=Y)) # y-intercept remove opt 1
lm1 <- lm(Y~X - 1, data = data.frame(X=X,Y=Y)) # y-intercept remove opt 2
lm1 <- lm(Y~0 + X, data = data.frame(X=X,Y=Y)) # y-intercept remove opt 3
lm1$coefficients
X
1.142857
summary(lm1)$r.squared
[1] 0.9957567
All the lm showed before, has . But, if I evaluate:
lm2 <- lm(Y~1, data = data.frame(X=X,Y=Y))
lm2$coefficients
(Intercept)
6.5
summary(lm2)$r.squared
[1] 0
There is a way to calculate out of lm function or calculate an index to identify how much y is represented by an horizontal line?
Let lmObject be your linear model returned by lm (called with y = TRUE to return y).
If your model has intercept, then R-squared is computed as
with(lmObject, 1 - c(crossprod(residuals) / crossprod(y - mean(y))) )
If your model does not have an intercept, then R-squared is computed as
with(lmObject, 1 - c(crossprod(residuals) / crossprod(y)) )
Note, if your model is only an intercept (so it is certainly from the 1st case above), you have
residuals = y - mean(y)
thus R-squared is always 1 - 1 = 0.
In regression analysis, it is always recommended to include intercept in the model to get unbiased estimate. A model with intercept only is the NULL model. Any other model is compared with this NULL model for further analysis of variance.
A note. The value / quantity you want has nothing to do with regression. You can simply compute it as
c(crossprod(Y - mean(Y)) / crossprod(Y)) ## `Y` is your data
#[1] 0.1633663
Alternatively, use
(length(Y) - 1) * var(Y) / c(crossprod(Y))
#[1] 0.1633663

Error when using `delthamethod`{msm} for a fitted linear model: covariances should be a n x n matrix

I am using the delthamethod from package msm to derive the standard error of a transformed variable.
Example code:
require(msm)
x1 <- 1:10
x2 <- c(0,0,0,0,0,0,0,0,0,0)
y <- c(1,3,3,4,5,7,7,8,9,10)
m1 <- lm(y~x1+x2)
summary(m1)
deltamethod(~ (1-x1), coef(m1), vcov(m1))
The error I get is "Covariances should be a 3x3 matrix". The reason is that 1 variable does not have any variation (x2 is always zero) and has "NA" in the regression output.
Is there an easy fix to this? I know I could leave the variable out, but I am running more than 1.000 regressions with each around 15 parameters to estimate, and the NA variables (without variation) are every time different variables.
How about:
deltamethod(~(1-x1), na.omit(coef(m1)), vcov(m1))
# [1] 0.2949063

Resources