Quadprog to constrain coefficients of linear regression - r

I need some help. I need to fit a linear model (y ~ X1) with the following constraints:
Intercept + Beta1 <= 1; and both Intercept and Beta1 need to belong to [0,1]
After looking how to do this in R, I found that Quadprog seems like the best option. However, I do not fully understand how to set those constraints. I take an example form a webpage (http://zoonek.free.fr/blosxom/R/2012-06-01_Optimization.html) to adapted to my needs, but the sum of coefficients are greater than 1 in some cases (by teh way, I have to run this model on different data frames, and in some the constraint is ok, but in some cases Intercept + beta1 > 1).
My script is:
n <- 100
x1 <- rnorm(n)
y <- .3 * x1 + .2 * x2 + .5*x3 + rnorm(n)
X <- cbind(1, x1)
Mod <- solve.QP(t(X) %*% X, t(y) %*% X, cbind(matrix(1, nr=2, nc=1),
diag(2),
-diag(2)),
c(1, 0, 0, -1,-1),
meq = 1) ""
As I told before, my problem is that in some cases, intercept plus beta1 are greater than 1, and I do not understand how to set that constraint in Quadprog with that matrix type.

Related

Replace lm coefficients and calculate results of lm new in R

I am able to change the coefficients of my linear model. Then i want to compare the results of my "new" model with the new coefficients, but R is not calculating the results with the new coefficients.
As you can see in my following example the summary of my models fit and fit1 are excactly the same, though results like multiple R-squared should or fitted values should change.
set.seed(2157010) #forgot set.
x1 <- 1998:2011
x2 <- x1 + rnorm(length(x1))
y <- 3*x2 + rnorm(length(x1)) #you had x, not x1 or x2
fit <- lm( y ~ x1 + x2)
# view original coefficients
coef(fit)
# generate second function for comparing results
fit1 <- fit
# replace coefficients with new values, use whole name which is coefficients:
fit1$coefficients[2:3] <- c(5, 1)
# view new coefficents
coef(fit1)
# Comparing
summary(fit)
summary(fit1)
Thanks in advance
It might be easier to compute the multiple R^2 yourself with the substituted parameters.
mult_r2 <- function(beta, y, X) {
tot_ss <- var(y) * (length(y) - 1)
rss <- sum((y - X %*% beta)^2)
1 - rss/tot_ss
}
(or, more compactly, following the comments, you could compute p <- X %*% beta; (cor(y,beta))^2)
mult_r2(coef(fit), y = model.response(model.frame(fit)), X = model.matrix(fit))
## 0.9931179, matches summary()
Now with new coefficients:
new_coef <- coef(fit)
new_coef[2:3] <- c(5,1)
mult_r2(new_coef, y = model.response(model.frame(fit)), X = model.matrix(fit))
## [1] -343917
That last result seems pretty wild, but the substituted coefficients are very different from the true least-squares coeffs, and negative R^2 is possible when the model is bad enough ...

Simulate data from regression model with exact parameters in R

How can I simulate data so that the coefficients recovered by lm are determined to be particular pre-determined values and have normally distributed residuals? For example, could I generate data so that lm(y ~ 1 + x) will yield (Intercept) = 1.500 and x = 4.000? I would like the solution to be versatile enough to work for multiple regression with continuous x (e.g., lm(y ~ 1 + x1 + x2)) but there are bonus points if it works for interactions as well (lm(y ~ 1 + x1 + x2 + x1*x2)). Also, it should work for small N (e.g., N < 200).
I know how to simulate random data which is generated by these parameters (see e.g. here), but that randomness carries over to variation in the estimated coefficients, e.g., Intercept = 1.488 and x = 4.067.
Related: It is possible to generate data that yields pre-determined correlation coefficients (see here and here). So I'm asking if this can be done for multiple regression?
One approach is to use a perfectly symmetrical noise. The noise cancels itself so the estimated parameters are exactly the input parameters, yet the residuals appear normally distributed.
x <- 1:100
y <- cbind(1,x) %*% c(1.5, 4)
eps <- rnorm(100)
x <- c(x, x)
y <- c(y + eps, y - eps)
fit <- lm(y ~ x)
# (Intercept) x
# 1.5 4.0
plot(fit)
Residuals are normally distributed...
... but exhibit an anormally perfect symmetry!
EDIT by OP: I wrote up a general-purpose code exploiting the symmetrical-residuals trick. It scales well with more complex models. This example also shows that it works for categorical predictors and interaction effects.
library(dplyr)
# Data and residuals
df = tibble(
# Predictors
x1 = 1:100, # Continuous
x2 = rep(c(0, 1), each=50), # Dummy-coded categorical
# Generate y from model, including interaction term
y_model = 1.5 + 4 * x1 - 2.1 * x2 + 8.76543 * x1 * x2,
noise = rnorm(100) # Residuals
)
# Do the symmetrical-residuals trick
# This is copy-and-paste ready, no matter model complexity.
df = bind_rows(
df %>% mutate(y = y_model + noise),
df %>% mutate(y = y_model - noise) # Mirrored
)
# Check that it works
fit <- lm(y ~ x1 + x2 + x1*x2, df)
coef(fit)
# (Intercept) x1 x2 x1:x2
# 1.50000 4.00000 -2.10000 8.76543
You could do rejection sampling:
set.seed(42)
tol <- 1e-8
x <- 1:100
continue <- TRUE
while(continue) {
y <- cbind(1,x) %*% c(1.5, 4) + rnorm(length(x))
if (sum((coef(lm(y ~ x)) - c(1.5, 4))^2) < tol) continue <- FALSE
}
coef(lm(y ~ x))
#(Intercept) x
# 1.500013 4.000023
Obviously, this is a brute-force approach and the smaller the tolerance and the more complex the model, the longer this will take. A more efficient approach should be possible by providing residuals as input and then employing some matrix algebra to calculate y values. But that's more of a maths question ...

Generating iid variates in R

I am working on a question and it reads:
Generate 1000 iid variates of X=(x1, x2) with a specific definition that is given in an example. The definition is that x1 is a standard normal distribution so N(0,1). However x2 is approximately x1 when -1 <= x1 <=1 and is x1 otherwise.
I have used the following code to generate bi variate random variables before in R but I do not know how to to get X=(x1, x2) and then plot.
library(gplots)
n<-10^6
sigma<- matrix(c(4,2,2,3), ncol=2)
x<- rmvnorm(n, mean = c(1,2), sigma = sigma)
h2d<- hist2d(x, show = FALSE, same.scale = TRUE, nbins = c(40,40))
persp(h2d$x, h2d$y, h2d$counts/n, ticktype="detailed", theta=30, phi=30, expand=0.5, shade=0.5, col="cyan", ltheta=-30,main="Bivariate Normal Distribution", zlab="f(x1,x2)",xlab="x1",ylab="x2")
So I know the above is not correct but I am not sure if I can do something similar or I am going about this all wrong.
It's simpler than you think:
set.seed(1) # Setting a seed
X1 <- rnorm(1000) # Simulating X1
X2 <- ifelse(abs(X1) <= 1, -X1, X1) # If abs(X1) <= 1, then set X2=-X1 and X2=X1 otherwise.
Since the question is about normal marginals but not normal bivariate distribution, we may look at a bivariate density estimate:
library(MASS)
image(kde2d(X1,X2))
Clearly the shape is not an ellipsoid, so the bivariate distribution is not normal even though both marginals are normal.
It can also be seen analytically. Let Z=X1+X2. If (X1,X2) was bivariate normal, then Z also would be normal. But P(Z = 0) >= P(|X1| <= 1) ~= 0.68, i.e., it has positive mass at zero, which cannot be the case with a continuous distribution.
You can use rnorm base function to generate the normal distribution. And, using simple ifelse function we can get x2, shown below:
x1 <- rnorm(10, mean = 0, sd = 1)
x2 <- ifelse( ((x1 <= 1) & (x1 >= -1)), -x1, x1)
plot(x1, x2, type='p')

Linear regression with constraints on the coefficients

I am trying to perform linear regression, for a model like this:
Y = aX1 + bX2 + c
So, Y ~ X1 + X2
Suppose I have the following response vector:
set.seed(1)
Y <- runif(100, -1.0, 1.0)
And the following matrix of predictors:
X1 <- runif(100, 0.4, 1.0)
X2 <- sample(rep(0:1,each=50))
X <- cbind(X1, X2)
I want to use the following constraints on the coefficients:
a + c >= 0
c >= 0
So no constraint on b.
I know that the glmc package can be used to apply constraints, but I was not able to determine how to apply it for my constraints. I also know that contr.sum can be used so that all coefficients sum to 0, for example, but that is not what I want to do. solve.QP() seems like another possibility, where setting meq=0 can be used so that all coefficients are >=0 (again, not my goal here).
Note: The solution must be able to handle NA values in the response vector Y, for example with:
Y <- runif(100, -1.0, 1.0)
Y[c(2,5,17,56,37,56,34,78)] <- NA
solve.QP can be passed arbitrary linear constraints, so it can certainly be used to model your constraints a+c >= 0 and c >= 0.
First, we can add a column of 1's to X to capture the intercept term, and then we can replicate standard linear regression with solve.QP:
X2 <- cbind(X, 1)
library(quadprog)
solve.QP(t(X2) %*% X2, t(Y) %*% X2, matrix(0, 3, 0), c())$solution
# [1] 0.08614041 0.21433372 -0.13267403
With the sample data from the question, neither constraint is met using standard linear regression.
By modifying both the Amat and bvec parameters, we can add our two constraints:
solve.QP(t(X2) %*% X2, t(Y) %*% X2, cbind(c(1, 0, 1), c(0, 0, 1)), c(0, 0))$solution
# [1] 0.0000000 0.1422207 0.0000000
Subject to these constraints, the squared residuals are minimized by setting the a and c coefficients to both equal 0.
You can handle missing values in Y or X2 just as the lm function does, by removing the offending observations. You might do something like the following as a pre-processing step:
has.missing <- rowSums(is.na(cbind(Y, X2))) > 0
Y <- Y[!has.missing]
X2 <- X2[!has.missing,]

Fit 'nls': singular gradient matrix at initial parameter estimates

I'm new using 'nls' and I'm encountering problems finding the starting parameters. I've read several posts and tried various parameters and formula constructions but I keep getting errors.
This is a small example of what I'm doing and I'd very much appreciate if anyone could give me some tips!
# Data to which I want to fit a non-linear function
x <- c(0, 4, 13, 30, 63, 92)
y <- c(0.00000000, 0.00508822, 0.01103990, 0.02115466, 0.04036655, 0.05865331)
z <- 0.98
# STEPS:
# 1 pool, z fixed. This works.
fit <- nls(y ~ z * ((1 - exp(-k1*x))),
start=list(k1=0))
# 2 pool model, z fixed
fit2 <- nls(y ~ z * (1 - exp(-k1*x)) + (1 - exp(-k2*x)),
start=list(k1=0, k2=0)) # Error: singular gradient matrix at initial parameter estimates
# My goal: 2 pool model, z free
fit3 <- nls(y ~ z * (1 - exp(-k1*x)) + (1 - exp(-k2*x)),
start=list(z=0.5, k1=0, k2=0))
It has been a while since you asked the question but maybe you are still interested in some comments:
At least your fit2 works fine when one varies the starting parameters (see code and plots below). I guess that fit3 is then just a "too complicated" model given these data which follow basically just a linear trend. That implies that two parameters are usually sufficient to describe the data reasonable well (see second plot).
So as a general hint: When you obtain
singular gradient matrix at initial parameter estimates
you can
1) vary the starting values/your initial parameter estimates
and/or
2) try to simplify your model by looking for redundant parameters which usually cause troubles.
I also highly recommend to always plot the data first together with your initial guesses (check also this question).
Here is a plot showing the outcome for your fit, fit2 and a third function defined by me which is given in the code below:
As you can see, there is almost no difference between your fit2 and the function which has a variable z and one additional exponential. Two parameters seem pretty much enough to describe the system reasonable well (also one is already quite good represented by the black line in the plot above). If you then want to fit a line through a certain data point, you can also check out this answer.
So how does it now look like when one uses a linear function with two free parameters and a function with variable z, one exponential term and a variable offset? That is shown in the following plot; again there is not much of a difference:
How do the residuals compare?
> fit
Nonlinear regression model
model: y ~ zfix * ((1 - exp(-k1 * x)))
data: parent.frame()
k1
0.0006775
residual sum-of-squares: 1.464e-05
> fit2
Nonlinear regression model
model: y ~ zfix * (1 - exp(-k1 * x)) + (1 - exp(-k2 * x))
data: parent.frame()
k1 k2
-0.0006767 0.0014014
residual sum-of-squares: 9.881e-06
> fit3
Nonlinear regression model
model: y ~ Z * (1 - exp(-k1 * x))
data: parent.frame()
Z k1
0.196195 0.003806
residual sum-of-squares: 9.59e-06
> fit4
Nonlinear regression model
model: y ~ a * x + b
data: parent.frame()
a b
0.0006176 0.0019234
residual sum-of-squares: 6.084e-06
> fit5
Nonlinear regression model
model: y ~ z * (1 - exp(-k1 * x)) + k2
data: parent.frame()
z k1 k2
0.395106 0.001685 0.001519
residual sum-of-squares: 5.143e-06
As one could guess, the fit with only one free parameter gives the worst while the one with three free parameters gives the best result; however, there is not much of a difference (in my opinion).
Here is the code I used:
x <- c(0, 4, 13, 30, 63, 92)
y <- c(0.00000000, 0.00508822, 0.01103990, 0.02115466, 0.04036655, 0.05865331)
zfix <- 0.98
plot(x,y)
# STEPS:
# 1 pool, z fixed. This works.
fit <- nls(y ~ zfix * ((1 - exp(-k1*x))), start=list(k1=0))
xr = data.frame(x = seq(min(x),max(x),len=200))
lines(xr$x,predict(fit,newdata=xr))
# 2 pool model, z fixed
fit2 <- nls(y ~ zfix * (1 - exp(-k1*x)) + (1 - exp(-k2*x)), start=list(k1=0, k2=0.5))
lines(xr$x,predict(fit2,newdata=xr), col='red')
# 3 z variable
fit3 <- nls(y ~ Z * (1 - exp(-k1*x)), start=list(Z=zfix, k1=0.2))
lines(xr$x,predict(fit3,newdata=xr), col='blue')
legend('topleft',c('fixed z, single exp', 'fixed z, two exp', 'variable z, single exp'),
lty=c(1,1,1),
lwd=c(2.5,2.5,2.5),
col=c('black', 'red','blue'))
#dev.new()
plot(x,y)
# 4 fit linear function a*x + b
fit4 <- nls(y ~ a *x + b, start=list(a=1, b=0.))
lines(xr$x,predict(fit4,newdata=xr), col='blue')
fit5 <- nls(y ~ z * (1 - exp(-k1*x)) + k2, start=list(z=zfix, k1=0.1, k2=0.5))
lines(xr$x,predict(fit5,newdata=xr), col='red')
legend('topleft',c('linear approach', 'variable z, single exp, offset'),
lty=c(1,1),
lwd=c(2.5,2.5),
col=c('blue', 'red'))

Resources