Linear regression with constraints on the coefficients - r

I am trying to perform linear regression, for a model like this:
Y = aX1 + bX2 + c
So, Y ~ X1 + X2
Suppose I have the following response vector:
set.seed(1)
Y <- runif(100, -1.0, 1.0)
And the following matrix of predictors:
X1 <- runif(100, 0.4, 1.0)
X2 <- sample(rep(0:1,each=50))
X <- cbind(X1, X2)
I want to use the following constraints on the coefficients:
a + c >= 0
c >= 0
So no constraint on b.
I know that the glmc package can be used to apply constraints, but I was not able to determine how to apply it for my constraints. I also know that contr.sum can be used so that all coefficients sum to 0, for example, but that is not what I want to do. solve.QP() seems like another possibility, where setting meq=0 can be used so that all coefficients are >=0 (again, not my goal here).
Note: The solution must be able to handle NA values in the response vector Y, for example with:
Y <- runif(100, -1.0, 1.0)
Y[c(2,5,17,56,37,56,34,78)] <- NA

solve.QP can be passed arbitrary linear constraints, so it can certainly be used to model your constraints a+c >= 0 and c >= 0.
First, we can add a column of 1's to X to capture the intercept term, and then we can replicate standard linear regression with solve.QP:
X2 <- cbind(X, 1)
library(quadprog)
solve.QP(t(X2) %*% X2, t(Y) %*% X2, matrix(0, 3, 0), c())$solution
# [1] 0.08614041 0.21433372 -0.13267403
With the sample data from the question, neither constraint is met using standard linear regression.
By modifying both the Amat and bvec parameters, we can add our two constraints:
solve.QP(t(X2) %*% X2, t(Y) %*% X2, cbind(c(1, 0, 1), c(0, 0, 1)), c(0, 0))$solution
# [1] 0.0000000 0.1422207 0.0000000
Subject to these constraints, the squared residuals are minimized by setting the a and c coefficients to both equal 0.
You can handle missing values in Y or X2 just as the lm function does, by removing the offending observations. You might do something like the following as a pre-processing step:
has.missing <- rowSums(is.na(cbind(Y, X2))) > 0
Y <- Y[!has.missing]
X2 <- X2[!has.missing,]

Related

Quadprog to constrain coefficients of linear regression

I need some help. I need to fit a linear model (y ~ X1) with the following constraints:
Intercept + Beta1 <= 1; and both Intercept and Beta1 need to belong to [0,1]
After looking how to do this in R, I found that Quadprog seems like the best option. However, I do not fully understand how to set those constraints. I take an example form a webpage (http://zoonek.free.fr/blosxom/R/2012-06-01_Optimization.html) to adapted to my needs, but the sum of coefficients are greater than 1 in some cases (by teh way, I have to run this model on different data frames, and in some the constraint is ok, but in some cases Intercept + beta1 > 1).
My script is:
n <- 100
x1 <- rnorm(n)
y <- .3 * x1 + .2 * x2 + .5*x3 + rnorm(n)
X <- cbind(1, x1)
Mod <- solve.QP(t(X) %*% X, t(y) %*% X, cbind(matrix(1, nr=2, nc=1),
diag(2),
-diag(2)),
c(1, 0, 0, -1,-1),
meq = 1) ""
As I told before, my problem is that in some cases, intercept plus beta1 are greater than 1, and I do not understand how to set that constraint in Quadprog with that matrix type.

3-variable linear model in R

I want to get coefficients for a linear model related to synergism/antagonism between different chemicals.
Chemicals X, Y, Z. Coefficients b0...b7.
0 = b0 + b1x + b2y + b3z + b4xy + b5xz + b6yz + b7xyz
Some combination of X, Y, Z will kill 50% of cells (b0 represents this total effectiveness), and the sign/magnitude of the higher order terms represents interactions between chemicals.
Given real datapoints, I want to fit this model.
EDIT: I've gotten rid of the trivial solution by adding a forcing value at the start. Test data:
x1 <- c(0,1,2,3,4)
y1 <- c(0,2,1,5,4)
z1 <- c(0,1,-0.66667,-6,-7.25)
q <- c(-1,0,0,0,0)
model <- lm(q ~ x1*y1*z1)
This set has coefficients: -30, 12, 6, 4, 1, -1, 0, 0.5
EDIT: Progress made from before, will try some more data points. The first four coefficients look good (multiply by 30):
Coefficients:
(Intercept) x1 y1 z1 x1:y1 x1:z1 y1:z1 x1:y1:z1
-1.00000 0.47826 0.24943 0.13730 -0.05721 NA NA NA
EDIT: Adding more data points hasn't been successful so far, not sure if I need to have a certain minimum amount to be accurate.
Am I setting things up ccorrectly? Once I have coefficents, I want to solve for z so that I can plot a 3D surface. Thanks!
I was able to get the coefficients using just 16 arbitrary data points, and appending a point to exclude the trivial answer:
x1 <- c(0,1,2,3,4,1,2,3,4,1,2,3,4,5,6,7,8)
y1 <- c(0,2,1,5,4,3,7,5,8,6,2,1,5,5,3,5,7)
z1 <- c(0,1,-0.66667,-6,-7.25,-0.66667,-5.55556,-6,-6.125,-4,-2.5,-6,-6.8,-7.3913,-11.1429,-8.2069,-6.83333)
q <- c(-1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)
model <- lm(q ~ x1*y1*z1)

System is computationally singular due to small numbers in linearHypothesis

Ok, so here is the code that demonstrate the problem I am referring to:
x1 <- c(0.001, 0.002, 0.003, 0.0003)
x2 <- c(15000893, 23034340, 3034300, 232332242)
x3 <- c(1,3,5,6)
y <- rnorm( 4 )
model=lm( y ~ x1 + x2 )
model2=lm( y ~ x1 + x3 )
type <- "hc0"
V <- hccm(model, type=type)
sumry <- summary(model)
table <- coef(sumry)
table[,2] <- sqrt(diag(V))
table[,3] <- table[,1]/table[,2]
table[,4] <- 2*pt(abs(table[,3]), df.residual(model), lower.tail=FALSE)
sumry$coefficients <- table
p <- nrow(table)
hyp <- cbind(0, diag(p - 1))
linearHypothesis(model, hyp, white.adjust=type)
Note that this is not caused by perfect multicollinearity.
As you can see, I deliberately set the value of x2 to be very large and the value of x1 to be very small. When this happens, I cannot perform a linearHypothesis test of model=lm( y ~ x1 + x2 ) on all coefficients being 0: linearHypothesis(model, hyp, white.adjust=type). R will throw the following error:
> linearHypothesis(model, hyp, white.adjust=type)
Error in solve.default(vcov.hyp) :
system is computationally singular: reciprocal condition number = 2.31795e-23
However, when I use model2=lm( y ~ x1 + x3 ) instead, whose x3 is not too large compared to x1, the linearHypothesis test succeeds:
> linearHypothesis(model2, hyp, white.adjust=type)
Linear hypothesis test
Hypothesis:
x1 = 0
x3 = 0
Model 1: restricted model
Model 2: y ~ x1 + x3
Note: Coefficient covariance matrix supplied.
Res.Df Df F Pr(>F)
1 3
2 1 2 11.596 0.2033
I am aware that this might be caused by the fact that R cannot invert matrices whose numbers are smaller than a certain extent, in this case 2.31795e-23. However, is there a way to circumvent that? Is this the limitation in R or the underlying C++?
What is the good practice here? The only method I can think of is to rescale the variables so that they are at the same scale. But I am also concerned about the amount of information I will lose by dividing everything by their standard errors.
In fact, I have 200 variables that are percentages, and 10 variables (including dependent variables) that are large (potentially to the 10^6 scale). It might be troubling to scale them one by one.

Generating iid variates in R

I am working on a question and it reads:
Generate 1000 iid variates of X=(x1, x2) with a specific definition that is given in an example. The definition is that x1 is a standard normal distribution so N(0,1). However x2 is approximately x1 when -1 <= x1 <=1 and is x1 otherwise.
I have used the following code to generate bi variate random variables before in R but I do not know how to to get X=(x1, x2) and then plot.
library(gplots)
n<-10^6
sigma<- matrix(c(4,2,2,3), ncol=2)
x<- rmvnorm(n, mean = c(1,2), sigma = sigma)
h2d<- hist2d(x, show = FALSE, same.scale = TRUE, nbins = c(40,40))
persp(h2d$x, h2d$y, h2d$counts/n, ticktype="detailed", theta=30, phi=30, expand=0.5, shade=0.5, col="cyan", ltheta=-30,main="Bivariate Normal Distribution", zlab="f(x1,x2)",xlab="x1",ylab="x2")
So I know the above is not correct but I am not sure if I can do something similar or I am going about this all wrong.
It's simpler than you think:
set.seed(1) # Setting a seed
X1 <- rnorm(1000) # Simulating X1
X2 <- ifelse(abs(X1) <= 1, -X1, X1) # If abs(X1) <= 1, then set X2=-X1 and X2=X1 otherwise.
Since the question is about normal marginals but not normal bivariate distribution, we may look at a bivariate density estimate:
library(MASS)
image(kde2d(X1,X2))
Clearly the shape is not an ellipsoid, so the bivariate distribution is not normal even though both marginals are normal.
It can also be seen analytically. Let Z=X1+X2. If (X1,X2) was bivariate normal, then Z also would be normal. But P(Z = 0) >= P(|X1| <= 1) ~= 0.68, i.e., it has positive mass at zero, which cannot be the case with a continuous distribution.
You can use rnorm base function to generate the normal distribution. And, using simple ifelse function we can get x2, shown below:
x1 <- rnorm(10, mean = 0, sd = 1)
x2 <- ifelse( ((x1 <= 1) & (x1 >= -1)), -x1, x1)
plot(x1, x2, type='p')

Adding error variance to output of predict()

I am attempting to take a linear model fitted to empirical data, eg:
set.seed(1)
x <- seq(from = 0, to = 1, by = .01)
y <- x + .25*rnorm(101)
model <- (lm(y ~ x))
summary(model)
# R^2 is .6208
Now, what I would like to do is use the predict function (or something similar) to create, from x, a vector y of predicted values that shares the error of the original relationship between x and y. Using predict alone gives perfectly fitted values, so R^2 is 1 e.g:
y2 <- predict(model)
summary(lm(y2 ~ x))
# R^2 is 1
I know that I can use predict(model, se.fit = TRUE) to get the standard errors of the prediction, but I haven't found an option to incorporate those into the prediction itself, nor do I know exactly how to incorporate these standard errors into the predicted values to give the correct amount of error.
Hopefully someone here can point me in the right direction!
How about simulate(model) ?
set.seed(1)
x <- seq(from = 0, to = 1, by = .01)
y <- x + .25*rnorm(101)
model <- (lm(y ~ x))
y2 <- predict(model)
y3 <- simulate(model)
matplot(x,cbind(y,y2,y3),pch=1,col=1:3)
If you need to do it it by hand you could use
y4 <- rnorm(nobs(model),mean=predict(model),
sd=summary(model)$sigma)

Resources