I'm trying to use linear regression to figure out the best weighting for 3 models to predict an outcome. So there are 3 variables (x1, x2, x3) that are the predictions of the dependent variable, y. My question is, how do I run a regression with the constraint that the sum of the coefficients sum to 1. For example:
this is good:
y = .2(x1) + .4(x2) + .4(x3)
since .2 + .4 + .4 = 1
this is no good:
y = 1.2(x1) + .4(x2) + .3(x3)
since 1.2 + .4 + .3 > 1
I'm looking to do this in R if possible. Thanks. Let me know if this needs to get moved to the stats area ('Cross-Validated').
EDIT:
The problem is to classify each row as 1 or 0. y is the actual values ( 0 or 1 ) from the training set, x1 is the predicted values from a kNN model, x2 is from a randomForest, x3 is from a gbm model. I'm trying to get the best weightings for each model, so each coefficient is <=1 and the sum of the coefficients == 1.
Would look something like this:
y/Actual value knnPred RfPred gbmPred
0 .1111 .0546 .03325
1 .7778 .6245 .60985
0 .3354 .1293 .33255
0 .2235 .9987 .10393
1 .9888 .6753 .88933
... ... ... ...
The measure for success is AUC. So I'm trying to set the coefficients to maximize AUC while making sure they sum to 1.
There's very likely a better way that someone else will share, but you're looking for two parameters such that
b1 * x1 + b2 * x2 + (1 - b1 - b2) * x3
is close to y. To do that, I'd write an error function to minimize
minimizeMe <- function(b, x, y) { ## Calculates MSE
mean((b[1] * x[, 1] + b[2] * x[, 2] + (1 - sum(b)) * x[, 3] - y) ^ 2)
}
and throw it to optim
fit <- optim(par = c(.2, .4), fn = minimizeMe, x = cbind(x1, x2, x3), y = y)
No data to test on:
mod1 <- lm(y ~ 0+x1+x2+x3, data=dat)
mod2 <- lm(y/I(sum(coef(mod1))) ~ 0+x1+x2+x3, data=dat)
And now that I think about it some more, skip mod2, just:
coef(mod1)/sum(coef(mod1))
For the five rows shown either of round(knnPred) or round(gbmPred) give perfect predictions so there is some question whether more than one predictor is needed.
At any rate, to solve the given question as stated the following will give nonnegative coefficients that sum to 1 (except possibly for tiny differences due to computer arithmetic). a is the dependent variable and b is a matrix of independent variables. c and d define the equality constraint (coeffs sum to 1) and e and f define the inequality constraints (coeffs are nonnegative).
library(lsei)
a <- cbind(x1, x2, x3)
b <- y
c <- matrix(c(1, 1, 1), 1)
d <- 1
e <- diag(3)
f <- c(0, 0, 0)
lsei(a, b, c, d, e, f)
Related
I'd like to assign a numerical value to a character element of a list, but to an unquoted version of the element so that I can use it in a model formula.
Suppose I have a fully specified model formula that I'll use in, say, the nls function:
m.form <- y ~ b0 + b1*x1 + b2*x2
(I know my example is linear, but that doesn't matter for this). I also have a list of the parameter names and some starting values for each parameter:
params <- c("b0","b1","b2")
startvals <- list(b0=1, b1=1, b2=-1)
I then want to assign a value to a parameter in the params so I can estimate a restricted version of the model, lets say forcing b1==0. Of course, I want to do this by referring to the parameter in the vector params (because I'm going to do a loop over a model with more variables and parameter, estimating the model with the given restriction for each loop iteration).
So I want to do something like this:
params[2] <- 0
summary(nls(m.form,data,startvals[-2])
where I'm trying to replace the parameter name in the formula with numerical 0 and then delete the starting value for that parameter from the startvals since that parameter no longer appears in the model (very likely not the best way to do this!). The above doesn't work, but if instead of the "params[1] <- 0" line I use "b1 <- 0", it does work as intended. But I'll be looping through all the parameters in the model so I don't want to write out the actual parameter name each time. Thanks.
Edit 1
So to be clearer, I need to be able to impose the restriction by referring to the element of the params vector because I'm ultimately going to loop through, each time estimating the model with a different restriction. So, e.g. maybe in the first loop iteration I impose params[2]=0, but in the next, maybe it's params[3]=0.5.
1) It can be done without rewriting the formula by defining the value and removing it from startvals. No packages are used.
set.seed(123)
DF <- data.frame(y = rnorm(25), x1 = rnorm(25), x2 = rnorm(25))
m.form <- y ~ b0 + b1*x1 + b2*x2
startvals <- list(b0=1, b1=1, b2=-1)
b1 <- 0
nls(m.form, DF, start = startvals[-2])
giving:
Nonlinear regression model
model: y ~ b0 + b1 * x1 + b2 * x2
data: DF
b0 b2
-0.03457 0.12139
residual sum-of-squares: 21.18
Number of iterations to convergence: 1
Achieved convergence tolerance: 3.722e-09
2) or if you want to substitute b1 = 0 into the formula anyways then
m.form0 <- do.call("substitute", list(m.form, list(b1 = 0)))
nls(m.form0, DF, start = startvals[-2])
giving:
Nonlinear regression model
model: y ~ b0 + 0 * x1 + b2 * x2
data: DF
b0 b2
-0.03457 0.12139
residual sum-of-squares: 21.18
Number of iterations to convergence: 1
Achieved convergence tolerance: 3.722e-09
Added
If you want to specify these in terms of ix which is a non-empty vector of param index numbers and vals which is an equal length vector of constraint values then
set.seed(123)
DF <- data.frame(y = rnorm(25), x1 = rnorm(25), x2 = rnorm(25))
m.form <- y ~ b0 + b1*x1 + b2*x2
params <- c("b0", "b1", "b2")
startvals <- list(b0 = 1, b1 = 1, b2 = -1)
ix <- 2
vals <- 0
L <- setNames(list(vals), params[ix])
# 1
list2env(L, environment(m.form)) # add constraints to formula's envir
nls(m.form, DF, start = startvals[-ix])
## Nonlinear regression model
## model: y ~ b0 + b1 * x1 + b2 * x2
## ...snip...
# 2
m.form0 <- do.call("substitute", list(m.form, L))
nls(m.form0, DF, start = startvals[-ix])
## Nonlinear regression model
## model: y ~ b0 + 0 * x1 + b2 * x2
## ...sjip...
You could write a function that does the replacement:
m.form <- y ~ b0 + b1*x1 + b2*x2
restrict <- function(form, restrictions){
restrictions <- setNames(as.character(restrictions), names(restrictions))
form <- stringr::str_replace_all(deparse(form), restrictions)
as.formula(form)
}
params <- c("b0","b1","b2")
startvals <- list(b0=1, b1=1, b2=-1)
summary(nls(restrict(m.form, c(b1 = 0)),data,startvals[-2]))
You could retrict more than 1 parameter:
summary(nls(restrict(m.form, c(b1 = 0, b0 = 1)),data,startvals[3]))
I have existing code for fitting a sigmoid curve to data in R. How can I used selfstart (or another method) to automatically find start values for the regression?
sigmoid = function(params, x) {
params[1] / (1 + exp(-params[2] * (x - params[3])))
}
dataset = data.frame("x" = 1:53, "y" =c(0,0,0,0,0,0,0,0,0,0,0,0,0,0.1,0.18,0.18,0.18,0.33,0.33,0.33,0.33,0.41,0.41,0.41,0.41,0.41,0.41,0.5,0.5,0.5,0.5,0.68,0.58,0.58,0.68,0.83,0.83,0.83,0.74,0.74,0.74,0.83,0.83,0.9,0.9,0.9,1,1,1,1,1,1,1) )
x = dataset$x
y = dataset$y
# fitting code
fitmodel <- nls(y~a/(1 + exp(-b * (x-c))), start=list(a=1,b=.5,c=25))
# visualization code
# get the coefficients using the coef function
params=coef(fitmodel)
y2 <- sigmoid(params,x)
plot(y2,type="l")
points(y)
This is a common (and interesting) problem in non-linear curve fitting.
Background
We can find sensible starting values if we take a closer look at the function sigmoid
We first note that
So for large values of x, the function approaches a. In other words, as a starting value for a we may choose the value of y for the largest value of x.
In R language, this translates to y[which.max(x)].
Now that we have a starting value for a, we need to decide on starting values for b and c. To do that, we can make use of the geometric series
and expand f(x) = y by keeping only the first two terms
We now set a = 1 (our starting value for a), re-arrange the equation and take the logarithm on both sides
We can now fit a linear model of the form log(1 - y) ~ x to obtain estimates for the slope and offset, which in turn provide the starting values for b and c.
R implementation
Let's define a function that takes as an argument the values x and y and returns a list of parameter starting values
start_val_sigmoid <- function(x, y) {
fit <- lm(log(y[which.max(x)] - y + 1e-6) ~ x)
list(
a = y[which.max(x)],
b = unname(-coef(fit)[2]),
c = unname(-coef(fit)[1] / coef(fit)[2]))
}
Based on the data for x and y you give, we obtain the following starting values
start_val_sigmoid(x, y)
#$a
#[1] 1
#
#$b
#[1] 0.2027444
#
#$c
#[1] 15.01613
Since start_val_sigmoid returns a list we can use its output directly as the start argument in nls
nls(y ~ a / ( 1 + exp(-b * (x - c))), start = start_val_sigmoid(x, y))
#Nonlinear regression model
# model: y ~ a/(1 + exp(-b * (x - c)))
# data: parent.frame()
# a b c
# 1.0395 0.1254 29.1725
# residual sum-of-squares: 0.2119
#
#Number of iterations to convergence: 9
#Achieved convergence tolerance: 9.373e-06
Sample data
dataset = data.frame("x" = 1:53, "y" =c(0,0,0,0,0,0,0,0,0,0,0,0,0,0.1,0.18,0.18,0.18,0.33,0.33,0.33,0.33,0.41,0.41,0.41,0.41,0.41,0.41,0.5,0.5,0.5,0.5,0.68,0.58,0.58,0.68,0.83,0.83,0.83,0.74,0.74,0.74,0.83,0.83,0.9,0.9,0.9,1,1,1,1,1,1,1) )
x = dataset$x
y = dataset$y
I am trying to set up a simple OLS model with constraints on the coefficients in R. The code below is working. However, this demonstrates
y = c + a1x1 + a2x2 + a3x3 with constraint a1+a2 = 1
I would like to revise this constraint to:
a1*a2 - a3 = 0
thanks for your help!
WORKING CODE:
'''
set.seed(1000)
n <- 20
x1 <- seq(100,length.out=n)+rnorm(n,0,2)
x2 <- seq(50,length.out=n)+rnorm(n,0,2)
x3 <- seq(10,length.out=n)+rnorm(n,0,2)
constant <- 100
ymat <- constant + .5*x1 + .5*x2 + .75*x3 + rnorm(n,0,4)
xmat <- cbind(x1,x2,x3)
X <- cbind(rep(1,n),xmat) # explicitly include vector for constant
bh <- solve(t(X)%*%X)%*%t(X)%*%ymat
XX <- solve(t(X)%*%X)
cmat <- matrix(1,1,1)
Q <- matrix(c(0,1,1,0),ncol(X),1) # a1+a2=1 for y = c + a1x1 + a2x2 + a3x3
bc <- bh-XX%*%Q%*%solve(t(Q)%*%XX%*%Q)%*%(t(Q)%*%bh-cmat)
library(quadprog)
d <- t(ymat) %*% X
Rinv = solve(chol(t(X)%*%X))
qp <- solve.QP(Dmat=Rinv, dvec=d, Amat=Q, bvec=cmat, meq=1, factorized=TRUE)
qp
cbind(bh,qp$unconstrained.solution)
cbind(bc,qp$solution)
'''
Assuming the problem is to minimize || ymat - X b || ^2 subject to b[2] * b[3] == b[4] we can substitute for b[4] giving the unconstrained nls problem shown below. b below is the first 3 elements of b and we can get b[4] by multiplying the last two elements of b below together. No packages are used.
fm <- nls(ymat ~ X %*% c(b, b[2] * b[3]), start = list(b = 0:2))
fm
giving:
Nonlinear regression model
model: ymat ~ X %*% c(b, b[2] * b[3])
data: parent.frame()
b1 b2 b3
76.9718 0.6275 0.7598
residual sum-of-squares: 204
Number of iterations to convergence: 4
Achieved convergence tolerance: 6.555e-06
To compute b4
prod(coef(fm)[-1])
## [1] 0.476805
Note
In a similar way the original problem (to minimize the same objective but with the original constraint) can be reduced to an unconstrained problem and solved using nls via substitution:
nls(ymat ~ X %*% c(b[1], b[2], 1-b[2], b[3]), start = list(b = 0:2))
giving:
Nonlinear regression model
model: ymat ~ X %*% c(b[1], b[2], 1 - b[2], b[3])
data: parent.frame()
b1 b2 b3
105.3186 0.3931 0.7964
residual sum-of-squares: 222.3
Number of iterations to convergence: 1
Achieved convergence tolerance: 4.838e-08
It would even be possible to reparameterize to make this original problem solvable by lm
lm(ymat ~ I(X[, 2] - X[, 3]) + X[, 4] + offset(X[, 3]))
giving
Call:
lm(formula = ymat ~ I(X[, 2] - X[, 3]) + X[, 4] + offset(X[, 3]))
Coefficients:
(Intercept) I(X[, 2] - X[, 3]) X[, 4]
105.3186 0.3931 0.7964
G. grothendieck - thank you for your response. Unfortunately this didn't work for me.
I decided to work out the Lagrangian long handed, which turned out too complicated for me to solve.
Then realized,
a1*a2-a3 =0
a1*a2 = a3
ln(a1*a2)= ln(a3)
ln(a1) + ln(a2) -ln(a3) = 0
This leaves me with an additive constraint which I can solve with the quadprog package.
Maybe you can try the code below, using fmincon()
library(pracma)
library(NlcOptim)
# define objective function
fn <- function(v) norm(ymat- as.vector( xmat %*% v),"2")
# the constraint a1*a2 - a3 = 0
heq1 = function(v) prod(v[1:2])-v[3]
# solve a1, a2 and a3
res <- fmincon(0:2,fn,heq = heq1)
such that
> res$par
[1] 1.9043754 -0.1781830 -0.3393272
I have a time-series which I need to fit onto an AR (auto-regression) model.
The AR model has the form:
x(t) = a0 + a1*x(t-1) + a2*x(t-2) + ... + aq*x(t-q) + noise.
I have two contraints:
Find the best AR fit when lag.max = 50.
Sum of all coefficients a0 + a1 + ... + aq = 1
I wrote the below code:
require(FitAR)
data(lynx) # my real data comes from the stock market.
z <- -log(lynx)
#find best model
step <- SelectModel(z, ARModel = "AR" ,lag.max = 50, Criterion = "AIC",Best=10)
summary(step) # display results
# fit the model and get coefficients
arfit <- ar(z,p=1, order.max=ceil(mean(step[,1])), aic=FALSE)
#check if sum of coefficients are 1
sum(arfit$ar)
[1] 0.5784978
My question is, how to add the constraint: sum of all coefficients = 1?
I looked at this question, but I do not realize how to use it.
**UPDATE**
I think I manage to solve my question as follow.
library(quadprog)
coeff <- arfit$ar
y <- 0
for (i in 1:length(coeff)) {
y <- y + coeff[i]*c(z[(i+1):length(z)],rep(0,i))
ifelse (i==1, X <- c(z[2:length(z)],0), X <- cbind(X,c(z[(i+1):length(z)],rep(0,i))))
}
Dmat <- t(X) %*% X
s <- solve.QP(Dmat , t(y) %*% X, matrix(1, nr=15, nc=1), 1, meq=1 )
s$solution
# The coefficients should sum up to 1
sum(s$solution)
I am trying to simulate a three-variable dataset so that I can run linear regression models on it. 'X1' and 'X2' would be continuous independent variables (mean=0, sd=1), and 'Y' would be the continuous dependent variable.
The variables will be regression model will produce coefficients like this:
Y = 5 + 3(X1) - 2(X2)
I would like to simulate this dataset such that the resulting regression model has an R-squared value of 0.2. How can I determine the value of 'sd.value' so that the regression model has this R-squared?
n <- 200
set.seed(101)
sd.value <- 1
X1 <- rnorm(n, 0, 1)
X2 <- rnorm(n, 0, 1)
Y <- rnorm(n, (5 + 3*X1 - 2*X2), sd.value)
simdata <- data.frame(X1, X2, Y)
summary(lm(Y ~ X1 + X2, data=simdata))
Take a look at this code, it should be close enough to what you want:
simulate <- function(n.obs=10^4, beta=c(5, 3, -2), R.sq=0.8) {
stopifnot(length(beta) == 3)
df <- data.frame(x1=rnorm(n.obs), x2=rnorm(n.obs)) # x1 and x2 are independent
var.epsilon <- (beta[2]^2 + beta[3]^2) * (1 - R.sq) / R.sq
stopifnot(var.epsilon > 0)
df$epsilon <- rnorm(n.obs, sd=sqrt(var.epsilon))
df$y <- with(df, beta[1] + beta[2]*x1 + beta[3]*x2 + epsilon)
return(df)
}
get.R.sq <- function(desired) {
model <- lm(y ~ x1 + x2, data=simulate(R.sq=desired))
return(summary(model)$r.squared)
}
df <- data.frame(desired.R.sq=seq(from=0.05, to=0.95, by=0.05))
df$actual.R.sq <- sapply(df$desired.R.sq, FUN=get.R.sq)
plot(df)
abline(a=0, b=1, col="red", lty=2)
Basically your question comes down to figuring out the expression for var.epsilon. Since we have y = b1 + b2*x1 + b3*x2 + epsilon, and Xs and epsilon are all independent, we have var[y] = b2^2 * var[x1] + b3^2 * var[x2] + var[eps], where the var[Xs]=1 by assumption. You can then solve for var[eps] as a function of R-squared.
So the formula for R^2 is 1-var(residual)/var(total)
In this case, the variance of Y is going to be 3^2+2^2+sd.value^2, since we are adding three independent random variables. And, asymptotically, the residual variance is going to be simply sd.value^2.
So you can compute rsquared explicitly with this function:
rsq<-function(x){1-x^2/(9+ 4+x^2)}
With a little algebra, you can compute the inverse of this function:
rsqi<-function(x){sqrt(13)*sqrt((1-x)/x)}
So setting sd.value<-rsqi(rsquared) should give you what you want.
We can test this as follows:
simrsq<-function(x){
Y <- rnorm(n, (5 + 3*X1 - 2*X2), rsqi(x))
simdata <- data.frame(X1, X2, Y)
summary(lm(Y ~ X1 + X2, data=simdata))$r.squared
}
> meanrsq<-rep(0,9)
> for(i in 1:50)
+ meanrsq<-meanrsq+Vectorize(simrsq)((1:9)/10)
> meanrsq/50
[1] 0.1031827 0.2075984 0.3063701 0.3977051 0.5052408 0.6024988 0.6947790
[8] 0.7999349 0.8977187
So it looks to be correct.
This is how I would do it (blind iterative algorithm, assuming no knowledge, for when you are purely interested in "how to simulate this"):
simulate.sd <- function(nsim=10, n=200, seed=101, tol=0.01) {
set.seed(seed)
sd.value <- 1
rsquare <- 1:nsim
results <- 1:nsim
for (i in 1:nsim) {
# tracking iteration: if we miss the value, abort at sd.value > 7.
iter <- 0
while (rsquare[i] > (0.20 + tol) | rsquare[i] < (0.2 - tol)) {
sd.value <- sd.value + 0.01
rsquare[i] <- simulate.sd.iter(sd.value, n)
iter <- iter + 1
if (iter > 3000) { break }
}
results[i] <- sd.value # store the current sd.value that is OK!
sd.value <- 1
}
cbind(results, rsquare)
}
simulate.sd.iter <- function(sd.value, n=200) { # helper function
# Takes the sd.value, creates data, and returns the r-squared
X1 <- rnorm(n, 0, 1)
X2 <- rnorm(n, 0, 1)
Y <- rnorm(n, (5 + 3*X1 - 2*X2), sd.value)
simdata <- data.frame(X1, X2, Y)
return(summary(lm(Y ~ X1 + X2, data=simdata))$r.squared)
}
simulate.sd()
A few things to note:
I let the X1 and X2 vary, since this affects this sought sd.value.
The tolerance is how exact you want this estimate to be. Are you fine with an r-squared of ~0.19 or ~0.21? Have the tolerance be 0.01.
Note that a too precise tolerance might not allow you to find a result.
The value of 1 is quite a bad starting value, making this iterative algorithm quite slow.
The resulting vector for 10 results is:
[1] 5.64 5.35 5.46 5.42 5.79 5.39 5.64 5.62 4.70 5.55,
which takes roughly 13 seconds on my machine.
My next step would be to start from 4.5, add 0.001 to the iteration instead of 0.01, and perhaps lower the tolerance. Good luck!
Alright, some summary statistics for nsim=100, taking 150 seconds, with steps increase of 0.001, and tolerance still at 0.01:
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.513 4.913 5.036 5.018 5.157 5.393
Why are you interested in this though?
Here is another code to generate multiple linear regression with errors follow normal distribution:
OPS sorry this code just produces multiple regression
sim.regression<-function(n.obs=10,coefficients=runif(10,-5,5),s.deviation=.1){
n.var=length(coefficients)
M=matrix(0,ncol=n.var,nrow=n.obs)
beta=as.matrix(coefficients)
for (i in 1:n.var){
M[,i]=rnorm(n.obs,0,1)
}
y=M %*% beta + rnorm(n.obs,0,s.deviation)
return (list(x=M,y=y,coeff=coefficients))
}