I have two vectors (example):
x=c(100,98,60,30,28,30,20,10)
y=c(10,9.8,5,3,2,3.4,2.8,1)
I would like to fit them using this function:
and get the fitting parameters a b c d
I used this:
m<-nls(x~a/1+e^(-b*(y-c)) + d)
but I got this error:
Error in y - c : non-numeric argument to binary operator
x and y are reversed and e^(...) should be exp(...). Also I found that setting d to 0 helped.
d <- 0 # fix d at 0
st <- list(a = mean(y), b = 1/sd(x), c = mean(x))
fm <- nls(y ~ a/(1+exp(-b*(x-c))) + d, start = st)
fm
giving:
Nonlinear regression model
model: y ~ a/(1 + exp(-b * (x - c)))
data: parent.frame()
a b c
19.96517 0.02623 99.73842
residual sum-of-squares: 1.82
Number of iterations to convergence: 9
Achieved convergence tolerance: 9.023e-06
Plotting this it seems to be a good fit visually:
plot(y ~ x)
lines(fitted(fm) ~ x, col = "red")
I think the reason is that c is considered as the combine operator. Change that to another symbol (c1 for instance). Of course you would also ened to specify meaningful starting paramaters, but I guess that was not your question.
Related
I have existing code for fitting a sigmoid curve to data in R. How can I used selfstart (or another method) to automatically find start values for the regression?
sigmoid = function(params, x) {
params[1] / (1 + exp(-params[2] * (x - params[3])))
}
dataset = data.frame("x" = 1:53, "y" =c(0,0,0,0,0,0,0,0,0,0,0,0,0,0.1,0.18,0.18,0.18,0.33,0.33,0.33,0.33,0.41,0.41,0.41,0.41,0.41,0.41,0.5,0.5,0.5,0.5,0.68,0.58,0.58,0.68,0.83,0.83,0.83,0.74,0.74,0.74,0.83,0.83,0.9,0.9,0.9,1,1,1,1,1,1,1) )
x = dataset$x
y = dataset$y
# fitting code
fitmodel <- nls(y~a/(1 + exp(-b * (x-c))), start=list(a=1,b=.5,c=25))
# visualization code
# get the coefficients using the coef function
params=coef(fitmodel)
y2 <- sigmoid(params,x)
plot(y2,type="l")
points(y)
This is a common (and interesting) problem in non-linear curve fitting.
Background
We can find sensible starting values if we take a closer look at the function sigmoid
We first note that
So for large values of x, the function approaches a. In other words, as a starting value for a we may choose the value of y for the largest value of x.
In R language, this translates to y[which.max(x)].
Now that we have a starting value for a, we need to decide on starting values for b and c. To do that, we can make use of the geometric series
and expand f(x) = y by keeping only the first two terms
We now set a = 1 (our starting value for a), re-arrange the equation and take the logarithm on both sides
We can now fit a linear model of the form log(1 - y) ~ x to obtain estimates for the slope and offset, which in turn provide the starting values for b and c.
R implementation
Let's define a function that takes as an argument the values x and y and returns a list of parameter starting values
start_val_sigmoid <- function(x, y) {
fit <- lm(log(y[which.max(x)] - y + 1e-6) ~ x)
list(
a = y[which.max(x)],
b = unname(-coef(fit)[2]),
c = unname(-coef(fit)[1] / coef(fit)[2]))
}
Based on the data for x and y you give, we obtain the following starting values
start_val_sigmoid(x, y)
#$a
#[1] 1
#
#$b
#[1] 0.2027444
#
#$c
#[1] 15.01613
Since start_val_sigmoid returns a list we can use its output directly as the start argument in nls
nls(y ~ a / ( 1 + exp(-b * (x - c))), start = start_val_sigmoid(x, y))
#Nonlinear regression model
# model: y ~ a/(1 + exp(-b * (x - c)))
# data: parent.frame()
# a b c
# 1.0395 0.1254 29.1725
# residual sum-of-squares: 0.2119
#
#Number of iterations to convergence: 9
#Achieved convergence tolerance: 9.373e-06
Sample data
dataset = data.frame("x" = 1:53, "y" =c(0,0,0,0,0,0,0,0,0,0,0,0,0,0.1,0.18,0.18,0.18,0.33,0.33,0.33,0.33,0.41,0.41,0.41,0.41,0.41,0.41,0.5,0.5,0.5,0.5,0.68,0.58,0.58,0.68,0.83,0.83,0.83,0.74,0.74,0.74,0.83,0.83,0.9,0.9,0.9,1,1,1,1,1,1,1) )
x = dataset$x
y = dataset$y
I am trying to set up a simple OLS model with constraints on the coefficients in R. The code below is working. However, this demonstrates
y = c + a1x1 + a2x2 + a3x3 with constraint a1+a2 = 1
I would like to revise this constraint to:
a1*a2 - a3 = 0
thanks for your help!
WORKING CODE:
'''
set.seed(1000)
n <- 20
x1 <- seq(100,length.out=n)+rnorm(n,0,2)
x2 <- seq(50,length.out=n)+rnorm(n,0,2)
x3 <- seq(10,length.out=n)+rnorm(n,0,2)
constant <- 100
ymat <- constant + .5*x1 + .5*x2 + .75*x3 + rnorm(n,0,4)
xmat <- cbind(x1,x2,x3)
X <- cbind(rep(1,n),xmat) # explicitly include vector for constant
bh <- solve(t(X)%*%X)%*%t(X)%*%ymat
XX <- solve(t(X)%*%X)
cmat <- matrix(1,1,1)
Q <- matrix(c(0,1,1,0),ncol(X),1) # a1+a2=1 for y = c + a1x1 + a2x2 + a3x3
bc <- bh-XX%*%Q%*%solve(t(Q)%*%XX%*%Q)%*%(t(Q)%*%bh-cmat)
library(quadprog)
d <- t(ymat) %*% X
Rinv = solve(chol(t(X)%*%X))
qp <- solve.QP(Dmat=Rinv, dvec=d, Amat=Q, bvec=cmat, meq=1, factorized=TRUE)
qp
cbind(bh,qp$unconstrained.solution)
cbind(bc,qp$solution)
'''
Assuming the problem is to minimize || ymat - X b || ^2 subject to b[2] * b[3] == b[4] we can substitute for b[4] giving the unconstrained nls problem shown below. b below is the first 3 elements of b and we can get b[4] by multiplying the last two elements of b below together. No packages are used.
fm <- nls(ymat ~ X %*% c(b, b[2] * b[3]), start = list(b = 0:2))
fm
giving:
Nonlinear regression model
model: ymat ~ X %*% c(b, b[2] * b[3])
data: parent.frame()
b1 b2 b3
76.9718 0.6275 0.7598
residual sum-of-squares: 204
Number of iterations to convergence: 4
Achieved convergence tolerance: 6.555e-06
To compute b4
prod(coef(fm)[-1])
## [1] 0.476805
Note
In a similar way the original problem (to minimize the same objective but with the original constraint) can be reduced to an unconstrained problem and solved using nls via substitution:
nls(ymat ~ X %*% c(b[1], b[2], 1-b[2], b[3]), start = list(b = 0:2))
giving:
Nonlinear regression model
model: ymat ~ X %*% c(b[1], b[2], 1 - b[2], b[3])
data: parent.frame()
b1 b2 b3
105.3186 0.3931 0.7964
residual sum-of-squares: 222.3
Number of iterations to convergence: 1
Achieved convergence tolerance: 4.838e-08
It would even be possible to reparameterize to make this original problem solvable by lm
lm(ymat ~ I(X[, 2] - X[, 3]) + X[, 4] + offset(X[, 3]))
giving
Call:
lm(formula = ymat ~ I(X[, 2] - X[, 3]) + X[, 4] + offset(X[, 3]))
Coefficients:
(Intercept) I(X[, 2] - X[, 3]) X[, 4]
105.3186 0.3931 0.7964
G. grothendieck - thank you for your response. Unfortunately this didn't work for me.
I decided to work out the Lagrangian long handed, which turned out too complicated for me to solve.
Then realized,
a1*a2-a3 =0
a1*a2 = a3
ln(a1*a2)= ln(a3)
ln(a1) + ln(a2) -ln(a3) = 0
This leaves me with an additive constraint which I can solve with the quadprog package.
Maybe you can try the code below, using fmincon()
library(pracma)
library(NlcOptim)
# define objective function
fn <- function(v) norm(ymat- as.vector( xmat %*% v),"2")
# the constraint a1*a2 - a3 = 0
heq1 = function(v) prod(v[1:2])-v[3]
# solve a1, a2 and a3
res <- fmincon(0:2,fn,heq = heq1)
such that
> res$par
[1] 1.9043754 -0.1781830 -0.3393272
I am trying to estimate the constants for Heaps law.
I have the following dataset novels_colection:
Number of novels DistinctWords WordOccurrences
1 1 13575 117795
2 1 34224 947652
3 1 40353 1146953
4 1 55392 1661664
5 1 60656 1968274
Then I build the next function:
# Function for Heaps law
heaps <- function(K, n, B){
K*n^B
}
heaps(2,117795,.7) #Just to test it works
So n = Word Occurrences, and K and B are values that should be constants in order to find my prediction of Distinct Words.
I tried this but it gives me an error:
fitHeaps <- nls(DistinctWords ~ heaps(K,WordOccurrences,B),
data = novels_collection[,2:3],
start = list(K = .1, B = .1), trace = T)
Error = Error in numericDeriv(form[[3L]], names(ind), env) :
Missing value or an infinity produced when evaluating the model
Any idea in how could I fix this or a method to fit the function and get the values for K and B?
If you take log transform on both sides of y = K * n ^ B, you get log(y) = log(K) + B * log(n). This is a linear relationship between log(y) and log(n), hence you can fit a linear regression model to find log(K) and B.
logy <- log(DistinctWords)
logn <- log(WordOccurrences)
fit <- lm(logy ~ logn)
para <- coef(fit) ## log(K) and B
para[1] <- exp(para[1]) ## K and B
With minpack.lm we can fit a non-linear model but I guess it will be prone to overfitting more than a linear model on the log-transformed variables will do (as done by Zheyuan), but we may compare the residuals of linear / non-linear model on some held-out dataset to get the empirical results, which will be interesting to see.
library(minpack.lm)
fitHeaps = nlsLM(DistinctWords ~ heaps(K, WordOccurrences, B),
data = novels_collection[,2:3],
start = list(K = .01, B = .01))
coef(fitHeaps)
# K B
# 5.0452566 0.6472176
plot(novels_collection$WordOccurrences, novels_collection$DistinctWords, pch=19)
lines(novels_collection$WordOccurrences, predict(fitHeaps, newdata = novels_collection[,2:3]), col='red')
I have a dataset like this
df
x y
7.3006667 -0.14383333
-0.8983333 0.02133333
2.7953333 -0.07466667
and I would like to fit an exponential function like y = a*(exp(bx)).
This is what I tried and the error I get
f <- function(x,a,b) {a * exp(b * x)}
st <- coef(nls(log(y) ~ log(f(x, a, b)), df, start = c(a = 1, b = -1)))
Error in qr.qty(QR, resid) : NA/NaN/Inf in foreign function call (arg 5)
In addition: Warning messages:
1: In log(y) : NaNs produced
2: In log(y) : NaNs produced
fit <- nls(y ~ f(x, a, b), data = df, start = list(a = st[1], b = st[2]))
Error in nls(y ~ exp(a + b * x), data = df, start = list(a = st[1], :
singular gradient
I believe it has to do with the fact that the log is not defined for negative numbers but I don't know how to solve this.
I'm having trouble seeing the problem here.
f <- function(x,a,b) {a * exp(b * x)}
fit <- nls(y~f(x,a,b),df,start=c(a=1,b=1))
summary(fit)$coefficients
# Estimate Std. Error t value Pr(>|t|)
# a -0.02285668 0.03155189 -0.7244157 0.6008871
# b 0.25568987 0.19818736 1.2901422 0.4197729
plot(y~x, df)
curve(predict(fit,newdata=data.frame(x)), add=TRUE)
The coefficients are very poorly estimated, but that's not surprising: you have two parameters and three data points.
As to why your code fails: the first call to nls(...) generates an error, so st is never set to anything (although it may have a value from some earlier code). Then you try to use that in the second call to nls(...).
I'm trying to use linear regression to figure out the best weighting for 3 models to predict an outcome. So there are 3 variables (x1, x2, x3) that are the predictions of the dependent variable, y. My question is, how do I run a regression with the constraint that the sum of the coefficients sum to 1. For example:
this is good:
y = .2(x1) + .4(x2) + .4(x3)
since .2 + .4 + .4 = 1
this is no good:
y = 1.2(x1) + .4(x2) + .3(x3)
since 1.2 + .4 + .3 > 1
I'm looking to do this in R if possible. Thanks. Let me know if this needs to get moved to the stats area ('Cross-Validated').
EDIT:
The problem is to classify each row as 1 or 0. y is the actual values ( 0 or 1 ) from the training set, x1 is the predicted values from a kNN model, x2 is from a randomForest, x3 is from a gbm model. I'm trying to get the best weightings for each model, so each coefficient is <=1 and the sum of the coefficients == 1.
Would look something like this:
y/Actual value knnPred RfPred gbmPred
0 .1111 .0546 .03325
1 .7778 .6245 .60985
0 .3354 .1293 .33255
0 .2235 .9987 .10393
1 .9888 .6753 .88933
... ... ... ...
The measure for success is AUC. So I'm trying to set the coefficients to maximize AUC while making sure they sum to 1.
There's very likely a better way that someone else will share, but you're looking for two parameters such that
b1 * x1 + b2 * x2 + (1 - b1 - b2) * x3
is close to y. To do that, I'd write an error function to minimize
minimizeMe <- function(b, x, y) { ## Calculates MSE
mean((b[1] * x[, 1] + b[2] * x[, 2] + (1 - sum(b)) * x[, 3] - y) ^ 2)
}
and throw it to optim
fit <- optim(par = c(.2, .4), fn = minimizeMe, x = cbind(x1, x2, x3), y = y)
No data to test on:
mod1 <- lm(y ~ 0+x1+x2+x3, data=dat)
mod2 <- lm(y/I(sum(coef(mod1))) ~ 0+x1+x2+x3, data=dat)
And now that I think about it some more, skip mod2, just:
coef(mod1)/sum(coef(mod1))
For the five rows shown either of round(knnPred) or round(gbmPred) give perfect predictions so there is some question whether more than one predictor is needed.
At any rate, to solve the given question as stated the following will give nonnegative coefficients that sum to 1 (except possibly for tiny differences due to computer arithmetic). a is the dependent variable and b is a matrix of independent variables. c and d define the equality constraint (coeffs sum to 1) and e and f define the inequality constraints (coeffs are nonnegative).
library(lsei)
a <- cbind(x1, x2, x3)
b <- y
c <- matrix(c(1, 1, 1), 1)
d <- 1
e <- diag(3)
f <- c(0, 0, 0)
lsei(a, b, c, d, e, f)