univariate non linear optimisation in R - r

I'm trying to find solution in R that performs similarly to MATLAB's trust region reflective algorithm. This question has been asked before but the author was asked to provide reproducible example. I couldn't comment there so the only solution was to post new question. Here's my example:
x <- c(5000,5000,5000,5000,2500,2500,2500,2500,1250,1250,1250,1250,625,625, 625,625,312,312,312,312,156,156,156,156)
y <- c(0.209065186,0.208338898,0.211886104,0.209638321,0.112064803,0.110535275,0.111748670,0.111208841,0.060416469,0.059098975,0.059274827,0.060859512,0.032178730,0.033190833,0.031621743,0.032345817,0.017983939,0.016632180,0.018468540,0.019513489,0.011490089,0.011076365,0.009282322,0.012309134)
Since initial parameter values are the central issue I tried using 'nls2' package which uses 'brute-force' algorithm to find good starting parameters. Even with that, nls and nls.lm cannot reach convergence. Here's some basic code for this:
library('nls2'); library('minpack.lm')
fo <- y ~ I(A * (x ^ B) + C)
sA <- seq(-2,1,len=10) # range of parameter values
sB <- seq(-1,1,len=10)
sC <- seq(-1,1,len=10)
st1 <- expand.grid(A=sA,B=sB,C=sC)
mod1 <- nls2(fo,start=st1,algorithm="brute-force")
fit_ <- nls(fo,start=coef(mod1)) # basic nls
# or nls.lm
fit_ <- nlsLM(fo, start=coef(mod1),algorithm = "LM")
MATLAB produced:
a = 7.593e-05 (6.451e-05, 8.736e-05)
b = 0.9289 (0.9116, 0.9462)
c = 0.002553 (0.001333, 0.003772)
Goodness of fit:
SSE: 2.173e-05
R-square: 0.9998
Adjusted R-square: 0.9998
RMSE: 0.001017
and yes, using these parameter values, R also produced the solution.
Question: how to obtain this in R without using matlab ?

After looking at a the plotted data, I have no problem guessing suitable starting values:
plot(y ~ x)
The data is almost on a straight line through 0. So good starting value vor B and C should be 1 and 0, respectively. Then you only need to guestimate the slope of the straight line. Of course you could also use lm(y ~ x) to find starting values for A and C.
fo <- y ~ A * (x ^ B) + C
DF <- data.frame(x, y)
fit <- nls(fo, start = list(A = 0.001, B = 1, C = 0), data = DF)
summary(fit)
#Formula: y ~ A * (x^B) + C
#
#Parameters:
# Estimate Std. Error t value Pr(>|t|)
#A 7.593e-05 5.495e-06 13.820 5.17e-12 ***
#B 9.289e-01 8.317e-03 111.692 < 2e-16 ***
#C 2.552e-03 5.866e-04 4.351 0.000281 ***
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
#Residual standard error: 0.001017 on 21 degrees of freedom
#
#Number of iterations to convergence: 5
#Achieved convergence tolerance: 9.084e-07
lines(seq(min(x), max(x), length.out = 100),
predict(fit, newdata = data.frame(x = seq(min(x), max(x), length.out = 100))),
col = "blue")

Related

Non-numeric argument to binary operator R NLS package

I am using the nls package in R to perform a nonlinear fit. I have specified my independent variable as follows:
t <- seq(1,7)
and my dependent variables as P <- c(0.0246, 0.2735, 0.5697, 0.6715, 0.8655, 0.9614, 1)
I then have tried:
m <- nls(P ~ 1 / (c + q*exp(-b*t))^(1/v)),
but every time I get:
"Error in c + q * exp(-b * t) : non-numeric argument to binary
operator"
Every one of my variables is numeric. Any ideas?
Thanks!
You have more than one problem in your script. The main issue is that you should never use names which are used by R: t is the matrix transpose, c is a common method to create vectors, and q is the quit instruction. nls() will not try to fit them, as they are already defined. I recommend using more meaningful and less dangerous variables such as Coef1, Coef2, …
The second problem is that you are trying to fit a model with 4 variables to a dataset with 7 data... This may yield singularities and other problems.
For the sake of the argument, I have reduced your model to three variables, and changed some names:
Time <- seq(1,7)
Prob <- c(0.0246, 0.2735, 0.5697, 0.6715, 0.8655, 0.9614, 1)
plot(Time, Prob)
And now we perform the nls() fit:
Fit <- nls(Prob ~ 1 / (Coef1 + Coef2 * exp(-Coef3 * Time)))
X <- data.frame(Time = seq(0, 7, length.out = 100))
Y <- predict(object = Fit, newdata = X)
lines(X$Time, Y)
And a summary of the results:
summary(Fit)
# Formula: Prob ~ 1/(Coef1 + Coef2 * exp(-Coef3 * Time))
#
# Parameters:
# Estimate Std. Error t value Pr(>|t|)
# Coef1 1.00778 0.06113 16.487 7.92e-05 ***
# Coef2 23.43349 14.42378 1.625 0.1796
# Coef3 1.04899 0.21892 4.792 0.0087 **
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 0.06644 on 4 degrees of freedom
#
# Number of iterations to convergence: 12
# Achieved convergence tolerance: 3.04e-06
I know it is not exactly what you wanted, but I hope it helps.

R: lm() result differs when using `weights` argument and when using manually reweighted data

In order to correct heteroskedasticity in error terms, I am running the following weighted least squares regression in R :
#Call:
#lm(formula = a ~ q + q2 + b + c, data = mydata, weights = weighting)
#Weighted Residuals:
# Min 1Q Median 3Q Max
#-1.83779 -0.33226 0.02011 0.25135 1.48516
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) -3.939440 0.609991 -6.458 1.62e-09 ***
#q 0.175019 0.070101 2.497 0.013696 *
#q2 0.048790 0.005613 8.693 8.49e-15 ***
#b 0.473891 0.134918 3.512 0.000598 ***
#c 0.119551 0.125430 0.953 0.342167
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#Residual standard error: 0.5096 on 140 degrees of freedom
#Multiple R-squared: 0.9639, Adjusted R-squared: 0.9628
#F-statistic: 933.6 on 4 and 140 DF, p-value: < 2.2e-16
Where "weighting" is a variable (function of the variable q) used for weighting the observations. q2 is simply q^2.
Now, to double-check my results, I manually weight my variables by creating new weighted variables :
mydata$a.wls <- mydata$a * mydata$weighting
mydata$q.wls <- mydata$q * mydata$weighting
mydata$q2.wls <- mydata$q2 * mydata$weighting
mydata$b.wls <- mydata$b * mydata$weighting
mydata$c.wls <- mydata$c * mydata$weighting
And run the following regression, without the weights option, and without a constant - since the constant is weighted, the column of 1 in the original predictor matrix should now equal the variable weighting:
Call:
lm(formula = a.wls ~ 0 + weighting + q.wls + q2.wls + b.wls + c.wls,
data = mydata)
#Residuals:
# Min 1Q Median 3Q Max
#-2.38404 -0.55784 0.01922 0.49838 2.62911
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#weighting -4.125559 0.579093 -7.124 5.05e-11 ***
#q.wls 0.217722 0.081851 2.660 0.008726 **
#q2.wls 0.045664 0.006229 7.330 1.67e-11 ***
#b.wls 0.466207 0.121429 3.839 0.000186 ***
#c.wls 0.133522 0.112641 1.185 0.237876
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#Residual standard error: 0.915 on 140 degrees of freedom
#Multiple R-squared: 0.9823, Adjusted R-squared: 0.9817
#F-statistic: 1556 on 5 and 140 DF, p-value: < 2.2e-16
As you can see, the results are similar but not identical. Am I doing something wrong while manually weighting the variables, or does the option "weights" do something more than simply multiplying the variables by the weighting vector?
Provided you do manual weighting correctly, you won't see discrepancy.
So the correct way to go is:
X <- model.matrix(~ q + q2 + b + c, mydata) ## non-weighted model matrix (with intercept)
w <- mydata$weighting ## weights
rw <- sqrt(w) ## root weights
y <- mydata$a ## non-weighted response
X_tilde <- rw * X ## weighted model matrix (with intercept)
y_tilde <- rw * y ## weighted response
## remember to drop intercept when using formula
fit_by_wls <- lm(y ~ X - 1, weights = w)
fit_by_ols <- lm(y_tilde ~ X_tilde - 1)
Although it is generally recommended to use lm.fit and lm.wfit when passing in matrix directly:
matfit_by_wls <- lm.wfit(X, y, w)
matfit_by_ols <- lm.fit(X_tilde, y_tilde)
But when using these internal subroutines lm.fit and lm.wfit, it is required that all input are complete cases without NA, otherwise the underlying C routine stats:::C_Cdqrls will complain.
If you still want to use the formula interface rather than matrix, you can do the following:
## weight by square root of weights, not weights
mydata$root.weighting <- sqrt(mydata$weighting)
mydata$a.wls <- mydata$a * mydata$root.weighting
mydata$q.wls <- mydata$q * mydata$root.weighting
mydata$q2.wls <- mydata$q2 * mydata$root.weighting
mydata$b.wls <- mydata$b * mydata$root.weighting
mydata$c.wls <- mydata$c * mydata$root.weighting
fit_by_wls <- lm(formula = a ~ q + q2 + b + c, data = mydata, weights = weighting)
fit_by_ols <- lm(formula = a.wls ~ 0 + root.weighting + q.wls + q2.wls + b.wls + c.wls,
data = mydata)
Reproducible Example
Let's use R's built-in data set trees. Use head(trees) to inspect this dataset. There is no NA in this dataset. We aim to fit a model:
Height ~ Girth + Volume
with some random weights between 1 and 2:
set.seed(0); w <- runif(nrow(trees), 1, 2)
We fit this model via weighted regression, either by passing weights to lm, or manually transforming data and calling lm with no weigths:
X <- model.matrix(~ Girth + Volume, trees) ## non-weighted model matrix (with intercept)
rw <- sqrt(w) ## root weights
y <- trees$Height ## non-weighted response
X_tilde <- rw * X ## weighted model matrix (with intercept)
y_tilde <- rw * y ## weighted response
fit_by_wls <- lm(y ~ X - 1, weights = w)
#Call:
#lm(formula = y ~ X - 1, weights = w)
#Coefficients:
#X(Intercept) XGirth XVolume
# 83.2127 -1.8639 0.5843
fit_by_ols <- lm(y_tilde ~ X_tilde - 1)
#Call:
#lm(formula = y_tilde ~ X_tilde - 1)
#Coefficients:
#X_tilde(Intercept) X_tildeGirth X_tildeVolume
# 83.2127 -1.8639 0.5843
So indeed, we see identical results.
Alternatively, we can use lm.fit and lm.wfit:
matfit_by_wls <- lm.wfit(X, y, w)
matfit_by_ols <- lm.fit(X_tilde, y_tilde)
We can check coefficients by:
matfit_by_wls$coefficients
#(Intercept) Girth Volume
# 83.2127455 -1.8639351 0.5843191
matfit_by_ols$coefficients
#(Intercept) Girth Volume
# 83.2127455 -1.8639351 0.5843191
Again, results are the same.

how to generate data on the a given value of coefficient of multiple determination in R?

I need generate data on the a given value of coefficient of multiple determination.
For example,if i indicated R^2 = 0.77, i want generate data, which create regression model with R^2=0.77
but these data must be in a certain range. For example, sample= 100 and i need 4 variables(x1 - dependent var), where values in range from 5-15. How do that?
I use optim
optim(0.77, fn, gr = NULL,
method = c("Nelder-Mead", "BFGS", "CG", "L-BFGS-B", "SANN",
"Brent"),
lower = 5, upper = 15,
control = list(), hessian = FALSE)
but i don't know how create function fn for my purpose. Please help to write this function
First here's a solution:
library(mvtnorm)
get.r <- function(x) c((x+sqrt(x**2+3*x))/(3),(x-sqrt(x**2+3*x))/(3))
set.seed(123)
cv <- get.r(0.77)[1]
out <- rmvnorm(100,sigma=matrix(c(1,cv,cv,cv,cv,1,cv,cv,cv,cv,1,cv,cv,cv,cv,1),ncol=4))
out1 <- as.data.frame(10*(out-min(out))/diff(range(out))+5)
range(out1)
# [1] 5 15
lm1 <- lm(V1~V2+V3+V4,data=out1)
summary(lm1)
# Call:
# lm(formula = V1 ~ V2 + V3 + V4, data = out1)
#
# Residuals:
# Min 1Q Median 3Q Max
# -1.75179 -0.64323 -0.03397 0.64770 2.23142
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.36180 0.50940 0.710 0.479265
# V2 0.29557 0.09311 3.175 0.002017 **
# V3 0.31433 0.08814 3.566 0.000567 ***
# V4 0.35438 0.07581 4.674 9.62e-06 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 0.927 on 96 degrees of freedom
# Multiple R-squared: 0.7695, Adjusted R-squared: 0.7623
# F-statistic: 106.8 on 3 and 96 DF, p-value: < 2.2e-16
Now let me explain how I got there. We can construct this statistically. First we need to understand a little about correlation and covariance. One formula for correlation is
Corr(X, Y) = Cov(X,Y)/sqrt(Var(X)Var(Y))
And one formula for covariance is:
Cov(X,Y) = E(XY) - E(X)E(Y)
In your question you want to get the multiple correlation of the regression model:
Y = X1 + X2 + X3
Let's make this as simple as possible and force the variance of all variables to be 1 and let's make the pairwise correlation between any two variables to be equal and call it r.
Now we're looking for the square of the correlation between Y and X1 + X2 + X3, which is:
R^2 = [Cov(Y,X1 + X2 + X3)]^2/[Var(Y)Var(X1 + X2 + X3)]
Note that
Cov(Y,X1 + X2 + X3) = Cov(Y,X1) + Cov(Y,X2) + Cov(Y,X3)
Further note that the variance of each variable is 1 and the pairwise correlation is r, so the above result is equivalent to 3r.
Also note that
Var(X1 + X2 + X3) = Var(X1) + Var(X2) + Var(X3) + Cov(X1,X2) + Cov(X1,X3) + Cov(X2,X3).
Since the variance of each is 1, this is equivalent to 3 + 6r, so
R^2 = 9r^2/(3 + 6r) = 3r^2/(1 + 2r)
We can use the quadratic equation to solve for r and get
r = (R^2 +/- sqrt((R^2)^2+3R^2))/3
If we substitute R^2 = 0.77, then r = -0.3112633 or 0.8245966. We can use either to get what you need by using rmvnorm() within the mvtnorm package. And since R^2 is invariant to linear transformations, we can transform the resulting variables so that they fall between 5 and 15.
Update:
If we want to simulate with n predictors, we can use the following (note that I am not transforming the range of each predictor, but that can be done after the fact without altering the multiple R^2):
get.r <- function(x,n) c(((n-1)*x+sqrt(((n-1)*x)**2+4*n*x))/(2*n),
((n-1)*x-sqrt(((n-1)*x)**2+4*n*x))/(2*n))
sim.data <- function(R2, n) {
sig.mat <- matrix(get.r(R2,n+1)[1],n+1,n+1)
diag(sig.mat) <- 1
out <- as.data.frame(rmvnorm(100,sigma=sig.mat))
return(out)
}
This isn't an answer, but I wanted to share what I did. I don't believe optim can be used the way you want it to. I attempted a "brute force" method to find a dataset that could work, but the highest r-squared I "randomed" was 0.23:
# Initializing our boolean and counter.
rm(list = ls())
Done <- FALSE
count <- 1
maxr2 <- .000001
# I set y ahead of time.
y <- sample(5:15, 100, replace = TRUE)
# Running until an appropriate r-squared is found.
while(!Done) {
# Generating a sample data set to optimize y on.
a <- sample(5:15, 100, replace = TRUE)
b <- sample(5:15, 100, replace = TRUE)
c <- sample(5:15, 100, replace = TRUE)
data <- data.frame(y = y, a = a, b = b, c = c)
# Making our equation and making a linear model.
EQ <- "y ~ a + b + c" # Creating the equation.
model <- lm(EQ, data) # Running the model.
if (count != 1) { if (summary(model)$r.squared > maxr2) { maxr2 <- summary(model)$r.squared } }
r2 <- summary(model)$r.squared # Grabbing the r-squared.
print(r2) # Printing r-squared out to see what is popping out.
if (r2 <= 0.78 & r2 >= 0.76) { Done <- TRUE } # If the r-squared is satfisfactory, pop it out.
count <- count + 1 # Incrementing our counter.
if (count >= 1000000) { Done <- TRUE ; print("A satisfactory r-squared was not found.") } # Setting this to run at most 1,000,000 times.
}
# Data will be your model that has an r-squared of 0.77 if you found one.
The issue with optim is that it optimizes individual parameters, single values. The first argument in optim is the par argument, which is meant to be a list of the values you want to optimize. This could be used in optimizing an r-squared by some decay function that is dependent on several values (these would be your par values). However, in this case, you're asking to optimize entire columns towards maximizing an r-squared, which doesn't make sense (as far as I know) with optim.

piecewise function fitting with nls() in R

I am trying to fit a two-part line to data.
Here's some sample data:
x<-c(0.00101959664756622, 0.001929220749155, 0.00165657261751726,
0.00182514724375389, 0.00161532360585458, 0.00126991061099209,
0.00149545009309177, 0.000816386510029308, 0.00164402569283353,
0.00128029006251656, 0.00206892841921455, 0.00132378793976235,
0.000953143467154676, 0.00272964503695939, 0.00169743839571702,
0.00286411493120396, 0.0016464862337286, 0.00155672067449593,
0.000878271561566836, 0.00195872573138819, 0.00255412836538339,
0.00126212428137799, 0.00106206607962734, 0.00169140916371657,
0.000858015581562961, 0.00191955159274793, 0.00243104345247067,
0.000871042201994687, 0.00229814264111745, 0.00226756341241083)
y<-c(1.31893118849162, 0.105150790530179, 0.412732029152914, 0.25589805483046,
0.467147868109498, 0.983984462069833, 0.640007862668818, 1.51429617241365,
0.439777145282391, 0.925550163462951, -0.0555942758921906, 0.870117027565708,
1.38032147826294, -0.96757052387814, 0.346370836378525, -1.08032147826294,
0.426215616848312, 0.55151485221263, 1.41306889485598, 0.0803478641720901,
-0.86654892295057, 1.00422341998656, 1.26214517662281, 0.359512373951839,
1.4835398594013, 0.154967053938309, -0.680501679226447, 1.44740598234453,
-0.512732029152914, -0.359512373951839)
I am hoping to be able to define the best fitting two part line (hand drawn example shown)
I then define a piecewise function that should find a two part linear function. The definition is based on the gradients of the two lines and their intercept with each other, which should completely define the lines.
# A=gradient of first line segment
# B=gradient of second line segment
# Cx=inflection point x coord
# Cy=inflexion point y coord
out_model <- nls(y ~ I(x <= Cx)*Cy-A*(Cx-x)+I(x > Cx)*Cy+B*(x),
data = data.frame(x,y),
start = c(A=-500,B=-500,Cx=0.0001,Cy=-1.5) )
However I get the error:
Error in nls(y ~ I(x <= Cx) * Cy - A * (Cx - x) + I(x > Cx) * Cy + B * :
singular gradient
I got the basic method from Finding a curve to match data
Any ideas where I am going wrong?
I don't have an elegant answer, but I do have an answer.
(SEE THE EDIT BELOW FOR A MORE ELEGANT ANSWER)
If Cx is small enough that there are no data points to fit A and Cy to, or if Cx is big enough that there are no data points to fit B and Cy to, the QR decomposition matrix will be singular because there will be many different values of Cx, A and Cy or Cx, B and Cy respectively that will fit the data equally well.
I tested this by preventing Cx from being fitted. If I fix Cx at (say) Cx = mean(x), nls() solves the problem without difficulty:
nls(y ~ ifelse(x < mean(x),ya+A*x,yb+B*x),
data = data.frame(x,y),
start = c(A=-1000,B=-1000,ya=3,yb=0))
... gives:
Nonlinear regression model
model: y ~ ifelse(x < mean(x), ya + A * x, yb + B * x)
data: data.frame(x, y)
A B ya yb
-1325.537 -1335.918 2.628 2.652
residual sum-of-squares: 0.06614
Number of iterations to convergence: 1
Achieved convergence tolerance: 2.294e-08
That led me to think that if I transformed Cx so that it could never go outside the range [min(x),max(x)], that might solve the problem. In fact, I'd want there to be at least three data points available to fit each of the "A" line and the "B" line, so Cx has to be between the third lowest and the third highest values of x. Using the atan() function with the appropriate arithmetic let me map a range [-inf,+inf] onto [0,1], so I got the code:
trans <- function(x) 0.5+atan(x)/pi
xs <- sort(x)
xlo <- xs[3]
xhi <- xs[length(xs)-2]
nls(y ~ ifelse(x < xlo+(xhi-xlo)*trans(f),ya+A*x,yb+B*x),
data = data.frame(x,y),
start = c(A=-1000,B=-1000,ya=3,yb=0,f=0))
Unfortunately, however, I still get the singular gradient matrix at initial parameters error from this code, so the problem is still over-parameterised. As #Henrik has suggested, the difference between the bilinear and single linear fit is not great for these data.
I can nevertheless get an answer for the bilinear fit, however. Since nls() solves the problem when Cx is fixed, I can now find the value of Cx that minimises the residual standard error by simply doing a one-dimensional minimisation using optimize(). Not a particularly elegant solution, but better than nothing:
xs <- sort(x)
xlo <- xs[3]
xhi <- xs[length(xs)-2]
nn <- function(f) nls(y ~ ifelse(x < xlo+(xhi-xlo)*f,ya+A*x,yb+B*x),
data = data.frame(x,y),
start = c(A=-1000,B=-1000,ya=3,yb=0))
ssr <- function(f) sum(residuals(nn(f))^2)
f = optimize(ssr,interval=c(0,1))
print (f$minimum)
print (nn(f$minimum))
summary(nn(f$minimum))
... gives output of:
[1] 0.8541683
Nonlinear regression model
model: y ~ ifelse(x < xlo + (xhi - xlo) * f, ya + A * x, yb + B * x)
data: data.frame(x, y)
A B ya yb
-1317.215 -872.002 2.620 1.407
residual sum-of-squares: 0.0414
Number of iterations to convergence: 1
Achieved convergence tolerance: 2.913e-08
Formula: y ~ ifelse(x < xlo + (xhi - xlo) * f, ya + A * x, yb + B * x)
Parameters:
Estimate Std. Error t value Pr(>|t|)
A -1.317e+03 1.792e+01 -73.493 < 2e-16 ***
B -8.720e+02 1.207e+02 -7.222 1.14e-07 ***
ya 2.620e+00 2.791e-02 93.854 < 2e-16 ***
yb 1.407e+00 3.200e-01 4.399 0.000164 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.0399 on 26 degrees of freedom
Number of iterations to convergence: 1
There isn't a huge difference between the values of A and B and ya and yb for the optimum value of f, but there is some difference.
(EDIT -- ELEGANT ANSWER)
Having separated the problem into two steps, it isn't necessary to use nls() any more. lm() works fine, as follows:
function (x,y)
{
f <- function (Cx)
{
lhs <- function(x) ifelse(x < Cx,Cx-x,0)
rhs <- function(x) ifelse(x < Cx,0,x-Cx)
fit <- lm(y ~ lhs(x) + rhs(x))
c(summary(fit)$r.squared,
summary(fit)$coef[1], summary(fit)$coef[2],
summary(fit)$coef[3])
}
r2 <- function(x) -(f(x)[1])
res <- optimize(r2,interval=c(min(x),max(x)))
res <- c(res$minimum,f(res$minimum))
best_Cx <- res[1]
coef1 <- res[3]
coef2 <- res[4]
coef3 <- res[5]
plot(x,y)
abline(coef1+best_Cx*coef2,-coef2) #lhs
abline(coef1-best_Cx*coef3,coef3) #rs
}
... which gives:
If the breakpoint is known it is possible to use linear regression
Broken stick regression from "Practical Regression and Anova using R"
Julian J. Faraway
December 2000
k <- 0.0025
lhs <- function(x) ifelse(x < k,k-x,0)
rhs <- function(x) ifelse(x < k,0,x-k)
fit <- lm(y ~ lhs(x) + rhs(x))
The package segmented was designed for this type of problem.
First, create a regular linear regression with lm:
linmod <- lm(y ~ x)
summary(linmod)
Which gives us:
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-0.108783 -0.025432 -0.006484 0.040092 0.088638
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.630e+00 2.732e-02 96.28 <2e-16 ***
x -1.326e+03 1.567e+01 -84.63 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.04869 on 28 degrees of freedom
Multiple R-squared: 0.9961, Adjusted R-squared: 0.996
F-statistic: 7163 on 1 and 28 DF, p-value: < 2.2e-16
Next, we use the linear model to produce a segmented model with 1 break point:
segmod <- segmented(linmod, seg.Z = ~x)
summary(segmod)
And the segmented model provides a slightly better r-squared:
***Regression Model with Segmented Relationship(s)***
Call:
segmented.lm(obj = linmod, seg.Z = ~x)
Estimated Break-Point(s):
Est. St.Err
0.003 0.000
Meaningful coefficients of the linear terms:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.659e+00 2.882e-02 92.239 <2e-16 ***
x -1.347e+03 1.756e+01 -76.742 <2e-16 ***
U1.x 5.167e+02 4.822e+02 1.072 NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.04582 on 26 degrees of freedom
Multiple R-Squared: 0.9968, Adjusted R-squared: 0.9964
Convergence attained in 3 iterations with relative change 0
You can check the plot, intercept and slope:
plot(segmod)
intercept(segmod)
slope(segmod)
Thank to Henrik for putting me on the right path!
Here's a more complete and relatively elegant solution with a simple plot:
range_x<-max(x)-min(x)
intervals=1000
coef1=c()
coef2=c()
coef3=c()
r2=c()
for (i in 1:intervals)
{
Cx<-min(x)+(i-1)*(range_x/intervals)
lhs <- function(x) ifelse(x < Cx,Cx-x,0)
rhs <- function(x) ifelse(x < Cx,0,x-Cx)
fit <- lm(y ~ lhs(x) + rhs(x))
coef1[i]<-summary(fit)$coef[1]
coef2[i]<-summary(fit)$coef[2]
coef3[i]<-summary(fit)$coef[3]
r2[i]<-summary(fit)$r.squared
}
best_r2<-max(r2) # get best r squared
pos<-which.max(r2)
best_Cx<-min(x)+(pos-1)*(range_x/intervals) # get Cx for best r2
plot(x,y)
abline(coef1[pos]+best_Cx*coef2[pos],-coef2[pos]) #lhs
abline(coef1[pos]-best_Cx*coef3[pos],coef3[pos]) #rs

nls line of best fit - how to force plotting of line?

I am trying to write a basic function to add some lines of best fit to plots using nls.
This works fine unless the data just happens to be defined exactly by the formula passed to nls. I'm aware of the issues and that this is documented behaviour as reported here.
My question though is how can I get around this and force a line of best fit to be plotted regardless of the data exactly being described by the model? Is there a way to detect the data matches exactly and plot the perfectly fitting curve? My current dodgy solution is:
#test data
x <- 1:10
y <- x^2
plot(x, y, pch=20)
# polynomial line of best fit
f <- function(x,a,b,d) {(a*x^2) + (b*x) + d}
fit <- nls(y ~ f(x,a,b,d), start = c(a=1, b=1, d=1))
co <- coef(fit)
curve(f(x, a=co[1], b=co[2], d=co[3]), add = TRUE, col="red", lwd=2)
Which fails with the error:
Error in nls(y ~ f(x, a, b, d), start = c(a = 1, b = 1, d = 1)) :
singular gradient
The easy fix I apply is to jitter the data slightly, but this seems a bit destructive and hackish.
# the above code works after doing...
y <- jitter(x^2)
Is there a better way?
Use Levenberg-Marquardt.
x <- 1:10
y <- x^2
f <- function(x,a,b,d) {(a*x^2) + (b*x) + d}
fit <- nls(y ~ f(x,a,b,d), start = c(a=1, b=0, d=0))
Error in nls(y ~ f(x, a, b, d), start = c(a = 1, b = 0, d = 0)) :
number of iterations exceeded maximum of 50
library(minpack.lm)
fit <- nlsLM(y ~ f(x,a,b,d), start = c(a=1, b=0, d=0))
summary(fit)
Formula: y ~ f(x, a, b, d)
Parameters:
Estimate Std. Error t value Pr(>|t|)
a 1 0 Inf <2e-16 ***
b 0 0 NA NA
d 0 0 NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0 on 7 degrees of freedom
Number of iterations to convergence: 1
Achieved convergence tolerance: 1.49e-08
Note that I had to adjust the starting values and the result is sensitive to starting values.
fit <- nlsLM(y ~ f(x,a,b,d), start = c(a=1, b=0.1, d=0.1))
Parameters:
Estimate Std. Error t value Pr(>|t|)
a 1.000e+00 2.083e-09 4.800e+08 < 2e-16 ***
b -7.693e-08 1.491e-08 -5.160e+00 0.00131 **
d 1.450e-07 1.412e-08 1.027e+01 1.8e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6.191e-08 on 7 degrees of freedom
Number of iterations to convergence: 3
Achieved convergence tolerance: 1.49e-08

Resources