R nls model fails to converge on Linux, not on macOS - r

I have a grouped/nested data frame on which I want to run an nls model fit for each group. This code used to work fine on one machine. Running it on another machine results in an error.
The minimally reproducible example is this:
third_order_polynomial <- function(video_bitrate, a, b, c, d = 0) {
return(a * log10(video_bitrate)^3 + b * log10(video_bitrate)^2 + c * log10(video_bitrate) + d)
}
problematic_data = read.csv(text="video_name,video_target_bitrate,video_height,video_width,video_frame_rate,video_bitrate,video_duration,video_size,video_bitrate_log,score
water_netflix_200kbps_360p_59.94fps_vp9.mkv,200,360,640,59.94,201.18,9.994,0.2454337060546875,2.3035848038124636,1.31034482758621
water_netflix_750kbps_360p_59.94fps_vp9.mkv,750,360,640,59.94,735.98,9.994,0.8978740380859375,2.866866012696663,1.9655172413793105
water_netflix_2000kbps_720p_59.94fps_vp9.mkv,2000,720,1280,59.94,1737.01,9.994,2.119101311035156,3.239802318695986,2.68965517241379
water_netflix_7500kbps_1080p_59.94fps_vp9.mkv,7500,1080,1920,59.94,7374.88,9.994,8.9971375390625,3.8677549581062247,3.6551724137931
water_netflix_15000kbps_1080p_59.94fps_vp9.mkv,15000,1080,1920,59.94,14738.57,9.994,17.98062360595703,4.168455348432381,4.20689655172414
water_netflix_40000kbps_2160p_59.94fps_vp9.mkv,40000,2160,3840,59.94,37534.27,9.994,45.790709763183585,4.574427973737644,4.48275862068965
")
nls(
formula = score ~ third_order_polynomial(video_bitrate, a, b, c, d),
data = problematic_data,
start = list(a = 1, b = 1, c = 1, d = 0),
lower = list(a = -1, b = -1, c = -5, d = 0),
upper = list(a = 5, b = 5, c = 5, d = 5),
algorithm = "port"
)
#> Error in nls(formula = score ~ third_order_polynomial(video_bitrate, a, : Convergence failure: singular convergence (7)
Created on 2022-05-06 by the reprex package (v2.0.1)
The error appears under Ubuntu 20.04 using R 4.2.0 x86_64-pc-linux-gnu.
The error does not occur under macOS 12.3.1 using R 4.2.0 aarch64-apple-darwin20 (64-bit).
is this a bug in the stats package? Some numerical instability? I guess I will create a bug report but I'm wondering how I could prevent this.
Note that the start parameters have been chosen to get a good fit for the entire data set. The whole code is here (link to the particular revision that can also be reproducibly run via simply sourcing the script).

Your data are well-behaved, but nevertheless you're fitting 4 parameters to only 6 data points, so this kind of problem is not surprising. It's extremely common to get differences in convergence between different platforms/versions of R/etc.: tiny changes (compiler versions, compiler optimization flags, etc.) can easily flip a fitting attempt from just-barely-working to just-barely-failing.
Your starting values may be generally appropriate, but they're pretty far off for this data set (see plot below).
Do you need to use nls, or even non-linear fitting?
This model can be fitted easily with a linear model:
lm(score ~ poly(log10(video_bitrate), 3), data = problematic_data)
(this uses an orthogonal polynomial basis, which will be most numerically stable; you can probably get away with poly(log10(video_bitrate), 3, raw = TRUE) which will give you interpretable parameters). This doesn't easily allow you to incorporate constraints on the parameters, but for this particular data set the constraints don't seem to be binding.
The "singular convergence" warning from nls is notoriously opaque: see here:
A return with IV(1) = 7 occurs if a more favorable stopping test is not satisfied and if the algorithm thinks
f(x) − min{f(y): ||D(y−x)|| ≤ V(LMAXS)} < V(SCTOL)*|f(x)|,
where D is given by (4.1). When this test is satisfied, it appears that x has too many
degrees of freedom — and you should ponder whether f was properly formulated.
Default = max{ 10^(−10) , MACHEP 2 / 3 }
I don't think it's possible to adjust this tolerance from the R level. If you don't use algorithm = "port" you won't see this error (but you might be discarding a meaningful warning).
You could try nls.lm() from the minpack.lm package.
You can also cook up your own least-squares minimizers (albeit slightly less efficiently) using optim or nlminb (you have to write an objective function that computes the residual sum of squares yourself).
library(ggplot2)
ggplot(problematic_data, aes(log10(video_bitrate), score)) +
geom_point() +
geom_smooth(method = "lm",
formula = y ~poly(x, 3)) +
## starting values
geom_function(fun = function(x) x^3 + x^2 + x, colour = "red") +
scale_y_log10()

1) This is a linear model, i.e. linear in the coefficients, so we can use lm.
fm3 <- lm(formula = score ~ poly(log10(video_bitrate), 3), problematic_data)
2) or use raw = TRUE so that the coefficients correspond to those in the question; fm3 and fm3r give the same predictions and are only parameterized differently. That is all.equal(fitted(fm3), fitted(fm3r)) is TRUE.
fm3r <- lm(formula = score ~ poly(log10(video_bitrate), 3, raw = TRUE), problematic_data)
3) Also note that the model is not significantly different than the simpler score ~ poly(log10(video_bitrate), 1) model which is equivalent to score ~ log10(video_bitrate) modulo parametrization.
fm1 <- lm(formula = score ~ poly(log10(video_bitrate), 1), data = problematic_data)
anova(fm1, fm3)
giving:
Analysis of Variance Table
Model 1: score ~ poly(log10(video_bitrate), 1)
Model 2: score ~ poly(log10(video_bitrate), 3)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 4 0.079747
2 2 0.012857 2 0.066891 5.2028 0.1612
graphics--
plot(score ~ log10(video_bitrate), problematic_data)
lines(fitted(fm3) ~ log10(video_bitrate), problematic_data)
lines(fitted(fm1) ~ log10(video_bitrate), problematic_data, lty = 2)
legend("topleft", c("fm3", "fm1"), lty = 1:2)
Update
Revised to include log10 which had been omitted. Added graphics. Removed some output for brevity.

Related

Syntax for three-piece segmented regression using NLS in R when concave

My goal is to fit a three-piece (i.e., two break-point) regression model to make predictions using propagate's predictNLS function, making sure to define knots as parameters, but my model formula seems off.
I've used the segmented package to estimate the breakpoint locations (used as starting values in NLS), but would like to keep my models in the NLS format, specifically, nlsLM {minipack.lm} because I am fitting other types of curves to my data using NLS, want to allow NLS to optimize the knot values, am sometimes using variable weights, and need to be able to easily calculate the Monte Carlo confidence intervals from propagate. Though I'm very close to having the right syntax for the formula, I'm not getting the expected/required behaviour near the breakpoint(s). The segments SHOULD meet directly at the breakpoints (without any jumps), but at least on this data, I'm getting a weird local minimum at the breakpoint (see plots below).
Below is an example of my data and general process. I believe my issue to be in the NLS formula.
library(minpack.lm)
library(segmented)
y <- c(-3.99448113, -3.82447011, -3.65447803, -3.48447030, -3.31447855, -3.14448753, -2.97447972, -2.80448401, -2.63448380, -2.46448069, -2.29448796, -2.12448912, -1.95448783, -1.78448797, -1.61448563, -1.44448719, -1.27448469, -1.10448651, -0.93448525, -0.76448637, -0.59448626, -0.42448586, -0.25448588, -0.08448548, 0.08551417, 0.25551393, 0.42551411, 0.59551395, 0.76551389, 0.93551398)
x <- c(61586.1711, 60330.5550, 54219.9925, 50927.5381, 48402.8700, 45661.9175, 37375.6023, 33249.1248, 30808.6131, 28378.6508, 22533.3782, 13901.0882, 11716.5669, 11004.7305, 10340.3429, 9587.7994, 8736.3200, 8372.1482, 8074.3709, 7788.1847, 7499.6721, 7204.3168, 6870.8192, 6413.0828, 5523.8097, 3961.6114, 3460.0913, 2907.8614, 2016.1158, 452.8841)
df<- data.frame(x,y)
#Use Segmented to get estimates for parameters with 2 breakpoints
my.seg2 <- segmented(lm(y ~ x, data = df), seg.Z = ~ x, npsi = 2)
#extract knot, intercept, and coefficient values to use as NLS start points
my.knot1 <- my.seg2$psi[1,2]
my.knot2 <- my.seg2$psi[2,2]
my.m_2 <- slope(my.seg2)$x[1,1]
my.b1 <- my.seg2$coefficients[[1]]
my.b2 <- my.seg2$coefficients[[2]]
my.b3 <- my.seg2$coefficients[[3]]
#Fit a NLS model to ~replicate segmented model. Presumably my model formula is where the problem lies
my.model <- nlsLM(y~m*x+b+(b2*(ifelse(x>=knot1&x<=knot2,1,0)*(x-knot1))+(b3*ifelse(x>knot2,1,0)*(x-knot2-knot1))),data=df, start = c(m = my.m_2, b = my.b1, b2 = my.b2, b3 = my.b3, knot1 = my.knot1, knot2 = my.knot2))
How it should look
plot(my.seg2)
How it does look
plot(x, y)
lines(x=x, y=predict(my.model), col='black', lty = 1, lwd = 1)
I was pretty sure I had it "right", but when the 95% confidence intervals are plotted with the line and prediction resolution (e.g., the density of x points) is increased, things seem dramatically incorrect.
Thank you all for your help.
Define g to be a grouping vector having the same length as x which takes on values 1, 2, 3 for the 3 sections of the X axis and create an nls model from these. The resulting plot looks ok.
my.knots <- c(my.knot1, my.knot2)
g <- cut(x, c(-Inf, my.knots, Inf), label = FALSE)
fm <- nls(y ~ a[g] + b[g] * x, df, start = list(a = c(1, 1, 1), b = c(1, 1, 1)))
plot(y ~ x, df)
lines(fitted(fm) ~ x, df, col = "red")
(continued after graph)
Constraints
Although the above looks ok and may be sufficient it does not guarantee that the segments intersect at the knots. To do that we must impose the constraints that both sides are equal at the knots:
a[2] + b[2] * my.knots[1] = a[1] + b[1] * my.knots[1]
a[3] + b[3] * my.knots[2] = a[2] + b[2] * my.knots[2]
so
a[2] = a[1] + (b[1] - b[2]) * my.knots[1]
a[3] = a[2] + (b[2] - b[3]) * my.knots[2]
= a[1] + (b[1] - b[2]) * my.knots[1] + (b[2] - b[3]) * my.knots[2]
giving:
# returns a vector of the three a values
avals <- function(a1, b) unname(cumsum(c(a1, -diff(b) * my.knots)))
fm2 <- nls(y ~ avals(a1, b)[g] + b[g] * x, df, start = list(a1 = 1, b = c(1, 1, 1)))
To get the three a values we can use:
co <- coef(fm2)
avals(co[1], co[-1])
To get the residual sum of squares:
deviance(fm2)
## [1] 0.193077
Polynomial
Although it involves a large number of parameters, a polynomial fit could be used in place of the segmented linear regression. A 12th degree polynomial involves 13 parameters but has a lower residual sum of squares than the segmented linear regression. A lower degree could be used with corresponding increase in residual sum of squares. A 7th degree polynomial involves 8 parameters and visually looks not too bad although it has a higher residual sum of squares.
fm12 <- nls(y ~ cbind(1, poly(x, 12)) %*% b, df, start = list(b = rep(1, 13)))
deviance(fm12)
## [1] 0.1899218
It may, in part, reflect a limitation in segmented. segmented returns a single change point value without quantifying the associated uncertainty. Redoing the analysis using mcp which returns Bayesian posteriors, we see that the second change point is bimodally distributed:
library(mcp)
model = list(
y ~ 1 + x, # Intercept + slope in first segment
~ 0 + x, # Only slope changes in the next segments
~ 0 + x
)
# Fit it with a large number of samples and plot the change point posteriors
fit = mcp(model, data = data.frame(x, y), iter = 50000, adapt = 10000)
plot_pars(fit, regex_pars = "^cp*", type = "dens_overlay")
FYI, mcp can plot credible intervals as well (the red dashed lines):
plot(fit, q_fit = TRUE)

Non linear model with five parameters (w/ nls R)

This is my first question, please let me know if I'm doing anything wrong. We have a df with two variables, and want to model EPR (egg production rate) as a function of temperature.
The relevant packages as per the nls page:
install.packages("tidyverse")
install.packages("nls.multstart")
install.packages("nlstools")
library(tidyverse)
library(nls.multstart)
library(nlstools)
The relevant variables from a larger df:
temp=c(9.2,9.9,12.7,12.8,14.3,14.5,16.3,16.5,18,18,19.6,19.6,19.9,19.9,22,22.4,23.2,23.4,25.3,25.6,27,27.3,28.5,30.3,20.9)
EPR=c(1.5,0,0,0,1.27,0.56,3.08,0.575,2.7,3.09,2,6.3,2,3.76,3.7,1.65,7.1,18.9,7.07,3.77,13.79,0,0,0.47,0)
df<-data.frame(temp,EPR)
Here I write the formula with the five parameters to be estimated (k1,a,b,k2,c), temp will be the x values. So far so good.
formula<-function(k1,a,b,k2,c,temp) {
modelEPR<-k1*1/(1+exp(-a*(temp-b)))-k2*exp(c*temp)
return(modelEPR)
}
This is where I'm stuck; I'm already using quite narrow start_lower and upper, since I now know the parameters by using the excel solver somewhat successfully. The values I get with this method will get me a model, albeit quite an inaccurate one. Yes, I gave the start lower and upper a much greater range in the beginning, but that didn't yield any better results.
fit <- nls_multstart(EPR ~ formula(k1,a,b,k2,c,temp),
data = df,
iter = 100,
start_lower = c(k1 = 14, a = 0.3, b = 20, k2 = 0.02, c = 0.15),
start_upper = c(k1 = 15, a = 0.5, b = 21, k2 = 0.08, c = 0.24),
supp_errors = 'Y',
na.action = na.omit)
fit
As aforementioned, I used the excel solver to successfully make the model and I got the parameter estimates, then tried to just manually insert them here in R, which makes for a much better model.
model<-df %>%
mutate(pred=(14.69/(1+exp(-0.41*(temp-20.52)))-0.05*exp(0.19 *temp))) %>%
ggplot()+
xlab("Temperature (°C)")+
ylab("EPR (Eggs per female per day")+
geom_point(aes(temp,EPR))+
geom_line(aes(temp,pred),col="red")
model
Ultimately, I have two questions;
a) What am I doing wrong? Or is it simply the data being weird? Seems to work better with excel?!
b) How do I code the bridge between fit and model? fit will yield the 5 parameters, but how do I insert them directly into the model function? Can I utilize mutate somehow here?
Would appreciate any help!
A. Starting values and fitting model
To get starting values:
If k1 = 0 then we can rearrange the formula as follows and then use the result of fitting that linear model as a starting value for c.
log(EPR) ~ log(k2) + c * temp
b is a shift in temp and a is a scaling so choose b = mean(temp) and a = 1/sd(temp)
We can use algorithm = "plinear" to avoid having to specify starting values for the linear parameters, i.e. for k1 and k2. When using plinear the right hand side of the formula should be a matrix such that k1 times the first column plus k2 times the second column gives the predicted EPR.
This gives the following. Note that k1 and k2 will be represented by .lin1 and .lin2 in the nls output.
fm1 <- lm(log(EPR) ~ temp, df, subset = EPR > 0)
st2 <- list(c = coef(fm1)[[2]], a = 1/sd(df$temp), b = mean(df$temp))
fo2 <- EPR ~ cbind(1/(1+exp(-a*(temp-b))), -exp(c*temp))
fm2 <- nls(fo2, df, start = st2, algorithm = "plinear",
control = list(maxiter = 200))
deviance(fm2) # residual sum of squares
## [1] 333.6
Note that this represents a lower (better) residual sum of squares than the fit shown in the question:
sum((df$EPR - pred)^2) # residual sum of squares for fit shown in question
## [1] 339.7
No packages were used.
We can plot the two fits where the fit from the question is in blue and the fit done here is in red. From the plot there is some question whether the two large EFR values are outliers and whether they should be excluded.
plot(EPR ~ temp, df)
lines(fitted(fm2) ~ temp, df, subset = order(temp), col = "red")
lines(pred ~ temp, df, subset = order(temp), col = "blue")
[continued after screenshot]
B. Evaluating model at given parameters
For a given model expressed in formula notation we can evaluate it at given parameters using the nls2 package. nls2 takes similar arguments as nls but if the starting value is a data frame with one row and the algorithm is "brute" then it simply returns the value of the right hand side evaluated at the starting values. See ?nls for more information.
library(nls2)
fo <- EPR ~ k1*1/(1+exp(-a*(temp-b)))-k2*exp(c*temp)
st <- list(k1 = 14.69, a = 0.41, b = 20.52, k2 = 0.05, c = 0.19)
fm <- nls2(fo, df, start = data.frame(st), algorithm = "brute")
deviance(fm)
## [1] 339.7
fitted(fm) # predictions at parameter values given in st
or in terms of a function:
rhs <- function(a, b, c, k1, k2, temp) k1*1/(1+exp(-a*(temp-b)))-k2*exp(c*temp)
p <- do.call("rhs", c(st, list(temp = df$temp)))
all.equal(p, pred)
## [1] TRUE

ARIMA model with nonlinear exogenous variable in R

I'm doing a non-linear regression in R and want to add one moving-average term to my model to eliminate the autocorrelations in residuals.
Basically, here is the model:
y[n] = a + log((x1[n])^g + (x2[n])^g) + c*e[n-1] + e[n]
where [e] is the moving average term.
I plan to use ARIMA(0, 0, 1) to model residuals. However, I do not know which function I should use in R to add non-linear exogenous part to ARIMA model.
More information: I know how to use nls command to estimate a and g, but do not know how to deal with e[n].
I know that xreg in arima can handle ARIMA model with linear exogenous variables. Is there a similar function to handle ARIMA model with nonlinear exogenous variables?
Thank you for the help in advance!
nlme has such capability, as it is fitting non-linear mixed models. You can think of it an extension to nls (a fixed-effect only non-linear regression), by allowing random effect and correlated errors.
nlme can handle ARMA correlation, by something like correlation = corARMA(0.2, ~ 1, p = 0, q = 1, fixed = TRUE). This means, that residuals are MA(1) process, with initial guess of coefficient 0.2, but to be updated during model fitting. The ~ 1 suggests that MA(1) is on intercept and there is no further grouping structure.
I am not an expert in nlme, but I know nlme is what you need. I produce the following example, but since I am not an expert, I can't get nlme work at the moment. I post it here to give a start / flavour.
set.seed(0)
x1 <- runif(100)
x2 <- runif(100)
## MA(1) correlated error, with innovation standard deviation 0.1
e <- arima.sim(model = list(ma = 0.5), n = 100, sd = 0.1)
## a true model, with `a = 0.2, g = 0.5`
y0 <- 0.2 + log(x1 ^ 0.5 + x2 ^ 0.5)
## observations
y <- y0 + e
## no need to install; it comes with R; just `library()` it
library(nlme)
fit <- nlme(y ~ a + log(x1 ^ g + x2 ^ g), fixed = a + g ~ 1,
start = list(a = 0.5, g = 1),
correlation = corARMA(0.2, form = ~ 1, p = 0, q = 1, fixed = FALSE))
Similar to nls, we have an overall model formula y ~ a + log(x1 ^ g + x2 ^ g), and starting values are required for iteration process. I have chosen start = list(a = 0.5, g = 1). The correlation bit has been explained in the beginning.
fixed and random arguments in nlme specify what should be seen as fixed effects and random effects in the overall formula. Since we have no random effect, we leave it unspecified. We want a and g as fixed effect, so I tried something like fixed = a + g ~ 1. Unfortunately it does not quite work, for some reason I don't know. I read the ?nlme, and thought this formula means that we want a common a and g for all observations, but later nlme reports an error saying this is not a valid group formula.
I am also investing at this; as I said, the above gives us a start. We are already fairly close to the final answer.
Thanks to user20650 for point out my awkward error. I should use gnls function rather than nlme. By design nature of nlme package, functions lme and nlme have to take a random argument to work. Luckily, there are several other routines in nlme package for extending linear models and non-linear models.
gls and gnls extend lm and nls by allowing non-diagonal variance functions.
So, I should really use gnls instead:
## no `fixed` argument as `gnls` is a fixed-effect only
fit <- gnls(y ~ a + log(x1 ^ g + x2 ^ g), start = list(a = 0.5, g = 1),
correlation = corARMA(0.2, form = ~ 1, p = 0, q = 1, fixed = FALSE))
#Generalized nonlinear least squares fit
# Model: y ~ a + log(x1^g + x2^g)
# Data: NULL
# Log-likelihood: 92.44078
#
#Coefficients:
# a g
#0.1915396 0.5007640
#
#Correlation Structure: ARMA(0,1)
# Formula: ~1
# Parameter estimate(s):
# Theta1
#0.4184961
#Degrees of freedom: 100 total; 98 residual
#Residual standard error: 0.1050295

Exponential curve fitting in R

time = 1:100
head(y)
0.07841589 0.07686316 0.07534116 0.07384931 0.07238699 0.07095363
plot(time,y)
This is an exponential curve.
How can I fit line on this curve without knowing the formula ? I can't use 'nls' as the formula is unknown (only data points are given).
How can I get the equation for this curve and determine the constants in the equation?
I tried loess but it doesn't give the intercepts.
You need a model to fit to the data.
Without knowing the full details of your model, let's say that this is an
exponential growth model,
which one could write as: y = a * e r*t
Where y is your measured variable, t is the time at which it was measured,
a is the value of y when t = 0 and r is the growth constant.
We want to estimate a and r.
This is a non-linear problem because we want to estimate the exponent, r.
However, in this case we can use some algebra and transform it into a linear equation by taking the log on both sides and solving (remember
logarithmic rules), resulting in:
log(y) = log(a) + r * t
We can visualise this with an example, by generating a curve from our model, assuming some values for a and r:
t <- 1:100 # these are your time points
a <- 10 # assume the size at t = 0 is 10
r <- 0.1 # assume a growth constant
y <- a*exp(r*t) # generate some y observations from our exponential model
# visualise
par(mfrow = c(1, 2))
plot(t, y) # on the original scale
plot(t, log(y)) # taking the log(y)
So, for this case, we could explore two possibilies:
Fit our non-linear model to the original data (for example using nls() function)
Fit our "linearised" model to the log-transformed data (for example using the lm() function)
Which option to choose (and there's more options), depends on what we think
(or assume) is the data-generating process behind our data.
Let's illustrate with some simulations that include added noise (sampled from
a normal distribution), to mimic real data. Please look at this
StackExchange post
for the reasoning behind this simulation (pointed out by Alejo Bernardin's comment).
set.seed(12) # for reproducible results
# errors constant across time - additive
y_add <- a*exp(r*t) + rnorm(length(t), sd = 5000) # or: rnorm(length(t), mean = a*exp(r*t), sd = 5000)
# errors grow as y grows - multiplicative (constant on the log-scale)
y_mult <- a*exp(r*t + rnorm(length(t), sd = 1)) # or: rlnorm(length(t), mean = log(a) + r*t, sd = 1)
# visualise
par(mfrow = c(1, 2))
plot(t, y_add, main = "additive error")
lines(t, a*exp(t*r), col = "red")
plot(t, y_mult, main = "multiplicative error")
lines(t, a*exp(t*r), col = "red")
For the additive model, we could use nls(), because the error is constant across
t. When using nls() we need to specify some starting values for the optimization algorithm (try to "guesstimate" what these are, because nls() often struggles to converge on a solution).
add_nls <- nls(y_add ~ a*exp(r*t),
start = list(a = 0.5, r = 0.2))
coef(add_nls)
# a r
# 11.30876845 0.09867135
Using the coef() function we can get the estimates for the two parameters.
This gives us OK estimates, close to what we simulated (a = 10 and r = 0.1).
You could see that the error variance is reasonably constant across the range of the data, by plotting the residuals of the model:
plot(t, resid(add_nls))
abline(h = 0, lty = 2)
For the multiplicative error case (our y_mult simulated values), we should use lm() on log-transformed data, because
the error is constant on that scale instead.
mult_lm <- lm(log(y_mult) ~ t)
coef(mult_lm)
# (Intercept) t
# 2.39448488 0.09837215
To interpret this output, remember again that our linearised model is log(y) = log(a) + r*t, which is equivalent to a linear model of the form Y = β0 + β1 * X, where β0 is our intercept and β1 our slope.
Therefore, in this output (Intercept) is equivalent to log(a) of our model and t is the coefficient for the time variable, so equivalent to our r.
To meaningfully interpret the (Intercept) we can take its exponential (exp(2.39448488)), giving us ~10.96, which is quite close to our simulated value.
It's worth noting what would happen if we'd fit data where the error is multiplicative
using the nls function instead:
mult_nls <- nls(y_mult ~ a*exp(r*t), start = list(a = 0.5, r = 0.2))
coef(mult_nls)
# a r
# 281.06913343 0.06955642
Now we over-estimate a and under-estimate r
(Mario Reutter
highlighted this in his comment). We can visualise the consequence of using the wrong approach to fit our model:
# get the model's coefficients
lm_coef <- coef(mult_lm)
nls_coef <- coef(mult_nls)
# make the plot
plot(t, y_mult)
lines(t, a*exp(r*t), col = "brown", lwd = 5)
lines(t, exp(lm_coef[1])*exp(lm_coef[2]*t), col = "dodgerblue", lwd = 2)
lines(t, nls_coef[1]*exp(nls_coef[2]*t), col = "orange2", lwd = 2)
legend("topleft", col = c("brown", "dodgerblue", "orange2"),
legend = c("known model", "nls fit", "lm fit"), lwd = 3)
We can see how the lm() fit to log-transformed data was substantially better than the nls() fit on the original data.
You can again plot the residuals of this model, to see that the variance is not constant across the range of the data (we can also see this in the graphs above, where the spread of the data increases for higher values of t):
plot(t, resid(mult_nls))
abline(h = 0, lty = 2)
Unfortunately taking the logarithm and fitting a linear model is not optimal.
The reason is that the errors for large y-values weight much more than those
for small y-values when apply the exponential function to go back to the
original model.
Here is one example:
f <- function(x){exp(0.3*x+5)}
squaredError <- function(a,b,x,y) {sum((exp(a*x+b)-f(x))^2)}
x <- 0:12
y <- f(x) * ( 1 + sample(-300:300,length(x),replace=TRUE)/10000 )
x
y
#--------------------------------------------------------------------
M <- lm(log(y)~x)
a <- unlist(M[1])[2]
b <- unlist(M[1])[1]
print(c(a,b))
squaredError(a,b,x,y)
approxPartAbl_a <- (squaredError(a+1e-8,b,x,y) - squaredError(a,b,x,y))/1e-8
for ( i in 0:10 )
{
eps <- -i*sign(approxPartAbl_a)*1e-5
print(c(eps,squaredError(a+eps,b,x,y)))
}
Result:
> f <- function(x){exp(0.3*x+5)}
> squaredError <- function(a,b,x,y) {sum((exp(a*x+b)-f(x))^2)}
> x <- 0:12
> y <- f(x) * ( 1 + sample(-300:300,length(x),replace=TRUE)/10000 )
> x
[1] 0 1 2 3 4 5 6 7 8 9 10 11 12
> y
[1] 151.2182 203.4020 278.3769 366.8992 503.5895 682.4353 880.1597 1186.5158 1630.9129 2238.1607 3035.8076 4094.6925 5559.3036
> #--------------------------------------------------------------------
>
> M <- lm(log(y)~x)
> a <- unlist(M[1])[2]
> b <- unlist(M[1])[1]
> print(c(a,b))
coefficients.x coefficients.(Intercept)
0.2995808 5.0135529
> squaredError(a,b,x,y)
[1] 5409.752
> approxPartAbl_a <- (squaredError(a+1e-8,b,x,y) - squaredError(a,b,x,y))/1e-8
> for ( i in 0:10 )
+ {
+ eps <- -i*sign(approxPartAbl_a)*1e-5
+ print(c(eps,squaredError(a+eps,b,x,y)))
+ }
[1] 0.000 5409.752
[1] -0.00001 5282.91927
[1] -0.00002 5157.68422
[1] -0.00003 5034.04589
[1] -0.00004 4912.00375
[1] -0.00005 4791.55728
[1] -0.00006 4672.70592
[1] -0.00007 4555.44917
[1] -0.00008 4439.78647
[1] -0.00009 4325.71730
[1] -0.0001 4213.2411
>
Perhaps one can try some numeric method, i.e. gradient search, to find the
minimum of the squared error function.
If it really is exponential, you can try taking the logarithm of your variable and fitting a linear model to that.

How to put a complicated equation into a R formula?

We have the diameter of trees as the predictor and tree height as the dependent variable. A number of different equations exist for this kind of data and we try to model some of them and compare the results.
However, we we can't figure out how to correctly put one equation into the corresponding R formula format.
The trees data set in R can be used as an example.
data(trees)
df <- trees
df$h <- df$Height * 0.3048 #transform to metric system
df$dbh <- (trees$Girth * 0.3048) / pi #transform tree girth to diameter
First, the example of an equation that seems to work well:
form1 <- h ~ I(dbh ^ -1) + I( dbh ^ 2)
m1 <- lm(form1, data = df)
m1
Call:
lm(formula = form1, data = df)
Coefficients:
(Intercept) I(dbh^-1) I(dbh^2)
27.1147 -5.0553 0.1124
Coefficients a, b and c are estimated, which is what we are interested in.
Now the problematic equation:
Trying to fit it like this:
form2 <- h ~ I(dbh ^ 2) / dbh + I(dbh ^ 2) + 1.3
gives an error:
m1 <- lm(form2, data = df)
Error in terms.formula(formula, data = data)
invalid model formula in ExtractVars
I guess this is because / is interpreted as a nested model and not an arithmetic operator?
This doesn't give an error:
form2 <- h ~ I(I(dbh ^ 2) / dbh + I(dbh ^ 2) + 1.3)
m1 <- lm(form2, data = df)
But the result is not the one we want:
m1
Call:
lm(formula = form2, data = df)
Coefficients:
(Intercept) I(I(dbh^2)/dbh + I(dbh^2) + 1.3)
19.3883 0.8727
Only one coefficient is given for the whole term within the outer I(), which seems to be logic.
How can we fit the second equation to our data?
Assuming you are using nls the R formula can use an ordinary R function, H(a, b, c, D), so the formula can be just h ~ H(a, b, c, dbh) and this works:
# use lm to get startingf values
lm1 <- lm(1/(h - 1.3) ~ I(1/dbh) + I(1/dbh^2), df)
start <- rev(setNames(coef(lm1), c("c", "b", "a")))
# run nls
H <- function(a, b, c, D) 1.3 + D^2 / (a + b * D + c * D^2)
nls1 <- nls(h ~ H(a, b, c, dbh), df, start = start)
nls1 # display result
Graphing the output:
plot(h ~ dbh, df)
lines(fitted(nls1) ~ dbh, df)
You've got a couple problems. (1) You're missing parentheses for the denominator of form2 (and R has no way to know that you want to add a constant a in the denominator, or where to put any of the parameters, really), and much more problematic: (2) your 2nd model isn't linear, so lm won't work.
Fixing (1) is easy:
form2 <- h ~ 1.3 + I(dbh^2) / (a + b * dbh + c * I(dbh^2))
Fixing (2), though there are many ways to estimate parameters for a nonlinear model, the nls (nonlinear least squares) is a good place to start:
m2 <- nls(form2, data = df, start = list(a = 1, b = 1, c = 1))
You need to provide starting guesses for the parameters in nls. I just picked 1's, but you should use better guesses that ballpark what the parameters might be.
edit: fixed, no longer incorrectly using offset ...
An answer that complements #shujaa's:
You can transform your problem from
H = 1.3 + D^2/(a+b*D+c*D^2)
to
1/(H-1.3) = a/D^2+b/D+c
This would normally mess up the assumptions of the model (i.e., if H were normally distributed with constant variance, then 1/(H-1.3) wouldn't be. However, let's try it anyway:
data(trees)
df <- transform(trees,
h=Height * 0.3048, #transform to metric system
dbh=Girth * 0.3048 / pi #transform tree girth to diameter
)
lm(1/(h-1.3) ~ poly(I(1/dbh),2,raw=TRUE),data=df)
## Coefficients:
## (Intercept) poly(I(1/dbh), 2, raw = TRUE)1
## 0.043502 -0.006136
## poly(I(1/dbh), 2, raw = TRUE)2
## 0.010792
These results would normally be good enough to get good starting values for the nls fit. However, you can do better than that via glm, which uses a link function to allow for some forms of non-linearity. Specifically,
(fit2 <- glm(h-1.3 ~ poly(I(1/dbh),2,raw=TRUE),
family=gaussian(link="inverse"),data=df))
## Coefficients:
## (Intercept) poly(I(1/dbh), 2, raw = TRUE)1
## 0.041795 -0.002119
## poly(I(1/dbh), 2, raw = TRUE)2
## 0.008175
##
## Degrees of Freedom: 30 Total (i.e. Null); 28 Residual
## Null Deviance: 113.2
## Residual Deviance: 80.05 AIC: 125.4
##
You can see that the results are approximately the same as the linear fit, but not quite.
pframe <- data.frame(dbh=seq(0.8,2,length=51))
We use predict, but need to correct the prediction to account for the fact that we subtracted a constant from the LHS:
pframe$h <- predict(fit2,newdata=pframe,type="response")+1.3
p2 <- predict(fit2,newdata=pframe,se.fit=TRUE) ## predict on link scale
pframe$h_lwr <- with(p2,1/(fit+1.96*se.fit))+1.3
pframe$h_upr <- with(p2,1/(fit-1.96*se.fit))+1.3
png("dbh_tmp1.png",height=4,width=6,units="in",res=150)
par(las=1,bty="l")
plot(h~dbh,data=df)
with(pframe,lines(dbh,h,col=2))
with(pframe,polygon(c(dbh,rev(dbh)),c(h_lwr,rev(h_upr)),
border=NA,col=adjustcolor("black",alpha=0.3)))
dev.off()
Because we have used the constant on the LHS (this almost, but doesn't quite, fit into the framework of using an offset -- we could only use an offset if our formula were 1/H - 1.3 = a/D^2 + ..., i.e. if the constant adjustment were on the link (inverse) scale rather than the original scale), this doesn't fit perfectly into ggplot's geom_smooth framework
library("ggplot2")
ggplot(df,aes(dbh,h))+geom_point()+theme_bw()+
geom_line(data=pframe,colour="red")+
geom_ribbon(data=pframe,colour=NA,alpha=0.3,
aes(ymin=h_lwr,ymax=h_upr))
ggsave("dbh_tmp2.png",height=4,width=6)

Resources