R smooth.spline(): smoothing spline is not smooth but overfitting my data - r

I have several data points which seem suitable for fitting a spline through them. When I do this, I get a rather bumpy fit, like overfitting, which is not what I understand as smoothing.
Is there a special option / parameter for getting back the function of a really smooth spline like here.
The usage of the penalty parameter for smooth.spline didn't have any visible effect. Maybe I did it wrong?
Here are data and code:
results <- structure(
list(
beta = c(
0.983790622281964, 0.645152464354322,
0.924104713597375, 0.657703886566088, 0.788138034115623, 0.801080207252363,
1, 0.858337365965949, 0.999687052533693, 0.666552625121279, 0.717453633245958,
0.621570152961453, 0.964658181346544, 0.65071758770312, 0.788971505000918,
0.980476054183113, 0.670263506919246, 0.600387040967624, 0.759173403408052,
1, 0.986409675965, 0.982996471134736, 1, 0.995340781899163, 0.999855895958986,
1, 0.846179233381267, 0.879226324448832, 0.795820998892035, 0.997586607285667,
0.848036806290156, 0.905320944437968, 0.947709125535428, 0.592172373022407,
0.826847031044922, 0.996916006944244, 0.785967729206612, 0.650346929853076,
0.84206351833549, 0.999043126652724, 0.936879214753098, 0.76674066557003,
0.591431233516217, 1, 0.999833445117791, 0.999606223666537, 0.6224971799303,
1, 0.974537160571494, 0.966717133936379
), inventoryCost = c(
1750702.95138889,
442784.114583333, 1114717.44791667, 472669.357638889, 716895.920138889,
735396.180555556, 3837320.74652778, 872873.4375, 2872414.93055556,
481095.138888889, 538125.520833333, 392199.045138889, 1469500.95486111,
459873.784722222, 656220.486111111, 1654143.83680556, 437511.458333333,
393295.659722222, 630952.170138889, 4920958.85416667, 1723517.10069444,
1633579.86111111, 4639909.89583333, 2167748.35069444, 3062420.65972222,
5132702.34375, 838441.145833333, 937659.288194444, 697767.1875,
2523016.31944444, 800903.819444444, 1054991.49305556, 1266970.92013889,
369537.673611111, 764995.399305556, 2322879.6875, 656021.701388889,
458403.038194444, 844133.420138889, 2430700, 1232256.68402778,
695574.479166667, 351348.524305556, 3827440.71180556, 3687610.41666667,
2950652.51736111, 404550.78125, 4749901.64930556, 1510481.59722222,
1422708.07291667
)
), .Names = c("beta", "inventoryCost"), class = c("data.frame")
)
plot(results$beta,results$inventoryCost)
mySpline <- smooth.spline(results$beta,results$inventoryCost, penalty=999999)
lines(mySpline$x, mySpline$y, col="red", lwd = 2)

Transform your data sensibly before modelling
Based on the scale of your results$inventoryCost, log transform is appropriate. For simplicity, in the following I am using x, y. I am also reordering your data so that x is ascending:
x <- results$beta; y <- log(results$inventoryCost)
reorder <- order(x); x <- x[reorder]; y <- y[reorder]
par(mfrow = c(1,2))
plot(x, y, main = "take log transform")
hist(x, main = "x is skewed")
The left figure looks better? Also, it is highly recommended to further take transform for x, because it is skewed! (see right figure).
The following transform is appropriate:
x1 <- -(1-x)^(1/3)
The cubic root of (1-x) will make data more spread out around x = 1. I put an additional -1 so that there is a positively monotonic relation rather than a negative one between x and x1. Now let's check the relationship:
par(mfrow = c(1,2))
plot(x1, y, main = expression(y %~% ~ x1))
hist(x1, main = "x1 is well spread out")
Fitting a spline
Now we are ready for statistical modelling. Try the following call:
fit <- smooth.spline(x1, y, nknots = 10)
pred <- stats:::predict.smooth.spline(fit, x1)$y ## predict at all x1
## or you can simply call: pred <- predict(fit, x1)$y
plot(x1, y) ## scatter plot
lines(x1, pred, lwd = 2, col = 2) ## fitted spline
Does it look nice? Note, that I have used nknots = 10 tells smooth.spline to place 10 interior knots (by quantile); Therefore, we are to fit a penalized regression spline rather than a smoothing spline. In fact, the smooth.spline() function almost never fit a smoothing spline, unless you put all.knots = TRUE (see later example).
I also dropped penalty = 999999, as that has nothing to do with smoothness control. If you really want to control smoothness, rather than letting smooth.spline figure out the optimal one by GCV, you should use argument df or spar. I will give example later.
To transform fit back to original scale, do:
plot(x, exp(y), main = expression(Inventory %~%~ beta))
lines(x, exp(pred), lwd = 2, col = 2)
As you can see, the fitted spline is as smooth as you had expected.
Explanation on fitted spline
Let's see the summary of your fitted spline:
> fit
Smoothing Parameter spar= 0.4549062 lambda= 0.0008657722 (11 iterations)
Equivalent Degrees of Freedom (Df): 6.022959
Penalized Criterion: 0.08517417
GCV: 0.004288539
We used 10 knots, ending up with 6 degree of freedom, so penalization suppresses about 4 parameters. The smoothing parameter GCV has chosen, after 11 iterations, is lambda= 0.0008657722.
Why do we have to transform x to x1
Spline is penalized by 2nd derivatives, yet such penalization is on the averaged/integrated 2nd derivatives at all data points. Now, look at your data (x, y). For x before 0.98, the relationship is relatively steady; as x approaches 1, the relationship quickly goes steeper. The "change point", 0.98, has very high second derivative, much much higher than the second derivatives at other locations.
y0 <- as.numeric(tapply(y, x, mean)) ## remove tied values
x0 <- unique(x) ## remove tied values
dy0 <- diff(y0)/diff(x0) ## 1st order difference
ddy0 <- diff(dy0)/diff(x0[-1]) ## 2nd order difference
plot(x0[1:43], abs(ddy0), pch = 19)
Look at that huge spike in 2nd order difference/derivative! Now, if we fit a spline directly, the spline curve around this change point will be heavily penalized.
bad <- smooth.spline(x, y, all.knots = TRUE)
bad.pred <- predict(bad, x)$y
plot(x, exp(y), main = expression(Inventory %~% ~ beta))
lines(x, exp(bad.pred), col = 2, lwd = 3)
abline(v = 0.98, lwd = 2, lty = 2)
You can see clearly that the spline is having some difficulty in approximating data after x = 0.98.
There are of course some ways to achieve better approximation after this change point, for example, by manually setting smaller smoothing parameter, or higher degree of freedom. But we are going to another extreme. Remember, both penalization and degree of freedom are a global measure. Increasing model complexity will get better approximation after x = 0.98, but will also make other parts more bumpy. Now let's try a model with 45 degree of freedom:
worse <- smooth.spline(x, y, all.knots = TRUE, df = 45)
worse.pred <- predict(worse, x)$y
plot(x, exp(y), main = expression(Inventory %~% ~ beta))
lines(x, exp(worse.pred), col = 2, lwd = 2)
As you can see, the curve is bumpy. Sure, we have overfitted our dataset of 50 data, with 45 degree of freedom.
In fact, your original misuse of smooth.spline() is doing the same thing:
> mySpline
Call:
smooth.spline(x = results$beta, y = results$inventoryCost, penalty = 999999)
Smoothing Parameter spar= -0.8074624 lambda= 3.266077e-19 (17 iterations)
Equivalent Degrees of Freedom (Df): 45
Penalized Criterion: 5.598386
GCV: 0.03824885
Oops, 45 degree of freedom, overfitting!

I don't think you should use / want splinefun. I would suggest fitting a GAM instead:
library(mgcv)
fit <- gam(inventoryCost ~ s(beta, bs = "cr", k = 20), data = results)
summary(fit)
gam.check(fit)
plot(fit)
plot(inventoryCost ~ beta, data = results, col = "dark red", , pch = 16)
curve(predict(fit, newdata = data.frame(beta = x)), add = TRUE,
from = min(results$beta), to = max(results$beta), n = 1e3, lwd = 2)

Related

How do the coefficients from a model with an orthogonal polynomial predictor work? [duplicate]

Why I have the exact same model, but run predictions on different grid size (by 0.001 vs by 0.01) getting different predictions?
set.seed(0)
n_data=2000
x=runif(n_data)-0.5
y=0.1*sin(x*30)/x+runif(n_data)
plot(x,y)
poly_df=5
x_exp=as.data.frame(cbind(y,poly(x, poly_df)))
fit=lm(y~.,data=x_exp)
x_plt1=seq(-1,1,0.001)
x_plt_exp1=as.data.frame(poly(x_plt1,poly_df))
lines(x_plt1,predict(fit,x_plt_exp1),lwd=3,col=2)
x_plt2=seq(-1,1,0.01)
x_plt_exp2=as.data.frame(poly(x_plt2,poly_df))
lines(x_plt2,predict(fit,x_plt_exp2),lwd=3,col=3)
This is a coding / programming problem as on my quick run I can't reproduce this with appropriate set-up by putting poly() inside model formula. So I think this question better suited for Stack Overflow.
## quick test ##
set.seed(0)
x <- runif(2000) - 0.5
y <- 0.1 * sin(x * 30) / x + runif(2000)
plot(x,y)
x_exp <- data.frame(x, y)
fit <- lm(y ~ poly(x, 5), data = x_exp)
x1 <- seq(-1, 1, 0.001)
y1 <- predict(fit, newdata = list(x = x1))
lines(x1, y1, lwd = 5, col = 2)
x2 <- seq(-1, 1, 0.01)
y2 <- predict(fit, newdata = list(x = x2))
lines(x2, y2, lwd = 2, col = 3)
cuttlefish44 has pointed out the fault in your implementation. When making prediction matrix, we want to use the construction information in model matrix, rather than constructing a new set of basis. If you wonder what such "construction information" is, perhaps you can go through this very long answer: How poly() generates orthogonal polynomials? How to understand the “coefs” returned?
Perhaps I can try making a brief summary and getting around that long, detailed answer.
The construction of orthogonal polynomial always starts from centring the input covariate values x. If this centre is different, then all the rest will be different. Now, this is the difference between poly(x, coef = NULL) and poly(x, coef = some_coefficients). The former will always construct a new set of basis using a new centre, while the latter, will use the existing centring information in some_coefficients to predict basis value on given set-up. Surely this is what we want when making prediction.
poly(x, coef = some_coefficients) will actually call predict.poly (which I explained in that long answer). It is relatively rare when we need to set coef argument ourselves, unless we are doing testing. If we set up the linear model using the way I present in my quick run above, predict.lm is smart enough to realize the correct way to predict poly model terms, i.e., internally it will do the poly(new_x, coef = some_coefficients) for us.
As an interesting contrast, ordinary polynomial don't have problem with this. For example, if you specify raw = TRUE in all poly() calls in your code, you will have no trouble. This is because raw polynomial has no construction information; it is just taking powers 1, 2, ... degree of x.
In the first place, the predict lines don't fit the original data. You failed to make poly objs for prediction.
...
poly_ori <- poly(x, poly_df) # important
...
plot(x,y)
x_plt1 = seq(-1, 1, 0.001)
x_plt_exp1 = as.data.frame(poly(x_plt1, poly_df, coefs = attr(poly_ori, "coefs")))
lines(x_plt1, predict(fit, x_plt_exp1),lwd = 3, col = 2)
x_plt2 = seq(-1, 1, 0.01)
x_plt_exp2 = as.data.frame(poly(x_plt2, poly_df, coefs = attr(poly_ori, "coefs")))
lines(x_plt2, predict(fit, x_plt_exp2), lwd = 3, col = 3)

Syntax for three-piece segmented regression using NLS in R when concave

My goal is to fit a three-piece (i.e., two break-point) regression model to make predictions using propagate's predictNLS function, making sure to define knots as parameters, but my model formula seems off.
I've used the segmented package to estimate the breakpoint locations (used as starting values in NLS), but would like to keep my models in the NLS format, specifically, nlsLM {minipack.lm} because I am fitting other types of curves to my data using NLS, want to allow NLS to optimize the knot values, am sometimes using variable weights, and need to be able to easily calculate the Monte Carlo confidence intervals from propagate. Though I'm very close to having the right syntax for the formula, I'm not getting the expected/required behaviour near the breakpoint(s). The segments SHOULD meet directly at the breakpoints (without any jumps), but at least on this data, I'm getting a weird local minimum at the breakpoint (see plots below).
Below is an example of my data and general process. I believe my issue to be in the NLS formula.
library(minpack.lm)
library(segmented)
y <- c(-3.99448113, -3.82447011, -3.65447803, -3.48447030, -3.31447855, -3.14448753, -2.97447972, -2.80448401, -2.63448380, -2.46448069, -2.29448796, -2.12448912, -1.95448783, -1.78448797, -1.61448563, -1.44448719, -1.27448469, -1.10448651, -0.93448525, -0.76448637, -0.59448626, -0.42448586, -0.25448588, -0.08448548, 0.08551417, 0.25551393, 0.42551411, 0.59551395, 0.76551389, 0.93551398)
x <- c(61586.1711, 60330.5550, 54219.9925, 50927.5381, 48402.8700, 45661.9175, 37375.6023, 33249.1248, 30808.6131, 28378.6508, 22533.3782, 13901.0882, 11716.5669, 11004.7305, 10340.3429, 9587.7994, 8736.3200, 8372.1482, 8074.3709, 7788.1847, 7499.6721, 7204.3168, 6870.8192, 6413.0828, 5523.8097, 3961.6114, 3460.0913, 2907.8614, 2016.1158, 452.8841)
df<- data.frame(x,y)
#Use Segmented to get estimates for parameters with 2 breakpoints
my.seg2 <- segmented(lm(y ~ x, data = df), seg.Z = ~ x, npsi = 2)
#extract knot, intercept, and coefficient values to use as NLS start points
my.knot1 <- my.seg2$psi[1,2]
my.knot2 <- my.seg2$psi[2,2]
my.m_2 <- slope(my.seg2)$x[1,1]
my.b1 <- my.seg2$coefficients[[1]]
my.b2 <- my.seg2$coefficients[[2]]
my.b3 <- my.seg2$coefficients[[3]]
#Fit a NLS model to ~replicate segmented model. Presumably my model formula is where the problem lies
my.model <- nlsLM(y~m*x+b+(b2*(ifelse(x>=knot1&x<=knot2,1,0)*(x-knot1))+(b3*ifelse(x>knot2,1,0)*(x-knot2-knot1))),data=df, start = c(m = my.m_2, b = my.b1, b2 = my.b2, b3 = my.b3, knot1 = my.knot1, knot2 = my.knot2))
How it should look
plot(my.seg2)
How it does look
plot(x, y)
lines(x=x, y=predict(my.model), col='black', lty = 1, lwd = 1)
I was pretty sure I had it "right", but when the 95% confidence intervals are plotted with the line and prediction resolution (e.g., the density of x points) is increased, things seem dramatically incorrect.
Thank you all for your help.
Define g to be a grouping vector having the same length as x which takes on values 1, 2, 3 for the 3 sections of the X axis and create an nls model from these. The resulting plot looks ok.
my.knots <- c(my.knot1, my.knot2)
g <- cut(x, c(-Inf, my.knots, Inf), label = FALSE)
fm <- nls(y ~ a[g] + b[g] * x, df, start = list(a = c(1, 1, 1), b = c(1, 1, 1)))
plot(y ~ x, df)
lines(fitted(fm) ~ x, df, col = "red")
(continued after graph)
Constraints
Although the above looks ok and may be sufficient it does not guarantee that the segments intersect at the knots. To do that we must impose the constraints that both sides are equal at the knots:
a[2] + b[2] * my.knots[1] = a[1] + b[1] * my.knots[1]
a[3] + b[3] * my.knots[2] = a[2] + b[2] * my.knots[2]
so
a[2] = a[1] + (b[1] - b[2]) * my.knots[1]
a[3] = a[2] + (b[2] - b[3]) * my.knots[2]
= a[1] + (b[1] - b[2]) * my.knots[1] + (b[2] - b[3]) * my.knots[2]
giving:
# returns a vector of the three a values
avals <- function(a1, b) unname(cumsum(c(a1, -diff(b) * my.knots)))
fm2 <- nls(y ~ avals(a1, b)[g] + b[g] * x, df, start = list(a1 = 1, b = c(1, 1, 1)))
To get the three a values we can use:
co <- coef(fm2)
avals(co[1], co[-1])
To get the residual sum of squares:
deviance(fm2)
## [1] 0.193077
Polynomial
Although it involves a large number of parameters, a polynomial fit could be used in place of the segmented linear regression. A 12th degree polynomial involves 13 parameters but has a lower residual sum of squares than the segmented linear regression. A lower degree could be used with corresponding increase in residual sum of squares. A 7th degree polynomial involves 8 parameters and visually looks not too bad although it has a higher residual sum of squares.
fm12 <- nls(y ~ cbind(1, poly(x, 12)) %*% b, df, start = list(b = rep(1, 13)))
deviance(fm12)
## [1] 0.1899218
It may, in part, reflect a limitation in segmented. segmented returns a single change point value without quantifying the associated uncertainty. Redoing the analysis using mcp which returns Bayesian posteriors, we see that the second change point is bimodally distributed:
library(mcp)
model = list(
y ~ 1 + x, # Intercept + slope in first segment
~ 0 + x, # Only slope changes in the next segments
~ 0 + x
)
# Fit it with a large number of samples and plot the change point posteriors
fit = mcp(model, data = data.frame(x, y), iter = 50000, adapt = 10000)
plot_pars(fit, regex_pars = "^cp*", type = "dens_overlay")
FYI, mcp can plot credible intervals as well (the red dashed lines):
plot(fit, q_fit = TRUE)

get the derivative of an ECDF

Is it possible to differentiate an ECDF? Take the one obtained in the following for example example.
set.seed(1)
a <- sort(rnorm(100))
b <- ecdf(a)
plot(b)
I would like to take the derivative of b in order to obtain its probability density function (PDF).
n <- length(a) ## `a` must be sorted in non-decreasing order already
plot(a, 1:n / n, type = "s") ## "staircase" plot; not "line" plot
However I'm looking to find the derivative of b
In samples-based statistics, estimated density (for a continuous random variable) is not obtained from ECDF by differentiation, because the sample size is finite and and ECDF is not differentiable. Instead, we estimate the density directly. I guess plot(density(a)) is what you are really looking for.
a few days later..
Warning: the following is just a numerical solution without statistical ground!
I take it as an exercise to learn about R package scam for shape constrained additive models, a child package of mgcv by Prof Wood's early PhD student Dr Pya.
The logic is as such:
using scam::scam, fit a monotonically increasing P-spline to ECDF (you have to specify how many knots you want); [Note that monotonicity is not the only theoretical constraint. It is required that the smoothed ECDF are "clipped" on its two edges: the left edge at 0 and the right edge at 1. I am currently using weights to impose such constraint, by giving very large weight at two edges]
using stats::splinefun, reparametrize the fitted spline with a monotonic interpolation spline through knots and predicted values at knots;
return the interpolation spline function, which can also evaluate the 1st, 2nd and 3rd derivatives.
Why I expect this to work:
As sample size grows,
ECDF converges to CDF;
P-spline is consistent so a smoothed ECDF will be increasingly unbiased for ECDF;
the 1st derivative of smoothed ECDF will be increasingly unbiased for PDF.
Use with caution:
You have to choose number of knots yourself;
the derivative is NOT normalized so that the area under the curve is 1;
the result can be rather unstable, and is only good for large sample size.
function arguments:
x: a vector of samples;
n.knots: number of knots;
n.cells: number of grid points when plotting derivative function
You need to install scam package from CRAN.
library(scam)
test <- function (x, n.knots, n.cells) {
## get ECDF
n <- length(x)
x <- sort(x)
y <- 1:n / n
dat <- data.frame(x = x, y = y) ## make sure `scam` can find `x` and `y`
## fit a monotonically increasing P-spline for ECDF
fit <- scam::scam(y ~ s(x, bs = "mpi", k = n.knots), data = dat,
weights = c(n, rep(1, n - 2), 10 * n))
## interior knots
xk <- with(fit$smooth[[1]], knots[4:(length(knots) - 3)])
## spline values at interior knots
yk <- predict(fit, newdata = data.frame(x = xk))
## reparametrization into a monotone interpolation spline
f <- stats::splinefun(xk, yk, "hyman")
par(mfrow = c(1, 2))
plot(x, y, pch = 19, col = "gray") ## ECDF
lines(x, f(x), type = "l") ## smoothed ECDF
title(paste0("number of knots: ", n.knots,
"\neffective degree of freedom: ", round(sum(fit$edf), 2)),
cex.main = 0.8)
xg <- seq(min(x), max(x), length = n.cells)
plot(xg, f(xg, 1), type = "l") ## density estimated by scam
lines(stats::density(x), col = 2) ## a proper density estimate by density
## return smooth ECDF function
f
}
## try large sample size
set.seed(1)
x <- rnorm(1000)
f <- test(x, n.knots = 20, n.cells = 100)
f is a function as returned by stats::splinefun (read ?splinefun).
A naive, similar solution is to do interpolation spline on ECDF without smoothing. But this is a very bad idea, as we have no consistency.
g <- splinefun(sort(x), 1:length(x) / length(x), method = "hyman")
curve(g(x, deriv = 1), from = -3, to = 3)
A reminder: it is highly recommended to use stats::density for a direct density estimation.

Why predicted polynomial changes drastically when only the resolution of prediction grid changes?

Why I have the exact same model, but run predictions on different grid size (by 0.001 vs by 0.01) getting different predictions?
set.seed(0)
n_data=2000
x=runif(n_data)-0.5
y=0.1*sin(x*30)/x+runif(n_data)
plot(x,y)
poly_df=5
x_exp=as.data.frame(cbind(y,poly(x, poly_df)))
fit=lm(y~.,data=x_exp)
x_plt1=seq(-1,1,0.001)
x_plt_exp1=as.data.frame(poly(x_plt1,poly_df))
lines(x_plt1,predict(fit,x_plt_exp1),lwd=3,col=2)
x_plt2=seq(-1,1,0.01)
x_plt_exp2=as.data.frame(poly(x_plt2,poly_df))
lines(x_plt2,predict(fit,x_plt_exp2),lwd=3,col=3)
This is a coding / programming problem as on my quick run I can't reproduce this with appropriate set-up by putting poly() inside model formula. So I think this question better suited for Stack Overflow.
## quick test ##
set.seed(0)
x <- runif(2000) - 0.5
y <- 0.1 * sin(x * 30) / x + runif(2000)
plot(x,y)
x_exp <- data.frame(x, y)
fit <- lm(y ~ poly(x, 5), data = x_exp)
x1 <- seq(-1, 1, 0.001)
y1 <- predict(fit, newdata = list(x = x1))
lines(x1, y1, lwd = 5, col = 2)
x2 <- seq(-1, 1, 0.01)
y2 <- predict(fit, newdata = list(x = x2))
lines(x2, y2, lwd = 2, col = 3)
cuttlefish44 has pointed out the fault in your implementation. When making prediction matrix, we want to use the construction information in model matrix, rather than constructing a new set of basis. If you wonder what such "construction information" is, perhaps you can go through this very long answer: How poly() generates orthogonal polynomials? How to understand the “coefs” returned?
Perhaps I can try making a brief summary and getting around that long, detailed answer.
The construction of orthogonal polynomial always starts from centring the input covariate values x. If this centre is different, then all the rest will be different. Now, this is the difference between poly(x, coef = NULL) and poly(x, coef = some_coefficients). The former will always construct a new set of basis using a new centre, while the latter, will use the existing centring information in some_coefficients to predict basis value on given set-up. Surely this is what we want when making prediction.
poly(x, coef = some_coefficients) will actually call predict.poly (which I explained in that long answer). It is relatively rare when we need to set coef argument ourselves, unless we are doing testing. If we set up the linear model using the way I present in my quick run above, predict.lm is smart enough to realize the correct way to predict poly model terms, i.e., internally it will do the poly(new_x, coef = some_coefficients) for us.
As an interesting contrast, ordinary polynomial don't have problem with this. For example, if you specify raw = TRUE in all poly() calls in your code, you will have no trouble. This is because raw polynomial has no construction information; it is just taking powers 1, 2, ... degree of x.
In the first place, the predict lines don't fit the original data. You failed to make poly objs for prediction.
...
poly_ori <- poly(x, poly_df) # important
...
plot(x,y)
x_plt1 = seq(-1, 1, 0.001)
x_plt_exp1 = as.data.frame(poly(x_plt1, poly_df, coefs = attr(poly_ori, "coefs")))
lines(x_plt1, predict(fit, x_plt_exp1),lwd = 3, col = 2)
x_plt2 = seq(-1, 1, 0.01)
x_plt_exp2 = as.data.frame(poly(x_plt2, poly_df, coefs = attr(poly_ori, "coefs")))
lines(x_plt2, predict(fit, x_plt_exp2), lwd = 3, col = 3)

Exponential curve fitting in R

time = 1:100
head(y)
0.07841589 0.07686316 0.07534116 0.07384931 0.07238699 0.07095363
plot(time,y)
This is an exponential curve.
How can I fit line on this curve without knowing the formula ? I can't use 'nls' as the formula is unknown (only data points are given).
How can I get the equation for this curve and determine the constants in the equation?
I tried loess but it doesn't give the intercepts.
You need a model to fit to the data.
Without knowing the full details of your model, let's say that this is an
exponential growth model,
which one could write as: y = a * e r*t
Where y is your measured variable, t is the time at which it was measured,
a is the value of y when t = 0 and r is the growth constant.
We want to estimate a and r.
This is a non-linear problem because we want to estimate the exponent, r.
However, in this case we can use some algebra and transform it into a linear equation by taking the log on both sides and solving (remember
logarithmic rules), resulting in:
log(y) = log(a) + r * t
We can visualise this with an example, by generating a curve from our model, assuming some values for a and r:
t <- 1:100 # these are your time points
a <- 10 # assume the size at t = 0 is 10
r <- 0.1 # assume a growth constant
y <- a*exp(r*t) # generate some y observations from our exponential model
# visualise
par(mfrow = c(1, 2))
plot(t, y) # on the original scale
plot(t, log(y)) # taking the log(y)
So, for this case, we could explore two possibilies:
Fit our non-linear model to the original data (for example using nls() function)
Fit our "linearised" model to the log-transformed data (for example using the lm() function)
Which option to choose (and there's more options), depends on what we think
(or assume) is the data-generating process behind our data.
Let's illustrate with some simulations that include added noise (sampled from
a normal distribution), to mimic real data. Please look at this
StackExchange post
for the reasoning behind this simulation (pointed out by Alejo Bernardin's comment).
set.seed(12) # for reproducible results
# errors constant across time - additive
y_add <- a*exp(r*t) + rnorm(length(t), sd = 5000) # or: rnorm(length(t), mean = a*exp(r*t), sd = 5000)
# errors grow as y grows - multiplicative (constant on the log-scale)
y_mult <- a*exp(r*t + rnorm(length(t), sd = 1)) # or: rlnorm(length(t), mean = log(a) + r*t, sd = 1)
# visualise
par(mfrow = c(1, 2))
plot(t, y_add, main = "additive error")
lines(t, a*exp(t*r), col = "red")
plot(t, y_mult, main = "multiplicative error")
lines(t, a*exp(t*r), col = "red")
For the additive model, we could use nls(), because the error is constant across
t. When using nls() we need to specify some starting values for the optimization algorithm (try to "guesstimate" what these are, because nls() often struggles to converge on a solution).
add_nls <- nls(y_add ~ a*exp(r*t),
start = list(a = 0.5, r = 0.2))
coef(add_nls)
# a r
# 11.30876845 0.09867135
Using the coef() function we can get the estimates for the two parameters.
This gives us OK estimates, close to what we simulated (a = 10 and r = 0.1).
You could see that the error variance is reasonably constant across the range of the data, by plotting the residuals of the model:
plot(t, resid(add_nls))
abline(h = 0, lty = 2)
For the multiplicative error case (our y_mult simulated values), we should use lm() on log-transformed data, because
the error is constant on that scale instead.
mult_lm <- lm(log(y_mult) ~ t)
coef(mult_lm)
# (Intercept) t
# 2.39448488 0.09837215
To interpret this output, remember again that our linearised model is log(y) = log(a) + r*t, which is equivalent to a linear model of the form Y = β0 + β1 * X, where β0 is our intercept and β1 our slope.
Therefore, in this output (Intercept) is equivalent to log(a) of our model and t is the coefficient for the time variable, so equivalent to our r.
To meaningfully interpret the (Intercept) we can take its exponential (exp(2.39448488)), giving us ~10.96, which is quite close to our simulated value.
It's worth noting what would happen if we'd fit data where the error is multiplicative
using the nls function instead:
mult_nls <- nls(y_mult ~ a*exp(r*t), start = list(a = 0.5, r = 0.2))
coef(mult_nls)
# a r
# 281.06913343 0.06955642
Now we over-estimate a and under-estimate r
(Mario Reutter
highlighted this in his comment). We can visualise the consequence of using the wrong approach to fit our model:
# get the model's coefficients
lm_coef <- coef(mult_lm)
nls_coef <- coef(mult_nls)
# make the plot
plot(t, y_mult)
lines(t, a*exp(r*t), col = "brown", lwd = 5)
lines(t, exp(lm_coef[1])*exp(lm_coef[2]*t), col = "dodgerblue", lwd = 2)
lines(t, nls_coef[1]*exp(nls_coef[2]*t), col = "orange2", lwd = 2)
legend("topleft", col = c("brown", "dodgerblue", "orange2"),
legend = c("known model", "nls fit", "lm fit"), lwd = 3)
We can see how the lm() fit to log-transformed data was substantially better than the nls() fit on the original data.
You can again plot the residuals of this model, to see that the variance is not constant across the range of the data (we can also see this in the graphs above, where the spread of the data increases for higher values of t):
plot(t, resid(mult_nls))
abline(h = 0, lty = 2)
Unfortunately taking the logarithm and fitting a linear model is not optimal.
The reason is that the errors for large y-values weight much more than those
for small y-values when apply the exponential function to go back to the
original model.
Here is one example:
f <- function(x){exp(0.3*x+5)}
squaredError <- function(a,b,x,y) {sum((exp(a*x+b)-f(x))^2)}
x <- 0:12
y <- f(x) * ( 1 + sample(-300:300,length(x),replace=TRUE)/10000 )
x
y
#--------------------------------------------------------------------
M <- lm(log(y)~x)
a <- unlist(M[1])[2]
b <- unlist(M[1])[1]
print(c(a,b))
squaredError(a,b,x,y)
approxPartAbl_a <- (squaredError(a+1e-8,b,x,y) - squaredError(a,b,x,y))/1e-8
for ( i in 0:10 )
{
eps <- -i*sign(approxPartAbl_a)*1e-5
print(c(eps,squaredError(a+eps,b,x,y)))
}
Result:
> f <- function(x){exp(0.3*x+5)}
> squaredError <- function(a,b,x,y) {sum((exp(a*x+b)-f(x))^2)}
> x <- 0:12
> y <- f(x) * ( 1 + sample(-300:300,length(x),replace=TRUE)/10000 )
> x
[1] 0 1 2 3 4 5 6 7 8 9 10 11 12
> y
[1] 151.2182 203.4020 278.3769 366.8992 503.5895 682.4353 880.1597 1186.5158 1630.9129 2238.1607 3035.8076 4094.6925 5559.3036
> #--------------------------------------------------------------------
>
> M <- lm(log(y)~x)
> a <- unlist(M[1])[2]
> b <- unlist(M[1])[1]
> print(c(a,b))
coefficients.x coefficients.(Intercept)
0.2995808 5.0135529
> squaredError(a,b,x,y)
[1] 5409.752
> approxPartAbl_a <- (squaredError(a+1e-8,b,x,y) - squaredError(a,b,x,y))/1e-8
> for ( i in 0:10 )
+ {
+ eps <- -i*sign(approxPartAbl_a)*1e-5
+ print(c(eps,squaredError(a+eps,b,x,y)))
+ }
[1] 0.000 5409.752
[1] -0.00001 5282.91927
[1] -0.00002 5157.68422
[1] -0.00003 5034.04589
[1] -0.00004 4912.00375
[1] -0.00005 4791.55728
[1] -0.00006 4672.70592
[1] -0.00007 4555.44917
[1] -0.00008 4439.78647
[1] -0.00009 4325.71730
[1] -0.0001 4213.2411
>
Perhaps one can try some numeric method, i.e. gradient search, to find the
minimum of the squared error function.
If it really is exponential, you can try taking the logarithm of your variable and fitting a linear model to that.

Resources