smooth.spline(): fitted model does not match user-specified degree of freedom - r

Here is the code I ran
fun <- function(x) {1 + 3*sin(4*pi*x-pi)}
set.seed(1)
num.samples <- 1000
x <- runif(num.samples)
y <- fun(x) + rnorm(num.samples) * 1.5
fit <- smooth.spline(x, y, all.knots=TRUE, df=3)
Despite df=3, when I checked the fitted model, the output was
Call:
smooth.spline(x = x, y = y, df = 3, all.knots = TRUE)
Smoothing Parameter spar= 1.499954 lambda= 0.002508571 (26 iterations)
Equivalent Degrees of Freedom (Df): 9.86422
Could someone please help? Thanks!

Note that from R-3.4.0 (2017-04-21), smooth.spline can accept direct specification of λ by a newly added argument lambda. But it will still be converted to the internal one spar during estimation. So the following answer is not affected.
Smoothing parameter λ / spar lies in the centre of smoothness control
Smoothness is controlled by smoothing parameter λ.smooth.spline() uses an internal smoothing parameter spar rather than λ:
spar = s0 + 0.0601 * log(λ)
Such logarithm transform is necessary in order to do unconstrained minimization, like GCV/CV. User can specify spar to indirectly specify λ. When spar grows linearly, λ will grow exponentially. Thus there is rarely the need for using large spar value.
The degree of freedom df, is also defined in terms of λ:
where X is the model matrix with B-spline basis and S is the penalty matrix.
You can have a check on their relationships with your dataset:
spar <- seq(1, 2.5, by = 0.1)
a <- sapply(spar, function (spar_i) unlist(smooth.spline(x, y, all.knots=TRUE, spar = spar_i)[c("df","lambda")]))
Let's sketch df ~ spar, λ ~ spar and log(λ) ~ spar:
par(mfrow = c(1,3))
plot(spar, a[1, ], type = "b", main = "df ~ spar",
xlab = "spar", ylab = "df")
plot(spar, a[2, ], type = "b", main = "lambda ~ spar",
xlab = "spar", ylab = "lambda")
plot(spar, log(a[2,]), type = "b", main = "log(lambda) ~ spar",
xlab = "spar", ylab = "log(lambda)")
Note the radical growth of λ with spar, the linear relationship between log(λ) and spar, and the relatively smooth relationship between df and spar.
smooth.spline() fitting iterations for spar
If we manually specify the value of spar, like what we did in the sapply(), no fitting iterations is done for selecting spar; otherwise smooth.spline() needs iterate through a number of spar values. If we
specify cv = TRUE / FALSE, fitting iterations aims to minimize CV/GCV score;
specify df = mydf, fitting iterations aims to minimize (df(spar) - mydf) ^ 2.
Minimizing GCV is easy to follow. We don't care about the GCV score, but care the corresponding spar. On the contrary, when minimizing (df(spar) - mydf)^2, we often care about the df value at the end of iteration rather than spar! But bearing in mind that this is an minimization problem, we are never guaranteed that the final df matches our target value mydf.
Why you put df = 3, but get df = 9.864?
The end of iteration, could either implies hitting a minimum, or reaching searching boundary, or reaching maximum number of iterations.
We are far from maximum iterations limit (default 500); yet we do not hit the minimum. Well, we might reach the boundary.
Do not focus on df, think about spar.
smooth.spline(x, y, all.knots=TRUE, df=3)$spar # 1.4999
According to ?smooth.spline, by default, smooth.spline() searches spar between [-1.5, 1.5]. I.e., when you put df = 3, minimization terminates at the searching boundary, rather than hitting df = 3.
Have a look at our graph of the relationship between df and spar, again. From the figure, it looks like that we need some spar value near 2 in order to result in df = 3.
Let's use control.spar argument:
fit <- smooth.spline(x, y, all.knots=TRUE, df=3, control.spar = list(high = 2.5))
# Smoothing Parameter spar= 1.859066 lambda= 0.9855336 (14 iterations)
# Equivalent Degrees of Freedom (Df): 3.000305
Now you see, you end up with df = 3. And we need a spar = 1.86.
A better suggestion: Do not use all.knots = TRUE
Look, you have 1000 data. With all.knots = TRUE you will use 1000 parameters. Wishing to end up with df = 3 implies that 997 out of 1000 parameters are suppressed. Imagine how large a λ hence spar you need!
Try using penalized regression spline instead. Suppressing 200 parameters to 3 is definitely much easier:
fit <- smooth.spline(x, y, nknots = 200, df=3) ## using 200 knots
# Smoothing Parameter spar= 1.317883 lambda= 0.9853648 (16 iterations)
# Equivalent Degrees of Freedom (Df): 3.000386
Now, you end up with df = 3 without spar control.

Related

Non linear model with five parameters (w/ nls R)

This is my first question, please let me know if I'm doing anything wrong. We have a df with two variables, and want to model EPR (egg production rate) as a function of temperature.
The relevant packages as per the nls page:
install.packages("tidyverse")
install.packages("nls.multstart")
install.packages("nlstools")
library(tidyverse)
library(nls.multstart)
library(nlstools)
The relevant variables from a larger df:
temp=c(9.2,9.9,12.7,12.8,14.3,14.5,16.3,16.5,18,18,19.6,19.6,19.9,19.9,22,22.4,23.2,23.4,25.3,25.6,27,27.3,28.5,30.3,20.9)
EPR=c(1.5,0,0,0,1.27,0.56,3.08,0.575,2.7,3.09,2,6.3,2,3.76,3.7,1.65,7.1,18.9,7.07,3.77,13.79,0,0,0.47,0)
df<-data.frame(temp,EPR)
Here I write the formula with the five parameters to be estimated (k1,a,b,k2,c), temp will be the x values. So far so good.
formula<-function(k1,a,b,k2,c,temp) {
modelEPR<-k1*1/(1+exp(-a*(temp-b)))-k2*exp(c*temp)
return(modelEPR)
}
This is where I'm stuck; I'm already using quite narrow start_lower and upper, since I now know the parameters by using the excel solver somewhat successfully. The values I get with this method will get me a model, albeit quite an inaccurate one. Yes, I gave the start lower and upper a much greater range in the beginning, but that didn't yield any better results.
fit <- nls_multstart(EPR ~ formula(k1,a,b,k2,c,temp),
data = df,
iter = 100,
start_lower = c(k1 = 14, a = 0.3, b = 20, k2 = 0.02, c = 0.15),
start_upper = c(k1 = 15, a = 0.5, b = 21, k2 = 0.08, c = 0.24),
supp_errors = 'Y',
na.action = na.omit)
fit
As aforementioned, I used the excel solver to successfully make the model and I got the parameter estimates, then tried to just manually insert them here in R, which makes for a much better model.
model<-df %>%
mutate(pred=(14.69/(1+exp(-0.41*(temp-20.52)))-0.05*exp(0.19 *temp))) %>%
ggplot()+
xlab("Temperature (°C)")+
ylab("EPR (Eggs per female per day")+
geom_point(aes(temp,EPR))+
geom_line(aes(temp,pred),col="red")
model
Ultimately, I have two questions;
a) What am I doing wrong? Or is it simply the data being weird? Seems to work better with excel?!
b) How do I code the bridge between fit and model? fit will yield the 5 parameters, but how do I insert them directly into the model function? Can I utilize mutate somehow here?
Would appreciate any help!
A. Starting values and fitting model
To get starting values:
If k1 = 0 then we can rearrange the formula as follows and then use the result of fitting that linear model as a starting value for c.
log(EPR) ~ log(k2) + c * temp
b is a shift in temp and a is a scaling so choose b = mean(temp) and a = 1/sd(temp)
We can use algorithm = "plinear" to avoid having to specify starting values for the linear parameters, i.e. for k1 and k2. When using plinear the right hand side of the formula should be a matrix such that k1 times the first column plus k2 times the second column gives the predicted EPR.
This gives the following. Note that k1 and k2 will be represented by .lin1 and .lin2 in the nls output.
fm1 <- lm(log(EPR) ~ temp, df, subset = EPR > 0)
st2 <- list(c = coef(fm1)[[2]], a = 1/sd(df$temp), b = mean(df$temp))
fo2 <- EPR ~ cbind(1/(1+exp(-a*(temp-b))), -exp(c*temp))
fm2 <- nls(fo2, df, start = st2, algorithm = "plinear",
control = list(maxiter = 200))
deviance(fm2) # residual sum of squares
## [1] 333.6
Note that this represents a lower (better) residual sum of squares than the fit shown in the question:
sum((df$EPR - pred)^2) # residual sum of squares for fit shown in question
## [1] 339.7
No packages were used.
We can plot the two fits where the fit from the question is in blue and the fit done here is in red. From the plot there is some question whether the two large EFR values are outliers and whether they should be excluded.
plot(EPR ~ temp, df)
lines(fitted(fm2) ~ temp, df, subset = order(temp), col = "red")
lines(pred ~ temp, df, subset = order(temp), col = "blue")
[continued after screenshot]
B. Evaluating model at given parameters
For a given model expressed in formula notation we can evaluate it at given parameters using the nls2 package. nls2 takes similar arguments as nls but if the starting value is a data frame with one row and the algorithm is "brute" then it simply returns the value of the right hand side evaluated at the starting values. See ?nls for more information.
library(nls2)
fo <- EPR ~ k1*1/(1+exp(-a*(temp-b)))-k2*exp(c*temp)
st <- list(k1 = 14.69, a = 0.41, b = 20.52, k2 = 0.05, c = 0.19)
fm <- nls2(fo, df, start = data.frame(st), algorithm = "brute")
deviance(fm)
## [1] 339.7
fitted(fm) # predictions at parameter values given in st
or in terms of a function:
rhs <- function(a, b, c, k1, k2, temp) k1*1/(1+exp(-a*(temp-b)))-k2*exp(c*temp)
p <- do.call("rhs", c(st, list(temp = df$temp)))
all.equal(p, pred)
## [1] TRUE

get the derivative of an ECDF

Is it possible to differentiate an ECDF? Take the one obtained in the following for example example.
set.seed(1)
a <- sort(rnorm(100))
b <- ecdf(a)
plot(b)
I would like to take the derivative of b in order to obtain its probability density function (PDF).
n <- length(a) ## `a` must be sorted in non-decreasing order already
plot(a, 1:n / n, type = "s") ## "staircase" plot; not "line" plot
However I'm looking to find the derivative of b
In samples-based statistics, estimated density (for a continuous random variable) is not obtained from ECDF by differentiation, because the sample size is finite and and ECDF is not differentiable. Instead, we estimate the density directly. I guess plot(density(a)) is what you are really looking for.
a few days later..
Warning: the following is just a numerical solution without statistical ground!
I take it as an exercise to learn about R package scam for shape constrained additive models, a child package of mgcv by Prof Wood's early PhD student Dr Pya.
The logic is as such:
using scam::scam, fit a monotonically increasing P-spline to ECDF (you have to specify how many knots you want); [Note that monotonicity is not the only theoretical constraint. It is required that the smoothed ECDF are "clipped" on its two edges: the left edge at 0 and the right edge at 1. I am currently using weights to impose such constraint, by giving very large weight at two edges]
using stats::splinefun, reparametrize the fitted spline with a monotonic interpolation spline through knots and predicted values at knots;
return the interpolation spline function, which can also evaluate the 1st, 2nd and 3rd derivatives.
Why I expect this to work:
As sample size grows,
ECDF converges to CDF;
P-spline is consistent so a smoothed ECDF will be increasingly unbiased for ECDF;
the 1st derivative of smoothed ECDF will be increasingly unbiased for PDF.
Use with caution:
You have to choose number of knots yourself;
the derivative is NOT normalized so that the area under the curve is 1;
the result can be rather unstable, and is only good for large sample size.
function arguments:
x: a vector of samples;
n.knots: number of knots;
n.cells: number of grid points when plotting derivative function
You need to install scam package from CRAN.
library(scam)
test <- function (x, n.knots, n.cells) {
## get ECDF
n <- length(x)
x <- sort(x)
y <- 1:n / n
dat <- data.frame(x = x, y = y) ## make sure `scam` can find `x` and `y`
## fit a monotonically increasing P-spline for ECDF
fit <- scam::scam(y ~ s(x, bs = "mpi", k = n.knots), data = dat,
weights = c(n, rep(1, n - 2), 10 * n))
## interior knots
xk <- with(fit$smooth[[1]], knots[4:(length(knots) - 3)])
## spline values at interior knots
yk <- predict(fit, newdata = data.frame(x = xk))
## reparametrization into a monotone interpolation spline
f <- stats::splinefun(xk, yk, "hyman")
par(mfrow = c(1, 2))
plot(x, y, pch = 19, col = "gray") ## ECDF
lines(x, f(x), type = "l") ## smoothed ECDF
title(paste0("number of knots: ", n.knots,
"\neffective degree of freedom: ", round(sum(fit$edf), 2)),
cex.main = 0.8)
xg <- seq(min(x), max(x), length = n.cells)
plot(xg, f(xg, 1), type = "l") ## density estimated by scam
lines(stats::density(x), col = 2) ## a proper density estimate by density
## return smooth ECDF function
f
}
## try large sample size
set.seed(1)
x <- rnorm(1000)
f <- test(x, n.knots = 20, n.cells = 100)
f is a function as returned by stats::splinefun (read ?splinefun).
A naive, similar solution is to do interpolation spline on ECDF without smoothing. But this is a very bad idea, as we have no consistency.
g <- splinefun(sort(x), 1:length(x) / length(x), method = "hyman")
curve(g(x, deriv = 1), from = -3, to = 3)
A reminder: it is highly recommended to use stats::density for a direct density estimation.

R smooth.spline(): smoothing spline is not smooth but overfitting my data

I have several data points which seem suitable for fitting a spline through them. When I do this, I get a rather bumpy fit, like overfitting, which is not what I understand as smoothing.
Is there a special option / parameter for getting back the function of a really smooth spline like here.
The usage of the penalty parameter for smooth.spline didn't have any visible effect. Maybe I did it wrong?
Here are data and code:
results <- structure(
list(
beta = c(
0.983790622281964, 0.645152464354322,
0.924104713597375, 0.657703886566088, 0.788138034115623, 0.801080207252363,
1, 0.858337365965949, 0.999687052533693, 0.666552625121279, 0.717453633245958,
0.621570152961453, 0.964658181346544, 0.65071758770312, 0.788971505000918,
0.980476054183113, 0.670263506919246, 0.600387040967624, 0.759173403408052,
1, 0.986409675965, 0.982996471134736, 1, 0.995340781899163, 0.999855895958986,
1, 0.846179233381267, 0.879226324448832, 0.795820998892035, 0.997586607285667,
0.848036806290156, 0.905320944437968, 0.947709125535428, 0.592172373022407,
0.826847031044922, 0.996916006944244, 0.785967729206612, 0.650346929853076,
0.84206351833549, 0.999043126652724, 0.936879214753098, 0.76674066557003,
0.591431233516217, 1, 0.999833445117791, 0.999606223666537, 0.6224971799303,
1, 0.974537160571494, 0.966717133936379
), inventoryCost = c(
1750702.95138889,
442784.114583333, 1114717.44791667, 472669.357638889, 716895.920138889,
735396.180555556, 3837320.74652778, 872873.4375, 2872414.93055556,
481095.138888889, 538125.520833333, 392199.045138889, 1469500.95486111,
459873.784722222, 656220.486111111, 1654143.83680556, 437511.458333333,
393295.659722222, 630952.170138889, 4920958.85416667, 1723517.10069444,
1633579.86111111, 4639909.89583333, 2167748.35069444, 3062420.65972222,
5132702.34375, 838441.145833333, 937659.288194444, 697767.1875,
2523016.31944444, 800903.819444444, 1054991.49305556, 1266970.92013889,
369537.673611111, 764995.399305556, 2322879.6875, 656021.701388889,
458403.038194444, 844133.420138889, 2430700, 1232256.68402778,
695574.479166667, 351348.524305556, 3827440.71180556, 3687610.41666667,
2950652.51736111, 404550.78125, 4749901.64930556, 1510481.59722222,
1422708.07291667
)
), .Names = c("beta", "inventoryCost"), class = c("data.frame")
)
plot(results$beta,results$inventoryCost)
mySpline <- smooth.spline(results$beta,results$inventoryCost, penalty=999999)
lines(mySpline$x, mySpline$y, col="red", lwd = 2)
Transform your data sensibly before modelling
Based on the scale of your results$inventoryCost, log transform is appropriate. For simplicity, in the following I am using x, y. I am also reordering your data so that x is ascending:
x <- results$beta; y <- log(results$inventoryCost)
reorder <- order(x); x <- x[reorder]; y <- y[reorder]
par(mfrow = c(1,2))
plot(x, y, main = "take log transform")
hist(x, main = "x is skewed")
The left figure looks better? Also, it is highly recommended to further take transform for x, because it is skewed! (see right figure).
The following transform is appropriate:
x1 <- -(1-x)^(1/3)
The cubic root of (1-x) will make data more spread out around x = 1. I put an additional -1 so that there is a positively monotonic relation rather than a negative one between x and x1. Now let's check the relationship:
par(mfrow = c(1,2))
plot(x1, y, main = expression(y %~% ~ x1))
hist(x1, main = "x1 is well spread out")
Fitting a spline
Now we are ready for statistical modelling. Try the following call:
fit <- smooth.spline(x1, y, nknots = 10)
pred <- stats:::predict.smooth.spline(fit, x1)$y ## predict at all x1
## or you can simply call: pred <- predict(fit, x1)$y
plot(x1, y) ## scatter plot
lines(x1, pred, lwd = 2, col = 2) ## fitted spline
Does it look nice? Note, that I have used nknots = 10 tells smooth.spline to place 10 interior knots (by quantile); Therefore, we are to fit a penalized regression spline rather than a smoothing spline. In fact, the smooth.spline() function almost never fit a smoothing spline, unless you put all.knots = TRUE (see later example).
I also dropped penalty = 999999, as that has nothing to do with smoothness control. If you really want to control smoothness, rather than letting smooth.spline figure out the optimal one by GCV, you should use argument df or spar. I will give example later.
To transform fit back to original scale, do:
plot(x, exp(y), main = expression(Inventory %~%~ beta))
lines(x, exp(pred), lwd = 2, col = 2)
As you can see, the fitted spline is as smooth as you had expected.
Explanation on fitted spline
Let's see the summary of your fitted spline:
> fit
Smoothing Parameter spar= 0.4549062 lambda= 0.0008657722 (11 iterations)
Equivalent Degrees of Freedom (Df): 6.022959
Penalized Criterion: 0.08517417
GCV: 0.004288539
We used 10 knots, ending up with 6 degree of freedom, so penalization suppresses about 4 parameters. The smoothing parameter GCV has chosen, after 11 iterations, is lambda= 0.0008657722.
Why do we have to transform x to x1
Spline is penalized by 2nd derivatives, yet such penalization is on the averaged/integrated 2nd derivatives at all data points. Now, look at your data (x, y). For x before 0.98, the relationship is relatively steady; as x approaches 1, the relationship quickly goes steeper. The "change point", 0.98, has very high second derivative, much much higher than the second derivatives at other locations.
y0 <- as.numeric(tapply(y, x, mean)) ## remove tied values
x0 <- unique(x) ## remove tied values
dy0 <- diff(y0)/diff(x0) ## 1st order difference
ddy0 <- diff(dy0)/diff(x0[-1]) ## 2nd order difference
plot(x0[1:43], abs(ddy0), pch = 19)
Look at that huge spike in 2nd order difference/derivative! Now, if we fit a spline directly, the spline curve around this change point will be heavily penalized.
bad <- smooth.spline(x, y, all.knots = TRUE)
bad.pred <- predict(bad, x)$y
plot(x, exp(y), main = expression(Inventory %~% ~ beta))
lines(x, exp(bad.pred), col = 2, lwd = 3)
abline(v = 0.98, lwd = 2, lty = 2)
You can see clearly that the spline is having some difficulty in approximating data after x = 0.98.
There are of course some ways to achieve better approximation after this change point, for example, by manually setting smaller smoothing parameter, or higher degree of freedom. But we are going to another extreme. Remember, both penalization and degree of freedom are a global measure. Increasing model complexity will get better approximation after x = 0.98, but will also make other parts more bumpy. Now let's try a model with 45 degree of freedom:
worse <- smooth.spline(x, y, all.knots = TRUE, df = 45)
worse.pred <- predict(worse, x)$y
plot(x, exp(y), main = expression(Inventory %~% ~ beta))
lines(x, exp(worse.pred), col = 2, lwd = 2)
As you can see, the curve is bumpy. Sure, we have overfitted our dataset of 50 data, with 45 degree of freedom.
In fact, your original misuse of smooth.spline() is doing the same thing:
> mySpline
Call:
smooth.spline(x = results$beta, y = results$inventoryCost, penalty = 999999)
Smoothing Parameter spar= -0.8074624 lambda= 3.266077e-19 (17 iterations)
Equivalent Degrees of Freedom (Df): 45
Penalized Criterion: 5.598386
GCV: 0.03824885
Oops, 45 degree of freedom, overfitting!
I don't think you should use / want splinefun. I would suggest fitting a GAM instead:
library(mgcv)
fit <- gam(inventoryCost ~ s(beta, bs = "cr", k = 20), data = results)
summary(fit)
gam.check(fit)
plot(fit)
plot(inventoryCost ~ beta, data = results, col = "dark red", , pch = 16)
curve(predict(fit, newdata = data.frame(beta = x)), add = TRUE,
from = min(results$beta), to = max(results$beta), n = 1e3, lwd = 2)

Fitting non-linear Langmuir Isotherm in R

I want to fit Isotherm models for the following data in R. The simplest isotherm model is Langmuir model given here model is given in the bottom of the page. My MWE is given below which throw the error. I wonder if there is any R package for Isotherm models.
X <- c(10, 30, 50, 70, 100, 125)
Y <- c(155, 250, 270, 330, 320, 323)
Data <- data.frame(X, Y)
LangIMfm2 <- nls(formula = Y ~ Q*b*X/(1+b*X), data = Data, start = list(Q = 1, b = 0.5), algorith = "port")
Error in nls(formula = Y ~ Q * b * X/(1 + b * X), data = Data, start = list(Q = 1, :
Convergence failure: singular convergence (7)
Edited
Some nonlinear models can be transform to linear models. My understanding is that there might be one-to-one relationship between the estimates of nonlinear model and its linear model form but their corresponding standard errors are not related to each other. Is this assertion true? Are there any pitfalls in fitting Nonlinear Models by transforming to linearity?
I am not aware of such packages and personally I don't think that you need one as the problem can be solved using a base R.
nls is sensitive to the starting parameters, so you should begin with a good starting guess. You can easily evaluate Q because it corresponds to the asymptotic limit of the isotherm at x-->Inf, so it is reasonable to begin with Q=323 (which is the last value of Y in your sample data set).
Next, you could do plot(Data) and add a line with an isotherm that corresponds to your starting parameters Q and b and tweak b to come up with a reasonable guess.
The plot below shows your data set (points) and a probe isotherm with Q = 323 and b = 0.5, generated by with(Data,lines(X,323*0.5*X/(1+0.5*X),col='red')) (red line). It seemed a reasonable starting guess to me, and I gave it a try with nls:
LangIMfm2 <- nls(formula = Y ~ Q*b*X/(1+b*X), data = Data, start = list(Q = 300, b = 1), algorith = "port")
# Nonlinear regression model
# model: Y ~ Q * b * X/(1 + b * X)
# data: Data
# Q b
# 366.2778 0.0721
# residual sum-of-squares: 920.6
#
# Algorithm "port", convergence message: relative convergence (4)
and plotted predicted line to make sure that nls found the right solution:
lines(Data$X,predict(LangIMfm2),col='green')
Having said that, I would suggest to use a more effective strategy, based on the linearization of the model by rewriting the isotherm equation in reciprocal coordinates:
z <- 1/Data
plot(Y~X,z)
abline(lm(Y~X,z))
M <- lm(Y~X,z)
Q <- 1/coef(M)[1]
# 363.2488
b <- coef(M)[1]/coef(M)[2]
# 0.0741759
As you could see, both approaches produce essentially the same result, but the linear model is more robust and doesn't require starting parameters (and, as far as I remember, it is the standard way of the isotherm analysis in the experimental physical chemistry).
You can use the SSmicmen self-starter function (see Ritz and Streibig, 2008, Nonlinear Regression with R) in the nlme package for R, which calculates initial parameters from the fit of the linearized form of the Michaelis-Menten (MM) equation. Fortunately, the MM equation possesses a form that can be adapted for the Langmuir equation, S = Smax*x/(KL + x). I've found the nlshelper and tidyverse packages useful for modeling and exporting the results of the nls command into tables and plots, particularly when modeling sample groups. Here's my code for modeling a single set of sorption data:
library(tidyverse)
library(nlme)
library(nlshelper)
lang.fit <- nls(Y ~ SSmicmen(X,Smax,InvKL), data=Data)
fit.summary <- tidy(lang.fit)
fit.coefs <- coef(lang.fit)
For simplicity, the Langmuir affinity constant is modeled here as 1/KL. Applying this code, I get the same parameter estimates as #Marat given above.
The simple code below allows for wrangling the data in order to create a ggplot object, containing the original points and fitted line (i.e., geom_point would represent the original X and Y data, geom_line would represent the original X plus YHat).
FitY <- tibble(predict(lang.fit))
YHat <- FitY[,1]
Data2 <- cbind(Data, YHat)
If you want to model multiple groups of data (say, based on a "Sample_name" column, then the lang.fit variable would be calculated as below, this time using the nlsList command:
lang.fit <- nlsList(Y ~ SSmicmen(X,Smax,InvKL) | Sample_name, data=Data)
The problem is the starting values. We show two approaches to this as well as an alternative that converges even using the starting values in the question.
1) plinear The right hand side is linear in Q*b so it would be better to absorb b into Q and then we have a parameter that enters linearly so it is easier to solve. Also with the plinear algorithm no starting values are needed for the linear parameter so only the starting value for b need be specified. With plinear the right hand side of the nls formula should be specified as the vector that multiplies the linear parameter. The result of running nls giving fm0 below will be coefficients named b and .lin where Q = .lin / b.
We already have our answer from fm0 but if we want a clean run in terms of b and Q rather than b and .lin we can run the original formula in the question using the starting values implied by the coefficients returned by fm0 as shown.
fm0 <- nls(Y ~ X/(1+b*X), Data, start = list(b = 0.5), alg = "plinear")
st <- with(as.list(coef(fm0)), list(b = b, Q = .lin/b))
fm <- nls(Y ~ Q*b*X/(1+b*X), Data, start = st)
fm
giving
Nonlinear regression model
model: Y ~ Q * b * X/(1 + b * X)
data: Data
b Q
0.0721 366.2778
residual sum-of-squares: 920.6
Number of iterations to convergence: 0
Achieved convergence tolerance: 9.611e-07
We can display the result. The points are the data and the red line is the fitted curve.
plot(Data)
lines(fitted(fm) ~ X, Data, col = "red")
(contineud after plot)
2) mean Alternately, using a starting value of mean(Data$Y) for Q seems to work well.
nls(Y ~ Q*b*X/(1+b*X), Data, start = list(b = 0.5, Q = mean(Data$Y)))
giving:
Nonlinear regression model
model: Y ~ Q * b * X/(1 + b * X)
data: Data
b Q
0.0721 366.2779
residual sum-of-squares: 920.6
Number of iterations to convergence: 6
Achieved convergence tolerance: 5.818e-06
The question already had a reasonable starting value for b which we used but if one were needed one could set Y to Q*b so that they cancel and X to mean(Data$X) and solve for b to give b = 1 - 1/mean(Data$X) as a possible starting value. Although not shown using this starting value for b with mean(Data$Y) as the starting value for Q also resulted in convergence.
3) optim If we use optim the algorithm converges even with the initial values used in the question. We form the residual sum of squares and minimize that:
rss <- function(p) {
Q <- p[1]
b <- p[2]
with(Data, sum((Y - b*Q*X/(1+b*X))^2))
}
optim(c(1, 0.5), rss)
giving:
$par
[1] 366.27028219 0.07213613
$value
[1] 920.62
$counts
function gradient
249 NA
$convergence
[1] 0
$message
NULL

How to compute prediction intervals for a circle fit in R

I wish to compute the prediction interval of the radius from a circle fit with the formula > r² = (x-h)²+(y-k)². r- radius of the circle, x,y, are gaussian coordinates, h,k, mark the center of the fitted circle.
# data
x <- c(1,2.2,1,2.5,1.5,0.5,1.7)
y <- c(1,1,3,2.5,4,1.7,0.8)
# using nls.lm from minpack.lm (minimising the sum of squared residuals)
library(minpack.lm)
residFun <- function(par,x,y) {
res <- sqrt((x-par$h)^2+(y-par$k)^2)-par$r
return(res)
}
parStart <- list("h" = 1.5, "k" = 2.5, "r" = 1.7)
out <- nls.lm(par = parStart, x = x, y = y, lower =NULL, upper = NULL, residFun)
The problem is, predict() doesn't work with nls.lm, hence I am trying to compute the circle fit using nlsLM. (I could compute it by hand, but have troubles creating my Designmatrix).`
So this is what I tried next:
dat = list("x" = x,"y" = y)
out1 <- nlsLM(y ~ sqrt(-(x-h)^2+r^2)+k, start = parStart )
which results in:
Error in stats:::nlsModel(formula, mf, start, wts) :
singular gradient matrix at initial parameter estimates
Question 1a: How does nlsLM() work with circle fits? (advantage being that the generic predict() is available.
Question 1b: How do I get the prediction interval for my circle fit?
EXAMPLE from linear regression (this is what I want for the circle regression)
attach(faithful)
eruption.lm = lm(eruptions ~ waiting)
newdata = data.frame(waiting=seq(45,90, length = 272))
# confidence interval
conf <- predict(eruption.lm, newdata, interval="confidence")
# prediction interval
pred <- predict(eruption.lm, newdata, interval="predict")
# plot of the data [1], the regression line [1], confidence interval [2], and prediction interval [3]
plot(eruptions ~ waiting)
lines(conf[,1] ~ newdata$waiting, col = "black") # [1]
lines(conf[,2] ~ newdata$waiting, col = "red") # [2]
lines(conf[,3] ~ newdata$waiting, col = "red") # [2]
lines(pred[,2] ~ newdata$waiting, col = "blue") # [3]
lines(pred[,3] ~ newdata$waiting, col = "blue") # [3]
Kind regards
Summary of Edits:
Edit1: Rearranged formula in nlsLM, but parameter (h,k,r) results are now different in out and out1 ...
Edit2: Added 2 wikipedia links for clarification puprose on terminology used: (c.f. below)
confidence interval
prediction interval
Edit3: Some rephrasing of the question(s)
Edit4: Added a working example for linear regression
I am having a hard time figuring out what you want to do. Let me illustrate what the data looks like and something about the "prediction".
plot(x,y, xlim=range(x)*c(0, 1.5), ylim=range(y)*c(0, 1.5))
lines(out$par$h+c(-1,-1,1,1,-1)*out$par$r, # extremes of x-coord
out$par$k+c(-1,1,1,-1 ,-1)*out$par$r, # extremes of y-coord
col="red")
So what "prediction interval" are we speaking about? ( I do realize that you were thinking of a circle and if you just want to plot a circle on this background that's going to be pretty easy as well.)
lines(out$par$h+cos(seq(-pi,pi, by=0.1))*out$par$r, #center + r*cos(theta)
out$par$k+sin(seq(-pi,pi, by=0.1))*out$par$r, #center + r*sin(theta)
col="red")
I think that this question is not answerable in its current form. Any predict() function that is based on a linear model will require the predicted variable to be a linear function of the input design matrix. r^2 = (x-x0)^2 + (y-y0)^2 is not a linear function of the design matrix (which would be something like [x0 x y0 y], so I don't think you're going to be able to find a linear model fit that will give you confidence intervals. If someone more clever than I am has a way to do it, though, I'd be very interested in hearing about it.
The general way to approach these sorts of problems is to create a hierarchical nonlinear model, where your hyperparameters would be x0 and y0 (your h and k) with uniform distribution over your search space, and then the r^2 would be distributed ~N((x-x0)^2+(y-y0)^2, \sigma). You would then use MCMC sampling or similar to get your posterior confidence intervals.
Here's a solution to find h,k,r using base R's optim function. You essentially create a cost function that is a closure containing the data you wish to optimize over. I had to RSS value, else we would go to -Inf. There is a local optima problem, so you need to run this a few times...
# data
x <- c(1,2.2,1,2.5,1.5,0.5,1.7)
y <- c(1,1,3,2.5,4,1.7,0.8)
residFunArg <- function(xVector,yVector){
function(theta,xVec=xVector,yVec=yVector){
#print(xVec);print(h);print(r);print(k)
sum(sqrt((xVec-theta[1])^2+(yVec-theta[2])^2)-theta[3])^2
}
}
rFun = residFunArg(x,y);
o = optim(f=rFun,par=c(0,0,0))
h = o$par[1]
k = o$par[2]
r = o$par[3]
Run this command in the REPL to observe the local mins:
o=optim(f=tFun,par=runif(3),method="CG");o$par

Resources