nls function does not perform well - r

I need to fit this formula
y ~ 1/(pi*a*(1+((x-2.15646)/a)^2))+1/(pi*b*(1+((x-2.16355)/b)^2))
to the variables x and y
x<- c(2.15011, 2.15035, 2.15060, 2.15084, 2.15109, 2.15133, 2.15157, 2.15182, 2.15206, 2.15231, 2.15255, 2.15280, 2.15304, 2.15329, 2.15353, 2.15377, 2.15402, 2.15426, 2.15451, 2.15475, 2.15500, 2.15524, 2.15549, 2.15573, 2.15597, 2.15622, 2.15646, 2.15671, 2.15695, 2.15720, 2.15744, 2.15769, 2.15793, 2.15817, 2.15842, 2.15866, 2.15891, 2.15915, 2.15940, 2.15964, 2.15989, 2.16013, 2.16037, 2.16062, 2.16086, 2.16111, 2.16135, 2.16160, 2.16184, 2.16209, 2.16233, 2.16257, 2.16282, 2.16306, 2.16331, 2.16355, 2.16380, 2.16404, 2.16429, 2.16453, 2.16477, 2.16502, 2.16526, 2.16551, 2.16575, 2.16600, 2.16624, 2.16649, 2.16673, 2.16697, 2.16722, 2.16746, 2.16771, 2.16795, 2.16820, 2.16844, 2.16869, 2.16893, 2.16917, 2.16942, 2.16966, 2.16991)
y<- c(3.77212, 3.79541, 3.84574, 3.91918, 4.01056, 4.11677, 4.23851, 4.37986, 4.54638, 4.74367, 4.97765, 5.25593, 5.58823, 5.98405, 6.44850, 6.98006, 7.57280, 8.22085, 8.92094, 9.66990, 10.45900, 11.26720, 12.05540, 12.76920, 13.34830, 13.74250, 13.92420, 13.89250, 13.67090, 13.29980, 12.82780, 12.30370, 11.76950, 11.25890, 10.80020, 10.41860, 10.13840, 9.98005, 9.95758, 10.07690, 10.33680, 10.73210, 11.25730, 11.90670, 12.67240, 13.54110, 14.49530, 15.51670, 16.58660, 17.67900, 18.75190, 19.74600, 20.59680, 21.24910, 21.66800, 21.83910, 21.76560, 21.46020, 20.94020, 20.22730, 19.35360, 18.36460, 17.31730, 16.26920, 15.26920, 14.35320, 13.54360, 12.85230, 12.28520, 11.84690, 11.54040, 11.36610, 11.32130, 11.39980, 11.59230, 11.88310, 12.25040, 12.66660, 13.09810, 13.50220, 13.82580, 14.01250)
for estiming 'a' and 'b' values according to x and y. 'a' and 'b' are in the range between 0 and 1.
However, when I used the nls command:
nls(y ~1/(pi*a*(1+((x-2.15646)/a)^2))+1/(pi*b*(1+((x-2.16355)/b)^2)), control = list(maxiter = 500), start=list(a=0.4,b=0.4))
The console reported the following error:
singular gradient
Can anyone explain to me why does the console print this message?

This gives a better fit:
Before getting into the code (below), there are several issues with your model:
Assuming this is proton NMR, the area under the peaks is proportional to the proton abundance (so, number of protons). Your model does not allow for this, essentially forcing all peaks to have the same area. This is the main reason for the poor fit. We can accommodate this easily by including a "height" factor for each peak.
Your model assumes the peak positions. Why not just let the algorithm find the true peak positions?
Your model does not account for baseline drift, which as you can see is quite severe in your dataset. We can accommodate this by adding a linear drift function to the model.
nls(...) is poor for this type of modeling - the algorithms it uses are not especially robust. The default algorithm, Gauss-Newton, is especially poor when fitting offsetted data. So estimating p1 and p2 in a model with f(x-p1,x-p2) nearly always fails.
A better approach is to use the exceptionally robust Levenberg-Marquardt algorithm implemented in nls.lm(...) in package minpack. This package is a bit harder to use but it is capable of dealing with problems inaccessible with nls(...). If you're going to do a lot of this, you should read the documentation to understand how this example works.
Finally, even with nls.lm(...) the starting points have to be reasonable. In your model a and b are the peak widths. Clearly they must be comparable to or smaller than the difference in peak positions or the peaks will get smeared together. Your estimates of (a,b) = (0.4, 0.4) were much too large.
plot(x,y)
library(minpack.lm)
lorentzian <- function(par,x){
a <- par[1]
b <- par[2]
p1 <- par[3]
p2 <- par[4]
h1 <- par[5]
h2 <- par[6]
drift.a <- par[7]
drift.b <- par[8]
h1/(pi*a*(1+((x-p1)/a)^2))+h2/(pi*b*(1+((x-p2)/b)^2)) + drift.a + drift.b*x
}
resid <- function(par,obs,xx) {obs-lorentzian(par,xx)}
par=c(a=0.001,b=0.001, p1=2.157, p2=2.163, h1=1, h2=1, drift.a=0, drift.b=0)
lower=c(a=0,b=0,p1=0,p2=0,h1=0, h2=0,drift.a=NA,drift.b=NA)
nls.out <- nls.lm(par=par, lower=lower, fn=resid, obs=y, xx=x,
control = nls.lm.control(maxiter=500))
coef(nls.out)
# a b p1 p2 h1 h2 drift.a drift.b
# 1.679632e-03 1.879690e-03 2.156308e+00 2.163500e+00 4.318793e-02 8.199394e-02 -9.273083e+02 4.323897e+02
lines(x,lorentzian(coef(nls.out), x), col=2, lwd=2)
One last thing: the convention on SO is to wait a day before "accepting" an answer. The reason is that questions with accepted answers rarely get additional attention - once you accept an answer, no one else will look at it.

Try using the reciprocals of a and b:
fm<-nls(y~1/(pi*(1/a)*(1+((x-2.15646)/(1/a))^2))+1/(pi*(1/b)*(1+((x-2.16355)/(1/b))^2)),
lower = 1, alg = "port",
control = list(maxiter = 500),
start = list(a = 1/.4, b = 1/.4))
which gives:
> 1/coef(fm)
a b
1.00000000 0.02366843
Unfortunately the model does not work very well as the graph at the bottom shows.
plot(y ~ x, pch = 20)
lines(x, fitted(fm), col = "red")
ADDED:
In his answer, #jlhoward provided a better fitting model based on 8 parameters. I will just point out that if we used his model with his starting values for a, b, p1 and p2 (we don't need starting values for the linear parameters if we specify alg = "plinear") then nls would work too.
fo <- y ~ cbind(1/(pi*a*(1+((x-p1)/a)^2)), 1/(pi*b*(1+((x-p2)/b)^2)), 1, x)
start <- c(a = 0.001, b = 0.001, p1 = 2.157, p2 = 2.163)
fm2 <- nls(fo, start = start, alg = "plinear")
giving:
> coef(fm2)
a b p1 p2 .lin1
1.679635e-03 1.879682e-03 2.156308e+00 2.163500e+00 4.318798e-02
.lin2 .lin3 .lin.x
8.199364e-02 -9.273104e+02 4.323907e+02
Graph showing poor fit for fm:
REVISED to add constraints.

Related

SIRD model fitting in R using real data not working

I am trying to fit SIRD model in R to real data. However, the observed values are lying nowhere on the fitted curve. I can't understand what the error is or how to resolve it, but I have noticed that changing the value of "state" produces the error
DLSODA- Warning..Internal T (=R1) and H (=R2) are
such that in the machine, T + H = T on the next step
(H = step size). Solver will continue anyway.
In above message, R1 = 0.1, R2 = 9.94667e-21
Here is my entire code. Any help is greatly appreciated!
library(deSolve)
state<-c(S=10000,I=1000,R=5000,D=100)
parameters <- c(a=180,b=0.4,g=0.2)
eqn<-function(t, state, parameters) {
with(as.list(c(state, parameters)),{
dS <- -a*I*S
dI <- a*I*S-g*I-b*I
dR <- g*I
dD <-b*I
list(c(dS,dI,dR,dD))
})
}
times <- seq(0.1,2.6,by=0.1)
out <- ode(y = state, times = times, func = eqn, parms = parameters)
out
plot(out)
library(FME)
data <- data.frame(
time = seq(0.1,2.6,0.1),
S=c(11417747943,11417733626,11417717809,11417702207,11417685587,11417670536,
11417652672,11417629493,11417603660,11417577979,11417550853,11417520318,
11417495673,11417466974,11417435119,11417399167,11417362265,11417326539,
11417286125,11417254482,11417226564,11417187020,11417143837,11417095924,
11417046477,11416989403),
I=c(3686,7062,4415,8040,7706,4316,8266,13947,13593,11207,13446,19114,5121,15400,
16658,15386,19766,21024,22426,10683,3958,15701,10290,23299,11340,29331),
R=c(9913,7193,11344,7467,8861,10671,9510,9138,12174,14400,13588,11314,19463,13165,
15098,20444,17019,14523,17874,20854,23820,23600,32641,24126,37821,27508),
D=c(54,57,56,88,50,48,87,84,58,70,92,99,58,132,95,111,112,166,108,102,139,
227,249,481,277,222)
)
cost <- function(p) {
out <- ode(state, times, eqn, p)
modCost(out, data, weight = "none")
}
fit <- modFit(f = cost, p = parameters)
summary(fit)
out1 <- ode(state, times, eqn, parameters)
out2 <- ode(state, times, eqn, coef(fit))
plot(out1, out2, obs=data, obspar=list(pch=16, col="red"))
Your code has several issues:
the order of magnitude of state variables differs, so you need weight="std" or weight = "mean"
the initial values of the state variables are far away. This is the most critical error. You may either set it manually to a reasonable value (see below) or even better, fit it, see FME documentation how this can be done.
Start parameters are far away from optimum. While it is desirable that the algorithm converges to an optimum from arbitrary naive start values, this is rarely the case. Therefore, some careful consideration or trial and error is unavoidable.
The mass balance is violated, i.e. the sum of all 4 states changes over time. Check rowSums(data[-1]).
Here an approach that handles parts of the problem. The next step would then be to fix the mass balance and to include the ode initial states of the ode model as parameters of the nonlinear optimization.
library(deSolve)
library(FME)
eqn<-function(t, state, parameters) {
with(as.list(c(state, parameters)),{
dS <- -a*I*S
dI <- a*I*S - g*I - b*I
dR <- g*I
dD <- b*I
list(c(dS,dI,dR,dD))
})
}
data <- data.frame(
time = seq(0.1,2.6,0.1),
S=c(11417747943,11417733626,11417717809,11417702207,11417685587,11417670536,
11417652672,11417629493,11417603660,11417577979,11417550853,11417520318,
11417495673,11417466974,11417435119,11417399167,11417362265,11417326539,
11417286125,11417254482,11417226564,11417187020,11417143837,11417095924,
11417046477,11416989403),
I=c(3686,7062,4415,8040,7706,4316,8266,13947,13593,11207,13446,19114,5121,15400,
16658,15386,19766,21024,22426,10683,3958,15701,10290,23299,11340,29331),
R=c(9913,7193,11344,7467,8861,10671,9510,9138,12174,14400,13588,11314,19463,13165,
15098,20444,17019,14523,17874,20854,23820,23600,32641,24126,37821,27508),
D=c(54,57,56,88,50,48,87,84,58,70,92,99,58,132,95,111,112,166,108,102,139,
227,249,481,277,222)
)
state <- c(S=11417747943, I=5000, R=8000, D=50)
parameters <- c(a=1e-10, b=0.001, g=0.1)
times<-seq(0.1,2.6,by=0.01)
cost <- function(p) {
out <- ode(state, times, eqn, p)
modCost(out, data, weight = "mean")
}
fit <- modFit(f = cost, p = parameters)
summary(fit, corr=TRUE)
out2 <- ode(state, times, eqn, coef(fit))
plot(out2, obs=data, obspar=list(pch=16, col="red"), ylim=list(c(0, 2e10), c(0, 50000), c(0, 50000), c(0, 600)))
Edit
The following approach improves the fit by:
fixing mass balance by setting total population to be constant over time
re-scale data to improve stability of optimization
guessing initial values from data
It would (in theory) be even better to include initial values in the optimization, but this would lead again to non-identifiability of parameters
due to the intrinsic characteristics of the given model and data. See twocomp_final.R for a related tutorial example.
Instead of data rescaling, one may also consider to adapt control parameters
of the optimizer(s) and of the ode function, or to rescale individual state variables differently.
However, it is easiest here just to rescale the population to "million people".
## fix mass balance, i.e. make sum of all states constant
## an alternative would be an additional process in the model
## for migration and / or birth and natural death
Population <- rowSums(data[c("S", "I", "R", "D")])
data$S <- Population[1] - rowSums(data[c("I", "R", "D")])
## rescale state variables to numerically more convenient numbers
## here simply: million people
scaled_data <- cbind(
time = data$time,
data[c("S", "I", "R", "D")] * 1e-6
)
## guess initial values from data (of course a little bit subjective)
state <- c(
S = scaled_data$S[1],
I = mean(scaled_data$I[1:3]),
R = mean(scaled_data$R[1:5]),
D = mean(scaled_data$D[1:3])
)
## use good initial parameters by thinking and some trial and error
parameters <- c(a = 0.0001, b = 0.01, g = 1)
cost2 <- function(p) {
out <- ode(state, times, eqn, p)
modCost(out, scaled_data, weight = "mean")
}
## fit model, enable trace with option nprint
fit <- modFit(f = cost2, p = parameters, control = list(nprint = 1))
summary(fit, corr=TRUE)
out2 <- ode(state, times, eqn, coef(fit))
plot(out2, obs = scaled_data, obspar = list(pch = 16, col = "red"))

Why are the predicted values of my GLM cyclical?

I wrote a binomial regression model to predict the prevalence of igneous stone, v, at an archaeological site based on proximity to a river, river_dist, but when I use the predict() function I'm getting odd cyclical results instead of the curve I was expecting. For reference, my data:
v n river_dist
1 102 256 1040
2 1 11 720
3 19 24 475
4 12 15 611
Which I fit to this model:
library(bbmle)
m_r <- mle2(ig$v ~ dbinom(size=ig$n, prob = 1/(1+exp(-(a + br * river_dist)))),
start = list(a = 0, br = 0), data = ig)
This produces a coefficient which, when back-transformed, suggests about 0.4% decrease in the likelihood of igneous stone per meter from the river (br = 0.996):
exp(coef(m_r))
That's all good. But when I try to predict new values, I get this odd cycling of values:
newdat <- data.frame(river_dist=seq(min(ig$river_dist), max(ig$river_dist),len=100))
newdat$v <- predict(m_r, newdata=newdat, type="response")
plot(v~river_dist, data=ig, col="red4")
lines(v ~ river_dist, newdat, col="green4", lwd=2)
Example of predicted values:
river_dist v
1 475.0000 216.855114
2 480.7071 9.285536
3 486.4141 20.187424
4 492.1212 12.571487
5 497.8283 213.762248
6 503.5354 9.150584
7 509.2424 19.888471
8 514.9495 12.381805
9 520.6566 210.476312
10 526.3636 9.007289
11 532.0707 19.571218
12 537.7778 12.180629
Why are the values cycling up and down like that, creating crazy spikes when graphed?
In order for newdata to work, you have to specify the variables as 'raw' values rather than with $:
library(bbmle)
m_r <- mle2(v ~ dbinom(size=n, prob = 1/(1+exp(-(a + br * river_dist)))),
start = list(a = 0, br = 0), data = ig)
At this point, as #user20650 suggests, you'll also have to specify a value (or values) for n in newdata.
This model appears to be identical to binomial regression: is there a reason not to use
glm(cbind(v,n-v) ~ river_dist, data=ig, family=binomial)
? (bbmle:mle2 is more general, but glm is much more robust.) (Also: fitting two parameters to four data points is theoretically fine, but you should not try to push the results too far ... in particular, a lot of the default results from GLM/MLE are asymptotic ...)
Actually, in double-checking the correspondence of the MLE fit with GLM I realized that the default method ("BFGS", for historical reasons) doesn't actually give the right answer (!); switching to method="Nelder-Mead" improves things. Adding control=list(parscale=c(a=1,br=0.001)) to the argument list, or scaling the river dist (e.g. going from "1 m" to "100 m" or "1 km" as the unit), would also fix the problem.
m_r <- mle2(v ~ dbinom(size=n,
prob = 1/(1+exp(-(a + br * river_dist)))),
start = list(a = 0, br = 0), data = ig,
method="Nelder-Mead")
pframe <- data.frame(river_dist=seq(500,1000,length=51),n=1)
pframe$prop <- predict(m_r, newdata=pframe, type="response")
CIs <- lapply(seq(nrow(ig)),
function(i) prop.test(ig[i,"v"],ig[i,"n"])$conf.int)
ig2 <- data.frame(ig,setNames(as.data.frame(do.call(rbind,CIs)),
c("lwr","upr")))
library(ggplot2); theme_set(theme_bw())
ggplot(ig2,aes(river_dist,v/n))+
geom_point(aes(size=n)) +
geom_linerange(aes(ymin=lwr,ymax=upr)) +
geom_smooth(method="glm",
method.args=list(family=binomial),
aes(weight=n))+
geom_line(data=pframe,aes(y=prop),colour="red")
Finally, note that your third-farthest site is an outlier (although the small sample size means it doesn't hurt much).

Exponential decay fit in r

I would like to fit an exponential decay function in R to the following data:
data <- structure(list(x = 0:38, y = c(0.991744340878828, 0.512512332368168,
0.41102449265681, 0.356621905557202, 0.320851602373477, 0.29499198506227,
0.275037747162642, 0.25938850981822, 0.245263623938863, 0.233655093612007,
0.224041426946405, 0.214152907133301, 0.207475138903635, 0.203270738895484,
0.194942528735632, 0.188107106969046, 0.180926819430008, 0.177028560207711,
0.172595416846822, 0.166729221891201, 0.163502461048814, 0.159286528409165,
0.156110097827889, 0.152655498715612, 0.148684858095915, 0.14733605355542,
0.144691873223729, 0.143118852619617, 0.139542186417186, 0.137730138713745,
0.134353615271572, 0.132197800438632, 0.128369567159113, 0.124971834736476,
0.120027536018095, 0.117678812415655, 0.115720611113327, 0.112491329844252,
0.109219168085624)), class = "data.frame", row.names = c(NA,
-39L), .Names = c("x", "y"))
I've tried fitting with nls but the generated curve is not close to the actual data.
enter image description here
It would be very helpful if anyone could explain how to work with such nonlinear data and find a function of best fit.
Try y ~ .lin / (b + x^c). Note that when using "plinear" one omits the .lin linear parameter when specifying the formula to nls and also omits a starting value for it.
Also note that the .lin and b parameters are approximately 1 at the optimum so we could also try the one parameter model y ~ 1 / (1 + x^c). This is the form of a one-parameter log-logistic survival curve. The AIC for this one parameter model is worse than for the 3 parameter model (compare AIC(fm1) and AIC(fm3)) but the one parameter model might still be preferable due to its parsimony and the fact that the fit is visually indistinguishable from the 3 parameter model.
opar <- par(mfcol = 2:1, mar = c(3, 3, 3, 1), family = "mono")
# data = data.frame with x & y col names; fm = model fit; main = string shown above plot
Plot <- function(data, fm, main) {
plot(y ~ x, data, pch = 20)
lines(fitted(fm) ~ x, data, col = "red")
legend("topright", bty = "n", cex = 0.7, legend = capture.output(fm))
title(main = paste(main, "- AIC:", round(AIC(fm), 2)))
}
# 3 parameter model
fo3 <- y ~ 1/(b + x^c) # omit .lin parameter; plinear will add it automatically
fm3 <- nls(fo3, data = data, start = list(b = 1, c = 1), alg = "plinear")
Plot(data, fm3, "3 parameters")
# one parameter model
fo1 <- y ~ 1 / (1 + x^c)
fm1 <- nls(fo1, data, start = list(c = 1))
Plot(data, fm1, "1 parameter")
par(read.only = opar)
AIC
Adding the solutions in the other answers we can compare the AIC values. We have labelled each solution by the number of parameters it uses (the degrees of freedom would be one greater than that) and have reworked the log-log solution to use nls instead of lm and have a LHS of y since one cannot compare the AIC values of models having different left hand sides or using different optimization routines since the log likelihood constants used could differ.
fo2 <- y ~ exp(a + b * log(x+1))
fm2 <- nls(fo2, data, start = list(a = 1, b = 1))
fo4 <- y ~ SSbiexp(x, A1, lrc1, A2, lrc2)
fm4 <- nls(fo4, data)
aic <- AIC(fm1, fm2, fm3, fm4)
aic[order(aic$AIC), ]
giving from best AIC (i.e. fm3) to worst AIC (i.e. fm2):
df AIC
fm3 4 -329.35
fm1 2 -307.69
fm4 5 -215.96
fm2 3 -167.33
A biexponential model would fit much better, though still not perfect. This would indicate that you might have two simultaneous decay processes.
fit <- nls(y ~ SSbiexp(x, A1, lrc1, A2, lrc2), data = data)
#A1*exp(-exp(lrc1)*x)+A2*exp(-exp(lrc2)*x)
plot(y ~x, data = data)
curve(predict(fit, newdata = data.frame(x)), add = TRUE)
If the measurement error depends on magnitude, you could consider using it for weighting.
However, you should consider carefully what kind of model you'd expect from your domain knowledge. Just selecting a non-linear model empirically is usually not a good idea. A non-parametric fit might be a better option.
data <- structure(list(x = 0:38, y = c(0.991744340878828, 0.512512332368168,
0.41102449265681, 0.356621905557202, 0.320851602373477, 0.29499198506227,
0.275037747162642, 0.25938850981822, 0.245263623938863, 0.233655093612007,
0.224041426946405, 0.214152907133301, 0.207475138903635, 0.203270738895484,
0.194942528735632, 0.188107106969046, 0.180926819430008, 0.177028560207711,
0.172595416846822, 0.166729221891201, 0.163502461048814, 0.159286528409165,
0.156110097827889, 0.152655498715612, 0.148684858095915, 0.14733605355542,
0.144691873223729, 0.143118852619617, 0.139542186417186, 0.137730138713745,
0.134353615271572, 0.132197800438632, 0.128369567159113, 0.124971834736476,
0.120027536018095, 0.117678812415655, 0.115720611113327, 0.112491329844252,
0.109219168085624)), class = "data.frame", row.names = c(NA,
-39L), .Names = c("x", "y"))
# Do this because the log of 0 is not possible to calculate
data$x = data$x +1
fit = lm(log(y) ~ log(x), data = data)
plot(data$x, data$y)
lines(data$x, data$x ^ fit$coefficients[2], col = "red")
This did a lot better than using the nls forumla. And when plotting the fit seems to do fairly well.

Write a program to minimize the sum of squares of recursive exponential function

This is the function that I'd like to code in R,
i = 1,2,3,....j-1
a,b,c,f,g are to be determined from nls (with starting value arbitrarily set to 7,30,15,1,2)
S and Y are in the dataset
The function can be presented in a more computational friendly recursive equations,
Here is my attempt at the code but I could not get it to converge,
S=c(235,90,1775,960,965,1110,370,485,667,140,588,10,0,1340,600,0,930,1250,930,120,895,825,0,935,695,270,0,610,0,0,445,0,0,370,470,819,717,0,0,60,0,135,690,0,825,730,1250,370,1010,261,0,865,570,1425,150,1515,1143,0,675,1465,375,0,690,290,0,430,735,510,270,450,1044,0,928,60,95,105,60,950,0,1640,3960,1510,500,1135,0,0,0,181,568,60,1575,247,0,1270,870,290,510,0,540,455,120,580,420,90,525,1116,499,0,60,150,660,1080,1715,90,1090,840,975,280,850,633,30,1530,1765,880,150,225,77,1380,810,835,0,540,1017,1108,0,300,600,90,370,910,0,60,60,0,0,0,0,50,0,735,900)
Y=c(NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,7.7,NA,NA,7.2,NA,NA,NA,NA,NA,NA,7.4,NA,NA,NA,NA,NA,NA,10.7,NA,NA,NA,NA,8.1,8.5,NA,NA,NA,NA,NA,9.9,NA,7.4,NA,NA,NA,9.5,NA,NA,9,NA,NA,NA,8.8,NA,NA,8.5,NA,NA,NA,6.9,NA,NA,7.9,NA,NA,NA,7.3,NA,7.9,8.3,NA,NA,NA,11.5,NA,NA,12.3,NA,NA,NA,6.1,NA,NA,9,NA,NA,NA,10.3,NA,NA,9.7,NA,NA,8.6,NA,9.1,NA,NA,11,NA,NA,12.4,11.1,10.1,NA,NA,NA,NA,11.7,NA,NA,9,NA,NA,NA,10.2,NA,NA,11.2,NA,NA,NA,11.8,NA,9.2,10,9.8,NA,9.5,11.3,10.3,9.5,10.2,10.6,NA,10.8,10.7,11.1,NA,NA,NA,NA,NA,NA,NA,NA,12.6,NA)
mydata = data.frame(Y,S)
f <- function(a,b,f,c,g,m) {
model <- matrix(NA,nrow(m)+1,3)
model[1,1]=0
model[1,2]=0
model[1,3]=a
for (i in 2:nrow(model)){
model[i,1]=exp(-1/c)*model[i-1,1] + m$S[i-1]
model[i,2]=exp(-1/g)*model[i-1,2] + m$S[i-1]
model[i,3]=a+b*model[i,1]-f*model[i,2]
}
model <- as.data.frame(model)
colnames(model) = c('l','m','Y')
model$Y[which(m$Y>0)]
}
Y=mydata$Y
nls(Y ~ f(a,b,f,c,g,mydata), start=list(a=7,b=5.3651,f=5.3656,c=16.50329,g=16.5006),control=list(maxiter=1000,minFactor=1e-12))
Errors that I've been getting depends on the starting values are:
Error in nls(Y ~ f(a, b, f, c, g, mydata), start = list(a = 7, :
number of iterations exceeded maximum of 1000
Error in nls(Y ~ f(a, b, f, c, g, mydata), start = list(a = 7, :
singular gradient
I'm stuck and not sure what to do, any help would be greatly appreciated.
Try this:
ff <- function(a,b,f,c,g) {
Y <- numeric(length(S))
for(i in seq(from=2, to=length(S))) {
j <- seq(length=i-1)
Y[i] <- a + sum((b*exp(-(i-j)/c) - f*exp(-(i-j)/g))*S[j])
}
Y
}
S <- c(235,90,1775,960,965,1110,370,485,667,140,588,10,0,1340,600,0,930,1250,930,120,895,825,0,935,695,270,0,610,0,0,445,0,0,370,470,819,717,0,0,60,0,135,690,0,825,730,1250,370,1010,261,0,865,570,1425,150,1515,1143,0,675,1465,375,0,690,290,0,430,735,510,270,450,1044,0,928,60,95,105,60,950,0,1640,3960,1510,500,1135,0,0,0,181,568,60,1575,247,0,1270,870,290,510,0,540,455,120,580,420,90,525,1116,499,0,60,150,660,1080,1715,90,1090,840,975,280,850,633,30,1530,1765,880,150,225,77,1380,810,835,0,540,1017,1108,0,300,600,90,370,910,0,60,60,0,0,0,0,50,0,735,900)
Y <- c(NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,7.7,NA,NA,7.2,NA,NA,NA,NA,NA,NA,7.4,NA,NA,NA,NA,NA,NA,10.7,NA,NA,NA,NA,8.1,8.5,NA,NA,NA,NA,NA,9.9,NA,7.4,NA,NA,NA,9.5,NA,NA,9,NA,NA,NA,8.8,NA,NA,8.5,NA,NA,NA,6.9,NA,NA,7.9,NA,NA,NA,7.3,NA,7.9,8.3,NA,NA,NA,11.5,NA,NA,12.3,NA,NA,NA,6.1,NA,NA,9,NA,NA,NA,10.3,NA,NA,9.7,NA,NA,8.6,NA,9.1,NA,NA,11,NA,NA,12.4,11.1,10.1,NA,NA,NA,NA,11.7,NA,NA,9,NA,NA,NA,10.2,NA,NA,11.2,NA,NA,NA,11.8,NA,9.2,10,9.8,NA,9.5,11.3,10.3,9.5,10.2,10.6,NA,10.8,10.7,11.1,NA,NA,NA,NA,NA,NA,NA,NA,12.6,NA)
nls(Y ~ f(a,b,f,c,g,mydata), start=list(a=7,b=5.3651,f=5.3656,c=16.50329,g=16.5006))
But I am unable to get nls to run here. You may also try a general-purpose optimizer. Construct the sum of squares function (-sum of squares as we maximize it):
SS <- function(par) {
a <- par[1]
b <- par[2]
f <- par[3]
c <- par[4]
g <- par[5]
-sum((Y - ff(a,b,f,c,g))^2, na.rm=TRUE)
}
and maximize:
library(maxLik)
summary(a <- maxBFGS(SS, start=start))
It works, but as you see the gradients are still pretty large. I get gradients small if I re-run a NR optimizer on the output values of BFGS:
summary(b <- maxNR(SS, start=coef(a)))
which gives the results
Newton-Raphson maximisation
Number of iterations: 1
Return code: 2
successive function values within tolerance limit
Function value: -47.36338
Estimates:
estimate gradient
a 10.584488 0.0016371615
b 6.954444 -0.0043306656
f 6.955095 0.0043327901
c 28.622035 -0.0005735572
g 28.619185 0.0003871179
I don't know if this makes sense. The issues with nls and the other optimizers hint that you have numerical instabilities, either related to large numerical values, or the difference of exponents in the model formula.
Check what is going on there :-)

how do I select the smoothing parameter for smooth.spline()?

I know that the smoothing parameter(lambda) is quite important for fitting a smoothing spline, but I did not see any post here regarding how to select a reasonable lambda (spar=?), I was told that spar normally ranges from 0 to 1. Could anyone share your experience when use smooth.spline()? Thanks.
smooth.spline(x, y = NULL, w = NULL, df, spar = NULL,
cv = FALSE, all.knots = FALSE, nknots = NULL,
keep.data = TRUE, df.offset = 0, penalty = 1,
control.spar = list(), tol = 1e-6 * IQR(x))
agstudy provides a visual way to choose spar. I remember what I learned from linear model class (but not exact) is to use cross validation to pick "best" spar. Here's a toy example borrowed from agstudy:
x = seq(1:18)
y = c(1:3,5,4,7:3,2*(2:5),rep(10,4))
splineres <- function(spar){
res <- rep(0, length(x))
for (i in 1:length(x)){
mod <- smooth.spline(x[-i], y[-i], spar = spar)
res[i] <- predict(mod, x[i])$y - y[i]
}
return(sum(res^2))
}
spars <- seq(0, 1.5, by = 0.001)
ss <- rep(0, length(spars))
for (i in 1:length(spars)){
ss[i] <- splineres(spars[i])
}
plot(spars, ss, 'l', xlab = 'spar', ylab = 'Cross Validation Residual Sum of Squares' , main = 'CV RSS vs Spar')
spars[which.min(ss)]
R > spars[which.min(ss)]
[1] 0.381
Code is not neatest, but easy for you to understand. Also, if you specify cv=T in smooth.spline:
R > xyspline <- smooth.spline(x, y, cv=T)
R > xyspline$spar
[1] 0.3881
From the help of smooth.spline you have the following:
The computational λ used (as a function of \code{spar}) is λ = r *
256^(3*spar - 1)
spar can be greater than 1 (but I guess no too much). I think you can vary this parameters and choose it graphically by plotting the fitted values for different spars. For example:
spars <- seq(0.2,2,length.out=10) ## I will choose between 10 values
dat <- data.frame(
spar= as.factor(rep(spars,each=18)), ## spar to group data(to get different colors)
x = seq(1:18), ## recycling here to repeat x and y
y = c(1:3,5,4,7:3,2*(2:5),rep(10,4)))
xyplot(y~x|spar,data =dat, type=c('p'), pch=19,groups=spar,
panel =function(x,y,groups,...)
{
s2 <- smooth.spline(y,spar=spars[panel.number()])
panel.lines(s2)
panel.xyplot(x,y,groups,...)
})
Here for example , I get best results for spars = 0.4
If you don't have duplicated points at the same x value, then try setting GCV=TRUE - the Generalized Cross Validation (GCV) procedure is a clever way of selecting a pretty good stab at picking a good value for lambda (span). One neat detail about the GCV is that it doesn't actually have to go to the trouble of doing the calculations for every single set of one-left-out points - as highlighted in Simon Wood's book. For lots of detail on this have a look at the notes on Simon Wood's web page on MGCV.
Adrian Bowman's (sm) r-package has a function h.select() which is intended specifically for going the grunt work for choosing a value of lambda (though I'm not 100% sure that it is compatible with the smooth.spline() function in the base package.

Resources