Fitting experimental data points to different cumulative distributions using R - r

I am new to programming and using R software, so I would really appreciate your feedback to the current problem that I am trying to solve.
So, I have to fit a cumulative distribution with some function (two/three parameter function). This seems to be pretty straight-forward task, but I've been buzzing around this now for some time.
Let me show you what are my variables:
x=c(0.01,0.011482,0.013183,0.015136,0.017378,0.019953,0.022909,0.026303,0.0302,0.034674,0.039811,0.045709,0.052481,0.060256,0.069183,0.079433,0.091201,0.104713,0.120226,0.138038,0.158489,0.18197,0.20893,0.239883,0.275423,0.316228,0.363078,0.416869,0.47863,0.549541,0.630957,0.724436,0.831764,0.954993,1.096478,1.258925,1.44544,1.659587,1.905461,2.187762,2.511886,2.884031,3.311311,3.801894,4.365158,5.011872,5.754399,6.606934,7.585776,8.709636,10,11.481536,13.182567,15.135612,17.378008,19.952623,22.908677,26.30268,30.199517,34.673685,39.810717,45.708819,52.480746,60.255959,69.183097,79.432823,91.201084,104.712855,120.226443,138.038426,158.489319,181.970086,208.929613,239.883292,275.42287,316.227766,363.078055,416.869383,478.630092,549.540874,630.957344,724.43596,831.763771,954.992586,1096.478196)
y=c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.00044816,0.00127554,0.00221488,0.00324858,0.00438312,0.00559138,0.00686054,0.00817179,0.00950625,0.01085188,0.0122145,0.01362578,0.01514366,0.01684314,0.01880564,0.02109756,0.0237676,0.02683182,0.03030649,0.0342276,0.03874555,0.04418374,0.05119304,0.06076553,0.07437854,0.09380666,0.12115065,0.15836926,0.20712933,0.26822017,0.34131335,0.42465413,0.51503564,0.60810697,0.69886817,0.78237651,0.85461023,0.91287236,0.95616228,0.98569093,0.99869001,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999)
This is the plot where I set up x-axis as log:
After some research, I have tried with Sigmoid function, as found on one of the posts (I can't add link since my reputation is not high enough). This is the code:
# sigmoid function definition
sigmoid = function(params, x) {
params[1] / (1 + exp(-params[2] * (x - params[3])))
}
# fitting code using nonlinear least square
fitmodel <- nls(y~a/(1 + exp(-b * (x-c))), start=list(a=1,b=.5,c=25))
# get the coefficients using the coef function
params=coef(fitmodel)
# asigning to y2 sigmoid function
y2 <- sigmoid(params,x)
# plotting y2 function
plot(y2,type="l")
# plotting data points
points(y)
This led me to some good fitting results (I don't know how to quantify this). But, when I look at the at the plot of Sigmuid fitting function I don't understand why is the S shape now happening in the range of x-values from 40 until 7 (looking at the S shape should be in x-values from 10 until 200).
Since I couldn't explain this behavior, I thought of trying Weibull equation for fitting, but so far I can't make the code running.
To sum up:
Do you have any idea why is the Sigmoid giving me that weird fitting?
Do you know any better two or three parameter equation for this fitting approach?
How could I determine the goodness of fit? Something like r^2?

# Data
df <- data.frame(x=c(0.01,0.011482,0.013183,0.015136,0.017378,0.019953,0.022909,0.026303,0.0302,0.034674,0.039811,0.045709,0.052481,0.060256,0.069183,0.079433,0.091201,0.104713,0.120226,0.138038,0.158489,0.18197,0.20893,0.239883,0.275423,0.316228,0.363078,0.416869,0.47863,0.549541,0.630957,0.724436,0.831764,0.954993,1.096478,1.258925,1.44544,1.659587,1.905461,2.187762,2.511886,2.884031,3.311311,3.801894,4.365158,5.011872,5.754399,6.606934,7.585776,8.709636,10,11.481536,13.182567,15.135612,17.378008,19.952623,22.908677,26.30268,30.199517,34.673685,39.810717,45.708819,52.480746,60.255959,69.183097,79.432823,91.201084,104.712855,120.226443,138.038426,158.489319,181.970086,208.929613,239.883292,275.42287,316.227766,363.078055,416.869383,478.630092,549.540874,630.957344,724.43596,831.763771,954.992586,1096.478196),
y=c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.00044816,0.00127554,0.00221488,0.00324858,0.00438312,0.00559138,0.00686054,0.00817179,0.00950625,0.01085188,0.0122145,0.01362578,0.01514366,0.01684314,0.01880564,0.02109756,0.0237676,0.02683182,0.03030649,0.0342276,0.03874555,0.04418374,0.05119304,0.06076553,0.07437854,0.09380666,0.12115065,0.15836926,0.20712933,0.26822017,0.34131335,0.42465413,0.51503564,0.60810697,0.69886817,0.78237651,0.85461023,0.91287236,0.95616228,0.98569093,0.99869001,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999))
# sigmoid function definition
sigmoid = function(x, a, b, c) {
a * exp(-b * exp(-c * x))
}
# fitting code using nonlinear least square
fitmodel <- nls(y ~ sigmoid(x, a, b, c), start=list(a=1,b=.5,c=-2), data = df)
# plotting y2 function
plot(df$x, predict(fitmodel),type="l", log = "x")
# plotting data points
points(df)
The function I used is the Gompertz function and this blog post explains why R² shouldn't be used with nonlinear fits and offers an alternative.

After going through different functions and different data-sets I have found the best solution that gives the answers to all of my questions posted.
The code is as it follows for the data-set stated in question:
df <- data.frame(x=c(0.01,0.011482,0.013183,0.015136,0.017378,0.019953,0.022909,0.026303,0.0302,0.034674,0.039811,0.045709,0.052481,0.060256,0.069183,0.079433,0.091201,0.104713,0.120226,0.138038,0.158489,0.18197,0.20893,0.239883,0.275423,0.316228,0.363078,0.416869,0.47863,0.549541,0.630957,0.724436,0.831764,0.954993,1.096478,1.258925,1.44544,1.659587,1.905461,2.187762,2.511886,2.884031,3.311311,3.801894,4.365158,5.011872,5.754399,6.606934,7.585776,8.709636,10,11.481536,13.182567,15.135612,17.378008,19.952623,22.908677,26.30268,30.199517,34.673685,39.810717,45.708819,52.480746,60.255959,69.183097,79.432823,91.201084,104.712855,120.226443,138.038426,158.489319,181.970086,208.929613,239.883292,275.42287,316.227766,363.078055,416.869383,478.630092,549.540874,630.957344,724.43596,831.763771,954.992586,1096.478196),
y=c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.00044816,0.00127554,0.00221488,0.00324858,0.00438312,0.00559138,0.00686054,0.00817179,0.00950625,0.01085188,0.0122145,0.01362578,0.01514366,0.01684314,0.01880564,0.02109756,0.0237676,0.02683182,0.03030649,0.0342276,0.03874555,0.04418374,0.05119304,0.06076553,0.07437854,0.09380666,0.12115065,0.15836926,0.20712933,0.26822017,0.34131335,0.42465413,0.51503564,0.60810697,0.69886817,0.78237651,0.85461023,0.91287236,0.95616228,0.98569093,0.99869001,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999))
library(drc)
fm <- drm(y ~ x, data = df, fct = G.3()) #The Gompertz model G.3()
plot(fm)
#Gompertz Coefficients and residual standard error
summary(fm)
The plot after fitting

Related

How do I perform Non Linear Least Squares in R with a pre determined lag structure

Suppose I want to estimate the parameters of the following model:
$y_t = beta0 (sum_{i=1}^p w(delta;i) x_{t-i})$.
Latex version of the equation: https://i.stack.imgur.com/POOlD.png
Where y_t and x_{t-i} are known data points, wdelta follows an exponential Almon lag structure with two parameters delta1 and delta2(see image). And beta0 is the common parameter.
Generating some data for x and y
y <- seq(1:10)
x <- rnorm(10,2,5)
The literature suggests estimating the model parameters using NLS and the Gaussian Newton Method. R does have a function gaussNewton however I am not sure how to use this. How do I approach the estimation of the parameters beta0,delta1 and delta2?
Wikipedia suggest: https://en.wikipedia.org/wiki/Non-linear_least_squares, however I feel like this is not appropriate in this case.
The nls function in R is unable to deal with predefined lag structures so this is not an option either. Maybe I could write out the function in the form of the sum of squared residuals and use the optim function? Another option could be to use the nlm function.
nonls <- function(delta1,delta2,i,p) {
z <- exp(delta1 * i + delta2 *i)
wdelta[i] <- exp(delta1 * i + delta2 *i)/sum(z[1:i])
ssr <- (y[i]- (beta0 * wdelta[i] * x[i:p]))^2
}
optim(ssr)
I look forward to your suggestions.

How to do an exponential regression model?

I have a small data base (txt file).
I want to obtain an exponential regression in R.
The commands that I am using are:
regression <- read.delim("C:/Users/david/OneDrive/Desktop/regression.txt")
View(regression)
source('~/.active-rstudio-document', echo=TRUE)
m <- nls(DelSqRho ~ (1-exp(-a*(d-b)**2)), data=regression, start=list(a=1, b=1))
y_est<-predict(m,regression$d)
plot(x,y)
lines(x,y_est)
summary(m)
But, when I run it, I get an error:
Error in nls(DelSqRho ~ (1 - exp(-a * (d - b)^2)), data = regression, :
step factor 0.000488281 reduced below 'minFactor' of 0.000976562
and I do not know how to solve it, how to obtain the exponwential regression, please, any hint?
nls is quite sensitive to the values of the starting parameters and so you want to choose values that give a reasonable fit to the data (minpack.lm::nlsLM can be a bit more forgiving).
You can plot the curve at your starting values of a=1 and b=1 and see that it doesn't do a great job of capturing the curve.
regression <- read.delim("regression.txt")
with(regression, plot(d, DelSqRho, ylim=c(-3, 1)))
xs <- seq(min(regression$d), max(regression$d), length=100)
a <- 1; b <- 1; ys <- 1 - exp(-a* (xs - b)**2)
lines(xs, ys)
One way to get starting values is by rearranging the objective function.
y = 1 - exp(-a*(x-b)**2) can be rearranged as log(1/(1-y)) = ab^2 - 2abx + ax^2 (here y must be less than one). Linear regression can then be used to get an estimate of a and b.
start_m <- lm(log(1/(1-DelSqRho)) ~ poly(d, 2, raw=TRUE), regression)
unname(a <- coef(start_m)[3]) # as `a` is aligned with the quadratic term
# [1] -0.2345953
unname(b <- sqrt(coef(start_m)[1]/coef(start_m)[3]))
# [1] 2.933345
(Sometimes it is not possible to rearrange the data in this way and you can try to get a rough idea of the parameters by plotting the curves at various starting parameters. nls2 can also do a brute force search or grid search over starting parameters.)
We can now try to estimate the nls model at these parameters:
m <- nls(DelSqRho ~ 1-exp(-a*(d-b)**2), data=regression, start=list(a=a, b=b))
coef(m)
# a b
# -0.2379078 2.8868374
And plot the results:
# note that `newdata` must be a named list or data frame
# in which to look for variables with which to predict.
y_est <- predict(m, newdata=data.frame(d=xs))
with(regression, plot(d, DelSqRho))
lines(xs, y_est, col="red", lwd=2)
The fit isn't great and is perhaps suggestive that a more flexible model is required.

R gmm package using exactly identified moment conditions

For exactly identified moments, GMM results should be the same regardless of initial starting values. This doesn't appear to be the case however.
library(gmm)
data(Finance)
x <- data.frame(rm=Finance[1:500,"rm"], rf=Finance[1:500,"rf"])
# want to solve for coefficients theta[1], theta[2] in exactly identified
# system
g <- function(theta, x)
{
m.1 <- x[,"rm"] - theta[1] - theta[2]*x[,"rf"]
m.z <- (x[,"rm"] - theta[1] - theta[2]*x[,"rf"])*x[,"rf"]
f <- cbind(m.1, m.z)
return(f)
}
# gmm coefficient result should be identical to ols regressing rm on rf
# since two moments are E[u]=0 and E[u*rf]=0
model.lm <- lm(rm ~ rf, data=x)
model.lm
# gmm is consistent with lm given correct starting values
summary(gmm(g, x, t0=model.lm$coefficients))
# problem is that using different starting values leads to different
# coefficients
summary(gmm(g, x, t0=rep(0,2)))
Is there something wrong with my setup?
The gmm package author Pierre Chausse was kind enough to respond to my inquiry.
For linear models, he suggests using the formula approach:
gmm(rm ~ rf, ~rf, data=x)
For non-linear models, he emphasizes that the starting values are indeed critical. In the case of exactly identified models, he suggests setting the fnscale to a small number to force the optim minimizer to converge closer to 0. Also, he thinks the BFGS algorithm works better with GMM.
summary(gmm(g, x, t0=rep(0,2), method = "BFGS", control=list(fnscale=1e-8)))
Both solutions work for this example. Thanks Pierre!

Solution of varying coefficients ODE

I have a set of observed raw data and use 2nd order ODE to fit the data
y''+b1(t)y'+b0(t)y = 0
The b1 and b0 are time-dependent and I use principal differential analysis(PDA) (R-package: fda, function: pda.fd)to get the estimate of b1(t) and b0(t) .
To check the validity of the estimates of b1(t) and b0(t), I use collocation method (R-package bvpSolve, function:bvpcol) to get the numerical solution of the ODE and compare the solution with the smoothing curve fitting of the raw data.
My question is that my numerical solution from bvpcol can caputure the shape of the fitting curve but not for the value of the function. There are different in term of some constant multiples.
(Since I am not allowed to post images,please see the link for figure)
See the figure of my output. The gray dot is my raw data, the red line is Fourier expansion of the raw data, the green line is numerical solution of bvpcol function and the blue line the green-line/1.62. We can see the green line can capture the shape but with values that are constant times of fourier expansion.
I fit several other data and have similar situation but different constant. I am wondering it is the problem of numerical solution of ODE or some other reasons and how to solve this problem to get a good accordance between numerical solution(green) and true Fourier expansion?
Any help and idea is appreciated!
Here is a raw data and code:
RData is here
library(fda)
library(bvpSolve)
# load the data
load('y.RData')
tvec = 1:length(y)
tvec = (tvec-min(tvec))/(max(tvec)-min(tvec))
# create basis
fbasis = create.fourier.basis(c(0,1),nbasis=nbasis)
bbasis = create.bspline.basis(c(0,1),norder=8,nbasis=47)
bfdPar = fdPar(bbasis)
yfd = smooth.basis(tvec,y,fbasis)$fd
yfdlist = list(yfd)
bwtlist = rep(list(bfdPar),2)
# PDA fit
bwt = pda.fd(yfdlist,bwtlist)$bwtlist
# output of estimated coefficients
beta0.fd<-bwt[[1]]$fd
beta1.fd<-bwt[[2]]$fd
# define the vary-coef function in terms of t
fbeta0<-function(t)eval.fd(t,beta0.fd)
fbeta1<-function(t)eval.fd(t,beta1.fd)
# define 2nd order ODE
fun2 <- function(t,y,pars) {
with(as.list(c(y,pars)),{
beta0 = pars[[1]];
beta1 = pars[[2]];
dy1 = y[2]
dy2 = -beta1(t)*y[2]-beta0(t)*y[1]
return(list(c(dy1,dy2)))
})
}
# BVP
yinit<-c(p1[1],NA)
yend<-c(p1[length(p1)],NA)
t<-seq(tvec[1],tvec[length(tvec)],0.005)
col<-bvpcol(yini=yinit,yend=yend,x=t,func=fun2,parms=c(fbeta0,fbeta1),atol=1e-5,islin=T)
# plot output
plot(col[,1],col[,2],col='green',type='l')
points(tvec,p1,col='darkgray')
lines(yfd,col='red',lwd=2)
lines(col[,1],col[,2],col='green',type='l')
lines(col[,1],col[,2]/1.62,col='blue',type='l',lwd=2,lty=4)
legend('topleft',col=c('green','darkgray','red','blue'),
legend=c('ODE solution','raw data','basis curve fitting','ODE solution/1.62'),lty=1)

Solve variable coefficients second order linear ODE?

For the variable coefficients second order linear ODE
$x''(t)+\beta_1(t)x'(t)+\beta_0 x(t)=0$
I have the numerical values (in terms of vectors) for $\beta_1(t)$ and $\beta_0(t)$, does anyone know some R package to do that? And some simple examples to illustrate would be great as well.
I googled to find 'bvpSolve' can solve constant coefficients value.
In order to use deSolve, you have to make your second-order ODE
x''(t) + \beta_1(t) x'(t) + \beta_0 x(t)=0
into a pair of coupled first-order ODEs:
x'(t) = y(t)
y'(t) = - \beta_1(t) y(t) - \beta_0 x(t)
Then it's straightforward:
gfun <- function(t,z,params) {
g <- with(as.list(c(z,params)),
{
beta0 <- sin(2*pi*t)
beta1 <- cos(2*pi*t)
c(x=y,
y= -beta1*y - beta0*x))
list(g,NULL)
}
library("deSolve")
run1 <- ode(c(x=1,y=1),times=0:40,func=gfun,parms=numeric(0))
I picked some initial conditions (x(0)=1, x'(0)=1) arbitrarily; you might also want to add parameters to the model (i.e. make parms something other than numeric(0))
PS if you're not happy doing the conversion to coupled first-order ODEs by hand, and want a package that will seamlessly handle second-order ODEs, then I don't know the answer ...

Resources