Parameter and initial conditions fitting ODE models with nls.lm - r

I am currently trying to fit ODE functional responses using the Levenberg-Marquardt routine (nls.lm) in pkg-minpack.lm following the tutorial here (http://www.r-bloggers.com/learning-r-parameter-fitting-for-models-involving-differential-equations/).
In the example, he fits the data by first setting up a function rxnrate which I modified shown below:
library(ggplot2) #library for plotting
library(reshape2) # library for reshaping data (tall-narrow <-> short-wide)
library(deSolve) # library for solving differential equations
library(minpack.lm) # library for least squares fit using levenberg-marquart algorithm
# prediction of concentration
# rate function
rxnrate=function(t,c,parms){
# rate constant passed through a list called parms
k1=parms$k1
k2=parms$k2
k3=parms$k3
# c is the concentration of species
# derivatives dc/dt are computed below
r=rep(0,length(c))
r[1]=-k1*c["A"] #dcA/dt
r[2]=k1*c["A"]-k2*c["B"]+k3*c["C"] #dcB/dt
r[3]=k2*c["B"]-k3*c["C"] #dcC/dt
# the computed derivatives are returned as a list
# order of derivatives needs to be the same as the order of species in c
return(list(r))
}
My problem is that the initial condition of each states can be also considered as the estimated parameters. However, it does not work properly at the moment.
Below is my code:
# function that calculates residual sum of squares
ssq=function(myparms){
# inital concentration
cinit=c(A=myparms[4],B=0,C=0)
# time points for which conc is reported
# include the points where data is available
t=c(seq(0,5,0.1),df$time)
t=sort(unique(t))
# parms from the parameter estimation routine
k1=myparms[1]
k2=myparms[2]
k3=myparms[3]
# solve ODE for a given set of parameters
out=ode(y=cinit,times=t,func=rxnrate,parms=list(k1=k1,k2=k2,k3=k3))
# Filter data that contains time points where data is available
outdf=data.frame(out)
outdf=outdf[outdf$time %in% df$time,]
# Evaluate predicted vs experimental residual
preddf=melt(outdf,id.var="time",variable.name="species",value.name="conc")
expdf=melt(df,id.var="time",variable.name="species",value.name="conc")
ssqres=preddf$conc-expdf$conc
# return predicted vs experimental residual
return(ssqres)
}
# parameter fitting using levenberg marquart algorithm
# initial guess for parameters
myparms=c(k1=0.5,k2=0.5,k3=0.5,A=1)
# fitting
fitval=nls.lm(par=myparms,fn=ssq)
Once I run this, an error comes out like this
Error in chol.default(object$hessian) :
the leading minor of order 1 is not positive definite

The problem of your code is the following one:
In the code-line cinit=c(A=myparms[4],B=0,C=0) you gave A the value of myparms[4] AND the name of myparms[4]. Let's see:
myparms=c(k1=0.5,k2=0.5,k3=0.5,A=1)
cinit=c(A=myparms[4],B=0,C=0)
print(cinit)
A.A B C
1 0 0
to solve this problem, you can do this:
myparms=c(k1=0.5,k2=0.5,k3=0.5,A=1)
cinit=c(A=unname(myparms[4]),B=0,C=0)
print(cinit)
A B C
1 0 0
or this:
myparms=c(k1=0.5,k2=0.5,k3=0.5,1)
cinit=c(A=unname(myparms[4]),B=0,C=0)
print(cinit)
A B C
1 0 0
Then your code will work!
Best regards,
J_F

Related

Clustering with Mclust results in an empty cluster

I am trying to cluster my empirical data using Mclust. When using the following, very simple code:
library(reshape2)
library(mclust)
data <- read.csv(file.choose(), header=TRUE, check.names = FALSE)
data_melt <- melt(data, value.name = "value", na.rm=TRUE)
fit <- Mclust(data$value, modelNames="E", G = 1:7)
summary(fit, parameters = TRUE)
R gives me the following result:
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm
----------------------------------------------------
Mclust E (univariate, equal variance) model with 4 components:
log-likelihood n df BIC ICL
-20504.71 3258 8 -41074.13 -44326.69
Clustering table:
1 2 3 4
0 2271 896 91
Mixing probabilities:
1 2 3 4
0.2807685 0.4342499 0.2544305 0.0305511
Means:
1 2 3 4
1381.391 1381.715 1574.335 1851.667
Variances:
1 2 3 4
7466.189 7466.189 7466.189 7466.189
Edit: Here my data for download https://www.file-upload.net/download-14320392/example.csv.html
I do not readily understand why Mclust gives me an empty cluster (0), especially with nearly identical mean values to the second cluster. This only appears when specifically looking for an univariate, equal variance model. Using for example modelNames="V" or leaving it default, does not produce this problem.
This thread: Cluster contains no observations has a similary problem, but if I understand correctly, this appeared to be due to randomly generated data?
I am somewhat clueless as to where my problem is or if I am missing anything obvious.
Any help is appreciated!
As you noted the mean of cluster 1 and 2 are extremely similar, and it so happens that there's quite a lot of data there (see spike on histogram):
set.seed(111)
data <- read.csv("example.csv", header=TRUE, check.names = FALSE)
fit <- Mclust(data$value, modelNames="E", G = 1:7)
hist(data$value,br=50)
abline(v=fit$parameters$mean,
col=c("#FF000080","#0000FF80","#BEBEBE80","#BEBEBE80"),lty=8)
Briefly, mclust or gmm are probabilistic models, which estimates the mean / variance of clusters and also the probabilities of each point belonging to each cluster. This is unlike k-means provides a hard assignment. So the likelihood of the model is the sum of the probabilities of each data point belonging to each cluster, you can check it out also in mclust's publication
In this model, the means of cluster 1 and cluster 2 are near but their expected proportions are different:
fit$parameters$pro
[1] 0.28565736 0.42933294 0.25445342 0.03055627
This means if you have a data point that is around the means of 1 or 2, it will be consistently assigned to cluster 2, for example let's try to predict data points from 1350 to 1400:
head(predict(fit,1350:1400)$z)
1 2 3 4
[1,] 0.3947392 0.5923461 0.01291472 2.161694e-09
[2,] 0.3945941 0.5921579 0.01324800 2.301397e-09
[3,] 0.3944456 0.5919646 0.01358975 2.450108e-09
[4,] 0.3942937 0.5917661 0.01394020 2.608404e-09
[5,] 0.3941382 0.5915623 0.01429955 2.776902e-09
[6,] 0.3939790 0.5913529 0.01466803 2.956257e-09
The $classification is obtained by taking the column with the maximum probability. So, same example, everything is assigned to 2:
head(predict(fit,1350:1400)$classification)
[1] 2 2 2 2 2 2
To answer your question, no you did not do anything wrong, it's a fallback at least with this implementation of GMM. I would say it's a bit of overfitting, but you can basically take only the clusters that have a membership.
If you use model="V", i see the solution is equally problematic:
fitv <- Mclust(Data$value, modelNames="V", G = 1:7)
plot(fitv,what="classification")
Using scikit learn GMM I don't see a similar issue.. So if you need to use a gaussian mixture with spherical means, consider using a fuzzy kmeans:
library(ClusterR)
plot(NULL,xlim=range(data),ylim=c(0,4),ylab="cluster",yaxt="n",xlab="values")
points(data$value,fit_kmeans$clusters,pch=19,cex=0.1,col=factor(fit_kmeans$clusteraxis(2,1:3,as.character(1:3))
If you don't need equal variance, you can use the GMM function in the ClusterR package too.

Wald-test for single statistic in R

I have a series of hazard-rates at two points (low and high point) in the curve with corresponding standard errors. I calculate the hazard-ratio by dividing the high point hazard-rate by the low point hazard-rate. This is the hratio column. Now in the next column I would like to show the probability (p-value) that the ratio is significantly different from 1 using the Wald-test.
I have tried doing this using the wald.test() from the aods3 package, but I keep getting an error messages. It seems that the code only allows for the comparison of two related regression models.
How would you go about doing this?
> wald
fit.low se.low fit.high se.high hratio
1 0.09387638 0.002597817 0.09530283 0.002800329 0.9850324
2 0.10941588 0.002870383 0.10831292 0.003061924 1.0101831
3 0.02549611 0.001054303 0.02857411 0.001368525 0.8922802
4 0.02818208 0.000917136 0.02871669 0.000936373 0.9813833
5 0.04857652 0.000554676 0.04897211 0.000568229 0.9919222
6 0.05121328 0.000565592 0.05142951 0.000554893 0.9957956
> library(aods3)
> wald$pv <- wald.test(b=wald$hratio)
Error in wald.test(b = wald$hratio) :
One of the arguments Terms or L must be used.
define L=NULL, Terms=NULL, Sigma = vcov(b)

How to find the minimum floating-point value accepted by betareg package?

I'm doing a beta regression in R, which requires values between 0 and 1, endpoints excluded, i.e. (0,1) instead of [0,1].
I have some 0 and 1 values in my dataset, so I'd like to convert them to the smallest possible neighbor, such as 0.0000...0001 and 0.9999...9999. I've used .Machine$double.xmin (which gives me 2.225074e-308), but betareg() still gives an error:
invalid dependent variable, all observations must be in (0, 1)
If I use 0.000001 and 0.999999, I got a different set of errors:
1: In betareg.fit(X, Y, Z, weights, offset, link, link.phi, type, control) :
failed to invert the information matrix: iteration stopped prematurely
2: In sqrt(wpp) :
Error in chol.default(K) :
the leading minor of order 4 is not positive definite
Only if I use 0.0001 and 0.9999 I can run without errors. Is there any way I can improve this minimum values with betareg? Or should I just be happy with that?
Try it with eps (displacement from 0 and 1) first equal to 1e-4 (as you have here) and then with 1e-3. If the results of the models don't differ in any way you care about, that's great. If they are, you need to be very careful, because it suggests your answers will be very sensitive to assumptions.
In the example below the dispersion parameter phi changes a lot, but the intercept and slope parameter don't change very much.
If you do find that the parameters change by a worrying amount for your particular data, then you need to think harder about the process by which zeros and ones arise, and model that process appropriately, e.g.
a censored-data model: zero/one arise through a minimum/maximum detection threshold, models the zero/one values as actually being somewhere in the tails or
a hurdle/zero-one inflation model: zeros and ones arise through a separate process from the rest of the data, use a binomial or multinomial model to characterize zero vs. (0,1) vs. one, then use a Beta regression on the (0,1) component)
Questions about these steps are probably more appropriate for CrossValidated than for SO.
sample data
set.seed(101)
library(betareg)
dd <- data.frame(x=rnorm(500))
rbeta2 <- function(n, prob=0.5, d=1) {
rbeta(n, shape1=prob*d, shape2=(1-prob)*d)
}
dd$y <- rbeta2(500,plogis(1+5*dd$x),d=1)
dd$y[dd$y<1e-8] <- 0
trial fitting function
ss <- function(eps) {
dd <- transform(dd,
y=pmin(1-eps,pmax(eps,y)))
m <- try(betareg(y~x,data=dd))
if (inherits(m,"try-error")) return(rep(NA,3))
return(coef(m))
}
ss(0) ## fails
ss(1e-8) ## fails
ss(1e-4)
## (Intercept) x (phi)
## 0.3140810 1.5724049 0.7604656
ss(1e-3) ## also fails
ss(1e-2)
## (Intercept) x (phi)
## 0.2847142 1.4383922 1.3970437
ss(5e-3)
## (Intercept) x (phi)
## 0.2870852 1.4546247 1.2029984
try it for a range of values
evec <- seq(-4,-1,length=51)
res <- t(sapply(evec, function(e) ss(10^e)) )
library(ggplot2)
ggplot(data.frame(e=10^evec,reshape2::melt(res)),
aes(e,value,colour=Var2))+
geom_line()+scale_x_log10()

non-linear optimisation with constraints

I try to find the initial values of the system of differential equations. My parameter estimation returns negative values for some of the initial values, but all of them have to be bigger than or equal 0. On top of that I would like to specify that initial conditions should be as follow:
A>0
B>0<1
C>0<1
D>0<1
E=0
F=0
G>0<5
H>0<100
All results obtained should be positive (concentrations).
How can I introduce constraints? I found some info about how to use it while using optim, but I can't find any info relevant to my problem.
Any help will be appreciated.
Malgosia
library(ggplot2)
library(reshape2)
library(deSolve)
library(minpack.lm)
time<-seq(0, 5, by=1)
P4=c(0.018,0.028,0.201,0.888,0.934,2.044)
E2=c(0.355,0.28,0.665,0.995,0.934)
FSH=c(0.408,0.226,0.126,0.224,0.123)
signal<-as.data.frame(list(time=time,P4))
input<-approxfun(P4,rule=2)
df<-data.frame(time,E2,FSH)
df
names(df)=c("time","E2","FSH")
#plot data
tmp=melt(df,id.vars=c("time"),variable.name="species",value.name="conc")
ggplot(data=tmp,aes(x=time,y=conc,color=species))+geom_point(size=4)
#rate function
rxnrate=function(t,c,parms){
#c is the concentration of species
#derivatives dc/dt are computed below
P4<-input(t)
r=rep(0,length(c))
r[1]<-12.84*1/(1+(P4/5)^5)*((c["G"]/3)^10)/(1+(c["G"]/3)^10)-2.14*c["A"];
r[2]<-75*(((c["A"]/5)^10)/(1+(c["A"]/5)^10))-8.56*c["B"];
r[3]<-12.84*(1/(1+(c["H"]/2)^2))*(1/(1+(c["G"]/10)^10))+1*(c["A"]/1)^1)/(1+(c["A"]/1)^1))-2.14*c["C"];
r[4]<-0.0107*c["C"]+0.321*c["C"]*c["D"]- 0.749*c["D"];
r[5]<-0.749*c["D"]- 0.749*c["E"]+0.214*c["C"]*c["E"]^2;
r[6]<-0.749*c["E"]-0.749*c["F"]+0.214*c["B"]*c["F"]^2;
r[7]<-0.0107 + 2.14*c["E"] + 10.7*c["F"]-1.07*c["G"];
r[8]<-0.0107+3*c["E"]+ 3.21*c["F"]+3.21*c["G"]-1.07*c["H"];
return(list(r))
}
cinit<-c(A=0.3947,B=0.40727,C=0.408,D=0.17828,E=0,F=0.05,G=0.355,H=0.9);
t=df$time;
out=ode(y=cinit,times=t,func=rxnrate)
head(out)
plot(out)
ssq=function(myparms){
#initial concentration
cinit=c(A=myparms[1],B=myparms[2],C=myparms[3],D=myparms[4],E=myparms[5],F=myparms[6],G=myparms[7],H=myparms[8])
cinit=c(A=unname(myparms[1]),B=unname(myparms[2]),C=unname(myparms[3]),D=unname(myparms[4]),E=unname(myparms[5]),F=unname(myparms[6]),G=unname(myparms[7]),H=unname(myparms[8]))
print(cinit)
#time points for which conc is reported
#include the points where data is available
t=c(seq(0,5,1),df$time)
t=sort(unique(t))
#parameters from the parameters estimation
#solve ODE for a given set of parameters
out=ode(y=cinit,times=t,func=rxnrate)
#Filter data that contains time points
outdf=data.frame(out)
outdf=outdf[outdf$time%in% df$time,]
#Evaluate predicted vs experimental residual
preddf=melt(outdf,id.var="time",variable.name="species",value.name="conc")
expdf=melt(df,id.var="time",variable.name="species",value.name="conc")
ssqres=preddf$conc-expdf$conc
return(ssqres)
}
# parameter fitting using levenberg marquart
#initial guess for parameters
myparms=c(A=0.6947,B=0.4072,C=0.408,D=0.2,E=0,F=0,G=0.355,H=0.9)
#fitting
fitval=nls.lm(par=myparms,fn=ssq)
#summary of fit
summary(fitval)

Questions about using NLS.LM in the minpack.lm package

I am trying to use the minpack.lm package in R. Specifically the NLS.LM function. I'm pouring through the manual and help files, but the requirements for setting it up are a little beyond my current capabilities. Any guidance is greatly appreciated. Here is my code, and the error statement I'm getting, below.
R Code:
# Thomas P. Taggart
# ERE445/645
# Spring 2013 - Calibration Presentation
# Lumped parameter rainfall-runoff model for the Susquehanna River at Conklin, NY.
# Outlined in Haith's (1987) GWLF model. The model uses the SCS curve
# number runoff technique to determine runoff, with snowpack, unsaturated zone, and
# saturated zone mass balances. Evapotranspiration is to be determined using Hamon’s
# method with average monthly values for daylight hours.
# In this model we assume the following constants, which are determined through calibration:
# Baseflow Recession Coefficient, Kb
# Field capacity, FCAP
# Curve number for average moisture conditions, CN2
# Initial antecedent moisture conditions, iAMC
# Initial snow accumulation, iSNt
# Initial saturated zone storage, iSATt
# No deep groundwater seepage
# including needed functions
source("Functions.R")
source("distributionFunctions.R")
source("GWLF_Model.R")
require(ggplot2)
require(reshape)
library(minpack.lm)
library(scales)
###############################################################################################
# USGS Discharge data for Conklin, NY - Gage on the Susquehanna
# Reading in the input file
dischargeInput <- read.csv("USGS_DailyDischarge_ConklinNY_01503000_A.csv", header=TRUE)
###############################################################################################
# Weather Data
# Read in input file
weatherInput = read.csv("Conklin_NY_WeatherData_Edit.csv")
###############################################################################################
# Setting up the model inputs - inital Run
# Baseflow Recession, Kb
Kb <- 0.90
# Initial unsaturated storage is at field capacity, FCAP (cm)
FCAP <- 10
# Curve number for average moisture conditions, CN
CN <- 65.7
# Initial antecedent moisture conditions, AMC
AMC <- 1.5
# Initial saturated zone storage, SATt
iSATt <- 0.45
# Snowmelt constant, K
K <- 0.45
parameters <- c(Kb, FCAP,CN, AMC, iSATt, K)
# Calling the Model - 1st time to see the initial outputs
# GWLF(parameters, dischargeInput, weatherInput)
###############################################################################################
# Calibrating the model
guess <- c("Kb"=0.1, "FCAP"=1,"CN"=50, "AMC"=0, "iSATt"=0, "K"=0.5)
out <- nls.lm(par = guess, fn = GWLF(parameters, dischargeInput, weatherInput))
Here is the error message:
Error in function (par) : could not find function "fn"
How do I need to setup par? Or the 1st argument in the function i'm calling within nls.lm?
The GWLf function is being passed 6 parameters that are used as constants in the function. These are the 6 parameters i hope to calibrate.
Thanks,
Tom
From reading ?nls.lm
You need to pass the function, not a call to the function
out <- nls.lm(par = guess, fn = GWLF, dischargeInput, weatherInput)
Note the extra arguments (which I assume are the data) are passed within ...
It would be safer to name these arguments using whatever argument names you wish these to be within GWLF.

Resources