Confidence interval for Weibull distribution - r

I have wind data that I'm using to perform extreme value analysis (calculate return levels). I'm using R with packages 'evd', 'extRemes' and 'ismev'.
I'm fitting GEV, Gumbel and Weibull distributions, in order to estimate the return levels (RL) for some period T.
For the GEV and Gumbel cases, I can get RL's and Confidence Intervals using the extRemes::return.level() function.
Some code:
require(ismev)
require(MASS)
data(wind)
x = wind[, 2]
rperiod = 10
fit <- fitdistr(x, 'weibull')
s <- fit$estimate['shape']
b <- fit$estimate['scale']
rlevel <- qweibull(1 - 1/rperiod, shape = s, scale = b)
## CI around rlevel
## ci.rlevel = ??
But for the Weibull case, I need some help to generate the CI's.

I suspect the excruciatingly correct answer will be that the joint confidence region is an ellipse or some bent-sausage shape but you can extract variance estimates for the parameters from the fit object with the vcov function and then build standard errors for which +/- 1.96 SE's should be informative:
> sqrt(vcov(fit)["shape", "shape"])
[1] 0.691422
> sqrt(vcov(fit)["scale", "scale"])
[1] 1.371256
> s +c(-1,1)*sqrt(vcov(fit)["shape", "shape"])
[1] 6.162104 7.544948
> b +c(-1,1)*sqrt(vcov(fit)["scale", "scale"])
[1] 54.46597 57.20848
The usual way to calculate a CI for a single parameter is to assume Normal distribution and use theta+/- 1.96*SE(theta). In this case, you have two parameters so doing that with both of them would give you a "box", the 2D analog of an interval. The truly correct answer would be something more complex in the 'scale'-by-'shape' parameter space and might be most easily achieved with simulation methods, unless you have a better grasp of theory than I have.

Related

find total variation distance between multinomial distributions in r

I am comparing Bayes estimators to MLE in multinomial distributions. I am drawing random samples using rmultinom from a particular multinomial distribution using
rmultinom(400, size = 30, prob = c(5,7,10,8,14,10,15,12,10,9))
For each of the 400 samples, I compute the MLE and Bayes estimators for the ten probability parameters. I now want to find in each case the total variation distance between the true distribution and the one defined by the estimators.
Since for size 30 and 10 bins there are over 200 million possible arrangements, I don't think that using the theoretical definition is a good idea.
The package distrEx has a function "TotalVarDist()", but it can only be used with distributions defined in the distr package, and multinomial is not one of them. There are directions for defining them (see here and here) but the options are either to define a discrete distribution by explicitly listing the support (again, I don't think this is a good option since the support has a size of over 200 million) or starting from scratch using the same methods as how the distr package was created, which is beyond my current ability.
Any thoughts on how to do this, either using the packages mentioned or in a completely different way?
My answer is about how to calculate this using base R.
We have two multinomial parameter vectors, θ and η. The total variation distance is equivalent to P_θ(E) - P_η(E), where E={ω | P_θ({ω})>P_η({ω})}, and ω is a vector of sample counts.
I know of two ways to evaluate P(E) in base R. One is a very simple simulation-based method. The other reframes the problem in terms of a linear combination of the counts, which is approximately normally distributed, and uses the pnorm function.
Simulation-based method
You simulate samples from each distribution, check whether they're in E using the probability mass functions, and count how often they are. I'll go through an example here. We'll assume the true distribution from your question:
unnormalized.true <- c(5,7,10,8,14,10,15,12,10,9)
true <- unnormalized.true / sum(unnormalized.true)
We'll draw a sample and estimate a new distribution using a Bayes estimator:
set.seed(921)
result <- as.vector(rmultinom(1, size = 30, prob = true))
result
## [1] 3 6 2 0 5 3 3 4 1 3
dirichlet <- (result+1)/(30+length(true))
Calculating the probability of E under the true distribution:
set.seed(939)
true.dist <- rmultinom(10^6, 30, true)
p.true.e <- mean(apply(true.dist, 2, function(x)
dmultinom(x, 30, true) - dmultinom(x, 30, dirichlet) > 0))
Calculating the probability of E under the estimated distribution from the Bayes estimator:
dirichlet.dist <- rmultinom(10^6, 30, dirichlet)
p.dirichlet.e <- mean(apply(dirichlet.dist, 2, function(x)
dmultinom(x, 30, true) - dmultinom(x, 30, dirichlet) > 0))
And we can subtract to get the total variation distance.
p.true.e - p.dirichlet.e
## [1] 0.83737
Repeating this with the maximum likelihood estimate, we get a comparison of the estimators.
mle <- result/30
mle.dist <- rmultinom(10^6, 30, mle)
p.true.e2 <- mean(apply(true.dist, 2, function(x)
dmultinom(x, 30, true) - dmultinom(x, 30, mle) > 0))
p.mle.e2 <- mean(apply(mle.dist, 2, function(x)
dmultinom(x, 30, true) - dmultinom(x, 30, mle) > 0))
p.true.e2 - p.mle.e2
## [1] 0.968301
(edited to fix a serious mistake. Previously I had re-used p.true.e in the comparison with the MLE. I forgot that the event E is defined in terms of the estimated distribution.)
Normal approximation
I think this method is actually more accurate than the simulation based method, despite the normal approximation. As you'll see, we're not taking a normal approximation to the multinomial counts, which would be unlikely to be accurate for n=30. We're taking a normal approximation to a linear combination of these counts, which is close to normal. The weakness of this method will turn out to be that it can't handle zero probabilities in the estimated distribution. That's a real problem, since handling zeros gracefully is, to me, part of the point of using total variation distance rather than Kullback-Leibler divergence. But here it is.
The following derivation yields a restatement of E:
Define
where N_i is one cell of the multinomial sample, and
Then, E is the event that L>0.
The reason we have a problem with a zero probability is that it causes one of the λ_i's to be infinite.
I want to verify that L is close to normally distributed, in the example from before. I'll do that by getting samples from the distribution of L, using the previous multinomial simulation:
lambda <- log(true/dirichlet)
L.true.dist <- apply(true.dist, 2, function(x) sum(lambda*x))
L.dirichlet.dist <- apply(dirichlet.dist, 2, function(x) sum(lambda*x))
Note that I'm doing the comparison between the true distribution and the Bayes estimated distribution. I can't do the one with the MLE, because my sample had a zero count.
Plotting the distribution of L and comparing to a normal fit:
par(mfrow=c(1,2))
L.true.dist.hist <- hist(L.true.dist)
L.true.dist.fit <- function(x)
length(L.true.dist) * diff(L.true.dist.hist$breaks)[1] *
dnorm(x, mean(L.true.dist), sd=sd(L.true.dist))
curve(L.true.dist.fit, add=TRUE, n=1000, col='red')
L.dirichlet.dist.hist <- hist(L.dirichlet.dist)
L.dirichlet.dist.fit <- function(x)
length(L.dirichlet.dist) * diff(L.dirichlet.dist.hist$breaks)[1] *
dnorm(x, mean(L.dirichlet.dist), sd=sd(L.dirichlet.dist))
curve(L.dirichlet.dist.fit, add=TRUE, n=1000, col='red')
par(mfrow=c(1,1))
The distribution of L appears normal. So, instead of using simulations, we can just use pnorm. However, we need to calculate the mean and standard deviation of L. This can be done as follows.
The mean of L is
where p_i is the cell probability of cell i in the distribution p. The variance is
where
is the covariance matrix of the multinomial distribution. I'll calculate these moments for this example, and check them against the empirical moments in the simulation. First, for the distribution of L under the true distribution:
n <- 30
k <- length(true)
mean.L.true <- sum(lambda * n * true)
# Did we get the mean right?
c(mean.L.true, mean(L.true.dist))
## [1] 3.873509 3.875547
# Covariance matrix assuming the true distribution
sigma.true <- outer(1:k, 1:k, function(i,j)
ifelse(i==j, n*true[i]*(1-true[i]), -n*true[i]*true[j]))
var.L.true <- t(lambda) %*% sigma.true %*% lambda
# Did we get the standard deviation right?
c(sqrt(var.L.true), sd(L.true.dist))
## [1] 2.777787 2.776945
Then, the mean and variance of L under the Bayes estimate of the distribution:
mean.L.dirichlet <- sum(lambda * n * dirichlet)
# Did we get the mean right?
c(mean.L.dirichlet, mean(L.dirichlet.dist))
## [1] -3.893836 -3.895983
# Covariance matrix assuming the estimated distribution
sigma.dirichlet <- outer(1:k, 1:k, function(i,j)
ifelse(i==j, n*dirichlet[i]*(1-dirichlet[i]), -n*dirichlet[i]*dirichlet[j]))
var.L.dirichlet <- t(lambda) %*% sigma.dirichlet %*% lambda
# Did we get the standard deviation right?
c(sqrt(var.L.dirichlet), sd(L.dirichlet.dist))
## [1] 2.796348 2.793421
With these in hand, we can calculate the total variation distance with pnorm:
pnorm(0, mean.L.true, sd=sqrt(var.L.true), lower.tail=FALSE) -
pnorm(0, mean.L.dirichlet, sd=sqrt(var.L.true), lower.tail=FALSE)
## [1] 0.8379193
# Previous result was 0.83737
We get three digits of agreement with the simulation.
I don't know of any easy way to extend the normal approximation method to handle zero probabilities, though. I had an idea, but I got stuck trying to calculate the covariance matrix of the counts conditional on a specific cell having 0 count. I could share my progress if you think you could make something of it.

confidence interval around predicted value from complex inverse function

I'm trying to get a 95% confidence interval around some predicted values, but am not capable of achieving this.
Basically, I estimated a growth curve like this:
set.seed(123)
dat=data.frame(size=rnorm(50,10,3),age=rnorm(50,5,2))
S <- function(t,ts,C,K) ((C*K)/(2*pi))*sin(2*pi*(t-ts))
sommers <- function(t,Linf,K,t0,ts,C)
Linf*(1-exp(-K*(t-t0)-S(t,ts,C,K)+S(t0,ts,C,K)))
model <- nls(size~sommers(age,Linf,K,t0,ts,C),data=dat,
start=list(Linf=10,K=4.7,t0=2.2,C=0.9,ts=0.1))
I have independent size measurements, for which I would like to predict the age. Therefore, the inverse of the function, which is not very straightforward, I calculated like this:
model.out=coef(model)
S.out <- function(t)
((model.out[[4]]*model.out[[2]])/(2*pi))*sin(2*pi*(t-model.out[[5]]))
sommers.out <- function(t)
model.out[[1]]*(1-exp(-model.out[[2]]*(t-model.out[[3]])-S.out(t)+S.out(model.out[[3]])))
inverse = function (f, lower = -100, upper = 100) {
function (y) uniroot((function (x) f(x) - y), lower = lower, upper = upper)[1]
}
sommers.inverse = inverse(sommers.out, 0, 25)
x= sommers.inverse(10) #this works with my complete dataset, but not with this fake one
Although this works fine, I need to know the confidence interval (95%) around this estimate (x). For linear models there is for example "predict(... confidence=)". I could also bootstrap the function somehow to get the quantiles associated with the parameters (didn't find how), to then use the extremes of those to calculate the maximum and minimum values predictable. But that doesn't really look like the good way of doing this....
Any help would be greatly appreciated.
EDIT after answer:
So this worked (explained in the book of Ben Bolker, see answer):
vmat = mvrnorm(1000, mu = coef(mfit), Sigma = vcov(mfit))
dist = numeric(1000)
for (i in 1:1000) {dist[i] = sommers_inverse(9.938,vmat[i,])}
quantile(dist, c(0.025, 0.975))
On the rather bad fake data I gave, this works of course rather horrible. But on the real data (which I have a problem recreating), this is ok!
Unless I'm mistaken, you're going to have to use either regular (parametric) bootstrapping or a method called either "population predictive intervals" (e.g., see section 5 of chapter 7 of Bolker 2008), which assumes that the sampling distributions of your parameters are multivariate Normal. However, I think you may have bigger problems, unless I've somehow messed up your model in adapting it ...
Generate data (note that random data may actually bad for testing your model - see below ...)
set.seed(123)
dat <- data.frame(size=rnorm(50,10,3),age=rnorm(50,5,2))
S <- function(t,ts,C,K) ((C*K)/(2*pi))*sin(2*pi*(t-ts))
sommers <- function(t,Linf,K,t0,ts,C)
Linf*(1-exp(-K*(t-t0)-S(t,ts,C,K)+S(t0,ts,C,K)))
Plot the data and the initial curve estimate:
plot(size~age,data=dat,ylim=c(0,16))
agevec <- seq(0,10,length=1001)
lines(agevec,sommers(agevec,Linf=10,K=4.7,t0=2.2,ts=0.1,C=0.9))
I had trouble with nls so I used minpack.lm::nls.lm, which is slightly more robust. (There are other options here, e.g. calculating the derivatives and providing the gradient function, or using AD Model Builder or Template Model Builder, or using the nls2 package.)
For nls.lm we need a function that returns the residuals:
sommers_fn <- function(par,dat) {
with(c(as.list(par),dat),size-sommers(age,Linf,K,t0,ts,C))
}
library(minpack.lm)
mfit <- nls.lm(fn=sommers_fn,
par=list(Linf=10,K=4.7,t0=2.2,C=0.9,ts=0.1),
dat=dat)
coef(mfit)
## Linf K t0 C ts
## 10.6540185 0.3466328 2.1675244 136.7164179 0.3627371
Here's our problem:
plot(size~age,data=dat,ylim=c(0,16))
lines(agevec,sommers(agevec,Linf=10,K=4.7,t0=2.2,ts=0.1,C=0.9))
with(as.list(coef(mfit)), {
lines(agevec,sommers(agevec,Linf,K,t0,ts,C),col=2)
abline(v=t0,lty=2)
abline(h=c(0,Linf),lty=2)
})
With this kind of fit, the results of the inverse function are going to be extremely unstable, as the inverse function is many-to-one, with the number of inverse values depending sensitively on the parameter values ...
sommers_pred <- function(x,pars) {
with(as.list(pars),sommers(x,Linf,K,t0,ts,C))
}
sommers_pred(6,coef(mfit)) ## s(6)=9.93
sommers_inverse <- function (y, pars, lower = -100, upper = 100) {
uniroot(function(x) sommers_pred(x,pars) -y, c(lower, upper))$root
}
sommers_inverse(9.938, coef(mfit)) ## 0.28
If I pick my interval very carefully I can get back the correct answer ...
sommers_inverse(9.938, coef(mfit), 5.5, 6.2)
Maybe your model will be better behaved with more realistic data. I hope so ...

How to estimate lambdas of poisson distributed samples in R and to draw Kernel estimation of the density function of the estimator basing on that?

So I have 500 poisson distributed simulated samples with n=100 each.
1) How can I estimate the lambdas for each of these samples separately in R ?
2) How can I draw Kernel Estimation of the density function of the estimator for lambda based on the 500 estimated lambdas? (my guess is somehow with "Kernsmooth" package and function "bkfe" but i fail to programm it normally anyway
taskpois <- function(size, leng){
+ taskmlepois <- NULL
+ for (i in 1:leng){
+ randompois <- rpois(size, 6)
+ taskmlepois[i] <- mean(randompois)
+ }
+ return(taskmlepois)
+ }
tasksample <- taskpois(size=100, leng=500)
As the comments suggest, it seems you're pretty close already.
ltarget <- 2
set.seed(101)
lambdavec <- replicate(500,mean(rpois(100,lambda=ltarget)))
dd <- density(lambdavec)
plot(dd,main="",las=1,bty="l")
We might as well add the expected result based on asymptotic theory:
curve(dnorm(x,mean=2,sd=sqrt(2/100)),add=TRUE,col=2)
We can add another line that shows that the variation among the densities of different experiments is pretty large relative to the difference between the theoretical and observed density from the first experiment:
lambdavec2 <- replicate(500,mean(rpois(100,lambda=ltarget)))
lines(density(lambdavec2),col=4)

Fit distribution to given frequency values in R

I have frequency values changing with the time (x axis units), as presented on the picture below. After some normalization these values may be seen as data points of a density function for some distribution.
Q: Assuming that these frequency points are from Weibull distribution T, how can I fit best Weibull density function to the points so as to infer the distribution T parameters from it?
sample <- c(7787,3056,2359,1759,1819,1189,1077,1080,985,622,648,518,
611,1037,727,489,432,371,1125,69,595,624)
plot(1:length(sample), sample, type = "l")
points(1:length(sample), sample)
Update.
To prevent from being misunderstood, I would like to add little more explanation. By saying I have frequency values changing with the time (x axis units) I mean I have data which says that I have:
7787 realizations of value 1
3056 realizations of value 2
2359 realizations of value 3 ... etc.
Some way towards my goal (incorrect one, as I think) would be to create a set of these realizations:
# Loop to simulate values
set.values <- c()
for(i in 1:length(sample)){
set.values <<- c(set.values, rep(i, times = sample[i]))
}
hist(set.values)
lines(1:length(sample), sample)
points(1:length(sample), sample)
and use fitdistr on the set.values:
f2 <- fitdistr(set.values, 'weibull')
f2
Why I think it is incorrect way and why I am looking for a better solution in R?
in the distribution fitting approach presented above it is assumed that set.values is a complete set of my realisations from the distribution T
in my original question I know the points from the first part of the density curve - I do not know its tail and I want to estimate the tail (and the whole density function)
Here is a better attempt, like before it uses optim to find the best value constrained to a set of values in a box (defined by the lower and upper vectors in the optim call). Notice it scales x and y as part of the optimization in addition to the Weibull distribution shape parameter, so we have 3 parameters to optimize over.
Unfortunately when using all the points it pretty much always finds something on the edges of the constraining box which indicates to me that maybe Weibull is maybe not a good fit for all of the data. The problem is the two points - they ares just too large. You see the attempted fit to all data in the first plot.
If I drop those first two points and just fit the rest, we get a much better fit. You see this in the second plot. I think this is a good fit, it is in any case a local minimum in the interior of the constraining box.
library(optimx)
sample <- c(60953,7787,3056,2359,1759,1819,1189,1077,1080,985,622,648,518,
611,1037,727,489,432,371,1125,69,595,624)
t.sample <- 0:22
s.fit <- sample[3:23]
t.fit <- t.sample[3:23]
wx <- function(param) {
res <- param[2]*dweibull(t.fit*param[3],shape=param[1])
return(res)
}
minwx <- function(param){
v <- s.fit-wx(param)
sqrt(sum(v*v))
}
p0 <- c(1,200,1/20)
paramopt <- optim(p0,minwx,gr=NULL,lower=c(0.1,100,0.01),upper=c(1.1,5000,1))
popt <- paramopt$par
popt
rms <- paramopt$value
tit <- sprintf("Weibull - Shape:%.3f xscale:%.1f yscale:%.5f rms:%.1f",popt[1],popt[2],popt[3],rms)
plot(t.sample[2:23], sample[2:23], type = "p",col="darkred")
lines(t.fit, wx(popt),col="blue")
title(main=tit)
You can directly calculate the maximum likelihood parameters, as described here.
# Defining the error of the implicit function
k.diff <- function(k, vec){
x2 <- seq(length(vec))
abs(k^-1+weighted.mean(log(x2), w = sample)-weighted.mean(log(x2),
w = x2^k*sample))
}
# Setting the error to "quite zero", fulfilling the equation
k <- optimize(k.diff, vec=sample, interval=c(0.1,5), tol=10^-7)$min
# Calculate lambda, given k
l <- weighted.mean(seq(length(sample))^k, w = sample)
# Plot
plot(density(rep(seq(length(sample)),sample)))
x <- 1:25
lines(x, dweibull(x, shape=k, scale= l))
Assuming the data are from a Weibull distribution, you can get an estimate of the shape and scale parameter like this:
sample <- c(7787,3056,2359,1759,1819,1189,1077,1080,985,622,648,518,
611,1037,727,489,432,371,1125,69,595,624)
f<-fitdistr(sample, 'weibull')
f
If you are not sure whether it is distributed Weibull, I would recommend using the ks.test. This tests whether your data is from a hypothesised distribution. Given your knowledge of the nature of the data, you could test for a few selected distributions and see which one works best.
For your example this would look like this:
ks = ks.test(sample, "pweibull", shape=f$estimate[1], scale=f$estimate[2])
ks
The p-value is insignificant, hence you do not reject the hypothesis that the data is from a Weibull distribution.
Update: The histograms of either the Weibull or exponential look like a good match to your data. I think the exponential distribution gives you a better fit. Pareto distribution is another option.
f<-fitdistr(sample, 'weibull')
z<-rweibull(10000, shape= f$estimate[1],scale= f$estimate[2])
hist(z)
f<-fitdistr(sample, 'exponential')
z = rexp(10000, f$estimate[1])
hist(z)

Errors running Maximum Likelihood Estimation on a three parameter Weibull cdf

I am working with the cumulative emergence of flies over time (taken at irregular intervals) over many summers (though first I am just trying to make one year work). The cumulative emergence follows a sigmoid pattern and I want to create a maximum likelihood estimation of a 3-parameter Weibull cumulative distribution function. The three-parameter models I've been trying to use in the fitdistrplus package keep giving me an error. I think this must have something to do with how my data is structured, but I cannot figure it out. Obviously I want it to read each point as an x (degree days) and a y (emergence) value, but it seems to be unable to read two columns. The main error I'm getting says "Non-numeric argument to mathematical function" or (with slightly different code) "data must be a numeric vector of length greater than 1". Below is my code including added columns in the df_dd_em dataframe for cumulative emergence and percent emergence in case that is useful.
degree_days <- c(998.08,1039.66,1111.29,1165.89,1236.53,1293.71,
1347.66,1387.76,1445.47,1493.44,1553.23,1601.97,
1670.28,1737.29,1791.94,1849.20,1920.91,1967.25,
2036.64,2091.85,2152.89,2199.13,2199.13,2263.09,
2297.94,2352.39,2384.03,2442.44,2541.28,2663.90,
2707.36,2773.82,2816.39,2863.94)
emergence <- c(0,0,0,1,1,0,2,3,17,10,0,0,0,2,0,3,0,0,1,5,0,0,0,0,
0,0,0,0,1,0,0,0,0,0)
cum_em <- cumsum(emergence)
df_dd_em <- data.frame (degree_days, emergence, cum_em)
df_dd_em$percent <- ave(df_dd_em$emergence, FUN = function(df_dd_em) 100*(df_dd_em)/46)
df_dd_em$cum_per <- ave(df_dd_em$cum_em, FUN = function(df_dd_em) 100*(df_dd_em)/46)
x <- pweibull(df_dd_em[c(1,3)],shape=5)
dframe2.mle <- fitdist(x, "weibull",method='mle')
Here's my best guess at what you're after:
Set up data:
dd <- data.frame(degree_days=c(998.08,1039.66,1111.29,1165.89,1236.53,1293.71,
1347.66,1387.76,1445.47,1493.44,1553.23,1601.97,
1670.28,1737.29,1791.94,1849.20,1920.91,1967.25,
2036.64,2091.85,2152.89,2199.13,2199.13,2263.09,
2297.94,2352.39,2384.03,2442.44,2541.28,2663.90,
2707.36,2773.82,2816.39,2863.94),
emergence=c(0,0,0,1,1,0,2,3,17,10,0,0,0,2,0,3,0,0,1,5,0,0,0,0,
0,0,0,0,1,0,0,0,0,0))
dd <- transform(dd,cum_em=cumsum(emergence))
We're actually going to fit to an "interval-censored" distribution (i.e. probability of emergence between successive degree day observations: this version assumes that the first observation refers to observations before the first degree-day observation, you could change it to refer to observations after the last observation).
library(bbmle)
## y*log(p) allowing for 0/0 occurrences:
y_log_p <- function(y,p) ifelse(y==0 & p==0,0,y*log(p))
NLLfun <- function(scale,shape,x=dd$degree_days,y=dd$emergence) {
prob <- pmax(diff(pweibull(c(-Inf,x), ## or (c(x,Inf))
shape=shape,scale=scale)),1e-6)
## multinomial probability
-sum(y_log_p(y,prob))
}
library(bbmle)
I should probably have used something more systematic like the method of moments (i.e. matching the mean and variance of a Weibull distribution with the mean and variance of the data), but I just hacked around a bit to find plausible starting values:
## preliminary look (method of moments would be better)
scvec <- 10^(seq(0,4,length=101))
plot(scvec,sapply(scvec,NLLfun,shape=1))
It's important to use parscale to let R know that the parameters are on very different scales:
startvals <- list(scale=1000,shape=1)
m1 <- mle2(NLLfun,start=startvals,
control=list(parscale=unlist(startvals)))
Now try with a three-parameter Weibull (as originally requested) -- requires only a slight modification of what we already have:
library(FAdist)
NLLfun2 <- function(scale,shape,thres,
x=dd$degree_days,y=dd$emergence) {
prob <- pmax(diff(pweibull3(c(-Inf,x),shape=shape,scale=scale,thres)),
1e-6)
## multinomial probability
-sum(y_log_p(y,prob))
}
startvals2 <- list(scale=1000,shape=1,thres=100)
m2 <- mle2(NLLfun2,start=startvals2,
control=list(parscale=unlist(startvals2)))
Looks like the three-parameter fit is much better:
library(emdbook)
AICtab(m1,m2)
## dAIC df
## m2 0.0 3
## m1 21.7 2
And here's the graphical summary:
with(dd,plot(cum_em~degree_days,cex=3))
with(as.list(coef(m1)),curve(sum(dd$emergence)*
pweibull(x,shape=shape,scale=scale),col=2,
add=TRUE))
with(as.list(coef(m2)),curve(sum(dd$emergence)*
pweibull3(x,shape=shape,
scale=scale,thres=thres),col=4,
add=TRUE))
(could also do this more elegantly with ggplot2 ...)
These don't seem like spectacularly good fits, but they're sane. (You could in principle do a chi-squared goodness-of-fit test based on the expected number of emergences per interval, and accounting for the fact that you've fitted a three-parameter model, although the values might be a bit low ...)
Confidence intervals on the fit are a bit of a nuisance; your choices are (1) bootstrapping; (2) parametric bootstrapping (resample parameters assuming a multivariate normal distribution of the data); (3) delta method.
Using bbmle::mle2 makes it easy to do things like get profile confidence intervals:
confint(m1)
## 2.5 % 97.5 %
## scale 1576.685652 1777.437283
## shape 4.223867 6.318481
dd <- data.frame(degree_days=c(998.08,1039.66,1111.29,1165.89,1236.53,1293.71,
1347.66,1387.76,1445.47,1493.44,1553.23,1601.97,
1670.28,1737.29,1791.94,1849.20,1920.91,1967.25,
2036.64,2091.85,2152.89,2199.13,2199.13,2263.09,
2297.94,2352.39,2384.03,2442.44,2541.28,2663.90,
2707.36,2773.82,2816.39,2863.94),
emergence=c(0,0,0,1,1,0,2,3,17,10,0,0,0,2,0,3,0,0,1,5,0,0,0,0,
0,0,0,0,1,0,0,0,0,0))
dd$cum_em <- cumsum(dd$emergence)
dd$percent <- ave(dd$emergence, FUN = function(dd) 100*(dd)/46)
dd$cum_per <- ave(dd$cum_em, FUN = function(dd) 100*(dd)/46)
dd <- transform(dd)
#start 3 parameter model
library(FAdist)
## y*log(p) allowing for 0/0 occurrences:
y_log_p <- function(y,p) ifelse(y==0 & p==0,0,y*log(p))
NLLfun2 <- function(scale,shape,thres,
x=dd$degree_days,y=dd$percent) {
prob <- pmax(diff(pweibull3(c(-Inf,x),shape=shape,scale=scale,thres)),
1e-6)
## multinomial probability
-sum(y_log_p(y,prob))
}
startvals2 <- list(scale=1000,shape=1,thres=100)
m2 <- mle2(NLLfun2,start=startvals2,
control=list(parscale=unlist(startvals2)))
summary(m2)
#graphical summary
windows(5,5)
with(dd,plot(cum_per~degree_days,cex=3))
with(as.list(coef(m2)),curve(sum(dd$percent)*
pweibull3(x,shape=shape,
scale=scale,thres=thres),col=4,
add=TRUE))

Resources