ar(1) simulation with non-zero mean - r

I can't seem to find the correct way to simulate an AR(1) time series with a mean that is not zero.
I need 53 data points, rho = .8, mean = 300.
However, arima.sim(list(order=c(1,0,0), ar=.8), n=53, mean=300, sd=21)
gives me values in the 1500s. For example:
1480.099 1480.518 1501.794 1509.464 1499.965 1489.545 1482.367 1505.103 (and so on)
I have also tried arima.sim(n=52, model=list(ar=c(.8)), start.innov=300, n.start=1)
but then it just counts down like this:
238.81775870 190.19203239 151.91292491 122.09682547 96.27074057 [6] 77.17105923 63.15148491 50.04211711 39.68465916 32.46837830 24.78357345 21.27437183 15.93486092 13.40199333 10.99762449 8.70208879 5.62264196 3.15086491 2.13809323 1.30009732
and I have tried arima.sim(list(order=c(1,0,0), ar=.8), n=53,sd=21) + 300 which seems to give a correct answer. For example:
280.6420 247.3219 292.4309 289.8923 261.5347 279.6198 290.6622 295.0501
264.4233 273.8532 261.9590 278.0217 300.6825 291.4469 291.5964 293.5710
285.0330 274.5732 285.2396 298.0211 319.9195 324.0424 342.2192 353.8149
and so on..
However, I am in doubt that this is doing the correct thing? Is it still auto-correlating on the correct number then?

Your last option is okay to get the desired mean, "mu". It generates data from the model:
(y[t] - mu) = phi * (y[t-1] - mu) + \epsilon[t], epsilon[t] ~ N(0, sigma=21),
t=1,2,...,n.
Your first approach sets an intercept, "alpha", rather than a mean:
y[t] = alpha + phi * y[t-1] + epsilon[t].
Your second option sets the starting value y[0] equal to 300. As long as |phi|<1 the influence of this initial value will vanish after a few periods and will have no effect
on the level of the series.
Edit
The value of the standard deviation that you observe in the simulated data is correct. Be aware that the variance of the AR(1) process, y[t], is not equal the variance of the innovations, epsilon[t]. The variance of the AR(1) process, sigma^2_y, can be obtained obtained as follows:
Var(y[t]) = Var(alpha) + phi^2 * Var(y[t-1]) + Var(epsilon[t])
As the process is stationary Var(y[t]) = Var(t[t-1]) which we call sigma^2_y. Thus, we get:
sigma^2_y = 0 + phi^2 * sigma^2_y + sigma^2_epsilon
sigma^2_y = sigma^2_epsilon / (1 - phi^2)
For the values of the parameters that you are using you have:
sigma_y = sqrt(21^2 / (1 - 0.8^2)) = 35.

Use the rGARMA function in the ts.extend package
You can generate random vectors from any stationary Gaussian ARMA model using the ts.extend package. This package generates random vectors directly form the multivariate normal distribution using the computed autocorrelation matrix for the random vector, so it gives random vectors from the exact distribution and does not require "burn-in" iterations. Here is an example of generating multiple independent time-series vectors all from an AR(1) model.
#Load the package
library(ts.extend)
#Set parameters
MEAN <- 300
ERRORVAR <- 21^2
AR <- 0.8
m <- 53
#Generate n = 16 random vectors from this model
set.seed(1)
SERIES <- rGARMA(n = 16, m = m, mean = MEAN, ar = AR, errorvar = ERRORVAR)
#Plot the series using ggplot2 graphics
library(ggplot2)
plot(SERIES)
As you can see, the generated time-series vectors in this plot use the appropriate mean and error variance that were specified in the inputs.

Related

R code for maximum likelihood estimate from a specific likelihood function

I have been trying to generate R code for maximum likelihood estimation from a log likelihood function in a paper (equation 9 in page 609). Authors in the paper estimated it using MATLAB, which I am not familiar with. So I tried to generate codes in R.
Here is the snapshot of the log likelihood function in the paper:
, where
r: Binary decision (0 or 1) indicating infested plant(s) detection (1) or not (0).
e: Inspection efficiency. This is known.
n: Sample size
The overall objective is to estimate plant infestation rate (gamma: γ) and epsilon (e) based on binary decision of presence and absence of infested plants instead of using infested plant(s) detected. So, the function has only binary information (r) of infested plant detection and sample size. Since epsilon (e) is known or fixed, the actual goal is to estimate gamma (γ) in a population.
Another objective is to compare estimated infestation rates from above with ones in hypergeometric sampling formula in another paper (in page 6). The formula is:
This formula generates required sample size to detect infested plants with selected probability (e.g., 95) given an infested rate. For example:
# Sample size calculation function
fosgate.sample1 <- function(box, p, ci){ # Note: box represent total plant number
ninf <- p*box
sample.size <- round(((1-(1-ci)^(1/ninf))*(box-(ninf-1)/2)))
#sample.size <- ceiling(((1-(1-ci)^(1/ninf))*(box-(ninf-1)/2)))
sample.size
}
fosgate.sample1(box=100, p = .05, ci = .95) # where box: population or total plants, p: infestation rate, and ci: probability of detection
## 44
The idea is if sample size (e.g., 44) and binary decision data are provided the log-likelihood function can be used to estimate infestation rate and the rate may be close to anticipated rate (e.g., .05). Ultimately, I would like to compare plant infestation rates (gamma: γ) estimated from the log likelihood function above and D/N in the sample size calculation formula (second) or p in the sample size code below.
I generated R code for the log-likelihood described above.
### MLE with stat4
library(stats4)
# Log-likelihood function
plant.inf.lik <- function(inf.rate){
logl <- suppressWarnings(
sum((1-insp.result)*n*log(1-inf.rate) +
insp.result*log(1-(1-inf.rate)^n))
)
return(-logl)
}
Using the sample size function (i.e., fosgate.sample1) I generated sample sizes for various cases of total plant (or box) and anticipated detection rate (p) in the function. Since I am also interested in error/confidence ranges of estimated plant infestation rates, I used bootstrapping to calculate range of estimates (I am not sure if this is appropriate/acceptable). Here is the final code I generated:
### MLE and CI with bootstrapping with multiple scenarios
plant <- c(100, 500, 1000, 5000, 10000, 100000) # Total plant number
ir <- seq(.01, .2, by = .01) # Plant infestation rate
df.result <- data.frame(expand.grid(plant=plant, inf.rate = ir))
df.result$sample.size <- fosgate.sample1(box=df.result$plant, p=df.result$inf.rate, ci=.95) # Sample size
df.result$insp.result <- 1000 # Shipment number (can be replaced with random integers)
df.result <- df.result[order(df.result$plant, df.result$inf.rate, df.result$sample.size), ]
rownames(df.result) <- 1:nrow(df.result)
df.result$est.mean <- 0
#df.result$est.median <- 0
df.result$est.lower.ci <- 0
df.result$est.upper.ci <- 0
df.result$nsim <- 0
str(df.result)
head(df.result)
# Looping
est <- rep(NA, 1000)
for(j in 1:nrow(df.result)){
for(i in 1:1000){
insp.result <- sample(c(rep(1, df.result$insp.result[j]-df.result$insp.result[j]*df.result$inf.rate[j]),
rep(0, df.result$insp.result[j]*df.result$inf.rate[j])))
ir <- df.result$inf.rate[j]
n <- df.result$sample.size[j]
insp.result <- sample(insp.result, replace = TRUE)
est[i] <- mle(plant.inf.lik, start = list(inf.rate = ir*.9), method = "BFGS", nobs = length(insp.result))#coef
df.result$est.mean[j] <- mean(est, na.rm = TRUE)
# df.result$est.median[j] <- median(est, na.rm = TRUE)
df.result$est.lower.ci[j] <- quantile(est, prob = .025, na.rm = TRUE)
df.result$est.upper.ci[j] <- quantile(est, prob = .975, na.rm = TRUE)
df.result$nsim[j] <- length(est)
}
}
# Significance test result
sig <- ifelse(df.result$inf.rate >= df.result$est.lower.ci & df.result$inf.rate <= df.result$est.upper.ci, "no sig", "sig")
table(sig)
# Plot
library(ggplot2)
library(reshape2)
df.result$num <- ave(df.result$inf.rate, df.result$plant, FUN=seq_along)
df.result.m <- melt(df.result, id.vars=c("plant", "sample.size", "insp.result", "est.lower.ci", "est.upper.ci", "nsim", "num"))
df.result.m$est.lower.ci <- ifelse(df.result.m$variable == "inf.rate", NA, df.result.m$est.lower.ci)
df.result.m$est.upper.ci <- ifelse(df.result.m$variable == "inf.rate", NA, df.result.m$est.upper.ci)
str(df.result.m)
ggplot(data = df.result.m, aes(x = num, y = value, group=variable, color=variable, shape=variable))+
geom_point()+
geom_errorbar(aes(ymin = est.lower.ci, ymax = est.upper.ci), width=.5)+
scale_y_continuous(breaks = seq(0, .2, .02))+
xlab("Index")+
ylab("Plant infestation rate")+
facet_wrap(~plant, ncol = 3)
When I ran the code, I was able to obtain results and to compare estimated (est.mean) and anticipated (inf.rate) infestation rates as shown in the plot below.
If results are correct, plot indicates that estimation looks fine but off for greater infestation rates.
Also, I always got warning messages without "suppressWarnings" function and occasionally error messages below. I have no clue how to fix them.
## Warning messages
## 29: In log(1 - (1 - inf.rate)^n) : NaNs produced
## 30: In log(1 - inf.rate) : NaNs produced
## Error message (occasionally)
## Error in solve.default(oout$hessian) :
## Lapack routine dgesv: system is exactly singular: U[1,1] = 0
My questions are:
Is R function (plant.inf.lik) for maximum likelihood estimation of the log-likelihood function appropriate?
Should I take care of warning and error messages? If yes, how? Again, I have no clue how to fix...
Is bootstrapping (resampling?) method appropriate to estimate CI ranges and/or standard error?
I found this link useful for alternative approach. Although I am still working both approaches together, results seem different (maybe following question).
Any suggestion would be greatly appreciated.
Concerning your last question about estimating CI ranges, there are three common methods for ML estimators:
Variance estimation from the inverted Hessian matrix.
Jackknife estimator for the variance (simpler and more stable, if the Hessian is estimated numerically, but computationally more expensive)
Bootstrap CIs (the computatianally most expensive approach).
For bootstrap CIs, you do not need to implement them yourself (bias correction, e.g. can be tricky), but can rely on the R library boot.
Incidentally, I have written a summary with R code for all three approaches two years ago: Construction of Confidence Intervals (see section 5). For the method utilizing the Hessian Matrix, e.g., the outline is as follows:
lnL <- function(theta1, theta2, ...) {
# definition of the negative (!)
# log-likelihood function...
}
# starting values for the optimization
theta0 <- c(start1, start2, ...)
# optimization
p <- optim(theta0, lnL, hessian=TRUE)
if (p$convergence == 0) {
theta <- p$par
covmat <- solve(p$hessian)
sigma <- sqrt(diag(covmat))
}
The function mle from stats4 already wraps the covrainace matrix estimation and retruns it in vcov. In the practical use cases in which I have tried this (paired comparison models), though, this estimation was rather unstable, and I have resorted to the jackknife method instead.

GAM with "gp" smoother: how to retrieve the variogram parameters?

I am using the following geoadditive model
library(gamair)
library(mgcv)
data(mack)
mack$log.net.area <- log(mack$net.area)
gm2 <- gam(egg.count ~ s(lon,lat,bs="gp",k=100,m=c(2,10,1)) +
s(I(b.depth^.5)) +
s(c.dist) +
s(temp.20m) +
offset(log.net.area),
data = mack, family = tw, method = "REML")
Here I am using an exponential covariance function with range = 10 and power = 1 (m=c(2,10,1)). How can I retrieve from the results the variogram parameters (nugget, sill)? I couldn't find anything in the model output.
In smoothing approach the correlation matrix is specified so you only estimate variance parameter, i.e., the sill. For example, you've set m = c(2, 10, 1) to s(, bs = 'gp'), giving an exponential correlation matrix with range parameter phi = 10. Note that phi is not identical to range, except for spherical correlation. For many correlation models the actual range is a function of phi.
The variance / sill parameter is closely related to the smoothing parameter in penalized regression, and you can obtain it by dividing the scale parameter by smoothing parameter:
with(gm2, scale / sp["s(lon,lat)"])
#s(lon,lat)
# 26.20877
Is this right? No. There is a trap here: smoothing parameters returned in $sp are not real ones, and we need the following:
gm2_sill <- with(gm2, scale / sp["s(lon,lat)"] * smooth[[1]]$S.scale)
#s(lon,lat)
# 7.7772
And we copy in the range parameter you've specified:
gm2_phi <- 10
The nugget must be zero, since a smooth function is continuous. Using lines.variomodel function from geoR package, you can visualize the semivariogram for the latent Gaussian spatial random field modeled by s(lon,lat).
library(geoR)
lines.variomodel(cov.model = "exponential", cov.pars = c(gm2_sill, gm2_phi),
nugget = 0, max.dist = 60)
abline(h = gm2_sill, lty = 2)
However, be skeptical on this variogram. mgcv is not an easy environment to interpret geostatistics. The use of low-rank smoothers suggests that the above variance parameter is for parameters in the new parameter space rather than the original one. For example, there are 630 unique spatial locations in the spatial field for mack dataset, so the correlation matrix should be 630 x 630, and the full random effects should be a vector of length-630. But by setting k = 100 in s(, bs = 'gp') the truncated eigen decomposition and subsequent low-rank approximation reduce the random effects to length-100. The variance parameter is really for this vector not the original one. This might explain why the sill and the actual range do not agree with the data and predicted s(lon,lat).
## unique locations
loc <- unique(mack[, c("lon", "lat")])
max(dist(loc))
#[1] 15.98
The maximum distance between two spatial locations in the dataset is 15.98, but the actual range from the variogram seems to be somewhere between 40 and 60, which is too large.
## predict `s(lon, lat)`, using the method I told you in your last question
## https://stackoverflow.com/q/51634953/4891738
sp <- predict(gm2,
data.frame(loc, b.depth = 0, c.dist = 0, temp.20m = 0,
log.net.area = 0),
type = "terms", terms = "s(lon,lat)")
c(var(sp))
#[1] 1.587126
The predicted s(lon,lat) only has variance 1.587, but the sill at 7.77 is way much higher.

find total variation distance between multinomial distributions in r

I am comparing Bayes estimators to MLE in multinomial distributions. I am drawing random samples using rmultinom from a particular multinomial distribution using
rmultinom(400, size = 30, prob = c(5,7,10,8,14,10,15,12,10,9))
For each of the 400 samples, I compute the MLE and Bayes estimators for the ten probability parameters. I now want to find in each case the total variation distance between the true distribution and the one defined by the estimators.
Since for size 30 and 10 bins there are over 200 million possible arrangements, I don't think that using the theoretical definition is a good idea.
The package distrEx has a function "TotalVarDist()", but it can only be used with distributions defined in the distr package, and multinomial is not one of them. There are directions for defining them (see here and here) but the options are either to define a discrete distribution by explicitly listing the support (again, I don't think this is a good option since the support has a size of over 200 million) or starting from scratch using the same methods as how the distr package was created, which is beyond my current ability.
Any thoughts on how to do this, either using the packages mentioned or in a completely different way?
My answer is about how to calculate this using base R.
We have two multinomial parameter vectors, θ and η. The total variation distance is equivalent to P_θ(E) - P_η(E), where E={ω | P_θ({ω})>P_η({ω})}, and ω is a vector of sample counts.
I know of two ways to evaluate P(E) in base R. One is a very simple simulation-based method. The other reframes the problem in terms of a linear combination of the counts, which is approximately normally distributed, and uses the pnorm function.
Simulation-based method
You simulate samples from each distribution, check whether they're in E using the probability mass functions, and count how often they are. I'll go through an example here. We'll assume the true distribution from your question:
unnormalized.true <- c(5,7,10,8,14,10,15,12,10,9)
true <- unnormalized.true / sum(unnormalized.true)
We'll draw a sample and estimate a new distribution using a Bayes estimator:
set.seed(921)
result <- as.vector(rmultinom(1, size = 30, prob = true))
result
## [1] 3 6 2 0 5 3 3 4 1 3
dirichlet <- (result+1)/(30+length(true))
Calculating the probability of E under the true distribution:
set.seed(939)
true.dist <- rmultinom(10^6, 30, true)
p.true.e <- mean(apply(true.dist, 2, function(x)
dmultinom(x, 30, true) - dmultinom(x, 30, dirichlet) > 0))
Calculating the probability of E under the estimated distribution from the Bayes estimator:
dirichlet.dist <- rmultinom(10^6, 30, dirichlet)
p.dirichlet.e <- mean(apply(dirichlet.dist, 2, function(x)
dmultinom(x, 30, true) - dmultinom(x, 30, dirichlet) > 0))
And we can subtract to get the total variation distance.
p.true.e - p.dirichlet.e
## [1] 0.83737
Repeating this with the maximum likelihood estimate, we get a comparison of the estimators.
mle <- result/30
mle.dist <- rmultinom(10^6, 30, mle)
p.true.e2 <- mean(apply(true.dist, 2, function(x)
dmultinom(x, 30, true) - dmultinom(x, 30, mle) > 0))
p.mle.e2 <- mean(apply(mle.dist, 2, function(x)
dmultinom(x, 30, true) - dmultinom(x, 30, mle) > 0))
p.true.e2 - p.mle.e2
## [1] 0.968301
(edited to fix a serious mistake. Previously I had re-used p.true.e in the comparison with the MLE. I forgot that the event E is defined in terms of the estimated distribution.)
Normal approximation
I think this method is actually more accurate than the simulation based method, despite the normal approximation. As you'll see, we're not taking a normal approximation to the multinomial counts, which would be unlikely to be accurate for n=30. We're taking a normal approximation to a linear combination of these counts, which is close to normal. The weakness of this method will turn out to be that it can't handle zero probabilities in the estimated distribution. That's a real problem, since handling zeros gracefully is, to me, part of the point of using total variation distance rather than Kullback-Leibler divergence. But here it is.
The following derivation yields a restatement of E:
Define
where N_i is one cell of the multinomial sample, and
Then, E is the event that L>0.
The reason we have a problem with a zero probability is that it causes one of the λ_i's to be infinite.
I want to verify that L is close to normally distributed, in the example from before. I'll do that by getting samples from the distribution of L, using the previous multinomial simulation:
lambda <- log(true/dirichlet)
L.true.dist <- apply(true.dist, 2, function(x) sum(lambda*x))
L.dirichlet.dist <- apply(dirichlet.dist, 2, function(x) sum(lambda*x))
Note that I'm doing the comparison between the true distribution and the Bayes estimated distribution. I can't do the one with the MLE, because my sample had a zero count.
Plotting the distribution of L and comparing to a normal fit:
par(mfrow=c(1,2))
L.true.dist.hist <- hist(L.true.dist)
L.true.dist.fit <- function(x)
length(L.true.dist) * diff(L.true.dist.hist$breaks)[1] *
dnorm(x, mean(L.true.dist), sd=sd(L.true.dist))
curve(L.true.dist.fit, add=TRUE, n=1000, col='red')
L.dirichlet.dist.hist <- hist(L.dirichlet.dist)
L.dirichlet.dist.fit <- function(x)
length(L.dirichlet.dist) * diff(L.dirichlet.dist.hist$breaks)[1] *
dnorm(x, mean(L.dirichlet.dist), sd=sd(L.dirichlet.dist))
curve(L.dirichlet.dist.fit, add=TRUE, n=1000, col='red')
par(mfrow=c(1,1))
The distribution of L appears normal. So, instead of using simulations, we can just use pnorm. However, we need to calculate the mean and standard deviation of L. This can be done as follows.
The mean of L is
where p_i is the cell probability of cell i in the distribution p. The variance is
where
is the covariance matrix of the multinomial distribution. I'll calculate these moments for this example, and check them against the empirical moments in the simulation. First, for the distribution of L under the true distribution:
n <- 30
k <- length(true)
mean.L.true <- sum(lambda * n * true)
# Did we get the mean right?
c(mean.L.true, mean(L.true.dist))
## [1] 3.873509 3.875547
# Covariance matrix assuming the true distribution
sigma.true <- outer(1:k, 1:k, function(i,j)
ifelse(i==j, n*true[i]*(1-true[i]), -n*true[i]*true[j]))
var.L.true <- t(lambda) %*% sigma.true %*% lambda
# Did we get the standard deviation right?
c(sqrt(var.L.true), sd(L.true.dist))
## [1] 2.777787 2.776945
Then, the mean and variance of L under the Bayes estimate of the distribution:
mean.L.dirichlet <- sum(lambda * n * dirichlet)
# Did we get the mean right?
c(mean.L.dirichlet, mean(L.dirichlet.dist))
## [1] -3.893836 -3.895983
# Covariance matrix assuming the estimated distribution
sigma.dirichlet <- outer(1:k, 1:k, function(i,j)
ifelse(i==j, n*dirichlet[i]*(1-dirichlet[i]), -n*dirichlet[i]*dirichlet[j]))
var.L.dirichlet <- t(lambda) %*% sigma.dirichlet %*% lambda
# Did we get the standard deviation right?
c(sqrt(var.L.dirichlet), sd(L.dirichlet.dist))
## [1] 2.796348 2.793421
With these in hand, we can calculate the total variation distance with pnorm:
pnorm(0, mean.L.true, sd=sqrt(var.L.true), lower.tail=FALSE) -
pnorm(0, mean.L.dirichlet, sd=sqrt(var.L.true), lower.tail=FALSE)
## [1] 0.8379193
# Previous result was 0.83737
We get three digits of agreement with the simulation.
I don't know of any easy way to extend the normal approximation method to handle zero probabilities, though. I had an idea, but I got stuck trying to calculate the covariance matrix of the counts conditional on a specific cell having 0 count. I could share my progress if you think you could make something of it.

Sample from a custom likelihood function

I have the following likelihood function which I used in a rather complex model (in practice on a log scale):
library(plyr)
dcustom=function(x,sd,L,R){
R. = (log(R) - log(x))/sd
L. = (log(L) - log(x))/sd
ll = pnorm(R.) - pnorm(L.)
return(ll)
}
df=data.frame(Range=seq(100,500),sd=rep(0.1,401),L=200,U=400)
df=mutate(df, Likelihood = dcustom(Range, sd,L,U))
with(df,plot(Range,Likelihood,type='l'))
abline(v=200)
abline(v=400)
In this function, the sd is predetermined and L and R are "observations" (very much like the endpoints of a uniform distribution), so all 3 of them are given. The above function provides a large likelihood (1) if the model estimate x (derived parameter) is in between the L-R range, a smooth likelihood decrease (between 0 and 1) near the bounds (of which the sharpness is dependent on the sd), and 0 if it is too much outside.
This function works very well to obtain estimates of x, but now I would like to do the inverse: draw a random x from the above function. If I would do this many times, I would generate a histogram that follows the shape of the curve plotted above.
The ultimate goal is to do this in C++, but I think it would be easier for me if I could first figure out how to do this in R.
There's some useful information online that helps me start (http://matlabtricks.com/post-44/generate-random-numbers-with-a-given-distribution, https://stats.stackexchange.com/questions/88697/sample-from-a-custom-continuous-distribution-in-r) but I'm still not entirely sure how to do it and how to code it.
I presume (not sure at all!) the steps are:
transform likelihood function into probability distribution
calculate the cumulative distribution function
inverse transform sampling
Is this correct and if so, how do I code this? Thank you.
One idea might be to use the Metropolis Hasting Algorithm to obtain a sample from the distribution given all the other parameters and your likelihood.
# metropolis hasting algorithm
set.seed(2018)
n_sample <- 100000
posterior_sample <- rep(NA, n_sample)
x <- 300 # starting value: I chose 300 based on your likelihood plot
for (i in 1:n_sample){
lik <- dcustom(x = x, sd = 0.1, L = 200, R =400)
# propose a value for x (you can adjust the stepsize with the sd)
x.proposed <- x + rnorm(1, 0, sd = 20)
lik.proposed <- dcustom(x = x.proposed, sd = 0.1, L = 200, R = 400)
r <- lik.proposed/lik # this is the acceptance ratio
# accept new value with probablity of ratio
if (runif(1) < r) {
x <- x.proposed
posterior_sample[i] <- x
}
}
# plotting the density
approximate_distr <- na.omit(posterior_sample)
d <- density(approximate_distr)
plot(d, main = "Sample from distribution")
abline(v=200)
abline(v=400)
# If you now want to sample just a few values (for example, 5) you could use
sample(approximate_distr,5)
#[1] 281.7310 371.2317 378.0504 342.5199 412.3302

Errors running Maximum Likelihood Estimation on a three parameter Weibull cdf

I am working with the cumulative emergence of flies over time (taken at irregular intervals) over many summers (though first I am just trying to make one year work). The cumulative emergence follows a sigmoid pattern and I want to create a maximum likelihood estimation of a 3-parameter Weibull cumulative distribution function. The three-parameter models I've been trying to use in the fitdistrplus package keep giving me an error. I think this must have something to do with how my data is structured, but I cannot figure it out. Obviously I want it to read each point as an x (degree days) and a y (emergence) value, but it seems to be unable to read two columns. The main error I'm getting says "Non-numeric argument to mathematical function" or (with slightly different code) "data must be a numeric vector of length greater than 1". Below is my code including added columns in the df_dd_em dataframe for cumulative emergence and percent emergence in case that is useful.
degree_days <- c(998.08,1039.66,1111.29,1165.89,1236.53,1293.71,
1347.66,1387.76,1445.47,1493.44,1553.23,1601.97,
1670.28,1737.29,1791.94,1849.20,1920.91,1967.25,
2036.64,2091.85,2152.89,2199.13,2199.13,2263.09,
2297.94,2352.39,2384.03,2442.44,2541.28,2663.90,
2707.36,2773.82,2816.39,2863.94)
emergence <- c(0,0,0,1,1,0,2,3,17,10,0,0,0,2,0,3,0,0,1,5,0,0,0,0,
0,0,0,0,1,0,0,0,0,0)
cum_em <- cumsum(emergence)
df_dd_em <- data.frame (degree_days, emergence, cum_em)
df_dd_em$percent <- ave(df_dd_em$emergence, FUN = function(df_dd_em) 100*(df_dd_em)/46)
df_dd_em$cum_per <- ave(df_dd_em$cum_em, FUN = function(df_dd_em) 100*(df_dd_em)/46)
x <- pweibull(df_dd_em[c(1,3)],shape=5)
dframe2.mle <- fitdist(x, "weibull",method='mle')
Here's my best guess at what you're after:
Set up data:
dd <- data.frame(degree_days=c(998.08,1039.66,1111.29,1165.89,1236.53,1293.71,
1347.66,1387.76,1445.47,1493.44,1553.23,1601.97,
1670.28,1737.29,1791.94,1849.20,1920.91,1967.25,
2036.64,2091.85,2152.89,2199.13,2199.13,2263.09,
2297.94,2352.39,2384.03,2442.44,2541.28,2663.90,
2707.36,2773.82,2816.39,2863.94),
emergence=c(0,0,0,1,1,0,2,3,17,10,0,0,0,2,0,3,0,0,1,5,0,0,0,0,
0,0,0,0,1,0,0,0,0,0))
dd <- transform(dd,cum_em=cumsum(emergence))
We're actually going to fit to an "interval-censored" distribution (i.e. probability of emergence between successive degree day observations: this version assumes that the first observation refers to observations before the first degree-day observation, you could change it to refer to observations after the last observation).
library(bbmle)
## y*log(p) allowing for 0/0 occurrences:
y_log_p <- function(y,p) ifelse(y==0 & p==0,0,y*log(p))
NLLfun <- function(scale,shape,x=dd$degree_days,y=dd$emergence) {
prob <- pmax(diff(pweibull(c(-Inf,x), ## or (c(x,Inf))
shape=shape,scale=scale)),1e-6)
## multinomial probability
-sum(y_log_p(y,prob))
}
library(bbmle)
I should probably have used something more systematic like the method of moments (i.e. matching the mean and variance of a Weibull distribution with the mean and variance of the data), but I just hacked around a bit to find plausible starting values:
## preliminary look (method of moments would be better)
scvec <- 10^(seq(0,4,length=101))
plot(scvec,sapply(scvec,NLLfun,shape=1))
It's important to use parscale to let R know that the parameters are on very different scales:
startvals <- list(scale=1000,shape=1)
m1 <- mle2(NLLfun,start=startvals,
control=list(parscale=unlist(startvals)))
Now try with a three-parameter Weibull (as originally requested) -- requires only a slight modification of what we already have:
library(FAdist)
NLLfun2 <- function(scale,shape,thres,
x=dd$degree_days,y=dd$emergence) {
prob <- pmax(diff(pweibull3(c(-Inf,x),shape=shape,scale=scale,thres)),
1e-6)
## multinomial probability
-sum(y_log_p(y,prob))
}
startvals2 <- list(scale=1000,shape=1,thres=100)
m2 <- mle2(NLLfun2,start=startvals2,
control=list(parscale=unlist(startvals2)))
Looks like the three-parameter fit is much better:
library(emdbook)
AICtab(m1,m2)
## dAIC df
## m2 0.0 3
## m1 21.7 2
And here's the graphical summary:
with(dd,plot(cum_em~degree_days,cex=3))
with(as.list(coef(m1)),curve(sum(dd$emergence)*
pweibull(x,shape=shape,scale=scale),col=2,
add=TRUE))
with(as.list(coef(m2)),curve(sum(dd$emergence)*
pweibull3(x,shape=shape,
scale=scale,thres=thres),col=4,
add=TRUE))
(could also do this more elegantly with ggplot2 ...)
These don't seem like spectacularly good fits, but they're sane. (You could in principle do a chi-squared goodness-of-fit test based on the expected number of emergences per interval, and accounting for the fact that you've fitted a three-parameter model, although the values might be a bit low ...)
Confidence intervals on the fit are a bit of a nuisance; your choices are (1) bootstrapping; (2) parametric bootstrapping (resample parameters assuming a multivariate normal distribution of the data); (3) delta method.
Using bbmle::mle2 makes it easy to do things like get profile confidence intervals:
confint(m1)
## 2.5 % 97.5 %
## scale 1576.685652 1777.437283
## shape 4.223867 6.318481
dd <- data.frame(degree_days=c(998.08,1039.66,1111.29,1165.89,1236.53,1293.71,
1347.66,1387.76,1445.47,1493.44,1553.23,1601.97,
1670.28,1737.29,1791.94,1849.20,1920.91,1967.25,
2036.64,2091.85,2152.89,2199.13,2199.13,2263.09,
2297.94,2352.39,2384.03,2442.44,2541.28,2663.90,
2707.36,2773.82,2816.39,2863.94),
emergence=c(0,0,0,1,1,0,2,3,17,10,0,0,0,2,0,3,0,0,1,5,0,0,0,0,
0,0,0,0,1,0,0,0,0,0))
dd$cum_em <- cumsum(dd$emergence)
dd$percent <- ave(dd$emergence, FUN = function(dd) 100*(dd)/46)
dd$cum_per <- ave(dd$cum_em, FUN = function(dd) 100*(dd)/46)
dd <- transform(dd)
#start 3 parameter model
library(FAdist)
## y*log(p) allowing for 0/0 occurrences:
y_log_p <- function(y,p) ifelse(y==0 & p==0,0,y*log(p))
NLLfun2 <- function(scale,shape,thres,
x=dd$degree_days,y=dd$percent) {
prob <- pmax(diff(pweibull3(c(-Inf,x),shape=shape,scale=scale,thres)),
1e-6)
## multinomial probability
-sum(y_log_p(y,prob))
}
startvals2 <- list(scale=1000,shape=1,thres=100)
m2 <- mle2(NLLfun2,start=startvals2,
control=list(parscale=unlist(startvals2)))
summary(m2)
#graphical summary
windows(5,5)
with(dd,plot(cum_per~degree_days,cex=3))
with(as.list(coef(m2)),curve(sum(dd$percent)*
pweibull3(x,shape=shape,
scale=scale,thres=thres),col=4,
add=TRUE))

Resources