I've fit a Dyanmic Linear Model to some data using the dlmFilter in R [from the dlm package]. From said filter I have predicted 7 steps ahead using the dlmForecast function. The predicted outcome is very good, but I would like to add a 95% confidence interval and [after a lot of testing] have struggled to do so.
I've mocked up some similar code, below:
library(dlm)
data <- c(20.68502, 17.28549, 12.18363, 13.53479, 15.38779, 16.14770, 20.17536, 43.39321, 42.91027, 49.41402, 59.22262, 55.42043)
mod.build <- function(par) {
dlmModPoly(1, dV = exp(par[1]), dW = exp(par[2]))
}
# Returns most likely estimate of relevant values for parameters
mle <- dlmMLE(a2, rep(0,2), mod.build); #nileMLE$conv
if(mle$convergence==0) print("converged") else print("did not converge")
mod1 <- dlmModPoly(dV = v, dW = c(0, w))
mod1Filt <- dlmFilter(a1, mod1)
fut1 <- dlmForecast(mod1Filt, n = 7)
The forecast outcome appears to be very good [although the model to some extent over-fits the data due to the small number of observations]. However, I would like to add a 95% confidence interval and have struggled to figure out how to do so.
Any advice would be appreciated?
Cheers
hwidth <- (outer(sapply(fut1$Q, FUN=function(x) sqrt(diag(x))), qnorm(0.025, lower = FALSE)) +as.vector(t(fut1$f)))
Related
I am working with the orings data set in the faraway package in R. I have written the following grouped binomial model:
orings_model <- glm(cbind(damage, 6-damage) ~ temp, family = binomial, data = orings)
summary(orings_model)
I then constructed the Chi-Square test statistic and calculated the p-value:
pchisq(orings_model$null.deviance, orings_model$df.null,lower=FALSE)
First, I would like to generate data under the null distribution for this test statistic using rbinom with the average proportion of damaged o-rings (i.e., the variable "damage"). Second, I would like to recompute the above test statistic with this new data. I am not sure how to do this.
And second, I want to the process above 1000 times, saving the test statistic
each time. I am also not sure how to do this. My inclination is to use a for loop, but I am not sure how to set it up. Any help would be really appreciated!
It is not completely clear what you're looking to do here, but we can at least show some quick principles of how we can achieve this, and then hopefully you can get to your goal.
1) Simulating the null model
It is not entirely clear that you would like to simulate the null model here. It seems more like you're interested in simulating the actual model fit. Note that the null model is the model with form cbind(damage, 6-damage) ~ 1, and the null deviance and df are from this model. Either way, we can simulate data from the model using the simulate function in base R.
sims <- simulate(orings_model, 1000)
If you want to go the manual way estimate the mean vector of your model and use this for the probabilities in your call to rbinom
nsim <- 1000 * nrow(orings)
probs <- predict(orings_model, type = 'response')
sims_man <- matrix(rbinom(nsim, 6, probs),
ncol = 1000)
# Check they are equal:
# rowMeans(sims_man) - probs
In the first version we get a data.frame with 1000 columns each with a n times 2 matrix (damage vs not damage). In the latter we just summon the damage outcome.
2) Perform the bootstrapping
You could do this manually with the data above.
# Data from simulate
statfun <- function(x){
data <- orings_model$data
data$damage <- if(length(dim(x)) > 1)
x[, 1]
else
x
newmod <- update(orings_model, data = data)
pchisq(newmod$null.deviance, newmod$df.null, lower=FALSE)
}
sapply(sims, statfun)
# data from manual method
apply(sims_man, 2, statfun)
or alternatively one could take a bit of time with the boot function, allowing for a standardized way to perform the bootstrap:
library(boot)
# See help("boot")
ran_gen <- function(data, mle){
data$damage <- simulate(orings_model)[[1]][,1]
data
}
boot_metric <- function(data, w){
model <- glm(cbind(damage = damage, not_damage = 6 - damage) ~ temp,
family = binomial, data = data)
pchisq(model$null.deviance,
model$df.null,
lower=FALSE)
}
boots <- boot(orings, boot_metric,
R = 1000,
sim = 'parametric',
ran.gen = ran_gen,
mle = pchisq(orings_model$null.deviance,
orings_model$df.null,
lower=FALSE))
At which point we have the statistic in boots$t and the null statistic in boots$t0, so a simple statistic can be estimated using sum(boots$t > boots$t0) / boots$R (R being the number of replication).
I am new to R and looking to estimate the likelihood of having an outcome>=100 using a probability density function (the outcome in my example is the size of an outbreak). I believe I have the correct coding, but something doesn't feel right about the answer, when I look at the plot.
This is my code (it's based on the output of a stochastic model of an outbreak). I'd very much appreciate pointers. I think the error is in the likelihood calculation....
Thank you!
total_cases.dist <- dlnorm(sample.range, mean = total_cases.mean, sd = total_cases.sd)
total_cases.df <- data.frame("total_cases" = sample.range, "Density" = total_cases.dist)
library(ggplot2)
ggplot(total_cases.df, aes(x = total_cases, y = Density)) + geom_point()
pp <- function(x) {
print(paste0(round(x * 100, 3), "%"))
}
# likelihood of n_cases >= 100
pp(sum(total_cases.df$Density[total_cases.df$total_cases >= 100]))
You are using dlnorm, which is the log-normal distribution, which means the mean and sd are the mean of the log (values) and sd of log(values), for example:
# we call the standard rlnorm
X = rlnorm(1000,0,1)
# gives something close to sd = exp(1), and mean=something
c(mean(X),sd(X))
# gives what we simulated
c(mean(log(X)),sd(log(X)))
We now simulate some data, using a known poisson distribution where mean = variance. And we can model it using the log-normal:
set.seed(100)
X <- rpois(500,lambda=1310)
# we need to log values first
total_cases.mean <- mean(log(X))
total_cases.sd <- sd(log(X))
and you can see it works well
sample.range <- 1200:1400
hist(X,br=50,freq=FALSE)
lines(sample.range,
dlnorm(sample.range,mean=total_cases.mean,sd=total_cases.sd),
col="navyblue")
For your example, you can get probability of values > 1200 (see histogram):
plnorm(1200,total_cases.mean,total_cases.sd,lower.tail=FALSE)
Now for your data, if it is true that mean = 1310.198 and total_cases.sd = 31615.26, take makes variance ~ 76000X of your mean ! I am not sure then if the log normal distribution is appropriate for modeling this kind of data..
I’m attempting to perform a regression calibration on two variables using the yorkfit() function in the IsoplotR package. I would like to estimate the confidence interval of the bootstrapped slope coefficient from this model; however, instead of using the typical bootstrap method below, I’d like to only perform the iterations on 75% of the data (randomly selected) at a time. So far, using the following sample data, I managed to bootstrap the slope coefficient result of the yorkfit() function:
library(boot)
library(IsoplotR)
X <- c(9.105,8.987,8.974,8.994,8.996,8.966,9.035,9.215,9.239,
9.307,9.227,9.17, 9.102)
Y <- c(28.1,28.9,29.6,29.5,29.0,28.8,28.5,27.3,27.1,26.5,
27.0,27.5,28.4)
n <- length(X)
sX <- X*0.02
sY <- Y*0.05
rXY <- rep(0.8,n)
dat <- cbind(X,sX,Y,sY,rXY)
fit <- york(dat)
boot.test <- function(data,indices){
sample = data[indices,]
mod = york(sample)
return (mod$b)
}
result <- boot(data=dat, statistic = boot.test, R=1000)
boot.ci(result, type = 'bca')
...but I'm not really sure where to go from here. Any help to move me in the right direction would be greatly appreciated. I’m new to R so I apologize if question is ambiguous. Thanks.
Based on the package documentation, you should be able to use the ran.gen argument, with sim="parametric", to sample using a custom function. In this case, the sample is a certain percent of the total observations, chosen at random. Something like the following should accomplish what you want:
result <- boot(
data=dat,
statistic =boot.test,
R=1000,
sim="parametric",
ran.gen=function(data, percent){
n=nrow(data)
indic=runif(n)
data[rank(indic, ties.method="random")<=round(n*percent,0),]
},
percent=0.75)
I am trying to make a similar analysis to McNeil & Frey in their paper 'Estimation of tail-related risk measures for heteroscedastic financial time series: an extreme value approach' but I am stuck with a problem when implementing the models.
The approach is to fit a AR(1)-GARCH(1,1) model in order to estimate the the one-day ahead forecast of the VaR using a window of 1000 observations.
I have simulated data that should work fine with my model, and I assume that if I would be doing this correct, the observed coverage rate should be close to the theoretical one. However it is always below the theoretical coverage rate, and I don´t know why.
I beleive that this is how the calculation of the estimated VaR is done
VaR_hat = mu_hat + sigma_hat * qnorm(alpha)
, but I might be wrong. I have tried to find related questions here at stack but I have not found any.
How I approach this can be summarized in three steps.
Simulate 2000 AR(1)-GARCH(1,1) observations and fit a corresponding model and extract the one day prediction of the conditional mean and standard deviation using a window of 1000 observations.(Thereby making 1000 predictions)
Use the predicted values and the normal quantile to calculate the VaR for the wanted confidence level.
Check if the coverage rate is close to the theoretical one.
If someone could help me I would be extremely thankful, and if I'm unclear in my formalation please just tell me and I'll try to come up with a better explanation to the problem.
The code I'm using is attached below.
Thank you in advance
library(fGarch)
nObs <- 2000 # Number of observations.
quantileLevel <- 0.95 # Since we expect 5% exceedances.
from <- seq(1,1000) # Lower index vector for observations in model.
to <- seq(1001,2000) # Upper index vector for observations in model.
VaR_vec <- rep(0,(nObs-1000)) # Empty vector for storage of 1000 VaR estimates.
# Specs for simulated data (including AR(1) component and all components for GARC(1,1)).
spec = garchSpec(model = list(omega = 1e-6, alpha = 0.08, beta = 0.91, ar = 0.10),
cond.dist = 'norm')
# Simulate 1000 data points.
data_sim <- c(garchSim(spec, n = nObs, n.start = 1000))
for (i in 1:1000){
# The rolling window of 1000 observations.
data_insert <- data_sim[from[i]:to[i]]
# Fitting an AR(1)-GARCH(1,1) model with normal cond.dist.
fitted_model <- garchFit(~ arma(1,0) + garch(1,1), data_insert,
trace = FALSE,
cond.dist = "norm")
# One day ahead forecast of conditional mean and standard deviation.
predict(fitted_model, n.ahead = 1)
prediction_model <- predict(fitted_model, n.ahead = 1)
mu_pred <- prediction_model$meanForecast
sigma_pred <- prediction_model$standardDeviation
# Calculate VaR forecast
VaR_vec[i] <- mu_pred + sigma_pred*qnorm(quantileLevel)
if (length(to)-i != 0){
print(c('Countdown, just',(length(to) - i),'iterations left'))
} else {
print(c('Done!'))
}
}
# Exctract only the estiamtes ralated to the forecasts.
compare_data_sim <- data_sim[1001:length(data_sim)]
hit <- rep(0,length(VaR_vec))
# Count the amount of exceedances.
for (i in 1:length(VaR_vec)){
hit[i] <- sum(VaR_vec[i] <= compare_data_sim[i])
}
plot(data_sim[1001:2000], type = 'l',
ylab = 'Simulated data', main = 'Illustration of one day ahead prediction of 95%-VaR')
lines(VaR_vec, col = 'red')
cover_prop <- sum(hit)/length(hit)
print(sprintf("Diff theoretical level and VaR coverage = %f", (1-quantileLevel) - cover_prop))
I'm trying to get a 95% confidence interval around some predicted values, but am not capable of achieving this.
Basically, I estimated a growth curve like this:
set.seed(123)
dat=data.frame(size=rnorm(50,10,3),age=rnorm(50,5,2))
S <- function(t,ts,C,K) ((C*K)/(2*pi))*sin(2*pi*(t-ts))
sommers <- function(t,Linf,K,t0,ts,C)
Linf*(1-exp(-K*(t-t0)-S(t,ts,C,K)+S(t0,ts,C,K)))
model <- nls(size~sommers(age,Linf,K,t0,ts,C),data=dat,
start=list(Linf=10,K=4.7,t0=2.2,C=0.9,ts=0.1))
I have independent size measurements, for which I would like to predict the age. Therefore, the inverse of the function, which is not very straightforward, I calculated like this:
model.out=coef(model)
S.out <- function(t)
((model.out[[4]]*model.out[[2]])/(2*pi))*sin(2*pi*(t-model.out[[5]]))
sommers.out <- function(t)
model.out[[1]]*(1-exp(-model.out[[2]]*(t-model.out[[3]])-S.out(t)+S.out(model.out[[3]])))
inverse = function (f, lower = -100, upper = 100) {
function (y) uniroot((function (x) f(x) - y), lower = lower, upper = upper)[1]
}
sommers.inverse = inverse(sommers.out, 0, 25)
x= sommers.inverse(10) #this works with my complete dataset, but not with this fake one
Although this works fine, I need to know the confidence interval (95%) around this estimate (x). For linear models there is for example "predict(... confidence=)". I could also bootstrap the function somehow to get the quantiles associated with the parameters (didn't find how), to then use the extremes of those to calculate the maximum and minimum values predictable. But that doesn't really look like the good way of doing this....
Any help would be greatly appreciated.
EDIT after answer:
So this worked (explained in the book of Ben Bolker, see answer):
vmat = mvrnorm(1000, mu = coef(mfit), Sigma = vcov(mfit))
dist = numeric(1000)
for (i in 1:1000) {dist[i] = sommers_inverse(9.938,vmat[i,])}
quantile(dist, c(0.025, 0.975))
On the rather bad fake data I gave, this works of course rather horrible. But on the real data (which I have a problem recreating), this is ok!
Unless I'm mistaken, you're going to have to use either regular (parametric) bootstrapping or a method called either "population predictive intervals" (e.g., see section 5 of chapter 7 of Bolker 2008), which assumes that the sampling distributions of your parameters are multivariate Normal. However, I think you may have bigger problems, unless I've somehow messed up your model in adapting it ...
Generate data (note that random data may actually bad for testing your model - see below ...)
set.seed(123)
dat <- data.frame(size=rnorm(50,10,3),age=rnorm(50,5,2))
S <- function(t,ts,C,K) ((C*K)/(2*pi))*sin(2*pi*(t-ts))
sommers <- function(t,Linf,K,t0,ts,C)
Linf*(1-exp(-K*(t-t0)-S(t,ts,C,K)+S(t0,ts,C,K)))
Plot the data and the initial curve estimate:
plot(size~age,data=dat,ylim=c(0,16))
agevec <- seq(0,10,length=1001)
lines(agevec,sommers(agevec,Linf=10,K=4.7,t0=2.2,ts=0.1,C=0.9))
I had trouble with nls so I used minpack.lm::nls.lm, which is slightly more robust. (There are other options here, e.g. calculating the derivatives and providing the gradient function, or using AD Model Builder or Template Model Builder, or using the nls2 package.)
For nls.lm we need a function that returns the residuals:
sommers_fn <- function(par,dat) {
with(c(as.list(par),dat),size-sommers(age,Linf,K,t0,ts,C))
}
library(minpack.lm)
mfit <- nls.lm(fn=sommers_fn,
par=list(Linf=10,K=4.7,t0=2.2,C=0.9,ts=0.1),
dat=dat)
coef(mfit)
## Linf K t0 C ts
## 10.6540185 0.3466328 2.1675244 136.7164179 0.3627371
Here's our problem:
plot(size~age,data=dat,ylim=c(0,16))
lines(agevec,sommers(agevec,Linf=10,K=4.7,t0=2.2,ts=0.1,C=0.9))
with(as.list(coef(mfit)), {
lines(agevec,sommers(agevec,Linf,K,t0,ts,C),col=2)
abline(v=t0,lty=2)
abline(h=c(0,Linf),lty=2)
})
With this kind of fit, the results of the inverse function are going to be extremely unstable, as the inverse function is many-to-one, with the number of inverse values depending sensitively on the parameter values ...
sommers_pred <- function(x,pars) {
with(as.list(pars),sommers(x,Linf,K,t0,ts,C))
}
sommers_pred(6,coef(mfit)) ## s(6)=9.93
sommers_inverse <- function (y, pars, lower = -100, upper = 100) {
uniroot(function(x) sommers_pred(x,pars) -y, c(lower, upper))$root
}
sommers_inverse(9.938, coef(mfit)) ## 0.28
If I pick my interval very carefully I can get back the correct answer ...
sommers_inverse(9.938, coef(mfit), 5.5, 6.2)
Maybe your model will be better behaved with more realistic data. I hope so ...