Accuracy is different between "tune" and "predict" in R - r

I am making SVM which will differentiate beforehand and afterward of track maintenance using on-board accelerometer on train car. There is focused section and I extracted the acceleration data corresponding to that section. Each run should take around 3 minutes to pass that section, so considering that sampling rate of accelerometer is around 1600/s, there are around 3min * 60sec * 1600/s = 288,000 records of acceleration data for each run. Then I calculate variance, maximum, minimum, mean, standard deviation and most frequent value of those acceleration records of each run. There are around 250 runs, so I made the dataset of those calculated value of 250 runs. Then also I put classification of beforehand and afterward of track maintenance, depending on the maintenance record and the date of each run.
Using this, I tried to make SVM as I mentioned. At first, I tried to find optimal parameters of gamma and cost in gaussian kernel function, so I used "tune" to do grid search. Then I got the result as follow:
> source("grid_search.R")
[gamma = 1 , cost = 10 ]
- best parameters:
gamma = 1.584893 ; cost = 25.11886 ;
accuracy: 88.54935 %
Also "grid_search.R" is as follow:
gamma <- 10^(0.0)
cost <- 10^(1.0)
gammaRange <- 10^seq(log10(gamma)-1,log10(gamma)+1,length=11)[2:10]
costRange <- 10^seq(log10(cost)-1 ,log10(cost)+1 ,length=11)[2:10]
t <- tune.svm(Category ~ ., data = X, gamma=gammaRange, cost=costRange,
tunecontrol = tune.control(sampling="cross", cross=8))
cat("[gamma =", gamma, ", cost =" , cost , "]\n")
cat("- best parameters:\n")
cat("gamma =", t$best.parameters$gamma, "; cost =", t$best.parameters$cost, ";\n")
cat("accuracy:", 100 - t$best.performance * 100, "%\n\n")
plot(t, transform.x=log10, transform.y=log10, zlim=c(0,0.1))
After that, using "svm" with the parameter "gamma = 1.584893 ; cost = 25.11886 ;" found as above, I trained SVM and tried to predict to classify the data which is used for training SVM as following:
gamma = 1.584893 ; cost = 25.11886;
model <- svm(Category ~ ., data = X, gamma=gamma, cost=cost)
pred <- predict(model, X)
table(pred, X[,13])
And I got result as following matrix:
pred after before
after 47 2
before 1 185
My question is: Depending on the matrix above, accuracy can be said as
1 - (1 + 2)/(47 + 2 + 1 + 185) = 0.987234 (98.7%)
But I also got "accuracy: 88.54935 %" as a result of "tune" when I got optimal parameters such as "gamma = 1.584893 ; cost = 25.11886;".

Related

In Rjags/runjags, what causes the "node inconsistent with parents" error when using dinterval?

I have wracked my brain trying to come up with a solution to this problem and I'm at wits end! First, the necessary context: Aquatic plants in lakes are sampled with rakes. You throw a rake out into the lake, you pull it back into your boat, and you figure out what plants are on its tines. In our case, we measure both presence/absence as well as "abundance," but in an ordinal/interval-censored way --> it's 0 if species X isn't noticed on the rake at all, 1 if it covers < 25% of the rake's tines, 2 if it covers between 25 and 75%, and 3 if it covers > 75%. However, it's fairly easy to miss a species entirely when it's in low abundance, so 0s are sketchy--they may not represent true absences, and that is really the issue our model is trying to explore.
So, there are really three layers here--a true, fully latent abundance that we don't observe directly at all, a partially latent "true presence/absence" in that we know where true presences are but not where true absences are, and then we have our observed presence/absence data. What's more interesting is that we think some environmental variables may affect both true abundance and true occurrence but differently, and then other variables may affect detectability, and it's those processes we're trying to tease apart.
So, anyhow, my actual model is much larger and more complicated than what I've pasted below, but here is a sort of functional (but probably academically meritless) training version of it that replicates the error I am getting.
#data setup
N = 1500 #Number of cases
obs = sample(c(0,1,2,3), N,
replace=T, prob=c(0.7, 0.2, 0.075, 0.025)) #Our observed, interval-censored data.
X1 = rnorm(N) #Some covariate that probably affects both occurrance and abundance but maybe in different ways.
abundances = rep(NA, times = N) #Abundance is a latent variable we don't directly observe. From elsewhere, I know the values here need to be NAs so the model will know to impute them
occur = rep(1, times = N) #Occurance is a degraded form of our abundance data.
#d will be the initials for the abundance data, since this is apparently needed to jumpstart the imputation.
d = vector()
for(o in 1:N) {
if (obs[o]==0) { d[o] = 0.025; occur[o] = 0 }
if (obs[o]==1) { d[o] = 0.15 }
if (obs[o]==2) { d[o] = 0.5 }
if (obs[o]==3) { d[o] = 0.875 }
}
#Data
test.data = list("N" = N,
"obs" = obs,
"X1" = X1,
"abund" = abundances,
"lim" = c(0.05, 0.25, 0.75, 0.9999),
"occur" = occur)
#Inits
inits = list(abund = d)
cat("model
{
for (i in 1:N) {
obs[i] ~ dinterval(abund[i], lim)
abund[i] ~ dbeta(theta[i], rho[i]) T(0.0001, 0.9999)
theta[i] <- mu[i] * epsilon
rho[i] <- epsilon * (1-mu[i])
logit(mu[i]) <- alpha1 + X.beta1 * X1[i]
occur[i] ~ dbern(phi[i])
logit(phi[i]) <- alpha2 + X.beta2 * X1[i]
}
#Priors
epsilon ~ dnorm(5, 0.1) T(0.01, 10)
alpha1 ~ dnorm(0, 0.01)
X.beta1 ~ dnorm(0, 0.01)
alpha2 ~ dnorm(0, 0.01)
X.beta2 ~ dnorm(0, 0.01)
}
", file = "training.txt")
test.run = jags.model(file = "training.txt", inits = inits, data=test.data, n.chains = 3)
params = c("epsilon",
"alpha1",
"alpha2",
"X.beta1",
"X.beta2")
run1 = run.jags("training.txt", data = test.data, n.chains=3, burnin = 1000, sample = 5000, adapt = 4000, thin = 2,
monitor = c(params), method="parallel", modules = 'glm')
At the end, I get this error, and I always get this error any time I try to do something even remotely like this:
Graph information: Observed stochastic nodes: 3000 Unobserved
stochastic nodes: 1505 Total graph size: 19519 . Reading
parameter file inits1.txt. Initializing model Error in node obs1
Node inconsistent with parents
I've read every posting that covers this error I can find, including this one, this one, this one, and this one. I can surmise from my research and testing that the error is probably occurring for one of the following reasons.
My initials for the latent abundance variable are not adequate somehow. It sounds like this requires pretty useful initial values to work.
One or more of my priors is allowing values that are not permissible OR they are too broad and that's causing problems somehow. This might be especially an issue because of the beta distribution I am using which has strong requirements about not having values outside of 0 and 1.
I am using the dinterval() function incorrectly, which seems likely because it is always the line containing it that trips the error.
My model is somehow mis-specified.
But I can't see where I might be going wrong--I have tried a number of different options for 1 and 2, and so far as I can tell from the documentation (see pages 55-56), I am using dinterval correctly. What am I missing??
In case it's relevant, from what I have gathered, the idea of dinterval() is that the variable on the left of the ~ is the interval-censored version of the variable given in the first argument (here, abundance). Then, the second argument (here, lim) is a vector of "breakpoints" that dictate which intervals the abundance data end up in. So, here, you end up with an observed abundance code of 0 if you are lower than the lowest lim (here, 0.05), 1 if you are in between the first two values in lim, etc. It's like the abundance variable is being pushed through a "binning sieve" created by the lim variable to produce a binned output variable, our observed abundances.
Any guidance would be most welcome!!
I have run your example with JAGS 4.3.0 and rjags 4-12. For me, the version with rjags runs correctly. The version with runjags does not work because you have not provided intial values. This is easily fixed by adding the argument
inits=list(inits, inits, inits)
to the call to run.jags().
You have correctly understood the purpose of dinterval. This is an "observable function" which imposes constraints on its parameters via a likelihood. When using dinterval you must always provide initial values that satisfy the constraints from the fist iteration. As far as I can see, your initial values do satisfy the constraints and this is verified by the fact that I can run your example (with initial values).

R code for maximum likelihood estimate from a specific likelihood function

I have been trying to generate R code for maximum likelihood estimation from a log likelihood function in a paper (equation 9 in page 609). Authors in the paper estimated it using MATLAB, which I am not familiar with. So I tried to generate codes in R.
Here is the snapshot of the log likelihood function in the paper:
, where
r: Binary decision (0 or 1) indicating infested plant(s) detection (1) or not (0).
e: Inspection efficiency. This is known.
n: Sample size
The overall objective is to estimate plant infestation rate (gamma: γ) and epsilon (e) based on binary decision of presence and absence of infested plants instead of using infested plant(s) detected. So, the function has only binary information (r) of infested plant detection and sample size. Since epsilon (e) is known or fixed, the actual goal is to estimate gamma (γ) in a population.
Another objective is to compare estimated infestation rates from above with ones in hypergeometric sampling formula in another paper (in page 6). The formula is:
This formula generates required sample size to detect infested plants with selected probability (e.g., 95) given an infested rate. For example:
# Sample size calculation function
fosgate.sample1 <- function(box, p, ci){ # Note: box represent total plant number
ninf <- p*box
sample.size <- round(((1-(1-ci)^(1/ninf))*(box-(ninf-1)/2)))
#sample.size <- ceiling(((1-(1-ci)^(1/ninf))*(box-(ninf-1)/2)))
sample.size
}
fosgate.sample1(box=100, p = .05, ci = .95) # where box: population or total plants, p: infestation rate, and ci: probability of detection
## 44
The idea is if sample size (e.g., 44) and binary decision data are provided the log-likelihood function can be used to estimate infestation rate and the rate may be close to anticipated rate (e.g., .05). Ultimately, I would like to compare plant infestation rates (gamma: γ) estimated from the log likelihood function above and D/N in the sample size calculation formula (second) or p in the sample size code below.
I generated R code for the log-likelihood described above.
### MLE with stat4
library(stats4)
# Log-likelihood function
plant.inf.lik <- function(inf.rate){
logl <- suppressWarnings(
sum((1-insp.result)*n*log(1-inf.rate) +
insp.result*log(1-(1-inf.rate)^n))
)
return(-logl)
}
Using the sample size function (i.e., fosgate.sample1) I generated sample sizes for various cases of total plant (or box) and anticipated detection rate (p) in the function. Since I am also interested in error/confidence ranges of estimated plant infestation rates, I used bootstrapping to calculate range of estimates (I am not sure if this is appropriate/acceptable). Here is the final code I generated:
### MLE and CI with bootstrapping with multiple scenarios
plant <- c(100, 500, 1000, 5000, 10000, 100000) # Total plant number
ir <- seq(.01, .2, by = .01) # Plant infestation rate
df.result <- data.frame(expand.grid(plant=plant, inf.rate = ir))
df.result$sample.size <- fosgate.sample1(box=df.result$plant, p=df.result$inf.rate, ci=.95) # Sample size
df.result$insp.result <- 1000 # Shipment number (can be replaced with random integers)
df.result <- df.result[order(df.result$plant, df.result$inf.rate, df.result$sample.size), ]
rownames(df.result) <- 1:nrow(df.result)
df.result$est.mean <- 0
#df.result$est.median <- 0
df.result$est.lower.ci <- 0
df.result$est.upper.ci <- 0
df.result$nsim <- 0
str(df.result)
head(df.result)
# Looping
est <- rep(NA, 1000)
for(j in 1:nrow(df.result)){
for(i in 1:1000){
insp.result <- sample(c(rep(1, df.result$insp.result[j]-df.result$insp.result[j]*df.result$inf.rate[j]),
rep(0, df.result$insp.result[j]*df.result$inf.rate[j])))
ir <- df.result$inf.rate[j]
n <- df.result$sample.size[j]
insp.result <- sample(insp.result, replace = TRUE)
est[i] <- mle(plant.inf.lik, start = list(inf.rate = ir*.9), method = "BFGS", nobs = length(insp.result))#coef
df.result$est.mean[j] <- mean(est, na.rm = TRUE)
# df.result$est.median[j] <- median(est, na.rm = TRUE)
df.result$est.lower.ci[j] <- quantile(est, prob = .025, na.rm = TRUE)
df.result$est.upper.ci[j] <- quantile(est, prob = .975, na.rm = TRUE)
df.result$nsim[j] <- length(est)
}
}
# Significance test result
sig <- ifelse(df.result$inf.rate >= df.result$est.lower.ci & df.result$inf.rate <= df.result$est.upper.ci, "no sig", "sig")
table(sig)
# Plot
library(ggplot2)
library(reshape2)
df.result$num <- ave(df.result$inf.rate, df.result$plant, FUN=seq_along)
df.result.m <- melt(df.result, id.vars=c("plant", "sample.size", "insp.result", "est.lower.ci", "est.upper.ci", "nsim", "num"))
df.result.m$est.lower.ci <- ifelse(df.result.m$variable == "inf.rate", NA, df.result.m$est.lower.ci)
df.result.m$est.upper.ci <- ifelse(df.result.m$variable == "inf.rate", NA, df.result.m$est.upper.ci)
str(df.result.m)
ggplot(data = df.result.m, aes(x = num, y = value, group=variable, color=variable, shape=variable))+
geom_point()+
geom_errorbar(aes(ymin = est.lower.ci, ymax = est.upper.ci), width=.5)+
scale_y_continuous(breaks = seq(0, .2, .02))+
xlab("Index")+
ylab("Plant infestation rate")+
facet_wrap(~plant, ncol = 3)
When I ran the code, I was able to obtain results and to compare estimated (est.mean) and anticipated (inf.rate) infestation rates as shown in the plot below.
If results are correct, plot indicates that estimation looks fine but off for greater infestation rates.
Also, I always got warning messages without "suppressWarnings" function and occasionally error messages below. I have no clue how to fix them.
## Warning messages
## 29: In log(1 - (1 - inf.rate)^n) : NaNs produced
## 30: In log(1 - inf.rate) : NaNs produced
## Error message (occasionally)
## Error in solve.default(oout$hessian) :
## Lapack routine dgesv: system is exactly singular: U[1,1] = 0
My questions are:
Is R function (plant.inf.lik) for maximum likelihood estimation of the log-likelihood function appropriate?
Should I take care of warning and error messages? If yes, how? Again, I have no clue how to fix...
Is bootstrapping (resampling?) method appropriate to estimate CI ranges and/or standard error?
I found this link useful for alternative approach. Although I am still working both approaches together, results seem different (maybe following question).
Any suggestion would be greatly appreciated.
Concerning your last question about estimating CI ranges, there are three common methods for ML estimators:
Variance estimation from the inverted Hessian matrix.
Jackknife estimator for the variance (simpler and more stable, if the Hessian is estimated numerically, but computationally more expensive)
Bootstrap CIs (the computatianally most expensive approach).
For bootstrap CIs, you do not need to implement them yourself (bias correction, e.g. can be tricky), but can rely on the R library boot.
Incidentally, I have written a summary with R code for all three approaches two years ago: Construction of Confidence Intervals (see section 5). For the method utilizing the Hessian Matrix, e.g., the outline is as follows:
lnL <- function(theta1, theta2, ...) {
# definition of the negative (!)
# log-likelihood function...
}
# starting values for the optimization
theta0 <- c(start1, start2, ...)
# optimization
p <- optim(theta0, lnL, hessian=TRUE)
if (p$convergence == 0) {
theta <- p$par
covmat <- solve(p$hessian)
sigma <- sqrt(diag(covmat))
}
The function mle from stats4 already wraps the covrainace matrix estimation and retruns it in vcov. In the practical use cases in which I have tried this (paired comparison models), though, this estimation was rather unstable, and I have resorted to the jackknife method instead.

R Quantreg: Singularity with categorical survey data

For my Bachelor's thesis I am trying to apply a linear median regression model on constant sum data from a survey (see formula from A.Blass (2008)). It is an attempt to recreate the probability elicitation approach proposed by A. Blass et al (2008) - Using Elicited Choice Probabilities to Estimate Random Utility Models: Preferences for Electricity Reliability
My dependent variable is the log-odds transformation of the constant sum allocations. Calculated using the following formula:
PE_raw <- PE_raw %>% group_by(sys_RespNum, Task) %>% mutate(LogProb = c(log(Response[1]/Response[1]),
log(Response[2]/Response[1]),
log(Response[3]/Response[1])))
My independent variables are delivery costs, minimum order quantity and delivery window, each categorical variables with levels 0, 1, 2 and 3. Here, level 0 represent the none-option.
Data snapshot
I tried running the following quantile regression (using R's quantreg package):
LAD.factor <- rq(LogProb ~ factor(`Delivery costs`) + factor(`Minimum order quantity`) + factor(`Delivery window`) + factor(NoneOpt), data=PE_raw, tau=0.5)
However, I ran into the following error indicating singularity:
Error in rq.fit.br(x, y, tau = tau, ...) : Singular design matrix
I ran a linear regression and applied R's alias function for further investigation. This informed me of three cases of perfect multicollinearity:
minimum order quantity 3 = delivery costs 1 + delivery costs 2 + delivery costs 3 - minimum order quantity 1 - minimum order quantity 2
delivery window 3 = delivery costs 1 + delivery costs 2 + delivery costs 3 - delivery window 1 - delivery window 2
NoneOpt = intercept - delivery costs 1 - delivery costs 2 - delivery costs 3
In hindsight these cases all make sense. When R dichotomizedthe categorical variables you get these results by construction as, delivery costs 1 + delivery costs 2 + delivery costs 3 = 1 and minimum order quantity 1 + minimum order quantity 2 + minimum order quantity 3 = 1. Rewriting gives the first formula.
It looks like a classic dummy trap. In an attempt to workaround this issue I tried to manually dichotomize the data and used the following formula:
LM.factor <- rq(LogProb ~ Delivery.costs_1 + Delivery.costs_2 + Minimum.order.quantity_1 + Minimum.order.quantity_2 + Delivery.window_1 + Delivery.window_2 + factor(NoneOpt), data=PE_dichomitzed, tau=0.5)
Instead of an error message I now got the following:
Warning message:
In rq.fit.br(x, y, tau = tau, ...) : Solution may be nonunique
When using the summary function:
> summary(LM.factor)
Error in base::backsolve(r, x, k = k, upper.tri = upper.tri, transpose = transpose, :
singular matrix in 'backsolve'. First zero in diagonal [2]
In addition: Warning message:
In summary.rq(LM.factor) : 153 non-positive fis
Is anyone familiar with this issue? I am looking for alternative solutions. Perhaps I am making mistakes using the rq() function, or the data might be misrepresented.
I am grateful for any input, thank you in advance.
Reproducible example
library(quantreg)
#### Raw dataset (PE_raw_SO) ####
# quantile regression (produces singularity error)
LAD.factor <- rq(
LogProb ~ factor(`Delivery costs`) +
factor(`Minimum order quantity`) + factor(`Delivery window`) +
factor(NoneOpt),
data = PE_raw_SO,
tau = 0.5
)
# linear regression to check for singularity
LM.factor <- lm(
LogProb ~ factor(`Delivery costs`) +
factor(`Minimum order quantity`) + factor(`Delivery window`) +
factor(NoneOpt),
data = PE_raw_SO
)
alias(LM.factor)
# impose assumptions on standard errors
summary(LM.factor, se = "iid")
summary(LM.factor, se = "boot")
#### Manually created dummy variables to get rid of
#### collinearity (PE_dichotomized_SO) ####
LAD.di.factor <- rq(
LogProb ~ Delivery.costs_1 + Delivery.costs_2 +
Minimum.order.quantity_1 + Minimum.order.quantity_2 +
Delivery.window_1 + Delivery.window_2 + factor(NoneOpt),
data = PE_dichotomized_SO,
tau = 0.5
)
summary(LAD.di.factor) #backsolve error
# impose assumptions (unusual results)
summary(LAD.di.factor, se = "iid")
summary(LAD.di.factor, se = "boot")
# linear regression to check for singularity
LM.di.factor <- lm(
LogProb ~ Delivery.costs_1 + Delivery.costs_2 +
Minimum.order.quantity_1 + Minimum.order.quantity_2 +
Delivery.window_1 + Delivery.window_2 + factor(NoneOpt),
data = PE_dichotomized_SO
)
alias(LM.di.factor)
summary(LM.di.factor) #regular results, all significant
Link to sample data + code: GitHub
The Solution may be nonunique behaviour is not unusual when doing quantile regressions with dummy explanatory variables.
See, e.g., the quantreg FAQ:
The estimation of regression quantiles is a linear programming
problem. And the optimal solution may not be unique.
A more intuitive explanation for what is happening is given by Roger Koenker (the author of quantreg) on r-help back in 2006:
When computing the median from a sample with an even number of
distinct values there is inherently some ambiguity about its value:
any value between the middle order statistics is "a" median.
Similarly, in regression settings the optimization problem solved by
the "br" version of the simplex algorithm, modified to do general
quantile regression identifies cases where there may be non
uniqueness of this type. When there are "continuous" covariates this
is quite rare, when covariates are discrete then it is relatively
common, atleast when tau is chosen from the rationals. For univariate
quantiles R provides several methods of resolving this sort of
ambiguity by interpolation, "br" doesn't try to do this, instead
returning the first vertex solution that it comes to.
Your second warning -- "153 non-positive fis" -- is a warning related to how the local densities are calculated by rq. Occasionally, it could be possible that local densities of the quantile regression function end up being negative (which is obviously impossible). If this happens, rq automatically sets them to zero. Again, quoting from the FAQ:
This is generally harmless, leading to a somewhat conservative
(larger) estimate of the standard errors, however if the reported
number of non-positive fis is large relative to the sample size then
it is an indication of misspecification of the model.

value at risk estimation using fGarch package in R

I am trying to make a similar analysis to McNeil & Frey in their paper 'Estimation of tail-related risk measures for heteroscedastic financial time series: an extreme value approach' but I am stuck with a problem when implementing the models.
The approach is to fit a AR(1)-GARCH(1,1) model in order to estimate the the one-day ahead forecast of the VaR using a window of 1000 observations.
I have simulated data that should work fine with my model, and I assume that if I would be doing this correct, the observed coverage rate should be close to the theoretical one. However it is always below the theoretical coverage rate, and I don´t know why.
I beleive that this is how the calculation of the estimated VaR is done
VaR_hat = mu_hat + sigma_hat * qnorm(alpha)
, but I might be wrong. I have tried to find related questions here at stack but I have not found any.
How I approach this can be summarized in three steps.
Simulate 2000 AR(1)-GARCH(1,1) observations and fit a corresponding model and extract the one day prediction of the conditional mean and standard deviation using a window of 1000 observations.(Thereby making 1000 predictions)
Use the predicted values and the normal quantile to calculate the VaR for the wanted confidence level.
Check if the coverage rate is close to the theoretical one.
If someone could help me I would be extremely thankful, and if I'm unclear in my formalation please just tell me and I'll try to come up with a better explanation to the problem.
The code I'm using is attached below.
Thank you in advance
library(fGarch)
nObs <- 2000 # Number of observations.
quantileLevel <- 0.95 # Since we expect 5% exceedances.
from <- seq(1,1000) # Lower index vector for observations in model.
to <- seq(1001,2000) # Upper index vector for observations in model.
VaR_vec <- rep(0,(nObs-1000)) # Empty vector for storage of 1000 VaR estimates.
# Specs for simulated data (including AR(1) component and all components for GARC(1,1)).
spec = garchSpec(model = list(omega = 1e-6, alpha = 0.08, beta = 0.91, ar = 0.10),
cond.dist = 'norm')
# Simulate 1000 data points.
data_sim <- c(garchSim(spec, n = nObs, n.start = 1000))
for (i in 1:1000){
# The rolling window of 1000 observations.
data_insert <- data_sim[from[i]:to[i]]
# Fitting an AR(1)-GARCH(1,1) model with normal cond.dist.
fitted_model <- garchFit(~ arma(1,0) + garch(1,1), data_insert,
trace = FALSE,
cond.dist = "norm")
# One day ahead forecast of conditional mean and standard deviation.
predict(fitted_model, n.ahead = 1)
prediction_model <- predict(fitted_model, n.ahead = 1)
mu_pred <- prediction_model$meanForecast
sigma_pred <- prediction_model$standardDeviation
# Calculate VaR forecast
VaR_vec[i] <- mu_pred + sigma_pred*qnorm(quantileLevel)
if (length(to)-i != 0){
print(c('Countdown, just',(length(to) - i),'iterations left'))
} else {
print(c('Done!'))
}
}
# Exctract only the estiamtes ralated to the forecasts.
compare_data_sim <- data_sim[1001:length(data_sim)]
hit <- rep(0,length(VaR_vec))
# Count the amount of exceedances.
for (i in 1:length(VaR_vec)){
hit[i] <- sum(VaR_vec[i] <= compare_data_sim[i])
}
plot(data_sim[1001:2000], type = 'l',
ylab = 'Simulated data', main = 'Illustration of one day ahead prediction of 95%-VaR')
lines(VaR_vec, col = 'red')
cover_prop <- sum(hit)/length(hit)
print(sprintf("Diff theoretical level and VaR coverage = %f", (1-quantileLevel) - cover_prop))

ar(1) simulation with non-zero mean

I can't seem to find the correct way to simulate an AR(1) time series with a mean that is not zero.
I need 53 data points, rho = .8, mean = 300.
However, arima.sim(list(order=c(1,0,0), ar=.8), n=53, mean=300, sd=21)
gives me values in the 1500s. For example:
1480.099 1480.518 1501.794 1509.464 1499.965 1489.545 1482.367 1505.103 (and so on)
I have also tried arima.sim(n=52, model=list(ar=c(.8)), start.innov=300, n.start=1)
but then it just counts down like this:
238.81775870 190.19203239 151.91292491 122.09682547 96.27074057 [6] 77.17105923 63.15148491 50.04211711 39.68465916 32.46837830 24.78357345 21.27437183 15.93486092 13.40199333 10.99762449 8.70208879 5.62264196 3.15086491 2.13809323 1.30009732
and I have tried arima.sim(list(order=c(1,0,0), ar=.8), n=53,sd=21) + 300 which seems to give a correct answer. For example:
280.6420 247.3219 292.4309 289.8923 261.5347 279.6198 290.6622 295.0501
264.4233 273.8532 261.9590 278.0217 300.6825 291.4469 291.5964 293.5710
285.0330 274.5732 285.2396 298.0211 319.9195 324.0424 342.2192 353.8149
and so on..
However, I am in doubt that this is doing the correct thing? Is it still auto-correlating on the correct number then?
Your last option is okay to get the desired mean, "mu". It generates data from the model:
(y[t] - mu) = phi * (y[t-1] - mu) + \epsilon[t], epsilon[t] ~ N(0, sigma=21),
t=1,2,...,n.
Your first approach sets an intercept, "alpha", rather than a mean:
y[t] = alpha + phi * y[t-1] + epsilon[t].
Your second option sets the starting value y[0] equal to 300. As long as |phi|<1 the influence of this initial value will vanish after a few periods and will have no effect
on the level of the series.
Edit
The value of the standard deviation that you observe in the simulated data is correct. Be aware that the variance of the AR(1) process, y[t], is not equal the variance of the innovations, epsilon[t]. The variance of the AR(1) process, sigma^2_y, can be obtained obtained as follows:
Var(y[t]) = Var(alpha) + phi^2 * Var(y[t-1]) + Var(epsilon[t])
As the process is stationary Var(y[t]) = Var(t[t-1]) which we call sigma^2_y. Thus, we get:
sigma^2_y = 0 + phi^2 * sigma^2_y + sigma^2_epsilon
sigma^2_y = sigma^2_epsilon / (1 - phi^2)
For the values of the parameters that you are using you have:
sigma_y = sqrt(21^2 / (1 - 0.8^2)) = 35.
Use the rGARMA function in the ts.extend package
You can generate random vectors from any stationary Gaussian ARMA model using the ts.extend package. This package generates random vectors directly form the multivariate normal distribution using the computed autocorrelation matrix for the random vector, so it gives random vectors from the exact distribution and does not require "burn-in" iterations. Here is an example of generating multiple independent time-series vectors all from an AR(1) model.
#Load the package
library(ts.extend)
#Set parameters
MEAN <- 300
ERRORVAR <- 21^2
AR <- 0.8
m <- 53
#Generate n = 16 random vectors from this model
set.seed(1)
SERIES <- rGARMA(n = 16, m = m, mean = MEAN, ar = AR, errorvar = ERRORVAR)
#Plot the series using ggplot2 graphics
library(ggplot2)
plot(SERIES)
As you can see, the generated time-series vectors in this plot use the appropriate mean and error variance that were specified in the inputs.

Resources