Problems with calculating the likelihood of an outcome based on cumulative probability function in R - r

I am new to R and looking to estimate the likelihood of having an outcome>=100 using a probability density function (the outcome in my example is the size of an outbreak). I believe I have the correct coding, but something doesn't feel right about the answer, when I look at the plot.
This is my code (it's based on the output of a stochastic model of an outbreak). I'd very much appreciate pointers. I think the error is in the likelihood calculation....
Thank you!
total_cases.dist <- dlnorm(sample.range, mean = total_cases.mean, sd = total_cases.sd)
total_cases.df <- data.frame("total_cases" = sample.range, "Density" = total_cases.dist)
library(ggplot2)
ggplot(total_cases.df, aes(x = total_cases, y = Density)) + geom_point()
pp <- function(x) {
print(paste0(round(x * 100, 3), "%"))
}
# likelihood of n_cases >= 100
pp(sum(total_cases.df$Density[total_cases.df$total_cases >= 100]))

You are using dlnorm, which is the log-normal distribution, which means the mean and sd are the mean of the log (values) and sd of log(values), for example:
# we call the standard rlnorm
X = rlnorm(1000,0,1)
# gives something close to sd = exp(1), and mean=something
c(mean(X),sd(X))
# gives what we simulated
c(mean(log(X)),sd(log(X)))
We now simulate some data, using a known poisson distribution where mean = variance. And we can model it using the log-normal:
set.seed(100)
X <- rpois(500,lambda=1310)
# we need to log values first
total_cases.mean <- mean(log(X))
total_cases.sd <- sd(log(X))
and you can see it works well
sample.range <- 1200:1400
hist(X,br=50,freq=FALSE)
lines(sample.range,
dlnorm(sample.range,mean=total_cases.mean,sd=total_cases.sd),
col="navyblue")
For your example, you can get probability of values > 1200 (see histogram):
plnorm(1200,total_cases.mean,total_cases.sd,lower.tail=FALSE)
Now for your data, if it is true that mean = 1310.198 and total_cases.sd = 31615.26, take makes variance ~ 76000X of your mean ! I am not sure then if the log normal distribution is appropriate for modeling this kind of data..

Related

How to compute some basic statistics in R with the density and support only?

I have no sample and I'd like to compute the variance, mean, median, and mode of a distribution which I only have a vector with it's density and a vector with it's support. Is there an easy way to compute this statistics in R with this information?
Suppose that I only have the following information:
Support
Density
sum(Density) == 1 #TRUE
length(Support)==length(Density)# TRUE
You have to do weighted summations
F.e., starting with #Johann example
set.seed(312345)
x = rnorm(1000, mean=10, sd=1)
x_support = density(x)$x
x_density = density(x)$y
plot(x_support, x_density)
mean(x)
prints
[1] 10.00558
and what, I believe, you're looking for
m = weighted.mean(x_support, x_density)
computes mean as weighted mean of values, producing output
10.0055796130192
There are weighted.sd, weighted.sum functions which should help you with other quantities you're looking for.
Plot
If you don't need a mathematical solution, and an empirical one is all right, you can achieve a pretty good approximation by sampling.
Let's generate some data:
set.seed(6854684)
x = rnorm(50,mean=10,sd=1)
x_support = density(x)$x
x_density = density(x)$y
# see our example:
plot(x_support, x_density )
# the real mean of x
mean(x)
Now to 'reverse' the process we generate a large sample from that density distribution:
x_sampled = sample(x = x_support, 1000000, replace = T, prob = x_density)
# get the statistics
mean(x_sampled)
median(x_sampled)
var(x_sampled)
etc...

Fitting a confidence interval to dlmForecast in R

I've fit a Dyanmic Linear Model to some data using the dlmFilter in R [from the dlm package]. From said filter I have predicted 7 steps ahead using the dlmForecast function. The predicted outcome is very good, but I would like to add a 95% confidence interval and [after a lot of testing] have struggled to do so.
I've mocked up some similar code, below:
library(dlm)
data <- c(20.68502, 17.28549, 12.18363, 13.53479, 15.38779, 16.14770, 20.17536, 43.39321, 42.91027, 49.41402, 59.22262, 55.42043)
mod.build <- function(par) {
dlmModPoly(1, dV = exp(par[1]), dW = exp(par[2]))
}
# Returns most likely estimate of relevant values for parameters
mle <- dlmMLE(a2, rep(0,2), mod.build); #nileMLE$conv
if(mle$convergence==0) print("converged") else print("did not converge")
mod1 <- dlmModPoly(dV = v, dW = c(0, w))
mod1Filt <- dlmFilter(a1, mod1)
fut1 <- dlmForecast(mod1Filt, n = 7)
The forecast outcome appears to be very good [although the model to some extent over-fits the data due to the small number of observations]. However, I would like to add a 95% confidence interval and have struggled to figure out how to do so.
Any advice would be appreciated?
Cheers
hwidth <- (outer(sapply(fut1$Q, FUN=function(x) sqrt(diag(x))), qnorm(0.025, lower = FALSE)) +as.vector(t(fut1$f)))

Sample from a custom likelihood function

I have the following likelihood function which I used in a rather complex model (in practice on a log scale):
library(plyr)
dcustom=function(x,sd,L,R){
R. = (log(R) - log(x))/sd
L. = (log(L) - log(x))/sd
ll = pnorm(R.) - pnorm(L.)
return(ll)
}
df=data.frame(Range=seq(100,500),sd=rep(0.1,401),L=200,U=400)
df=mutate(df, Likelihood = dcustom(Range, sd,L,U))
with(df,plot(Range,Likelihood,type='l'))
abline(v=200)
abline(v=400)
In this function, the sd is predetermined and L and R are "observations" (very much like the endpoints of a uniform distribution), so all 3 of them are given. The above function provides a large likelihood (1) if the model estimate x (derived parameter) is in between the L-R range, a smooth likelihood decrease (between 0 and 1) near the bounds (of which the sharpness is dependent on the sd), and 0 if it is too much outside.
This function works very well to obtain estimates of x, but now I would like to do the inverse: draw a random x from the above function. If I would do this many times, I would generate a histogram that follows the shape of the curve plotted above.
The ultimate goal is to do this in C++, but I think it would be easier for me if I could first figure out how to do this in R.
There's some useful information online that helps me start (http://matlabtricks.com/post-44/generate-random-numbers-with-a-given-distribution, https://stats.stackexchange.com/questions/88697/sample-from-a-custom-continuous-distribution-in-r) but I'm still not entirely sure how to do it and how to code it.
I presume (not sure at all!) the steps are:
transform likelihood function into probability distribution
calculate the cumulative distribution function
inverse transform sampling
Is this correct and if so, how do I code this? Thank you.
One idea might be to use the Metropolis Hasting Algorithm to obtain a sample from the distribution given all the other parameters and your likelihood.
# metropolis hasting algorithm
set.seed(2018)
n_sample <- 100000
posterior_sample <- rep(NA, n_sample)
x <- 300 # starting value: I chose 300 based on your likelihood plot
for (i in 1:n_sample){
lik <- dcustom(x = x, sd = 0.1, L = 200, R =400)
# propose a value for x (you can adjust the stepsize with the sd)
x.proposed <- x + rnorm(1, 0, sd = 20)
lik.proposed <- dcustom(x = x.proposed, sd = 0.1, L = 200, R = 400)
r <- lik.proposed/lik # this is the acceptance ratio
# accept new value with probablity of ratio
if (runif(1) < r) {
x <- x.proposed
posterior_sample[i] <- x
}
}
# plotting the density
approximate_distr <- na.omit(posterior_sample)
d <- density(approximate_distr)
plot(d, main = "Sample from distribution")
abline(v=200)
abline(v=400)
# If you now want to sample just a few values (for example, 5) you could use
sample(approximate_distr,5)
#[1] 281.7310 371.2317 378.0504 342.5199 412.3302

value at risk estimation using fGarch package in R

I am trying to make a similar analysis to McNeil & Frey in their paper 'Estimation of tail-related risk measures for heteroscedastic financial time series: an extreme value approach' but I am stuck with a problem when implementing the models.
The approach is to fit a AR(1)-GARCH(1,1) model in order to estimate the the one-day ahead forecast of the VaR using a window of 1000 observations.
I have simulated data that should work fine with my model, and I assume that if I would be doing this correct, the observed coverage rate should be close to the theoretical one. However it is always below the theoretical coverage rate, and I donĀ“t know why.
I beleive that this is how the calculation of the estimated VaR is done
VaR_hat = mu_hat + sigma_hat * qnorm(alpha)
, but I might be wrong. I have tried to find related questions here at stack but I have not found any.
How I approach this can be summarized in three steps.
Simulate 2000 AR(1)-GARCH(1,1) observations and fit a corresponding model and extract the one day prediction of the conditional mean and standard deviation using a window of 1000 observations.(Thereby making 1000 predictions)
Use the predicted values and the normal quantile to calculate the VaR for the wanted confidence level.
Check if the coverage rate is close to the theoretical one.
If someone could help me I would be extremely thankful, and if I'm unclear in my formalation please just tell me and I'll try to come up with a better explanation to the problem.
The code I'm using is attached below.
Thank you in advance
library(fGarch)
nObs <- 2000 # Number of observations.
quantileLevel <- 0.95 # Since we expect 5% exceedances.
from <- seq(1,1000) # Lower index vector for observations in model.
to <- seq(1001,2000) # Upper index vector for observations in model.
VaR_vec <- rep(0,(nObs-1000)) # Empty vector for storage of 1000 VaR estimates.
# Specs for simulated data (including AR(1) component and all components for GARC(1,1)).
spec = garchSpec(model = list(omega = 1e-6, alpha = 0.08, beta = 0.91, ar = 0.10),
cond.dist = 'norm')
# Simulate 1000 data points.
data_sim <- c(garchSim(spec, n = nObs, n.start = 1000))
for (i in 1:1000){
# The rolling window of 1000 observations.
data_insert <- data_sim[from[i]:to[i]]
# Fitting an AR(1)-GARCH(1,1) model with normal cond.dist.
fitted_model <- garchFit(~ arma(1,0) + garch(1,1), data_insert,
trace = FALSE,
cond.dist = "norm")
# One day ahead forecast of conditional mean and standard deviation.
predict(fitted_model, n.ahead = 1)
prediction_model <- predict(fitted_model, n.ahead = 1)
mu_pred <- prediction_model$meanForecast
sigma_pred <- prediction_model$standardDeviation
# Calculate VaR forecast
VaR_vec[i] <- mu_pred + sigma_pred*qnorm(quantileLevel)
if (length(to)-i != 0){
print(c('Countdown, just',(length(to) - i),'iterations left'))
} else {
print(c('Done!'))
}
}
# Exctract only the estiamtes ralated to the forecasts.
compare_data_sim <- data_sim[1001:length(data_sim)]
hit <- rep(0,length(VaR_vec))
# Count the amount of exceedances.
for (i in 1:length(VaR_vec)){
hit[i] <- sum(VaR_vec[i] <= compare_data_sim[i])
}
plot(data_sim[1001:2000], type = 'l',
ylab = 'Simulated data', main = 'Illustration of one day ahead prediction of 95%-VaR')
lines(VaR_vec, col = 'red')
cover_prop <- sum(hit)/length(hit)
print(sprintf("Diff theoretical level and VaR coverage = %f", (1-quantileLevel) - cover_prop))

Central Limit Theorem in R

I wish to simulate the central limit theorem in order to demonstrate it, and I am not sure how to do it in R. I want to create 10,000 samples with a sample size of n (can be numeric or a parameter), from a distribution I will choose (uniform, exponential, etc...). Then I want to graph in one plot (using the par and mfrow commands) the original distribution (histogram), the distribution of the means of all samples, a Q-Q plot of the means, and in the 4th graph (there are four, 2X2), I am not sure what to plot. Can you please assist me in starting to program it in R ? I think once I have the simulated data I should be fine. Thank you.
My initial attempt is below, it is too simple and I am not sure even correct.
r = 10000;
n = 20;
M = matrix(0,n,r);
Xbar = rep(0,r);
for (i in 1:r)
{
M[,i] = runif(n,0,1);
}
for (i in 1:r)
{
Xbar[i] = mean(M[,i]);
}
hist(Xbar);
The CLT states that given i.i.d. samples from a distribution with mean and variance, the sample mean (as a random variable) has a distribution that converges to a Gaussian as the number of samples n increase. Here, I will assume that you want to generate r sample sets containing n samples each to create r samples of the sample mean. Some code to do that is as follows:
set.seed(123) ## set the seed for reproducibility
r <- 10000
n <- 200 ## I use 200 instead of 20 to enhance convergence to Gaussian
## this function computes the r samples of the sample mean from the
## r*n original samples
sample.means <- function(samps, r, n) {
rowMeans(matrix(samps,nrow=r,ncol=n))
}
For generating the plots, we use ggplot2 and Aaron's qqplot.data function from here. We also use gridExtra to plot multiple plots in one frame.
library(ggplot2)
library(gridExtra)
qqplot.data <- function (vec) {
# following four lines from base R's qqline()
y <- quantile(vec[!is.na(vec)], c(0.25, 0.75))
x <- qnorm(c(0.25, 0.75))
slope <- diff(y)/diff(x)
int <- y[1L] - slope * x[1L]
d <- data.frame(resids = vec)
ggplot(d, aes(sample = resids)) + stat_qq() + geom_abline(slope = slope, intercept = int, colour="red") + ggtitle("Q-Q plot")
}
generate.plots <- function(samps, samp.means) {
p1 <- qplot(samps, geom="histogram", bins=30, main="Sample Histogram")
p2 <- qplot(samp.means, geom="histogram", bins=30, main="Sample Mean Histogram")
p3 <- qqplot.data(samp.means)
grid.arrange(p1,p2,p3,ncol=2)
}
Then we can use these functions with the uniform distribution:
samps <- runif(r*n) ## uniform distribution [0,1]
# compute sample means
samp.means <- sample.means(samps, r, n))
# generate plots
generate.plots(samps, samp.means)
We get:
Or, with the poisson distribution with mean = 3:
samps <- rpois(r*n,lambda=3)
# compute sample means
samp.means <- sample.means(samps, r, n))
# generate plots
generate.plots(samps, samp.means)
We get:
Or, with the exponential distribution with mean = 1/1:
samps <- rexp(r*n,rate=1)
# compute sample means
samp.means <- sample.means(samps, r, n))
# generate plots
generate.plots(samps, samp.means)
We get:
Note that the mean of the sample mean histograms all look like Gaussians with mean that is very similar to the mean of the original generating distribution, whether this is uniform, poisson, or exponential, as predicted by the CLT (also its variance will be 1/(n=200) the variance of the original generating distribution).
Maybe this can help you get started. I have hard-coded the normal distribution and only shown two of your suggested plots: a the histogram of a randomly selected sample, and a histogram of all sample means.
I guess my main suggestion is using a list to store the samples instead of a matrix.
r <- 10000
my.n <- 20
simulation <- list()
for (i in 1:r) {
simulation[[i]] <- rnorm(my.n)
}
sample.means <- sapply(simulation, mean)
selected.sample <- runif(1, min = 1, max = r)
dev.off()
par(mfrow = c(1, 2))
hist(simulation[[selected.sample]])
hist(sample.means)

Resources