Calculate expected value of variance using monte carlo simulation - r

So I have this probability distribution
X = {0      probability 7/8}
      {1/60 probability 1/8}
James his car breaks down N times a year where N ~ Pois(2) and X the repair cost and Y is the total cost caused by James in a year.
I want to calculate the E[Y] and V(Y), which should give me E[X]=15 and V(Y) = 1800
I have this monte Carlo simulation:
expon_dis <- rexp(200, 1/60)
result_matrix2 <- rep(0, 200)
expected_matrix <- rep(0, runs)
for (u in 1:runs){
expon_dis <- rexp(200, 1/60)
N <- rpois(200, 2)
for (l in 1:200){
result_matrix2[l] <- (expon_dis[l] * (1/8)) * (N[l])
}
expected_matrix[u] <- mean(result_matrix2)
}
This code gives the expected value of 15 but the variance is not correct. So what is wrong with this simulation?

Not enough time to read through your code, but i think the error comes with the multiplication.
Below is a very rough implementation, where first you write a function to simulate the cost, given x number of breakdowns:
sim_cost = function(x){
cost = rexp(x,1/60)
prob = sample(c(0,1/60),x,prob=c(7/8,1/8),replace=TRUE)
sum(cost[prob>0])
}
Then generate the number of breakdowns per year:
set.seed(111)
N <- rpois(500000, 2)
Iterate over the years, if 0, we return 0:
set.seed(111)
sim = sapply(N,function(i)if(i==0){0}else{sum(sim_cost(i))})
mean(sim)
[1] 14.98248
var(sim)
[1] 1797.549
You need quite a number of simulations, but above should be a code that you can start to optimize to get it closer.

Related

Realistic age structured model using ODE from the deSolve package

I am trying to simulate a realistic age structured model where all individuals could shift into the following age group at the end of the time step (and not age continuously at a given rate) using ODE from the deSolve package.
Considering for example a model with two states Susceptible (S) and Infectious (I), each state being divided in 4 age groups (S1, S2, S3, S4, and I1, I2, I3, I4), all individuals in S1 should go into S2 at the end of the time step, those in S2 should go into S3, and so on.
I tried to make this in two steps, the first by solving the ODE, the second by shifting individuals into the following age group at the end of the time step, but without success.
Below is one of my attempts :
library(deSolve)
times <- seq(from = 0, to = 100, by = 1)
n_agecat <- 4
#Initial number of individuals in each state
S_0 = c(999,rep(0,n_agecat-1))
I_0 = c(1,rep(0,n_agecat-1))
si_initial_state_values <- c(S = S_0,
I = I_0)
# Parameter values
si_parameters <- c(beta = 0.01) #contact rate assuming random mixing
si_model <- function(time, state, parameters) {
with(as.list(c(state, parameters)), {
n_agegroups <- 4
S <- state[1:n_agegroups]
I <- state[(n_agegroups+1):(2*n_agegroups)]
# Total population
N <- S+I
# Force of infection
lambda <- beta * I/N
# Solving the differential equations
dS <- -lambda * S
dI <- lambda * S
# Trying to shift all individuals into the following age group
S <- c(0,S[-n_agecat])
I <- c(0,I[-n_agecat])
return(list(c(dS, dI)))
})
}
output <- as.data.frame(ode(y = si_initial_state_values,
times = times,
func = si_model,
parms = si_parameters))
Any guidance will be much appreciated, thank you in advance!
I had a look at your model. Implementing the shift in an event function works, in principle, but the main model has still several problems:
die out: if the age groups are shifted per time step and the first element is just filled with zero, everything is shifted to the end within 4 time steps and the population dies out.
infection: in your case, the infected can only infect the same age group, so you need to summarize over the "age" groups before calculating lambda.
Finally, what is "age" group? Do you want the time since infection?
To sum up, there are several options: I would personally prefer a discrete model for such a simulation, i.e. difference equations, a age structured matrix model or an individual-based model.
If you want to keep it an ODE, I recommend to let the susceptible together as one state and to implement only the infected as stage structured.
Here a quick example, please check:
library(deSolve)
times <- seq(from = 0, to = 100, by = 1)
n_agegroups <- 14
n_agecat <- 14
# Initial number of individuals in each state
S_0 = c(999) # only one state
I_0 = c(1, rep(0,n_agecat-1)) # several stages
si_initial_state_values <- c(S = S_0,
I = I_0)
# Parameter values
si_parameters <- c(beta = 0.1) # set contact parameter to a higher value
si_model <- function(time, state, parameters) {
with(as.list(c(state, parameters)), {
S <- state[1]
I <- state[2:(n_agegroups + 1)]
# Total population
N <- S + sum(I)
# Force of infection
#lambda <- beta * I/N # old
lambda <- beta * sum(I) / N # NEW
# Solving the differential equations
dS <- -lambda * S
dI <- lambda * S
list(c(dS, c(dI, rep(0, n_agegroups-1))))
})
}
shift <- function(t, state, p) {
S <- state[1]
I <- state[2:(n_agegroups + 1)]
I <- c(0, I[-n_agecat])
c(S, I)
}
# output time steps (note: ode uses automatic simulation steps!)
times <- 1:200
# time step of events (i.e. shifting), not necessarily same as times
evt_times <- 1:200
output <- ode(y = si_initial_state_values,
times = times,
func = si_model,
parms = si_parameters,
events=list(func=shift, time=evt_times))
## default plot function
plot(output, ask=FALSE)
## plot totals
S <- output[,2]
I <- rowSums(output[, -(1:2)])
par(mfrow=c(1,2))
plot(times, S, type="l", ylim=c(0, max(S)))
lines(times, I, col="red", lwd=1)
## plot stage groups
matplot(times, output[, -(1:2)], col=rainbow(n=14), lty=1, type="l", ylab="S")
Note: This is just a technical demonstration, not a valid stage structured SIR model!

How to draw Poisson density curve in R?

I need to show that the amount of events in Poisson process are distributed by Poisson distribution with parameter lambda * t.
Here is the Poisson process generator:
ppGen <- function(lambda, maxTime){
taos <- taosGen(lambda, maxTime)
pp <- NULL
for(i in 1:maxTime){
pp[i] <- sum(taos <= i)
}
return(pp)
}
Here I try to replicate the process 1000 times and vectorisee the total occurrences in each realisation:
d <- ppGen(0.5,100)
tail(d,n=1)
reps <- 1000
x1 <- replicate(reps, tail(ppGen(0.5,100), n=1))
hist(x1)
Here is the histogram:
Here I am trying to draw a theoretical Poisson density curve with parameter lambda * t:
xfit<-seq(1,100,length=100)
yfit<-dpois(xfit,lambda = 0.5*100)
lines(xfit,yfit)
But the curve doesn't appear anywhere near the histogram. Can anyone suggest on the right way to do this?
Maybe you can try curve like below
x <- rpois(1000, 0.5 * 100)
dp <- function(x, lbd = 0.5 * 100) dpois(x, lambda = lbd)
curve(dp, 0, 100)
hist(x, freq = FALSE, add = TRUE)

Rejection Sampling to generate Normal samples from Cauchy samples

I tried my luck on coding a rejection sampling method to generate a sample that follows a normal distribution. The samples look like normal distributions on first glance but the p-value of the Shapiro-Wilk test is always <0.05. I don't really know where I turned wrong and I only got the pseudo-code from my teacher (its NOT homework). Any help is appreciated. Below my code:
f <- function(x,m,v) { #target distribution, m=mean,v=variance
dnorm(x,m,sqrt(v))
}
g <- function(x,x0,lambda) { #cauchy distribution for sampling
dcauchy(x,x0,lambda)
}
genSamp <- function(n,m,v) { #I want the user to be able to choose mean and sd
#and size of the sample
stProbe <- rep(0,n) #the sample vector
interval = c(m-10*sqrt(v),m+10*sqrt(v)) #wanted to go sure that everything
#is covered, so I took a range
#that depends on the mean
M = max(f(interval,m,v)/g(interval,m,v)) #rescaling coefficient, so the cauchy distribution
#is never under the normal distribution
#I chose x0 = m and lambda = v, so the cauchy distribution is close to a
#the target normal distribution
for (i in 1:n) {
repeat{
x <- rcauchy(1,m,v)
u <- runif(1,0,max(f(interval,m,v)))
if(u < (f(x,m,v)/(M*g(x,m,v)))) {
break
}
}
stProbe[i] <- x
}
return(stProbe)
}
Then I tried it out with:
test <- genSamp(100,2,0.5)
hist(test,prob=T,breaks=30)#looked not bad
shapiro.test(test) #p-value way below 0.05
Thank you in advance for your help.
Actually, the first thing I checked is sample mean and sample variance. When I draw 1000 samples with your genSamp, I get sample mean at 2, but sample variance at about 2.64, far from the target 0.5.
The 1st problem is with your computation of M. Note that:
interval = c(m - 10 * sqrt(v), m + 10 * sqrt(v))
only gives you 2 values, rather than a grid of equally spaced points on the interval. At 10 standard deviation away from the mean, the Normal density is almost 0, so M is almost 0. You need to do something like
interval <- seq(m - 10 * sqrt(v), m + 10 * sqrt(v), by = 0.01)
The 2nd problem is the generation of uniform random variable in your repeat. Why do you do
u <- runif(1,0,max(f(interval,m,v)))
You want
u <- runif(1, 0, 1)
With these fixes, I have tested that genSamp gets the correct sample mean and sample variance. The samples pass both Shapiro–Wilk test and Kolmogorov-Smirnov test (?ks.test).
Full working code
f <- function(x,m,v) dnorm(x,m,sqrt(v))
g <- function(x,x0,lambda) dcauchy(x,x0,lambda)
genSamp <- function(n,m,v) {
stProbe <- rep(0,n)
interval <- seq(m - 10 * sqrt(v), m + 10 * sqrt(v), by = 0.01)
M = max(f(interval,m,v)/g(interval,m,v))
for (i in 1:n) {
repeat{
x <- rcauchy(1,m,v)
u <- runif(1,0,1)
if(u < (f(x,m,v)/(M*g(x,m,v)))) break
}
stProbe[i] <- x
}
return(stProbe)
}
set.seed(0)
test <- genSamp(1000, 2, 0.5)
shapiro.test(test)$p.value
#[1] 0.1563038
ks.test(test, rnorm(1000, 2, sqrt(0.5)))$p.value
#[1] 0.7590978
You have
f <- function(x,m,v) { #target distribution, m=mean,v=variance
dnorm(x,e,sqrt(v))
}
which samples with mean e, but that is never defined.

one sample hypothesis test for proportions

I'm looking for a built-in R function that calculates the power of a one sample hypothesis test for proportions.
The built in function power.prop.test only does TWO SAMPLE hypothesis tests for proportions.
The original question is: "How many times do you have to toss a coin to determine that it is biased?
p.null <- 0.5 # null hypothesis.
We say that a coin is "biased" if the probability of tossing heads is either
greater than 0.51 or less than 0.49. Otherwise we say that it is "good enough"
delta <- 0.01
Here is a function to toss a biased coin N times and return the proportion of heads:
biased.coin <- function(delta, N) {
probs <- runif(N, 0, 1)
heads <- probs[probs < 0.5+delta]
return(length(heads)/N)
}
We fix alpha and beta throughout at the standard values. Our goal is to calculate N.
alpha = 0.05 # 95% confidence interval
beta = 0.8 # Correctly reject the null hypothesis 80% of time.
The first step is to use a simulation.
A single experiment is to toss the coin N times and reject the null hypothesis if the number of heads deviates "too far" from the expected value of N/2
We then repeat the experiment M times and count how many times the null hypothesis is (correctly) rejected.
M <- 1000
simulate.power <- function(delta, N, p.null, M, alpha) {
print(paste("Calculating power for N =", N))
reject <- c()
se <- sqrt(p.null*(1-p.null))/sqrt(N)
for (i in (1:M)) {
heads <- biased.coin(delta, N) # perform an experiment
z <- (heads - p.null)/se # z-score
p.value <- pnorm(-abs(z)) # p-value
reject[i] <- p.value < alpha/2 # Do we rejct the null?
}
return(sum(reject)/M) # proportion of time null was rejected.
}
Next we plot a graph (slow, about 5 minutes):
ns <- seq(1000, 50000, by=1000)
my.pwr <- c()
for (i in (1:length(ns))) {
my.pwr[i] <- simulate.power(delta, ns[i], p.null, M, alpha)
}
plot(ns, my.pwr)
From the graph it looks like the N you need for a power of beta = 0.8 is about 20000.
The simulation is very slow so it would be nice to have a built in function.
A little fiddling around gave me this:
magic <- function(p.null, delta, alpha, N) {
magic <-power.prop.test(p1=p.null,
p2=p.null+delta,
sig.level=alpha,
###################################
n=2*N, # mysterious 2
###################################
alternative="two.sided",
strict=FALSE)
return(magic[["power"]])
}
Let's plot it against our simulated data.
pwr.magic <- c()
for (i in (1:length(ns))) {
pwr.magic[i] <- magic(p.null, delta, alpha, ns[i])
}
points(ns, pwr.magic, pch=20)
The fit is good, but I have no idea why I would need to multiply N by two,
in order to get a one sample power out of a two sample proportion test.
It would be nice if there were a built in function that let you do one sample directly.
Thanks!
You could try
library(pwr)
h <- ES.h(0.51, 0.5) # Compute effect size h for two proportions
pwr.p.test(h = h, n = NULL, sig.level = 0.05, power = 0.8, alternative = "two.sided")
# proportion power calculation for binomial distribution (arcsine transformation)
# h = 0.02000133
# n = 19619.53
# sig.level = 0.05
# power = 0.8
# alternative = two.sided
As an aside, one way to speed up your simulation significantly would be to use rbinom instead of runif:
biased.coin2 <- function(delta, N) {
rbinom(1, N, 0.5 + delta) / N
}

How does ar.yw estimate the variance

In R, how does the function ar.yw estimate the variance? Specifically, where does the number "var.pred" come from? It does not seem to come from the usual YW estimate of the variance, nor the sum of squared residuals divided by df (even though there is disagreement about what the df should be, none of the choices give an answer equivalent to var.pred). And yes, I know that there are better methods than YW; just trying to figure out what R is doing.
set.seed(82346)
temp <- arima.sim(n=10, list(ar = 0.5), sd=1)
fit <- ar(temp, method = "yule-walker", demean = FALSE, aic=FALSE, order.max=1)
## R's estimate of the sigma squared
fit$var.pred
## YW estimate
sum(temp^2)/10 - fit$ar*sum(temp[2:10]*temp[1:9])/10
## YW if there was a mean
sum((temp-mean(temp))^2)/10 - fit$ar*sum((temp[2:10]-mean(temp))*(temp[1:9]-mean(temp)))/10
## estimate based on residuals, different possible df.
sum(na.omit(fit$resid^2))/10
sum(na.omit(fit$resid^2))/9
sum(na.omit(fit$resid^2))/8
sum(na.omit(fit$resid^2))/7
Need to read the code if it's not documented.
?ar.yw
Which says: "In ar.yw the variance matrix of the innovations is computed from the fitted coefficients and the autocovariance of x." If that is not enough explanation, then you need to look at the code:
methods(ar.yw)
#[1] ar.yw.default* ar.yw.mts*
#see '?methods' for accessing help and source code
getAnywhere(ar.yw.default)
# there are two cases that I see
x <- as.matrix(x)
nser <- ncol(x)
if (nser > 1L) # .... not your situation
#....
else{
r <- as.double(drop(xacf))
z <- .Fortran(C_eureka, as.integer(order.max), r, r,
coefs = double(order.max^2), vars = double(order.max),
double(order.max))
coefs <- matrix(z$coefs, order.max, order.max)
partialacf <- array(diag(coefs), dim = c(order.max, 1L,
1L))
var.pred <- c(r[1L], z$vars)
#.......
order <- if (aic)
(0L:order.max)[xaic == 0L]
else order.max
ar <- if (order)
coefs[order, seq_len(order)]
else numeric()
var.pred <- var.pred[order + 1L]
var.pred <- var.pred * n.used/(n.used - (order + 1L))
So you now need to find the Fortran code for C_eureka. I think I'm finding it here: https://svn.r-project.org/R/trunk/src/library/stats/src/eureka.f This is the code that aI think is returning the var.pred estimate. I'm not a time series guy and It's your responsibility to review this process for applicability to your problem.
subroutine eureka (lr,r,g,f,var,a)
c
c solves Toeplitz matrix equation toep(r)f=g(1+.)
c by Levinson's algorithm
c a is a workspace of size lr, the number
c of equations
c
snipped
c estimate the innovations variance
var(l) = var(l-1) * (1 - f(l,l)*f(l,l))
if (l .eq. lr) return
d = 0.0d0
q = 0.0d0
do 50 i = 1, l
k = l-i+2
d = d + a(i)*r(k)
q = q + f(l,i)*r(k)
50 continue

Resources