My code for getting a proper plot in R does not seem to work (I am new to R and I am having difficulties with coding).
Basically, using the concept of temporal discounting in the form of beta-delta model, we are supposed to calculate the subjective value for $10 at every delay from 0 to 365.
The context of the homework is that we have to account for the important exception that if a reward is IMMEDIATE, there’s no discount, but if it occurs at any delay there’s both an exponential discount and a delay penalty.
I created a variable called BetaDeltaValuesOf10, which is 366 elements long and represents the subjective value for $10 at each delay from 0 to 365.
The code needs to have the following properties using for-loops and an if-else statement:
1) IF the delay is 0, the subjective value is the objective magnitude (and should be saved to the appropriate element of BetaDeltaValuesOf10.
2) OTHERWISE, calculate the subjective value at the exponentially discounted rate, assuming 𝛿 = .98 and apply a delay penalty of .8, then save it to the appropriate element of BetaDeltaValuesOf10.
The standard code given to us to help us in creating the code is as follows:
BetaDeltaValuesOf10 = 0
Delays = 0:365
Code(Equation) to get subjective value/preference using exponential discounting model:
ExponentialDecayValuesOf10 = .98^Delays*10
0.98 is the discount rate which ranges between 0 and 1.
Delays is the number of time periods in the future when the later reward will be delivered
10 is the subjective value of $10
Code(Equation) to get subjective value using beta-delta model:
0.8*0.98^Delays*10
0.8 is the delay penalty
The code I came up with in trying to satisfy the above mentioned properties is as follows:
for(t in 1:length(Delays)){BetaDeltaValuesOf10 = 0.98^0*10
if(BetaDeltaValuesOf10 == 0){0.98^t*10}
else {0.8*0.98^t*10}
}
So, I tried the code and did not get any error. But, when I try to plot the outcome of the code, my plot comes up blank.
To plot I used the code:
plot(BetaDeltaValuesOf10,type = 'l', ylab = 'DiscountedValue')
I believe that my code is actually faulty and that is why I am not getting a proper outcome for my plot.
Please let me know of the amendments to the code and if the community needs any clarification, I will try to clarify as soon as I can.
result <- double(length=366)
delays <- 0:365
val <- 10
delta <- 0.98
penalty <- 0.8
for(t in seq_along(delays)) {
result[t] <- val * delta^delays[t] * penalty^(delays[t]>0)
}
plot(x=delays, y=result, pch=20)
Related
The following code comes from this book, Statistics and Data Analysis For Financial Engineering, which describes how to generate simulation data of ARCH(1) model.
library(TSA)
library(tseries)
n = 10200
set.seed("7484")
e = rnorm(n)
a = e
y = e
sig2 = e^2
omega = 1
alpha = 0.55
phi = 0.8
mu = 0.1
omega/(1-alpha) ; sqrt(omega/(1-alpha))
for (t in 2:n){
a[t] = sqrt(sig2[t])*e[t]
y[t] = mu + phi*(y[t-1]-mu) + a[t]
sig2[t+1] = omega + alpha * a[t]^2
}
plot(e[10001:n],type="l",xlab="t",ylab=expression(epsilon),main="(a) white noise")
My question is that why we need to discard the first 10000 simulation?
========================================================
Bottom Line Up Front
Truncation is needed to deal with sampling bias introduced by the simulation model's initialization when the simulation output is a time series.
Details
Not all simulations require truncation of initial data. If a simulation produces independent observations, then no truncation is needed. The problem arises when the simulation output is a time series. Time series differ from independent data because their observations are serially correlated (also known as autocorrelated). For positive correlations, the result is similar to having inertia—observations which are near neighbors tend to be similar to each other. This characteristic interacts with the reality that computer simulations are programs, and all state variables need to be initialized to something. The initialization is usually to a convenient state, such as "empty and idle" for a queueing service model where nobody is in line and the server is available to immediately help the first customer. As a result, that first customer experiences zero wait time with probability 1, which is certainly not the case for the wait time of some customer k where k > 1. Here's where serial correlation kicks us in the pants. If the first customer always has a zero wait time, that affects some unknown quantity of subsequent customer's experiences. On average they tend to be below the long term average wait time, but gravitate more towards that long term average as k, the customer number, increases. How long this "initialization bias" lingers depends on both how atypical the initialization is relative to the long term behavior, and the magnitude and duration of the serial correlation structure of the time series.
The average of a set of values yields an unbiased estimate of the population mean only if they belong to the same population, i.e., if E[Xi] = μ, a constant, for all i. In the previous paragraph, we argued that this is not the case for time series with serial correlation that are generated starting from a convenient but atypical state. The solution is to remove some (unknown) quantity of observations from the beginning of the data so that the remaining data all have the same expected value. This issue was first identified by Richard Conway in a RAND Corporation memo in 1961, and published in refereed journals in 1963 - [R.W. Conway, "Some tactical problems on digital simulation", Manag. Sci. 10(1963)47–61]. How to determine an optimal truncation amount has been and remains an active area of research in the field of simulation. My personal preference is for a technique called MSER, developed by Prof. Pres White (University of Virginia). It treats the end of the data set as the most reliable in terms of unbiasedness, and works its way towards the front using a fairly simple measure to detect when adding observations closer to the front produces a significant deviation. You can find more details in this 2011 Winter Simulation Conference paper if you're interested. Note that the 10,000 you used may be overkill, or it may be insufficient, depending on the magnitude and duration of serial correlation effects for your particular model.
It turns out that serial correlation causes other problems in addition to the issue of initialization bias. It also has a significant effect on the standard error of estimates, as pointed out at the bottom of page 489 of the WSC2011 paper, so people who calculate the i.i.d. estimator s2/n can be off by orders of magnitude on the estimated width of confidence intervals for their simulation output.
I have following problem:
A casualty insurance company has 1000 policyholders, each of whom will
independently present a claim in the next month with probability 5%.
Assuming that the amounts of the claims made are independent exponential
random variables with mean 800 Dollars.
Does anyone know how to create simulation in R to estimate the probability
that the sum of those claims exceeds 50,000 Dollars?
This sounds like a homework assignment, so it's probably best to consult with your teacher(s) if you're unsure about how to approach this. Bearing that in mind, here's how I'd go about simulating this:
First, create a function that generates values from an exponential distribution and sums those values, based on the values you give in your problem description.
get_sum_claims <- function(n_policies, prob_claim, mean_claim) {
sum(rexp(n = n_policies*prob_claim, rate = 1/mean_claim))
}
Next, make this function return the sum of all claims lots of times, and store the results. The line with map_dbl does this, essentially instructing R to return 100000 simulated sums of claims from the get_sum_claims function.
library(tidyverse)
claim_sums <- map_dbl(1:100000, ~ get_sum_claims(1000, 0.05, 800))
Finally, we can calculate the probability that the sum of claims is greater than 50000 by using the code below:
sum(claim_sums > 50000)/length(claim_sums)
This gives a fairly reliable estimate of ~ 0.046 as the probability that the sum of claims exceeds 50000 in a given month.
I'm a bit inexperienced with R, but here is my solution.
First, construct a function which simulates a single trial. To do so, one needs to determine how many claims are filed n. I hope it is clear that n ~ Binomial(1000, 0.05). Note that, you cannot simply assume n = 1000 * 0.05 = 50. By doing so, you would decrease the variance, which will result in a lower probability. I can explain why this is the case if needed. Then, generate and sum n values based on an exponential distribution with mean 800.
simulate_total_claims <- function(){
claim_amounts <- rexp(rbinom(n=1,size=1000, prob = 0.05), rate = 1/800)
total <- sum(claim_amounts)
return(total)
}
Now, all that needs to be done is run the above function a lot and determine the proportion of runs which have values greater than 50000.
totals <- rerun(.n = 100000, simulate_total_claims())
estimated_prob <- mean(unlist(totals) > 50000)
I'm going to preface this with the fact that I am a complete R novice.
I have the following problem:
Consider a simple model that progresses year-by-year. In year i, let W_i = patient is well, I_i = patient is ill, and D_i = patient is dead. Transitions can be modeled as a set of conditional probabilities.
Let L = number of years that the patient is well.
I have come up with the probability mass function of L to be P(L)=(1-p)(p)^{L-1}.
The given information is that a patient is well in year 1 and given their age and risk factors, P(W_{i+1}|W_{i})=0.2 for all i
The problem is to write a function in R that simulates the trajectory of a single patient and returns the number of years the patient is well.
I thought that this could be programmed in R as a binomial distribution using the rbinom function. For a single patient,
rbinom(1, 1, 0.2)
but I don't think that this would return the number of years that the patient is well. I'm thinking that the rbinom function should be the start, and that it would need to be paired with a way to count the number of years that a patient is well, but I don't know how to do that.
The next problem is to use R to simulate 1000 patient trajectories and find the sample mean of years of wellness. I'm assuming that this would be an extension of the previous part, just replacing the 1 patient with 1000. However I can't quite figure out where to replace the 1 with 1000: n or size
rbinom(n, size, prob)
This is assuming that using rbinom is the correct thing to do in the first place...
If I were to do this in another programming language (say Python) I would use a while loop conditional on patient_status=W and starting with L=0 iterate through the loop and add 1 each successful iteration. I'm not sure if R works in the same way.
Let's start with what rbinom(1, 1, 0.2) does: it returns 1 instance of 1 independent Bernoulli (that is, 0-1) random variables added together that have a probability of 0.2 of being equal to 1. So, that line will only give outputs 0 (which it will do 80% of the time) or 1 (which it will do the other 20% of the time). As you noted, this isn't what you want.
The issue here is the selection of a random variable. A binomial variable is great for something like, "I roll ten dice. How many land on 6?" because it has the following essential components:
outcomes dichotomized into success / failure
a fixed number (ten) of trials
a consistent probability of success (1/6)
independent trials (dice do not affect each other)
The situation you're describing doesn't have those features. So, what to do?
Option 1: Go with your instinct for a while() loop. I'll preface this by saying that while() loops are discouraged in R for various reasons (chiefly inefficiency). But, since you already understand the concept, let's run with it.
one_patient <- function(){
status <- 1 # 1 = healthy, 0 = ill
years <- (-1) # count how many years completed while healthy
while(status == 1){
years <- years + 1 # this line will run at least one time
status <- rbinom(1, 1, 0.2) # your rbinom(1, 1, 0.2) line makes an appearance!
}
return(years)
}
Now, executing one_patient() will result in the number of the years the patient successfully transitioned from well to well. This will be at least 0, since years starts at -1 and is incremented at least one time. It could be very high, if the patient is lucky, though it most likely won't. You can experiment with this by changing the 0.2 parameter to something more optimistic like 0.99 to simulate long life spans.
Option 2: Rethink the random variable. I mentioned above that the variable wasn't binomial; in fact, it's geometric. A situation like, "I roll a die until it lands on 6. How many rolls did it take?" is geometric because it has the following essential components:
outcomes dichotomized into success / failure
a consistent probability of success
repeated trials that terminate when the first success is reached
independent trials
Much like how binomial variables have useful functions in R such as rbinom(), pbinom(), qbinom(), dbinom(), there is a corresponding collection for geometric variables: rgeom(), pgeom(), qgeom(), dgeom().
To use rgeom(), we need to be careful about one detail: here, a "success" is characterized as the patient becoming ill, because that's when the experiment ends. (Above, by encoding the patient being well as 1, we're implicitly using the reverse perspective.) This means that the "success" probability is 0.8. rgeom(1, 0.8) will return the number of draws strictly before the first success, which is equivalent to the number of years the patient went from well to well, as above. Note that the 1 parameter refers to the number of times we want to run this experiment and not something else. Hence:
rgeom(1, 0.8)
will accomplish the same task as the one_patient() function we defined above. (That is, the distribution of outputs for each will be the same.)
For multiple patients, you can either wrap the one_patient() function inside replicate(), or you can just directly adjust the first parameter of rgeom(1, 0.8). The second option is much faster, though both are fast if just simulating 1000 patients.
Addendum
Proof that both have the same effect:
sims1 <- replicate(10000, one_patient())
hist(sims1, breaks = seq(-0.5, max(sims1) + 0.5, by = 1))
sims2 <- rgeom(10000, 0.8)
hist(sims2, breaks = seq(-0.5, max(sims2) + 0.5, by = 1))
Proof that rgeom() is faster:
library(microbenchmark)
microbenchmark(
replicate(10000, one_patient()),
rgeom(10000, 0.8)
)
#Unit: milliseconds
# expr min lq mean median uq max neval
# replicate(10000, one_patient()) 35.4520 38.77585 44.135562 43.82195 46.05920 73.5090 100
# rgeom(10000, 0.8) 1.1978 1.22540 1.273766 1.23640 1.27485 1.9734 100
I am testing if a sending information to consumers about promotion convince them to buy anything. Out of 100k consumers we randomly selected 90% of them and sended them catalogs. After sometime we tracked who have bought.
To recreate the problem lets use:
set.seed(1)
got <- rbinom(n=100000, size=1, prob=0.1)
bought <- rbinom(n=100000, size=1, prob=0.05)
table(got, bought)
bought
got 0 1
0 85525 4448
1 9567 460
As I read on here I should use prop.test(table(got, bought), correct=FALSE) function, but i want to check not only if the proportions are equal, but if the proportion of those who bought during promotion, for the group who got the leaflet was greater then in those who didn't get it.
Should I use argument alternative = "less" or , alternative = "greater"? and dose the order or got and bought is impotent?
You usually want to use a two sided alternative (for all you know sending promotion annoys people and they are less likely to purchase).
prop.test is doing a chi square test which by definition does not look at which group is bigger.
You could do a t.test like this
t.test(bought ~ got, data = data.frame(got = got, bought = bought))
Depending on your typical conversion rate and sample size and alpha you can get confidence intervals implying negative conversion rates so a Bootstrapping or Bayesian approach may be better suited.
I have produce a stochastic model of infection (parasitic worm), using a Gillespie SSA. The model used the "GillespieSSA"package (https://cran.r-project.org/web/packages/GillespieSSA/index.html).
In short the code models a population of discrete compartments. Movement between compartments is dependent on user defined rate equations. The SSA algorithm acts to calculate the number of events produced by each rate equation for a given timestep (tau) and updates the population accordingly, process repeats up to a given time point. The problem is, the number of events is assumed Poisson distributed (Poisson(rate[i]*tau)), thus produces an error when the rate is negative, including when population numbers become negative.
# Parameter Values
sir.parms <- c(deltaHinfinity=0.00299, CHi=0.00586, deltaH0=0.0854, aH=0.5,
muH=0.02, SigmaW=0.1, SigmaM =0.8, SigmaL=104, phi=1.15, f = 0.6674,
deltaVo=0.0166, CVo=0.0205, alphaVo=0.5968, beta=52, mbeta=7300 ,muV=52, g=0.0096, N=100)
# Inital Population Values
sir.x0 <- c(W=20,M=10,L=0.02)
# Rate Equations
sir.a <- c("((deltaH0+deltaHinfinity*CHi*mbeta*L)/(1+CHi*mbeta*L))*mbeta*L*N"
,"SigmaW*W*N", "muH*W*N", "((1/2)*phi*f)*W*N", "SigmaM*M*N", "muH*M*N",
"(deltaVo/(1+CVo*M))*beta*M*N", "SigmaL*L*N", "muV*L*N", "alphaVo*M*L*N", "(aH/g)*L*N")
# Population change for even
sir.nu <- matrix(c(+0.01,0,0,
-0.01,0,0,
-0.01,0,0,
0,+0.01,0,
0,-0.01,0,
0,-0.01,0,
0,0,+0.01/230,
0,0,-0.01/230,
0,0,-0.01/230,
0,0,-0.01/230,
0,0,-0.01/32),nrow=3,ncol=11,byrow=FALSE)
runs <- 10
set.seed(1)
# Data Frame of output
sir.out <- data.frame(time=numeric(),W=numeric(),M=numeric(),L=numeric())
# Multiple runs and combining data and SSA methods
for(i in 1:runs){
sim <- ssa(sir.x0,sir.a,sir.nu,sir.parms, method="ETL", tau=1/12, tf=140, simName="SIR")
sim.out <- data.frame(time=sim$data[,1],W=sim$data[,2],M=sim$data[,3],L=sim$data[,4])
sim.out$run <- i
sir.out <- rbind(sir.out,sim.out)
}
Thus, rates are computed and the model updates the population values for each time step, with the data store in a data frame, then attached together with previous runs. However, when levels of the population get very low events can occur such that the number of events that occurs reducing a population is greater than the number in the compartment. One method is to make the time step very small, however this greatly increases the length of the simulation very long.
My question is there a way to augment the code so that as the data is created/ calculated at each time step any values of population numbers that are negative are converted to 0?
I have tried working on this problem, but only seem to be able to come up with methods that alter the values once the simulation is complete, with the negative values still causing issues in the runs themselves.
E.g.
if (sir.out$L < 0) sir.out$L == 0
Any help would be appreciated
I believe the problem is the method you set ("ETL") in the ssa function. The ETL method will eventually produce negative numbers. You can try the "OTL" method, based on Efficient step size selection for the tau-leaping simulation method- in which there are a few more parameters that you can tweak, but the basic command is:
ssa(sir.x0,sir.a,sir.nu,sir.parms, method="OTL", tf=140, simName="SIR")
Or the direct method, which will not produce negative number whatsoever:
ssa(sir.x0,sir.a,sir.nu,sir.parms, method="D", tf=140, simName="SIR")