Simulate Steps Through a Markov Chain - r

I am trying to simulate a step through a Markov chain with the intent to loop through the procedure multiple times until a condition is met. (i.e., find out how many steps it takes, on average, to reach a specific state).
In this case, a state can only go one way. E.g., State 4 can transition forward to State 5, but cannot transition backward to State 3. This means the left-lower half of the transition matrix is set to zero. This is also why the method below puts arbitrarily large values in the 'prior' states. I attempt to find the correct new state by examining which probability in the specified row of the transition matrix is closest to a random number.
get_new_state <- function(current_state, trans_matrix)
{
# generate a random number between 0-1 to compare to the transition matrix probabilities
rand <- runif(1)
# transition to where the
# random number falls within the transition matrix
row <- current_state # initial condition determines the row of the trans_matrix
col = current_state # start in the column
# loop thru all columns and find the correct state
potential_states <- rep(0, each=ncol(trans_matrix)) # holds the value of the potential state it transitions to
# in this case, we can't transition to a previous state so we set the previous state values arbitrarily high
# so they don't get selected in the which.min() function later
potential_states[1:col] <- 999
for(k in row:ncol(trans_matrix)) # loop thru non-zero matrix values
{
if(trans_matrix[row,k] > rand)
{
potential_states[k] <- trans_matrix[row,k] / rand
potential_states[k] <- 1 - potential_states[k]
}
}
# new state is equal to the index of the lowest value
# lowest value = closest to random number
new_state = which.min(potential_states)
return(as.numeric(new_state))
}
I'm not sure if this approach is reasonable. I'm assuming there is a better way to simulate without the kluge that puts arbitrarily large values in potential_states[].

Would something like this work better (it is a one-line Markov transition):
> new_state <- sample(1:number_states, size=1,
prob=transition_matrix[old_state,])
and just put this in a (for instance) while() loop with a counter.

Related

How do I use R's optimize, optim functions, etc. to find the function input that provides a specific returned value?

I am trying to find a function input value (scalar) that, when combined with three other known vector parameters, creates a fourth vector that diminishes to zero exactly at a specific vector element (and not before).
The fourth vector contains calculated portfolio balances and each vector element is a year. I want the portfolio balance to diminish to zero exactly in the portfolio balance for the year the investor dies (the year is specified in one of the known vector parameters). I want to achieve this by finding the initial portfolio value that results in that vector element equalling zero in the year of death and no sooner.
(Note: In Excel, this could be done with the Goal Seek tool by setting a specific portfolio balance element (year) equal to zero by changing the initial portfolio value.)
I have tried both optim and optimize but last tried the script below. The optimized initial portfolio value always equals the constraint max.
Any help would be greatly appreciated. Been pulling out my hair for two days.
f1 <- function(init,spendV,returnV,stateV) {
# create a vector that equals an initial portfolio value (init) and sets subsequent vector elements
# equal to the previous portfolio balance minus an amount of spending and then applies a market return rate.
# Portfolio balance can never fall below zero
# When the state vector element = 4, the investor died in the previous period
endState <- which (stateV == 4)
portfoliobal <- rep(0,length(spendV))
portfoliobal[1] <- (init - spendV[1]) * returnV[1]
for (i in (2:length(spendV))) {
portfoliobal[i] <- max(0,(portfoliobal[i-1] - spendV[i]) * returnV[i])
}
# the function returns the portfolio balance when the investor dies
return(portfoliobal[endState])
}
# spending, market growth rates, and the period of death are given vectors
spendV <- c(97719.19,97737.92,102649.4,98669.15,78108.44,58105.49,51710.02,53267.12,
39982.34,21070,22439.68,21375.4,15613.45,10826.54,9333.82,7808.239,
8737.435,8382.001,7267.976,6534.688,5129.403,5026.947,4931.132,5468.401,
5033.245,5195.273,5199.938,4854.684,5039.221,3757.753,1822.97,1202.24,
1237.238,965.111,1051.235,906.4884,1110.66,1018.127,500.788,538.2703,
584.5545,599.1832,575.2254,640.8828,604.0179,781.878,595.9795,625.5037,
615.471,667.4227)
init <- 1000000
returnV <- c(0.9347388,1.170053,1.204515,0.9572276,1.044682,0.9229759,1.110595,0.9299398,
1.161509,1.053207,1.058104,0.8997761,1.000342,1.353597,1.031785,1.121795,0.8745584,
1.05637,1.180234,0.9795393,0.9137375,0.772738,1.021843,0.9697467,1.055284,0.9182615,
0.9662726,1.105152,1.099005,0.9195565,0.895424,0.9226368,1.196467,1.085768,0.9529325,
1.485245,0.9124764,0.8978044,0.8021779,0.9064698,1.034353,0.9914232,0.742632,
0.9308539,0.9683604,0.9325817,1.000051,1.145982,1.018012,1.127159)
stateV <- c(3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,
2,2,2,2,2,2,2,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)
cat("\n\nCritical vector element= ",which(stateV == 4))
# the critical vector element is the portfolio balance at death
# I experiment with different inital portfolio values and find that portfolio balance[34] approaches zero
# when initial portfolio balance is about 710,000
f1value <- f1(init=1000000,spendV,returnV,stateV)
cat ("\n\nf1value for critical vector element",which(stateV == 4)," when inital value = 1,000,000: ",f1value)
f1value <- f1(init=800000,spendV,returnV,stateV)
cat ("\n\nf1value for critical vector element",which(stateV == 4)," when inital value = 800,000: ",f1value)
f1value <- f1(init=710000,spendV,returnV,stateV)
cat ("\n\nf1value for critical vector element",which(stateV == 4)," when inital value = 710,000: ",f1value)
# I am trying to find the initial portfolio balance that leaves zero dollars in portfolio balance vector [34]
# ---- BUT not BEFORE portfolio balance [34]
# opt <- optimize(f1,c(0,10000000),spendV,returnV,stateV,maximum = FALSE)
Answer
optim() and optimize() both seek a minimum (by default). You want a zero. The function you are looking for is uniroot(). Do this:
uniroot(f=f1, interval=c(1, 1e7), spendV=spendV, returnV=returnV, stateV=stateV)
#> $root
#> 703932.9
Some other stuff
For fun, here's a re-write of your function in vectorized notation (I lazily left out the returnV and spendV args):
f2 <- function(init, end=34) {
init*prod(returnV[1:end]) - sum(cumprod(returnV[end:1])*spendV[end:1])
}
uniroot(f2, c(1, 1e7))
#> $root
#> 703932.9
In general, if you are testing possible solutions, you might find it easier to plot a whole bunch of solutions, rather than trying them one at a time and reporting the result with cat(). Here's how I might go about it:
x <- seq(1, 2e6, by=100000)
y <- f2(x, 34)
plot(x, y, pch=20)
abline(v=0, h=0)
One last thought: in your function f1, the line which(stateV==4) might return a vector of values if the user inputs, say, stateV=c(1,4,4). I'm guessing you want the first occurrence of a "4", which you can accomplish with which(stateV==4)[1] or if(sum(stateV==4)>1) stop("Too many 4s!").

Forming a Wright-Fisher loop with "sample()"

I am trying to create a simple loop to generate a Wright-Fisher simulation of genetic drift with the sample() function (I'm actually not dead-set on using this function, but, in my naivety, it seems like the right way to go). I know that sample() randomly selects values from a vector based on certain probabilities. My goal is to create a system that will keep running making random selections from successive sets. For example, if it takes some original set of values and samples a second set, I'd like the loop to take another random sample from the second set (using the probabilities that were defined earlier).
I'd like to just learn how to do this in a very general way. Therefore, the specific probabilities and elements are arbitrary at this point. The only things that matter are (1) that every element can be repeated and (2) the size of the set must stay constant across generations, per Wright-Fisher. For an example, I've been playing with the following:
V <- c(1,1,2,2,2,2)
sample(V, size=6, replace=TRUE, prob=c(1,1,1,1,1,1))
Regrettably, my issue is that I don't have any code to share yet precisely because I'm not sure of how to start writing this kind of loop. I know that for() loops are used to repeat a function multiple times, so my guess is to start there. However, from what I've researched about these, it seems that you have to start with a variable (typically i). I don't have any variables in this sampling that seem explicitly obvious; which isn't to say one couldn't be made up.
If you wanted to repeatedly sample from a population with replacement for a total of iter iterations, you could use a for loop:
set.seed(144) # For reproducibility
population <- init.population
for (iter in seq_len(iter)) {
population <- sample(population, replace=TRUE)
}
population
# [1] 1 1 1 1 1 1
Data:
init.population <- c(1, 1, 2, 2, 2, 2)
iter <- 100

Expected value of the difference between a sum of variables and a threshold

I had a custom deck consisting of eight cards of the sequence 2^n, n=0,..,6. I draw cards (without replacement) until the sum is equal or greater than the threshold. How can I implement in R a function that calculates the mean of the difference between the sum and the threshold??
I tried to do it using this How to store values in a vector with nested functions
but it takes ages... I think there is a way to do it with probabilities/simulations but I can figure out.
The threshold could be greater than the value of one single card, ie, threshold=500 or less than the value of a single card, ie, threshold=50
What I have done so far is to find all the subsets that meet the condition of the sum greater or equal to the threshold. Then I will only substract the threshold and calculate the mean.
I am using the following code in R. For a small set I get the answer quite fast. However, I have been running the function for several ours with the set containing the 56 numbers and is still working.
set<-c(rep(1,8),rep(2,8), rep(4,8),rep(8,8),rep(16,8),rep(32,8),rep(64,8))
recursive.subset <-function(x, index, current, threshold, result){
for (i in index:length(x)){
if (current + x[i] >= threshold){
store <<- append(store, sum(c(result,x[i])))
} else {
recursive.subset(x, i + 1, current+x[i], threshold, c(result,x[i]))
}
}
}
store <- vector()
inivector <- vector(mode="numeric", length=0) #initializing empty vector
recursive.subset (set, 1, 0, threshold, inivector)
I don't know if it is possible to get an exact solution, simply because there are so many possible combinations. It is probably better to do simulations, i.e. write a script for 1 full draw and then rerun that script many times. Since the solutions are very similar, the simulation should give a pretty good approximation.
Ok, here goes:
set <- rep(2^(0:6), each = 8)
thr <- 500
fun <- function(set,thr){
x <- cumsum(sample(set))
value <- x[min(which(x >= thr))]
value
}
system.time(a <- replicate(100000, fun(set,thr)))
# user system elapsed
# 1.10 0.00 1.09
mean(a - thr)
# [1] 21.22992
Explanation: Rather than drawing a card one at a time, I draw all cards simultaneously (sample) and then calculate the cumulative sum (cumsum). I then find the point where the cards at up to the threshold or larger, and find the sum of those cards back in x. We run this function many times with replicate, to obtain a vector of the values. We use mean(a-thr) to calculate the mean difference.
Edit: Made a really stupid typo in the code, fixed it now.
Edit2: Shortened the function a little.

optimizing markov chain transition matrix calculations?

As an intermediate R user, I know that for loops can very often be optimized by using functions like apply or otherwise. However, I am not aware of functions that can optimize my current code to generate a markov chain matrix, which is running quite slowly. Have I max-ed out on speed or are there things that I am overlooking? I am trying to find the transition matrix for a Markov chain by counting the number of occurrences in 24-hour time periods before given alerts. The vector ids contains all possible id's (about 1700).
The original matrix looks like this, as an example:
>matrix
id time
1 1376084071
1 1376084937
1 1376023439
2 1376084320
2 1372983476
3 1374789234
3 1370234809
And here is my code to try to handle this:
matrixtimesort <- matrix[order(-matrix$time),]
frequency = 86400 #number of seconds in 1 day
# Initialize matrix that will contain probabilities
transprobs <- matrix(data=0, nrow=length(ids), ncol=length(ids))
# Loop through each type of event
for (i in 1:length(ids)){
localmatrix <- matrix[matrix$id==ids[i],]
# Loop through each row of the event
for(j in 1:nrow(localmatrix)) {
localtime <- localmatrix[j,]$time
# Find top and bottom row number defining the 1-day window
indices <- which(matrixtimesort$time < localtime & matrixtimesort$time >= (localtime - frequency))
# Find IDs that occur within the 1-day window
positiveids <- unique(matrixtimesort[c(min(indices):max(indices)),]$id)
# Add one to each cell in the matrix that corresponds to the occurrence of an event
for (l in 1:length(positiveids)){
k <- which(ids==positiveids[l])
transprobs[i,k] <- transprobs[i,k] + 1
}
}
# Divide each row by total number of occurrences to determine probabilities
transprobs[i,] <- transprobs[i,]/nrow(localmatrix)
}
# Normalize rows so that row sums are equal to 1
normalized <- transprobs/rowSums(transprobs)
Can anyone make any suggestions to optimize this for speed?
Using nested loops seems a bad idea. Your code can be vectorized to speed up.
For example, why find the top and bottom of row numbers? You can simply compare the time value with "time_0 + frequency": it is a vectorized operation.
HTH.

How should I combine two loops in r?

I want to ask your opinion since I am not so sure how to do it. This is regarding one part of my paper project and my situation is:
Stage I
I have 2 groups and for each group I need to compute the following steps:
Generate 3 random numbers from normal distribution and square them.
Repeat step 1 for 15 times and at the end I will get 15 random numbers.
I already done stage I using for loop.
n1<-3
n2<-3
miu<-0
sd1<-1
sd2<-1
asim<-15
w<-rep(NA,asim)
x<-rep(NA,asim)
for (i in 1:asim) {
print(i)
set.seed(i)
data1<-rnorm(n1,miu,sd1)
data2<-rnorm(n2,miu,sd2)
w[i]<-sum(data1^2)
x[i]<-sum(data2^2)
}
w
x
Second stage is;
Stage II
For each group, I need to:
Sort the group;
Find trimmed mean for each group.
For the whole process (stage I and stage II) I need to simulate them for 5000 times. How am I going to proceed with step 2? Do you think I need to put another loop to proceed with stage II?
Those are tasks you can do without explicit loops. Therefore, note a few things: It is the same if you generate 3 times 15 times 2000 random numbers or if you generate them all at once. They still share the same distribution.
Next: Setting the seed within each loop makes your simulation deterministic. Call set.seed once at the start of your script.
So, what we will do is to generate all random numbers at once, then compute their squared norms for groups of three, then build groups of 15.
First some variable definitions:
set.seed(20131301)
repetitions <- 2000
numperval <- 3
numpergroup <- 15
miu <- 0
sd1 <- 1
sd2 <- 1
As we need two groups, we wrap the group generation stuff into a custom function. This is not necessary, but does help a bit in keeping the code clean an readable.
generateGroup <- function(repetitions, numperval, numpergroup, m, s) {
# Generate all data
data <- rnorm(repetitions*numperval*numpergroup, m, s)
# Build groups of 3:
data <- matrix(data, ncol=numperval)
# And generate the squared norm of those
data <- rowSums(data*data)
# Finally build a matrix with 15 columns, each column one dataset of numbers, each row one repetition
matrix(data, ncol=numpergroup)
}
Great, now we can generate random numbers for our group:
group1 <- generateGroup(repetitions, numperval, numpergroup, miu, sd1)
group2 <- generateGroup(repetitions, numperval, numpergroup, miu, sd2)
To compute the trimmed mean, we again utilize apply:
trimmedmeans_group1 <- apply(group1, 1, mean, trim=0.25)
trimmedmeans_group2 <- apply(group2, 1, mean, trim=0.25)
I used mean with the trim argument instead of sorting, throwing away and computing the mean. If you need the sorted numbers explicitly, you could do it by hand (just for one group, this time):
sorted <- t(apply(group1, 1, sort))
# We have to transpose as apply by default returns a matrix with each observation in one column. I chose the other way around above, so we stick with this convention and transpose.
Now, it would be easy to throw away the first and last two columns and generate the mean, if you want to do it manually.

Resources