optimizing markov chain transition matrix calculations? - r

As an intermediate R user, I know that for loops can very often be optimized by using functions like apply or otherwise. However, I am not aware of functions that can optimize my current code to generate a markov chain matrix, which is running quite slowly. Have I max-ed out on speed or are there things that I am overlooking? I am trying to find the transition matrix for a Markov chain by counting the number of occurrences in 24-hour time periods before given alerts. The vector ids contains all possible id's (about 1700).
The original matrix looks like this, as an example:
>matrix
id time
1 1376084071
1 1376084937
1 1376023439
2 1376084320
2 1372983476
3 1374789234
3 1370234809
And here is my code to try to handle this:
matrixtimesort <- matrix[order(-matrix$time),]
frequency = 86400 #number of seconds in 1 day
# Initialize matrix that will contain probabilities
transprobs <- matrix(data=0, nrow=length(ids), ncol=length(ids))
# Loop through each type of event
for (i in 1:length(ids)){
localmatrix <- matrix[matrix$id==ids[i],]
# Loop through each row of the event
for(j in 1:nrow(localmatrix)) {
localtime <- localmatrix[j,]$time
# Find top and bottom row number defining the 1-day window
indices <- which(matrixtimesort$time < localtime & matrixtimesort$time >= (localtime - frequency))
# Find IDs that occur within the 1-day window
positiveids <- unique(matrixtimesort[c(min(indices):max(indices)),]$id)
# Add one to each cell in the matrix that corresponds to the occurrence of an event
for (l in 1:length(positiveids)){
k <- which(ids==positiveids[l])
transprobs[i,k] <- transprobs[i,k] + 1
}
}
# Divide each row by total number of occurrences to determine probabilities
transprobs[i,] <- transprobs[i,]/nrow(localmatrix)
}
# Normalize rows so that row sums are equal to 1
normalized <- transprobs/rowSums(transprobs)
Can anyone make any suggestions to optimize this for speed?

Using nested loops seems a bad idea. Your code can be vectorized to speed up.
For example, why find the top and bottom of row numbers? You can simply compare the time value with "time_0 + frequency": it is a vectorized operation.
HTH.

Related

Simulate Steps Through a Markov Chain

I am trying to simulate a step through a Markov chain with the intent to loop through the procedure multiple times until a condition is met. (i.e., find out how many steps it takes, on average, to reach a specific state).
In this case, a state can only go one way. E.g., State 4 can transition forward to State 5, but cannot transition backward to State 3. This means the left-lower half of the transition matrix is set to zero. This is also why the method below puts arbitrarily large values in the 'prior' states. I attempt to find the correct new state by examining which probability in the specified row of the transition matrix is closest to a random number.
get_new_state <- function(current_state, trans_matrix)
{
# generate a random number between 0-1 to compare to the transition matrix probabilities
rand <- runif(1)
# transition to where the
# random number falls within the transition matrix
row <- current_state # initial condition determines the row of the trans_matrix
col = current_state # start in the column
# loop thru all columns and find the correct state
potential_states <- rep(0, each=ncol(trans_matrix)) # holds the value of the potential state it transitions to
# in this case, we can't transition to a previous state so we set the previous state values arbitrarily high
# so they don't get selected in the which.min() function later
potential_states[1:col] <- 999
for(k in row:ncol(trans_matrix)) # loop thru non-zero matrix values
{
if(trans_matrix[row,k] > rand)
{
potential_states[k] <- trans_matrix[row,k] / rand
potential_states[k] <- 1 - potential_states[k]
}
}
# new state is equal to the index of the lowest value
# lowest value = closest to random number
new_state = which.min(potential_states)
return(as.numeric(new_state))
}
I'm not sure if this approach is reasonable. I'm assuming there is a better way to simulate without the kluge that puts arbitrarily large values in potential_states[].
Would something like this work better (it is a one-line Markov transition):
> new_state <- sample(1:number_states, size=1,
prob=transition_matrix[old_state,])
and just put this in a (for instance) while() loop with a counter.

Count duration of value in vector in R

I am trying to count the length of occurrances of a value in a vector such as
q <- c(1,1,1,1,1,1,4,4,4,4,4,4,4,4,4,4,4,4,6,6,6,6,6,6,6,6,6,6,1,1,4,4,4)
Actual vectors are longer than this, and are time based. What I would like would be an output for 4 that tells me it occurred for 12 time steps (before the vector changes to 6) and then 3 time steps. (Not that it occurred 15 times total).
Currently my ideas to do this are pretty inefficient (a loop that looks element by element that I can have stop when it doesn't equal the value I specified). Can anyone recommend a more efficient method?
x <- with(rle(q), data.frame(values, lengths)) will pull the information that you want (courtesy of d.b. in the comments).
From the R Documentation: rle is used to "Compute the lengths and values of runs of equal values in a vector – or the reverse operation."
y <- x[x$values == 4, ] will subset the data frame to include only the value of interest (4). You can then see clearly that 4 ran for 12 times and then later for 3.
Modifying the code will let you check whatever value you want.

TraMineR, Extract all present combination of events as dummy variables

Lets say I have this data. My objective is to extraxt combinations of sequences.
I have one constraint, the time between two events may not be more than 5, lets call this maxGap.
User <- c(rep(1,3)) # One users
Event <- c("C","B","C") # Say this is random events could be anything from LETTERS[1:4]
Time <- c(c(1,12,13)) # This is a timeline
df <- data.frame(User=User,
Event=Event,
Time=Time)
If want to use these sequences as binary explanatory variables for analysis.
Given this dataframe the result should be like this.
res.df <- data.frame(User=1,
C=1,
B=1,
CB=0,
BC=1,
CBC=0)
(CB) and (CBC) will be 0 since the maxGap > 5.
I was trying to write a function for this using many for-loops, but it becomes very complex if the sequence becomes larger and the different number of evets also becomes larger. And also if the number of different User grows to 100 000.
Is it possible of doing this in TraMineR with the help of seqeconstraint?
Here is how you would do that with TraMineR
df.seqe <- seqecreate(id=df$User, timestamp=df$Time, event=df$Event)
constr <- seqeconstraint(maxGap=5)
subseq <- seqefsub(df.seqe, minSupport=0, constraint=constr)
(presence <- seqeapplysub(subseq, method="presence"))
which gives
(B) (B)-(C) (C)
1-(C)-11-(B)-1-(C) 1 1 1
presence is a table with a column for each subsequence that occurs at least once in the data set. So, if you have several individuals (event sequences), the table will have one row per individual and the columns will be the binary variable you are looking for. (See also TraMineR: Can I get the complete sequence if I give an event sub sequence? )
However, be aware that TraMineR works fine only with subsequences of length up to about 4 or 5. We suggest to set maxK=3 or 4 in seqefsub. The number of individuals should not be a problem, nor should the number of different possible events (the alphabet) as long as you restrict the maximal subsequence length you are looking for.
Hope this helps

Expected value of the difference between a sum of variables and a threshold

I had a custom deck consisting of eight cards of the sequence 2^n, n=0,..,6. I draw cards (without replacement) until the sum is equal or greater than the threshold. How can I implement in R a function that calculates the mean of the difference between the sum and the threshold??
I tried to do it using this How to store values in a vector with nested functions
but it takes ages... I think there is a way to do it with probabilities/simulations but I can figure out.
The threshold could be greater than the value of one single card, ie, threshold=500 or less than the value of a single card, ie, threshold=50
What I have done so far is to find all the subsets that meet the condition of the sum greater or equal to the threshold. Then I will only substract the threshold and calculate the mean.
I am using the following code in R. For a small set I get the answer quite fast. However, I have been running the function for several ours with the set containing the 56 numbers and is still working.
set<-c(rep(1,8),rep(2,8), rep(4,8),rep(8,8),rep(16,8),rep(32,8),rep(64,8))
recursive.subset <-function(x, index, current, threshold, result){
for (i in index:length(x)){
if (current + x[i] >= threshold){
store <<- append(store, sum(c(result,x[i])))
} else {
recursive.subset(x, i + 1, current+x[i], threshold, c(result,x[i]))
}
}
}
store <- vector()
inivector <- vector(mode="numeric", length=0) #initializing empty vector
recursive.subset (set, 1, 0, threshold, inivector)
I don't know if it is possible to get an exact solution, simply because there are so many possible combinations. It is probably better to do simulations, i.e. write a script for 1 full draw and then rerun that script many times. Since the solutions are very similar, the simulation should give a pretty good approximation.
Ok, here goes:
set <- rep(2^(0:6), each = 8)
thr <- 500
fun <- function(set,thr){
x <- cumsum(sample(set))
value <- x[min(which(x >= thr))]
value
}
system.time(a <- replicate(100000, fun(set,thr)))
# user system elapsed
# 1.10 0.00 1.09
mean(a - thr)
# [1] 21.22992
Explanation: Rather than drawing a card one at a time, I draw all cards simultaneously (sample) and then calculate the cumulative sum (cumsum). I then find the point where the cards at up to the threshold or larger, and find the sum of those cards back in x. We run this function many times with replicate, to obtain a vector of the values. We use mean(a-thr) to calculate the mean difference.
Edit: Made a really stupid typo in the code, fixed it now.
Edit2: Shortened the function a little.

How should I combine two loops in r?

I want to ask your opinion since I am not so sure how to do it. This is regarding one part of my paper project and my situation is:
Stage I
I have 2 groups and for each group I need to compute the following steps:
Generate 3 random numbers from normal distribution and square them.
Repeat step 1 for 15 times and at the end I will get 15 random numbers.
I already done stage I using for loop.
n1<-3
n2<-3
miu<-0
sd1<-1
sd2<-1
asim<-15
w<-rep(NA,asim)
x<-rep(NA,asim)
for (i in 1:asim) {
print(i)
set.seed(i)
data1<-rnorm(n1,miu,sd1)
data2<-rnorm(n2,miu,sd2)
w[i]<-sum(data1^2)
x[i]<-sum(data2^2)
}
w
x
Second stage is;
Stage II
For each group, I need to:
Sort the group;
Find trimmed mean for each group.
For the whole process (stage I and stage II) I need to simulate them for 5000 times. How am I going to proceed with step 2? Do you think I need to put another loop to proceed with stage II?
Those are tasks you can do without explicit loops. Therefore, note a few things: It is the same if you generate 3 times 15 times 2000 random numbers or if you generate them all at once. They still share the same distribution.
Next: Setting the seed within each loop makes your simulation deterministic. Call set.seed once at the start of your script.
So, what we will do is to generate all random numbers at once, then compute their squared norms for groups of three, then build groups of 15.
First some variable definitions:
set.seed(20131301)
repetitions <- 2000
numperval <- 3
numpergroup <- 15
miu <- 0
sd1 <- 1
sd2 <- 1
As we need two groups, we wrap the group generation stuff into a custom function. This is not necessary, but does help a bit in keeping the code clean an readable.
generateGroup <- function(repetitions, numperval, numpergroup, m, s) {
# Generate all data
data <- rnorm(repetitions*numperval*numpergroup, m, s)
# Build groups of 3:
data <- matrix(data, ncol=numperval)
# And generate the squared norm of those
data <- rowSums(data*data)
# Finally build a matrix with 15 columns, each column one dataset of numbers, each row one repetition
matrix(data, ncol=numpergroup)
}
Great, now we can generate random numbers for our group:
group1 <- generateGroup(repetitions, numperval, numpergroup, miu, sd1)
group2 <- generateGroup(repetitions, numperval, numpergroup, miu, sd2)
To compute the trimmed mean, we again utilize apply:
trimmedmeans_group1 <- apply(group1, 1, mean, trim=0.25)
trimmedmeans_group2 <- apply(group2, 1, mean, trim=0.25)
I used mean with the trim argument instead of sorting, throwing away and computing the mean. If you need the sorted numbers explicitly, you could do it by hand (just for one group, this time):
sorted <- t(apply(group1, 1, sort))
# We have to transpose as apply by default returns a matrix with each observation in one column. I chose the other way around above, so we stick with this convention and transpose.
Now, it would be easy to throw away the first and last two columns and generate the mean, if you want to do it manually.

Resources