Calculate probability of observing sequence using markovchain package - r

Let's use the dataset from this question:
dat<-data.frame(replicate(20,sample(c("A", "B", "C","D"), size = 100, replace=TRUE)))
Then we can build the transition matrix and the markov chain:
# Build transition matrix
trans.matrix <- function(X, prob=T)
{
tt <- table( c(X[,-ncol(X)]), c(X[,-1]) )
if(prob) tt <- tt / rowSums(tt)
tt
}
trans.mat <- trans.matrix(as.matrix(dat))
attributes(trans.mat)$class <- 'matrix'
# Build markovchain
library(markovchain)
chain <- new('markovchain', transitionMatrix = trans.mat)
If I now encounter a new sequence, let's say AAABCAD can I then calculate the probability of observing this sequence given this markovchain?

I cannot see a function in markovchain exactly for that, but it can be easily done manually too. There's one caveat though: the transition matrix does not provide the probability of observing the first A, which needs to be provided by you. Let it be 0.25, as it would be if all four states were equally likely (which is true in your example).
Then the transitions in the observed chain can be obtained with
cbind(head(obs, -1), obs[-1])
# [,1] [,2]
# [1,] "A" "A"
# [2,] "A" "A"
# [3,] "A" "B"
# [4,] "B" "C"
# [5,] "C" "A"
# [6,] "A" "D"
Probabilities for each of those transitions then are
trans.mat[cbind(head(obs, -1), obs[-1])]
# [1] 0.2268722 0.2268722 0.2268722 0.2926316 0.2791165 0.2665198
and the final answer is 0.25 * (the product of the above vector):
0.25 * prod(trans.mat[cbind(head(obs, -1), obs[-1])])
# [1] 6.355069e-05
For comparison, we may estimate this probability by generating many chains of length 7:
dat <- replicate(2000000, paste(sample(c("A", "B", "C", "D"), size = 7, replace = TRUE), collapse = ""))
mean(dat == "AAABCAD")
# [1] 6.55e-05
Looks close enough!

Related

Using mvrnorm from MASS package

I need to generate a random sample with a multivariate normal distribution using seed(12346) with 100 columns and 5000 rows.
So far I have got this:
set.seed(12346)
Preg1 <- data.frame(MASS::mvrnorm(n=5000,mu=c(0,0,0),Sigma = diag(3)))
The above gives me three columns, how can I get 100?
I cannot figure out how to get the vector of mu with 100 zeros without typing them in and the Sigma would then be Sigma = diag(100)
You can use mu = rep(0, 100). The rep function is used to repeat values.
set.seed(12346)
ncol = 100
Preg1<-data.frame(mvrnorm(n = 5000, mu = rep(0, ncol), Sigma = diag(ncol)))
dim(Preg1)
# [1] 5000 100
The rep function is quite useful, it can be used in various ways that aren't applicable here but are good to know about:
rep(c("A", "B", "C"), times = 3)
# [1] "A" "B" "C" "A" "B" "C" "A" "B" "C"
rep(c("A", "B", "C"), times = 1:3)
# [1] "A" "B" "B" "C" "C" "C"
rep(c("A", "B", "C"), each = 3)
# [1] "A" "A" "A" "B" "B" "B" "C" "C" "C"
In this particular case, because your Sigma is an identity matrix, each column is actually independent. So it would be equivalent to generate each column (or even each draw) independently, which we could do either of these ways:
x = replicate(n = ncol, rnorm(5000))
dim(x)
# [1] 5000 100
z = matrix(rnorm(5000 * ncol), ncol = ncol)
dim(z)
# [1] 5000 100

Enumerating a subset of paths in a sequential probability tree in R

To illustrate the problem, let us define the following matrix (where NA indicates that the option is unavailable in period t)
set.seed(1)
x <- matrix(NA, 4, 4, dimnames = list(paste0("t=", seq_len(4)), LETTERS[seq_len(4)]))
x[lower.tri(x, diag = TRUE)] <- rnorm(10)
Which gives a matrix that looks like this:
A B C D
t=1 0.91897737 NA NA NA
t=2 0.78213630 0.61982575 NA NA
t=3 0.07456498 -0.05612874 -1.4707524 NA
t=4 -1.98935170 -0.15579551 -0.4781501 0.4179416
The goal is to calculate the probability that each value is the highest in each time period $t$, however, the values are conditional on the values in the previous periods. For example, in moving from period t=2 to t=3 and the assumption that A is the highest, A is only compared to C and not B because in t=2 it is assumed to be higher. We can structure the problem as a tree like this:
So for t=1 the probability is 1, for t=2 we calculate 2 probabilities from 1 grouping, in t=3 we calculate 4 probabilities from 2 groupings (note how one option is eliminated from the comparison because of the sequential dependence and inherent assumption that it was not the highest in t-1) and in t=4, we calculate 8 probabilities from 4 groupings. The final probabilities then are product over the probabilities in each t making up the 8 paths. In the real problem, t gets large and manually identifying these groupings becomes infeasible.
I've been trying to come up with a clever way of identifying these paths and calculate the probabilities. One idea was to use a set of "masking matrices" for each possible pattern. That way I could simply multiply the masking matrix and perform row operations. However, I could not find a robust way to populate the different masking matrices as the the number of levels increased.
For example, assume the pattern of choosing A in all periods leading up to the final period can be described by the following masking matrix:
mask <- matrix(c(
1, NA, NA, NA,
1, 1, NA, NA,
1, NA, 1, NA,
1, NA, NA, 1
), ncol = 4, byrow = TRUE, dimnames = list(paste0("t=", seq_len(4)), LETTERS[seq_len(4)]))
which looks like this (1 of the 4 possible comparisons in this case):
A B C D
t=1 1 NA NA NA
t=2 1 1 NA NA
t=3 1 NA 1 NA
t=4 1 NA NA 1
And we can calculate the probabilities in each period like this (all rows sum to one as they should):
exp_x <- exp(x * mask)
sum_exp_x <- rowSums(exp_x, na.rm = TRUE)
pr_x <- exp_x / sum_exp_x
A B C D
t=1 1.00000000 NA NA NA
t=2 0.54048879 0.4595112 NA NA
t=3 0.82423638 NA 0.1757636 NA
t=4 0.08261824 NA NA 0.9173818
Is there a clever way of doing this for all possible paths as tgrows? Or a good way of populating a set of masking matrices to loop over? I'm trying to avoid the problem growing out of hand. Is it possible that complete path enumeration and elimination is a better option, i.e. faster and more robust? Any help, ideas and pointers are helpful.
Is this what you want?
find_path <- function(nperiods, opts = LETTERS[seq_len(period)]) {
stopifnot(length(opts) == nperiods)
out <- matrix(nrow = 2 ^ (nperiods - 1L), ncol = nperiods)
r <- 1L
recur_ <- function(period, branch, outcome) {
if (period > length(branch)) {
out[r, ] <<- opts[branch]
r <<- r + 1L
return(NULL)
}
for (i in c(outcome, period)) {
branch[[period]] <- i
recur_(period + 1L, branch, i)
}
}
recur_(1L, integer(nperiods), NULL)
out
}
calc_prob <- function(mat) {
ps <- dimnames(mat)[[1L]]; if (is.null(ps)) ps <- seq_len(nrow(mat))
ops <- dimnames(mat)[[2L]]; if (is.null(ops)) ops <- seq_len(ncol(mat))
paths <- find_path(nrow(mat), ops)
out <- vapply(seq_len(ncol(paths))[-1L], function(i) {
comp <- ops[[i]]
comp <- ifelse(paths[, i] == comp, paths[, i - 1L], comp)
x <- exp(mat[i, paths[, i]])
y <- exp(mat[i, comp])
x / (x + y)
}, numeric(nrow(paths)))
dimnames(out) <- NULL; out <- cbind(1, out)
dimnames(out)[[2L]] <- dimnames(paths)[[2L]] <- ps
list(paths = paths, probs = out)
}
Output
> calc_prob(x) # x is the same lower-triangular matrix as shown in your example.
$paths
t=1 t=2 t=3 t=4
[1,] "A" "A" "A" "A"
[2,] "A" "A" "A" "D"
[3,] "A" "A" "C" "C"
[4,] "A" "A" "C" "D"
[5,] "A" "B" "B" "B"
[6,] "A" "B" "B" "D"
[7,] "A" "B" "C" "C"
[8,] "A" "B" "C" "D"
$probs
t=1 t=2 t=3 t=4
[1,] 1 0.5404888 0.8242364 0.08261823
[2,] 1 0.5404888 0.8242364 0.91738177
[3,] 1 0.5404888 0.1757636 0.28985432
[4,] 1 0.5404888 0.1757636 0.71014568
[5,] 1 0.4595112 0.8044942 0.36037495
[6,] 1 0.4595112 0.8044942 0.63962505
[7,] 1 0.4595112 0.1955058 0.28985432
[8,] 1 0.4595112 0.1955058 0.71014568
The variable paths gives you all the possible outcomes for each period t; probs tells you the probability of a corresponding outcome. However, note that such a probability tree grows exponentially as the number of periods increases. The equation is
where N is the number of all possible paths at period t. For just 20 periods, you will have 524288 different paths. If the number of periods goes to 30, you will have 536870912 different paths, and R just cannot handle that amount of computations. I do suggest you reconsider your expected outputs. Are you running a simulation with some other constraints than just the time dependence so that we can further trim off some unnecessary paths? Or maybe you only need some summary statistics like the expected value so that we don't have to generate all possible paths? There must be a better way than just using a brute-force approach like this.

R Populate multi-dimensional array with input from alternating vectors

I'm trying to populate a multi-dimensional array with two vectors of the same length. The input data should alternate between the vectors, so that the first input is the first object of the first vector, the second input is the first object of the second vector and so on.
I searched for similar problems on this site and found the function rbind(), however, this will not work as soon as my third dimension is unequal to 1.
In short, I want to achieve this:
a <- 1:6
b <- c("a","b","c","d","e","f")
# output array
, , 1
[,1] [,2]
[1,] "1" "a"
[2,] "2" "b"
[3,] "3" "c"
, , 2
[,1] [,2]
[1,] "4" "d"
[2,] "5" "e"
[3,] "6" "f"
I have a working solution below using three for-loops, but this seems overly complicated.
a <- 1:6
b <- c("a","b","c","d","e","f")
len <- prod(length(a)+length(b))
myarray <- array(rep(F,len),dim=c(3,2,2))
counter <- 1
for (n in 1:dim(myarray)[3]) { # n 2
for (r in 1:dim(myarray)[1]) { # rows 3
for (c in 1:dim(myarray)[2]) { # columns 2
if (c %% 2 != 0) {
myarray[r,c,n] <- a[counter]
} else {
myarray[r,c,n] <- b[counter]
}
}
counter <- counter + 1
}
}
Is there an easier approach?
(I'm sure I'm missing something very simple here, but I'm new to R and can't figure it out myself)
Thank you for reading!
[EDIT]
The code should be applyable to a data set with any vector length and any dimension dim = c(x,y,z).
Example data can be found on Dryad Database https://doi.org/10.5061/dryad.mp713, "Table 1 Arctic char landmarks", which contains 13 pairs of x-y-coordinates from 121 individuals of arctic char fish (dim=c(13,2,121)).
Here is my solution for the problem with dim = c(13,2,121):
M <- cbind(a, b)
array(sapply(seq(1, length(a), 13), function(i) M[i:(i+12),]), c(13,2,121))
Do not forget to store the result Mneu <- ...
For your small example:
M <- cbind(a, b);
array(sapply(seq(1, length(a), 3), function(i) M[i:(i+2),]), c(3,2,2))
Form an array and then permute the dimensions:
aperm(array(cbind(a, b), c(3, 2, 2)), c(1, 3:2))
giving:
, , 1
[,1] [,2]
[1,] "1" "a"
[2,] "2" "b"
[3,] "3" "c"
, , 2
[,1] [,2]
[1,] "4" "d"
[2,] "5" "e"
[3,] "6" "f"
Note
We can generalize the example slightly:
n <- 6 # must be 26 or less so that we can use letters below
a <- 1:n
b <- head(letters, n)
aperm(array(cbind(a, b), c(n/2,2,2)), c(1, 3:2))

R : randomly divide 10 data values into a group of 5 and a group of 5

I would like to find all the possibilities to divide 10 data values into 2 groups of 5
If i'm right there are 252 possibilities
choose(10,5)
252
How can i do it with R ?
Thanks !
Here's one possibility:
a <- letters[1:10]
split1 <- combn(a, 5);
split2 <- apply(b, 2, function(x) a[!a %in% x])
Pick a random one:
set.seed(1)
rnd <- sample(1:ncol(split1), size=1)
split1[, rnd]; split2[, rnd]
# [1] "a" "c" "d" "g" "i"
# [1] "b" "e" "f" "h" "j"
So i will explain in details what i have to do :
I have 2 sets of data :
cellular_wt = c(1.1656,0.9577,1.3655,0.9016,0.9336)
cellular_mutant = c(2.8896,5.7018,3.595,1.6998,1.8893)
secreted_wt = c(7.8491,6.1546,5.1972,6.1607,5.928)
secreted_mutant = c(4.6801,3.2418,3.6651,3.0678,2.3221)
mean_cellular_wt <- mean(cellular_wt)
mean_cellular_mutant <- mean(cellular_mutant)
mean_secreted_wt <- mean(secreted_wt)
mean_secreted_mutant <- mean(secreted_mutant)
mean_secreted_wt/mean_cellular_wt = 5.877085
mean_secreted_mutant/mean_cellular_mutant = 1.076156
mean_ratio <- (mean_secreted_wt/mean_cellular_wt)/(mean_secreted_mutant/mean_cellular_mutant) = 5.46
I want to run a randomization test on these data to test the significance of mean ratio
To do so, i would like to randomly divide these 10 values (cellular_wt + cellular_mutant and secreted_wt + secreted_mutant into 2 groups of 5 (as the initial data sets), and calculate the mean ratio each time.
In this way, i can see whether the observed difference of 5.46 seems unusually large by comparing it to the 252 differences that could have been seen due to random assignment alone. Do you understand ?

Generate all combinations given a constraint

How can I generate all of the 6 combinations of 2 treatments (A,B) in blocks of 4, such that in each block there is an equal number of A's and B's, using R?
"AABB","ABAB","ABBA","BBAA","BABA","BAAB"
P.S. The number of combinations is calculated as follows:
If
T = #treatments
n = #treatments in each block = k*T,
The number of combinations equals n! / [k!*k! (T times)]
Thank you
Something like this should work:
library(gtools)
t <- c('A','B')
k <- 2
n <- k * length(t)
t2 <- rep(t, k)
m <- permutations(n,n)
res <- unique(apply(m,MARGIN=1,function(x) paste(t2[x],collapse='')))
--------------------------------------------------------------------
res
[1] "ABAB" "ABBA" "AABB" "BAAB" "BABA" "BBAA"
The multicool package implements an algorithm for permuting multisets --- exactly the task you want to have performed. Here's an example of what it can do:
library(multicool)
# Create a simple convenience function
enumAllPartitions <- function(multiset) {
m1 <- initMC(multiset) # Initialize the permutation object
N <- fact(length(multiset))/ # Calculate number of permutations
prod(fact(table(multiset)))
sapply(seq_len(N), function(X) paste(nextPerm(m1), collapse=""))
}
# Try it out with a few different multisets
x <- c("A", "A", "B", "B")
y <- c("G", "L", "L", "L")
z <- c("X", "X", "Y", "Z", "Z")
lapply(list(x,y,z), enumAllPartitions)
[[1]]
[1] "BBAA" "ABBA" "BABA" "ABAB" "AABB" "BAAB"
[[2]]
[1] "LLLG" "GLLL" "LGLL" "LLGL"
[[3]]
[1] "ZZYXX" "XZZYX" "ZXZYX" "ZZXYX" "XZZXY" "ZXZXY" "XZXZY" "XXZZY" "ZXXZY"
[10] "ZZXXY" "YZZXX" "ZYZXX" "XZYZX" "ZXYZX" "YZXZX" "XYZZX" "YXZZX" "ZYXZX"
[19] "XZYXZ" "ZXYXZ" "XZXYZ" "XXZYZ" "ZXXYZ" "YZXXZ" "XYZXZ" "YXZXZ" "XYXZZ"
[28] "XXYZZ" "YXXZZ" "ZYXXZ"
The expected solution can also be achieved using the new iterpc package.
I <- iterpc(c(2, 2), labels=c("A", "B"), ordered=TRUE)
getall(I)
# [,1] [,2] [,3] [,4]
# [1,] "A" "A" "B" "B"
# [2,] "A" "B" "A" "B"
# [3,] "A" "B" "B" "A"
# [4,] "B" "A" "A" "B"
# [5,] "B" "A" "B" "A"
# [6,] "B" "B" "A" "A"

Resources