Enumerating a subset of paths in a sequential probability tree in R - r

To illustrate the problem, let us define the following matrix (where NA indicates that the option is unavailable in period t)
set.seed(1)
x <- matrix(NA, 4, 4, dimnames = list(paste0("t=", seq_len(4)), LETTERS[seq_len(4)]))
x[lower.tri(x, diag = TRUE)] <- rnorm(10)
Which gives a matrix that looks like this:
A B C D
t=1 0.91897737 NA NA NA
t=2 0.78213630 0.61982575 NA NA
t=3 0.07456498 -0.05612874 -1.4707524 NA
t=4 -1.98935170 -0.15579551 -0.4781501 0.4179416
The goal is to calculate the probability that each value is the highest in each time period $t$, however, the values are conditional on the values in the previous periods. For example, in moving from period t=2 to t=3 and the assumption that A is the highest, A is only compared to C and not B because in t=2 it is assumed to be higher. We can structure the problem as a tree like this:
So for t=1 the probability is 1, for t=2 we calculate 2 probabilities from 1 grouping, in t=3 we calculate 4 probabilities from 2 groupings (note how one option is eliminated from the comparison because of the sequential dependence and inherent assumption that it was not the highest in t-1) and in t=4, we calculate 8 probabilities from 4 groupings. The final probabilities then are product over the probabilities in each t making up the 8 paths. In the real problem, t gets large and manually identifying these groupings becomes infeasible.
I've been trying to come up with a clever way of identifying these paths and calculate the probabilities. One idea was to use a set of "masking matrices" for each possible pattern. That way I could simply multiply the masking matrix and perform row operations. However, I could not find a robust way to populate the different masking matrices as the the number of levels increased.
For example, assume the pattern of choosing A in all periods leading up to the final period can be described by the following masking matrix:
mask <- matrix(c(
1, NA, NA, NA,
1, 1, NA, NA,
1, NA, 1, NA,
1, NA, NA, 1
), ncol = 4, byrow = TRUE, dimnames = list(paste0("t=", seq_len(4)), LETTERS[seq_len(4)]))
which looks like this (1 of the 4 possible comparisons in this case):
A B C D
t=1 1 NA NA NA
t=2 1 1 NA NA
t=3 1 NA 1 NA
t=4 1 NA NA 1
And we can calculate the probabilities in each period like this (all rows sum to one as they should):
exp_x <- exp(x * mask)
sum_exp_x <- rowSums(exp_x, na.rm = TRUE)
pr_x <- exp_x / sum_exp_x
A B C D
t=1 1.00000000 NA NA NA
t=2 0.54048879 0.4595112 NA NA
t=3 0.82423638 NA 0.1757636 NA
t=4 0.08261824 NA NA 0.9173818
Is there a clever way of doing this for all possible paths as tgrows? Or a good way of populating a set of masking matrices to loop over? I'm trying to avoid the problem growing out of hand. Is it possible that complete path enumeration and elimination is a better option, i.e. faster and more robust? Any help, ideas and pointers are helpful.

Is this what you want?
find_path <- function(nperiods, opts = LETTERS[seq_len(period)]) {
stopifnot(length(opts) == nperiods)
out <- matrix(nrow = 2 ^ (nperiods - 1L), ncol = nperiods)
r <- 1L
recur_ <- function(period, branch, outcome) {
if (period > length(branch)) {
out[r, ] <<- opts[branch]
r <<- r + 1L
return(NULL)
}
for (i in c(outcome, period)) {
branch[[period]] <- i
recur_(period + 1L, branch, i)
}
}
recur_(1L, integer(nperiods), NULL)
out
}
calc_prob <- function(mat) {
ps <- dimnames(mat)[[1L]]; if (is.null(ps)) ps <- seq_len(nrow(mat))
ops <- dimnames(mat)[[2L]]; if (is.null(ops)) ops <- seq_len(ncol(mat))
paths <- find_path(nrow(mat), ops)
out <- vapply(seq_len(ncol(paths))[-1L], function(i) {
comp <- ops[[i]]
comp <- ifelse(paths[, i] == comp, paths[, i - 1L], comp)
x <- exp(mat[i, paths[, i]])
y <- exp(mat[i, comp])
x / (x + y)
}, numeric(nrow(paths)))
dimnames(out) <- NULL; out <- cbind(1, out)
dimnames(out)[[2L]] <- dimnames(paths)[[2L]] <- ps
list(paths = paths, probs = out)
}
Output
> calc_prob(x) # x is the same lower-triangular matrix as shown in your example.
$paths
t=1 t=2 t=3 t=4
[1,] "A" "A" "A" "A"
[2,] "A" "A" "A" "D"
[3,] "A" "A" "C" "C"
[4,] "A" "A" "C" "D"
[5,] "A" "B" "B" "B"
[6,] "A" "B" "B" "D"
[7,] "A" "B" "C" "C"
[8,] "A" "B" "C" "D"
$probs
t=1 t=2 t=3 t=4
[1,] 1 0.5404888 0.8242364 0.08261823
[2,] 1 0.5404888 0.8242364 0.91738177
[3,] 1 0.5404888 0.1757636 0.28985432
[4,] 1 0.5404888 0.1757636 0.71014568
[5,] 1 0.4595112 0.8044942 0.36037495
[6,] 1 0.4595112 0.8044942 0.63962505
[7,] 1 0.4595112 0.1955058 0.28985432
[8,] 1 0.4595112 0.1955058 0.71014568
The variable paths gives you all the possible outcomes for each period t; probs tells you the probability of a corresponding outcome. However, note that such a probability tree grows exponentially as the number of periods increases. The equation is
where N is the number of all possible paths at period t. For just 20 periods, you will have 524288 different paths. If the number of periods goes to 30, you will have 536870912 different paths, and R just cannot handle that amount of computations. I do suggest you reconsider your expected outputs. Are you running a simulation with some other constraints than just the time dependence so that we can further trim off some unnecessary paths? Or maybe you only need some summary statistics like the expected value so that we don't have to generate all possible paths? There must be a better way than just using a brute-force approach like this.

Related

Replacing pair of element of symmetric matrix with NA

I have a positive definite symmetric matrix. Pasting the matrix generated using the following code:
set.seed(123)
m <- genPositiveDefMat(
dim = 3,
covMethod = "unifcorrmat",
rangeVar = c(0,1) )
x <- as.matrix(m$Sigma)
diag(x) <- 1
x
#Output
[,1] [,2] [,3]
[1,] 1.0000000 -0.2432303 -0.4110525
[2,] -0.2432303 1.0000000 -0.1046602
[3,] -0.4110525 -0.1046602 1.0000000
Now, I want to run the matrix through iterations and in each iteration I want to replace the symmetric pair with NA. For example,
Iteration 1:
x[1,2] = x[2,1] <- NA
Iteration2:
x[1,3] = x[3,1] <- NA
and so on....
My idea was to check using a for loop
Prototype:
for( r in 1:nrow(x)
for( c in 1:ncol(x)
if x[r,c]=x[c,r]<-NA
else
x[r,c]
The issue with my code is for row 1 and column 1, the values are equal hence it sets to 0 (which is wrong). Also, the moment it is not NA it comes out of the loop.
Appreciate any help here.
Thanks
If you need the replacement done iteratively, you can use the indexes of values represented by upper.tri(x)/lower.tri to do the replacements pair-by-pair. That will allow you to pass the results to a function before/after each replacement, e.g.:
idx <- which(lower.tri(mat), arr.ind=TRUE)
sel <- cbind(
replace(mat, , seq_along(mat))[ idx ],
replace(mat, , seq_along(mat))[ idx[,2:1] ]
)
# [,1] [,2]
#[1,] 2 4 ##each row represents the lower/upper pair
#[2,] 3 7
#[3,] 6 8
for( i in seq_len(nrow(sel)) ) {
mat[ sel[i,] ] <- NA
print(mean(mat, na.rm=TRUE))
}
#[1] 0.2812249
#[1] 0.5581359
#[1] 1

Calculate probability of observing sequence using markovchain package

Let's use the dataset from this question:
dat<-data.frame(replicate(20,sample(c("A", "B", "C","D"), size = 100, replace=TRUE)))
Then we can build the transition matrix and the markov chain:
# Build transition matrix
trans.matrix <- function(X, prob=T)
{
tt <- table( c(X[,-ncol(X)]), c(X[,-1]) )
if(prob) tt <- tt / rowSums(tt)
tt
}
trans.mat <- trans.matrix(as.matrix(dat))
attributes(trans.mat)$class <- 'matrix'
# Build markovchain
library(markovchain)
chain <- new('markovchain', transitionMatrix = trans.mat)
If I now encounter a new sequence, let's say AAABCAD can I then calculate the probability of observing this sequence given this markovchain?
I cannot see a function in markovchain exactly for that, but it can be easily done manually too. There's one caveat though: the transition matrix does not provide the probability of observing the first A, which needs to be provided by you. Let it be 0.25, as it would be if all four states were equally likely (which is true in your example).
Then the transitions in the observed chain can be obtained with
cbind(head(obs, -1), obs[-1])
# [,1] [,2]
# [1,] "A" "A"
# [2,] "A" "A"
# [3,] "A" "B"
# [4,] "B" "C"
# [5,] "C" "A"
# [6,] "A" "D"
Probabilities for each of those transitions then are
trans.mat[cbind(head(obs, -1), obs[-1])]
# [1] 0.2268722 0.2268722 0.2268722 0.2926316 0.2791165 0.2665198
and the final answer is 0.25 * (the product of the above vector):
0.25 * prod(trans.mat[cbind(head(obs, -1), obs[-1])])
# [1] 6.355069e-05
For comparison, we may estimate this probability by generating many chains of length 7:
dat <- replicate(2000000, paste(sample(c("A", "B", "C", "D"), size = 7, replace = TRUE), collapse = ""))
mean(dat == "AAABCAD")
# [1] 6.55e-05
Looks close enough!

Find closest value with condition

I have a function that finds me the nearest values for each row in a matrix. It then reports a list with an index of the nearest rows. However, I want it to exclude values if they are +1 in the first AND +1 in the second column away from a particular set of values (-1 in the first and -1 in the second column should also be removed). Moreover, +1 in first column and -1 in second column with respect to the values of interest should also be avoided.
As an example, if I want things closes to c(2, 1), it should accept c(3,1) or (2,2) or (1,1), but NOT c(3,2) and not c(1,0).
Basically, for an output to be reported either column 1 or column 2 should be a value of 1 away from a row of interest, but not both.
input looks like this
x
v1 v2
[1,] 3 1
[2,] 2 1
[3,] 3 2
[4,] 1 2
[5,] 8 5
myfunc(x)
The output looks like this. Notice that the closest thing to row 2 ($V2 in output) is row 1,3,4. The answer should only be 1 though.
$V1
[1] 2 3
$V2
[1] 1 3 4
$V3
[1] 1 2
$V4
[1] 2
$V5
integer(0)
Here is myfunc
myfunc = function(t){
d1 <- dist(t[,1])
d2 <- dist(t[,2])
dF <- as.matrix(d1) <= 1 & as.matrix(d2) <= 1
diag(dF) <- NA
colnames(dF) <- NULL
dF2 <- lapply(as.data.frame(dF), which)
return(dF2)
}
Basically, the rows that you want to find should differ from your reference element by +1 or -1 in one column and be identical in the other column. That means that the sum over the absolute values of the differences is exactly one. For your example c(2, 1), this works as follows:
c(3, 1): difference is c(1, 0), thus sum(abs(c(1, 0))) = 1 + 0 = 1
c(1, 1): difference is c(-1, 0), thus sum(abs(c(-1, 0))) = 1 + 0 = 1
etc.
The following function checks exactly this:
myfunc <- function(x) {
do_row <- function(r) {
r_mat <- matrix(rep(r, length = length(x)), ncol = ncol(x), byrow = TRUE)
abs_dist <- abs(r_mat - x)
return(which(rowSums(abs_dist) == 1))
}
return(apply(x, 1, do_row))
}
do_row() does the job for a single row, and then apply() is used to do this with each row. For your example, I get:
myfunc(x)
## [[1]]
## [1] 2 3
##
## [[2]]
## [1] 1
##
## [[3]]
## [1] 1
##
## [[4]]
## integer(0)
##
## [[5]]
## integer(0)
Using sweep(), one can write a shorter function:
myfunc2 <- function(x) {
apply(x, 1, function(r) which(rowSums(abs(sweep(x, 2, r))) == 1))
}
But this seems harder to understand and it turns out that it is slower by about a factor two for your matrix x. (I have also tried it with a large matrix, and there, the efficiency seems about the same.)

R : randomly divide 10 data values into a group of 5 and a group of 5

I would like to find all the possibilities to divide 10 data values into 2 groups of 5
If i'm right there are 252 possibilities
choose(10,5)
252
How can i do it with R ?
Thanks !
Here's one possibility:
a <- letters[1:10]
split1 <- combn(a, 5);
split2 <- apply(b, 2, function(x) a[!a %in% x])
Pick a random one:
set.seed(1)
rnd <- sample(1:ncol(split1), size=1)
split1[, rnd]; split2[, rnd]
# [1] "a" "c" "d" "g" "i"
# [1] "b" "e" "f" "h" "j"
So i will explain in details what i have to do :
I have 2 sets of data :
cellular_wt = c(1.1656,0.9577,1.3655,0.9016,0.9336)
cellular_mutant = c(2.8896,5.7018,3.595,1.6998,1.8893)
secreted_wt = c(7.8491,6.1546,5.1972,6.1607,5.928)
secreted_mutant = c(4.6801,3.2418,3.6651,3.0678,2.3221)
mean_cellular_wt <- mean(cellular_wt)
mean_cellular_mutant <- mean(cellular_mutant)
mean_secreted_wt <- mean(secreted_wt)
mean_secreted_mutant <- mean(secreted_mutant)
mean_secreted_wt/mean_cellular_wt = 5.877085
mean_secreted_mutant/mean_cellular_mutant = 1.076156
mean_ratio <- (mean_secreted_wt/mean_cellular_wt)/(mean_secreted_mutant/mean_cellular_mutant) = 5.46
I want to run a randomization test on these data to test the significance of mean ratio
To do so, i would like to randomly divide these 10 values (cellular_wt + cellular_mutant and secreted_wt + secreted_mutant into 2 groups of 5 (as the initial data sets), and calculate the mean ratio each time.
In this way, i can see whether the observed difference of 5.46 seems unusually large by comparing it to the 252 differences that could have been seen due to random assignment alone. Do you understand ?

How to analyze a data frame and get another with values greater than 0.05?

I have a data frame with different values of p.value includind missing values (NA):
pvalue2=pvalue[1:679,3:10]
and I need to analyze it and the numbers greater than 0.05 i need to write "Normal" e values less than 0.05 i need to write the value. I want the result to be written in another data frame.
This is my code:
a=data.frame()
for (i in 1:nrow(pvalue2)) {
for (j in 1:ncol(pvalue2)){
if (pvalue2[i,j] >=0.05) {
print (a[i,j]=="Normal")
} else {print a[i,j]==pvalue2[i,j] }
}
}
Someone can help me please?
a <- ifelse(as.matrix(pvalue2) < .05, as.matrix(pvalue2), "normal")
a <- as.data.frame(a)
Since R is a high level language that is not compiled, for loops have a tendendency to get very slow when they grow. By using vectorized functions instead (that do the looping in a lower level language internally) you speed up the code and make it more readable.
Example run
> set.seed(123)
> pvalue2 <- matrix(runif(18)/10, 6, 3)
> pvalue2[sample(length(pvalue2), 4)] <- NA
> pvalue2 <- as.data.frame(pvalue2)
> pvalue2
V1 V2 V3
1 0.02875775 0.05281055 0.067757064
2 0.07883051 0.08924190 0.057263340
3 0.04089769 0.05514350 NA
4 0.08830174 0.04566147 0.089982497
5 0.09404673 NA NA
6 NA 0.04533342 0.004205953
> ifelse(as.matrix(pvalue2) < .05, as.matrix(pvalue2), "normal")
V1 V2 V3
[1,] "0.0287577520124614" "normal" "normal"
[2,] "normal" "normal" "normal"
[3,] "0.04089769218117" "normal" NA
[4,] "normal" "0.0456614735303447" "normal"
[5,] "normal" NA NA
[6,] NA "0.0453334156190977" "0.00420595335308462"
I suppose, your p values are stored as factors. You need to convert them to numeric values first.
tmp <- sapply(pvalue2, function(x) as.numeric(as.character(x)))
Now, the object tmp can be used:
# copy the existing data frame to a new object
df2 <- pvalue2
# fill it with "Normal"
df2[ , ] <- "Normal"
# replace with values from tmp if value < 0.05
df2[tmp < 0.05] <- pvalue2[tmp < 0.05]
assuming your first data frame is called df
df_2<-data.frame(matrix(nrow=nrow(df),ncol=ncol(df)));
for (i in 1:ncol(df)){
df_2[,i]<-ifelse(is.na(df[,i]) == FALSE && df[,i] >= .05,"Normal",ifelse(is.na(df[,i])==FALSE && df[,i] < 0.05,df[,i],NA))
}
set.seed(42)
df <- data.frame(a=runif(10,0,0.1),b=runif(10,0,0.1))
#since there are only numeric values
#you can transform to matrix
m <- as.matrix(df)
#new matrix
m2 <- m
m2[m>0.05] <- "Normal"
df2 <- as.data.frame(m2)
a b
1 Normal 0.045774177624844
2 Normal Normal
3 0.0286139534786344 Normal
4 Normal 0.0255428824340925
5 Normal 0.0462292822543532
6 Normal Normal
7 Normal Normal
8 0.013466659723781 0.0117487361654639
9 Normal 0.0474997081561014
10 Normal Normal

Resources