Building a table of probability transition matrices with plyr - r

I'm trying to model a system of continuous time Markov chains where in different time intervals I have different rates.
I build a rate matrix for each time period like this
make.rate.matrix <- function(c1, c2, m12, m21) {
matrix(
c(# State 1: lineages in different populations
-(m12+m21), m21, m12, 0,
# State 2: both lineages in population 1
2*m12, -(2*m12+c1), 0, c1,
# State 3: both lineages in population 2
2*m21, 0, -(2*m21+c2), c2,
# State 4: coalesced (catches both populations; absorbing)
0, 0, 0, 0),
byrow = TRUE, nrow=4)
}
(if you are interested it is modelling the coalescence density in a two-deme system with migration)
The rates, the cs and ms, differs in different time periods, so I want to build a rate matrix for each time period and then a transition probability matrix for each period.
With two periods I can specify the rates like this
rates <- data.frame(c1 = c(1,2), c2 = c(2,1), m12 = c(0.2, 0.3), m21 = c(0.4, 0.2))
and I want to use the first rates from time 0 to t and the second set of rates from time t to s, say.
So I want to have a table of rate matrices for the first and second period, and probability transition matrices for moving from state a to b through the first and second period.
mlply(rates, make.rate.matrix)
gives me a list of the two rate matrices, and if I want a table where I can easily look up the rate matrices, I can do something like
> xx <- array(unlist(mlply(rates, make.rate.matrix)), dim=c(4,4,2))
> xx[,,1]
[,1] [,2] [,3] [,4]
[1,] -0.6 0.4 0.2 0
[2,] 0.4 -1.4 0.0 1
[3,] 0.8 0.0 -2.8 2
[4,] 0.0 0.0 0.0 0
> xx[,,2]
[,1] [,2] [,3] [,4]
[1,] -0.5 0.2 0.3 0
[2,] 0.6 -2.6 0.0 2
[3,] 0.4 0.0 -1.4 1
[4,] 0.0 0.0 0.0 0
I can then get the probability transition matrices like
> library(Matrix)
> t <- 1; s <- 2
> P1 <- expm(xx[,,1] * t)
> P2 <- expm(xx[,,2] * (s - t))
but I somehow cannot figure out how to get a table of these like I can get for the rate matrices.
I feel that I should be able to get there with aaply, but I am stomped as to how to get there.
How do I get a table P, where P[,,1] is P1 and P[,,2] is P2?

Related

Create fully antithetic draws using R

say I have the matrix d, which is the result of two different realizations (rows) of a sampling procedure in two dimensions (columns). I want to develop a function that creates the fully-antithetic draws from this original matrix.
c1 <- c(0.1, 0.6);c2 <- c(0.3, 0.8);d <- rbind(c1,c2)
# [,1] [,2]
# c1 0.1 0.6
# c2 0.3 0.8
That is to say, for example, for the first realization (c(0.1, 0.6)) I want to obtain the mirror images of this random draw in two dimensions, which generated 4 (2^2) possible combinations as follows:
d1_anthi = matrix(
c( d[1,1] , d[1,2],
1 - d[1,1], d[1,2],
d[1,1] , 1 - d[1,2],
1 - d[1,1], 1 - d[1,2]), nrow=2,ncol=4)
t(d1_anthi)
# [,1] [,2]
# [1,] 0.1 0.6
# [2,] 0.9 0.6
# [3,] 0.1 0.4
# [4,] 0.9 0.4
Analogously, for the second, realization the results is the following:
d2_anthi = matrix(
c( d[2,1] , d[2,2],
1 - d[2,1], d[2,2],
d[2,1] , 1 - d[2,2],
1 - d[2,1], 1 - d[2,2]), nrow=2, ncol=4)
t(d2_anthi)
# [,1] [,2]
# [1,] 0.3 0.8
# [2,] 0.7 0.8
# [3,] 0.3 0.2
# [4,] 0.7 0.2
Accordingly, my desired object will lock is like this:
anthi_draws <- rbind(t(d1_anthi),t(d2_anthi))
# [,1] [,2]
# [1,] 0.1 0.6 <- original first realization
# [2,] 0.9 0.6
# [3,] 0.1 0.4
# [4,] 0.9 0.4
# [5,] 0.3 0.8 <- original second realization
# [6,] 0.7 0.8
# [7,] 0.3 0.2
# [8,] 0.7 0.2
Finally, I would like to create a function that, given a matrix of random numbers, is able to create this expanded matrix of antithetic draws. For example, in the picture below I have a sampling in three dimensions, then the total number of draws per original draw is 2^3 = 8.
In particular, I am having problems with the creating of the full combinatory that depends on the dimensions of the original sampling (columns of the matrix). I was planning on using expand.grid() but I couldn't create the full combinations using it. Any hints or help in order to create such a function is welcome. Thank you in advance.
You can try this
do.call(
rbind,
apply(
d,
1,
function(x) {
expand.grid(data.frame(rbind(x, 1 - x)))
}
)
)
which gives
X1 X2
c1.1 0.1 0.6
c1.2 0.9 0.6
c1.3 0.1 0.4
c1.4 0.9 0.4
c2.1 0.3 0.8
c2.2 0.7 0.8
c2.3 0.3 0.2
c2.4 0.7 0.2

unexpected result from cmdscale in R

I'm afraid I'm missing something obvious, but I just can't see what I am doing wrong.
If anyone can help me find it, please, it would be great.
Here's the full, symmetrical distance matrix I'm starting from:
d2 <- structure(list(P1 = c(0, 0.1, 0.3, 0.2, 0, 0.1), P2 = c(0.1,
0, 0.5, 0.7, 1, 0.9), P3 = c(0.3, 0.5, 0, 1, 0.2, 0.3), P4 = c(0.2,
0.7, 1, 0, 0.2, 0.5), P5 = c(0, 1, 0.2, 0.2, 0, 0.7), P6 = c(0.1,
0.9, 0.3, 0.5, 0.7, 0)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -6L))
sum(abs(d2-t(d2)))
#[1] 0
I want to generate coordinates for the corresponding 6 points, so that the (euclidean) distance matrix resulting from those coordinates is as close as possible to my d2.
From the cmdscale documentation:
A set of Euclidean distances on n points can be represented exactly in at most n - 1 dimensions.
I would have thought (n-1)/2 dimensions would suffice, and indeed, when I run cmdscale, if I go anywhere higher than k=3 I get something close to 0 for the higher coordinates, or even error messages:
cmdscale(d2,k=3)
# [,1] [,2] [,3]
#[1,] -0.03526127 0.07755701 1.708755e-05
#[2,] -0.50626939 0.31256816 -5.646907e-02
#[3,] -0.26333957 -0.40518119 -6.978213e-02
#[4,] 0.35902238 0.37455879 2.148406e-02
#[5,] 0.33997864 -0.17998635 -2.809260e-01
#[6,] 0.10586921 -0.17951643 3.856760e-01
cmdscale(d2,k=4)
# [,1] [,2] [,3] [,4]
#[1,] -0.03526127 0.07755701 1.708755e-05 -7.450581e-09
#[2,] -0.50626939 0.31256816 -5.646907e-02 -7.450581e-09
#[3,] -0.26333957 -0.40518119 -6.978213e-02 -7.450581e-09
#[4,] 0.35902238 0.37455879 2.148406e-02 -7.450581e-09
#[5,] 0.33997864 -0.17998635 -2.809260e-01 -7.450581e-09
#[6,] 0.10586921 -0.17951643 3.856760e-01 -7.450581e-09
cmdscale(d2,k=5)
# [,1] [,2] [,3] [,4]
#[1,] -0.03526127 0.07755701 1.708755e-05 -7.450581e-09
#[2,] -0.50626939 0.31256816 -5.646907e-02 -7.450581e-09
#[3,] -0.26333957 -0.40518119 -6.978213e-02 -7.450581e-09
#[4,] 0.35902238 0.37455879 2.148406e-02 -7.450581e-09
#[5,] 0.33997864 -0.17998635 -2.809260e-01 -7.450581e-09
#[6,] 0.10586921 -0.17951643 3.856760e-01 -7.450581e-09
#Warning message:
#In cmdscale(d2, k = 5) : only 4 of the first 5 eigenvalues are > 0
So, assuming that k=3 is sufficient, this is what happens when I try to reverse the operation:
dd <- dist(cmdscale(d2,k=3),diag = T,upper = T)
dd
# 1 2 3 4 5 6
#1 0.0000000 0.5294049 0.5384495 0.4940956 0.5348482 0.4844970
#2 0.5294049 0.0000000 0.7578630 0.8710048 1.0045529 0.9013064
#3 0.5384495 0.7578630 0.0000000 1.0018275 0.6777074 0.6282371
#4 0.4940956 0.8710048 1.0018275 0.0000000 0.6319294 0.7097335
#5 0.5348482 1.0045529 0.6777074 0.6319294 0.0000000 0.7065166
#6 0.4844970 0.9013064 0.6282371 0.7097335 0.7065166 0.0000000
Which is quite different from what I expected:
as.matrix(dd)-d2
# P1 P2 P3 P4 P5 P6
#1 0.0000000 0.429404930 0.238449457 0.294095619 0.534848178 0.384497043
#2 0.4294049 0.000000000 0.257862963 0.171004810 0.004552925 0.001306386
#3 0.2384495 0.257862963 0.000000000 0.001827507 0.477707386 0.328237091
#4 0.2940956 0.171004810 0.001827507 0.000000000 0.431929428 0.209733518
#5 0.5348482 0.004552925 0.477707386 0.431929428 0.000000000 0.006516573
#6 0.3844970 0.001306386 0.328237091 0.209733518 0.006516573 0.000000000
sum(abs(as.matrix(dd)-d2))
#[1] 7.543948
Has anyone got any idea why the two distance matrices don't match at all?
I could try building my own least squares problem to find the coordinates, but first I need to understand if I'm doing something wrong with these out of the box R functionalities.
Thanks!
EDIT possible inconsistency in the data found
Could the issue be that according to d2 points 1 and 5 coincide (they have distance 0):
as.matrix(d2)
# P1 P2 P3 P4 P5 P6
#[1,] 0.0 0.1 0.3 0.2 0.0 0.1
#[2,] 0.1 0.0 0.5 0.7 1.0 0.9
#[3,] 0.3 0.5 0.0 1.0 0.2 0.3
#[4,] 0.2 0.7 1.0 0.0 0.2 0.5
#[5,] 0.0 1.0 0.2 0.2 0.0 0.7
#[6,] 0.1 0.9 0.3 0.5 0.7 0.0
but then these two points have different distances from other points, e.g. d(1-2) is 0.1 whereas d(5-2) is 1?
Replacing the two 0's does not seem to help though:
d3 <- d2
d3[1,5] <- 0.2
d3[5,1] <- 0.2
dd3 <- cmdscale(as.matrix(d3),k=3)
sum(abs(as.matrix(dist(dd3))-as.matrix(d3)))
#[1] 7.168348
Does this perhaps indicate that not all distance matrices can be reduced to a completely consistent set of points, regardless of how many dimensions one uses?
EDIT 2 possible answer to the last question.
I suspect that the answer is yes. And I was wrong on the number of dimensions, I see now why you need N-1 rather than half that.
If I have a distance d(A-B) = 1, I can represent that in 2-1 = 1 dimensions (x axis), i.e. on a line, placing A in (xA=0) and B in (xB=1).
Then I introduce a third point C and I state that d(A-C) = 2.
I have 3 points, so I need 3-1 = 2 dimensions (xy plane).
The constraint given by d(A-C) is:
(xC - 0)^2 + (yC - 0)^2 = d(A-C)^2 = 4.
i.e. C can be anywhere on a circumference of radius 2 centred in A.
This constrains both xC and yC to be in [-2,2].
However, previously I had not considered that this constrains the possible values of d(B-C), too, because:
d(B-C)^2 = (xC - 1)^2 + (yC - 0)^2
thus, by substitution of the (yC - 0)^2 term:
d(B-C)^2 = (xC - 1)^2 + 4 - (xC - 0)^2 = -2*xC + 5
d(B-C)^2 is therefore bound to [-2*(+2)+5,-2*(-2)+5] = [1,9].
So if my distance matrix contained d(A-B) = 1, d(A-C) = 2 and d(B-C) anywhere outside [1,3], it would configure a system that does not correspond to 3 points in Euclidean space.
At least, I hope this makes sense.
So I guess my original question must be revoked.
I thought I'd leave the reasoning here for future reference or if anyone else should have the same doubt.
Multidimensional scaling creates coordinates for the specified number of dimensions such that they will represent the distances in the original matrix as closely as possible. But the distances will be at different scales. In your example, d3 is the original distance matrix, dd3 is the matrix of coordinates, and dist(dd3) is the distance matrix from the reconstructed coordinates. The values are different, but they reflect the same relationships between points:
d3.v <- as.vector(as.dist(d3)) # Vector of original distances
dd3.v <- as.vector(dist(dd3)) # Vector of distances computed from coordinates
cor(d3.v, dd3.v)
# [1] 0.9433903
plot(d3.v, dd3.v, pch=16)

Efficient way to generate a coincidence matrix

I want to generate a simple coincidence matrix, I've looked for R packages but could not find one that does this calculation so far, I don't know if the English term for this matrix is different than the Portuguese one... so, that's what I need to do.
I have a matrix:
[,1] [,2] [,3] [,4]
[1,] 1 1 2 1
[2,] 1 2 3 1
[3,] 2 3 1 2
[4,] 1 2 3 3
A coincidence matrix will be calculated comparing each element row by row to generate a dissimilarity distance with the formula:
Diss = 1 - (Coincidences / (Coincidences + Discordance))
So my resulting matrix is an symmetrical one with dim 4x4 and diagonal elements equal 0, so in the example my A(1,2) would it be:
A(1,2) = 1 - (2 / 4) = 0.5
A(1,3) = 1 - (0/4) = 1.0
And so on...
I have created a function to generate this matrix:
cs_matrix <- function (x) {
cs.mat <- matrix(rep(0,dim(x)[1]^2), ncol = dim(x)[1])
for (i in 1:dim(x)[1]){
for (j in 1:dim(x)[1]){
cs.mat[i,j] <- 1 - (sum(x[i,] == x[j,]) / dim(x)[2])
}
}
return(cs.mat)
}
The function works fine, but my actual Data Set has 2560 observations of 4 variables, thus generating a 2560 x 2560 coincidence matrix, and it takes quite some time to do the calculation. I wonder if there is a more efficient way of calculating this or even if there is already a package that can calculate this dissimilarity distance. This matrix will be later used in Cluster Analysis.
I think you can use outer
add <- function(x, y) sum(mat[x, ] == mat[y,])
nr <- seq_len(nrow(mat))
mat1 <- 1 - outer(nr, nr, Vectorize(add))/ncol(mat)
mat1
# [,1] [,2] [,3] [,4]
#[1,] 0.00 0.50 1 0.75
#[2,] 0.50 0.00 1 0.25
#[3,] 1.00 1.00 0 1.00
#[4,] 0.75 0.25 1 0.00
If diagonal elements need to be 1 do diag(mat1) <- 1.
data
mat <- structure(c(1, 1, 2, 1, 1, 2, 3, 2, 2, 3, 1, 3, 1, 1, 2, 3), .Dim = c(4L,4L))

use function code to produce a n×n matrix

In order to produce the matrix in the picture, I tried to write a function code to do this, but I cannot figure it out what to do next, and also not sure if what I already did is right or not.
Matrix <- function(n){
mat1 <- diag(x = ((1:n)-1)/((1:n)+1), n, n)[-1,]
mat2 <- diag(x = ((1:n)-(1:n)+1)/((1:n)+1), n, n)[,-1]
mat3 <- diag(x = 1/((1:n)+1), n, n)
}
An option:
library(SoDA)
n <- 4
triDiag(diagonal = rep(1/(n+1), n+1),
upper = (n:1)/(n+1),
lower = (1:n)/(n+1))
# [,1] [,2] [,3] [,4] [,5]
# [1,] 0.2 0.8 0.0 0.0 0.0
# [2,] 0.2 0.2 0.6 0.0 0.0
# [3,] 0.0 0.4 0.2 0.4 0.0
# [4,] 0.0 0.0 0.6 0.2 0.2
# [5,] 0.0 0.0 0.0 0.8 0.2
It is not entirely clear what you are trying to achieve.
From your description the matrix will have n+1 elements (from 1/(n+1) to n/(n+1)), and I assume the remaining matrix is Sparse. It is not a simple structure to achieve via vectorized computations, but it can be achieved in a single for loop, thus being constructed in O(n) time, given a matrix of size n+1.
In the code below i present an example of such code. The idea is to traverse the matrix in opposite, and only assign 1 type value to each.
Create_Matrix <- function(n){
n1 = n + 1 #Last row, avoid n computations
n2 = n1 + 1
output <- diag(1/n1, nrow = n1, ncol = n1)
for(i in seq(n)){
output[i + 1, i] = output[n1 - i, n2 - i] = output[[1]] * i
}
output
}

R: calculating the time spent by N individuals within each of M time intervals

There are four time intervals
[0, 3), [3, 10), [10, 12), and [12, Inf)
and three subjects for whom we have survival times
10.3, 0.7, 12.2
I would like to construct a matrix with three rows (one for each individual) and four column (one for each time interval) that contains the time spent by each individual within each time interval.
For this particular example, we have
3.0 7 0.3 0.0
0.7 0 0.0 0.0
3.0 7 2.0 0.2
Can you help me to obtain this in R? The idea is to apply this for N much larger than 3.
My attempt:
breaks <- c(0, 3, 10, 12, Inf) # interval break points
M <- length(breaks) - 1 # number of intervals
time <- c(10.3, 0.7, 12.2) # observed survival times
N <- length(time) # number of subjects
timeSpent <- matrix(NA, nrow=N, ncol=M)
for(m in 1:M)
{
ind <- which(breaks[m + 1] - time > 0)
timeSpent[ind, m] <- time[ind] - breaks[m]
timeSpent[-ind, m] <- breaks[m + 1] - breaks[m]
}
timeSpent <- replace(x=timeSpent, list=timeSpent < 0, values=0)
breaks <- c(0, 3, 10, 12, Inf)
time <- c(10.3, 0.7, 12.2)
timeSpent <- sapply(time, function(x) {
int <- max(which(x>breaks))
res <- diff(breaks)
res[int:length(res)] <- 0
res[int] <- x-breaks[int]
res
})
t(timeSpent)
# [,1] [,2] [,3] [,4]
#[1,] 3.0 7 0.3 0.0
#[2,] 0.7 0 0.0 0.0
#[3,] 3.0 7 2.0 0.2
This doesn't loop and should be faster. However, a potential problem could be memory demand.
tmp <- t(outer(time, breaks, ">"))
res <- tmp * breaks
res[is.na(res)] <- 0
res <- diff(res)
res[diff(tmp)==-1] <- time+res[diff(tmp)==-1]
t(res)
# [,1] [,2] [,3] [,4]
#[1,] 3.0 7 0.3 0.0
#[2,] 0.7 0 0.0 0.0
#[3,] 3.0 7 2.0 0.2

Resources