I am learning R on my own and I am having some troubles trying to build a transition probability matrix in Rstudio using the markovchain package. First I tried to calculate the transition probabilities of a DNA sequence.
ATTCAACACATCCAGCCACATGCTCCGAGAGGAGGCAGAGGGCCCCCGGAATGATGCTTACCGAGATTCTTGTTTTTATCCTCGTGGTTGTTTAAAAACGAGTTGAAACTGACGGCATGTCGGACTATAAGCTACTTACTCACCATAGACGTGACCATAGGCCCTAAAACGTTACCGAGATATTCACTTCTAATAACAGTTGTCGGCAGAGCCAAAAGGCCGGGTGATAATACTTTAAAAAGGGAGTTGATTGTTGTATCTAATCCTAGAATGTCAAGAGCGACCATAACAAGATAATTCGGCAGAGCCAGAAAGCGTTCAAGGACTAGAACCATACCGAGACGCAAACGTTCAGGTCGAACTCTAATACCGATTAGT
But how can the transition probability matrix be calculated in a sequence like this, I was thinking of using R indexes but I don't really know how to calculate those transition probabilities.
Is there a way of doing this in R?
I am guessing that the output of those probabilities in a matrix should be something like this:
A T C G
A 0.60 0.10 0.10 0.20
T 0.10 0.50 0.30 0.10
C 0.05 0.20 0.70 0.05
G 0.40 0.05 0.05 0.50
You can use the markovchain package for help with this. First, your data
seq <- "ATTCAACACATCCAGCCACATGCTCCGAGAGGAGGCAGAGGGCCCCCGGAATGATGCTTACCGAGATTCTTGTTTTTATCCTCGTGGTTGTTTAAAAACGAGTTGAAACTGACGGCATGTCGGACTATAAGCTACTTACTCACCATAGACGTGACCATAGGCCCTAAAACGTTACCGAGATATTCACTTCTAATAACAGTTGTCGGCAGAGCCAAAAGGCCGGGTGATAATACTTTAAAAAGGGAGTTGATTGTTGTATCTAATCCTAGAATGTCAAGAGCGACCATAACAAGATAATTCGGCAGAGCCAGAAAGCGTTCAAGGACTAGAACCATACCGAGACGCAAACGTTCAGGTCGAACTCTAATACCGATTAGT"
Then use the package
library(markovchain)
base_sequence <- strsplit(seq, "")[[1]]
mcX <- markovchainFit(base_sequence)$estimate
mcX
# A C G T
# A 0.3000000 0.2250000 0.2583333 0.2166667
# C 0.2857143 0.2619048 0.2380952 0.2142857
# G 0.3764706 0.1882353 0.2117647 0.2235294
# T 0.3068182 0.2159091 0.1818182 0.2954545
Create DNA
DNA <- "ATTCAACACATCCAGCCACATGCTCCGAGAGGAGGCAGAGGGCCCCCGGAATGATGCTTACCGAGATTCTTGTTTTTATCCTCGTGGTTGTTTAAAAACGAGTTGAAACTGACGGCATGTCGGACTATAAGCTACTTACTCACCATAGACGTGACCATAGGCCCTAAAACGTTACCGAGATATTCACTTCTAATAACAGTTGTCGGCAGAGCCAAAAGGCCGGGTGATAATACTTTAAAAAGGGAGTTGATTGTTGTATCTAATCCTAGAATGTCAAGAGCGACCATAACAAGATAATTCGGCAGAGCCAGAAAGCGTTCAAGGACTAGAACCATACCGAGACGCAAACGTTCAGGTCGAACTCTAATACCGATTAGT"
Split it character by character
DNA_list <- unlist(strsplit(DNA, split = ""))
Retrieve unique elements
DNA_unique <- unique(DNA_list)
Create an empty matrix
matrix <- matrix(0, ncol = length(DNA_unique), nrow=length(DNA_unique))
Fill it: to elt i and element i + 1 and add one in the corresponding cell of the matrix.
for (i in 1:(length(DNA_list) - 1)){
index_of_i <- DNA_unique == DNA_list[i]
index_of_i_plus_1 <- DNA_unique == DNA_list[i + 1]
matrix[index_of_i, index_of_i_plus_1] = matrix[index_of_i, index_of_i_plus_1] + 1
}
Normalize it
matrix <- matrix / rowSums(matrix)
> matrix
[,1] [,2] [,3] [,4]
[1,] 0.3000000 0.2166667 0.2250000 0.2583333
[2,] 0.3068182 0.2954545 0.2159091 0.1818182
[3,] 0.2857143 0.2142857 0.2619048 0.2380952
[4,] 0.3764706 0.2235294 0.1882353 0.2117647
NB: There might be a way to perform it in a faster way if you have really large DNA to compute. But here it seeems to be fast enough.
Related
I reduced the problem to a small example. I hope it is helpful and understandable.
Given are the two vectors A and B. The entries in vector A are to be replaced by the entries in vector B. The replacement should be done with the absolute minimum difference of the entries. After replacing all entries in vector A, the new vector should be saved.
Maybe a for loop is a good idea?(also for large vectors?)
Thank you very much for your help!
For example:
A <- c(1.2, 1.3, 1.3, 1.4, 1.5)
B <- c(1.25, 1.45)
for-loop should work like this:
1.2 - 1.25 = 0.05
1.2 - 1.45 = 0.25
etc.
0.05 is the absolute minimum, replacing 1.2 with 1.25
The new vector should look like this:
newVector <- c(1.25, 1.25, 1.25, 1.45, 1.45)
Another idea using outer and max.col
B[max.col(-abs(outer(A, B, `-`)))]
# [1] 1.25 1.25 1.25 1.45 1.45
Should be fine to use if A and B are not too large.
step by step
outer(A, B, "-") returns the following matrix
# [,1] [,2]
#[1,] -0.05 -0.25
#[2,] 0.05 -0.15
#[3,] 0.05 -0.15
#[4,] 0.15 -0.05
#[5,] 0.25 0.05
where the first column is the result of A - B[1] and the second column is A - B[2]. For each row we need to find the column position of the absolute minimum.
There is no min.col function therefore the minus in
max.col(-abs(outer(A, B, `-`)))
which returns
# [1] 1 1 1 2 2
We finally use this vector to extract the desired values from B.
With sapply this can be easily done without a for-loop:
newVector <- sapply(A,function(x) B[which.min(abs(x-B))])
First get the index of the minimum of the difference of all elements of A with B:
(ind <- sapply(A, function(a) which.min(abs(B - a))))
# [1] 1 1 1 2 2
Then replace the values by the respective values in B:
B[ind]
# [1] 1.25 1.25 1.25 1.45 1.45
This form for illustrative purposes. Short code would be directly like this:
sapply(A, function(a) B[which.min(abs(B - a))])
# [1] 1.25 1.25 1.25 1.45 1.45
I would like to simulate 10000 result for the function below and store the values.It is a function available on the package msm (R-software).
sim.msm(qmatrix,15)
Result:
$states
[1] 1 2 3 2 3 2 2
$times
[1] 0.000000 1.538988 2.240587 9.695302 11.002184 14.998754 15.000000
$qmatrix
[,1] [,2] [,3]
[1,] -0.11 0.10 0.01
[2,] 0.05 -0.15 0.10
[3,] 0.02 0.07 -0.09
This is only one simulation . I need 10000 like this.
Grateful if someone could help me
Replicate allows to repeat N times the same command. Here N = 10 :
replicate(10, sim.msm(qmatrix,15), simplify = FALSE)
I have a matrix called "variables" which includes 9 variables (9 columns). I have obtained the pairwise correlation matrix with this code:
matrix.cor <- cor(variables, method="kendall", use="pairwise")
Now I want to obtain the average pairwise correlation as a function of the number of variables considered. I mean, The average of all possible correlation of 2 variables, 3 variables, 4 variables... up to the 9 variables in order to see the effect of adding variables. I have this R code (extracted from an article which more factors and columns) but it does not run well, I only obtain the average considering the 9 variables.
pairwisecor.df = ddply(data,c("Exp"),function(x) {
Smax = unique(x$Rich)
x = x[,variables]
cormat = cor(t(x),use="complete.obs",method=c("kendall"))
data.frame(
Smax = Smax,
no.fn = nrow(x),
avg.cor = mean(cormat[lower.tri(cormat)]) ) } )
I think it couldn't be very difficult to create a function to analyze a cumulative number of variables... but I only have the reference of an article where the data is much more complicated.
Any idea?
Here is a fictitious example on calculating mean values among the increasing size of lower triangle matrices, starting from left upper corner:
> (cormat <- matrix((1:25)/25, 5, 5))
[,1] [,2] [,3] [,4] [,5]
[1,] 0.04 0.24 0.44 0.64 0.84
[2,] 0.08 0.28 0.48 0.68 0.88
[3,] 0.12 0.32 0.52 0.72 0.92
[4,] 0.16 0.36 0.56 0.76 0.96
[5,] 0.20 0.40 0.60 0.80 1.00
> avg.cor = c()
> for (i in 2:dim(cormat)[1]) {
+ avg.cor=cbind(avg.cor,mean(cormat[1:i,1:i][lower.tri(cormat[1:i,1:i])]))
+ }
> avg.cor
[,1] [,2] [,3] [,4]
[1,] 0.08 0.1733333 0.2666667 0.36
I want to get the data that each column minus its mean.
First I count the mean of each column
There is my data bellow called m
angel distance
[1,] 1.3 0.43
[2,] 4.0 0.84
[3,] 2.7 0.58
[4,] 2.2 0.58
[5,] 3.6 0.70
[6,] 4.9 1.00
[7,] 0.9 0.27
[8,] 1.1 0.29
[9,] 3.1 0.63
> mean<-apply(m,2,FUN=mean)
angel distance
2.6444444 0.5911111
> m-mean
angel distance
1 -1.34444444 -0.16111111
2 3.40888889 -1.80444444
3 0.05555556 -0.01111111
4 1.60888889 -2.06444444
5 0.95555556 0.10888889
6 4.30888889 -1.64444444
7 -1.74444444 -0.32111111
8 0.50888889 -2.35444444
9 0.45555556 0.03888889
So the final answer is got through minus mean by column.
I want it minus by each row. How can I get this?
First, let's use colMeans(m) to get column means of matrix m. Then we use sweep:
sweep(m, 2, colMeans(m))
where 2 specifies margin (we want column-wise operation, and in 2D index, the second index is for column). By default, sweep performs FUN = "-", so in above we are subtracting column means from the matrix, i.e., centring the matrix.
Similarly if we want to subtract row means from all rows, we can use:
sweep(m, 1, rowMeans(m))
You can set FUN argument to other functions, too. Another common use of sweep is for column / row rescaling, where you can read How to rescale my matrix by column or row for more.
Function scale mentioned by the other answer is used only for column-wise operation. A common use is to standardised all matrix columns. We can set scale = FALSE to perform column centring only.
scale is just a wrapper function of sweep which you can verify by inspecting the source code of sweep.default:
if (center) {
center <- colMeans(x, na.rm = TRUE)
x <- sweep(x, 2L, center, check.margin = FALSE)
}
if (scale) {
scale <- apply(x, 2L, f)
x <- sweep(x, 2L, scale, "/", check.margin = FALSE)
}
Read ?sweep, ?scale, ?colMeans for more on those functions.
You can get the same by this (z-score normalization without scaling):
scale(df, scale=FALSE)
angel distance
[1,] -1.34444444 -0.16111111
[2,] 1.35555556 0.24888889
[3,] 0.05555556 -0.01111111
[4,] -0.44444444 -0.01111111
[5,] 0.95555556 0.10888889
[6,] 2.25555556 0.40888889
[7,] -1.74444444 -0.32111111
[8,] -1.54444444 -0.30111111
[9,] 0.45555556 0.03888889
This question is sort of a follow-up to how to extract intragroup and intergroup distances from a distance matrix? in R. In that question, they first computed the distance matrix for all points, and then simply extracted the inter-class distance matrix. I have a situation where I'd like to bypass the initial computation and skip right to extraction, i.e. I want to directly compute the inter-class distance matrix. Drawing from the linked example, with tweaks, let's say I have some data in a dataframe called df:
values<-c(0.002,0.3,0.4,0.005,0.6,0.2,0.001,0.002,0.3,0.01)
class<-c("A","A","A","B","B","B","B","A","B","A")
df<-data.frame(values, class)
What I'd like is a distance matrix:
1 2 3 8 10
4 .003 .295 .395 .003 .005
5 .598 .300 .200 .598 .590
6 .198 .100 .200 .198 .190
7 .001 .299 .399 .001 .009
9 .298 .000 .100 .298 .290
Does there already exist in R an elegant and fast way to do this?
EDIT After receiving a good solution for the 1D case above, I thought of a bonus question: what about a higher-dimensional case, say if instead df looks like this:
values1<-c(0.002,0.3,0.4,0.005,0.6,0.2,0.001,0.002,0.3,0.01)
values2<-c(0.001,0.1,0.1,0.001,0.1,0.1,0.001,0.001,0.1,0.01)
class<-c("A","A","A","B","B","B","B","A","B","A")
df<-data.frame(values1, values2, class)
And I'm interested in again getting a matrix of the Euclidean distance between points in class B with points in class A.
For general n-dimensional Euclidean distance, we can exploit the equation (not R, but algebra):
square_dist(b,a) = sum_i(b[i]*b[i]) + sum_i(a[i]*a[i]) - 2*inner_prod(b,a)
where the sums are over the dimensions of vectors a and b for i=[1,n]. Here, a and b are one pair from A and B. The key here is that this equation can be written as a matrix equation for all pairs in A and B.
In code:
## First split the data with respect to the class
n <- 2 ## the number of dimensions, for this example is 2
tmp <- split(df[,1:n], df$class)
d <- sqrt(matrix(rowSums(expand.grid(rowSums(tmp$B*tmp$B),rowSums(tmp$A*tmp$A))),
nrow=nrow(tmp$B)) -
2. * as.matrix(tmp$B) %*% t(as.matrix(tmp$A)))
Notes:
The inner rowSums compute sum_i(b[i]*b[i]) and sum_i(a[i]*a[i]) for each b in B and a in A, respectively.
expand.grid then generates all pairs between B and A.
The outer rowSums computes the sum_i(b[i]*b[i]) + sum_i(a[i]*a[i]) for all these pairs.
This result is then reshaped into a matrix. Note that the number of rows of this matrix is the number of points of class B as you requested.
Then subtract two times the inner product of all pairs. This inner product can be written as a matrix multiply tmp$B %*% t(tmp$A) where I left out the coercion to matrix for clarity.
Finally, take the square root.
Using this code with your data:
print(d)
## 1 2 3 8 10
##4 0.0030000 0.3111688 0.4072174 0.0030000 0.01029563
##5 0.6061394 0.3000000 0.2000000 0.6061394 0.59682493
##6 0.2213707 0.1000000 0.2000000 0.2213707 0.21023796
##7 0.0010000 0.3149635 0.4110985 0.0010000 0.01272792
##9 0.3140143 0.0000000 0.1000000 0.3140143 0.30364453
Note that this code will work for any n > 1. We can recover your previous 1-d result by setting n to 1 and not perform the inner rowSums (because there is now only one column in tmp$A and tmp$B):
n <- 1 ## the number of dimensions, set this now to 1
tmp <- split(df[,1:n], df$class)
d <- sqrt(matrix(rowSums(expand.grid(tmp$B*tmp$B,tmp$A*tmp$A)),
nrow=length(tmp$B)) -
2. * as.matrix(tmp$B) %*% t(as.matrix(tmp$A)))
print(d)
## [,1] [,2] [,3] [,4] [,5]
##[1,] 0.003 0.295 0.395 0.003 0.005
##[2,] 0.598 0.300 0.200 0.598 0.590
##[3,] 0.198 0.100 0.200 0.198 0.190
##[4,] 0.001 0.299 0.399 0.001 0.009
##[5,] 0.298 0.000 0.100 0.298 0.290
Here's an attempt via generating each combination and then simply taking the difference from each value:
abs(matrix(Reduce(`-`, expand.grid(split(df$values, df$class))), nrow=5, byrow=TRUE))
# [,1] [,2] [,3] [,4] [,5]
#[1,] 0.003 0.295 0.395 0.003 0.005
#[2,] 0.598 0.300 0.200 0.598 0.590
#[3,] 0.198 0.100 0.200 0.198 0.190
#[4,] 0.001 0.299 0.399 0.001 0.009
#[5,] 0.298 0.000 0.100 0.298 0.290