Draw markov chain given transition matrix in R - r

Let trans_m be a n by n transition matrix of a first-order markov chain. In my problem, n is large, say 10,000, and the matrix trans_m is a sparse matrix constructed from Matrix package. Otherwise, the size of trans_m would be huge. My goal is to simulate a sequence of markov chain given a vector of initial states s1 and this transition matrix trans_m. Consider the following concrete example.
n <- 5000 # there are 5,000 states in this case.
trans_m <- Matrix(0, nr = n, nc = n, sparse = TRUE)
K <- 5 # the maximal number of states that could be reached.
for(i in 1:n){
states_reachable <- sample(1:n, size = K) # randomly pick K states that can be reached with equal probability.
trans_m[i, states_reachable] <- 1/K
}
s1 <- sample(1:n, size = 1000, replace = TRUE) # generate 1000 inital states
draw_next <- function(s) {
.s <- sample(1:n, size = 1, prob = trans_m[s, ]) # given the current state s, draw the next state .s
.s
}
sapply(s1, draw_next)
Given the vector of initial states s1 as above, I used sapply(s1, draw_next) to draw the next state. When n is larger, sapply becomes slow. Is there a better way?

Repeatedly indexing by row can be slow, so it's faster to work on the transpose of the transition matrix and use column indexing, and to factor out the indexing from the inner function:
R> trans_m_t <- t(trans_m)
R>
R> require(microbenchmark)
R> microbenchmark(
+ apply(trans_m_t[,s1], 2,sample, x=n, size=1, replace=F)
+ ,
+ sapply(s1, draw_next)
+ )
Unit: milliseconds
expr min
apply(trans_m_t[, s1], 2, sample, x = n, size = 1, replace = F) 111.828814
sapply(s1, draw_next) 499.255402
lq mean median uq max neval
193.1139810 190.4379185 194.6563380 196.4273105 270.418189 100
503.7398805 512.0849013 506.9467125 516.6082480 586.762573 100
Since you're already working with a sparse matrix, you might be able to
get even better performance by working directly on the triplets. Using the higher level matrix operators can trigger recompression.

Related

Fast QR Factorization in R

I have a large number of matrices for which I need to perform a QR factorization and store the resulting Q matrices (normalized such that the R matrix has positive number in its diagonal). Is there another method than using the qr() function?
Here is the working example:
system.time({
# Parameters for the matrix to be generated
row_number <- 1e6/4
col_number <- 4
# Generate large matrix of random numbers normally distributed.
# Basically it's a matrix containing 4x4 matrices for which I will perform the QR factorization:
RND <- matrix(data = rnorm(n=1e6 , mean = 0,sd = 1),nrow = row_number, ncol = col_number)
# Allocate a 0 matrix where the result will be entered
QSTACK <- matrix(0, nrow = row_number, ncol = col_number)
number_of_blocks <- row_number/4 # The number of 4x4 matrices in RND => 62,500
for (k in c(1:number_of_blocks)) {
l1 <- 1 + col_number * (k-1)
l2 <- col_number * k
QR <- qr(RND[l1:l2,]) # Perform QR factorization
R <- qr.R(QR)
QSTACK[l1:l2,] <- qr.Q(QR) %*% diag(sign(diag(R))) # Normalize such that R diagonal elements are positive
}
})
# user system elapsed
# 3.04 0.03 3.07
So that took 3.07 seconds to compute 62,500 QR factorization. I'm wondering if there is something faster?
If you want:
the R factor to have positive diagonal elements
to explicitly form the Q factor (rather than its sequential Householder vectors format)
you can cheat by using Cholesky factorization:
cheatQR <- function (X) {
XtX <- crossprod(X)
R <- chol(XtX)
Q <- t(forwardsolve(R, t(X), upper.tri = TRUE, transpose = TRUE))
list(Q = Q, R = R)
}
The raw QR:
rawQR <- function (X) {
QR <- qr(X)
Q <- qr.Q(QR)
R <- qr.R(QR)
sgn <- sign(diag(R))
R <- sgn * R
Q <- Q * rep(sgn, each = nrow(Q))
list(Q = Q, R = R)
}
Benchmark:
X <- matrix(rnorm(10000 * 4), nrow = 10000, ncol = 4)
ans1 <- rawQR(X)
ans2 <- cheatQR(X)
all.equal(ans1, ans2)
#[1] TRUE
library(microbenchmark)
microbenchmark(rawQR(X), cheatQR(X))
#Unit: microseconds
# expr min lq mean median uq max neval
# rawQR(X) 3083.537 3109.222 3796.191 3123.2230 4782.583 13895.81 100
# cheatQR(X) 828.025 837.491 1421.211 865.9085 1434.657 32577.01 100
For further speedup, it is often advised that we link our R software to an optimized BLAS library, like OpenBLAS. But more relevant to your context, where you are computing large number of QR factorizations for small matrices, it is more worthwhile to parallelize your for loop.

How to optimize my correlation problem in R?

I have three dataframes in R, let's call them A, B, and C.
dataframe C contains two columns, the first one contains various row names from dataframe A and the second one contains row names in dataframe B:
C <- data.frame(col1 = c("a12", "a9"), col2 = c("b6","b54"))
I want to calculate the correlation coefficient and p-values for each row of the table C using the corresponding values from the rows of table A and B (i.e. correlating values from the a12 row in the table A with values from b6 row from table B, a9 row from table A with b54 row from table B, etc.) and put the resulting values in additional columns in the table C. This is my current naive and highly inefficient code:
for (i in 1:nrow(C)) {
correlation <- cor.test(unlist(A[C[i,1],]), unlist(B[C[i,2],]), method = "spearman")
C[i,3] <-correlation$estimate
C[i,4] <- correlation$p.value
}
The main problem is that with my current large datasets this analysis can literally take months. so I'm looking for a more efficient way to accomplish this task. I also tried the following code using the "Hmisc" package but the server I'm working on can't handle the large vectors:
A <- t(A)
B <- t(B)
ind.A <- match(C[,1], colnames(A))
A<- A[,ind.A]
ind.B <- match(C[,2], colnames(B))
B<- B[,ind.B]
C[,3]<- diag(rcorr(as.matrix(A),as.matrix(B),type = "spearman")$r[c(1:ncol(A)),c(1:ncol(A))])
C[,4]<- diag(rcorr(as.matrix(A),as.matrix(B),type = "spearman")$P[c(1:ncol(A)),c(1:ncol(A))])
Based on the comment by #HYENA, I tried parallelize processing. This approach accelerated the process approximately 4 times (with 8 cores). The code:
library(foreach)
library(doParallel)
cl<- makeCluster(detectCores())
registerDoParallel(cl)
cor.res<- foreach (i=1:nrow(C)) %dopar% {
a<- C[i,1]
b<- C[i,2]
correlation<- cor.test(unlist(A[a,]),unlist(B[b,]), method = "spearman")
c(correlation$estimate,correlation$p.value)
}
cor.res<- data.frame(Reduce("rbind",cor.res))
C[,c(3,4)]<- cor.res
Extract just the part you need from cor.test giving cor_test1 and use that instead or, in addition, create a lookup table for the p values giving cor_test2 which is slightly faster than cor_test1.
Based on the median column with 10-vectors these run about 3x faster than cor.test. Although cor_test2 is only slightly faster than cor_test1 here we have included it since the speed could depend on size of input which we don't have but you can try it out yourself with whatever sizes you have.
# given correlation and degrees of freedom output p value
r2pval <- function(r, dof) {
tval <- sqrt(dof) * r/sqrt(1 - r^2)
min(pt(tval, dof), pt(tval, dof, lower.tail = FALSE))
}
# faster version of cor.test
cor_test1 <- function(x, y) {
r <- cor(x, y)
dof <- length(x) - 2
tval <- sqrt(dof) * r/sqrt(1 - r^2)
pval <- min(pt(tval, dof), pt(tval, dof, lower.tail = FALSE))
c(r, pval)
}
# even faster version of cor.test.
# Given x, y and the pvals table calculate a 2-vector of r and p value
cor_test2 <- function(x, y, pvals) {
r <- cor(x, y)
c(r, pvals[100 * round(r, 2) + 101])
}
# test
set.seed(123)
n <- 10
x <- rnorm(n); y <- rnorm(n)
dof <- n - 2
# pvals is the 201 p values for r = -1, -0.99, -0.98, ..., 1
pvals <- sapply(seq(-1, 1, 0.01), r2pval, dof = dof)
library(microbenchmark)
microbenchmark(cor.test(x, y), cor_test1(x, y), cor_test2(x, y, pvals))
giving:
Unit: microseconds
expr min lq mean median uq max neval cld
cor.test(x, y) 253.7 256.7 346.278 266.05 501.45 650.6 100 a
cor_test1(x, y) 84.8 87.2 346.777 89.10 107.40 22974.4 100 a
cor_test2(x, y, pvals) 72.4 75.0 272.030 79.45 91.25 17935.8 100 a

Faster matrix multiplication by replacing a double loop

I have a dataframe which looks a bit as produced by the following code (but much larger)
set.seed(10)
mat <- matrix(rbinom(200, size=1, prob = .5), ncol = 10)
In the columns are issues and 1 indicates that an observation is interested in a specific issue. I want to generate a network comparing all observations and have a count of issues that each dyad is jointly interested in.
I have produced the following code, which seems to be working fine:
mat2 <- matrix(NA,20,20)
for(i in 1:nrow(mat)){
for(j in 1:nrow(mat)){
mat2[i,j] <- sum(as.numeric(mat[i,]==1) + as.numeric(mat[j,]==1) == 2)
}
}
So I compare every entry with every other entry, and only if both have a 1 entry (i.e., they are interested), then this sums to 2 and will be counted as joint interest in a topic.
My problem is that my dataset is very large, and the loop now runs for hours already.
Does anyone have an idea how to do this while avoiding the loop?
This should be faster:
tmat <- t(mat==1)
mat4 <- apply(tmat, 2, function(x) colSums(tmat & x))
going ahead and promoting #jogo's comment as it is by far the fastest (thank's for the hint, I will use that in production as well).
set.seed(10)
mat <- matrix(rbinom(200, size=1, prob = .5), ncol = 10)
mat2 <- matrix(NA,20,20)
binary_mat <- mat == 1
tmat <- t(mat==1)
microbenchmark::microbenchmark(
"loop" = for(i in 1:nrow(mat)){
for(j in 1:nrow(mat)){
mat2[i,j] <- sum(as.numeric(mat[i,]==1) + as.numeric(mat[j,]==1) == 2)
}
},
"apply" = mat4 <- apply(tmat, 2, function(x) colSums(tmat & x)),
"matrix multiplication" = mat5 <- mat %*% t(mat),
"tcrossprod" = tcrossprod(mat),
"tcrossprod binary" = tcrossprod(binary_mat)
)
On my machine this benchmark results in
Unit: microseconds
expr min lq mean median uq max neval cld
loop 16699.634 16972.271 17931.82535 17180.397 17546.1545 31502.706 100 b
apply 322.942 330.046 395.69045 357.886 368.8300 4299.228 100 a
matrix multiplication 21.889 28.801 36.76869 39.360 43.9685 50.689 100 a
tcrossprod 7.297 8.449 11.20218 9.984 14.4005 18.433 100 a
tcrossprod binary 7.680 8.833 11.08316 9.601 12.0970 35.713 100 a

Machine Learning: Stochastic gradient descent for logistic regression in R: Calculating Eout and average number of epochs

I am trying to write a code to solve the following problem (As stated in HW5 in the CalTech course Learning from Data):
In this problem you will create your own target function f
(probability in this case) and data set D to see how Logistic
Regression works. For simplicity, we will take f to be a 0=1
probability so y is a deterministic function of x. Take d = 2 so you
can visualize the problem, and let X = [-1; 1]×[-1; 1] with uniform
probability of picking each x 2 X . Choose a line in the plane as the
boundary between f(x) = 1 (where y has to be +1) and f(x) = 0 (where y
has to be -1) by taking two random, uniformly distributed points from
X and taking the line passing through them as the boundary between y =
±1. Pick N = 100 training points at random from X , and evaluate the
outputs yn for each of these points xn. Run Logistic Regression with
Stochastic Gradient Descent to find g, and estimate Eout(the cross
entropy error) by generating a sufficiently large, separate set of
points to evaluate the error. Repeat the experiment for 100 runs with
different targets and take the average. Initialize the weight vector
of Logistic Regression to all zeros in each run. Stop the algorithm
when |w(t-1) - w(t)| < 0:01, where w(t) denotes the weight vector at
the end of epoch t. An epoch is a full pass through the N data points
(use a random permutation of 1; 2; · · · ; N to present the data
points to the algorithm within each epoch, and use different
permutations for different epochs). Use a learning rate of 0.01.
I am required to calculate the nearest value to Eout for N=100, and the average number of epochs for the required criterion.
I wrote and ran the code but I'm not getting the right answers (as stated in the solutions, these are Eout is near 0.1 and the number of epochs is near 350). The required number of epochs for a delta w of 0.01 comes to far too small (around 10), leaving the error too big (around 2). I then tried to replace the criterion with |w(t-1) - w(t)| < 0.001 (rather than 0.01). Then, the average required number of epochs was about 250 and out of sample error was about 0.35.
Is there something wrong with my code/solution, or is it possible that the answers provided are faulty? I've added comments to indicate what I intend to do at each step. Thanks in advance.
library(pracma)
h<- 0 # h will later be updated to number of required epochs
p<- 0 # p will later be updated to Eout
C <- matrix(ncol=10000, nrow=2) # Testing set, used to calculate out of sample error
d <- matrix(ncol=10000, nrow=1)
for(i in 1:10000){
C[, i] <- c(runif(2, min = -1, max = 1)) # Sample data
d[1, i] <- sign(C[2, i] - f(C[1, i]))
}
for(g in 1:100){ # 100 runs of the experiment
x <- runif(2, min = -1, max = 1)
y <- runif(2, min = -1, max = 1)
fit = (lm(y~x))
t <- summary(fit)$coefficients[,1]
f <- function(x){ # Target function
t[2]*x + t[1]
}
A <- matrix(ncol=100, nrow=2) # Sample data
b <- matrix(ncol=100, nrow=1)
norm_vec <- function(x) {sqrt(sum(x^2))} # vector norm calculator
w <- c(0,0) # weights initialized to zero
for(i in 1:100){
A[, i] <- c(runif(2, min = -1, max = 1)) # Sample data
b[1, i] <- sign(A[2, i] - f(A[1, i]))
}
q <- matrix(nrow = 2, ncol = 1000) # q tracks the weight vector at the end of each epoch
l= 1
while(l < 1001){
E <- function(z){ # cross entropy error function
x = z[1]
y = z[2]
v = z[3]
return(log(1 + exp(-v*t(w)%*%c(x, y))))
}
err <- function(xn1, xn2, yn){ #gradient of error function
return(c(-yn*xn1, -yn*xn2)*(exp(-yn*t(w)*c(xn1,xn2))/(1+exp(-yn*t(w)*c(xn1,xn2)))))
}
e = matrix(nrow = 2, ncol = 100) # e will track the required gradient at each data point
e[,1:100] = 0
perm = sample(100, 100, replace = FALSE, prob = NULL) # Random permutation of the data indices
for(j in 1:100){ # One complete Epoch
r = A[,perm[j]] # pick the perm[j]th entry in A
s = b[perm[j]] # pick the perm[j]th entry in b
e[,perm[j]] = err(r[1], r[2], s) # Gradient of the error
w = w - 0.01*e[,perm[j]] # update the weight vector accorng to the formula involving step size, gradient
}
q[,l] = w # the lth entry is the weight vector at the end of the lth epoch
if(l > 1 & norm_vec(q[,l] - q[,l-1])<0.001){ # given criterion to terminate the algorithm
break
}
l = l+1 # move to the next epoch
}
for(n in 1:10000){
p[g] = mean(E(c(C[1,n], C[2, n], d[n]))) # average over 10000 data points, of the error function, in experiment no. g
}
h[g] = l #gth entry in the vector h, tracks the number of epochs in the gth iteration of the experiment
}
mean(h) # Mean number of epochs needed
mean(p) # average Eout, over 100 experiments

Sampling repeatedly with different probability

In the following code "Weight" is a large matrix of weight sets. This matrix is consisted of let's say 1000 rows and 4 columns. Each row is a set of weights (sum of the elements in each row is equal to one).
In addition, there are four object and I want to select one of them based on the each weight sets. In other words, this random selection should be repeated for all of the weight sets.
Right now I have solved the problem with for. But is there any more efficient way to code it in R?
y <- c("a", "b", "c", "d")
for(i in 1:nrow(Weight)){
selection[i] <- sample(y, 1, prob=Weight[i,]) #selection is a vector with the same number of rows as Weight
}
A more efficient way would be to first compute the row-wise cumulative sums of your weights then draw a number between 0 and 1 and see where it lands within that cumulative sum. This way, you only need to do one call to runif to get your random data, versus 1000 calls using other methods.
Weight <- matrix(sample(1:100, 1000 * 4, TRUE), 1000, 4)
x <- runif(nrow(Weight))
cumul.w <- Weight %*% upper.tri(diag(ncol(Weight)), diag = TRUE) / rowSums(Weight)
i <- rowSums(x > cumul.w) + 1L
selection <- y[i]
Also note how I computed the cumulative sums by multiplying by a triangular matrix instead of using the slower apply(Weight, 1, cumsum). Everything is vectorized so it should be way faster than using an apply or for loop.
Benchmark comparison with apply and for:
f_runif <- function(Weight, y) {
x <- runif(nrow(Weight))
cumul.w <- Weight %*% upper.tri(diag(ncol(Weight)), diag = TRUE) /
rowSums(Weight)
i <- rowSums(x > cumul.w) + 1L
y[i]
}
f_for <- function(Weight, y) {
selection <- rep(NA, nrow(Weight))
for(i in 1:nrow(Weight)){
selection[i] <- sample(y, 1, prob=Weight[i,])
}
}
f_apply <- function(Weight, y) {
apply(Weight, 1, function(w)sample(y, 1, prob=w))
}
y <- c("a", "b", "c", "d")
Weight <- matrix(sample(1:100, 1000 * 4, TRUE), 1000, 4)
library(microbenchmark)
microbenchmark(f_runif(Weight, y),
f_for (Weight, y),
f_apply(Weight, y))
# Unit: microseconds
# expr min lq median uq max neval
# f_runif(Weight, y) 223.635 231.111 274.531 281.2165 1443.208 100
# f_for(Weight, y) 10220.674 11238.660 11574.039 11917.1610 14583.028 100
# f_apply(Weight, y) 9006.974 10016.747 10509.150 10879.9245 27060.189 100
Wrap your sample into a function that lets you pass only one argument, a row from Weight:
myfun <- function(w) {
sample(y, 1, prob=w)
}
Then you can use one of the apply family:
apply(Weight, 1, myfun)
However, so long as you have pre-allocated selection your method is not terribly inefficient.

Resources