Sampling repeatedly with different probability - r

In the following code "Weight" is a large matrix of weight sets. This matrix is consisted of let's say 1000 rows and 4 columns. Each row is a set of weights (sum of the elements in each row is equal to one).
In addition, there are four object and I want to select one of them based on the each weight sets. In other words, this random selection should be repeated for all of the weight sets.
Right now I have solved the problem with for. But is there any more efficient way to code it in R?
y <- c("a", "b", "c", "d")
for(i in 1:nrow(Weight)){
selection[i] <- sample(y, 1, prob=Weight[i,]) #selection is a vector with the same number of rows as Weight
}

A more efficient way would be to first compute the row-wise cumulative sums of your weights then draw a number between 0 and 1 and see where it lands within that cumulative sum. This way, you only need to do one call to runif to get your random data, versus 1000 calls using other methods.
Weight <- matrix(sample(1:100, 1000 * 4, TRUE), 1000, 4)
x <- runif(nrow(Weight))
cumul.w <- Weight %*% upper.tri(diag(ncol(Weight)), diag = TRUE) / rowSums(Weight)
i <- rowSums(x > cumul.w) + 1L
selection <- y[i]
Also note how I computed the cumulative sums by multiplying by a triangular matrix instead of using the slower apply(Weight, 1, cumsum). Everything is vectorized so it should be way faster than using an apply or for loop.
Benchmark comparison with apply and for:
f_runif <- function(Weight, y) {
x <- runif(nrow(Weight))
cumul.w <- Weight %*% upper.tri(diag(ncol(Weight)), diag = TRUE) /
rowSums(Weight)
i <- rowSums(x > cumul.w) + 1L
y[i]
}
f_for <- function(Weight, y) {
selection <- rep(NA, nrow(Weight))
for(i in 1:nrow(Weight)){
selection[i] <- sample(y, 1, prob=Weight[i,])
}
}
f_apply <- function(Weight, y) {
apply(Weight, 1, function(w)sample(y, 1, prob=w))
}
y <- c("a", "b", "c", "d")
Weight <- matrix(sample(1:100, 1000 * 4, TRUE), 1000, 4)
library(microbenchmark)
microbenchmark(f_runif(Weight, y),
f_for (Weight, y),
f_apply(Weight, y))
# Unit: microseconds
# expr min lq median uq max neval
# f_runif(Weight, y) 223.635 231.111 274.531 281.2165 1443.208 100
# f_for(Weight, y) 10220.674 11238.660 11574.039 11917.1610 14583.028 100
# f_apply(Weight, y) 9006.974 10016.747 10509.150 10879.9245 27060.189 100

Wrap your sample into a function that lets you pass only one argument, a row from Weight:
myfun <- function(w) {
sample(y, 1, prob=w)
}
Then you can use one of the apply family:
apply(Weight, 1, myfun)
However, so long as you have pre-allocated selection your method is not terribly inefficient.

Related

How to optimize my correlation problem in R?

I have three dataframes in R, let's call them A, B, and C.
dataframe C contains two columns, the first one contains various row names from dataframe A and the second one contains row names in dataframe B:
C <- data.frame(col1 = c("a12", "a9"), col2 = c("b6","b54"))
I want to calculate the correlation coefficient and p-values for each row of the table C using the corresponding values from the rows of table A and B (i.e. correlating values from the a12 row in the table A with values from b6 row from table B, a9 row from table A with b54 row from table B, etc.) and put the resulting values in additional columns in the table C. This is my current naive and highly inefficient code:
for (i in 1:nrow(C)) {
correlation <- cor.test(unlist(A[C[i,1],]), unlist(B[C[i,2],]), method = "spearman")
C[i,3] <-correlation$estimate
C[i,4] <- correlation$p.value
}
The main problem is that with my current large datasets this analysis can literally take months. so I'm looking for a more efficient way to accomplish this task. I also tried the following code using the "Hmisc" package but the server I'm working on can't handle the large vectors:
A <- t(A)
B <- t(B)
ind.A <- match(C[,1], colnames(A))
A<- A[,ind.A]
ind.B <- match(C[,2], colnames(B))
B<- B[,ind.B]
C[,3]<- diag(rcorr(as.matrix(A),as.matrix(B),type = "spearman")$r[c(1:ncol(A)),c(1:ncol(A))])
C[,4]<- diag(rcorr(as.matrix(A),as.matrix(B),type = "spearman")$P[c(1:ncol(A)),c(1:ncol(A))])
Based on the comment by #HYENA, I tried parallelize processing. This approach accelerated the process approximately 4 times (with 8 cores). The code:
library(foreach)
library(doParallel)
cl<- makeCluster(detectCores())
registerDoParallel(cl)
cor.res<- foreach (i=1:nrow(C)) %dopar% {
a<- C[i,1]
b<- C[i,2]
correlation<- cor.test(unlist(A[a,]),unlist(B[b,]), method = "spearman")
c(correlation$estimate,correlation$p.value)
}
cor.res<- data.frame(Reduce("rbind",cor.res))
C[,c(3,4)]<- cor.res
Extract just the part you need from cor.test giving cor_test1 and use that instead or, in addition, create a lookup table for the p values giving cor_test2 which is slightly faster than cor_test1.
Based on the median column with 10-vectors these run about 3x faster than cor.test. Although cor_test2 is only slightly faster than cor_test1 here we have included it since the speed could depend on size of input which we don't have but you can try it out yourself with whatever sizes you have.
# given correlation and degrees of freedom output p value
r2pval <- function(r, dof) {
tval <- sqrt(dof) * r/sqrt(1 - r^2)
min(pt(tval, dof), pt(tval, dof, lower.tail = FALSE))
}
# faster version of cor.test
cor_test1 <- function(x, y) {
r <- cor(x, y)
dof <- length(x) - 2
tval <- sqrt(dof) * r/sqrt(1 - r^2)
pval <- min(pt(tval, dof), pt(tval, dof, lower.tail = FALSE))
c(r, pval)
}
# even faster version of cor.test.
# Given x, y and the pvals table calculate a 2-vector of r and p value
cor_test2 <- function(x, y, pvals) {
r <- cor(x, y)
c(r, pvals[100 * round(r, 2) + 101])
}
# test
set.seed(123)
n <- 10
x <- rnorm(n); y <- rnorm(n)
dof <- n - 2
# pvals is the 201 p values for r = -1, -0.99, -0.98, ..., 1
pvals <- sapply(seq(-1, 1, 0.01), r2pval, dof = dof)
library(microbenchmark)
microbenchmark(cor.test(x, y), cor_test1(x, y), cor_test2(x, y, pvals))
giving:
Unit: microseconds
expr min lq mean median uq max neval cld
cor.test(x, y) 253.7 256.7 346.278 266.05 501.45 650.6 100 a
cor_test1(x, y) 84.8 87.2 346.777 89.10 107.40 22974.4 100 a
cor_test2(x, y, pvals) 72.4 75.0 272.030 79.45 91.25 17935.8 100 a

Faster matrix multiplication by replacing a double loop

I have a dataframe which looks a bit as produced by the following code (but much larger)
set.seed(10)
mat <- matrix(rbinom(200, size=1, prob = .5), ncol = 10)
In the columns are issues and 1 indicates that an observation is interested in a specific issue. I want to generate a network comparing all observations and have a count of issues that each dyad is jointly interested in.
I have produced the following code, which seems to be working fine:
mat2 <- matrix(NA,20,20)
for(i in 1:nrow(mat)){
for(j in 1:nrow(mat)){
mat2[i,j] <- sum(as.numeric(mat[i,]==1) + as.numeric(mat[j,]==1) == 2)
}
}
So I compare every entry with every other entry, and only if both have a 1 entry (i.e., they are interested), then this sums to 2 and will be counted as joint interest in a topic.
My problem is that my dataset is very large, and the loop now runs for hours already.
Does anyone have an idea how to do this while avoiding the loop?
This should be faster:
tmat <- t(mat==1)
mat4 <- apply(tmat, 2, function(x) colSums(tmat & x))
going ahead and promoting #jogo's comment as it is by far the fastest (thank's for the hint, I will use that in production as well).
set.seed(10)
mat <- matrix(rbinom(200, size=1, prob = .5), ncol = 10)
mat2 <- matrix(NA,20,20)
binary_mat <- mat == 1
tmat <- t(mat==1)
microbenchmark::microbenchmark(
"loop" = for(i in 1:nrow(mat)){
for(j in 1:nrow(mat)){
mat2[i,j] <- sum(as.numeric(mat[i,]==1) + as.numeric(mat[j,]==1) == 2)
}
},
"apply" = mat4 <- apply(tmat, 2, function(x) colSums(tmat & x)),
"matrix multiplication" = mat5 <- mat %*% t(mat),
"tcrossprod" = tcrossprod(mat),
"tcrossprod binary" = tcrossprod(binary_mat)
)
On my machine this benchmark results in
Unit: microseconds
expr min lq mean median uq max neval cld
loop 16699.634 16972.271 17931.82535 17180.397 17546.1545 31502.706 100 b
apply 322.942 330.046 395.69045 357.886 368.8300 4299.228 100 a
matrix multiplication 21.889 28.801 36.76869 39.360 43.9685 50.689 100 a
tcrossprod 7.297 8.449 11.20218 9.984 14.4005 18.433 100 a
tcrossprod binary 7.680 8.833 11.08316 9.601 12.0970 35.713 100 a

For loops for nested variables within function in R

I would like to iterate through vectors of values and calculate something for every value while being within a function environment in R. For example:
# I have costs for 3 companies
c <- c(10, 20, 30)
# I have the same revenue across all 3
r <- 100
# I want to obtain the profits for all 3 within one variable
result <- list()
# I could do this in a for loop
for(i in 1:3){
result[i] <- r - c[i]
}
Now lets assume I have a model that is very long and I define everything as a function which is to be solved with various random draws for the costs.
# Random draws
n <- 1000
r <- rnorm(n, mean = 100, sd = 10)
c1 <- rnorm(n, mean = 10, sd = 1)
c2 <- rnorm(n, mean = 20, sd = 2)
c3 <- rnorm(n, mean = 30, sd = 3)
X <- data.frame(r, c1, c2, c3)
fun <- function(x){
r <- x[1]
c <- c(x[2], x[3], x[4])
for(i in 1:3){
result[i] <- r - c[i]
}
return(result)
}
I could then evaluate the result for all draws by iterating through the rows of randomly sampled input data.
for(j in 1:n){
x <- X[j,]
y <- fun(x)
}
In this example, the output variable y would entail the nested result variable which comprises of the results for all 3 companies. However, my line of thinking results in an error and I think it has to do with the fact that I try to return a nested variable? Hence my question how you guys would approach something like this.
I would suggest rethinking your coding approach. This is a very un-R-like way of doing things.
For example, the first for loop can be written much more succinctly as
x <- c(10, 20, 30)
r <- 100
result <- lapply(-x, `+`, r)
Then fun becomes something like
fun <- function(x) lapply(-x[-1], `+`, x[1])
To then operate over the rows of a data.frame (which is what you seem to do in the last step), you can use something like
apply(X, 1, fun)
where the MARGIN = 1 argument in apply ensures that you are applying a function per row (as opposed to per column).
Here's an approach using your function and a for loop:
# Random draws
n <- 1000
r <- rnorm(n, mean = 100, sd = 10)
c1 <- rnorm(n, mean = 10, sd = 1)
c2 <- rnorm(n, mean = 20, sd = 2)
c3 <- rnorm(n, mean = 30, sd = 3)
X <- data.frame(r, c1, c2, c3)
result <- list()
fun <- function(x){
r <- x[[1]]
c <- c(x[[2]], x[[3]], x[[4]])
for(i in 1:3){
result[i] <- r - c[i]
}
return(result)
}
# Create a list to store results
profits <- rep(rep(list(1:3)),nrow(X))
# Loop throuhg each row of dataframe and store in profits.
for(i in 1:nrow(X)){
profits_temp <-
fun(list(X[i,"r"],X[i,"c1"],X[i,"c2"],X[i,"c3"]))
for(j in 1:3)
profits[[i]][[j]] <- profits_temp[[j]]
}
# Eye results
profits[[1]]
#> [1] 93.23594 81.25731 70.27699
profits[[2]]
#> [1] 80.50516 69.27517 63.36439

Draw markov chain given transition matrix in R

Let trans_m be a n by n transition matrix of a first-order markov chain. In my problem, n is large, say 10,000, and the matrix trans_m is a sparse matrix constructed from Matrix package. Otherwise, the size of trans_m would be huge. My goal is to simulate a sequence of markov chain given a vector of initial states s1 and this transition matrix trans_m. Consider the following concrete example.
n <- 5000 # there are 5,000 states in this case.
trans_m <- Matrix(0, nr = n, nc = n, sparse = TRUE)
K <- 5 # the maximal number of states that could be reached.
for(i in 1:n){
states_reachable <- sample(1:n, size = K) # randomly pick K states that can be reached with equal probability.
trans_m[i, states_reachable] <- 1/K
}
s1 <- sample(1:n, size = 1000, replace = TRUE) # generate 1000 inital states
draw_next <- function(s) {
.s <- sample(1:n, size = 1, prob = trans_m[s, ]) # given the current state s, draw the next state .s
.s
}
sapply(s1, draw_next)
Given the vector of initial states s1 as above, I used sapply(s1, draw_next) to draw the next state. When n is larger, sapply becomes slow. Is there a better way?
Repeatedly indexing by row can be slow, so it's faster to work on the transpose of the transition matrix and use column indexing, and to factor out the indexing from the inner function:
R> trans_m_t <- t(trans_m)
R>
R> require(microbenchmark)
R> microbenchmark(
+ apply(trans_m_t[,s1], 2,sample, x=n, size=1, replace=F)
+ ,
+ sapply(s1, draw_next)
+ )
Unit: milliseconds
expr min
apply(trans_m_t[, s1], 2, sample, x = n, size = 1, replace = F) 111.828814
sapply(s1, draw_next) 499.255402
lq mean median uq max neval
193.1139810 190.4379185 194.6563380 196.4273105 270.418189 100
503.7398805 512.0849013 506.9467125 516.6082480 586.762573 100
Since you're already working with a sparse matrix, you might be able to
get even better performance by working directly on the triplets. Using the higher level matrix operators can trigger recompression.

Numerical integration in R for indicator functions on the unit square

I am trying to write a simple routine for obtaining the measure of certain regions in the unit square (two dimensions)
I know I could use one of the standard num. integration functions but I wanted to try and optimize my code since the class of functions I am interested in is very small.
This is my code:
#GRID DEFINITION
listgrid <- vector(mode="list", length=10201)
listgrid2 <- vector(mode="list", length=101)
for(i in 0:100)
{
listgrid2[[i+1]] <- c(NaN,i/100)
}
listgrid = rep(listgrid2,101)
for(j in 0:100)
{
for(k in 0:100)
{
listgrid[[j*101 + k + 1]][1] <- j/100
}
}
#Indicator Integration Routine
indint <- function(FUN){
valuegrid <- sapply(listgrid, FUN)
result <- ifelse(sum(valuegrid)== 0,0, (sum(valuegrid)+50.5)/10201)
return(result)
}
It just defines a grid and then it tries all the points and checks whether it is zero or 1 and adds the elements of the grid up. The 50.5 is a bias correction (very rough, I know)
The problem is that this is neither fast nor accurate, I guess probably because of the sapply....
The regions I want to know the area of are nothing too complicated, they have smooth boundaries.
How would you approach this problem?
Thank you.
From the little testing I've done, you may well be best off using the adaptIntegrate function in the cubature package, as that is written in C.
Otherwise, if you can change your function from returning vector values to separate inputs for x and y, you may be best off using outer.
Here is some test code including your original:
#GRID DEFINITION
listgrid <- vector(mode="list", length=10201)
listgrid2 <- vector(mode="list", length=101)
for(i in 0:100)
{
listgrid2[[i+1]] <- c(NaN,i/100)
}
listgrid = rep(listgrid2,101)
for(j in 0:100)
{
for(k in 0:100)
{
listgrid[[j*101 + k + 1]][1] <- j/100
}
}
#Indicator Integration Routine
indint <- function(FUN){
valuegrid <- sapply(listgrid, FUN)
result <- ifelse(sum(valuegrid)== 0,0, (sum(valuegrid)+50.5)/10201)
return(result)
}
#Matrix Grid
Row <- rep(seq(0, 1, .01), 101)
Col <- rep(seq(0, 1, .01), each = 101)
Grid <- cbind(Row, Col)
#Integrator using apply
QuickInt <- function(FUN) {
#FUN must be vector-valued and two-dimensional
return(sum(apply(Grid, 1, FUN)) / dim(Grid)[1])
}
#Integrator using mapply
QuickInt2 <- function(FUN) {
#FUN must take separate X and Y inpute and two-dimensional
return(sum(mapply(FUN, x = Row, y = Col)) / length (Row))
}
#Index for each axis for outer
Index <- seq(0, 1, .01)
#Integrator using outer
QuickInt3 <- function(FUN) {
#FUN must take separate X and Y inpute and two-dimensional
return(sum(outer(Index, Index, FUN) / (length(Index) * length(Index))))
}
Here are some test functions:
#Two test functions, both in vector input (_V) and independant input (_I) forms
test_f1_I <- function(x, y) x^2 + y^2
test_f1_V <- function(X) X[1]^2 + X[2]^2
test_f2_I <- function(x, y) sin(x) * cos(y)
test_f2_V <- function(X) sin(X[1]) * cos(X[2])
Testing accuracy:
#Accuracy Test
library(cubature)
TrueValues <- c(adaptIntegrate(test_f1_V, c(0, 0), c(1, 1))$integral, adaptIntegrate(test_f2_V, c(0, 0), c(1, 1))$integral)
InditV <- c(indint(test_f1_V), indint(test_f2_V))
Q1V <- c(QuickInt(test_f1_V), QuickInt(test_f2_V))
Q2V <- c(QuickInt2(test_f1_I), QuickInt2(test_f2_I))
Q3V <- c(QuickInt3(test_f1_I), QuickInt3(test_f2_I))
Rs <- rbind(TrueValues, InditV, Q1V, Q2V, Q3V)
Rs
> Rs
[,1] [,2]
TrueValues 0.6666667 0.3868223
InditV 0.6749505 0.3911174
Q1V 0.6700000 0.3861669
Q2V 0.6700000 0.3861669
Q3V 0.6700000 0.3861669
Speed Test (R-3.1.2, Intel i7-2600K overclocked to 4.6Ghz, 16GB RAM, Windows 7 64 bit, R compiled with OpenBLAS 2.13).
#Speed Test
library(microbenchmark)
Spd1 <- microbenchmark(adaptIntegrate(test_f1_V, c(0,0), c(1,1)),
indint(test_f1_V),
QuickInt(test_f1_V),
QuickInt2(test_f1_I),
QuickInt3(test_f1_I),
times = 100L,
unit = "relative",
control = list(order = 'block'))
Spd2 <- microbenchmark(adaptIntegrate(test_f2_V, c(0,0), c(1,1)),
indint(test_f2_V),
QuickInt(test_f2_V),
QuickInt2(test_f2_I),
QuickInt3(test_f2_I),
times = 100L,
unit = "relative",
control = list(order = 'block'))
> Spd1
Unit: relative
expr min lq mean median uq max neval
adaptIntegrate(test_f1_V, c(0, 0), c(1, 1)) 1.000000 1.000000 1.000000 1.000000 1.000000 1.00000 100
indint(test_f1_V) 491.861071 543.166516 544.559059 568.961534 500.777301 934.62521 100
QuickInt(test_f1_V) 1777.972641 2102.021092 2188.762796 2279.148477 2141.449811 1781.36640 100
QuickInt2(test_f1_I) 681.548730 818.746406 887.345323 875.833969 923.155066 972.86832 100
QuickInt3(test_f1_I) 4.689201 4.525525 4.795768 4.401223 3.706085 31.28801 100
> Spd2
Unit: relative
expr min lq mean median uq max neval
adaptIntegrate(test_f2_V, c(0, 0), c(1, 1)) 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 100
indint(test_f2_V) 192.49010 216.29215 232.99091 226.67006 239.32413 464.76166 100
QuickInt(test_f2_V) 628.38419 770.34591 872.15128 870.58423 937.67521 1018.59902 100
QuickInt2(test_f2_I) 267.68011 306.74721 345.69155 332.67551 352.19696 726.86772 100
QuickInt3(test_f2_I) 12.32649 12.05382 12.15228 11.89353 11.75468 26.54075 100
Here is a vectorized way to estimate using simple Monte Carlo techniques. You can play with n and replicate.n to get the accuracy/speed trade off you desire.
MC_estimator <- function(expr, n = 1E5, replicate.n = 50){
f <- function(){
# generate points randomly over the unit square
x <- runif(n, min = -1)
y <- runif(n, min = -1)
# select points only in unit circle
sel <- sqrt(x^2 + y^2) <= 1
x <- x[sel]
y <- y[sel]
selN <- length(x)
# select points in the desired area, find the proportion and multiply by total area of the sampled region for an estimate
eval(parse(text = paste("pi*sum(", expr, ")/selN")))
}
# replicate the procedure to bootstrap the estimate
mean(replicate(replicate.n, f()))
}
expr <- "y <= sin(x) & y >= x^2 & y <= 0.5 - x"
> system.time(MC_estimator(expr))
user system elapsed
1.20 0.01 1.23
> true <- 0.0370152
> estimated <- MC_estimator(expr)
> cat(round(100*(estimated - true)/true, 2), "% difference")
0.34 % difference

Resources