Need help vectorizing a for loop in R - r

I'm trying to speed up an R function from a package I regularly use, so any help vectorizing the for-loop below would be much appreciated!
y <- array(0, dim=c(75, 12))
samp <- function(x) x<-sample(c(0,1), 1)
y <- apply(y, c(1,2), samp)
nr <- nrow(y)
nc <- ncol(y)
rs <- rowSums(y)
p <- colSums(y)
out <- matrix(0, nrow = nr, ncol = nc)
for (i in 1:nr) {
out[i, sample.int(nc, rs[i], prob = p)] <- 1
}
The issue I'm having a hard time getting around is the reference to object 'rs' within the loop.
Any suggestions?

Here are two options:
This one uses the somewhat discouraged <<- operator:
lapply(1:nr, function(i) out[i, sample.int(nc, rs[i], prob = p)] <<- 1)
This one uses more traditional indexing:
out[do.call('rbind',sapply(1:nr, function(i) cbind(i,sample.int(nc, rs[i], prob = p))))] <- 1
I suppose you could also use Vectorize to do an implicit mapply on your function:
z <- Vectorize(sample.int, vectorize.args='size')(nc, rs, prob=p)
out[cbind(rep(1:length(z), sapply(z, length)), unlist(z))] <- 1
But I don't think that's necessarily any cleaner.
And, indeed, #Roland is correct, that all of these are slower than just doing the for loop:
> microbenchmark(op(), t1(), t2(), t3())
Unit: microseconds
expr min lq median uq max neval
op() 494.970 513.8290 521.7195 532.3040 1902.898 100
t1() 591.962 602.1615 609.4745 617.5570 2369.385 100
t2() 734.756 754.7700 764.3925 782.4825 2205.421 100
t3() 642.383 672.9815 711.4700 763.8150 2283.169 100
Yay for benefit-free obfuscation!

Related

Faster matrix multiplication by replacing a double loop

I have a dataframe which looks a bit as produced by the following code (but much larger)
set.seed(10)
mat <- matrix(rbinom(200, size=1, prob = .5), ncol = 10)
In the columns are issues and 1 indicates that an observation is interested in a specific issue. I want to generate a network comparing all observations and have a count of issues that each dyad is jointly interested in.
I have produced the following code, which seems to be working fine:
mat2 <- matrix(NA,20,20)
for(i in 1:nrow(mat)){
for(j in 1:nrow(mat)){
mat2[i,j] <- sum(as.numeric(mat[i,]==1) + as.numeric(mat[j,]==1) == 2)
}
}
So I compare every entry with every other entry, and only if both have a 1 entry (i.e., they are interested), then this sums to 2 and will be counted as joint interest in a topic.
My problem is that my dataset is very large, and the loop now runs for hours already.
Does anyone have an idea how to do this while avoiding the loop?
This should be faster:
tmat <- t(mat==1)
mat4 <- apply(tmat, 2, function(x) colSums(tmat & x))
going ahead and promoting #jogo's comment as it is by far the fastest (thank's for the hint, I will use that in production as well).
set.seed(10)
mat <- matrix(rbinom(200, size=1, prob = .5), ncol = 10)
mat2 <- matrix(NA,20,20)
binary_mat <- mat == 1
tmat <- t(mat==1)
microbenchmark::microbenchmark(
"loop" = for(i in 1:nrow(mat)){
for(j in 1:nrow(mat)){
mat2[i,j] <- sum(as.numeric(mat[i,]==1) + as.numeric(mat[j,]==1) == 2)
}
},
"apply" = mat4 <- apply(tmat, 2, function(x) colSums(tmat & x)),
"matrix multiplication" = mat5 <- mat %*% t(mat),
"tcrossprod" = tcrossprod(mat),
"tcrossprod binary" = tcrossprod(binary_mat)
)
On my machine this benchmark results in
Unit: microseconds
expr min lq mean median uq max neval cld
loop 16699.634 16972.271 17931.82535 17180.397 17546.1545 31502.706 100 b
apply 322.942 330.046 395.69045 357.886 368.8300 4299.228 100 a
matrix multiplication 21.889 28.801 36.76869 39.360 43.9685 50.689 100 a
tcrossprod 7.297 8.449 11.20218 9.984 14.4005 18.433 100 a
tcrossprod binary 7.680 8.833 11.08316 9.601 12.0970 35.713 100 a

Why is using a for loop faster than apply, in computing the norm of all rows/columns of a matrix?

Consider the following
n <- 10^4
p <- 2
foo <- matrix(runif(p*n), n, p)
I would like to compute the norm of each row of the matrix, i.e., to compute sqrt(crossprod(a_i)) where a_i is the i-th row of foo. I can do this with apply, or with a for loop:
for_loop <- function(x){
range <- seq_along(x[,1])
foo <- range
for (i in range){
foo[i] <- sqrt(crossprod(x[i,]))
}
foo
}
use_apply <- function(x){
apply(x, 1, function(r) sqrt(crossprod(r)))
}
I thought the simpler apply code would be faster, however:
> microbenchmark(for_loop(foo), use_apply(foo), times = 1000)
Unit: milliseconds
expr min lq mean median uq max neval
for_loop(foo) 16.07111 18.87690 24.25369 20.78997 27.66441 179.8374 1000
use_apply(foo) 24.77948 29.05891 35.98689 31.89625 40.30085 205.1632 1000
note that times = 1000 can take quite a bit of time, if you don't have a fast machine you may want to use microbenchmark defaults. Why is apply slower than the for loop code? Is there some function from purrr which would be faster?
EDIT I couldn't believe that crossprod(x) would be so much slower than sum(x*x), so I wanted to check Emmanuel-Lin's results. I get very different timings:
my_loop <- function(x){
range <- seq_along(x[,1])
foo <- range
for (i in range){
foo[i] <- sqrt(sum((x[i,] *x[i,])))
}
foo
}
my_apply <- function(x){
apply(x, 1, function(r) sqrt(sum(r*r)))
}
for_loop <- function(x){
range <- seq_along(x[,1])
foo <- range
for (i in range){
foo[i] <- sqrt(crossprod(x[i,]))
}
foo
}
use_apply <- function(x){
apply(x, 1, function(r) sqrt(crossprod(r)))
}
> microbenchmark(for_loop(foo), my_loop(foo), use_apply(foo), my_apply(foo))
Unit: milliseconds
expr min lq mean median uq max neval
for_loop(foo) 16.299758 17.77176 21.59988 19.04428 22.44558 131.33819 100
my_loop(foo) 9.950813 12.02106 14.43540 12.66142 15.26865 45.42030 100
use_apply(foo) 25.480019 27.95396 31.98351 29.85244 36.41599 60.88678 100
my_apply(foo) 13.277354 14.98329 17.60356 15.98103 19.70325 34.07097 100
ok, my_apply and my_loop are faster (I still can't believe it! What, is crossprod optimized for slowness? :-/) but not so faster as Emmanuel-lin found. It's probably related to some dimension congruence checks which crossprod performs.
apply is literally an R for-loop if you check the code:
#only the for-loop code shown here
if (length(d.call) < 2L) {
if (length(dn.call))
dimnames(newX) <- c(dn.call, list(NULL))
for (i in 1L:d2) {
tmp <- forceAndCall(1, FUN, newX[, i], ...)
if (!is.null(tmp))
ans[[i]] <- tmp
}
}
else for (i in 1L:d2) {
tmp <- forceAndCall(1, FUN, array(newX[, i], d.call,
dn.call), ...)
if (!is.null(tmp))
ans[[i]] <- tmp
}
In addition to the above, apply will run a series of checks too, to make sure the arguments you provided were correct. It is the above that make it a bit slower.
However, lapply, sapply and vapply are C-based for-loops and therefore much faster than an R-based for loop.
To complete #LyzandeR answer on RAM.
You can perform computation way faster by coding your self the multiplication:
Repalce crossprod by sum(r * r)
my_loop <- function(x){
range <- seq_along(x[,1])
foo <- range
for (i in range){
foo[i] <- sqrt(sum(x[i,] * x[i,]))
}
foo
}
my_sapply <- function(x){
apply(x, 1, function(r) sqrt(sum(r * r)))
}
microbenchmark(for_loop(X),
use_apply(X),
my_loop(X),
my_sapply(X),
times = 100)
And the results:
Unit: milliseconds
expr min lq mean median uq max neval
for_loop(X) 122.45210 145.67150 179.84469 177.63446 199.10468 460.73182 100
use_apply(X) 141.99250 169.11596 198.82019 198.11953 223.50906 296.94566 100
my_loop(X) 10.38776 11.61263 16.47609 14.24066 19.07957 58.50008 100
my_sapply(X) 13.21431 15.32081 23.23124 18.39573 26.08099 222.57685 100
So it is more than 10 times faster!
Also you can notice than your machine is way faster than mine :/

Fill matrix using names with Rcpp

Suppose that named elements of a vector - stored in list - should be assigned to the matching columns of a matrix (see example below).
library(microbenchmark)
set.seed(123)
myList <- list()
for(i in 1:10000) {
myList[[i]] <- list(sample(setNames(rnorm(5), sample(LETTERS[1:5])), ceiling(runif(1,1,4))))
}
myMatrix <- matrix(NA, ncol = 5, nrow = 10000)
colnames(myMatrix) <- LETTERS[1:5]
for(i in 1:10000) {
myMatrix[i, match(names(myList[[i]][[1]]), colnames(myMatrix))] <- myList[[i]][[1]]
}
myList[[6]][[1]]
myMatrix[6,]
microbenchmark(for(i in 1:10000) {myMatrix[i, match(names(myList[[i]][[1]]), colnames(myMatrix))] <- myList[[i]][[1]]}, times = 10)
In this example, elements of 10,000 vectors are assigned to the matching columns of a matrix.
Problem
The assignment is slow (approximately 3.5 seconds)!
Question
How can I speed up this process in R or with Rcpp?
Use rbindlist from package data.table. It can bind by matching column names.
library(microbenchmark)
n <- 10000
set.seed(123)
myList <- list()
for(i in 1:n) {
myList[[i]] <- list(sample(setNames(rnorm(5), sample(LETTERS[1:5])), ceiling(runif(1,1,4))))
}
myMatrix <- matrix(NA, ncol = 5, nrow = n)
colnames(myMatrix) <- LETTERS[1:5]
library(data.table)
microbenchmark(match = for(i in 1:n) {myMatrix[i, match(names(myList[[i]][[1]]), colnames(myMatrix))] <- myList[[i]][[1]]},
rbindlist = {
myMatrix1 <- as.matrix(rbindlist(lapply(myList,
function(x) as.list(unlist(x))),
fill = TRUE))
myMatrix1 <- myMatrix1[, order(colnames(myMatrix1))]
},
times = 10)
#Unit: milliseconds
# expr min lq mean median uq max neval cld
# match 1392.52949 1496.40382 1599.63584 1605.39080 1690.98410 1761.67322 10 b
#rbindlist 48.76146 50.29176 51.66355 51.10672 53.75465 54.93798 10 a
all.equal(myMatrix, myMatrix1)
#TRUE

Vectorization of tempered fractional differencing calculation

I am trying to speed up this approximation of tempered fractional differencing.
This controls the long/quasi-long memory of a time series. Given that the first for loop is iterative, I don't know how to vectorize it. Also,the output of the attempted vectorization is a little off from the unaltered raw code. Thank you for your help.
Raw Code
tempfracdiff= function (x,d,eta) {
n=length(x);x=x-mean(x);PI=numeric(n)
PI[1]=-d;TPI=numeric(n);ydiff=x
for (k in 2:n) {PI[k]=PI[k-1]*(k-1-d)/k}
for (j in 1:n) {TPI[j]=exp(-eta*j)*PI[j]}
for (i in 2:n) {ydiff[i]=x[i]+sum(TPI[1:(i-1)]*x[(i-1):1])}
return(ydiff) }
Attempted Vectorization
tempfracdiffFL=function (x,d,eta) {
n=length(x);x=x-mean(x);PI=numeric(n)
PI[1]=-d;TPI=numeric(n);ydiff=x
for (k in 2:n) {PI[k]=PI[k-1]*(k-1-d)/k}
TPI[1:n]=exp(-eta*1:n)*PI[1:n]
ydiff[2:n]=x[2:n]+sum(TPI[1:(2:n-1)]*x[(2:n-1):1])
return(ydiff) }
For PI, you can use cumprod:
k <- 1:n
PI <- cumprod((k-1-d)/k)
TPI may be expressed without indices:
TPI <- exp(-eta*k)*PI
And ydiff is x plus the convolution of x and TPI:
ydiff <- x+c(0,convolve(x,rev(TPI),type="o")[1:n-1])
So, putting it all together:
mytempfracdiff = function (x,d,eta) {
n <- length(x)
x <- x-mean(x)
k <- 1:n
PI <- cumprod((k-1-d)/k)
TPI <- exp(-eta*k)*PI
x+c(0,convolve(x,rev(TPI),type="o")[1:n-1])
}
Test case example
set.seed(1)
x <- rnorm(100)
d <- 0.1
eta <- 0.5
all.equal(mytempfracdiff(x,d,eta), tempfracdiff(x,d,eta))
# [1] TRUE
library(microbenchmark)
microbenchmark(mytempfracdiff(x,d,eta), tempfracdiff(x,d,eta))
Unit: microseconds
expr min lq mean median uq
mytempfracdiff(x, d, eta) 186.220 198.0025 211.9254 207.473 219.944
tempfracdiff(x, d, eta) 961.617 978.5710 1117.8803 1011.257 1061.816
max neval
302.548 100
3556.270 100
For PI[k], Reduce is helpful
n <- 5; d <- .3
fun <- function( a,b ) a * (b-1-d)/b
Reduce( fun, c(1,1:n), accumulate = T )[-1] # Eliminates PI[0]
[1] -0.30000000 -0.10500000 -0.05950000 -0.04016250 -0.02972025

R - fastest way to look at all unique pairs of columns

I have a data frame, M, and I want to calculate all pairwise correlations between the columns of M. I can do this easily using apply functions, e.g.
pvals = laply(M, function(x) llply(M, function(y) cor.test(x, y)$p.value))
However, this solution is doing 2x the required work because the correlation between x and y is the same as the correlation between y and x.
I am looking for a fast, simple way to calculate all correlations among unique pairs of columns. I would like the result to be an NxN matrix, where N=ncol(M). I've searched on Stack Overflow for a long time, but couldn't find anything that did this. Thanks!
for the iris data, you can do:
data(iris)
r <- cor(iris[1:4])
to get the correlation matrix.
You can look at what cor.test actually does with stats:::cor.test and find this...
df <- n - 2L
ESTIMATE <- c(cor = r)
PARAMETER <- c(df = df)
STATISTIC <- c(t = sqrt(df) * r/sqrt(1 - r^2))
p <- pt(STATISTIC, df)
which is all vectorized, so you can just run it.
There's a good discussion of the different tests on wikipedia: http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient
You could use combn:
#Some data:
DF <- USJudgeRatings
#transform to matrix for better subset performance:
m <- as.matrix(DF)
#use combn and its `FUN` argument:
res <- matrix(nrow=ncol(DF), ncol=ncol(DF))
res[lower.tri(res)] <- combn(seq_along(DF), 2, function(ind) cor.test(m[, ind[[1]]], m[, ind[[2]]])$p.value)
res[upper.tri(res)] <- t(res)[upper.tri(res)]
diag(res) <- 0
Benchmarks:
corpRoland <- function(DF) {
m <- as.matrix(DF)
res <- matrix(nrow=ncol(DF), ncol=ncol(DF))
res[lower.tri(res)] <- combn(seq_along(DF), 2, function(ind) cor.test(m[, ind[[1]]], m[, ind[[2]]])$p.value)
res[upper.tri(res)] <- t(res)[upper.tri(res)]
diag(res) <- 0
res}
corpNeal <- function(DF) {
cors <- cor(DF)
df <- nrow(DF)-2
STATISTIC <- c(t = sqrt(df) * cors/sqrt(1 - cors^2))
p <- pt(STATISTIC, df)
matrix(2 * pmin(p, 1 - p),nrow=ncol(DF))}
library(microbenchmark)
DF <- as.data.frame(matrix(rnorm(1e3), ncol=10))
microbenchmark(corpRoland(DF), corpNeal(DF))
#Unit: microseconds
# expr min lq median uq max neval
# corpRoland(DF) 14021.003 14228.2040 14950.212 15157.27 17013.574 100
# corpNeal(DF) 342.631 351.6775 373.636 385.34 467.773 100
DF <- as.data.frame(matrix(rnorm(1e4), ncol=100))
microbenchmark(corpRoland(DF), corpNeal(DF), times=10)
# Unit: milliseconds
# expr min lq median uq max neval
# corpRoland(DF) 1595.878487 1601.221980 1615.391891 1633.746678 1637.373231 10
# corpNeal(DF) 8.359662 8.751755 9.021532 9.509576 9.753154 10
So, you should use the answer by #NealFultz.

Resources