R - fastest way to look at all unique pairs of columns - r

I have a data frame, M, and I want to calculate all pairwise correlations between the columns of M. I can do this easily using apply functions, e.g.
pvals = laply(M, function(x) llply(M, function(y) cor.test(x, y)$p.value))
However, this solution is doing 2x the required work because the correlation between x and y is the same as the correlation between y and x.
I am looking for a fast, simple way to calculate all correlations among unique pairs of columns. I would like the result to be an NxN matrix, where N=ncol(M). I've searched on Stack Overflow for a long time, but couldn't find anything that did this. Thanks!

for the iris data, you can do:
data(iris)
r <- cor(iris[1:4])
to get the correlation matrix.
You can look at what cor.test actually does with stats:::cor.test and find this...
df <- n - 2L
ESTIMATE <- c(cor = r)
PARAMETER <- c(df = df)
STATISTIC <- c(t = sqrt(df) * r/sqrt(1 - r^2))
p <- pt(STATISTIC, df)
which is all vectorized, so you can just run it.
There's a good discussion of the different tests on wikipedia: http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient

You could use combn:
#Some data:
DF <- USJudgeRatings
#transform to matrix for better subset performance:
m <- as.matrix(DF)
#use combn and its `FUN` argument:
res <- matrix(nrow=ncol(DF), ncol=ncol(DF))
res[lower.tri(res)] <- combn(seq_along(DF), 2, function(ind) cor.test(m[, ind[[1]]], m[, ind[[2]]])$p.value)
res[upper.tri(res)] <- t(res)[upper.tri(res)]
diag(res) <- 0
Benchmarks:
corpRoland <- function(DF) {
m <- as.matrix(DF)
res <- matrix(nrow=ncol(DF), ncol=ncol(DF))
res[lower.tri(res)] <- combn(seq_along(DF), 2, function(ind) cor.test(m[, ind[[1]]], m[, ind[[2]]])$p.value)
res[upper.tri(res)] <- t(res)[upper.tri(res)]
diag(res) <- 0
res}
corpNeal <- function(DF) {
cors <- cor(DF)
df <- nrow(DF)-2
STATISTIC <- c(t = sqrt(df) * cors/sqrt(1 - cors^2))
p <- pt(STATISTIC, df)
matrix(2 * pmin(p, 1 - p),nrow=ncol(DF))}
library(microbenchmark)
DF <- as.data.frame(matrix(rnorm(1e3), ncol=10))
microbenchmark(corpRoland(DF), corpNeal(DF))
#Unit: microseconds
# expr min lq median uq max neval
# corpRoland(DF) 14021.003 14228.2040 14950.212 15157.27 17013.574 100
# corpNeal(DF) 342.631 351.6775 373.636 385.34 467.773 100
DF <- as.data.frame(matrix(rnorm(1e4), ncol=100))
microbenchmark(corpRoland(DF), corpNeal(DF), times=10)
# Unit: milliseconds
# expr min lq median uq max neval
# corpRoland(DF) 1595.878487 1601.221980 1615.391891 1633.746678 1637.373231 10
# corpNeal(DF) 8.359662 8.751755 9.021532 9.509576 9.753154 10
So, you should use the answer by #NealFultz.

Related

How to optimize my correlation problem in R?

I have three dataframes in R, let's call them A, B, and C.
dataframe C contains two columns, the first one contains various row names from dataframe A and the second one contains row names in dataframe B:
C <- data.frame(col1 = c("a12", "a9"), col2 = c("b6","b54"))
I want to calculate the correlation coefficient and p-values for each row of the table C using the corresponding values from the rows of table A and B (i.e. correlating values from the a12 row in the table A with values from b6 row from table B, a9 row from table A with b54 row from table B, etc.) and put the resulting values in additional columns in the table C. This is my current naive and highly inefficient code:
for (i in 1:nrow(C)) {
correlation <- cor.test(unlist(A[C[i,1],]), unlist(B[C[i,2],]), method = "spearman")
C[i,3] <-correlation$estimate
C[i,4] <- correlation$p.value
}
The main problem is that with my current large datasets this analysis can literally take months. so I'm looking for a more efficient way to accomplish this task. I also tried the following code using the "Hmisc" package but the server I'm working on can't handle the large vectors:
A <- t(A)
B <- t(B)
ind.A <- match(C[,1], colnames(A))
A<- A[,ind.A]
ind.B <- match(C[,2], colnames(B))
B<- B[,ind.B]
C[,3]<- diag(rcorr(as.matrix(A),as.matrix(B),type = "spearman")$r[c(1:ncol(A)),c(1:ncol(A))])
C[,4]<- diag(rcorr(as.matrix(A),as.matrix(B),type = "spearman")$P[c(1:ncol(A)),c(1:ncol(A))])
Based on the comment by #HYENA, I tried parallelize processing. This approach accelerated the process approximately 4 times (with 8 cores). The code:
library(foreach)
library(doParallel)
cl<- makeCluster(detectCores())
registerDoParallel(cl)
cor.res<- foreach (i=1:nrow(C)) %dopar% {
a<- C[i,1]
b<- C[i,2]
correlation<- cor.test(unlist(A[a,]),unlist(B[b,]), method = "spearman")
c(correlation$estimate,correlation$p.value)
}
cor.res<- data.frame(Reduce("rbind",cor.res))
C[,c(3,4)]<- cor.res
Extract just the part you need from cor.test giving cor_test1 and use that instead or, in addition, create a lookup table for the p values giving cor_test2 which is slightly faster than cor_test1.
Based on the median column with 10-vectors these run about 3x faster than cor.test. Although cor_test2 is only slightly faster than cor_test1 here we have included it since the speed could depend on size of input which we don't have but you can try it out yourself with whatever sizes you have.
# given correlation and degrees of freedom output p value
r2pval <- function(r, dof) {
tval <- sqrt(dof) * r/sqrt(1 - r^2)
min(pt(tval, dof), pt(tval, dof, lower.tail = FALSE))
}
# faster version of cor.test
cor_test1 <- function(x, y) {
r <- cor(x, y)
dof <- length(x) - 2
tval <- sqrt(dof) * r/sqrt(1 - r^2)
pval <- min(pt(tval, dof), pt(tval, dof, lower.tail = FALSE))
c(r, pval)
}
# even faster version of cor.test.
# Given x, y and the pvals table calculate a 2-vector of r and p value
cor_test2 <- function(x, y, pvals) {
r <- cor(x, y)
c(r, pvals[100 * round(r, 2) + 101])
}
# test
set.seed(123)
n <- 10
x <- rnorm(n); y <- rnorm(n)
dof <- n - 2
# pvals is the 201 p values for r = -1, -0.99, -0.98, ..., 1
pvals <- sapply(seq(-1, 1, 0.01), r2pval, dof = dof)
library(microbenchmark)
microbenchmark(cor.test(x, y), cor_test1(x, y), cor_test2(x, y, pvals))
giving:
Unit: microseconds
expr min lq mean median uq max neval cld
cor.test(x, y) 253.7 256.7 346.278 266.05 501.45 650.6 100 a
cor_test1(x, y) 84.8 87.2 346.777 89.10 107.40 22974.4 100 a
cor_test2(x, y, pvals) 72.4 75.0 272.030 79.45 91.25 17935.8 100 a

Faster matrix multiplication by replacing a double loop

I have a dataframe which looks a bit as produced by the following code (but much larger)
set.seed(10)
mat <- matrix(rbinom(200, size=1, prob = .5), ncol = 10)
In the columns are issues and 1 indicates that an observation is interested in a specific issue. I want to generate a network comparing all observations and have a count of issues that each dyad is jointly interested in.
I have produced the following code, which seems to be working fine:
mat2 <- matrix(NA,20,20)
for(i in 1:nrow(mat)){
for(j in 1:nrow(mat)){
mat2[i,j] <- sum(as.numeric(mat[i,]==1) + as.numeric(mat[j,]==1) == 2)
}
}
So I compare every entry with every other entry, and only if both have a 1 entry (i.e., they are interested), then this sums to 2 and will be counted as joint interest in a topic.
My problem is that my dataset is very large, and the loop now runs for hours already.
Does anyone have an idea how to do this while avoiding the loop?
This should be faster:
tmat <- t(mat==1)
mat4 <- apply(tmat, 2, function(x) colSums(tmat & x))
going ahead and promoting #jogo's comment as it is by far the fastest (thank's for the hint, I will use that in production as well).
set.seed(10)
mat <- matrix(rbinom(200, size=1, prob = .5), ncol = 10)
mat2 <- matrix(NA,20,20)
binary_mat <- mat == 1
tmat <- t(mat==1)
microbenchmark::microbenchmark(
"loop" = for(i in 1:nrow(mat)){
for(j in 1:nrow(mat)){
mat2[i,j] <- sum(as.numeric(mat[i,]==1) + as.numeric(mat[j,]==1) == 2)
}
},
"apply" = mat4 <- apply(tmat, 2, function(x) colSums(tmat & x)),
"matrix multiplication" = mat5 <- mat %*% t(mat),
"tcrossprod" = tcrossprod(mat),
"tcrossprod binary" = tcrossprod(binary_mat)
)
On my machine this benchmark results in
Unit: microseconds
expr min lq mean median uq max neval cld
loop 16699.634 16972.271 17931.82535 17180.397 17546.1545 31502.706 100 b
apply 322.942 330.046 395.69045 357.886 368.8300 4299.228 100 a
matrix multiplication 21.889 28.801 36.76869 39.360 43.9685 50.689 100 a
tcrossprod 7.297 8.449 11.20218 9.984 14.4005 18.433 100 a
tcrossprod binary 7.680 8.833 11.08316 9.601 12.0970 35.713 100 a

Efficient way to sum every k columns in each row of large sparse matrix

In this post on CodeReview, I compared several ways to generate a large sparse matrix. Specifically, I compared dense and sparse constructions using the Matrix package in R. My question is about post-processing with the sparse constructions. I'm finding that when I try to find the row sums of every k columns, the dense construction outperforms the sparse constructions.
Microbenchmarking
ncols <- 100000
nrows <- 1000
col_probs <- runif(ncols, 0.001, 0.002)
mat1 <- spMat_dense(ncols=ncols,nrows=nrows,col_probs=col_probs)
mat2 <- spMat_dgC(ncols=ncols,nrows=nrows,col_probs = col_probs)
mat3 <- spMat_dgT(ncols=ncols,nrows=nrows,col_probs=col_probs)
k <- 50
starts <- seq(1, ncols, by=k)
microbenchmark::microbenchmark(sapply(starts, function(x) rowSums(mat1[, x:(x+k-1)])),
sapply(starts, function(x) Matrix::rowSums(mat2[, x:(x+k-1)])),
sapply(starts, function(x) Matrix::rowSums(mat3[, x:(x+k-1)])),
times=5L)
Unit: milliseconds
expr
sapply(starts, function(x) rowSums(mat1[, x:(x + k - 1)]))
sapply(starts, function(x) Matrix::rowSums(mat2[, x:(x + k - 1)]))
sapply(starts, function(x) Matrix::rowSums(mat3[, x:(x + k - 1)]))
min lq mean median uq max
912.0453 947.0454 1041.365 965.4375 1007.311 1374.988
2097.4125 2208.0056 2566.575 2406.8450 2851.640 3268.970
13231.4790 13619.3818 13819.745 13675.6282 13923.803 14648.434
neval cld
5 a
5 b
5 c
My guess is that the sapply function works better with dense matrices because it doesn't need to do the sparse to dense conversion under the hood. The functions are posted below.
Question
Is there a way to improve the speed of the above post-processing for sparse constructions?
Functions
spMat_dense <- function(ncols,nrows,col_probs){
matrix(rbinom(nrows*ncols,1,col_probs),
ncol=ncols,byrow=T)
}
library(Matrix)
spMat_dgC <- function(ncols,nrows,col_probs){
#Credit to Andrew Guster (https://stackoverflow.com/a/56348978/4321711)
mat <- Matrix(0, nrows, ncols, sparse = TRUE) #blank matrix for template
i <- vector(mode = "list", length = ncols) #each element of i contains the '1' rows
p <- rep(0, ncols) #p will be cumsum no of 1s by column
for(r in 1:nrows){
row <- rbinom(ncols, 1, col_probs) #random row
p <- p + row #add to column identifier
if(any(row == 1)){
for (j in which(row == 1)){
i[[j]] <- c(i[[j]], r-1) #append row identifier
}
}
}
p <- c(0, cumsum(p)) #this is the format required
i <- unlist(i)
x <- rep(1, length(i))
mat#i <- as.integer(i)
mat#p <- as.integer(p)
mat#x <- x
return(mat)
}
spMat_dgT <- function(ncols, nrows, col_probs){
#Credit to minem - https://codereview.stackexchange.com/a/222190/121860
r <- lapply(1:ncols, function(x) {
p <- col_probs[x]
i <- sample.int(2L, size = nrows, replace = T, prob = c(1 - p, p))
which(i == 2L)
})
rl <- lengths(r)
nc <- rep(1:ncols, times = rl) # col indexes
nr <- unlist(r) # row index
ddims <- c(nrows, ncols)
sparseMatrix(i = nr, j = nc, dims = ddims, giveCsparse = FALSE)
}
Using a dgCMatrix as input, this is one possible solution that is very fast:
new_combine <- function(mat,k){
#Convert dgCMatrix to dgTMatrix
x.T <- as(mat, "dgTMatrix")
#Map column indices to new set of indices
#based on partitioning every k columns
x.T#j <- as.integer(x.T#j %/% k)
#Correct dimensions of new matrix
x.T#Dim <- as.integer(c(nrow(x.T),floor(ncol(mat)/k)))
#Convert back to dgCMatrix
y <- as(x.T,"dgCMatrix")
y
}
microbenchmark::microbenchmark(sapply(starts, function(x) Matrix::rowSums(mat2[, x:(x+k-1)])),
new_combine(mat2,k),
times=5L)
Unit: milliseconds
expr
sapply(starts, function(x) Matrix::rowSums(mat2[, x:(x + k - 1)]))
new_combine(mat2, k)
min lq mean median uq
1808.872676 1864.783181 1925.17118 1935.98946 1990.28866
8.471521 9.396441 10.99871 10.04459 10.96175
max neval cld
2025.92192 5 b
16.11923 5 a
comp <- sapply(starts, function(x) Matrix::rowSums(mat2[, x:(x+k-1)]))
comp2 <- new_combine(mat2,k)
> all.equal(comp2,as(comp,"dgCMatrix"))
[1] TRUE

Vectorization of tempered fractional differencing calculation

I am trying to speed up this approximation of tempered fractional differencing.
This controls the long/quasi-long memory of a time series. Given that the first for loop is iterative, I don't know how to vectorize it. Also,the output of the attempted vectorization is a little off from the unaltered raw code. Thank you for your help.
Raw Code
tempfracdiff= function (x,d,eta) {
n=length(x);x=x-mean(x);PI=numeric(n)
PI[1]=-d;TPI=numeric(n);ydiff=x
for (k in 2:n) {PI[k]=PI[k-1]*(k-1-d)/k}
for (j in 1:n) {TPI[j]=exp(-eta*j)*PI[j]}
for (i in 2:n) {ydiff[i]=x[i]+sum(TPI[1:(i-1)]*x[(i-1):1])}
return(ydiff) }
Attempted Vectorization
tempfracdiffFL=function (x,d,eta) {
n=length(x);x=x-mean(x);PI=numeric(n)
PI[1]=-d;TPI=numeric(n);ydiff=x
for (k in 2:n) {PI[k]=PI[k-1]*(k-1-d)/k}
TPI[1:n]=exp(-eta*1:n)*PI[1:n]
ydiff[2:n]=x[2:n]+sum(TPI[1:(2:n-1)]*x[(2:n-1):1])
return(ydiff) }
For PI, you can use cumprod:
k <- 1:n
PI <- cumprod((k-1-d)/k)
TPI may be expressed without indices:
TPI <- exp(-eta*k)*PI
And ydiff is x plus the convolution of x and TPI:
ydiff <- x+c(0,convolve(x,rev(TPI),type="o")[1:n-1])
So, putting it all together:
mytempfracdiff = function (x,d,eta) {
n <- length(x)
x <- x-mean(x)
k <- 1:n
PI <- cumprod((k-1-d)/k)
TPI <- exp(-eta*k)*PI
x+c(0,convolve(x,rev(TPI),type="o")[1:n-1])
}
Test case example
set.seed(1)
x <- rnorm(100)
d <- 0.1
eta <- 0.5
all.equal(mytempfracdiff(x,d,eta), tempfracdiff(x,d,eta))
# [1] TRUE
library(microbenchmark)
microbenchmark(mytempfracdiff(x,d,eta), tempfracdiff(x,d,eta))
Unit: microseconds
expr min lq mean median uq
mytempfracdiff(x, d, eta) 186.220 198.0025 211.9254 207.473 219.944
tempfracdiff(x, d, eta) 961.617 978.5710 1117.8803 1011.257 1061.816
max neval
302.548 100
3556.270 100
For PI[k], Reduce is helpful
n <- 5; d <- .3
fun <- function( a,b ) a * (b-1-d)/b
Reduce( fun, c(1,1:n), accumulate = T )[-1] # Eliminates PI[0]
[1] -0.30000000 -0.10500000 -0.05950000 -0.04016250 -0.02972025

Need help vectorizing a for loop in R

I'm trying to speed up an R function from a package I regularly use, so any help vectorizing the for-loop below would be much appreciated!
y <- array(0, dim=c(75, 12))
samp <- function(x) x<-sample(c(0,1), 1)
y <- apply(y, c(1,2), samp)
nr <- nrow(y)
nc <- ncol(y)
rs <- rowSums(y)
p <- colSums(y)
out <- matrix(0, nrow = nr, ncol = nc)
for (i in 1:nr) {
out[i, sample.int(nc, rs[i], prob = p)] <- 1
}
The issue I'm having a hard time getting around is the reference to object 'rs' within the loop.
Any suggestions?
Here are two options:
This one uses the somewhat discouraged <<- operator:
lapply(1:nr, function(i) out[i, sample.int(nc, rs[i], prob = p)] <<- 1)
This one uses more traditional indexing:
out[do.call('rbind',sapply(1:nr, function(i) cbind(i,sample.int(nc, rs[i], prob = p))))] <- 1
I suppose you could also use Vectorize to do an implicit mapply on your function:
z <- Vectorize(sample.int, vectorize.args='size')(nc, rs, prob=p)
out[cbind(rep(1:length(z), sapply(z, length)), unlist(z))] <- 1
But I don't think that's necessarily any cleaner.
And, indeed, #Roland is correct, that all of these are slower than just doing the for loop:
> microbenchmark(op(), t1(), t2(), t3())
Unit: microseconds
expr min lq median uq max neval
op() 494.970 513.8290 521.7195 532.3040 1902.898 100
t1() 591.962 602.1615 609.4745 617.5570 2369.385 100
t2() 734.756 754.7700 764.3925 782.4825 2205.421 100
t3() 642.383 672.9815 711.4700 763.8150 2283.169 100
Yay for benefit-free obfuscation!

Resources