Related
I have three dataframes in R, let's call them A, B, and C.
dataframe C contains two columns, the first one contains various row names from dataframe A and the second one contains row names in dataframe B:
C <- data.frame(col1 = c("a12", "a9"), col2 = c("b6","b54"))
I want to calculate the correlation coefficient and p-values for each row of the table C using the corresponding values from the rows of table A and B (i.e. correlating values from the a12 row in the table A with values from b6 row from table B, a9 row from table A with b54 row from table B, etc.) and put the resulting values in additional columns in the table C. This is my current naive and highly inefficient code:
for (i in 1:nrow(C)) {
correlation <- cor.test(unlist(A[C[i,1],]), unlist(B[C[i,2],]), method = "spearman")
C[i,3] <-correlation$estimate
C[i,4] <- correlation$p.value
}
The main problem is that with my current large datasets this analysis can literally take months. so I'm looking for a more efficient way to accomplish this task. I also tried the following code using the "Hmisc" package but the server I'm working on can't handle the large vectors:
A <- t(A)
B <- t(B)
ind.A <- match(C[,1], colnames(A))
A<- A[,ind.A]
ind.B <- match(C[,2], colnames(B))
B<- B[,ind.B]
C[,3]<- diag(rcorr(as.matrix(A),as.matrix(B),type = "spearman")$r[c(1:ncol(A)),c(1:ncol(A))])
C[,4]<- diag(rcorr(as.matrix(A),as.matrix(B),type = "spearman")$P[c(1:ncol(A)),c(1:ncol(A))])
Based on the comment by #HYENA, I tried parallelize processing. This approach accelerated the process approximately 4 times (with 8 cores). The code:
library(foreach)
library(doParallel)
cl<- makeCluster(detectCores())
registerDoParallel(cl)
cor.res<- foreach (i=1:nrow(C)) %dopar% {
a<- C[i,1]
b<- C[i,2]
correlation<- cor.test(unlist(A[a,]),unlist(B[b,]), method = "spearman")
c(correlation$estimate,correlation$p.value)
}
cor.res<- data.frame(Reduce("rbind",cor.res))
C[,c(3,4)]<- cor.res
Extract just the part you need from cor.test giving cor_test1 and use that instead or, in addition, create a lookup table for the p values giving cor_test2 which is slightly faster than cor_test1.
Based on the median column with 10-vectors these run about 3x faster than cor.test. Although cor_test2 is only slightly faster than cor_test1 here we have included it since the speed could depend on size of input which we don't have but you can try it out yourself with whatever sizes you have.
# given correlation and degrees of freedom output p value
r2pval <- function(r, dof) {
tval <- sqrt(dof) * r/sqrt(1 - r^2)
min(pt(tval, dof), pt(tval, dof, lower.tail = FALSE))
}
# faster version of cor.test
cor_test1 <- function(x, y) {
r <- cor(x, y)
dof <- length(x) - 2
tval <- sqrt(dof) * r/sqrt(1 - r^2)
pval <- min(pt(tval, dof), pt(tval, dof, lower.tail = FALSE))
c(r, pval)
}
# even faster version of cor.test.
# Given x, y and the pvals table calculate a 2-vector of r and p value
cor_test2 <- function(x, y, pvals) {
r <- cor(x, y)
c(r, pvals[100 * round(r, 2) + 101])
}
# test
set.seed(123)
n <- 10
x <- rnorm(n); y <- rnorm(n)
dof <- n - 2
# pvals is the 201 p values for r = -1, -0.99, -0.98, ..., 1
pvals <- sapply(seq(-1, 1, 0.01), r2pval, dof = dof)
library(microbenchmark)
microbenchmark(cor.test(x, y), cor_test1(x, y), cor_test2(x, y, pvals))
giving:
Unit: microseconds
expr min lq mean median uq max neval cld
cor.test(x, y) 253.7 256.7 346.278 266.05 501.45 650.6 100 a
cor_test1(x, y) 84.8 87.2 346.777 89.10 107.40 22974.4 100 a
cor_test2(x, y, pvals) 72.4 75.0 272.030 79.45 91.25 17935.8 100 a
I have a dataframe which looks a bit as produced by the following code (but much larger)
set.seed(10)
mat <- matrix(rbinom(200, size=1, prob = .5), ncol = 10)
In the columns are issues and 1 indicates that an observation is interested in a specific issue. I want to generate a network comparing all observations and have a count of issues that each dyad is jointly interested in.
I have produced the following code, which seems to be working fine:
mat2 <- matrix(NA,20,20)
for(i in 1:nrow(mat)){
for(j in 1:nrow(mat)){
mat2[i,j] <- sum(as.numeric(mat[i,]==1) + as.numeric(mat[j,]==1) == 2)
}
}
So I compare every entry with every other entry, and only if both have a 1 entry (i.e., they are interested), then this sums to 2 and will be counted as joint interest in a topic.
My problem is that my dataset is very large, and the loop now runs for hours already.
Does anyone have an idea how to do this while avoiding the loop?
This should be faster:
tmat <- t(mat==1)
mat4 <- apply(tmat, 2, function(x) colSums(tmat & x))
going ahead and promoting #jogo's comment as it is by far the fastest (thank's for the hint, I will use that in production as well).
set.seed(10)
mat <- matrix(rbinom(200, size=1, prob = .5), ncol = 10)
mat2 <- matrix(NA,20,20)
binary_mat <- mat == 1
tmat <- t(mat==1)
microbenchmark::microbenchmark(
"loop" = for(i in 1:nrow(mat)){
for(j in 1:nrow(mat)){
mat2[i,j] <- sum(as.numeric(mat[i,]==1) + as.numeric(mat[j,]==1) == 2)
}
},
"apply" = mat4 <- apply(tmat, 2, function(x) colSums(tmat & x)),
"matrix multiplication" = mat5 <- mat %*% t(mat),
"tcrossprod" = tcrossprod(mat),
"tcrossprod binary" = tcrossprod(binary_mat)
)
On my machine this benchmark results in
Unit: microseconds
expr min lq mean median uq max neval cld
loop 16699.634 16972.271 17931.82535 17180.397 17546.1545 31502.706 100 b
apply 322.942 330.046 395.69045 357.886 368.8300 4299.228 100 a
matrix multiplication 21.889 28.801 36.76869 39.360 43.9685 50.689 100 a
tcrossprod 7.297 8.449 11.20218 9.984 14.4005 18.433 100 a
tcrossprod binary 7.680 8.833 11.08316 9.601 12.0970 35.713 100 a
I have a dataset with the following structure:
require(data.table)
train <- data.table(sample(1:10, 10), runif(10, 0, 10))
However, the dataset is ~ 7,5 GB in memory and has ~630 million rows. Attempting summary(train) yields in an error: Error: cannot allocate vector of size 2.3 Gb. I can extract some information by manually calling train[, mean(V2)], train[, min(V2)] and train[, max(V2)], but median and quantiles will result in OOM. Is there a possibility to make these operations on a 16GB RAM machine?
An idea would be to split the dataset but that would be a bit cumbersome w.r.t to median and quantiles
So I came up with function summaryI, to which we supply our interested column name:
summaryI <- function(i2) {
setorderv(train, i2)
N <- train[, .N]
# count NAs:
# nas <- is.na(train[[i2]])
# nNA <- sum(nas)
# OR
i <- 1L
nNA <- 0L
while (is.na(train[[i2]][i])) {
nNA <- nNA + 1L
i <- i + 1L
}
nNA
# will be slow if many NAs, but more memory efficient
# (will not create additional vector)
n <- N - nNA
probs <- seq(0, 1, 0.25)
# quantiles, only type = 7
index <- 1 + (n - 1) * probs
lo <- floor(index)
hi <- ceiling(index)
qs <- train[[i2]][lo + nNA]
i <- which(index > lo)
h <- (index - lo)[i]
qs[i] <- (1 - h) * qs[i] + h * train[[i2]][hi[i] + nNA]
qs # quantile results
rmean <- sum(train[[i2]], na.rm = T) / n
qq <- c(qs[1L:3L], rmean, qs[4L:5L])
digits <- max(3L, getOption("digits") - 3L)
qq <- signif(qq, digits)
names(qq) <- c("Min.", "1st Qu.", "Median", "Mean", "3rd Qu.", "Max.")
if (nNA > 0L) { # to match summary output
c(qq, `NA's` = nNA)
} else {
qq
}
}
The basic idea is that we order the interested column in place (with setorder from data.table) and then try to do all the calculations without copying data.
As mentioned in comments, if your data have a lot of NAs then this will be slow.
But hopefully you will be able to run on all of the data. Also, I hard coded inside NA management.
Example:
summaryI('V2')
# Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
# 2.398e-08 2.501e-01 5.000e-01 5.000e-01 7.500e-01 1.000e+00 1.000e+02
or run over multiple columns, like:
sapply(colnames(train), summaryI)
The source code of summary and quantile, which I used as basis, can be found:
github quantile.R
github summary.R
I have 5 tables which I would like to find some combinations that fulfills some requirements. I could easily solve the data below by using matrix multiplication to create all possible combinations and afterwards selecting rows that fulfills my needs. The problem is that my original problem consist of 5 tables with 200 rows each. Which requires a couple of hundred gb of ram if generating all possible combinations.
So I tried this instead:
x1 <- seq(1,10,1)
x2 <- seq(5,15,3)
x3 <- seq(2,11,1)
x4 <- seq(1,5,1)
x5 <- seq(1,20,2)
Which should fulfill: x1 < x2 and x1 < x3.
nm <- data.frame(matrix(NA,1,5))
for(a in 1:length(x1)){
for(s in 1:length(x2)){
for(d in 1:length(x3)){
for(f in 1:length(x4)){
for(g in 1:length(x5)){
l1 <- x1[a]
l2 <- x2[s]
if(l1 < l2){
l3 <- x3[d]
if(l1 < l3){
l4 <- x4[f]
l5 <- x5[g]
fy <- c()
fy[1] <- l1
fy[2] <- l2
fy[3] <- l3
fy[4] <- l4
fy[5] <- l5
nm <- rbind(nm, fy)
}}}}}}}
In my original problem I have more if statements, which I hope will increase speed. But I have had it running for about 24hours now, and still not done. The above problem takes me about 10s which makes me think its stuck.
Two problems:
The huge problem is that you grow an object in a loop. This is the slowest operation possible since there is huge OS overhead involved. You need to preallocate the object and only grow it in chunks as necessary.
The medium problem is that you use a data.frame to store the results. Data.frames are useful, but slow. Use a matrix instead.
nm1 <- matrix(nrow = 1e3, ncol = 5) #adjust the chunk size to a reasonable estimate
rx <- 1
for(a in 1:length(x1)){
for(s in 1:length(x2)){
for(d in 1:length(x3)){
for(f in 1:length(x4)){
for(g in 1:length(x5)){
l1 <- x1[a]
l2 <- x2[s]
if(l1 < l2){
l3 <- x3[d]
if(l1 < l3){
l4 <- x4[f]
l5 <- x5[g]
if(rx > nrow(nm1)) nm1 <- rbind(nm1, matrix(nrow = 1e3, ncol = 5))
nm1[rx, 1] <- l1
nm1[rx, 2] <- l2
nm1[rx, 3] <- l3
nm1[rx, 4] <- l4
nm1[rx, 5] <- l5
rx <- rx + 1
}}}}}}}
nm1 <- nm1[seq_len(rx - 1),]
Timings:
Unit: milliseconds
expr min lq mean median uq max neval cld
mod() 589.2437 591.1576 594.4138 593.3678 595.0909 603.2087 5 a
original() 4934.4981 4952.4502 4980.6414 4953.3183 4985.7943 5077.1463 5 b
We get a factor 10 performance improvement without actually starting to think about the algorithm. This factor gets bigger if you have more iterations of growing the data.frame. If this is still too slow, you could try to byte-compile the code using the compiler package. It would also be trivial to implement as actual compiled code with Rcpp. However, you should benchmark with increasing number of iterations and extrapolate the timings to your actual problem. You might need to find a better algorithm than brute force or consider if you actually need to do this.
JSD matrix is a similarity matrix of distributions based on Jensen-Shannon divergence.
Given matrix m which rows present distributions we would like to find JSD distance between each distribution. Resulting JSD matrix is a square matrix with dimensions nrow(m) x nrow(m). This is triangular matrix where each element contains JSD value between two rows in m.
JSD can be calculated by the following R function:
JSD<- function(x,y) sqrt(0.5 * (sum(x*log(x/((x+y)/2))) + sum(y*log(y/((x+y)/2)))))
where x, y are rows in matrix m.
I experimented with different JSD matrix calculation algorithms in R to figure out the quickest one. For my surprise, the algorithm with two nested loops performs faster than the different vectorized versions (parallelized or not). I'm not happy with the results. Could you pinpoint me better solutions than the ones I game up?
library(parallel)
library(plyr)
library(doParallel)
library(foreach)
nodes <- detectCores()
cl <- makeCluster(4)
registerDoParallel(cl)
m <- runif(24000, min = 0, max = 1)
m <- matrix(m, 24, 1000)
prob_dist <- function(x) t(apply(x, 1, prop.table))
JSD<- function(x,y) sqrt(0.5 * (sum(x*log(x/((x+y)/2))) + sum(y*log(y/((x+y)/2)))))
m <- t(prob_dist(m))
m[m==0] <- 0.000001
Algorithm with two nested loops:
dist.JSD_2 <- function(inMatrix) {
matrixColSize <- ncol(inMatrix)
resultsMatrix <- matrix(0, matrixColSize, matrixColSize)
for(i in 2:matrixColSize) {
for(j in 1:(i-1)) {
resultsMatrix[i,j]=JSD(inMatrix[,i], inMatrix[,j])
}
}
return(resultsMatrix)
}
Algorithm with outer:
dist.JSD_3 <- function(inMatrix) {
matrixColSize <- ncol(inMatrix)
resultsMatrix <- outer(1:matrixColSize,1:matrixColSize, FUN = Vectorize( function(i,j) JSD(inMatrix[,i], inMatrix[,j])))
return(resultsMatrix)
}
Algorithm with combn and apply:
dist.JSD_4 <- function(inMatrix) {
matrixColSize <- ncol(inMatrix)
ind <- combn(matrixColSize, 2)
out <- apply(ind, 2, function(x) JSD(inMatrix[,x[1]], inMatrix[,x[2]]))
a <- rbind(ind, out)
resultsMatrix <- sparseMatrix(a[1,], a[2,], x=a[3,], dims=c(matrixColSize, matrixColSize))
return(resultsMatrix)
}
Algorithm with combn and aaply:
dist.JSD_5 <- function(inMatrix) {
matrixColSize <- ncol(inMatrix)
ind <- combn(matrixColSize, 2)
out <- aaply(ind, 2, function(x) JSD(inMatrix[,x[1]], inMatrix[,x[2]]))
a <- rbind(ind, out)
resultsMatrix <- sparseMatrix(a[1,], a[2,], x=a[3,], dims=c(matrixColSize, matrixColSize))
return(resultsMatrix)
}
performance test:
mbm = microbenchmark(
two_loops = dist.JSD_2(m),
outer = dist.JSD_3(m),
combn_apply = dist.JSD_4(m),
combn_aaply = dist.JSD_5(m),
times = 10
)
ggplot2::autoplot(mbm)
> summary(mbm)
expr min lq mean median
1 two_loops 18.30857 18.68309 23.50231 18.77303
2 outer 38.93112 40.98369 42.44783 42.16858
3 combn_apply 20.45740 20.90747 21.49122 21.35042
4 combn_aaply 55.61176 56.77545 59.37358 58.93953
uq max neval cld
1 18.87891 65.34197 10 a
2 42.85978 48.82437 10 b
3 22.06277 22.98803 10 a
4 62.26417 64.77407 10 c
This is my implementation of your dist.JSD_2
dist0 <- function(m) {
ncol <- ncol(m)
result <- matrix(0, ncol, ncol)
for (i in 2:ncol) {
for (j in 1:(i-1)) {
x <- m[,i]; y <- m[,j]
result[i, j] <-
sqrt(0.5 * (sum(x * log(x / ((x + y) / 2))) +
sum(y * log(y / ((x + y) / 2)))))
}
}
result
}
The usual steps are to replace iterative calculations with vectorized versions. I moved sqrt(0.5 * ...) from inside the loops, where it is applied to each element of result, to outside the loop, where it is applied to the vector result.
I realized that sum(x * log(x / (x + y) / 2)) could be written as sum(x * log(2 * x)) - sum(x * log(x + y)). The first sum is calculated once for each entry, but could be calculated once for each column. It too comes out of the loops, with the vector of values (one element for each column) calculated as colSums(m * log(2 * m)).
The remaining term inside the inner loop is sum((x + y) * log(x + y)). For a given value of i, we can trade off space for speed by vectorizing this across all relevant y columns as a matrix operation
j <- seq_len(i - 1L)
xy <- m[, i] + m[, j, drop=FALSE]
xylogxy[i, j] <- colSums(xy * log(xy))
The end result is
dist4 <- function(m) {
ncol <- ncol(m)
xlogx <- matrix(colSums(m * log(2 * m)), ncol, ncol)
xlogx2 <- xlogx + t(xlogx)
xlogx2[upper.tri(xlogx2, diag=TRUE)] <- 0
xylogxy <- matrix(0, ncol, ncol)
for (i in seq_len(ncol)[-1]) {
j <- seq_len(i - 1L)
xy <- m[, i] + m[, j, drop=FALSE]
xylogxy[i, j] <- colSums(xy * log(xy))
}
sqrt(0.5 * (xlogx2 - xylogxy))
}
Which produces results that are numerically equal (though not exactly identical) to the original
> all.equal(dist0(m), dist4(m))
[1] TRUE
and about 2.25x faster
> microbenchmark(dist0(m), dist4(m), dist.JSD_cpp2(m), times=10)
Unit: milliseconds
expr min lq mean median uq max neval
dist0(m) 48.41173 48.42569 49.26072 48.68485 49.48116 51.64566 10
dist4(m) 20.80612 20.90934 21.34555 21.09163 21.96782 22.32984 10
dist.JSD_cpp2(m) 28.95351 29.11406 29.43474 29.23469 29.78149 30.37043 10
You'll still be waiting for about 10 hours, though that seems to imply a very large problem. The algorithm seems like it is quadratic in the number of columns, but the number of columns here was small (24) compared to the number of rows, so I wonder what the actual size of data being processed is? There are ncol * (ncol - 1) / 2 distances to be calculated.
A crude approach to further performance gain is parallel evaluation, which the following implements using parallel::mclapply()
dist4p <- function(m, ..., mc.cores=detectCores()) {
ncol <- ncol(m)
xlogx <- matrix(colSums(m * log(2 * m)), ncol, ncol)
xlogx2 <- xlogx + t(xlogx)
xlogx2[upper.tri(xlogx2, diag=TRUE)] <- 0
xx <- mclapply(seq_len(ncol)[-1], function(i, m) {
j <- seq_len(i - 1L)
xy <- m[, i] + m[, j, drop=FALSE]
colSums(xy * log(xy))
}, m, ..., mc.cores=mc.cores)
xylogxy <- matrix(0, ncol, ncol)
xylogxy[upper.tri(xylogxy, diag=FALSE)] <- unlist(xx)
sqrt(0.5 * (xlogx2 - t(xylogxy)))
}
My laptop has 8 nominal cores, and for 1000 columns I have
> system.time(xx <- dist4p(m1000))
user system elapsed
48.909 1.939 8.043
suggests that I get 48s of processor time in 8s of clock time. The algorithm is still quadratic, so this might reduce overall computation time to about 1h for the full problem. Memory might become an issue on a multicore machine, where all processes are competing for the same memory pool; it might be necessary to choose mc.cores less than the number available.
With large ncol, the way to get better performance is to avoid calculating the complete set of distances. Depending on the nature of the data it might make sense to filter for duplicate columns, or to filter for informative columns (e.g., with greatest variance), or... An appropriate strategy requires more information on what the columns represent and what the goal is for the distance matrix. The question 'how similar is company i to other companies?' can be answered without calculating the full distance matrix, just a single row, so if the number of times the question is asked relative to the total number of companies is small, then maybe there is no need to calculate the full distance matrix? Another strategy might be to reduce the number of companies to be clustered by (1) simplify the 1000 rows of measurement using principal components analysis, (2) kmeans clustering of all 50k companies to identify say 1000 centroids, and (3) using the interpolated measurements and Jensen-Shannon distance between these for clustering.
I'm sure there are better approaches than the following, but your JSD function itself can trivially be converted to an Rcpp function by just swapping sum and log for their Rcpp sugar equivalents, and using std::sqrt in place of the R's base::sqrt.
#include <Rcpp.h>
// [[Rcpp::export]]
double cppJSD(const Rcpp::NumericVector& x, const Rcpp::NumericVector& y) {
return std::sqrt(0.5 * (Rcpp::sum(x * Rcpp::log(x/((x+y)/2))) +
Rcpp::sum(y * Rcpp::log(y/((x+y)/2)))));
}
I only tested with your dist.JST_2 approach (since it was the fastest version), but you should see an improvement when using cppJSD instead of JSD regardless of the implementation:
R> microbenchmark::microbenchmark(
two_loops = dist.JSD_2(m),
cpp = dist.JSD_cpp(m),
times=100L)
Unit: milliseconds
expr min lq mean median uq max neval
two_loops 41.25142 41.34755 42.75926 41.45956 43.67520 49.54250 100
cpp 36.41571 36.52887 37.49132 36.60846 36.98887 50.91866 100
EDIT:
Actually, your dist.JSD_2 function itself can easily be converted to an Rcpp function for an additional speed-up:
// [[Rcpp::export("dist.JSD_cpp2")]]
Rcpp::NumericMatrix foo(const Rcpp::NumericMatrix& inMatrix) {
size_t cols = inMatrix.ncol();
Rcpp::NumericMatrix result(cols, cols);
for (size_t i = 1; i < cols; i++) {
for (size_t j = 0; j < i; j++) {
result(i,j) = cppJSD(inMatrix(Rcpp::_, i), inMatrix(Rcpp::_, j));
}
}
return result;
}
(where cppJSD was defined in the same .cpp file as the above). Here are the timings:
R> microbenchmark::microbenchmark(
two_loops = dist.JSD_2(m),
partial_cpp = dist.JSD_cpp(m),
full_cpp = dist.JSD_cpp2(m),
times=100L)
Unit: milliseconds
expr min lq mean median uq max neval
two_loops 41.25879 41.36729 42.95183 41.84999 44.08793 54.54610 100
partial_cpp 36.45802 36.62463 37.69742 36.99679 37.96572 44.26446 100
full_cpp 32.00263 32.12584 32.82785 32.20261 32.63554 38.88611 100
dist.JSD_2 <- function(inMatrix) {
matrixColSize <- ncol(inMatrix)
resultsMatrix <- matrix(0, matrixColSize, matrixColSize)
for(i in 2:matrixColSize) {
for(j in 1:(i-1)) {
resultsMatrix[i,j]=JSD(inMatrix[,i], inMatrix[,j])
}
}
return(resultsMatrix)
}
##
dist.JSD_cpp <- function(inMatrix) {
matrixColSize <- ncol(inMatrix)
resultsMatrix <- matrix(0, matrixColSize, matrixColSize)
for(i in 2:matrixColSize) {
for(j in 1:(i-1)) {
resultsMatrix[i,j]=cppJSD(inMatrix[,i], inMatrix[,j])
}
}
return(resultsMatrix)
}
m <- runif(24000, min = 0, max = 1)
m <- matrix(m, 24, 1000)
prob_dist <- function(x) t(apply(x, 1, prop.table))
JSD <- function(x,y) sqrt(0.5 * (sum(x*log(x/((x+y)/2))) + sum(y*log(y/((x+y)/2)))))
m <- t(prob_dist(m))
m[m==0] <- 0.000001