Related
I need to make recodings to data sets of the following form.
# List elements of varying length
set.seed(12345)
n = 1e3
m = sample(2:5, n, T)
V = list()
for(i in 1:n) {
for(j in 1:m[i])
if(j ==1) V[[i]] = 0 else V[[i]][j] = V[[i]][j-1] + rexp(1, 1/10)
}
As an example consider
[1] 0.00000 23.23549 30.10976
Each list element contains a ascending vector of length m, each starting with 0 and ending somewhere in positive real numbers.
Now, consider a value s, where s is smaller than the maximum v_m of each V[[i]]. Also let v_m_ denote the m-1-th element of V[[i]]. Our goal is to find all elements of V[[i]] that are bounded by v_m_ - s and v_m - s. In the example above, if s=5, the desired vector v would be 23.23549. v can contain more elements if the interval encloses more values. As an example consider:
> V[[1]]
[1] 0.000000 2.214964 8.455576 10.188048 26.170458
If we now let s=16, the resulting vector is now 0 2.214964 8.455576, so that it has length 3. The code below implements this procedure using a for loop. It returns v in a list for all n. Note that I also attach the (upper/lower) bound before/afterv, if the bound lead to a reduction in length of v (in other words, if the bound has a positive value).
This loop is too slow in my application because n is large and the procedure is part of a larger algorithm that has to be run many times with some parameters changing. Is there a way to obtain the result faster than with a for loop, for example using vectorization? I know lapply in general is not faster than for.
# Series maximum and one below maximum
v_m = sapply(V, function(x) x[length(x)])
v_m_ = sapply(V, function(x) x[length(x)-1])
# Set some offsets s
s = runif(n,0,v_m)
# Procedure
d1 = (v_m_ - s)
d2 = (v_m - s)
if(sum(d2 < 0) > 0) stop('s cannot be larger than series maximum.')
# For loop - can this be done faster?
W = list()
for(i in 1:n){
v = V[[i]]
l = length(v)
v = v[v > d1[i]]
if(l > length(v)) v = c(d1[i], v)
l = length(v)
v = v[v < d2[i]]
if(l > length(v)) v = c(v, d2[i])
W[[i]] = v
}
I guess you can try mapply like below
V <- lapply(m, function(i) c(0, cumsum(rexp(i - 1, 1 / 10))))
v <- sapply(V, tail, 2)
s <- runif(n, 0, v[1, ])
if (sum(v[2, ] < 0) > 0) stop("s cannot be larger than series maximum.")
W <- mapply(
function(x, lb, ub) c(pmax(lb,0), x[x >= lb & x <= ub], pmin(ub,max(x))),
V,
v[1,]-s,
v[2,]-s
)
I don't think vectorization will be an option since the operation goes from a list of unequal-length vectors to another list of unequal-length vectors.
For example, the following vectorizes all the comparisons, but the unlist/relist operations are too expensive (not to mention the final lapply(..., unique)). Stick with the for loop.
W <- lapply(
relist(
pmax(
pmin(
unlist(V),
rep.int(d2, lengths(V))
),
rep.int(d1, lengths(V))
),
V
),
unique
)
I see two things that will give modest gains in speed. First, if s is always greater than 0, your final if statement will always evaluate to TRUE, so it can be skipped, simplifying some of the code. Second is to pre-allocate W. These are both implemented in fRecode2 below. A third thing that gives a slight gain is to avoid multiple reassignments to v. This is implemented in fRecode3 below.
For additional speed, move to Rcpp--it will allow the vectors in W to be built via a single pass through each vector element in V instead of two.
set.seed(12345)
n <- 1e3
m <- sample(2:5, n, T)
V <- lapply(m, function(i) c(0, cumsum(rexp(i - 1, 1 / 10))))
v_m <- sapply(V, function(x) x[length(x)])
v_m_ <- sapply(V, function(x) x[length(x)-1])
s <- runif(n,0,v_m)
d1 <- (v_m_ - s)
d2 <- (v_m - s)
if(sum(d2 < 0) > 0) stop('s cannot be larger than series maximum.')
fRecode1 <- function() {
# original function
W = list()
for(i in 1:n){
v = V[[i]]
l = length(v)
v = v[v > d1[i]]
if(l > length(v)) v = c(d1[i], v)
l = length(v)
v = v[v < d2[i]]
if(l > length(v)) v = c(v, d2[i])
W[[i]] = v
}
W
}
fRecode2 <- function() {
W <- vector("list", length(V))
i <- 0L
for(v in V){
l <- length(v)
v <- v[v > d1[i <- i + 1L]]
if (l > length(v)) v <- c(d1[i], v)
W[[i]] <- c(v[v < d2[i]], d2[[i]])
}
W
}
fRecode3 <- function() {
W <- vector("list", length(V))
i <- 0L
for(v in V){
idx1 <- sum(v <= d1[i <- i + 1L]) + 1L
idx2 <- sum(v < d2[i])
if (idx1 > 1L) {
if (idx2 >= idx1) {
W[[i]] <- c(d1[i], v[idx1:idx2], d2[i])
} else {
W[[i]] <- c(d1[i], d2[i])
}
} else {
W[[i]] <- c(v[1:idx2], d2[i])
}
}
W
}
microbenchmark::microbenchmark(fRecode1 = fRecode1(),
fRecode2 = fRecode2(),
fRecode3 = fRecode3(),
times = 1e3,
check = "equal")
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> fRecode1 2.0210 2.20405 2.731124 2.39785 2.80075 12.7946 1000
#> fRecode2 1.2829 1.43315 1.917761 1.54715 1.88495 51.8183 1000
#> fRecode3 1.2710 1.38920 1.741597 1.45640 1.76225 5.4515 1000
Not a huge speed boost: fRecode3 shaves just under a microsecond on average for each vector in V.
How could I do something like this but in an optimal (vectorized) way in R?
N=10000
f <- 1.005
S0 <- 100
p <- 1/10
n <- seq(3,N)
S <- c(f*S0, f^2*S0, f^3*S0)
P <- c(0, 0, p*(f-1)*f^2*S0)
for(i in n){
R <- tail(S,1)-tail(P,1)
S <- c(S, f*R)
P <- c(P, p*(f-1)*R)
}
the final desired output being of course S and P (all the way up to row N+1). This computes a sequential time series row by row (each row is a function of the previous row values, above row 3).
I tried to use lapply but it's difficult to get a function to return two changes in the global environment... (and the resulting table is also badly formatted)
The simplest step to speed up your code is to pre-allocate the vectors. Start S and P at their final lengths, rather than "growing" them each iteration of the loop. This results in a more than 100x speed-up of your code:
N <- 10000
f <- 1.005
S0 <- 100
p <- 1/10
original = function(N, f, S0, p) {
n <- seq(3,N)
S <- c(f*S0, f^2*S0, f^3*S0)
P <- c(0, 0, p*(f-1)*f^2*S0)
for(i in n){
R <- tail(S,1)-tail(P,1)
S <- c(S, f*R)
P <- c(P, p*(f-1)*R)
}
return(list(S, P))
}
pre_allocated = function(N, f, S0, p) {
n <- seq(3,N)
S <- c(f*S0, f^2*S0, f^3*S0, rep(NA, N - 3))
P <- c(0, 0, p*(f-1)*f^2*S0, rep(NA, N - 3))
for(i in n){
R <- S[i] - P[i]
S[i + 1] <- f*R
P[i + 1] <- p*(f-1)*R
}
return(list(S, P))
}
## Check that we get the same result
identical(original(N, f, S0, p), pre_allocated(N, f, S0, p))
# [1] TRUE
## See how fast it is
microbenchmark::microbenchmark(original(N, f, S0, p), pre_allocated(N, f, S0, p), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# original(N, f, S0, p) 414.3610 419.9241 441.26030 426.01610 454.6002 538.0523 10
# pre_allocated(N, f, S0, p) 2.3306 2.6478 2.92908 3.05785 3.1198 3.2885 10
It's possible that a vectorized solution, perhaps using a function like cumprod, would be even faster, but I don't see a clear way to do it. If you can write out your result mathematically as a cumulative sum or product, that would make it clearer and possibly reveal a solution.
In this post on CodeReview, I compared several ways to generate a large sparse matrix. Specifically, I compared dense and sparse constructions using the Matrix package in R. My question is about post-processing with the sparse constructions. I'm finding that when I try to find the row sums of every k columns, the dense construction outperforms the sparse constructions.
Microbenchmarking
ncols <- 100000
nrows <- 1000
col_probs <- runif(ncols, 0.001, 0.002)
mat1 <- spMat_dense(ncols=ncols,nrows=nrows,col_probs=col_probs)
mat2 <- spMat_dgC(ncols=ncols,nrows=nrows,col_probs = col_probs)
mat3 <- spMat_dgT(ncols=ncols,nrows=nrows,col_probs=col_probs)
k <- 50
starts <- seq(1, ncols, by=k)
microbenchmark::microbenchmark(sapply(starts, function(x) rowSums(mat1[, x:(x+k-1)])),
sapply(starts, function(x) Matrix::rowSums(mat2[, x:(x+k-1)])),
sapply(starts, function(x) Matrix::rowSums(mat3[, x:(x+k-1)])),
times=5L)
Unit: milliseconds
expr
sapply(starts, function(x) rowSums(mat1[, x:(x + k - 1)]))
sapply(starts, function(x) Matrix::rowSums(mat2[, x:(x + k - 1)]))
sapply(starts, function(x) Matrix::rowSums(mat3[, x:(x + k - 1)]))
min lq mean median uq max
912.0453 947.0454 1041.365 965.4375 1007.311 1374.988
2097.4125 2208.0056 2566.575 2406.8450 2851.640 3268.970
13231.4790 13619.3818 13819.745 13675.6282 13923.803 14648.434
neval cld
5 a
5 b
5 c
My guess is that the sapply function works better with dense matrices because it doesn't need to do the sparse to dense conversion under the hood. The functions are posted below.
Question
Is there a way to improve the speed of the above post-processing for sparse constructions?
Functions
spMat_dense <- function(ncols,nrows,col_probs){
matrix(rbinom(nrows*ncols,1,col_probs),
ncol=ncols,byrow=T)
}
library(Matrix)
spMat_dgC <- function(ncols,nrows,col_probs){
#Credit to Andrew Guster (https://stackoverflow.com/a/56348978/4321711)
mat <- Matrix(0, nrows, ncols, sparse = TRUE) #blank matrix for template
i <- vector(mode = "list", length = ncols) #each element of i contains the '1' rows
p <- rep(0, ncols) #p will be cumsum no of 1s by column
for(r in 1:nrows){
row <- rbinom(ncols, 1, col_probs) #random row
p <- p + row #add to column identifier
if(any(row == 1)){
for (j in which(row == 1)){
i[[j]] <- c(i[[j]], r-1) #append row identifier
}
}
}
p <- c(0, cumsum(p)) #this is the format required
i <- unlist(i)
x <- rep(1, length(i))
mat#i <- as.integer(i)
mat#p <- as.integer(p)
mat#x <- x
return(mat)
}
spMat_dgT <- function(ncols, nrows, col_probs){
#Credit to minem - https://codereview.stackexchange.com/a/222190/121860
r <- lapply(1:ncols, function(x) {
p <- col_probs[x]
i <- sample.int(2L, size = nrows, replace = T, prob = c(1 - p, p))
which(i == 2L)
})
rl <- lengths(r)
nc <- rep(1:ncols, times = rl) # col indexes
nr <- unlist(r) # row index
ddims <- c(nrows, ncols)
sparseMatrix(i = nr, j = nc, dims = ddims, giveCsparse = FALSE)
}
Using a dgCMatrix as input, this is one possible solution that is very fast:
new_combine <- function(mat,k){
#Convert dgCMatrix to dgTMatrix
x.T <- as(mat, "dgTMatrix")
#Map column indices to new set of indices
#based on partitioning every k columns
x.T#j <- as.integer(x.T#j %/% k)
#Correct dimensions of new matrix
x.T#Dim <- as.integer(c(nrow(x.T),floor(ncol(mat)/k)))
#Convert back to dgCMatrix
y <- as(x.T,"dgCMatrix")
y
}
microbenchmark::microbenchmark(sapply(starts, function(x) Matrix::rowSums(mat2[, x:(x+k-1)])),
new_combine(mat2,k),
times=5L)
Unit: milliseconds
expr
sapply(starts, function(x) Matrix::rowSums(mat2[, x:(x + k - 1)]))
new_combine(mat2, k)
min lq mean median uq
1808.872676 1864.783181 1925.17118 1935.98946 1990.28866
8.471521 9.396441 10.99871 10.04459 10.96175
max neval cld
2025.92192 5 b
16.11923 5 a
comp <- sapply(starts, function(x) Matrix::rowSums(mat2[, x:(x+k-1)]))
comp2 <- new_combine(mat2,k)
> all.equal(comp2,as(comp,"dgCMatrix"))
[1] TRUE
I'm trying to calculate the weighted euclidean distance (squared) between twoo data frames that have the same number of columns (variables) and different number of rows (observations).
The calculation follows the formula:
DIST[m,i] <- sum(((DATA1[m,] - DATA2[i,]) ^ 2) * lambda[1,])
I specifically need to multiply each parcel of the somatory by a specific weight (lambda).
The code provided bellow runs correctly, but if I use it in hundreds of iterations it takes a lot of processing time. Yesterday it took me 18 hours to create a graphic using multiple iterations of a function that contains this calculation. Using library(profvis) profvis({ my code }) I saw that this specific part of the code is taking up like 80% of the processing time.
I read a lot about how to reduce the processing time using parallel and vectorized operations, but I don't know how to implement them in this particular case, because of the weight lamb#.
Can some one help me reduce my processing time with this code?
More information about the code and the structure of the data can be found in the code provided bellow as comments.
# Data frames used to calculate the euclidean distances between each observation
# from DATA1 and each observation from DATA2.
# The euclidean distance is between a [600x50] and a [8X50] dataframes, resulting
# in a [600X8] dataframe.
DATA1 <- matrix(rexp(30000, rate=.1), ncol=50) #[600x50]
DATA2 <- matrix(rexp(400, rate=.1), ncol=50) #[8X50]
# Weights used for each of the 50 variables to calculate the weighted
# euclidean distance.
# Can be a vector of different weights or a scalar of the same weight
# for all variables.
lambda <- runif(n=50, min=0, max=10) ## length(lambda) > 1
# lambda=1 ## length(lambda) == 1
if (length(lambda) > 1) {
as.numeric(unlist(lambda))
lambda <- as.matrix(lambda)
lambda <- t(lambda)
}
nrows1 <- nrow(DATA1)
nrows2 <- nrow(DATA2)
# Euclidean Distance calculation
DIST <- matrix(NA, nrow=nrows1, ncol=nrows2 )
for (m in 1:nrows1) {
for (i in 1:nrows2) {
if (length(lambda) == 1) {
DIST[m, i] <- sum((DATA1[m, ] - DATA2[i, ])^2)
}
if (length(lambda) > 1){
DIST[m, i] <- sum(((DATA1[m, ] - DATA2[i, ])^2) * lambda[1, ])
}
next
}
next
}
After all the sugestions, combining the answers from #MDWITT (for length(lambda > 1) and #F. Privé (for length(lambda == 1) the final solution took only one minute to run, whilst the original one took me an hour and a half to run, in a bigger code that has that calculation. The final code for this problem, for those interested, is:
#Data frames used to calculate the euclidean distances between each observation from DATA1 and each observation from DATA2.
#The euclidean distance is between a [600x50] and a [8X50] dataframes, resulting in a [600X8] dataframe.
DATA1 <- matrix(rexp(30000, rate=.1), ncol=50) #[600x50]
DATA2 <- matrix(rexp(400, rate=.1), ncol=50) #[8X50]
#Weights used for each of the 50 variables to calculate the weighted euclidean distance.
#Can be a vector of different weights or a scalar of the same weight for all variables.
#lambda <- runif(n = 50, min = 0, max = 10) ##length(lambda) > 1
lambda = 1 ##length(lambda) == 1
nrows1 <- nrow(DATA1)
nrows2 <- nrow(DATA2)
#Euclidean Distance calculation
DIST <- matrix(NA, nrow = nrows1, ncol = nrows2)
if (length(lambda) > 1){
as.numeric(unlist(lambda))
lambda <- as.matrix(lambda)
lambda <- t(lambda)
library(Rcpp)
cppFunction('NumericMatrix weighted_distance (NumericMatrix x, NumericMatrix y, NumericVector lambda){
int n_x = x.nrow();
int n_y = y.nrow();
NumericMatrix DIST(n_x, n_y);
//begin the loop
for (int i = 0 ; i < n_x; i++){
for (int j = 0 ; j < n_y ; j ++) {
double d = sum(pow(x.row(i) - y.row(j), 2)*lambda);
DIST(i,j) = d;
}
}
return (DIST) ;
}')
DIST <- weighted_distance(DATA1, DATA2, lambda = lambda)}
if (length(lambda) == 1) {
DIST <- outer(rowSums(DATA1^2), rowSums(DATA2^2), '+') - tcrossprod(DATA1, 2 * DATA2)
}
Rewrite the problem to use linear algebra and vectorization, which is much faster than loops.
If you don't have lambda, this is just
outer(rowSums(DATA1^2), rowSums(DATA2^2), '+') - tcrossprod(DATA1, 2 * DATA2)
With lambda, it becomes
outer(drop(DATA1^2 %*% lambda), drop(DATA2^2 %*% lambda), '+') -
tcrossprod(DATA1, sweep(DATA2, 2, 2 * lambda, '*'))
Here an alternate way using Rcpp just to have this concept documents. In a file called euclidean.cpp in it I have
#include <Rcpp.h>
#include <cmath>
using namespace Rcpp;
// [[Rcpp::export]]
NumericMatrix weighted_distance (NumericMatrix x, NumericMatrix y, NumericVector lambda){
int n_x = x.nrow();
int n_y = y.nrow();
NumericMatrix out(n_x, n_y);
//begin the loop
for (int i = 0 ; i < n_x; i++){
for (int j = 0 ; j < n_y ; j ++) {
double d = sum(pow(x.row(i) - y.row(j), 2)*lambda);
out(i,j) = d;
}
}
return (out) ;
}
In R, then I have
library(Rcpp)
sourceCpp("libs/euclidean.cpp")
# Generate Data
DATA1 <- matrix(rexp(30000, rate=.1), ncol=50) #[600x50]
DATA2 <- matrix(rexp(400, rate=.1), ncol=50) #[8X50]
lambda <- runif(n=50, min=0, max=10)
# Run the program
out <- weighted_distance(DATA1, DATA2, lambda = lambda)
When I test the speed using:
microbenchmark(
Rcpp_way = weighted_distance(DATA1, DATA2, lambda = lambda),
other = {DIST <- matrix(NA, nrow=nrows1, ncol=ncols)
for (m in 1:nrows1) {
for (i in 1:nrows2) {
if (length(lambda) == 1) {
DIST[m, i] <- sum((DATA1[m, ] - DATA2[i, ])^2)
}
if (length(lambda) > 1){
DIST[m, i] <- sum(((DATA1[m, ] - DATA2[i, ])^2) * lambda[1, ])
}
next
}
next
}}, times = 100)
You can see that it is a good clip faster:
Unit: microseconds
expr min lq mean median uq max neval
Rcpp_way 446.769 492.308 656.9849 562.667 846.9745 1169.231 100
other 24688.821 30681.641 44153.5264 37511.385 50878.3585 200843.898 100
JSD matrix is a similarity matrix of distributions based on Jensen-Shannon divergence.
Given matrix m which rows present distributions we would like to find JSD distance between each distribution. Resulting JSD matrix is a square matrix with dimensions nrow(m) x nrow(m). This is triangular matrix where each element contains JSD value between two rows in m.
JSD can be calculated by the following R function:
JSD<- function(x,y) sqrt(0.5 * (sum(x*log(x/((x+y)/2))) + sum(y*log(y/((x+y)/2)))))
where x, y are rows in matrix m.
I experimented with different JSD matrix calculation algorithms in R to figure out the quickest one. For my surprise, the algorithm with two nested loops performs faster than the different vectorized versions (parallelized or not). I'm not happy with the results. Could you pinpoint me better solutions than the ones I game up?
library(parallel)
library(plyr)
library(doParallel)
library(foreach)
nodes <- detectCores()
cl <- makeCluster(4)
registerDoParallel(cl)
m <- runif(24000, min = 0, max = 1)
m <- matrix(m, 24, 1000)
prob_dist <- function(x) t(apply(x, 1, prop.table))
JSD<- function(x,y) sqrt(0.5 * (sum(x*log(x/((x+y)/2))) + sum(y*log(y/((x+y)/2)))))
m <- t(prob_dist(m))
m[m==0] <- 0.000001
Algorithm with two nested loops:
dist.JSD_2 <- function(inMatrix) {
matrixColSize <- ncol(inMatrix)
resultsMatrix <- matrix(0, matrixColSize, matrixColSize)
for(i in 2:matrixColSize) {
for(j in 1:(i-1)) {
resultsMatrix[i,j]=JSD(inMatrix[,i], inMatrix[,j])
}
}
return(resultsMatrix)
}
Algorithm with outer:
dist.JSD_3 <- function(inMatrix) {
matrixColSize <- ncol(inMatrix)
resultsMatrix <- outer(1:matrixColSize,1:matrixColSize, FUN = Vectorize( function(i,j) JSD(inMatrix[,i], inMatrix[,j])))
return(resultsMatrix)
}
Algorithm with combn and apply:
dist.JSD_4 <- function(inMatrix) {
matrixColSize <- ncol(inMatrix)
ind <- combn(matrixColSize, 2)
out <- apply(ind, 2, function(x) JSD(inMatrix[,x[1]], inMatrix[,x[2]]))
a <- rbind(ind, out)
resultsMatrix <- sparseMatrix(a[1,], a[2,], x=a[3,], dims=c(matrixColSize, matrixColSize))
return(resultsMatrix)
}
Algorithm with combn and aaply:
dist.JSD_5 <- function(inMatrix) {
matrixColSize <- ncol(inMatrix)
ind <- combn(matrixColSize, 2)
out <- aaply(ind, 2, function(x) JSD(inMatrix[,x[1]], inMatrix[,x[2]]))
a <- rbind(ind, out)
resultsMatrix <- sparseMatrix(a[1,], a[2,], x=a[3,], dims=c(matrixColSize, matrixColSize))
return(resultsMatrix)
}
performance test:
mbm = microbenchmark(
two_loops = dist.JSD_2(m),
outer = dist.JSD_3(m),
combn_apply = dist.JSD_4(m),
combn_aaply = dist.JSD_5(m),
times = 10
)
ggplot2::autoplot(mbm)
> summary(mbm)
expr min lq mean median
1 two_loops 18.30857 18.68309 23.50231 18.77303
2 outer 38.93112 40.98369 42.44783 42.16858
3 combn_apply 20.45740 20.90747 21.49122 21.35042
4 combn_aaply 55.61176 56.77545 59.37358 58.93953
uq max neval cld
1 18.87891 65.34197 10 a
2 42.85978 48.82437 10 b
3 22.06277 22.98803 10 a
4 62.26417 64.77407 10 c
This is my implementation of your dist.JSD_2
dist0 <- function(m) {
ncol <- ncol(m)
result <- matrix(0, ncol, ncol)
for (i in 2:ncol) {
for (j in 1:(i-1)) {
x <- m[,i]; y <- m[,j]
result[i, j] <-
sqrt(0.5 * (sum(x * log(x / ((x + y) / 2))) +
sum(y * log(y / ((x + y) / 2)))))
}
}
result
}
The usual steps are to replace iterative calculations with vectorized versions. I moved sqrt(0.5 * ...) from inside the loops, where it is applied to each element of result, to outside the loop, where it is applied to the vector result.
I realized that sum(x * log(x / (x + y) / 2)) could be written as sum(x * log(2 * x)) - sum(x * log(x + y)). The first sum is calculated once for each entry, but could be calculated once for each column. It too comes out of the loops, with the vector of values (one element for each column) calculated as colSums(m * log(2 * m)).
The remaining term inside the inner loop is sum((x + y) * log(x + y)). For a given value of i, we can trade off space for speed by vectorizing this across all relevant y columns as a matrix operation
j <- seq_len(i - 1L)
xy <- m[, i] + m[, j, drop=FALSE]
xylogxy[i, j] <- colSums(xy * log(xy))
The end result is
dist4 <- function(m) {
ncol <- ncol(m)
xlogx <- matrix(colSums(m * log(2 * m)), ncol, ncol)
xlogx2 <- xlogx + t(xlogx)
xlogx2[upper.tri(xlogx2, diag=TRUE)] <- 0
xylogxy <- matrix(0, ncol, ncol)
for (i in seq_len(ncol)[-1]) {
j <- seq_len(i - 1L)
xy <- m[, i] + m[, j, drop=FALSE]
xylogxy[i, j] <- colSums(xy * log(xy))
}
sqrt(0.5 * (xlogx2 - xylogxy))
}
Which produces results that are numerically equal (though not exactly identical) to the original
> all.equal(dist0(m), dist4(m))
[1] TRUE
and about 2.25x faster
> microbenchmark(dist0(m), dist4(m), dist.JSD_cpp2(m), times=10)
Unit: milliseconds
expr min lq mean median uq max neval
dist0(m) 48.41173 48.42569 49.26072 48.68485 49.48116 51.64566 10
dist4(m) 20.80612 20.90934 21.34555 21.09163 21.96782 22.32984 10
dist.JSD_cpp2(m) 28.95351 29.11406 29.43474 29.23469 29.78149 30.37043 10
You'll still be waiting for about 10 hours, though that seems to imply a very large problem. The algorithm seems like it is quadratic in the number of columns, but the number of columns here was small (24) compared to the number of rows, so I wonder what the actual size of data being processed is? There are ncol * (ncol - 1) / 2 distances to be calculated.
A crude approach to further performance gain is parallel evaluation, which the following implements using parallel::mclapply()
dist4p <- function(m, ..., mc.cores=detectCores()) {
ncol <- ncol(m)
xlogx <- matrix(colSums(m * log(2 * m)), ncol, ncol)
xlogx2 <- xlogx + t(xlogx)
xlogx2[upper.tri(xlogx2, diag=TRUE)] <- 0
xx <- mclapply(seq_len(ncol)[-1], function(i, m) {
j <- seq_len(i - 1L)
xy <- m[, i] + m[, j, drop=FALSE]
colSums(xy * log(xy))
}, m, ..., mc.cores=mc.cores)
xylogxy <- matrix(0, ncol, ncol)
xylogxy[upper.tri(xylogxy, diag=FALSE)] <- unlist(xx)
sqrt(0.5 * (xlogx2 - t(xylogxy)))
}
My laptop has 8 nominal cores, and for 1000 columns I have
> system.time(xx <- dist4p(m1000))
user system elapsed
48.909 1.939 8.043
suggests that I get 48s of processor time in 8s of clock time. The algorithm is still quadratic, so this might reduce overall computation time to about 1h for the full problem. Memory might become an issue on a multicore machine, where all processes are competing for the same memory pool; it might be necessary to choose mc.cores less than the number available.
With large ncol, the way to get better performance is to avoid calculating the complete set of distances. Depending on the nature of the data it might make sense to filter for duplicate columns, or to filter for informative columns (e.g., with greatest variance), or... An appropriate strategy requires more information on what the columns represent and what the goal is for the distance matrix. The question 'how similar is company i to other companies?' can be answered without calculating the full distance matrix, just a single row, so if the number of times the question is asked relative to the total number of companies is small, then maybe there is no need to calculate the full distance matrix? Another strategy might be to reduce the number of companies to be clustered by (1) simplify the 1000 rows of measurement using principal components analysis, (2) kmeans clustering of all 50k companies to identify say 1000 centroids, and (3) using the interpolated measurements and Jensen-Shannon distance between these for clustering.
I'm sure there are better approaches than the following, but your JSD function itself can trivially be converted to an Rcpp function by just swapping sum and log for their Rcpp sugar equivalents, and using std::sqrt in place of the R's base::sqrt.
#include <Rcpp.h>
// [[Rcpp::export]]
double cppJSD(const Rcpp::NumericVector& x, const Rcpp::NumericVector& y) {
return std::sqrt(0.5 * (Rcpp::sum(x * Rcpp::log(x/((x+y)/2))) +
Rcpp::sum(y * Rcpp::log(y/((x+y)/2)))));
}
I only tested with your dist.JST_2 approach (since it was the fastest version), but you should see an improvement when using cppJSD instead of JSD regardless of the implementation:
R> microbenchmark::microbenchmark(
two_loops = dist.JSD_2(m),
cpp = dist.JSD_cpp(m),
times=100L)
Unit: milliseconds
expr min lq mean median uq max neval
two_loops 41.25142 41.34755 42.75926 41.45956 43.67520 49.54250 100
cpp 36.41571 36.52887 37.49132 36.60846 36.98887 50.91866 100
EDIT:
Actually, your dist.JSD_2 function itself can easily be converted to an Rcpp function for an additional speed-up:
// [[Rcpp::export("dist.JSD_cpp2")]]
Rcpp::NumericMatrix foo(const Rcpp::NumericMatrix& inMatrix) {
size_t cols = inMatrix.ncol();
Rcpp::NumericMatrix result(cols, cols);
for (size_t i = 1; i < cols; i++) {
for (size_t j = 0; j < i; j++) {
result(i,j) = cppJSD(inMatrix(Rcpp::_, i), inMatrix(Rcpp::_, j));
}
}
return result;
}
(where cppJSD was defined in the same .cpp file as the above). Here are the timings:
R> microbenchmark::microbenchmark(
two_loops = dist.JSD_2(m),
partial_cpp = dist.JSD_cpp(m),
full_cpp = dist.JSD_cpp2(m),
times=100L)
Unit: milliseconds
expr min lq mean median uq max neval
two_loops 41.25879 41.36729 42.95183 41.84999 44.08793 54.54610 100
partial_cpp 36.45802 36.62463 37.69742 36.99679 37.96572 44.26446 100
full_cpp 32.00263 32.12584 32.82785 32.20261 32.63554 38.88611 100
dist.JSD_2 <- function(inMatrix) {
matrixColSize <- ncol(inMatrix)
resultsMatrix <- matrix(0, matrixColSize, matrixColSize)
for(i in 2:matrixColSize) {
for(j in 1:(i-1)) {
resultsMatrix[i,j]=JSD(inMatrix[,i], inMatrix[,j])
}
}
return(resultsMatrix)
}
##
dist.JSD_cpp <- function(inMatrix) {
matrixColSize <- ncol(inMatrix)
resultsMatrix <- matrix(0, matrixColSize, matrixColSize)
for(i in 2:matrixColSize) {
for(j in 1:(i-1)) {
resultsMatrix[i,j]=cppJSD(inMatrix[,i], inMatrix[,j])
}
}
return(resultsMatrix)
}
m <- runif(24000, min = 0, max = 1)
m <- matrix(m, 24, 1000)
prob_dist <- function(x) t(apply(x, 1, prop.table))
JSD <- function(x,y) sqrt(0.5 * (sum(x*log(x/((x+y)/2))) + sum(y*log(y/((x+y)/2)))))
m <- t(prob_dist(m))
m[m==0] <- 0.000001