How to improve processing time for euclidean distance calculation - r

I'm trying to calculate the weighted euclidean distance (squared) between twoo data frames that have the same number of columns (variables) and different number of rows (observations).
The calculation follows the formula:
DIST[m,i] <- sum(((DATA1[m,] - DATA2[i,]) ^ 2) * lambda[1,])
I specifically need to multiply each parcel of the somatory by a specific weight (lambda).
The code provided bellow runs correctly, but if I use it in hundreds of iterations it takes a lot of processing time. Yesterday it took me 18 hours to create a graphic using multiple iterations of a function that contains this calculation. Using library(profvis) profvis({ my code }) I saw that this specific part of the code is taking up like 80% of the processing time.
I read a lot about how to reduce the processing time using parallel and vectorized operations, but I don't know how to implement them in this particular case, because of the weight lamb#.
Can some one help me reduce my processing time with this code?
More information about the code and the structure of the data can be found in the code provided bellow as comments.
# Data frames used to calculate the euclidean distances between each observation
# from DATA1 and each observation from DATA2.
# The euclidean distance is between a [600x50] and a [8X50] dataframes, resulting
# in a [600X8] dataframe.
DATA1 <- matrix(rexp(30000, rate=.1), ncol=50) #[600x50]
DATA2 <- matrix(rexp(400, rate=.1), ncol=50) #[8X50]
# Weights used for each of the 50 variables to calculate the weighted
# euclidean distance.
# Can be a vector of different weights or a scalar of the same weight
# for all variables.
lambda <- runif(n=50, min=0, max=10) ## length(lambda) > 1
# lambda=1 ## length(lambda) == 1
if (length(lambda) > 1) {
as.numeric(unlist(lambda))
lambda <- as.matrix(lambda)
lambda <- t(lambda)
}
nrows1 <- nrow(DATA1)
nrows2 <- nrow(DATA2)
# Euclidean Distance calculation
DIST <- matrix(NA, nrow=nrows1, ncol=nrows2 )
for (m in 1:nrows1) {
for (i in 1:nrows2) {
if (length(lambda) == 1) {
DIST[m, i] <- sum((DATA1[m, ] - DATA2[i, ])^2)
}
if (length(lambda) > 1){
DIST[m, i] <- sum(((DATA1[m, ] - DATA2[i, ])^2) * lambda[1, ])
}
next
}
next
}
After all the sugestions, combining the answers from #MDWITT (for length(lambda > 1) and #F. Privé (for length(lambda == 1) the final solution took only one minute to run, whilst the original one took me an hour and a half to run, in a bigger code that has that calculation. The final code for this problem, for those interested, is:
#Data frames used to calculate the euclidean distances between each observation from DATA1 and each observation from DATA2.
#The euclidean distance is between a [600x50] and a [8X50] dataframes, resulting in a [600X8] dataframe.
DATA1 <- matrix(rexp(30000, rate=.1), ncol=50) #[600x50]
DATA2 <- matrix(rexp(400, rate=.1), ncol=50) #[8X50]
#Weights used for each of the 50 variables to calculate the weighted euclidean distance.
#Can be a vector of different weights or a scalar of the same weight for all variables.
#lambda <- runif(n = 50, min = 0, max = 10) ##length(lambda) > 1
lambda = 1 ##length(lambda) == 1
nrows1 <- nrow(DATA1)
nrows2 <- nrow(DATA2)
#Euclidean Distance calculation
DIST <- matrix(NA, nrow = nrows1, ncol = nrows2)
if (length(lambda) > 1){
as.numeric(unlist(lambda))
lambda <- as.matrix(lambda)
lambda <- t(lambda)
library(Rcpp)
cppFunction('NumericMatrix weighted_distance (NumericMatrix x, NumericMatrix y, NumericVector lambda){
int n_x = x.nrow();
int n_y = y.nrow();
NumericMatrix DIST(n_x, n_y);
//begin the loop
for (int i = 0 ; i < n_x; i++){
for (int j = 0 ; j < n_y ; j ++) {
double d = sum(pow(x.row(i) - y.row(j), 2)*lambda);
DIST(i,j) = d;
}
}
return (DIST) ;
}')
DIST <- weighted_distance(DATA1, DATA2, lambda = lambda)}
if (length(lambda) == 1) {
DIST <- outer(rowSums(DATA1^2), rowSums(DATA2^2), '+') - tcrossprod(DATA1, 2 * DATA2)
}

Rewrite the problem to use linear algebra and vectorization, which is much faster than loops.
If you don't have lambda, this is just
outer(rowSums(DATA1^2), rowSums(DATA2^2), '+') - tcrossprod(DATA1, 2 * DATA2)
With lambda, it becomes
outer(drop(DATA1^2 %*% lambda), drop(DATA2^2 %*% lambda), '+') -
tcrossprod(DATA1, sweep(DATA2, 2, 2 * lambda, '*'))

Here an alternate way using Rcpp just to have this concept documents. In a file called euclidean.cpp in it I have
#include <Rcpp.h>
#include <cmath>
using namespace Rcpp;
// [[Rcpp::export]]
NumericMatrix weighted_distance (NumericMatrix x, NumericMatrix y, NumericVector lambda){
int n_x = x.nrow();
int n_y = y.nrow();
NumericMatrix out(n_x, n_y);
//begin the loop
for (int i = 0 ; i < n_x; i++){
for (int j = 0 ; j < n_y ; j ++) {
double d = sum(pow(x.row(i) - y.row(j), 2)*lambda);
out(i,j) = d;
}
}
return (out) ;
}
In R, then I have
library(Rcpp)
sourceCpp("libs/euclidean.cpp")
# Generate Data
DATA1 <- matrix(rexp(30000, rate=.1), ncol=50) #[600x50]
DATA2 <- matrix(rexp(400, rate=.1), ncol=50) #[8X50]
lambda <- runif(n=50, min=0, max=10)
# Run the program
out <- weighted_distance(DATA1, DATA2, lambda = lambda)
When I test the speed using:
microbenchmark(
Rcpp_way = weighted_distance(DATA1, DATA2, lambda = lambda),
other = {DIST <- matrix(NA, nrow=nrows1, ncol=ncols)
for (m in 1:nrows1) {
for (i in 1:nrows2) {
if (length(lambda) == 1) {
DIST[m, i] <- sum((DATA1[m, ] - DATA2[i, ])^2)
}
if (length(lambda) > 1){
DIST[m, i] <- sum(((DATA1[m, ] - DATA2[i, ])^2) * lambda[1, ])
}
next
}
next
}}, times = 100)
You can see that it is a good clip faster:
Unit: microseconds
expr min lq mean median uq max neval
Rcpp_way 446.769 492.308 656.9849 562.667 846.9745 1169.231 100
other 24688.821 30681.641 44153.5264 37511.385 50878.3585 200843.898 100

Related

Trying to simulate Poisson samples using inverse CDF method but my R function produces wrong results

I wrote some R code for simulating random samples from a Poisson distribution, based on the description of an algorithm (see attached image). But my code does not seem to work correctly, because the generated random samples are of a different pattern compared with those generated by R's built-in rpois() function. Can anybody tell me what I did wrong and how to fix my function?
r.poisson <- function(n, l=0.5)
{
U <- runif(n)
X <- rep(0,n)
p=exp(-l)
F=p
for(i in 1:n)
{
if(U[i] < F)
{
X[i] <- i
} else
{
p=p*l/(i+1)
F=F+p
i=i+1
}
}
return(X)
}
r.poisson(50)
The output is very different from rpois(50, lambda = 0.5). The algorithm I followed is:
(Thank you for your question. Now I know how a Poisson random variable is simulated.)
You had a misunderstanding. The inverse CDF method (with recursive computation) you referenced is used to generate a single Poisson random sample. So you need to fix this function to produce a single number. Here is the correct function, commented to help you follow each step.
rpois1 <- function (lambda) {
## step 1
U <- runif(1)
## step 2
i <- 0
p <- exp(-lambda)
F <- p
## you need an "infinite" loop
## no worry, it will "break" at some time
repeat {
## step 3
if (U < F) {
X <- i
break
}
## step 4
i <- i + 1
p <- lambda * p / i ## I have incremented i, so it is `i` not `i + 1` here
F <- F + p
## back to step 3
}
return(X)
}
Now to get n samples, you need to call this function n times. R has a nice function called replicate to repeat a function many times.
r.poisson <- function (n, lambda) {
## use `replicate()` to call `rpois1` n times
replicate(n, rpois1(lambda))
}
Now we can make a reasonable comparison with R's own rpois.
x1 <- r.poisson(1000, lambda = 0.5)
x2 <- rpois(1000, lambda = 0.5)
## set breaks reasonably when making a histogram
xmax <- max(x1, x2) + 0.5
par(mfrow = c(1, 2))
hist(x1, main = "proof-of-concept-implementation", breaks = seq.int(-0.5, xmax))
hist(x2, main = "R's rpois()", breaks = seq.int(-0.5, xmax))
Remark:
Applaud jblood94 for exemplifying how to seek vectorization opportunity of an R loop, without converting everything to C/C++. R's rpois is coded in C, that is why it is fast.
A vectorized version will run much faster than a non-vectorized function using replicate. The idea is to iteratively drop the uniform random samples as i is incremented.
r.poisson1 <- function(n, l = 0.5) {
U <- runif(n)
i <- 0L
X <- integer(n)
p <- exp(-l)
F <- p
idx <- 1:n
while (length(idx)) {
bln <- U < F
X[idx[bln]] <- i
p <- l*p/(i <- i + 1L)
F <- F + p
idx <- idx[!bln]
U <- U[!bln]
}
X
}
#Zheyuan Li's non-vectorized functions:
rpois1 <- function (lambda) {
## step 1
U <- runif(1)
## step 2
i <- 0
p <- exp(-lambda)
F <- p
## you need an "infinite" loop
## no worry, it will "break" at some time
repeat {
## step 3
if (U < F) {
X <- i
break
}
## step 4
i <- i + 1
p <- lambda * p * i
F <- F + p
## back to step 3
}
return(X)
}
r.poisson2 <- function (n, lambda) {
## use `replicate()` to call `rpois1` n times
replicate(n, rpois1(lambda))
}
Benchmark:
microbenchmark::microbenchmark(r.poisson1(1e5),
r.poisson2(1e5, 0.5),
rpois(1e5, 0.5))
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> r.poisson1(1e+05) 3.063202 3.129151 3.782200 3.225402 3.734600 18.377700 100
#> r.poisson2(1e+05, 0.5) 217.631002 244.816601 269.692648 267.977001 287.599251 375.910601 100
#> rpois(1e+05, 0.5) 1.519901 1.552300 1.649026 1.579551 1.620451 7.531401 100

Fast random sampling from matrix of cumulative probability mass functions in R

I have a matrix (mat_cdf) representing the cumulative probability an individual in census tract i moves to census tract j on a given day. Given a vector of agents who decide not to "stay home", I have a function, GetCTMove function below, to randomly sample from this matrix to determine which census tract they will spend time in.
# Random generation
cts <- 500
i <- rgamma(cts, 50, 1)
prop <- 1:cts
# Matrix where rows correspond to probability mass of column integer
mat <- do.call(rbind, lapply(i, function(i){dpois(prop, i)}))
# Convert to cumulative probability mass
mat_cdf <- matrix(NA, cts, cts)
for(i in 1:cts){
# Create cdf for row i
mat_cdf[i,] <- sapply(1:cts, function(j) sum(mat[i,1:j]))
}
GetCTMove <- function(agent_cts, ct_mat_cdf){
# Expand such that every agent has its own row corresponding to CDF of movement from their home ct i to j
mat_expand <- ct_mat_cdf[agent_cts,]
# Probabilistically sample column index for every row by generating random number and then determining corresponding closest column
s <- runif(length(agent_cts))
fin_col <- max.col(s < mat_expand, "first")
return(fin_col)
}
# Sample of 500,000 agents' residence ct
agents <- sample(1:cts, size = 500000, replace = T)
# Run function
system.time(GetCTMove(agents, mat_cdf))
user system elapsed
3.09 1.19 4.30
Working with 1 million agents, each sample takes ~10 seconds to run, multiplied by many time steps leads to hours for each simulation, and this function is by far the rate limiting factor of the model. I'm wondering if anyone has advice on faster implementation of this kind of random sampling. I've used the dqrng package to speed up random number generation, but that's relatively miniscule in comparison to the matrix expansion (mat_expand) and max.col calls which take longest to run.
The first thing that you can optimise is the following code:
max.col(s < mat_expand, "first")
Since s < mat_expand returns a logical matrix, applying the max.col function is the same as getting the first TRUE in each row. In this case, using which will be much more efficient. Also, as shown below, you store all your CDFs in a matrix.
mat <- do.call(rbind, lapply(i, function(i){dpois(prop, i)}))
mat_cdf <- matrix(NA, cts, cts)
for(i in 1:cts){
mat_cdf[i,] <- sapply(1:cts, function(j) sum(mat[i,1:j]))
}
This structure may not be optimal. A list structure is better for applying functions like which. It is also faster to run as you do not have to go through a do.call(rbind, ...).
# using a list structure to speed up the creation of cdfs
ls_cdf <- lapply(i, function(x) cumsum(dpois(prop, x)))
Below is your implementation:
# Implementation 1
GetCTMove <- function(agent_cts, ct_mat_cdf){
mat_expand <- ct_mat_cdf[agent_cts,]
s <- runif(length(agent_cts))
fin_col <- max.col(s < mat_expand, "first")
return(fin_col)
}
On my desktop, it takes about 2.68s to run.
> system.time(GetCTMove(agents, mat_cdf))
user system elapsed
2.25 0.41 2.68
With a list structure and a which function, the run time can be reduced by about 1s.
# Implementation 2
GetCTMove2 <- function(agent_cts, ls_cdf){
n <- length(agent_cts)
s <- runif(n)
out <- integer(n)
i <- 1L
while (i <= n) {
out[[i]] <- which(s[[i]] < ls_cdf[[agent_cts[[i]]]])[[1L]]
i <- i + 1L
}
out
}
> system.time(GetCTMove2(agents, ls_cdf))
user system elapsed
1.59 0.02 1.64
To my knowledge, with R only there is no other way to further speed up the code. However, you can indeed improve the performance by re-writing the key function GetCTMove in C++. With the Rcpp package, you can do something as follows:
# Implementation 3
Rcpp::cppFunction('NumericVector fast_GetCTMove(NumericVector agents, NumericVector s, List cdfs) {
int n = agents.size();
NumericVector out(n);
for (int i = 0; i < n; ++i) {
NumericVector cdf = as<NumericVector>(cdfs[agents[i] - 1]);
int m = cdf.size();
for (int j = 0; j < m; ++j) {
if (s[i] < cdf[j]) {
out[i] = j + 1;
break;
}
}
}
return out;
}')
GetCTMove3 <- function(agent_cts, ls_cdf){
s <- runif(length(agent_cts))
fast_GetCTMove(agent_cts, s, ls_cdf)
}
This implementation is lightning fast, which should fulfil all your needs.
> system.time(GetCTMove3(agents, ls_cdf))
user system elapsed
0.07 0.00 0.06
The full script is attached as follows:
# Random generation
cts <- 500
i <- rgamma(cts, 50, 1)
prop <- 1:cts
agents <- sample(1:cts, size = 500000, replace = T)
# using a list structure to speed up the creation of cdfs
ls_cdf <- lapply(i, function(x) cumsum(dpois(prop, x)))
# below is your code
mat <- do.call(rbind, lapply(i, function(i){dpois(prop, i)}))
mat_cdf <- matrix(NA, cts, cts)
for(i in 1:cts){
mat_cdf[i,] <- sapply(1:cts, function(j) sum(mat[i,1:j]))
}
# Implementation 1
GetCTMove <- function(agent_cts, ct_mat_cdf){
mat_expand <- ct_mat_cdf[agent_cts,]
s <- runif(length(agent_cts))
fin_col <- max.col(s < mat_expand, "first")
return(fin_col)
}
# Implementation 2
GetCTMove2 <- function(agent_cts, ls_cdf){
n <- length(agent_cts)
s <- runif(n)
out <- integer(n)
i <- 1L
while (i <= n) {
out[[i]] <- which(s[[i]] < ls_cdf[[agent_cts[[i]]]])[[1L]]
i <- i + 1L
}
out
}
# Implementation 3
Rcpp::cppFunction('NumericVector fast_GetCTMove(NumericVector agents, NumericVector s, List cdfs) {
int n = agents.size();
NumericVector out(n);
for (int i = 0; i < n; ++i) {
NumericVector cdf = as<NumericVector>(cdfs[agents[i] - 1]);
int m = cdf.size();
for (int j = 0; j < m; ++j) {
if (s[i] < cdf[j]) {
out[i] = j + 1;
break;
}
}
}
return out;
}')
GetCTMove3 <- function(agent_cts, ls_cdf){
s <- runif(length(agent_cts))
fast_GetCTMove(agent_cts, s, ls_cdf)
}
system.time(GetCTMove(agents, mat_cdf))
system.time(GetCTMove2(agents, ls_cdf))
system.time(GetCTMove3(agents, ls_cdf))

Which R implementation gives the fastest JSD matrix computation?

JSD matrix is a similarity matrix of distributions based on Jensen-Shannon divergence.
Given matrix m which rows present distributions we would like to find JSD distance between each distribution. Resulting JSD matrix is a square matrix with dimensions nrow(m) x nrow(m). This is triangular matrix where each element contains JSD value between two rows in m.
JSD can be calculated by the following R function:
JSD<- function(x,y) sqrt(0.5 * (sum(x*log(x/((x+y)/2))) + sum(y*log(y/((x+y)/2)))))
where x, y are rows in matrix m.
I experimented with different JSD matrix calculation algorithms in R to figure out the quickest one. For my surprise, the algorithm with two nested loops performs faster than the different vectorized versions (parallelized or not). I'm not happy with the results. Could you pinpoint me better solutions than the ones I game up?
library(parallel)
library(plyr)
library(doParallel)
library(foreach)
nodes <- detectCores()
cl <- makeCluster(4)
registerDoParallel(cl)
m <- runif(24000, min = 0, max = 1)
m <- matrix(m, 24, 1000)
prob_dist <- function(x) t(apply(x, 1, prop.table))
JSD<- function(x,y) sqrt(0.5 * (sum(x*log(x/((x+y)/2))) + sum(y*log(y/((x+y)/2)))))
m <- t(prob_dist(m))
m[m==0] <- 0.000001
Algorithm with two nested loops:
dist.JSD_2 <- function(inMatrix) {
matrixColSize <- ncol(inMatrix)
resultsMatrix <- matrix(0, matrixColSize, matrixColSize)
for(i in 2:matrixColSize) {
for(j in 1:(i-1)) {
resultsMatrix[i,j]=JSD(inMatrix[,i], inMatrix[,j])
}
}
return(resultsMatrix)
}
Algorithm with outer:
dist.JSD_3 <- function(inMatrix) {
matrixColSize <- ncol(inMatrix)
resultsMatrix <- outer(1:matrixColSize,1:matrixColSize, FUN = Vectorize( function(i,j) JSD(inMatrix[,i], inMatrix[,j])))
return(resultsMatrix)
}
Algorithm with combn and apply:
dist.JSD_4 <- function(inMatrix) {
matrixColSize <- ncol(inMatrix)
ind <- combn(matrixColSize, 2)
out <- apply(ind, 2, function(x) JSD(inMatrix[,x[1]], inMatrix[,x[2]]))
a <- rbind(ind, out)
resultsMatrix <- sparseMatrix(a[1,], a[2,], x=a[3,], dims=c(matrixColSize, matrixColSize))
return(resultsMatrix)
}
Algorithm with combn and aaply:
dist.JSD_5 <- function(inMatrix) {
matrixColSize <- ncol(inMatrix)
ind <- combn(matrixColSize, 2)
out <- aaply(ind, 2, function(x) JSD(inMatrix[,x[1]], inMatrix[,x[2]]))
a <- rbind(ind, out)
resultsMatrix <- sparseMatrix(a[1,], a[2,], x=a[3,], dims=c(matrixColSize, matrixColSize))
return(resultsMatrix)
}
performance test:
mbm = microbenchmark(
two_loops = dist.JSD_2(m),
outer = dist.JSD_3(m),
combn_apply = dist.JSD_4(m),
combn_aaply = dist.JSD_5(m),
times = 10
)
ggplot2::autoplot(mbm)
> summary(mbm)
expr min lq mean median
1 two_loops 18.30857 18.68309 23.50231 18.77303
2 outer 38.93112 40.98369 42.44783 42.16858
3 combn_apply 20.45740 20.90747 21.49122 21.35042
4 combn_aaply 55.61176 56.77545 59.37358 58.93953
uq max neval cld
1 18.87891 65.34197 10 a
2 42.85978 48.82437 10 b
3 22.06277 22.98803 10 a
4 62.26417 64.77407 10 c
This is my implementation of your dist.JSD_2
dist0 <- function(m) {
ncol <- ncol(m)
result <- matrix(0, ncol, ncol)
for (i in 2:ncol) {
for (j in 1:(i-1)) {
x <- m[,i]; y <- m[,j]
result[i, j] <-
sqrt(0.5 * (sum(x * log(x / ((x + y) / 2))) +
sum(y * log(y / ((x + y) / 2)))))
}
}
result
}
The usual steps are to replace iterative calculations with vectorized versions. I moved sqrt(0.5 * ...) from inside the loops, where it is applied to each element of result, to outside the loop, where it is applied to the vector result.
I realized that sum(x * log(x / (x + y) / 2)) could be written as sum(x * log(2 * x)) - sum(x * log(x + y)). The first sum is calculated once for each entry, but could be calculated once for each column. It too comes out of the loops, with the vector of values (one element for each column) calculated as colSums(m * log(2 * m)).
The remaining term inside the inner loop is sum((x + y) * log(x + y)). For a given value of i, we can trade off space for speed by vectorizing this across all relevant y columns as a matrix operation
j <- seq_len(i - 1L)
xy <- m[, i] + m[, j, drop=FALSE]
xylogxy[i, j] <- colSums(xy * log(xy))
The end result is
dist4 <- function(m) {
ncol <- ncol(m)
xlogx <- matrix(colSums(m * log(2 * m)), ncol, ncol)
xlogx2 <- xlogx + t(xlogx)
xlogx2[upper.tri(xlogx2, diag=TRUE)] <- 0
xylogxy <- matrix(0, ncol, ncol)
for (i in seq_len(ncol)[-1]) {
j <- seq_len(i - 1L)
xy <- m[, i] + m[, j, drop=FALSE]
xylogxy[i, j] <- colSums(xy * log(xy))
}
sqrt(0.5 * (xlogx2 - xylogxy))
}
Which produces results that are numerically equal (though not exactly identical) to the original
> all.equal(dist0(m), dist4(m))
[1] TRUE
and about 2.25x faster
> microbenchmark(dist0(m), dist4(m), dist.JSD_cpp2(m), times=10)
Unit: milliseconds
expr min lq mean median uq max neval
dist0(m) 48.41173 48.42569 49.26072 48.68485 49.48116 51.64566 10
dist4(m) 20.80612 20.90934 21.34555 21.09163 21.96782 22.32984 10
dist.JSD_cpp2(m) 28.95351 29.11406 29.43474 29.23469 29.78149 30.37043 10
You'll still be waiting for about 10 hours, though that seems to imply a very large problem. The algorithm seems like it is quadratic in the number of columns, but the number of columns here was small (24) compared to the number of rows, so I wonder what the actual size of data being processed is? There are ncol * (ncol - 1) / 2 distances to be calculated.
A crude approach to further performance gain is parallel evaluation, which the following implements using parallel::mclapply()
dist4p <- function(m, ..., mc.cores=detectCores()) {
ncol <- ncol(m)
xlogx <- matrix(colSums(m * log(2 * m)), ncol, ncol)
xlogx2 <- xlogx + t(xlogx)
xlogx2[upper.tri(xlogx2, diag=TRUE)] <- 0
xx <- mclapply(seq_len(ncol)[-1], function(i, m) {
j <- seq_len(i - 1L)
xy <- m[, i] + m[, j, drop=FALSE]
colSums(xy * log(xy))
}, m, ..., mc.cores=mc.cores)
xylogxy <- matrix(0, ncol, ncol)
xylogxy[upper.tri(xylogxy, diag=FALSE)] <- unlist(xx)
sqrt(0.5 * (xlogx2 - t(xylogxy)))
}
My laptop has 8 nominal cores, and for 1000 columns I have
> system.time(xx <- dist4p(m1000))
user system elapsed
48.909 1.939 8.043
suggests that I get 48s of processor time in 8s of clock time. The algorithm is still quadratic, so this might reduce overall computation time to about 1h for the full problem. Memory might become an issue on a multicore machine, where all processes are competing for the same memory pool; it might be necessary to choose mc.cores less than the number available.
With large ncol, the way to get better performance is to avoid calculating the complete set of distances. Depending on the nature of the data it might make sense to filter for duplicate columns, or to filter for informative columns (e.g., with greatest variance), or... An appropriate strategy requires more information on what the columns represent and what the goal is for the distance matrix. The question 'how similar is company i to other companies?' can be answered without calculating the full distance matrix, just a single row, so if the number of times the question is asked relative to the total number of companies is small, then maybe there is no need to calculate the full distance matrix? Another strategy might be to reduce the number of companies to be clustered by (1) simplify the 1000 rows of measurement using principal components analysis, (2) kmeans clustering of all 50k companies to identify say 1000 centroids, and (3) using the interpolated measurements and Jensen-Shannon distance between these for clustering.
I'm sure there are better approaches than the following, but your JSD function itself can trivially be converted to an Rcpp function by just swapping sum and log for their Rcpp sugar equivalents, and using std::sqrt in place of the R's base::sqrt.
#include <Rcpp.h>
// [[Rcpp::export]]
double cppJSD(const Rcpp::NumericVector& x, const Rcpp::NumericVector& y) {
return std::sqrt(0.5 * (Rcpp::sum(x * Rcpp::log(x/((x+y)/2))) +
Rcpp::sum(y * Rcpp::log(y/((x+y)/2)))));
}
I only tested with your dist.JST_2 approach (since it was the fastest version), but you should see an improvement when using cppJSD instead of JSD regardless of the implementation:
R> microbenchmark::microbenchmark(
two_loops = dist.JSD_2(m),
cpp = dist.JSD_cpp(m),
times=100L)
Unit: milliseconds
expr min lq mean median uq max neval
two_loops 41.25142 41.34755 42.75926 41.45956 43.67520 49.54250 100
cpp 36.41571 36.52887 37.49132 36.60846 36.98887 50.91866 100
EDIT:
Actually, your dist.JSD_2 function itself can easily be converted to an Rcpp function for an additional speed-up:
// [[Rcpp::export("dist.JSD_cpp2")]]
Rcpp::NumericMatrix foo(const Rcpp::NumericMatrix& inMatrix) {
size_t cols = inMatrix.ncol();
Rcpp::NumericMatrix result(cols, cols);
for (size_t i = 1; i < cols; i++) {
for (size_t j = 0; j < i; j++) {
result(i,j) = cppJSD(inMatrix(Rcpp::_, i), inMatrix(Rcpp::_, j));
}
}
return result;
}
(where cppJSD was defined in the same .cpp file as the above). Here are the timings:
R> microbenchmark::microbenchmark(
two_loops = dist.JSD_2(m),
partial_cpp = dist.JSD_cpp(m),
full_cpp = dist.JSD_cpp2(m),
times=100L)
Unit: milliseconds
expr min lq mean median uq max neval
two_loops 41.25879 41.36729 42.95183 41.84999 44.08793 54.54610 100
partial_cpp 36.45802 36.62463 37.69742 36.99679 37.96572 44.26446 100
full_cpp 32.00263 32.12584 32.82785 32.20261 32.63554 38.88611 100
dist.JSD_2 <- function(inMatrix) {
matrixColSize <- ncol(inMatrix)
resultsMatrix <- matrix(0, matrixColSize, matrixColSize)
for(i in 2:matrixColSize) {
for(j in 1:(i-1)) {
resultsMatrix[i,j]=JSD(inMatrix[,i], inMatrix[,j])
}
}
return(resultsMatrix)
}
##
dist.JSD_cpp <- function(inMatrix) {
matrixColSize <- ncol(inMatrix)
resultsMatrix <- matrix(0, matrixColSize, matrixColSize)
for(i in 2:matrixColSize) {
for(j in 1:(i-1)) {
resultsMatrix[i,j]=cppJSD(inMatrix[,i], inMatrix[,j])
}
}
return(resultsMatrix)
}
m <- runif(24000, min = 0, max = 1)
m <- matrix(m, 24, 1000)
prob_dist <- function(x) t(apply(x, 1, prop.table))
JSD <- function(x,y) sqrt(0.5 * (sum(x*log(x/((x+y)/2))) + sum(y*log(y/((x+y)/2)))))
m <- t(prob_dist(m))
m[m==0] <- 0.000001

Minimum id with non-repetitive elements

I am stuck in a difficult problem in R and am not able to resolve it. The problem goes like this.
x and y are two vectors, as given below:
x<- c(1,2,3,4,5)
y<- c(12,4,2,5,7,18,9,10)
I want to create a new vector p, where length(p) = length(x), in the following manner:
For each id in x, find the id in y which has minimum absolute distance in terms of values. For instance, for id=1 in x, value_x(id=1)=1, min_value_y =2, and id_y(value==2) = 3. Thus, the answer to id 1 in x is 3. Thus, we create a new vector p, which will have following values: p = (3,3,3,2,4);
Now we have to update p, in the following manner:
As 3 has been the id corresponding to id_x=1, it can't be the id for id_x=2. Hence, we have to discard id_y =3 with value 2, to calculate the next minimum distance for id_x=2. Next best minimum distance for id_x=2 is id_y=2 with value 4. Hence, updated p is (3,2,3,2,4).
As 3 has been the id corresponding to id_x=1, it can't be the id for id_x=3. Hence, we have to discard id_y =3 with value 2, to calculate the next minimum distance for id_x=3. Next best minimum distance for id_x=3 is 2. Hence, updated p is (3,2,4,2,4).
As next values in p is 2, and 4 we have to repeat what we did in the last two steps. In summary, while calculating the minimum distance between x and y, for each id of x we have to get that id of y which hasn't been previously appeared. Thus all the elements of p has to be unique.
Any answers would be appreciated.
I tried something like this, though not a complete solution:
minID <- function(x,y) {return(which(abs(x-y)==min(abs(x-y))))};
p1 <- sapply(x,minID,y=y);
#Calculates the list of all minimum elements -no where close to actual solution :(
I have a x and y over 1 million, hence for loop would be extremely slow. I am looking for a faster solution.
This can be implemented efficiently with a binary search tree on the elements of y, deleting elements as they're matched and added to p. I've implemented this using set from the stl in C++, using Rcpp to get the code into R:
library(Rcpp)
getVals = cppFunction(
'NumericVector getVals(NumericVector x, NumericVector y) {
NumericVector p(x.size());
std::vector<std::pair<double, int> > init;
for (int j=0; j < y.size(); ++j) {
init.push_back(std::pair<double, int>(y[j], j));
}
std::set<std::pair<double, int> > s(init.begin(), init.end());
for (int i=0; i < x.size(); ++i) {
std::set<std::pair<double, int> >::iterator p1, p2, selected;
p1 = s.lower_bound(std::pair<double, int>(x[i], 0));
p2 = p1;
--p2;
if (p1 == s.end()) {
selected = p2;
} else if (p2 == s.begin()) {
selected = p1;
} else if (fabs(x[i] - p1->first) < fabs(x[i] - p2->first)) {
selected = p1;
} else {
selected = p2;
}
p[i] = selected->second+1; // 1-indexed
s.erase(selected);
}
return p;
}')
Here's a runtime comparison against the pure-R solution that was posted -- the binary search tree solution is much faster and enables solutions with vectors of length 1 million in just a few seconds:
# Pure-R posted solution
getVals2 = function(x, y) {
n <- length(x)
p <- rep(NA, n)
for(i in 1:n) {
id <- which.min(abs(y - x[i]))
y[id] <- Inf
p[i] <- id
}
return(p)
}
# Test with medium-sized vectors
set.seed(144)
x = rnorm(10000)
y = rnorm(20000)
system.time(res1 <- getVals(x, y))
# user system elapsed
# 0.008 0.000 0.008
system.time(res2 <- getVals2(x, y))
# user system elapsed
# 1.284 2.919 4.211
all.equal(res1, res2)
# [1] TRUE
# Test with large vectors
set.seed(144)
x = rnorm(1000000)
y = rnorm(2000000)
system.time(res3 <- getVals(x, y))
# user system elapsed
# 4.402 0.097 4.467
The reason for the speedup is because this approach is asymptotically faster -- if x is of size n and y is of size m, then the binary search tree approach runs in O((n+m)log(m)) time -- O(m log(m)) to construct the BST and O(n log(m)) to compute p -- while the which.min approach runs in O(nm) time.
n <- length(x)
p <- rep(NA, n)
for(i in 1:n) {
id <- which.min(abs(y - x[i]))
y[id] <- Inf
p[i] <- id
}
I have tried to develop a code in R and have gotten around 20x improvement over for loop. The piece of code goes as follows:
Generalized.getMinId <- function(a,b)
{
sapply(a, FUN = function(x) which.min(abs(x-b)))
}
Generalized.getAbsDiff <- function(a,b)
{
lapply(a, FUN = function(x) abs(x-b))
}
min_id = Generalized.getMinId(tlist,clist);
dup = which(duplicated(min_id));
while(length(dup) > 0)
{
absdiff = Generalized.getAbsDiff(tlist[dup],clist);
infind = lapply(dup, function(x,y)
{l <- head(y,x-1); l[l>0]}, y = min_id);
absdiff = Map(`[<-`, absdiff, infind, Inf);
dupind = sapply(absdiff, which.min);
min_id[dup] = dupind;
dup = which(duplicated(min_id));
}
In case someone can make an improvement over this piece of code, it would be awesome.

use apply function to 2 separate lists in R

I have the following code to create a sample function and to generate simulated data
mean_detects<- function(obs,cens) {
detects <- obs[cens==0]
nondetects <- obs[cens==1]
res <- mean(detects)
return(res)
}
mu <-log(1); sigma<- log(3); n_samples=10, n_iterations = 5; p=0.10
dset2 <- function (mu, sigma, n_samples, n_iterations, p) {
X_after <- matrix(NA_real_, nrow = n_iterations, ncol = n_samples)
delta <- matrix(NA_real_, nrow = n_iterations, ncol = n_samples)
lod <- quantile(rlnorm(100000, mu, sigma), p = p)
pct_cens <- numeric(n_iterations)
count <- 1
while(count <= n_iterations) {
X_before <- rlnorm(n_samples, mu, sigma)
X_after[count, ] <- pmax(X_before, lod)
delta [count, ] <- X_before <= lod
pct_cens[count] <- mean(delta[count,])
if (pct_cens [count] > 0 & pct_cens [count] < 1 ) count <- count + 1 }
ave_detects <- mean_detects(X_after,delta) ## how can I use apply or other functions here?
return(ave_detects)
}
If I specify n_iterations, I will have a 1x10 X_after matrix and also 1x10 delta matrix. Then the mean_detects function works fine using this command.
ave_detects <- mean_detects(X_after,delta)
however when I increase n_iterations to say 5, then I will have 2 5x10 X_after and delta then the mean_detects function does not work any more. It only gives me output for 1 iteration instead of 5. My real simulation has thousands of iterations so speed and memory must also be taken into account.
Edits: I edited my code based your comments. The mean_detects function that I created was meant to show an example the use of X_after and delta matrices simultaneously. The real function is very long. That's why I did not post it here.
Your actual question isn't really clear. So,
"My function only takes in 1 dataframe".
Actually your function takes in two vectors
Write code that can use both X_after and delta. This doesn't really mean anything - sorry.
"speed and memory must be taken into account". This is vague. Will your run out of memory? As a suggestion, you could think about a rolling mean. For example,
x = runif(5)
total = 0
for(i in seq_along(x)) {
total = (i-1)*total/i + x[i]/i
cat(i, ": mean ", total, "\n")
}
1 : mean 0.4409
2 : mean 0.5139
3 : mean 0.5596
4 : mean 0.6212
5 : mean 0.6606
Aside
Your dest2 function requires the variable n (which you haven't defined).
Your dest2 function doesn't return an obvious value.
your mean_detects function can be simplified to:
mean(obs[cens==0])

Resources