I would like to speed up a distance calculation. I have already put effort into parallelizing it. Unfortunately it still takes longer than an hour.
Basically, the distance between a vector i and j is computed via manhattan distance. The distances between possible values of the vectors is given in the matrix Vardist. Vardist[i[1],j[1]] is the distance between the two values i[1] and j[1]. (the matrix is indexed by characters in i[1] and j[1] respectively)
There is one more important addition for the distance computation. The distance between vector i and j is the minimum over all manhattan distances between vector i and any possible permutation of vector j. This makes it computationally heavy the way it is programmed.
I have 1000 objects to compare with another. Furthermore each object is a vector of length 5. So there will be 120 permutations for each vector.
distMatrix <- foreach(i = 1:samplesize,
.combine = cbind,
.options.snow=opts,
.packages = c("combinat")) %dopar%
{
# inititalizing matrix
dist <- rep(0,samplesize)
# get values on customer i
ValuesCi <- as.matrix(recodedData[i,])
# Remove unecessary entries in value distance matrix
mVardist <- Vardist[ValuesCi,]
for(j in i:samplesize){
# distance between vector i and all permutations of vector j is computed
# minimum of above all distances is taken as distance between vector i and j
dist[j] <- min(unlist(permn(recodedData[j,],
function(x){ pdist <- 0
#nvariables is length of each vector
for(i in 1:nvariables){
pdist <- pdist + mVardist[i,as.matrix(x)[i]]
}
return(pdist)} )))
}
dist
}
Any tips or suggestions are greatly appreciated!
Oh yes, this code is going to take a while. The basic reason is that you use explicit indexing. Even paralellizing will not help.
Okay, there are several option which you can use.
(1) use base::dist; give it a matrix and it will compute distances between the rows in the matrix.
(2) use some clustering packages, e.g. flexClust, that has some other options.
(3) If you need to compute distances between rows of a matix with rows of some other matrix, you can vectorize the code, e.g. euclidean distance:
function(xmat, ymat) {
t(apply(xmat, 1, function(x) {
sqrt(colSums((t(ymat) - x)^2))
}))
}
(4) use C++ and Rcpp to make use of the BLAS functionality and you may even consider parallelizing code using RcppParallel (distance matrix example)
When you have fast routines for middle-sized data, then you may go into distributing it to clusters ... for large data.
Related
I was reading about the Cosine distance, and looking for a method to calculate it in R.
I did not find it, but from its description in Wikipedia it seemed pretty straightforward to write it as a function of the simple Euclidean distance matrix one can obtain from dist.
If the input matrix has row vectors, like in this example, the function is:
Cosine_dist_rows <- function(m) {
0.5*(dist(m/sqrt(rowSums(m^2)),method="euclidean"))^2
}
If it has column vectors:
Cosine_dist_cols <- function(m) {
0.5*(dist(t(m)/sqrt(colSums(m^2)),method="euclidean"))^2
}
I tested it with the data from the example I linked above, and it seemed to work (it gave a near-zero difference between the similarity matrix from lsa and 1 minus the distance matrix from the above code).
Does anybody know if:
using R's own dist to compute a Euclidean distance matrix is efficient, or instead suffers from memory or speed limitations?
doing the above additional calculations on the resulting dist object is particularly costly?
this could be done better / more efficiently when the input matrix m is binary (and sparse)?
I'm asking because I might need to calculate cosine distance matrices from sets of 10^4-10^5 sparse binary vectors, and I suspect that going via the Euclidean distance when one has binary vectors is not the best idea.
Apart from using m instead of m^2 in the colSums/rowSums computation, which is the same for binary vectors, I would not know what else could be done to make this more efficient.
I know that a "binary" method exists in dist, but that is what we usually refer to as "Tanimoto" distance, which has a different formula and can't easily be linked to the cosine distance (you would need to do matrix algebra, and then the advantage of using dist would be lost, I believe). Besides, I don't know if "binary" is much faster than "euclidean".
Any idea?
Thanks!
PS
Here is an example of a matrix of 1000 sparse (row) vectors:
set.seed(123654)
dfu <- do.call(rbind, sapply(1:1000, function(i) {
n <- ceiling(26/sample(2:52,1))
data.frame("ID" = i, "F" = sample(LETTERS,size=n), stringsAsFactors = F)
}, simplify = F))
m <- xtabs(~ID + F, dfu, sparse = T)
You have a set of N=400 objects, each having its own coordinates in a, say, 19-dimensional space.
You calculate the (Euclidean) distance matrix (all pairwise distances).
Now you want to select n=50 objects, such that the sum of all pairwise distances between the selected objects is maximal.
I devised a way to solve this by linear programming (code below, for a smaller example), but it seems inefficient to me, because I am using N*(N-1)/2 binary variables, corresponding to all the non-redundant elements of the distance matrix, and then a lot of constraints to ensure self-consistency of the solution vector.
I suspect there must be a simpler approach, where only N variables are used, but I can't immediately think of one.
This post briefly mentions some 'Bron–Kerbosch' algorithm, which apparently addresses the distance sum part.
But in that example the sum of distances is a specific number, so I don't see a direct application to my case.
I had a brief look at quadratic programming, but again I could not see the immediate parallel with my case, although the 'b %*% bT' matrix, where b is the (column) binary solution vector, could in theory be used to multiply the distance matrix, etc.; but I'm really not familiar with this technique.
Could anyone please advise (/point me to other posts explaining) if and how this kind of problem can be solved by linear programming using only N binary variables?
Or provide any other advice on how to tackle the problem more efficiently?
Thanks!
PS: here's the code I referred to above.
require(Matrix)
#distmat defined manually for this example as a sparseMatrix
distmat <- sparseMatrix(i=c(rep(1,4),rep(2,3),rep(3,2),rep(4,1)),j=c(2:5,3:5,4:5,5:5),x=c(0.3,0.2,0.9,0.5,0.1,0.8,0.75,0.6,0.6,0.15))
N = 5
n = 3
distmat_summary <- summary(distmat)
distmat_summary["ID"] <- 1:NROW(distmat_summary)
i.mat <- xtabs(~i+ID,distmat_summary,sparse=T)
j.mat <- xtabs(~j+ID,distmat_summary,sparse=T)
ij.mat <- rbind(i.mat,"5"=rep(0,10))+rbind("1"=rep(0,10),j.mat)
ij.mat.rowSums <- rowSums(ij.mat)
ij.diag.mat <- .sparseDiagonal(n=length(ij.mat.rowSums),-ij.mat.rowSums)
colnames(ij.diag.mat) <- dimnames(ij.mat)[[1]]
mat <- rbind(cbind(ij.mat,ij.diag.mat),cbind(ij.mat,ij.diag.mat),c(rep(0,NCOL(ij.mat)),rep(1,NROW(ij.mat)) ))
dir <- c(rep("<=",NROW(ij.mat)),rep(">=",NROW(ij.mat)),"==")
rhs <- c(rep(0,NROW(ij.mat)),1-unname(ij.mat.rowSums),n)
obj <- xtabs(x~ID,distmat_summary)
obj <- c(obj,setNames(rep(0, NROW(ij.mat)), dimnames(ij.mat)[[1]]))
if (length(find.package(package="Rsymphony",quiet=TRUE))==0) install.packages("Rsymphony")
require(Rsymphony)
LP.sol <- Rsymphony_solve_LP(obj,mat,dir,rhs,types="B",max=TRUE)
items.sol <- (names(obj)[(1+NCOL(ij.mat)):(NCOL(ij.mat)+NROW(ij.mat))])[as.logical(LP.sol$solution[(1+NCOL(ij.mat)):(NCOL(ij.mat)+NROW(ij.mat))])]
items.sol
ID.sol <- names(obj)[1:NCOL(ij.mat)][as.logical(LP.sol$solution[1:NCOL(ij.mat)])]
as.data.frame(distmat_summary[distmat_summary$ID %in% ID.sol,])
This problem is called the p-dispersion-sum problem. It can be formulated using N binary variables, but using quadratic terms. As far as I know, it is not possible to formulate it with only N binary variables in a linear program.
This paper by Pisinger gives the quadratic formulation and discusses bounds and a branch-and-bound algorithm.
Hope this helps.
I have build my own distance (let's call it d1). Now, I have a matrix for which I need to compute the distance. Considering x as the matrix with the content for each sample, the code written to get the distance matrix is the following:
# Build the matrix
wDM <- matrix(0, nrow=nrow(x), ncol=nrow(x))
# Fill the matrix
for (i in 1:(nrow(wDM)-1)){
for (j in (i+1):nrow(wDM)){
wDM[i,j] <- wDM[j,i] <- d1(x[i,], x[j,])
}
}
I have to implement this process several times. So, I was wondering if there is a faster way to fill the distance matrix wDM rather than using two for loops.
Thank you so much,
You can use dist() from proxy package. It lets you specify user-defined distance function by setting the parameter method = #yourDistance default would be euclidean. Check the documentation here: https://cran.r-project.org/web/packages/proxy/proxy.pdf
Rao QE is a weighted Euclidian distance matrix. I have the vectors for the elements of the d_ijs in a data table dt, one column per element (say there are x of them). p is the final column. nrow = S. The double sums are for the lower left (or upper right since it is symmetric) elements of the distance matrix.
If I only needed an unweighted distance matrix I could simply do dist() over the x columns. How do I weight the d_ijs by the product of p_i and p_j?
And example data set is at https://github.com/GeraldCNelson/nutmod/blob/master/RaoD_example.csv with the ps in the column called foodQ.ratio.
You still start with dist for the raw Euclidean distance matrix. Let it be D. As you will read from R - How to get row & column subscripts of matched elements from a distance matrix, a "dist" object is not a real matrix, but a 1D array. So first do D <- as.matrix(D) or D <- dist2mat(D) to convert it to a complete matrix before the following.
Now, let p be the vector of weights, the Rao's QE is just a quadratic form q'Dq / 2:
c(crossprod(p, D %*% p)) / 2
Note, I am not doing everything in the most efficient way. I have performed a symmetric matrix-vector multiplication D %*% p using the full D rather than just its lower triangular part. However, R does not have a routine doing triangular matrix-vector multiplication. So I compute the full version than divide 2.
This doubles computation amount that is necessary; also, making D a full matrix doubles memory costs. But if your problem is small to medium size this is absolutely fine. For large problem, if you are R and C wizard, call BLAS routine dtrmv or even dtpmv for the triangular matrix-vector computation.
Update
I just found this simple paper: Rao's quadratic entropy as a measure of functional diversity based on multiple traits for definition and use of Rao's EQ. It mentions that we can replace Euclidean distance with Mahalanobis distance. In case we want to do this, use my code in Mahalanobis distance of each pair of observations for fast computation of Mahalanobis distance matrix.
According to the results I am getting ( I do not see that in the API), hclust works by using each row of a given matrix as a vector. Is there any way to work it so that it works with columns instead?
Besides, does dist work the same or does dist work with columns?
You can always apply hclust to transposed matrix:
# If you have observations matrix
m <- matrix(1:100, nrow=20)
hc <- hclust(dist(t(m)))
Besides, does dist work the same or does dist work with columns?
General convention is variables in columns, observations in rows and that's how dist works:
dist package:stats R Documentation
Distance Matrix Computation
Description:
This function computes and returns the distance matrix computed by
using the specified distance measure to compute the distances
between the rows of a data matrix.
Update
hclust works by using each row of a given matrix as a vector.
Actually internal implementation of hclust shouldn't matter. You pass as an argument dissimilarity structure produced by dist, and I am almost sure, that all metrics implemented in dist produce proper symmetrical distance matrix.