Improving distance calculation - r

I have build my own distance (let's call it d1). Now, I have a matrix for which I need to compute the distance. Considering x as the matrix with the content for each sample, the code written to get the distance matrix is the following:
# Build the matrix
wDM <- matrix(0, nrow=nrow(x), ncol=nrow(x))
# Fill the matrix
for (i in 1:(nrow(wDM)-1)){
for (j in (i+1):nrow(wDM)){
wDM[i,j] <- wDM[j,i] <- d1(x[i,], x[j,])
}
}
I have to implement this process several times. So, I was wondering if there is a faster way to fill the distance matrix wDM rather than using two for loops.
Thank you so much,

You can use dist() from proxy package. It lets you specify user-defined distance function by setting the parameter method = #yourDistance default would be euclidean. Check the documentation here: https://cran.r-project.org/web/packages/proxy/proxy.pdf

Related

R: cmdscale: need to use dissimilaritymatrix?

After some testing, it seems that we can feed either dissimilarity or similarity matrices to cmdscale (apparently the function detects when it is a similarity so it will take 1 - the matrix).
Therefore
F=mysimilarityMatrix
mds <- stats::cmdscale(F, k= 2)
is equal to
F=1-mysimilarityMatrix
mds <- stats::cmdscale(F, k= 2)
I would just want confirmation, since the function does not clarify this.
Thanks!

Cosine distance matrix as a function of Euclidean distance matrix in R, and applications to binary vectors

I was reading about the Cosine distance, and looking for a method to calculate it in R.
I did not find it, but from its description in Wikipedia it seemed pretty straightforward to write it as a function of the simple Euclidean distance matrix one can obtain from dist.
If the input matrix has row vectors, like in this example, the function is:
Cosine_dist_rows <- function(m) {
0.5*(dist(m/sqrt(rowSums(m^2)),method="euclidean"))^2
}
If it has column vectors:
Cosine_dist_cols <- function(m) {
0.5*(dist(t(m)/sqrt(colSums(m^2)),method="euclidean"))^2
}
I tested it with the data from the example I linked above, and it seemed to work (it gave a near-zero difference between the similarity matrix from lsa and 1 minus the distance matrix from the above code).
Does anybody know if:
using R's own dist to compute a Euclidean distance matrix is efficient, or instead suffers from memory or speed limitations?
doing the above additional calculations on the resulting dist object is particularly costly?
this could be done better / more efficiently when the input matrix m is binary (and sparse)?
I'm asking because I might need to calculate cosine distance matrices from sets of 10^4-10^5 sparse binary vectors, and I suspect that going via the Euclidean distance when one has binary vectors is not the best idea.
Apart from using m instead of m^2 in the colSums/rowSums computation, which is the same for binary vectors, I would not know what else could be done to make this more efficient.
I know that a "binary" method exists in dist, but that is what we usually refer to as "Tanimoto" distance, which has a different formula and can't easily be linked to the cosine distance (you would need to do matrix algebra, and then the advantage of using dist would be lost, I believe). Besides, I don't know if "binary" is much faster than "euclidean".
Any idea?
Thanks!
PS
Here is an example of a matrix of 1000 sparse (row) vectors:
set.seed(123654)
dfu <- do.call(rbind, sapply(1:1000, function(i) {
n <- ceiling(26/sample(2:52,1))
data.frame("ID" = i, "F" = sample(LETTERS,size=n), stringsAsFactors = F)
}, simplify = F))
m <- xtabs(~ID + F, dfu, sparse = T)

Speed Up Distance Calculations

I would like to speed up a distance calculation. I have already put effort into parallelizing it. Unfortunately it still takes longer than an hour.
Basically, the distance between a vector i and j is computed via manhattan distance. The distances between possible values of the vectors is given in the matrix Vardist. Vardist[i[1],j[1]] is the distance between the two values i[1] and j[1]. (the matrix is indexed by characters in i[1] and j[1] respectively)
There is one more important addition for the distance computation. The distance between vector i and j is the minimum over all manhattan distances between vector i and any possible permutation of vector j. This makes it computationally heavy the way it is programmed.
I have 1000 objects to compare with another. Furthermore each object is a vector of length 5. So there will be 120 permutations for each vector.
distMatrix <- foreach(i = 1:samplesize,
.combine = cbind,
.options.snow=opts,
.packages = c("combinat")) %dopar%
{
# inititalizing matrix
dist <- rep(0,samplesize)
# get values on customer i
ValuesCi <- as.matrix(recodedData[i,])
# Remove unecessary entries in value distance matrix
mVardist <- Vardist[ValuesCi,]
for(j in i:samplesize){
# distance between vector i and all permutations of vector j is computed
# minimum of above all distances is taken as distance between vector i and j
dist[j] <- min(unlist(permn(recodedData[j,],
function(x){ pdist <- 0
#nvariables is length of each vector
for(i in 1:nvariables){
pdist <- pdist + mVardist[i,as.matrix(x)[i]]
}
return(pdist)} )))
}
dist
}
Any tips or suggestions are greatly appreciated!
Oh yes, this code is going to take a while. The basic reason is that you use explicit indexing. Even paralellizing will not help.
Okay, there are several option which you can use.
(1) use base::dist; give it a matrix and it will compute distances between the rows in the matrix.
(2) use some clustering packages, e.g. flexClust, that has some other options.
(3) If you need to compute distances between rows of a matix with rows of some other matrix, you can vectorize the code, e.g. euclidean distance:
function(xmat, ymat) {
t(apply(xmat, 1, function(x) {
sqrt(colSums((t(ymat) - x)^2))
}))
}
(4) use C++ and Rcpp to make use of the BLAS functionality and you may even consider parallelizing code using RcppParallel (distance matrix example)
When you have fast routines for middle-sized data, then you may go into distributing it to clusters ... for large data.

easy thing about clustering in R

According to the results I am getting ( I do not see that in the API), hclust works by using each row of a given matrix as a vector. Is there any way to work it so that it works with columns instead?
Besides, does dist work the same or does dist work with columns?
You can always apply hclust to transposed matrix:
# If you have observations matrix
m <- matrix(1:100, nrow=20)
hc <- hclust(dist(t(m)))
Besides, does dist work the same or does dist work with columns?
General convention is variables in columns, observations in rows and that's how dist works:
dist package:stats R Documentation
Distance Matrix Computation
Description:
This function computes and returns the distance matrix computed by
using the specified distance measure to compute the distances
between the rows of a data matrix.
Update
hclust works by using each row of a given matrix as a vector.
Actually internal implementation of hclust shouldn't matter. You pass as an argument dissimilarity structure produced by dist, and I am almost sure, that all metrics implemented in dist produce proper symmetrical distance matrix.

Using different metric for hclust linkage?

In R you can use all sorts of metrics to build a distance matrix prior to clustering, e.g. binary distance, Manhattan distance, etc...
However, when it comes to choosing a linkage method (complete, average, single, etc...), these linkage all use euclidean distance. This does not seem particularly appropriate if you rely on a difference metric to build the distance matrix.
Is there a way (or a library...) to apply other distances to linkage methods when building a clustering tree?
Thanks!
I don't really get your question. For example, suppose I have the following data:
x <- matrix(rnorm(100), nrow=5)
then I can build a distance matrix using dist
##Changing the distance measure
d_e = dist(x, method="euclidean")
d_m = dist(x, method="maximum")
I can then cluster in however I want:
##Changing the clustering method
hclust(d_m, method="median")
If you have constructed a matrix that already represents the pairwise distances, use e.g.
hclust(as.dist(mx), method="single")
You might want to try using agnes, rather than hclust, and hand it a distance matrix. There's a nice tutorial on this here:
http://strata.uga.edu/software/pdf/clusterTutorial.pdf
From the tutorial, here's how you would generate and use a distance matrix for clustering:
> library(vegan)
# load library for distance functions
> mydata.bray <- vegdist(mydata, method="bray")
# calculates bray (=Sørenson) distances among samples
> mydata.bray.agnes <- agnes(mydata.bray)
# run the cluster analysis
I myself use Prof. Daniel Müllner's fastcluster library, which has exactly the same API as agnes but is orders of magnitude faster for large data sets.

Resources