R: cmdscale: need to use dissimilaritymatrix? - r

After some testing, it seems that we can feed either dissimilarity or similarity matrices to cmdscale (apparently the function detects when it is a similarity so it will take 1 - the matrix).
Therefore
F=mysimilarityMatrix
mds <- stats::cmdscale(F, k= 2)
is equal to
F=1-mysimilarityMatrix
mds <- stats::cmdscale(F, k= 2)
I would just want confirmation, since the function does not clarify this.
Thanks!

Related

Cosine distance matrix as a function of Euclidean distance matrix in R, and applications to binary vectors

I was reading about the Cosine distance, and looking for a method to calculate it in R.
I did not find it, but from its description in Wikipedia it seemed pretty straightforward to write it as a function of the simple Euclidean distance matrix one can obtain from dist.
If the input matrix has row vectors, like in this example, the function is:
Cosine_dist_rows <- function(m) {
0.5*(dist(m/sqrt(rowSums(m^2)),method="euclidean"))^2
}
If it has column vectors:
Cosine_dist_cols <- function(m) {
0.5*(dist(t(m)/sqrt(colSums(m^2)),method="euclidean"))^2
}
I tested it with the data from the example I linked above, and it seemed to work (it gave a near-zero difference between the similarity matrix from lsa and 1 minus the distance matrix from the above code).
Does anybody know if:
using R's own dist to compute a Euclidean distance matrix is efficient, or instead suffers from memory or speed limitations?
doing the above additional calculations on the resulting dist object is particularly costly?
this could be done better / more efficiently when the input matrix m is binary (and sparse)?
I'm asking because I might need to calculate cosine distance matrices from sets of 10^4-10^5 sparse binary vectors, and I suspect that going via the Euclidean distance when one has binary vectors is not the best idea.
Apart from using m instead of m^2 in the colSums/rowSums computation, which is the same for binary vectors, I would not know what else could be done to make this more efficient.
I know that a "binary" method exists in dist, but that is what we usually refer to as "Tanimoto" distance, which has a different formula and can't easily be linked to the cosine distance (you would need to do matrix algebra, and then the advantage of using dist would be lost, I believe). Besides, I don't know if "binary" is much faster than "euclidean".
Any idea?
Thanks!
PS
Here is an example of a matrix of 1000 sparse (row) vectors:
set.seed(123654)
dfu <- do.call(rbind, sapply(1:1000, function(i) {
n <- ceiling(26/sample(2:52,1))
data.frame("ID" = i, "F" = sample(LETTERS,size=n), stringsAsFactors = F)
}, simplify = F))
m <- xtabs(~ID + F, dfu, sparse = T)

How to select n objects from a set of N objects, maximizing the sum of pairwise distances between them

You have a set of N=400 objects, each having its own coordinates in a, say, 19-dimensional space.
You calculate the (Euclidean) distance matrix (all pairwise distances).
Now you want to select n=50 objects, such that the sum of all pairwise distances between the selected objects is maximal.
I devised a way to solve this by linear programming (code below, for a smaller example), but it seems inefficient to me, because I am using N*(N-1)/2 binary variables, corresponding to all the non-redundant elements of the distance matrix, and then a lot of constraints to ensure self-consistency of the solution vector.
I suspect there must be a simpler approach, where only N variables are used, but I can't immediately think of one.
This post briefly mentions some 'Bron–Kerbosch' algorithm, which apparently addresses the distance sum part.
But in that example the sum of distances is a specific number, so I don't see a direct application to my case.
I had a brief look at quadratic programming, but again I could not see the immediate parallel with my case, although the 'b %*% bT' matrix, where b is the (column) binary solution vector, could in theory be used to multiply the distance matrix, etc.; but I'm really not familiar with this technique.
Could anyone please advise (/point me to other posts explaining) if and how this kind of problem can be solved by linear programming using only N binary variables?
Or provide any other advice on how to tackle the problem more efficiently?
Thanks!
PS: here's the code I referred to above.
require(Matrix)
#distmat defined manually for this example as a sparseMatrix
distmat <- sparseMatrix(i=c(rep(1,4),rep(2,3),rep(3,2),rep(4,1)),j=c(2:5,3:5,4:5,5:5),x=c(0.3,0.2,0.9,0.5,0.1,0.8,0.75,0.6,0.6,0.15))
N = 5
n = 3
distmat_summary <- summary(distmat)
distmat_summary["ID"] <- 1:NROW(distmat_summary)
i.mat <- xtabs(~i+ID,distmat_summary,sparse=T)
j.mat <- xtabs(~j+ID,distmat_summary,sparse=T)
ij.mat <- rbind(i.mat,"5"=rep(0,10))+rbind("1"=rep(0,10),j.mat)
ij.mat.rowSums <- rowSums(ij.mat)
ij.diag.mat <- .sparseDiagonal(n=length(ij.mat.rowSums),-ij.mat.rowSums)
colnames(ij.diag.mat) <- dimnames(ij.mat)[[1]]
mat <- rbind(cbind(ij.mat,ij.diag.mat),cbind(ij.mat,ij.diag.mat),c(rep(0,NCOL(ij.mat)),rep(1,NROW(ij.mat)) ))
dir <- c(rep("<=",NROW(ij.mat)),rep(">=",NROW(ij.mat)),"==")
rhs <- c(rep(0,NROW(ij.mat)),1-unname(ij.mat.rowSums),n)
obj <- xtabs(x~ID,distmat_summary)
obj <- c(obj,setNames(rep(0, NROW(ij.mat)), dimnames(ij.mat)[[1]]))
if (length(find.package(package="Rsymphony",quiet=TRUE))==0) install.packages("Rsymphony")
require(Rsymphony)
LP.sol <- Rsymphony_solve_LP(obj,mat,dir,rhs,types="B",max=TRUE)
items.sol <- (names(obj)[(1+NCOL(ij.mat)):(NCOL(ij.mat)+NROW(ij.mat))])[as.logical(LP.sol$solution[(1+NCOL(ij.mat)):(NCOL(ij.mat)+NROW(ij.mat))])]
items.sol
ID.sol <- names(obj)[1:NCOL(ij.mat)][as.logical(LP.sol$solution[1:NCOL(ij.mat)])]
as.data.frame(distmat_summary[distmat_summary$ID %in% ID.sol,])
This problem is called the p-dispersion-sum problem. It can be formulated using N binary variables, but using quadratic terms. As far as I know, it is not possible to formulate it with only N binary variables in a linear program.
This paper by Pisinger gives the quadratic formulation and discusses bounds and a branch-and-bound algorithm.
Hope this helps.

Improving distance calculation

I have build my own distance (let's call it d1). Now, I have a matrix for which I need to compute the distance. Considering x as the matrix with the content for each sample, the code written to get the distance matrix is the following:
# Build the matrix
wDM <- matrix(0, nrow=nrow(x), ncol=nrow(x))
# Fill the matrix
for (i in 1:(nrow(wDM)-1)){
for (j in (i+1):nrow(wDM)){
wDM[i,j] <- wDM[j,i] <- d1(x[i,], x[j,])
}
}
I have to implement this process several times. So, I was wondering if there is a faster way to fill the distance matrix wDM rather than using two for loops.
Thank you so much,
You can use dist() from proxy package. It lets you specify user-defined distance function by setting the parameter method = #yourDistance default would be euclidean. Check the documentation here: https://cran.r-project.org/web/packages/proxy/proxy.pdf

Inaccuracies w/ prcomp? R lang PCA for eigenfaces

My question is: in the case of having a matrix we want to do PCA on, where the number of features greatly outnumbers the number of trials, why doesn't prcomp behave as expected (or am I missing something)?
Below is a summary of the issue, full code is here, compressed 7MB data source is here (is 55MB uncompressed), target image is here.
My exact situation is that I have a matrix p by n matrix X (p = features, n = trials) where the trials are photos taken of faces, and the features are the pixels in the photos (so a 32256 by 148 matrix). What I want to do is find the principal component score vectors of that matrix. Since finding the covariance matrix XX^T is too expensive, an easy solution is to find the eigenvectors (v_i) of X^TX and transform them by X (Xv_i) more info.
XTX <- t(X) %*% X # missing the 1 / n - 1 for cov matrix b/c we normalize later anyway
eigen <- eigen(XTX)
eigenvectors.XTX.col <- eigen$vectors
principal.component.scores <- apply(eigenvectors.XTX.col, 2, function(c) {
normalize.vector(X %*% matrix(c, ncol = 1))
})
The principal component scores are eigenfaces in my case, and can be used to successfully reconstruct the target face as seen here: http://cl.ly/image/260w0N0u0Z3y (refer to my full code for how)
Passing X to prcomp should do something equivalent, but has a different result than the above homegrown way:
pca <- prcomp(X)
pca$x # right size, but wrong pc scores
The result of using pca$x in reconstructing the face is not total crap, but much worse: http://cl.ly/image/2p19360u2P43
I also checked that using prcomp on t(x) yielded a different rotation matrix, so prcomp is doing something fancy, but something mysterious under the hood. I know from here that prcomp is using SVD to calculate the principal component loading vectors instead of eigen decomposition, but that should not be leading to any errors here (or so I think...).
What is the correct way of using the built in prcomp method, there must be a way, right?
Wow, the answer is not a fun one at all, and rather has to do with default parameters in the prcomp method:
To solve this issue, first, I looked at the R source of prcomp and saw that the rotation matrix should equal svd(X)$v. Checking this on the R command line proved that with my X (data here) it did not. This is because even though there is the default param scale = F to prcomp, prcomp will still run the R scale method, if only to center the matrix, which is default True as seen here. In my case, this is bad because I passed the data as already centered (subtracted mean image).
So, rerunning with prcomp(X, center = F) will yield a rotation matrix equal to svd(X)$v as expected. From this point forward, the only "mistake" prcomp makes when constructing prcomp(X, center = F)$x will be to not normalize the columns, so they are each only off by a scalar multiple from the principal.component.scores matrix I reference above in my code. Without normalizing prcomp(X, center = F)$x the results are better, but not quite great as seen here:
http://cl.ly/image/3u2y3m1h2S0o
But after normalizing via pca.x.norm <- apply(pca$x, 2, normalize.vector) the results of prcomp in reconstructing the face is identical:
http://cl.ly/image/24390O3x0A0x
tl;dr - prcomp unexpectedly centers the data even with the param scale = F, plus for the purposes of eigenfaces you will need to normalize the columns of prcomp(X, center = F)$x, then everything will work as desired!

Function and data format for doing vector-based clustering in R

I need to run clustering on the correlations of data row vectors, that is, instead of using individual variables as clustering predictor variables, I intend to use the correlations between the vector of variables between data rows.
Is there a function in R that does vector-based clustering. If not and I need to do it manually, what is the right data format to feed in a function such as cmeans or kmeans?
Say, I have m variables and n data rows, the m variables constitute one vector for each data row. so I have a n X n matrix for correlation or cosine. Can this matrix be plugged in the clustering function directly or certain processing is required?
Many thanks.
You can transform your correlation matrix into a dissimilarity matrix,
for instance 1-cor(x) (or 2-cor(x) or 1-abs(cor(x))).
# Sample data
n <- 200
k <- 10
x <- matrix( rnorm(n*k), nr=k )
x <- x * row(x) # 10 dimensions, with less information in some of them
# Clustering
library(cluster)
r <- pam(1-cor(x), diss=TRUE, k=5)
# Check the results
plot(prcomp(t(x))$x[,1:2], col=r$clustering, pch=16, cex=3)
R clustering is often a bit limited. This is a design limitation of R, since it heavily relies on low-level C code for performance. The fast kmeans implementation included with R is an example of such a low-level code, that in turn is tied to using Euclidean distance.
There are a dozen of extensions and alternatives available in the community around R. There are PAM, CLARA and CLARANS for example. They aren't exactly k-means, but closely related. There should be a "spherical k-means" somewhere, that is sensible for cosine distance. There is the whole family of hierarchical clusterings (which scale rather badly - usually O(n^3), with O(n^2) in a few exceptions - but are very easy to understand conceptually).
If you want to explore some more clustering options, have a look at ELKI, it should allow clustering (with various methods, including k-means) by correlation based distances (and it also includes such distance functions). It's not R, though, but Java. So if you are bound to using R, it won't work for you.

Resources