easy thing about clustering in R - r

According to the results I am getting ( I do not see that in the API), hclust works by using each row of a given matrix as a vector. Is there any way to work it so that it works with columns instead?
Besides, does dist work the same or does dist work with columns?

You can always apply hclust to transposed matrix:
# If you have observations matrix
m <- matrix(1:100, nrow=20)
hc <- hclust(dist(t(m)))
Besides, does dist work the same or does dist work with columns?
General convention is variables in columns, observations in rows and that's how dist works:
dist package:stats R Documentation
Distance Matrix Computation
Description:
This function computes and returns the distance matrix computed by
using the specified distance measure to compute the distances
between the rows of a data matrix.
Update
hclust works by using each row of a given matrix as a vector.
Actually internal implementation of hclust shouldn't matter. You pass as an argument dissimilarity structure produced by dist, and I am almost sure, that all metrics implemented in dist produce proper symmetrical distance matrix.

Related

Cosine distance matrix as a function of Euclidean distance matrix in R, and applications to binary vectors

I was reading about the Cosine distance, and looking for a method to calculate it in R.
I did not find it, but from its description in Wikipedia it seemed pretty straightforward to write it as a function of the simple Euclidean distance matrix one can obtain from dist.
If the input matrix has row vectors, like in this example, the function is:
Cosine_dist_rows <- function(m) {
0.5*(dist(m/sqrt(rowSums(m^2)),method="euclidean"))^2
}
If it has column vectors:
Cosine_dist_cols <- function(m) {
0.5*(dist(t(m)/sqrt(colSums(m^2)),method="euclidean"))^2
}
I tested it with the data from the example I linked above, and it seemed to work (it gave a near-zero difference between the similarity matrix from lsa and 1 minus the distance matrix from the above code).
Does anybody know if:
using R's own dist to compute a Euclidean distance matrix is efficient, or instead suffers from memory or speed limitations?
doing the above additional calculations on the resulting dist object is particularly costly?
this could be done better / more efficiently when the input matrix m is binary (and sparse)?
I'm asking because I might need to calculate cosine distance matrices from sets of 10^4-10^5 sparse binary vectors, and I suspect that going via the Euclidean distance when one has binary vectors is not the best idea.
Apart from using m instead of m^2 in the colSums/rowSums computation, which is the same for binary vectors, I would not know what else could be done to make this more efficient.
I know that a "binary" method exists in dist, but that is what we usually refer to as "Tanimoto" distance, which has a different formula and can't easily be linked to the cosine distance (you would need to do matrix algebra, and then the advantage of using dist would be lost, I believe). Besides, I don't know if "binary" is much faster than "euclidean".
Any idea?
Thanks!
PS
Here is an example of a matrix of 1000 sparse (row) vectors:
set.seed(123654)
dfu <- do.call(rbind, sapply(1:1000, function(i) {
n <- ceiling(26/sample(2:52,1))
data.frame("ID" = i, "F" = sample(LETTERS,size=n), stringsAsFactors = F)
}, simplify = F))
m <- xtabs(~ID + F, dfu, sparse = T)

Improving distance calculation

I have build my own distance (let's call it d1). Now, I have a matrix for which I need to compute the distance. Considering x as the matrix with the content for each sample, the code written to get the distance matrix is the following:
# Build the matrix
wDM <- matrix(0, nrow=nrow(x), ncol=nrow(x))
# Fill the matrix
for (i in 1:(nrow(wDM)-1)){
for (j in (i+1):nrow(wDM)){
wDM[i,j] <- wDM[j,i] <- d1(x[i,], x[j,])
}
}
I have to implement this process several times. So, I was wondering if there is a faster way to fill the distance matrix wDM rather than using two for loops.
Thank you so much,
You can use dist() from proxy package. It lets you specify user-defined distance function by setting the parameter method = #yourDistance default would be euclidean. Check the documentation here: https://cran.r-project.org/web/packages/proxy/proxy.pdf

convert a list -class numeric- into a distance structure in R

I have a list that looks like this, it is a measure of dispersion for each sample.
1 2 3 4 5
0.11829384 0.24987017 0.08082147 0.13355495 0.12933790
To further analyze this I need it to be a distance structure, the -vegan- package need it as a 'dist' object.
I found some solutions that applies to matrices > dist, but how could I change this current data into a dist object?
I am using the FD package, at the manual I found,
Still, one potential advantage of FDis over Rao’s Q is that in the unweighted case
(i.e. with presence-absence data), it opens possibilities for formal statistical tests for differences in
FD between two or more communities through a distance-based test for homogeneity of multivariate
dispersions (Anderson 2006); see betadisper for more details
I wanted to use vegan betadisper function to test if there are differences among different regions (I provided this using element "region" with column "region" too)
functional <- FD(trait, comun)
mod <- betadisper(functional$FDis, region$region)
using gowdis or fdisp from FD didn't work too.
distancias <- gowdis(rasgo)
mod <- betadisper(distancias, region$region)
dispersion <- fdisp(distancias, presence)
mod <- betadisper(dispersion, region$region)
I tried this but I need a list object. I thought I could pass those results to betadisper.
You cannot do this: FD::fdisp() does not return dissimilarities. It returns a list of three elements: the dispersions FDis for each sampling unit (SU), and the results of the eigen decomposition of input dissimilarities (eig for eigenvalues, vectors for orthonormal eigenvectors). The FDis values are summarized for each original SU, but there is no information on the differences among SUs. The eigen decomposition can be used to reconstruct the original input dissimilarities (your distancias from FD::gowdis()), but you can directly use the input dissimilarities. Function FD::gowdis() returns a regular "dist" structure that you can directly use in vegan::betadisper() if that gives you a meaningful analysis. For this, your grouping variable must be based on the same units as your distancias. In typical application of fdisp, the units are species (taxa), but it seems you want to get analysis for communities/sites/whatever. This will not be possible with these tools.

Using different metric for hclust linkage?

In R you can use all sorts of metrics to build a distance matrix prior to clustering, e.g. binary distance, Manhattan distance, etc...
However, when it comes to choosing a linkage method (complete, average, single, etc...), these linkage all use euclidean distance. This does not seem particularly appropriate if you rely on a difference metric to build the distance matrix.
Is there a way (or a library...) to apply other distances to linkage methods when building a clustering tree?
Thanks!
I don't really get your question. For example, suppose I have the following data:
x <- matrix(rnorm(100), nrow=5)
then I can build a distance matrix using dist
##Changing the distance measure
d_e = dist(x, method="euclidean")
d_m = dist(x, method="maximum")
I can then cluster in however I want:
##Changing the clustering method
hclust(d_m, method="median")
If you have constructed a matrix that already represents the pairwise distances, use e.g.
hclust(as.dist(mx), method="single")
You might want to try using agnes, rather than hclust, and hand it a distance matrix. There's a nice tutorial on this here:
http://strata.uga.edu/software/pdf/clusterTutorial.pdf
From the tutorial, here's how you would generate and use a distance matrix for clustering:
> library(vegan)
# load library for distance functions
> mydata.bray <- vegdist(mydata, method="bray")
# calculates bray (=Sørenson) distances among samples
> mydata.bray.agnes <- agnes(mydata.bray)
# run the cluster analysis
I myself use Prof. Daniel Müllner's fastcluster library, which has exactly the same API as agnes but is orders of magnitude faster for large data sets.

R: how to compute the distance between the columns of a matrix?

The R function dist "computes and returns the distance matrix computed by using the specified distance measure to compute the distances between the rows of a data matrix".
However, I want the distance measure to be computed between the columns of a data matrix, not the rows! How can I do that?
Do I need to rotate the matrix. If so, how? If not, should I use a different function?
Maybe you can use R function t?
t(x) will transpose matrix x.

Resources