Dendrogram and heatmap on similarity matrix - r

I have already computed a similarity matrix for pairwise comparisons of my data, and I want to use hierarchical clustering and a heatmap to visualize the data.
The heatmap isn't an issue, but for the hierarchical clustering, it seems to be doing a distance matrix of my similarity matrix (I am using package aheatmap if that changes things), and then clustering.
What is the best way to specify that it is already a similarity matrix and cluster based on that data, next to the figure of the heatmap?
Thanks!

You should be able to specify your pairs to aheatmap. I tried it out with the iris package.
NMF::aheatmap(iris[, 3:4]) # The default uses euclidean
NMF::aheatmap(iris[, 3:4], Rowv = 'manhattan', Colv = 'euclidean') # Specify what type of distance method to use on rows, and columns.
It also says you can pass external clustering to it. See the ?NMF::aheatmap help file for more.
hc <- hclust(dist(x, method = 'minkowski'), method = 'centroid')
aheatmap(x, Rowv = hc, info = TRUE)

Related

How do I implement a non-default dissimilarity metric with vegan function metaMDS()?

I have constructed a distance matrix from phylogenetic data using the Claddis function MorphDistMatrix() with the distance metric "MORD" (Maximum Observable Rescaled Distance). I now want to use this dissimilarity matrix to run an NMDS using the vegan function metaMDS(). However, although metaMDS has many distance metrics to choose from, "MORD" is not one of them. How do I enable metaMDS() to have this metric as an option?
Edit: here is some example code:
nexus.data<-ReadMorphNexus("example.nex")
Reading in Nexus file
dist<- MorphDistMatrix(nexus.data, distance = "MORD")
Claddis command for creating distance matrix. Instead of using the Gower dissimilarity (distance = "GC"), I would like to use Maximum Observable Rescaled Distance (distance = "MORD"), which is a modified form of Gower for use with ordered characters (Lloyd 2016). So far so good.
nmds<-metaMDS(dist$DistanceMatrix, k=2, trymax=1000, distance = "GC")
Here is where I run into trouble: as I understand it, the distance used for the metaMDS command should be the same as was used to construct the distance matrix, but MORD is not an option for "distance" in metaMDS. If I were to construct the distance matrix under Gower dissimilarity it wouldn't be a problem, as that is also available in metaMDS
Lloyd, G. T., 2016. Estimating morphological diversity and tempo with discrete character-taxon matrices: implementation, challenges, progress, and future directions. Biological Journal of the Linnean Society, 118, 131-151.
metaMDS has argument distfun to select other dissimilarity functions than vegdist. Such a function should accept argument method to select the dissimilarity measure used. Further, it should return a regular dissimilarity object that inherits from standard R dist function. I do not know about this Claddis package: does it return regular dissimilarities or something peculiar? Your example hints that it returns something that is not a regular R object, but something peculiar. Alternatively, you can use pre-calculated dissimilarities as input in metaMDS. Again these should be regular dissimilarities like in any decent R implementation. So you need to check the following with your dissimilarities:
inherits(dist, "dist") # your dist result: should be TRUE
inherits(dist$DistanceMatrix, "dist") # alternatively this should be TRUE
## if the latter was TRUE, you can extract that with
d <- dist$DistanceMatrix
## if d is not a "dist" object, you can see if it can be turned into one
d <- as.dist(dist$DistanceMatrix)
inherits(d, "dist") # TRUE: OK, FALSE: no hope
## if it was OK, you just do
metaMDS(d)

R — Automatic Optimal Number of Clusters Sequence Algorithm

I am interested in finding a function to automatically determine the optimal number of clusters in R.
I am using a sequence algorithm from the package TraMineR to compute my distances.
library(TraMineR)
data(biofam)
biofam.seq <- seqdef(biofam[501:600, 10:25])
## OM distances ##
biofam.om <- seqdist(biofam.seq, method = "OM", indel = 3, sm = "TRATE",
full.matrix = F)
For instance, hclust can simply be used like this
h = hclust(as.dist(biofam.om), method = 'ward')
and the number of clusters can then be manually determined with
clusters = cutree(h, k = 7)
What I would like ultimately is to automatically set up in the cutree function the k number of clusters, based on an "ideal" number of clusters.
It seems that the package clValid has such function (optimalScores).
However, I cannot pass a distance matrix into clValid.
clValid(obj = as.dist(biofam.om), 2:6, clMethods = 'hierarchical')
I get this error
argument 'obj' must be a matrix, data.frame, or ExpressionSet object
I get the same kind of error using other packages such as NbClust
NbClust(diss = as.dist(biofam.om), method = 'ward.D')
Data matrix is needed.
Anyone knows how to solve this or knows other packages?
Thanks.
There are several different criteria for measuring the quality of a clustering result and choosing the optimal number of clusters. Take a look at the weightedCluster package: http://mephisto.unige.ch/weightedcluster/WeightedCluster.pdf
You can easily compare between different measures and numbers of clusters.

Silhouette plot in R

I have a set of data containing:
item, associated cluster, silhouette coefficient. I can further augment this data set with more information if necessary.
I would like to generate a silhouette plot in R. I am having trouble with this because examples I came across use the built-in kmeans (or related) clustering function and plot the result. I want to bypass this step and produce the plot for my own clustering algorithm but I'm ending up short on providing the correct arguments to the plot function.
Thank you.
EDIT
Data set example https://pastebin.mozilla.org/8853427
What I've tried is loading the dataset and passing it to the plot function using various arguments based on https://stat.ethz.ch/R-manual/R-devel/library/cluster/html/silhouette.html
Function silhouette in package cluster can do the plots for you. It just needs a vector of cluster membership (produced from whatever algorithm you choose) and a dissimilarity matrix (probably best to use the same one used in producing the clusters). For example:
library (cluster)
library (vegan)
data(varespec)
dis = vegdist(varespec)
res = pam(dis,3) # or whatever your choice of clustering algorithm is
sil = silhouette (res$clustering,dis) # or use your cluster vector
windows() # RStudio sometimes does not display silhouette plots correctly
plot(sil)
EDIT: For k-means (which uses squared Euclidean distance)
library (vegan)
library (cluster)
data(varespec)
dis = dist(varespec)^2
res = kmeans(varespec,3)
sil = silhouette (res$cluster, dis)
windows()
plot(sil)

How to extract cluster centres from agnes for inputting into kmeans?

One recommended method for getting a good cluster solution is to first use a hierarchical clustering method, choose a number of clusters, then extract the centroids, and then rerun it as a K-means clustering algorithm with the centres pre-specified. A toy example:
library(cluster)
data(animals)
ag.a <- agnes(agriculture, method = "ward")
ag.2 <- cutree(ag.a, k = 2)
This would give me two clusters. How can I extract the cluster centres in a format that I can then put into the kmeans() algorithm and reapply it to the same data?
You can use the clustering to assign cluster membership and then calculate the center for all the observations in a cluster. The kmeans function allows you to specify initial centers via the centers= parameter if you pass in a matrix. You can do that with
library(cluster)
data(animals)
ag.a <- agnes(agriculture, method = "ward")
ag.2 <- cutree(ag.a, k = 2)
# calculate means per group
cent<-aggregate(cbind(x,y)~ag.2, agriculture, mean)
# pass as initial centers
kmeans(agriculture, cent[,-1])

How does the heatmap function in R cluster the data an how can we get the number of groups?

I have a distance matrix which is plotted with heatmap function. The heatmap function cluster the data to the groups. I want to cluster them to the same group.
The arguments are:
heatmap(distanceMatrix, symm = T)
The groups of data is evident in the diagnal of matrix.
something like this:
In fact I looking for the number of groups. After that I can use hcluste and cut in R to partition the data.
Have you looked at the help file of the function (`?heatmap)? See the below arguments.
distfun
function used to compute the distance (dissimilarity) between both rows and columns. Defaults to dist.
hclustfun
function used to compute the hierarchical clustering when Rowv or Colv are not dendrograms. Defaults to hclust. Should take as argument a result of distfun and return an object to which as.dendrogram can be applied.
Thelibrary(NbClust) package solved the problem
"NbClust: An examination of indices for determining the number of clusters : NbClust Package"

Resources