I know very little about R, but I need to convert the dendrogram resulted from hierarchical clustering in matlab into R dendrogram structure. The following table shows the dendrogram resulted from hierarchical clustering in matlab function; where the first and the second column are the IDs for the objects or branches, and the third column is the distance.
Is there a way to map this table (or matlab dendrogram) into R dendrogram?
I think that the easiest way for you to have a dendrogram in R is to use some intermediate results from your matlab analysis instead of using the final table.
Assuming that you have a dissimilarity matrix called Diss_Mat (which you should definitely evaluate at some point of your matlab algorithm), you could do the following
DIST_Mat=as.dist(Diss_Mat) #create a dist type object
dendro=as.dendrogram(hclust(DIST_Mat))
where with the second line you perform the hierarchical clustering in R and then you create a dendrogram type object.
Related
Is there a way of calculating or estimating the area under the curve as an external metric, using base R, from confusion matrices alone?
If not, how would I do it, given the clustering object?
e.g. we can start from
cutree(hclust(dist(iris[,1:4])),method="average"),3))
or, from a diagonal-maximized version of
table(iris$Species, cutree(hclust(dist(iris[,1:4])),method="average"),3))
the latter being the confusion matrix. I would much, much prefer a solution that goes from the confusion matrix but if it's impossible we can use the clustering object itself.
I read the comments here: Calculate AUC in R? -- the top solution looks good, but it's unclear to me how to generalise it for multi-class data like iris.
(No packages, obviously, I want to find out how to do it by hand in base R)
I used the R package corrplot to visualize the correlation matrix from my data. I involved the clustering of variables using the embedded option hclust.
The invocation of the command was like this (plus various arrangements of titles, axes etc):
corrplot(Rbas,type="upper",order="hclust",method="ellipse")
But now I perform some analysis and visualizations using other packages, and the question arose about the compatibility of results. In particular, I have to repeat manually the clustering of the correlation matrix. But from the documentation to corrplot there is one obscure point: what dissimilarity measure was used in corrplot behind its reasonable defaults? Whether this is 1-|corr|, sqrt(1-corr^2), or anything else? In literature there are multiple choices, for example, as described in this article
Update to answer own question. I performed a guess trial, using the dissimilarity measure in the form 1-corr. That is I coded (Rbas is the correlation matrix):
dissim1<-1-Rbas
dist1<-as.dist(dissim1)
plot(hclust(dist1))
and recovered the ordering of variables, coinciding with the one suggested by default corrplot with hclust invocation. But it is not clear whether this is indeed their used mechanism and whether this will hold for any other matrix?
The function used by corrplot to reorder variables is corrMatOrder (try ?corrMatOrder). It returns a single permutation vector.
When order= "hclust" is selected in corrplot, corrMatOrder invokes the corrplot:::reorder_using_hclust function:
function (corr, hclust.method)
{
hc <- hclust(as.dist(1 - corr), method = hclust.method)
order.dendrogram(as.dendrogram(hc))
}
This function uses 1-corr as dissimilarity measure.
Here is a truncated version of my data set. There are many more rows in the full set.
I know I can convert the second column to a vector via as.vector(df[,2]), which I can then use for distance calculation. Once I have the distances, I'm going to cluster. But then I want to know how whether the ones that corresponded to "1" from the first column ended up clustering together, and with "2", "3", and so on. How would I go about that?
It would be more helpful to include a text dump of your data using dput(), rather than an image of your data. It looks like your data might be in Excel. You could save it as a csv file and load it into R using read.csv with stringsAsFactors=FALSE. Then your SecondaryStructure column would be strings. Once you have that, load the stringdist package (install if you don't have it). That package has a function called stringdist that will give you a distance matrix using Levenstein distance. Most of the clustering algorithms will take a distance matrix as input. You might start out with hclust (and maybe better to not use the default method="complete" but instead use method="single"). hclust will give you a tree structure. You will have to use cutree to turn that into a set of cluster assignments. When you have the cluster assignments just use table(Clusters, PrimarySeqGroup) to get a confusion matrix.
I hope that this helps.
I am interested in using the pvclust R package to determine significance of clusters that I have generated using the regular hierarchical clustering hclust function in R. I have a datamatrix that consists of ~ 8000 genes and their expression values at 4 developmental time points. The code below shows what I use to perform regular hierarchical clustering on my data. My first question is: Is there a way to take hr.dendrogram plot and apply that to pvclust?
Secondly, pvclust seems to cluster columns, and it seems more appropriate for data that is being compared across columns rather than rows like I want to do (I have seen many examples where pvclust is used to cluster samples rather than genes). Has anyone used pvclust in a similar fashion to what I want to do?
My simple code for regular hierarchical clustering is as follows:
mydata<-read.table("Developmental.genes",header=TRUE, row.names=1)
mydata<-na.omit(mydata)
data.corr <-cor(t(mydata),method="pearson")
d<-as.dist(1-data.corr)
hr<-hclust(d,method="complete",members=NULL)
hr.dendrogram.<-plot(as.dendrogram(hr))
I appreciate any help with this!
Why not just use pvclust first like fit<-pvclust(distance.matrix, method.hclust="ward", nboot=1000, method.dist="eucl"). After that fit$hclust will be equal to hclust(distance.matrix).
I've applied PCA on my data using the function prcomp in R.
This function returns the following:
Variation
Rotation Matrix
Standard Deviation
Scores (X)
My question is: How can I reconstruct the reduced version of the data after choosing, for example, two principle components?
Going back and forth between PCA and 'normal' space is done using the rotation matrix. Just take a close look at the matrix algebra in the chapter on PCA in your favorite multivariate statistics book. To truncate (or reduce) your dataset, just limit the rotation matrix to the PC axes you want, e.g. the first two.