I am interested in using the pvclust R package to determine significance of clusters that I have generated using the regular hierarchical clustering hclust function in R. I have a datamatrix that consists of ~ 8000 genes and their expression values at 4 developmental time points. The code below shows what I use to perform regular hierarchical clustering on my data. My first question is: Is there a way to take hr.dendrogram plot and apply that to pvclust?
Secondly, pvclust seems to cluster columns, and it seems more appropriate for data that is being compared across columns rather than rows like I want to do (I have seen many examples where pvclust is used to cluster samples rather than genes). Has anyone used pvclust in a similar fashion to what I want to do?
My simple code for regular hierarchical clustering is as follows:
mydata<-read.table("Developmental.genes",header=TRUE, row.names=1)
mydata<-na.omit(mydata)
data.corr <-cor(t(mydata),method="pearson")
d<-as.dist(1-data.corr)
hr<-hclust(d,method="complete",members=NULL)
hr.dendrogram.<-plot(as.dendrogram(hr))
I appreciate any help with this!
Why not just use pvclust first like fit<-pvclust(distance.matrix, method.hclust="ward", nboot=1000, method.dist="eucl"). After that fit$hclust will be equal to hclust(distance.matrix).
Related
A similar question has been asked few years ago previous post but stayed unanswered/unsolved so I try my chance again.
I'm trying to code and duplicate in R, cluster analyses done in SAS that involved the Ward method and the Trim option. This trim option automatically omits points with low probability densities (outliers). Densities are estimated by the kth-nearest-neighbor or uniform-kernel method. This option is runned during the clustering analysis.
My goal is to find the same clustering method including this trimmming option in R because I have to complement my dataset with new datas. I thus want to be sure my cluster analysis in R are right and follow similarly what was done in SAS.
So far, as my dataset is composed of mixed variables, I have computed a Gower dissimiliraty matrix. Then, I tried different clustering ways:
the usual one from 'cluster' package (hclust with ward method) => worked well but I can't find any function to deal with outliers during the analysis.
partitionning clustering from 'TraMineRextras' (pam with ward method). => Outliers can be removed but only once the cluster are identified so it gives a different results from the one from SAS.
density-based clustering algorithm from 'dbscan' package. => worked well (good numbers of cluster and identifications of outliers) but the clustering analysis is not using the ward method. So I can't rely on this method to reproduce the exact same analysis from SAS.
Does anyone know how to proceed or would have ideas to reproduce the trimming from SAS in R?
Many thanks !
I think you are looking for the agnes function in the cluster package. Documentation can be found here: https://cran.r-project.org/web/packages/cluster/cluster.pdf
for my undergraduate research project, I'm looking for an R code for agglomerative clustering. Basically, I need to know what happened inside hclust method in R. I have looked everywhere but don't find a proper way to combine 2 data points that have the smallest distance. I was stuck at developing a dissimilarity matrix after the first phase of dissimilarity matrix generation(literally I have generated the first dissimilarity matrix). I'm not specifying R, if anyone could give me a solution in any language I would gratefully accept that.
sample from the dataset
Is there a way of calculating or estimating the area under the curve as an external metric, using base R, from confusion matrices alone?
If not, how would I do it, given the clustering object?
e.g. we can start from
cutree(hclust(dist(iris[,1:4])),method="average"),3))
or, from a diagonal-maximized version of
table(iris$Species, cutree(hclust(dist(iris[,1:4])),method="average"),3))
the latter being the confusion matrix. I would much, much prefer a solution that goes from the confusion matrix but if it's impossible we can use the clustering object itself.
I read the comments here: Calculate AUC in R? -- the top solution looks good, but it's unclear to me how to generalise it for multi-class data like iris.
(No packages, obviously, I want to find out how to do it by hand in base R)
I am trying to run a cluster on a very large data set. It contains only strings for values. I have removed the NA's and relaced with a dummy value. My K-Means in R keeps failing due to NA coerecion. How would the community run a cluster on this data. I am shwoing 10 rows of a dummy example below. In this situation lets call the data frame: cluster_data
ANy help would be greatly appreciated. I am trying see if any of the columns cause the data to break earlier then another to try and understand a possible struture. Thought Clustering with K-means was best approach but do not see how to do with strings. Have converted to factors in R and still have issues. ANy example code is greatly appreciated
Question: how do you run kmeans clustering with strings?
Answer: You can't run k means cluster analysis on categorical data. You need data that a distance function can make sense of.
K-means is designed for continuous variables, where least-squares and the mean make sense to be used as centers.
For other data types, it is better to sue other algorithms, such as PAM, HAC, DBSCAN, OPTICS, ...
I have been trying to cluster my data to be able to sort out the different intensities. From the graph bellow you can see two distinct groups. Other plots like this are not so easilly distinquished so I thought k-means with a cluster estimation would be a good way to go. So I am using from the fpc package the function pamkCBI (basically the same as pamk just with an output I find is easier to use) and I am trying to get my data (also bellow) clustered. The problem I have is that the data are being clustered along the x-axis which produces two clusters with the top peak in one and the low peaks in the other set. I need it to distinguish between the V1-V8 lines. I was thinking just cluster along the y-axis by translocating columns and rows, but then I get this error:
Error in summary(silhouette(clustering[ss[[i]]], dx))$avg.width :
$ operator is invalid for atomic vectors
There has to be a way this can be done. If anyone has any suggestions, or another way to do this, using a different package (even a different program or a different clustering technique) I'd appreciate it. Sorry for the long question.
library(flexmix)
library(fpc)
cluster <- pamkCBI(mt,krange=1:100,criterion="multiasw", usepam=FALSE,
scaling=FALSE, diss=FALSE,
critout=FALSE, ns=10, seed=NULL)
Example test matrix