A similar question has been asked few years ago previous post but stayed unanswered/unsolved so I try my chance again.
I'm trying to code and duplicate in R, cluster analyses done in SAS that involved the Ward method and the Trim option. This trim option automatically omits points with low probability densities (outliers). Densities are estimated by the kth-nearest-neighbor or uniform-kernel method. This option is runned during the clustering analysis.
My goal is to find the same clustering method including this trimmming option in R because I have to complement my dataset with new datas. I thus want to be sure my cluster analysis in R are right and follow similarly what was done in SAS.
So far, as my dataset is composed of mixed variables, I have computed a Gower dissimiliraty matrix. Then, I tried different clustering ways:
the usual one from 'cluster' package (hclust with ward method) => worked well but I can't find any function to deal with outliers during the analysis.
partitionning clustering from 'TraMineRextras' (pam with ward method). => Outliers can be removed but only once the cluster are identified so it gives a different results from the one from SAS.
density-based clustering algorithm from 'dbscan' package. => worked well (good numbers of cluster and identifications of outliers) but the clustering analysis is not using the ward method. So I can't rely on this method to reproduce the exact same analysis from SAS.
Does anyone know how to proceed or would have ideas to reproduce the trimming from SAS in R?
Many thanks !
I think you are looking for the agnes function in the cluster package. Documentation can be found here: https://cran.r-project.org/web/packages/cluster/cluster.pdf
Is there a way of calculating or estimating the area under the curve as an external metric, using base R, from confusion matrices alone?
If not, how would I do it, given the clustering object?
e.g. we can start from
cutree(hclust(dist(iris[,1:4])),method="average"),3))
or, from a diagonal-maximized version of
table(iris$Species, cutree(hclust(dist(iris[,1:4])),method="average"),3))
the latter being the confusion matrix. I would much, much prefer a solution that goes from the confusion matrix but if it's impossible we can use the clustering object itself.
I read the comments here: Calculate AUC in R? -- the top solution looks good, but it's unclear to me how to generalise it for multi-class data like iris.
(No packages, obviously, I want to find out how to do it by hand in base R)
I currently have an adjacency matrix I would like to perform spectral clustering on to determine the community each node belongs to. I have looked around, but there do not look to be implementations in either igraph or other packages.
Another issue is determining how many clusters you want. I was wondering if R has any packages that might help one find the optimal number of clusters to break an adjacency matrix into? Thanks.
I cannot advise for R, however, I can suggest this example implementation of Spectral Clustering using Python and Networkx (which is comparable to iGraph). It should not be hard to translate this into R.
For an introduction to Spectral Clustering see lectures 28-34 here and this paper.
I have a set (2k - 4k) of small strings (3-6 characters) and I want to cluster them. Since I use strings, previous answers on How does clustering (especially String clustering) work?, informed me that Levenshtein distance is good to be used as a distance function for strings. Also, since I do not know in advance the number of clusters, hierarchical clustering is the way to go and not k-means.
Although I get the problem in its abstract form, I do not know what is the easie way to actually do it. For example, is MATLAB or R a better choice for the actual implementation of hierarchical clustering with the custom function (Levenshtein distance).
For both software, one may easily find a Levenshtein distance implementation. The clustering part seems harder. For example Clustering text in MATLAB calculates the distance array for all strings, but I cannot understand how to use the distance array to actually get the clustering. Can you any of you gurus show me the way to how to implement the hierarchical clustering in either MATLAB or R with a custom function?
This may be a bit simplistic, but here's a code example that uses hierarchical clustering based on Levenshtein distance in R.
set.seed(1)
rstr <- function(n,k){ # vector of n random char(k) strings
sapply(1:n,function(i){do.call(paste0,as.list(sample(letters,k,replace=T)))})
}
str<- c(paste0("aa",rstr(10,3)),paste0("bb",rstr(10,3)),paste0("cc",rstr(10,3)))
# Levenshtein Distance
d <- adist(str)
rownames(d) <- str
hc <- hclust(as.dist(d))
plot(hc)
rect.hclust(hc,k=3)
df <- data.frame(str,cutree(hc,k=3))
In this example, we create a set of 30 random char(5) strings artificially in 3 groups (starting with "aa", "bb", and "cc"). We calculate the Levenshtein distance matrix using adist(...), and we run heirarchal clustering using hclust(...). Then we cut the dendrogram into three clusters with cutree(...) and append the cluster id's to the original strings.
ELKI includes Levenshtein distance, and offers a wide choice of advanced clustering algorithms, for example OPTICS clustering.
Text clustering support was contributed by Felix Stahlberg, as part of his work on:
Stahlberg, F., Schlippe, T., Vogel, S., & Schultz, T. Word segmentation through cross-lingual word-to-phoneme alignment. Spoken Language Technology Workshop (SLT), 2012 IEEE. IEEE, 2012.
We would of course appreciate additional contributions.
While the answer depends to a degree on the meaning of the strings, in general your problem is solved by the sequence analysis family of techniques. More specifically, Optimal Matching Analysis (OMA).
Most often the OMA is carried out in three steps. First, you define your sequences. From your description I can assume that each letter is a separate "state", the building block in a sequence. Second, you will employ one of the several algorithms to calculate the distances between all sequences in your dataset, thus obtaining the distance matrix. Finally, you will feed that distance matrix into a clustering algorithm, such as hierarchical clustering or Partitioning Around Medoids (PAM), which seems to gain popularity due to the additional information on the quality of the clusters. The latter guides you in the choice of the number of clusters, one of the several subjective steps in the sequence analysis.
In R the most convenient package with a great number of functions is TraMineR, the website can be found here. Its user guide is very accessible, and developers are more or less active on SO as well.
You are likely to find that clustering is not the most difficult part, except for the decision on the number of clusters. The guide for TraMineR shows that is the syntax is very straighforward, and the results are easy to interpret based on visual sequence graphs. Here is an example from the user guide:
clusterward1 <- agnes(dist.om1, diss = TRUE, method = "ward")
dist.om1 is the distance matrix obtained by OMA, cluster membership is contained in the clusterward1 object, which which you can do whatever you want: plotting, recoding as variables etc. The diss=TRUE option indicates that the data object is the dissimilarity (or distance) matrix. Easy, eh? The most difficult choice (not syntactically, but methodologically) is to choose the right distance algorithm, suitable for your particular application. Once you have that, being able to justify the choice, the rest is quite easy. Good luck!
If you would like a clear explanation of how to use partitional clustering (which will surely be faster) to solve your problem, check this paper: Effective Spell Checking Methods Using Clustering Algorithms.
https://www.researchgate.net/publication/255965260_Effective_Spell_Checking_Methods_Using_Clustering_Algorithms?ev=prf_pub
The authors explain how to cluster a dictionary using a modified (PAM-like) version of iK-Means.
Best of Luck!
I am interested in using the pvclust R package to determine significance of clusters that I have generated using the regular hierarchical clustering hclust function in R. I have a datamatrix that consists of ~ 8000 genes and their expression values at 4 developmental time points. The code below shows what I use to perform regular hierarchical clustering on my data. My first question is: Is there a way to take hr.dendrogram plot and apply that to pvclust?
Secondly, pvclust seems to cluster columns, and it seems more appropriate for data that is being compared across columns rather than rows like I want to do (I have seen many examples where pvclust is used to cluster samples rather than genes). Has anyone used pvclust in a similar fashion to what I want to do?
My simple code for regular hierarchical clustering is as follows:
mydata<-read.table("Developmental.genes",header=TRUE, row.names=1)
mydata<-na.omit(mydata)
data.corr <-cor(t(mydata),method="pearson")
d<-as.dist(1-data.corr)
hr<-hclust(d,method="complete",members=NULL)
hr.dendrogram.<-plot(as.dendrogram(hr))
I appreciate any help with this!
Why not just use pvclust first like fit<-pvclust(distance.matrix, method.hclust="ward", nboot=1000, method.dist="eucl"). After that fit$hclust will be equal to hclust(distance.matrix).