Predict mclust cluster membership outside R - r

I've used mclust to find clusters in a dataset. Now I want to implement these findings into external non-r software (predict.Mclust is thus not an option as has been suggested in previous similar Questions) to classify new observations. I need to know how mclust classifies observations.
Since mclust outputs a center and a covariance matrix for each cluster it felt reasonable to calculate mahalanobis distance for every observation and for every cluster. Observations could then be classified to the mahalonobi-nearest cluster. It seems not not to work fully however.
Example code with simulated data (in this example I only use one dataset, d, and try to obtain the same classification as mclust does by the mahalanobi approach outlined above):
set.seed(123)
c1<-mvrnorm(100,mu=c(0,0),Sigma=matrix(c(2,0,0,2),ncol=2))
c2<-mvrnorm(200,mu=c(3,3),Sigma=matrix(c(3,0,0,3),ncol=2))
d<-rbind(c1,c2)
m<-Mclust(d)
int_class<-m$classification
clust1_cov<-m$parameters$variance$sigma[,,1]
clust1_center<-m$parameters$mean[,1]
clust2_cov<-m$parameters$variance$sigma[,,2]
clust2_center<-m$parameters$mean[,2]
mahal_clust1<-mahalanobis(d,cov=clust1_cov,center=clust1_center)
mahal_clust2<-mahalanobis(d,cov=clust2_cov,center=clust2_center)
mahal_clust_dist<-cbind(mahal_clust1,mahal_clust2)
mahal_classification<-apply(mahal_clust_dist,1,function(x){
match(min(x),x)
})
table(int_class,mahal_classification)
#List mahalanobis distance for miss-classified observations:
mahal_clust_dist[mahal_classification!=int_class,]
plot(m,what="classification")
#Indicate miss-classified observations:
points(d[mahal_classification!=int_class,],pch="X")
#Results:
> table(int_class,mahal_classification)
mahal_classification
int_class 1 2
1 124 0
2 5 171
> mahal_clust_dist[mahal_classification!=int_class,]
mahal_clust1 mahal_clust2
[1,] 1.340450 1.978224
[2,] 1.607045 1.717490
[3,] 3.545037 3.938316
[4,] 4.647557 5.081306
[5,] 1.570491 2.193004
Five observations are classified differently between the mahalanobi approach and mclust. In the plots they are intermediate points between the two clusters. Could someone tell me why it does not work and how I could mimic the internal classification of mclust and predict.Mclust?

After formulating the above question I did some additional research (thx LoBu) and found that the key was to calculate the posterior probability (pp) for an observation to belong to a certain cluster and classify according to maximal pp. The following works:
denom<-rep(0,nrow(d))
pp_matrix<-matrix(rep(NA,nrow(d)*2),nrow=nrow(d))
for(i in 1:2){
denom<-denom+m$parameters$pro[i]*dmvnorm(d,m$parameters$mean[,i],m$parameters$variance$sigma[,,i])
}
for(i in 1:2){
pp_matrix[,i]<-m$parameters$pro[i]*dmvnorm(d,m$parameters$mean[,i],m$parameters$variance$sigma[,,i]) / denom
}
pp_class<-apply(pp_matrix,1,function(x){
match(max(x),x)
})
table(pp_class,m$classification)
#Result:
pp_class 1 2
1 124 0
2 0 176
But if someone in layman terms could explain the difference between the mahalanobi and pp approach I would be greatful. What do the "mixing probabilities" (m$parameters$pro) signify?

In addition to Mahalanobis distance, you also need to take the cluster weight into account.
These weight the relative importance of clusters when they overlap.

Related

Is there an R function to statistically compare different cluster solutions? (e.g. k-means solution with pam/clara solution)

I compared the silhouette widths of different cluster algorithms on the same dataset: k-means, clara and pam. I can see which one scores the highest on silhouette width. But can I now statistically test whether the solutions differ from each other kind of as we normally do with ANOVA?
I formulated the hypothesis for my thesis that clara and pam would give more valid results than k-means. I know the silhouette width of both of them is higher, but I don't know how I can statistically confirm/disconfirm my hypothesis.
#######4: Behavioral Clustering
##4.1 Kmeans
kmeans.res.4.1 <- kmeans(ClusterDFSBeha, 2)
print(kmeans.res.4.1)
#Calculate SW
library(clValid)
intern4.1 <- clValid(ClusterDFSBeha, 2, clMethods="kmeans",validation="internal", maxitems = 9800)
summary(intern4.1)
#Silhouette width = 0.7861
##4.2 PAM
pam.res.4.2 <- pam(ClusterDFSBeha, 2)
print(pam.res.4.2)
intern4.2 <- clValid(ClusterDFSBeha, 2, clMethods="pam", validation="internal", maxitems = 9800)
summary(intern4.2)
#Silhouette width = 0.6702
##4.3 Clara
clara.res.4.3 <- clara(ClusterDFSBeha,2)
print(clara.res.4.3)
intern4.3 <- clValid(ClusterDFSBeha, 2, clMethods="clara", validation="internal", maxitems = 9800)
summary(intern4.3)
#Silhouette width = 0.8756
Now I would like to statistically assess whether the methods statistically 'differ' from each other to be able to reject or approve my hypothesis with a certain p level.
It is not a perfect answer.
If you want to test the "quality" of a clustering method, the better thing is to look at the partition given by the algorithm.
For the checking you can compare partition through measure like ARI (Adjusted Rank Index), we call that relative performance. Another idea is to use simulated data where you know true label and thanks to them you can compare your result, how far you are from the truth. The last one, I know is to asses the stability of your clustering method to small perturbation of the data: the gap algorithm of Rob Tibshirani.
But in fact in clustering theory (unsupervised classification) it is really hard to evaluate the pertinency of a cluster. We have fewer selection model criteria than for supervised learning task.
I really advised you to look on internet, for instance this package description seems to be a good inroduction :
https://cran.r-project.org/web/packages/clValid/vignettes/clValid.pdf
To answer directly, I don't think that what you are looking for exist. If yes, I will be really happy to know more about it.
Such a comparison will never be fair.
Any such test makes some assumptions, and a clustering method that is based on similar assumptions is to be expected to score better.
For example if you use Silhouette with Euclidean distance, PAM with Euclidean distance, and k-means, it must be expected that PAM has an advantage. If you used Silhouette with squared Euclidean distances instead, k-means is almost certain going to fare best (and it is also almost certain to outperform PAM with squared Euclidean).
So you aren't judging which method is "better", but which correlates more with your evaluation method.
There is a simple way using the contingency tables. Let's say you get 1 set of cluster assignments ll and another cc and in the ideal situation you have those labels align perfectly and from that table you can produce a statistic using the chi squared test and the pvalue for the significance of the allocation differences;
ll1 = rep(c(4,3,2,1),100)
cc1 = rep(c(1:4),length(ll1)/4)
table(cc1, ll1)
print(paste("chi statistic=",chisq.test(cc1, ll1)$statistic ))
print(paste("chi pvalue=",chisq.test(cc1, ll1)$p.value ))
producing;
ll1
cc1 1 2 3 4
1 0 0 0 100
2 0 0 100 0
3 0 100 0 0
4 100 0 0 0
[1] "chi statistic= 1200"
[1] "chi pvalue= 1.21264177763119e-252"
meaning that the cell counts are not randomly (uniformly) allocated supporting an association. For a random allocation;
ll2 = sample(c(4,3,2,1),100,replace=TRUE)
cc2 = sample(c(1:4),length(ll2),replace=TRUE)
table(cc2, ll2)
print(paste("chi statistic=",chisq.test(cc2, ll2)$statistic ))
print(paste("chi pvalue=",chisq.test(cc2, ll2)$p.value ))
with outputs
ll2
cc2 1 2 3 4
1 6 7 6 10
2 5 5 7 9
3 6 7 7 4
4 4 8 5 4
[1] "chi statistic= 4.96291083483202"
[1] "chi pvalue= 0.837529350518186"
supporting that there is no association.
You can use this for your cluster assignments from different algorithms, to see if they are randomly associated or not.
You can also use; ** Variation of Information Distance for Clusterings** to get the distance between the assignments. for ll1 and cc1 ('mcclust' R package)
vi.dist(bb1,cc1)
vi.dist(bb1,cc1, parts=TRUE)
you get
0
vi0H(1|2)0H(2|1)0
and for the sampled ll2 and cc2
vi.dist(aa2,cc2)
vi.dist(aa2,cc2, parts=TRUE)
3.68438190593985
vi3.68438190593985H(1|2)1.84631473075115H(2|1)1.83806717518869
There's also the V-measure you can apply

comparing kappa coefficients (intercoder agreements) on categorical data

I have a list of 282 items that has been classified by 6 independent coders into 20 categories.
The 20 categories are defined by words (example "perceptual", "evaluation" etc).
The 6 coders have different status: 3 of them are experts, 3 are novices.
I calculated all the kappas (and alphas) between each pair of coders, and the overall kappas among the 6 coders, and the kappas between the 3 experts and between the 3 novices.
Now I would like to check whether there is a significant difference between the interrater agreements achieved by the experts vs those achieved by the novices (whose kappa is indeed lower).
How would you approach this question and report the results?
thanks!
You can at least simply obtain the Cohen's Kappa and its sd in R (<- by far the best option in my opinion).
The PresenceAbsence package has a Kappa (see ?Kappa) function.
You can get the package with the regular install.packages("PresenceAbsence"), then pass a confusion matrix, i.e.:
# we load the package
library(PresenceAbsence)
# a dummy confusion matrix
cm <- matrix(round(runif(16, 0, 10)), nrow=4)
Kappa(cm)
you will obtain the Kappa and its sd. As far as I know there are limitations about testing using the Kappa metric (eg see https://en.wikipedia.org/wiki/Cohen's_kappa#Significance_and_magnitude).
hope this helps

Discrepancy in results when using k-means and plotting the distance matrix. Why?

I am doing cluster of some data in R Studio. I am having a problem with results of K-means Cluster Analysis and plotting Hierarchical Clustering. So when I use function kmeans, I get 4 groups with 10, 20, 30 and 6 observations. Nevertheless, when I plot the dendogram, I get 4 groups but with different numbers of observations: 23, 26, 10 and 7.
Have you ever found a problem like this?
Here you are my code:
mydata<-scale(mydata0)
# K-Means Cluster Analysis
fit <- kmeans(mydata, 4) # 4 cluster solution
# get cluster means
aggregate(mydata,by=list(fit$cluster),FUN=mean)
# append cluster assignment
mydatafinal <- data.frame(mydata, fit$cluster)
fit$size
[1] 10 20 30 6
# Ward Hierarchical Clustering
d <- dist(mydata, method = "euclidean") # distance matrix
fit2 <- hclust(d, method="ward.D2")
plot(fit2,cex=0.4) # display dendogram
groups <- cutree(fit2, k=4) # cut tree into 4 clusters
# draw dendogram with red borders around the 4 clusters
rect.hclust(fit2, k=4, border="red")
Results of k-means and hierarchical clustering do not need to be the same in every scenario.
Just to give an example, everytime you run k-means the initial choice of the centroids is different and so results are different.
This is not surprising. K-means clustering is initialised at random and can give distinct answers. Typically one tends to do several runs and then aggregate the results to check which are the 'core' clusters.
Hierarchical clustering is, in contrast, purely deterministic as there is no randomness involved. But like K-means, it is a heuristic: a set of rules is followed to create clusters with no regard to any underlying objective function (for example the intra- and inter- cluster variance vs overall variance). The way existing clusters are aggregated to individual observations is crucial in determining the size of the formed clusters (the "ward.D2" parameter you pass as method in the hclust command).
Having a properly defined objective function to optimise should give you a unique answer (or set thereof) but the problem is NP-hard, because of the sheer size (as a function of the number of observations) of the partitioning involved. This is why only heuristics exist and also why any clustering procedure should not be seen as a tool giving definitive answers but as an exploratory one.

how to use the optimised cutoff point to improve the predictive model in R?

I am a bit confuse of how to use a cut-off score to improve a precision of my predictive model.
here is a sample of data:
I have a dataset (matrix) which looks like:
> data_test
1 2 3 4 5 6
KRT6B 0.807688 1.097187 -0.390313 0.644938 -0.187188 1.200688
CXCL1 0.255250 -0.134917 1.886083 0.433417 0.267583 0.996583
S100A8 -1.694800 0.012900 -0.314800 -0.368600 -0.750100 2.864700
S100A7 -0.417500 0.989000 -0.887000 -0.914500 -0.909000 4.485000
HORMAD1 -0.124750 -0.304083 -0.911050 5.426917 0.042250 6.490917
CLCA2 4.243417 0.032583 -1.750917 -1.551250 1.249917 1.494417
the colnames are samples and rownames are Genes.
so to find a cut-off point I generate a score for the cut-offs by adding up the expression of each column and assign it to predictor:
predictor <- colSums(data_test)
> predictor
1 2 3 4 5 6
3.069305 1.692670 -2.367997 3.670922 -0.286538 17.532305
and the response for it:
> response
[1] norm high norm low norm high
Levels: high norm low
I used pROC package to generate a ROC curve and find the optimised cut-off (with youden index value/ J statistic):
library(pROC)
rocobj <- roc(response,predictor)
cutpoint <- coords(rocobj,x='best',input='threshold',best.method = 'youden')
threshold specificity sensitivity
0.7030660 1.0000000 0.6666667
So, now I have my optimised cut-off point but I cannot understand how can I use this optimised cut-off (sort of a score!) to improve the precision of my predictive model.
In several papers they've used this approach and at the end they've shown that the similarity level of the predictive model has been improved by using the new cut-off point. I tried to understand but I am stock here cause I don't get it! (I mean the next step!). They didn't mention how they checked the similarity or how they implement the new cut-off point in my case the score to improve their method.
Could someone give me a good explanation for the next step?
Thanks in advance and sorry for my messy explanation.

Difficulty Comparing Clusters in R - Inconsistant Labeling

I am attempting to run a monte carlo simulation that compares two different clustering techniques. The following code generates a dataset according to random clustering and then applies two clustering techniques (kmeans and sparse k means).
My issue is that these three techniques use different labels for their clusters. For example, what I call cluster 1, kmeans might call it cluster 2 and sparse k means might call it cluster 3. When I regenerate and re-run, the differences in labeling do not appear to be consistent. Sometimes the labels agree, sometimes they do not.
Can anyone provide a way to 'standardize' these labels so I can run n iterations of the simulation without having to manually resolve labeling differences each time?
My code:
library(sparcl)
library(flexclust)
x.generate=function(n,p,q,mu){
c=sample(c(1,2,3),n,replace=TRUE)
x=matrix(rnorm(p*n),nrow=n)
for(i in 1:n){
if(c[i]==1){
for(j in 1:q){
x[i,j]=rnorm(1,mu,1)
}
}
if(c[i]==2){
for(j in 1:q){
x[i,j]=rnorm(1,-mu,1)
}
}
}
return(list('sample'=x,'clusters'=c))
}
x=x.generate(20,50,50,1)
w=KMeansSparseCluster.permute(x$sample,K=3,silent=TRUE)
kms.out = KMeansSparseCluster(x$sample,K=3,wbounds=w$bestw,silent=TRUE)
km.out = kmeans(x$sample,3)
tabs=table(x$clusters,kms.out$Cs)
tab=table(x$clusters,km.out$cluster)
CER=1-randIndex(tab)
Sample output of x$clusters, km.out$cluster, kms.out$Cs
> x$clusters
[1] 3 2 2 2 1 1 2 2 3 2 1 1 3 1 1 3 2 2 3 1
> km.out$cluster
[1] 3 1 1 1 2 2 1 1 3 1 2 2 3 2 2 3 1 1 3 2
> km.out$Cs
[1] 1 2 2 2 3 3 2 2 1 2 3 3 1 3 3 1 2 2 1 3
One of the most used criterion of similarity is the Jaccard distance See for instance Ben-Hur, A.
Elissee, A., & Guyon, I. (2002). A stability based method for discovering structure in clustered
data. Pacific Symposium on Biocomputing (pp.6--17).
Others include
Fowlkes, E. B., & Mallows, C. L. (1983). A method for comparing two hierarchical clusterings. Journal of the American Statistical Association , 78 , 553--569
Hubert, L., & Arabie, P . (1985). Comparing partitions. Journal of Classification , 2 , 193--218.
Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the Americ an Statistical Association , 66 , 846--850
As #Joran points out, the clusters are nominal, and thus do not have an order per se.
Here are 2 heuristics that come to my mind:
Starting from the tables you calculate already: when the clusters are well aligned, the trace of the tab matrix is maximal.
If the number of clusters is small, you could find the maximum by trying all permutations of 1 : n of method 2 against the $n$ clusters of method 1. If it is too large, you may go with a heuristic that first puts the biggest match onto the diagonal and so on.
Similarly, the trace of the distance matrix between the centroids of the 2 methods should be minimal.
K-means is a randomized algorithm. You must expect them to be randomly ordered, actually.
That is why the established evaluation methods for clusters (read the Wikipedia article on clustering, in particular the section on "external validation") do not assume that there is a one-on-one mapping of clusters.
Even worse, one clustering algorithm may find 3 clusters, another one may find 4 clusters.
There are also hierarchical clustering algorithms. There each object can belong to many clusters, as clusters can be nested in each other.
Also some algorithms such as DBSCAN have a notion of "noise": These objects do not belong to any cluster.
I would not recommend the Jaccard distance (even though it is famous and well established) as it is hugely influenced by cluster sizes. This is due to the fact that it counts node pairs rather than nodes. I also find the methods with a statistical flavour to be missing the point. The point is that the space of partitions (clusterings) have a beautiful lattice structure. Two distances that work beautifully within that structure are the Variation of Information (VI) distance and the split/join distance. See also this answer on stackexchange:
https://stats.stackexchange.com/questions/24961/comparing-clusterings-rand-index-vs-variation-of-information/25001#25001
It includes examples of all three distances discussed here (Jaccard, VI, split/join).

Resources