What would be the best function/package to use in R to try and replicate the K-means clustering method used in SPSS? Here is an example of the syntax I would use in SPSS:
QUICK CLUSTER VAR1 TO VAR10
/MISSING=LISTWISE
/CRITERIA=CLUSTER(5) MXITER(50) CONVERGE(.02)
/METHOD=KMEANS(NOUPDATE)
Thanks!
In SPSS, use the /PRINT INITIAL option. This will give you the initial cluster centers, which seem to be fixed in SPSS, but random in R (see ?kmeansfor parameter centers).
If you use the printed initial cluster centers from SPSS output and the argument="Lloyd" parameter in kmeans, you should get the same results (at least it worked for me, testing with several repetitions).
Example of an SPSS-output of the initial cluster centers:
Cluster
Cl1 Cl2 Cl3 Cl4
Var A 1 1 4 3
Var B 4 1 4 1
Var C 1 1 1 4
Var D 1 4 4 1
Var E 1 4 1 2
Var F 1 4 4 3
This table, replicated as matrix in R, with kmeans computation:
mat <- matrix(c(1,1,4,3,4,1,4,1,1,1,1,4,1,4,4,1,1,4,1,2,1,4,4,3), nrow=4, ncol=6)
kmeans(na.omit(data.frame), centers=mat, iter.max=20, algorithm="Lloyd")
Be sure to use the same amount of maximum iterations in SPSS and R-kemans, and use Lloyd-method in R-kmeans.
However, I don't know whether it's better to have a fixed or a random choice of initial centers. I personally like the random choice, and compute a linear discriminant analysis with the found cluster groups to assess the classification accuracy, and rerun the kmeans clustering until I have a statisfying group classification.
Edit: I found this posting where the SPSS procedure of selecting initial clusters is described. Perhaps somebody knows of an R implementation?
Related
I compared the silhouette widths of different cluster algorithms on the same dataset: k-means, clara and pam. I can see which one scores the highest on silhouette width. But can I now statistically test whether the solutions differ from each other kind of as we normally do with ANOVA?
I formulated the hypothesis for my thesis that clara and pam would give more valid results than k-means. I know the silhouette width of both of them is higher, but I don't know how I can statistically confirm/disconfirm my hypothesis.
#######4: Behavioral Clustering
##4.1 Kmeans
kmeans.res.4.1 <- kmeans(ClusterDFSBeha, 2)
print(kmeans.res.4.1)
#Calculate SW
library(clValid)
intern4.1 <- clValid(ClusterDFSBeha, 2, clMethods="kmeans",validation="internal", maxitems = 9800)
summary(intern4.1)
#Silhouette width = 0.7861
##4.2 PAM
pam.res.4.2 <- pam(ClusterDFSBeha, 2)
print(pam.res.4.2)
intern4.2 <- clValid(ClusterDFSBeha, 2, clMethods="pam", validation="internal", maxitems = 9800)
summary(intern4.2)
#Silhouette width = 0.6702
##4.3 Clara
clara.res.4.3 <- clara(ClusterDFSBeha,2)
print(clara.res.4.3)
intern4.3 <- clValid(ClusterDFSBeha, 2, clMethods="clara", validation="internal", maxitems = 9800)
summary(intern4.3)
#Silhouette width = 0.8756
Now I would like to statistically assess whether the methods statistically 'differ' from each other to be able to reject or approve my hypothesis with a certain p level.
It is not a perfect answer.
If you want to test the "quality" of a clustering method, the better thing is to look at the partition given by the algorithm.
For the checking you can compare partition through measure like ARI (Adjusted Rank Index), we call that relative performance. Another idea is to use simulated data where you know true label and thanks to them you can compare your result, how far you are from the truth. The last one, I know is to asses the stability of your clustering method to small perturbation of the data: the gap algorithm of Rob Tibshirani.
But in fact in clustering theory (unsupervised classification) it is really hard to evaluate the pertinency of a cluster. We have fewer selection model criteria than for supervised learning task.
I really advised you to look on internet, for instance this package description seems to be a good inroduction :
https://cran.r-project.org/web/packages/clValid/vignettes/clValid.pdf
To answer directly, I don't think that what you are looking for exist. If yes, I will be really happy to know more about it.
Such a comparison will never be fair.
Any such test makes some assumptions, and a clustering method that is based on similar assumptions is to be expected to score better.
For example if you use Silhouette with Euclidean distance, PAM with Euclidean distance, and k-means, it must be expected that PAM has an advantage. If you used Silhouette with squared Euclidean distances instead, k-means is almost certain going to fare best (and it is also almost certain to outperform PAM with squared Euclidean).
So you aren't judging which method is "better", but which correlates more with your evaluation method.
There is a simple way using the contingency tables. Let's say you get 1 set of cluster assignments ll and another cc and in the ideal situation you have those labels align perfectly and from that table you can produce a statistic using the chi squared test and the pvalue for the significance of the allocation differences;
ll1 = rep(c(4,3,2,1),100)
cc1 = rep(c(1:4),length(ll1)/4)
table(cc1, ll1)
print(paste("chi statistic=",chisq.test(cc1, ll1)$statistic ))
print(paste("chi pvalue=",chisq.test(cc1, ll1)$p.value ))
producing;
ll1
cc1 1 2 3 4
1 0 0 0 100
2 0 0 100 0
3 0 100 0 0
4 100 0 0 0
[1] "chi statistic= 1200"
[1] "chi pvalue= 1.21264177763119e-252"
meaning that the cell counts are not randomly (uniformly) allocated supporting an association. For a random allocation;
ll2 = sample(c(4,3,2,1),100,replace=TRUE)
cc2 = sample(c(1:4),length(ll2),replace=TRUE)
table(cc2, ll2)
print(paste("chi statistic=",chisq.test(cc2, ll2)$statistic ))
print(paste("chi pvalue=",chisq.test(cc2, ll2)$p.value ))
with outputs
ll2
cc2 1 2 3 4
1 6 7 6 10
2 5 5 7 9
3 6 7 7 4
4 4 8 5 4
[1] "chi statistic= 4.96291083483202"
[1] "chi pvalue= 0.837529350518186"
supporting that there is no association.
You can use this for your cluster assignments from different algorithms, to see if they are randomly associated or not.
You can also use; ** Variation of Information Distance for Clusterings** to get the distance between the assignments. for ll1 and cc1 ('mcclust' R package)
vi.dist(bb1,cc1)
vi.dist(bb1,cc1, parts=TRUE)
you get
0
vi0H(1|2)0H(2|1)0
and for the sampled ll2 and cc2
vi.dist(aa2,cc2)
vi.dist(aa2,cc2, parts=TRUE)
3.68438190593985
vi3.68438190593985H(1|2)1.84631473075115H(2|1)1.83806717518869
There's also the V-measure you can apply
I've used mclust to find clusters in a dataset. Now I want to implement these findings into external non-r software (predict.Mclust is thus not an option as has been suggested in previous similar Questions) to classify new observations. I need to know how mclust classifies observations.
Since mclust outputs a center and a covariance matrix for each cluster it felt reasonable to calculate mahalanobis distance for every observation and for every cluster. Observations could then be classified to the mahalonobi-nearest cluster. It seems not not to work fully however.
Example code with simulated data (in this example I only use one dataset, d, and try to obtain the same classification as mclust does by the mahalanobi approach outlined above):
set.seed(123)
c1<-mvrnorm(100,mu=c(0,0),Sigma=matrix(c(2,0,0,2),ncol=2))
c2<-mvrnorm(200,mu=c(3,3),Sigma=matrix(c(3,0,0,3),ncol=2))
d<-rbind(c1,c2)
m<-Mclust(d)
int_class<-m$classification
clust1_cov<-m$parameters$variance$sigma[,,1]
clust1_center<-m$parameters$mean[,1]
clust2_cov<-m$parameters$variance$sigma[,,2]
clust2_center<-m$parameters$mean[,2]
mahal_clust1<-mahalanobis(d,cov=clust1_cov,center=clust1_center)
mahal_clust2<-mahalanobis(d,cov=clust2_cov,center=clust2_center)
mahal_clust_dist<-cbind(mahal_clust1,mahal_clust2)
mahal_classification<-apply(mahal_clust_dist,1,function(x){
match(min(x),x)
})
table(int_class,mahal_classification)
#List mahalanobis distance for miss-classified observations:
mahal_clust_dist[mahal_classification!=int_class,]
plot(m,what="classification")
#Indicate miss-classified observations:
points(d[mahal_classification!=int_class,],pch="X")
#Results:
> table(int_class,mahal_classification)
mahal_classification
int_class 1 2
1 124 0
2 5 171
> mahal_clust_dist[mahal_classification!=int_class,]
mahal_clust1 mahal_clust2
[1,] 1.340450 1.978224
[2,] 1.607045 1.717490
[3,] 3.545037 3.938316
[4,] 4.647557 5.081306
[5,] 1.570491 2.193004
Five observations are classified differently between the mahalanobi approach and mclust. In the plots they are intermediate points between the two clusters. Could someone tell me why it does not work and how I could mimic the internal classification of mclust and predict.Mclust?
After formulating the above question I did some additional research (thx LoBu) and found that the key was to calculate the posterior probability (pp) for an observation to belong to a certain cluster and classify according to maximal pp. The following works:
denom<-rep(0,nrow(d))
pp_matrix<-matrix(rep(NA,nrow(d)*2),nrow=nrow(d))
for(i in 1:2){
denom<-denom+m$parameters$pro[i]*dmvnorm(d,m$parameters$mean[,i],m$parameters$variance$sigma[,,i])
}
for(i in 1:2){
pp_matrix[,i]<-m$parameters$pro[i]*dmvnorm(d,m$parameters$mean[,i],m$parameters$variance$sigma[,,i]) / denom
}
pp_class<-apply(pp_matrix,1,function(x){
match(max(x),x)
})
table(pp_class,m$classification)
#Result:
pp_class 1 2
1 124 0
2 0 176
But if someone in layman terms could explain the difference between the mahalanobi and pp approach I would be greatful. What do the "mixing probabilities" (m$parameters$pro) signify?
In addition to Mahalanobis distance, you also need to take the cluster weight into account.
These weight the relative importance of clusters when they overlap.
I am a bit confuse of how to use a cut-off score to improve a precision of my predictive model.
here is a sample of data:
I have a dataset (matrix) which looks like:
> data_test
1 2 3 4 5 6
KRT6B 0.807688 1.097187 -0.390313 0.644938 -0.187188 1.200688
CXCL1 0.255250 -0.134917 1.886083 0.433417 0.267583 0.996583
S100A8 -1.694800 0.012900 -0.314800 -0.368600 -0.750100 2.864700
S100A7 -0.417500 0.989000 -0.887000 -0.914500 -0.909000 4.485000
HORMAD1 -0.124750 -0.304083 -0.911050 5.426917 0.042250 6.490917
CLCA2 4.243417 0.032583 -1.750917 -1.551250 1.249917 1.494417
the colnames are samples and rownames are Genes.
so to find a cut-off point I generate a score for the cut-offs by adding up the expression of each column and assign it to predictor:
predictor <- colSums(data_test)
> predictor
1 2 3 4 5 6
3.069305 1.692670 -2.367997 3.670922 -0.286538 17.532305
and the response for it:
> response
[1] norm high norm low norm high
Levels: high norm low
I used pROC package to generate a ROC curve and find the optimised cut-off (with youden index value/ J statistic):
library(pROC)
rocobj <- roc(response,predictor)
cutpoint <- coords(rocobj,x='best',input='threshold',best.method = 'youden')
threshold specificity sensitivity
0.7030660 1.0000000 0.6666667
So, now I have my optimised cut-off point but I cannot understand how can I use this optimised cut-off (sort of a score!) to improve the precision of my predictive model.
In several papers they've used this approach and at the end they've shown that the similarity level of the predictive model has been improved by using the new cut-off point. I tried to understand but I am stock here cause I don't get it! (I mean the next step!). They didn't mention how they checked the similarity or how they implement the new cut-off point in my case the score to improve their method.
Could someone give me a good explanation for the next step?
Thanks in advance and sorry for my messy explanation.
I am attempting to run a monte carlo simulation that compares two different clustering techniques. The following code generates a dataset according to random clustering and then applies two clustering techniques (kmeans and sparse k means).
My issue is that these three techniques use different labels for their clusters. For example, what I call cluster 1, kmeans might call it cluster 2 and sparse k means might call it cluster 3. When I regenerate and re-run, the differences in labeling do not appear to be consistent. Sometimes the labels agree, sometimes they do not.
Can anyone provide a way to 'standardize' these labels so I can run n iterations of the simulation without having to manually resolve labeling differences each time?
My code:
library(sparcl)
library(flexclust)
x.generate=function(n,p,q,mu){
c=sample(c(1,2,3),n,replace=TRUE)
x=matrix(rnorm(p*n),nrow=n)
for(i in 1:n){
if(c[i]==1){
for(j in 1:q){
x[i,j]=rnorm(1,mu,1)
}
}
if(c[i]==2){
for(j in 1:q){
x[i,j]=rnorm(1,-mu,1)
}
}
}
return(list('sample'=x,'clusters'=c))
}
x=x.generate(20,50,50,1)
w=KMeansSparseCluster.permute(x$sample,K=3,silent=TRUE)
kms.out = KMeansSparseCluster(x$sample,K=3,wbounds=w$bestw,silent=TRUE)
km.out = kmeans(x$sample,3)
tabs=table(x$clusters,kms.out$Cs)
tab=table(x$clusters,km.out$cluster)
CER=1-randIndex(tab)
Sample output of x$clusters, km.out$cluster, kms.out$Cs
> x$clusters
[1] 3 2 2 2 1 1 2 2 3 2 1 1 3 1 1 3 2 2 3 1
> km.out$cluster
[1] 3 1 1 1 2 2 1 1 3 1 2 2 3 2 2 3 1 1 3 2
> km.out$Cs
[1] 1 2 2 2 3 3 2 2 1 2 3 3 1 3 3 1 2 2 1 3
One of the most used criterion of similarity is the Jaccard distance See for instance Ben-Hur, A.
Elissee, A., & Guyon, I. (2002). A stability based method for discovering structure in clustered
data. Pacific Symposium on Biocomputing (pp.6--17).
Others include
Fowlkes, E. B., & Mallows, C. L. (1983). A method for comparing two hierarchical clusterings. Journal of the American Statistical Association , 78 , 553--569
Hubert, L., & Arabie, P . (1985). Comparing partitions. Journal of Classification , 2 , 193--218.
Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the Americ an Statistical Association , 66 , 846--850
As #Joran points out, the clusters are nominal, and thus do not have an order per se.
Here are 2 heuristics that come to my mind:
Starting from the tables you calculate already: when the clusters are well aligned, the trace of the tab matrix is maximal.
If the number of clusters is small, you could find the maximum by trying all permutations of 1 : n of method 2 against the $n$ clusters of method 1. If it is too large, you may go with a heuristic that first puts the biggest match onto the diagonal and so on.
Similarly, the trace of the distance matrix between the centroids of the 2 methods should be minimal.
K-means is a randomized algorithm. You must expect them to be randomly ordered, actually.
That is why the established evaluation methods for clusters (read the Wikipedia article on clustering, in particular the section on "external validation") do not assume that there is a one-on-one mapping of clusters.
Even worse, one clustering algorithm may find 3 clusters, another one may find 4 clusters.
There are also hierarchical clustering algorithms. There each object can belong to many clusters, as clusters can be nested in each other.
Also some algorithms such as DBSCAN have a notion of "noise": These objects do not belong to any cluster.
I would not recommend the Jaccard distance (even though it is famous and well established) as it is hugely influenced by cluster sizes. This is due to the fact that it counts node pairs rather than nodes. I also find the methods with a statistical flavour to be missing the point. The point is that the space of partitions (clusterings) have a beautiful lattice structure. Two distances that work beautifully within that structure are the Variation of Information (VI) distance and the split/join distance. See also this answer on stackexchange:
https://stats.stackexchange.com/questions/24961/comparing-clusterings-rand-index-vs-variation-of-information/25001#25001
It includes examples of all three distances discussed here (Jaccard, VI, split/join).
Hints I got as to a different question puzzled me quite a bit.
I got an exercise, actually part of a larger exercise:
Cluster some data, using hclust (done)
Given a totally new vector, find out to which of the clusters you got in 1 it is nearest.
According to the excercise, this should be done in quite short a time.
However, after weeks I am puzzled whether this can be done at all, as apparently all I really get from hclust is a tree - and not, as I assumed, a number of clusters.
As I suppose I was unclear:
Say, for instance, I feed hclust a matrix which consists of 15 1x5 Vectors, 5 times (1 1 1 1 1 ), 5 times (2 2 2 2 2) and 5 times (3 3 3 3 3). This should give me three quite distinct clusters of size 5, anyone can easily do that by hand. Is there a command to use so that I can actually find out from the program that there are 3 such clusters in my hclust-object and what they contain?
You'll have to think about what the right metric is to define closeness to the cluster. Building on the example in the hclust doc, here's a way to compute the means for each cluster and then measure the distance between the new data point and the set of means.
# Leave out one state
A <-USArrests
B <-A[rownames(A)!="Kentucky",]
KY <- A[rownames(A)=="Kentucky",]
# Put the B data into 10 clusters
hc <- hclust(dist(B), "ave")
memb <- cutree(hc, k = 10)
B$cluster = memb[rownames(B)==names(memb)]
# Compute the averages over the clusters
M <-aggregate( .~cluster, data=B, FUN=mean)
M$cluster=NULL
# Now add the hold out state to the set of averages
M <-rbind(M,KY)
# Compute the distance between the clusters and the hold out state.
# This is a pretty silly way to do this but it works.
D <- as.matrix(dist(as.matrix(M),diag=TRUE,upper=TRUE))["Kentucky",]
names(D) = rownames(M)
KYclust = which.min(D[-length(D)])
memb[memb==KYclust]
# Now cluster the full set of states and compare the results.
hc <- hclust(dist(A), "ave")
memb <- cutree(hc, k = 10)
a=memb[which(names(memb)=="Kentucky")]
memb[memb==a]
In contrast to k-means, clusters found by hclust can be of arbitrary shape.
The distance to the nearest cluster center therefore is not always meaningful.
Doing a 1 nearest neighbor style assignment probably is better.