Comparing Kmeans and Agglomerative Clustering - r

I'm working to compare kmeans and agglomerative clustering methods for a country-level analysis, but I'm struggling because they are assigning numerical clusters differently, so even if a country (let's say Canada) is in a cluster by itself, that cluster is '1' in one version and '2' in another. How would I reconcile that both methods clustered this the same way, but assigned a different order?
I tried some hacky things with averaging the two, but am struggling to figure the logic out.
geo_group a_cluster k_cluster
<chr> <int> <int>
1 United States 1 1
2 Canada 2 3
3 United Kingdom 3 5
4 Australia 4 5
5 Germany 4 5
6 Mexico 5 6
7 France 5 5
8 Sweden 6 8
9 Brazil 6 6
10 Netherlands 6 6

Agglomerative clustering and kmeans are different methods to define a partition of a set of samples (e.g. samples 1 and 2 belong to cluster A and sample 3 belongs to cluster B).
kmeans calculates the Euclidean distance between each sample pair. This is only possible for numerical features and is often only useful for spatial data (e.g. longitude and latitude), because here Eukledian distance is just the distance as the crow flies.
Agglomerative clustering, however, can be used with many other dissimilarity measures, not just metric distances, even e.g. Jaccard allowing not only numerical but also categorical data.
Furthermore, the number of clusters can be defined afterwards whereas in kmeans, the chosen k affects the clustering in the first place. Here, in agglomerative clustering, clusters were merged together in a hierarchical manner. This merging can be either single, complete, or average linkage resulting in not just one but many different agglomerative algorithms.
It is very normal to get different results from these methods.

Related

Is it possible to write a function in R to perform a discriminant analysis with a cumulative, variable number of factors?

I am attempting to perform a least discriminant analysis on geometric morphometric data. Because geometric morphometric data typically produces large numbers of variables and discriminant analyses require more data points than variables to accurately classify specimens, a common solution in the literature is to perform a principal component analysis and then use a variable number of PCs representing less than 99% of the cumulative variance but returning the highest reclassification rate as input for the LDA.
Right now the way I am doing this is running the LDA in R (using the functions in the Morpho and MASS packages) under every possible number of PCs used and noting the classification accuracy by hand until I found the lowest number of PCs that returned the highest accuracy, but this is highly inefficient.
I was wondering if there was any way to write a function that would run an LDA for all possible numbers of the first N PCs (up to a certain, user defined level representing 99% of the cumulative variance) and return the percent reclassification rate for each level, producing something like the following:
PCs percent_accuracy
20 72.2
19 76.3
18 77.4
17 80.1
16 75.4
15 50.7
... ...
1 20.2
So row 1 would be the reclassification rate when the first 20 PCs are used, row 2 is the rate when the first 19 PCs are used, and so on and so forth.

How to score the clusters obtained through igraph community detection

I have constructed a gene co-expression network from RNA-seq data. The network file is in edge list format of memory around 1gb which was created by calculating Pearson correlation of each gene pairs and the gene pairs which having correlation >95% were selected to create the edge list.
I have clustered this gene network (edge list) using igraph R package "cluster_louvian" community detection algorithm and obtained 534 subclusters. Many of the subclusters have only one vertex in it
How can I score the clusters in order to identify the best clusters which has more vertexes and edges and important for further studies.
You do not provide any data, so I will illustrate with an arbitrary example.
library(igraph)
set.seed(1234)
g = erdos.renyi.game(20,0.1)
plot(g)
CL = cluster_louvain(g)
plot(g, vertex.color=CL$membership)
Now you can get the number of vertices in each cluster and the number of edges that connect them.
## number of vertices per cluster
table(CL$membership)
1 2 3 4 5 6 7
1 1 3 2 3 5 5
## number of edges within each cluster
NumClust = max(CL$membership)
sapply(1:NumClust, function(i)
ecount(induced_subgraph(g, which(CL$membership==i))))
[1] 0 0 2 1 2 4 5

Post-hoc test for lmer Error message

I am running a Linear Mixed Effect Model in R and I was able to successfully run my code and get results.
My code is as follow:
library(lme4)
library(multcomp)
read.csv(file="bh_new_all_woas.csv")
whb=read.csv(file="bh_new_all_woas.csv")
attach(whb)
head(whb)
whb.model = lmer(Density ~ distance + (1|Houses) + Cats, data = whb)
summary(whb.model)
However, I would like to do a comparison of my distance fixed factor that has 4 levels to it. I tried running a lsmean as followed:
lsmeans(whb.model, pairwise ~ distance, adjust = "tukey")
This error popped up:
Error in match(x, table, nomatch = 0L) : 'match' requires vector arguments
I also tried glht using this code:
glht(whb.model, linfct=mcp(distance="tukey"))
and got the same results. A sample of my data is as follows:
Houses distance abund density
House 1 20 0 0
House 1 120 6.052357 0.00077061
House 1 220 3.026179 0.000385305
House 1 320 7.565446 0.000963263
House 2 20 0 0
House 2 120 4.539268 0.000577958
House 2 220 6.539268 0.000832606
House 2 320 5.026179 0.000639953
House 3 20 0 0
House 3 120 6.034696 0.000768362
House 3 220 8.565446 0.001090587
House 3 320 5.539268 0.000705282
House 4 20 0 0
House 4 120 6.052357 0.00077061
House 4 220 8.052357 0.001025258
House 4 320 2.521606 0.000321061
House 5 20 4.513089 0.000574624
House 5 120 6.634916 0.000844784
House 5 220 4.026179 0.000512629
House 5 320 5.121827 0.000652131
House 6 20 2.513089 0.000319976
House 6 120 9.308185 0.001185155
House 6 220 7.803613 0.000993587
House 6 320 6.130344 0.00078054
House 7 20 3.026179 0.000385305
House 7 120 9.052357 0.001152582
House 7 220 7.052357 0.000897934
House 7 320 6.547785 0.00083369
House 8 20 5.768917 0.000734521
House 8 120 4.026179 0.000512629
House 8 220 4.282007 0.000545202
House 8 320 7.537835 0.000959747
House 9 20 3.513089 0.0004473
House 9 120 5.026179 0.000639953
House 9 220 8.052357 0.001025258
House 9 320 9.573963 0.001218995
House 10 20 2.255828 0.000287221
House 10 120 5.255828 0.000669193
House 10 220 10.060874 0.001280991
House 10 320 8.539268 0.001087254
Does anyone have any suggestions on how to fix this problem?
So which problem is it that needs fixing? One issue is the model, and another is the follow-up to it.
The model displayed is fitted using the fixed effects ~ distance + Cats. Now, Cats is not in the dataset provided, so that's an issue. But aside from that, distance enters the model as a quantitative predictor (if I am to believe the read.csv statements etc.). This model implies that changes in the expected Density are proportional to changes in distance. Is that a reasonable model? Maybe, maybe not. But is it reasonable to follow that up with multiple comparisons for distance? Definitely not. From this model, the change between distances of 20 to 120 will be exactly the same as the change between distances of 120 and 220. The estimated slope of distance, from the model summary, embodies everything you need to know about the effect of distance. Multiple comparisons should not be done.
Now, one might guess from the question that what you really had wanted to do was to fit a model where each of the four distances has its own effect, separate from the other distances. That would require a model with factor(distance) as a predictor; in that case, factor(distance) will account for 3 degrees of freedom rather than 1 d.f. for distance as a quantitative predictor. For such a model, it is appropriate to follow it up with multiple comparisons (unless possibly distance also interacts with some other predictors). If you were to fit such a model, I believe you will find there will be no errors in your lsmeans call (though you need a library("lsmeans") statement, not shown in your code.
Ultimately, getting programs to run without error is not necessarily the same as producing sensible or meaningful answers. So my real answer is to consider carefully what is a reasonable model for the data. I might suggest seeking one-on-one help from a statistical consultant to make sure you understand the modeling issues. Once that is settled, then appropriate interpretation of that model is the next step; and again, that may require some advice.
Additional minor notes about the code provided:
The first read.csv call accomplishes nothing because it doesn't store the data.
R is case-sensitive, so technically, Density isn't in your dataset either
When the data frame is attached, you don't also need the data argument in the lmer call.
The apparent fact that Houses has levels "House 1", "House 2", etc. is messed-up in your listing because the comma delimiters in your data file are not shown.

Difficulty Comparing Clusters in R - Inconsistant Labeling

I am attempting to run a monte carlo simulation that compares two different clustering techniques. The following code generates a dataset according to random clustering and then applies two clustering techniques (kmeans and sparse k means).
My issue is that these three techniques use different labels for their clusters. For example, what I call cluster 1, kmeans might call it cluster 2 and sparse k means might call it cluster 3. When I regenerate and re-run, the differences in labeling do not appear to be consistent. Sometimes the labels agree, sometimes they do not.
Can anyone provide a way to 'standardize' these labels so I can run n iterations of the simulation without having to manually resolve labeling differences each time?
My code:
library(sparcl)
library(flexclust)
x.generate=function(n,p,q,mu){
c=sample(c(1,2,3),n,replace=TRUE)
x=matrix(rnorm(p*n),nrow=n)
for(i in 1:n){
if(c[i]==1){
for(j in 1:q){
x[i,j]=rnorm(1,mu,1)
}
}
if(c[i]==2){
for(j in 1:q){
x[i,j]=rnorm(1,-mu,1)
}
}
}
return(list('sample'=x,'clusters'=c))
}
x=x.generate(20,50,50,1)
w=KMeansSparseCluster.permute(x$sample,K=3,silent=TRUE)
kms.out = KMeansSparseCluster(x$sample,K=3,wbounds=w$bestw,silent=TRUE)
km.out = kmeans(x$sample,3)
tabs=table(x$clusters,kms.out$Cs)
tab=table(x$clusters,km.out$cluster)
CER=1-randIndex(tab)
Sample output of x$clusters, km.out$cluster, kms.out$Cs
> x$clusters
[1] 3 2 2 2 1 1 2 2 3 2 1 1 3 1 1 3 2 2 3 1
> km.out$cluster
[1] 3 1 1 1 2 2 1 1 3 1 2 2 3 2 2 3 1 1 3 2
> km.out$Cs
[1] 1 2 2 2 3 3 2 2 1 2 3 3 1 3 3 1 2 2 1 3
One of the most used criterion of similarity is the Jaccard distance See for instance Ben-Hur, A.
Elissee, A., & Guyon, I. (2002). A stability based method for discovering structure in clustered
data. Pacific Symposium on Biocomputing (pp.6--17).
Others include
Fowlkes, E. B., & Mallows, C. L. (1983). A method for comparing two hierarchical clusterings. Journal of the American Statistical Association , 78 , 553--569
Hubert, L., & Arabie, P . (1985). Comparing partitions. Journal of Classification , 2 , 193--218.
Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the Americ an Statistical Association , 66 , 846--850
As #Joran points out, the clusters are nominal, and thus do not have an order per se.
Here are 2 heuristics that come to my mind:
Starting from the tables you calculate already: when the clusters are well aligned, the trace of the tab matrix is maximal.
If the number of clusters is small, you could find the maximum by trying all permutations of 1 : n of method 2 against the $n$ clusters of method 1. If it is too large, you may go with a heuristic that first puts the biggest match onto the diagonal and so on.
Similarly, the trace of the distance matrix between the centroids of the 2 methods should be minimal.
K-means is a randomized algorithm. You must expect them to be randomly ordered, actually.
That is why the established evaluation methods for clusters (read the Wikipedia article on clustering, in particular the section on "external validation") do not assume that there is a one-on-one mapping of clusters.
Even worse, one clustering algorithm may find 3 clusters, another one may find 4 clusters.
There are also hierarchical clustering algorithms. There each object can belong to many clusters, as clusters can be nested in each other.
Also some algorithms such as DBSCAN have a notion of "noise": These objects do not belong to any cluster.
I would not recommend the Jaccard distance (even though it is famous and well established) as it is hugely influenced by cluster sizes. This is due to the fact that it counts node pairs rather than nodes. I also find the methods with a statistical flavour to be missing the point. The point is that the space of partitions (clusterings) have a beautiful lattice structure. Two distances that work beautifully within that structure are the Variation of Information (VI) distance and the split/join distance. See also this answer on stackexchange:
https://stats.stackexchange.com/questions/24961/comparing-clusterings-rand-index-vs-variation-of-information/25001#25001
It includes examples of all three distances discussed here (Jaccard, VI, split/join).

SPSS K-means & R

What would be the best function/package to use in R to try and replicate the K-means clustering method used in SPSS? Here is an example of the syntax I would use in SPSS:
QUICK CLUSTER VAR1 TO VAR10
/MISSING=LISTWISE
/CRITERIA=CLUSTER(5) MXITER(50) CONVERGE(.02)
/METHOD=KMEANS(NOUPDATE)
Thanks!
In SPSS, use the /PRINT INITIAL option. This will give you the initial cluster centers, which seem to be fixed in SPSS, but random in R (see ?kmeansfor parameter centers).
If you use the printed initial cluster centers from SPSS output and the argument="Lloyd" parameter in kmeans, you should get the same results (at least it worked for me, testing with several repetitions).
Example of an SPSS-output of the initial cluster centers:
Cluster
Cl1 Cl2 Cl3 Cl4
Var A 1 1 4 3
Var B 4 1 4 1
Var C 1 1 1 4
Var D 1 4 4 1
Var E 1 4 1 2
Var F 1 4 4 3
This table, replicated as matrix in R, with kmeans computation:
mat <- matrix(c(1,1,4,3,4,1,4,1,1,1,1,4,1,4,4,1,1,4,1,2,1,4,4,3), nrow=4, ncol=6)
kmeans(na.omit(data.frame), centers=mat, iter.max=20, algorithm="Lloyd")
Be sure to use the same amount of maximum iterations in SPSS and R-kemans, and use Lloyd-method in R-kmeans.
However, I don't know whether it's better to have a fixed or a random choice of initial centers. I personally like the random choice, and compute a linear discriminant analysis with the found cluster groups to assess the classification accuracy, and rerun the kmeans clustering until I have a statisfying group classification.
Edit: I found this posting where the SPSS procedure of selecting initial clusters is described. Perhaps somebody knows of an R implementation?

Resources