Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I have proceed in clustering of storm's energy data using different clustering methods (kmeans, hclust, agnes, funny) in R but even if it is easy to choose the best method for my work, I need a computational (and not theoretical) method to compare and evaluate the methods via their results. Do you believe that there is something?
Thanks in advance,
Thanks for the question, I learnt that you could compute optimal number of clusters using eclust function from factoextra package
Using kmeans demo from here
# Load and scale the dataset
data("USArrests")
DF <- scale(USArrests)
When data is not scaledd the clustering results might not be reliable [example](http://stats.stackexchange.com/questions/140711/why-does-gap-statistic-for-k-means-suggest-one-cluster-even-though-there-are-ob)
library("factoextra")
# Enhanced k-means clustering
res.km <- eclust(DF, "kmeans")
# Gap statistic plot
fviz_gap_stat(res.km$gap_stat)
Comparison of Clustering Functions:
You can use all the available methods and compute the optimal number of clusters with:
clusterFuncList = c("kmeans", "pam", "clara", "fanny", "hclust", "agnes" ,"diana")
resultList <- sapply(clusterFuncList,function(x) {
cat("Begin clustering for function:",x,"\n")
#For each clustering function find optimal number of clusters, to disable plotting use graph=FALSE
clustObj = eclust(DF, x,graph=FALSE)
#return optimal number of clusters for each clustering function
cat("End clustering for function:",x,"\n\n\n")
resultDF = data.frame(clustFunc = x, optimalNumbClusters = clustObj$nbclust,stringsAsFactors=FALSE)
})
# >resultList
# clustFunc optimalNumbClusters
# 1 kmeans 4
# 2 pam 4
# 3 clara 5
# 4 fanny 5
# 5 hclust 4
# 6 agnes 4
# 7 diana 4
Gap Statistic i.e. goodness-of-fit measure:
The "gap statistic" is used as a measure of goodness-of-fit for clustering algorithms, see paper
For fixed number of user defined clusters we could compare gap statistic for each clustering algorithm with clusGap function from cluster package:
numbClusters = 5
library(cluster)
clusterFuncFixedK = c("kmeans", "pam", "clara", "fanny")
gapStatList <- do.call(rbind,lapply(clusterFuncFixedK,function(x) {
cat("Begin clustering for function:",x,"\n")
set.seed(42)
#For each clustering function compute gap statistic
gapStatBoot=clusGap(DF,FUNcluster=get(x),K.max=numbClusters)
gapStatVec= round(gapStatBoot$Tab[,"gap"],3)
gapStat_at_AllClusters = paste(gapStatVec,collapse=",")
gapStat_at_chosenCluster = gapStatVec[numbClusters]
#return gap statistic for each clustering function
cat("End clustering for function:",x,"\n\n\n")
resultDF = data.frame(clustFunc = x, gapStat_at_AllClusters = gapStat_at_AllClusters,gapStat_at_chosenCluster = gapStat_at_chosenCluster, stringsAsFactors=FALSE)
}))
# >gapStatList
# clustFunc gapStat_at_AllClusters gapStat_at_chosenCluster
#1 kmeans 0.184,0.235,0.264,0.233,0.27 0.270
#2 pam 0.181,0.253,0.274,0.307,0.303 0.303
#3 clara 0.181,0.253,0.276,0.311,0.315 0.315
#4 fanny 0.181,0.23,0.313,0.351,0.478 0.478
The table above has gap statistic of each algorithm at each clutser from k=1 to 5. Column 3, gapStat_at_chosenCluster has the
gap statistic at k = 5 cluster. The lower the statistic the better the partitioning hence,at k = 5 clusters, kmeans performs better
relative to other algorithms on USArrests dataset
Related
I am trying to cluster my empirical data using Mclust. When using the following, very simple code:
library(reshape2)
library(mclust)
data <- read.csv(file.choose(), header=TRUE, check.names = FALSE)
data_melt <- melt(data, value.name = "value", na.rm=TRUE)
fit <- Mclust(data$value, modelNames="E", G = 1:7)
summary(fit, parameters = TRUE)
R gives me the following result:
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm
----------------------------------------------------
Mclust E (univariate, equal variance) model with 4 components:
log-likelihood n df BIC ICL
-20504.71 3258 8 -41074.13 -44326.69
Clustering table:
1 2 3 4
0 2271 896 91
Mixing probabilities:
1 2 3 4
0.2807685 0.4342499 0.2544305 0.0305511
Means:
1 2 3 4
1381.391 1381.715 1574.335 1851.667
Variances:
1 2 3 4
7466.189 7466.189 7466.189 7466.189
Edit: Here my data for download https://www.file-upload.net/download-14320392/example.csv.html
I do not readily understand why Mclust gives me an empty cluster (0), especially with nearly identical mean values to the second cluster. This only appears when specifically looking for an univariate, equal variance model. Using for example modelNames="V" or leaving it default, does not produce this problem.
This thread: Cluster contains no observations has a similary problem, but if I understand correctly, this appeared to be due to randomly generated data?
I am somewhat clueless as to where my problem is or if I am missing anything obvious.
Any help is appreciated!
As you noted the mean of cluster 1 and 2 are extremely similar, and it so happens that there's quite a lot of data there (see spike on histogram):
set.seed(111)
data <- read.csv("example.csv", header=TRUE, check.names = FALSE)
fit <- Mclust(data$value, modelNames="E", G = 1:7)
hist(data$value,br=50)
abline(v=fit$parameters$mean,
col=c("#FF000080","#0000FF80","#BEBEBE80","#BEBEBE80"),lty=8)
Briefly, mclust or gmm are probabilistic models, which estimates the mean / variance of clusters and also the probabilities of each point belonging to each cluster. This is unlike k-means provides a hard assignment. So the likelihood of the model is the sum of the probabilities of each data point belonging to each cluster, you can check it out also in mclust's publication
In this model, the means of cluster 1 and cluster 2 are near but their expected proportions are different:
fit$parameters$pro
[1] 0.28565736 0.42933294 0.25445342 0.03055627
This means if you have a data point that is around the means of 1 or 2, it will be consistently assigned to cluster 2, for example let's try to predict data points from 1350 to 1400:
head(predict(fit,1350:1400)$z)
1 2 3 4
[1,] 0.3947392 0.5923461 0.01291472 2.161694e-09
[2,] 0.3945941 0.5921579 0.01324800 2.301397e-09
[3,] 0.3944456 0.5919646 0.01358975 2.450108e-09
[4,] 0.3942937 0.5917661 0.01394020 2.608404e-09
[5,] 0.3941382 0.5915623 0.01429955 2.776902e-09
[6,] 0.3939790 0.5913529 0.01466803 2.956257e-09
The $classification is obtained by taking the column with the maximum probability. So, same example, everything is assigned to 2:
head(predict(fit,1350:1400)$classification)
[1] 2 2 2 2 2 2
To answer your question, no you did not do anything wrong, it's a fallback at least with this implementation of GMM. I would say it's a bit of overfitting, but you can basically take only the clusters that have a membership.
If you use model="V", i see the solution is equally problematic:
fitv <- Mclust(Data$value, modelNames="V", G = 1:7)
plot(fitv,what="classification")
Using scikit learn GMM I don't see a similar issue.. So if you need to use a gaussian mixture with spherical means, consider using a fuzzy kmeans:
library(ClusterR)
plot(NULL,xlim=range(data),ylim=c(0,4),ylab="cluster",yaxt="n",xlab="values")
points(data$value,fit_kmeans$clusters,pch=19,cex=0.1,col=factor(fit_kmeans$clusteraxis(2,1:3,as.character(1:3))
If you don't need equal variance, you can use the GMM function in the ClusterR package too.
Context: I am trying to 1) define cluster based on certain individuals and 2) assign others individuals to the defined clusters.
What has been done: I use FactoMineR functions PCA() and HCPC() according to the workflow described by Husson, F., Josse, J., Pages, J., 2010. Principal component methods - hierarchical clustering - partitional clustering: why would we need to choose for visualizing data? Technical Report – Agrocampus 17.
Question: it is possible to assign each supplementary individual (PCA(..., ind.sup =***) a cluster defined in the Hierarchical Clustering analysis?
Similar question: this question has already been asked on stackoverflow here but it was 5 years ago and the answer does not fit into HCPC workflow.
Please fin below some code using base R dataset mtcars:
if(!require(FactoMineR)){install.packages("FactoMineR")}
library(FactoMineR)
if(!require(factoextra)){install.packages("factoextra")}
library(factoextra)
# lets use mtcars dataset for this question
head(mtcars)
# some individuals are considered as supplementary:
mtcars[22:nrow(mtcars),]
# HCPC workflow
res.pca = PCA(mtcars,
ind.sup = c(22:nrow(mtcars)), # the last 10 individuals are supplementary
scale.unit = TRUE,
ncp = 5,
graph = TRUE)
fviz_pca_ind(res.pca) # here supplementary individuals are included
res.hcpc = HCPC(res.pca,
nb.clust = -1, # automatic tree cut
min = 3,
max = NULL,
graph = FALSE,
kk=Inf) # no k-means pre-processing
# here are the two results needed to illustrate my question
fviz_dend(res.hcpc, show_labels = TRUE)
res.hcpc$desc.ind # notice that individuals considered as supplementary are not included in any cluster
Expected output: the expected output is something like this where supplementary individuals are assigned to cluster and clearly identified: . With Fiat X-9 and Lotus Europa (supplementary individuals) assigned to the cluster with Fiat 128 for example.
How do I determine the optimal number of clusters while using hierarchical clustering. If I am just having the distance matrix as I am measuring only pairwise distances (levenshtein distances), how do I find out the optimal number of clusters? I referred to other posts they all use k-means, hierarchical but not for string type of data as shown below. Any suggestions on how to use R to find the number of clusters?
set.seed(1)
rstr <- function(n,k){ # vector of n random char(k) strings
sapply(1:n,function(i) {do.call(paste0,as.list(sample(letters,k,replace=T)))})
}
str<- c(paste0("aa",rstr(10,3)),paste0("bb",rstr(10,3)),paste0("cc",rstr(10,3)))
# Levenshtein Distance
d <- adist(str)
rownames(d) <- str
hc <- hclust(as.dist(d))
Several statistics can be used.
Look for example at the WeightedCluster package that can compute and plot a series of such statistics.
To illustrate, you get the optimal number of groups for each available statistics as follows:
library("WeightedCluster")
hcRange <- as.clustrange(hc, diss=as.dist(d), ncluster=6)
summary(hcRange)
## 1. N groups 1. stat
## PBC 3 0.8799136
## HG 3 1.0000000
## HGSD 3 0.9987651
## ASW 3 0.4136550
## ASWw 3 0.4722895
## CH 3 8.3605263
## R2 6 0.4734561
## CHsq 3 20.6538462
## R2sq 6 0.6735039
## HC 3 0.0000000
You can also plot the statistics (here we show the Average silhouette width, ASWw, Huber's Gamma, HG, and the Point biserial correlation) for all the computed solutions
plot(hcRange, stat = c("ASWw", "HG", "PBC"), lwd = 2)
The better solution seems to be the three groups solution.
I want to do a Kmeans clustering on a dataset (namely, Sample_Data) with three variables (columns) such as below:
A B C
1 12 10 1
2 8 11 2
3 14 10 1
. . . .
. . . .
. . . .
in a typical way, after scaling the columns, and determining the number of clusters, I will use this function in R:
Sample_Data <- scale(Sample_Data)
output_kmeans <- kmeans(Sample_Data, centers = 5, nstart = 50)
But, what if there is a preference for the variables? I mean that, suppose variable (column) A, is more important than the two other variables?
how can I insert their weights in the model?
Thank you all
You have to use a kmeans weighted clustering, like the one presented in flexclust package:
https://cran.r-project.org/web/packages/flexclust/flexclust.pdf
The function
cclust(x, k, dist = "euclidean", method = "kmeans",
weights=NULL, control=NULL, group=NULL, simple=FALSE,
save.data=FALSE)
Perform k-means clustering, hard competitive learning or neural gas on a data matrix.
weights An optional vector of weights to be used in the fitting process. Works only in combination with hard competitive learning.
A toy example using iris data:
library(flexclust)
data(iris)
cl <- cclust(iris[,-5], k=3, save.data=TRUE,weights =c(1,0.5,1,0.1),method="hardcl")
cl
kcca object of family ‘kmeans’
call:
cclust(x = iris[, -5], k = 3, method = "hardcl", weights = c(1, 0.5, 1, 0.1), save.data = TRUE)
cluster sizes:
1 2 3
50 59 41
As you can see from the output of cclust, also using competitive learning the family is always kmenas.
The difference is related to cluster assignment during training phase:
If method is "kmeans", the classic kmeans algorithm as given by
MacQueen (1967) is used, which works by repeatedly moving all cluster
centers to the mean of their respective Voronoi sets. If "hardcl",
on-line updates are used (AKA hard competitive learning), which work
by randomly drawing an observation from x and moving the closest
center towards that point (e.g., Ripley 1996).
The weights parameter is just a sequence of numbers, in general I use number between 0.01 (minimum weight) and 1 (maximum weight).
I had the same problem and the answer here is not satisfying for me.
What we both wanted was an observation-weighted k-means clustering in R. A good readable example for our question is this link: https://towardsdatascience.com/clustering-the-us-population-observation-weighted-k-means-f4d58b370002
However the solution to use the flexclust package is not satisfying simply b/c the used algorithm is not the "standard" k-means algorithm but the "hard competitive learning" algorithm. The difference are well described above and in the package description.
I looked through many sites and did not find any solution/package in R in order to use to perform a "standard" k-means algorithm with weighted observations. I was also wondering why the flexclust package explicitly do not support weights with the standard k-means algorithm. If anyone has an explanation for this, please feel free to share!
So basically you have two options: First, rewrite the flexclust-algorithm to enable weights within the standard approach. Or second, you can estimate weighted cluster centroids as starting centroids and perform a standard k-means algorithm with only one iteration, then compute new weighted cluster centroids and perform a k-means with one iteration and so on until you reach convergence.
I used the second alternative b/c it was the easier way for me. I used the data.table package, hope you are familiar with it.
rm(list=ls())
library(data.table)
### gen dataset with sample-weights
dataset <- data.table(iris)
dataset[, weights:= rep(c(1, 0.7, 0.3, 4, 5),30)]
dataset[, Species := NULL]
### initial hclust for estimating weighted centroids
clustering <- hclust(dist(dataset[, c(1:4)], method = 'euclidean'),
method = 'ward.D2')
no_of_clusters <- 4
### estimating starting centroids (weighted)
weighted_centroids <- matrix(NA, nrow = no_of_clusters,
ncol = ncol(dataset[, c(1:4)]))
for (i in (1:no_of_clusters))
{
weighted_centroids[i,] <- sapply(dataset[, c(1:4)][cutree(clustering, k =
no_of_clusters) == i,], weighted.mean, w = dataset[cutree(clustering, k = no_of_clusters) == i, weights])
}
### performing weighted k-means as explained in my post
iter <- 0
cluster_i <- 0
cluster_iminus1 <- 1
## while loop: if number of iteration is smaller than 50 and cluster_i (result of
## current iteration) is not identical to cluster_iminus1 (result of former
## iteration) then continue
while(identical(cluster_i, cluster_iminus1) == F && iter < 50){
# update iteration
iter <- iter + 1
# k-means with weighted centroids and one iteration (may generate warning messages
# as no convergence is reached)
cluster_kmeans <- kmeans(x = dataset[, c(1:4)], centers = weighted_centroids, iter = 1)$cluster
# estimating new weighted centroids
weighted_centroids <- matrix(NA, nrow = no_of_clusters,
ncol=ncol(dataset[,c(1:4)]))
for (i in (1:no_of_clusters))
{
weighted_centroids[i,] <- sapply(dataset[, c(1:4)][cutree(clustering, k =
no_of_clusters) == i,], weighted.mean, w = dataset[cutree(clustering, k = no_of_clusters) == i, weights])
}
# update cluster_i and cluster_iminus1
if(iter == 1) {cluster_iminus1 <- 0} else{cluster_iminus1 <- cluster_i}
cluster_i <- cluster_kmeans
}
## merge final clusters to data table
dataset[, cluster := cluster_i]
If you want to increase the weight of a variable (column), just multiply it with a constant c > 1.
It's trivial to show that this increases the weight in the SSQ optimization objective.
Is it possible to change the average estimator in a region by something different from the mean, like median or geometric mean using the rpart library in R? (or another library)
I believe my tree partitioning is highly affected by extreme values and I would like to build trees showing other estimators.
Thanks!
One of the usual tricks for right-skewed responses would be to take logs. In many applications this makes the response distribution more symmetric and then you don't need to switch from the usual mean predictions.
Another solution for changing the learning of the tree would be to use some more robust scores, e.g., ranks etc. The ctree() function from the partykit offers a nonparametric inference framework for this.
Finally, the partykit package also allows to compute other predictions than the means from all the terminal nodes. You can easily transform rpart trees to party trees via as.party(). A very simple example would be to learn an rpart tree for the cars data
library("rpart")
data("cars", package = "datasets")
rp <- rpart(dist ~ speed, data = cars)
And then transform it to party:
library("partykit")
pr <- as.party(rp)
The tree structure remains unchanged but you get enhanced plotting and predictions. The default plot methods yield:
Furthermore, the default predictions on both objects are the same.
nd <- data.frame(speed = c(10, 15, 20))
predict(rp, nd)
## 1 2 3
## 18.20000 39.75000 65.26316
predict(pr, nd)
## 1 2 3
## 18.20000 39.75000 65.26316
However, the latter allows you to specify a FUNction that should be used in each of the nodes. This must be of the form function(y, w) where y is the response and w are the case weights. As we haven't used any weights here, we can simply ignore that argument and do:
predict(pr, nd, FUN = function(y, w) mean(y))
## 1 2 3
## 18.20000 39.75000 65.26316
predict(pr, nd, FUN = function(y, w) median(y))
## 1 2 3
## 18 35 64
predict(pr, nd, FUN = function(y, w) quantile(y, 0.9))
## 1 2 3
## 28.0 57.0 92.2
And so on... See the package vignettes for more details.