EM clustering instead of Kmeans - r

I have the following script that I can use to find the best number of the cluster using kmeans. How to change the following script using the EM clustering technique rather than kmeans.
reproducible example:
ourdata<- scale(USArrests)
Appreciate!
wss <- (nrow(ourdata)-1)*sum(apply(ourdata,2,var))
for (i in 2:10) wss[i] <- sum(kmeans(ourdata,
centers=i)$withinss)
plot(1:10, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares")

The EMCluster package offers a variety of functions for running EM model-based clustering. An example of finding a solution with k = 3 clusters:
Update per OP's comment:
You can calculate the within sums of squares, along with other metrics of interest, using fpc::cluster.stats(). These can be extracted and plotted akin to your original post. As a reminder, "the elbow technique" as you described is an inaccurate description because the elbow technique is a general techinque and can and is used with any metric of choice. It is not only used for within sums of squares as in your original post.
library(EMCluster)
library(fpc)
ourdata<- scale(USArrests)
dist_fit <- dist(ourdata)
num_clusters <- 2:4
set.seed(1)
wss <- vapply(num_clusters, function(i_k) {
em_fit <- em.EM(ourdata, nclass = i_k, lab = NULL, EMC = .EMC,
stable.solution = TRUE, min.n = NULL, min.n.iter = 10)
cluster_stats_fit <- fpc::cluster.stats(dist_fit, em_fit$class)
cluster_stats_fit$within.cluster.ss
}, numeric(1))
plot(num_clusters, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares")

Related

What should be an Optimal value of K in K means Clustering for it to be implemented on ANY Dataset?

Like the Question speaks, I'm making a Visualization tool that is bound to work for any dataset provided. What should be the Optimal K value I should select and How?
So you can use Calinski criterion from vegan package, also your phrasing of question is little debatable. I am hoping this is what you expecting, please comment in case of otherwise.
For example, You can do:
n = 100
g = 6
set.seed(g)
d <- data.frame(
x = unlist(lapply(1:g, function(i) rnorm(n/g, runif(1)*i^2))),
y = unlist(lapply(1:g, function(i) rnorm(n/g, runif(1)*i^2))))
require(vegan)
fit <- cascadeKM(scale(d, center = TRUE, scale = TRUE), 1, 10, iter = 1000)
plot(fit, sortg = TRUE, grpmts.plot = TRUE)
calinski.best <- as.numeric(which.max(fit$results[2,]))
cat("Calinski criterion optimal number of clusters:", calinski.best, "\n")
This would result in value of 5, which means you can use 5 clusters, the algorithm works with the fundamentals on withiness and betweeness of k means clustering. You can also write a manual code basis on that.
From the documentation from here:
criterion: The criterion that will be used to select the best
partition. The default value is "calinski", which refers to the
Calinski-Harabasz (1974) criterion. The simple structure index ("ssi")
is also available. Other indices are available in function clustIndex
(package cclust). In our experience, the two indices that work best
and are most likely to return their maximum value at or near the
optimal number of clusters are "calinski" and "ssi".
A manual code would look like something as below:
At the first iteration since there is no SSB( Betweeness of the variance).
wss <- (nrow(d)-1)*sum(apply(d,2,var))
#TSS = WSS ##No betweeness at first observation, total variance equal to withness variance, TSS is total sum of squares, WSS is within sum of squress
for (i in 2:15) wss[i] <- sum(kmeans(d,centers=i)$withinss) #from second observation onward, since TSS would remain constant and between sum of squares will increase, correspondingly withiness would decrease.
#Plotting the same using the plot command for 15 iterations.(This is not constant, you have to decide what iterations you can do here.
plot(1:15, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares",col="mediumseagreen",pch=12)
An output of above can look like this, Here after the point at which the line become constant is the point that you have to pick for optimum cluster size, in this case it is 5 :

K-means clustering doesn't find all clusters in data

The set of data I am using is shown below. As one can see you would think the k-means cluster analysis would find the centers of these clusters easily.
However, when I run K-means cluster analysis and plot the centers I get this.
I am using just the basic kmeans code:
cluster <- kmeans(mydata,90)
cluster$centers
A little-known fact about kmeans is that to get reliable results, you need to run the algorithm repeatedly with many random initializations. I typically use kmeans(, nstart = 1000).
In theory, the kmeans++ algorithm does not suffer as much from the initialization problem, but I often find that kmeans with many random restarts performs better than kmeans++. Still, you might want to try kmeans++ using the flexclust R package.
As I mentioned in the comment, using hclust() to find the centers might be a viable approach.
set.seed(1)
l <- 1e4
v1 <- sample(1:10, l, replace=TRUE) + rnorm(l, 0, 0.05)
v2 <- sample(1:13, l, replace=TRUE) + rnorm(l, 0, 0.05)
dtf <- data.frame(v1, v2)
par(mar=c(2, 2, 1, 1))
plot(dtf, pch=16, cex=0.2, col="#00000044")
km <- kmeans(dtf, 10*13)
points(km$centers, cex=2, lwd=0.5, col="red")
hc <- hclust(dist(dtf))
hc <- cutree(hc, 10*13)
hcent <- aggregate(dtf, list(hc), mean)[, -1]
hckm <- kmeans(dtf, hcent)
points(hckm$centers, cex=3, lwd=0.5, col="blue")
This data set is likely to be much better to be clustered by DBSCAN.
Choose epsilon less than the distance of clusters (e.g., 10), and Minpts should not matter much then, e.g., minpts=4

an error in R code written for finding clusters

I am relatively new to R. I just try to find out optimum number of clusters for iris data using the following methods:
library(datasets)
head(iris)
# method1:
wss <- (nrow(iris)-1)*sum(apply(iris,2,var))
for (i in 2:3) wss[i] <- sum(kmeans(iris, centers=i)$withinss)
plot(1:3, wss, type="b", xlab="Number of Clusters",ylab="Within groups sum of squares")
# method2:
library(fpc)
pamk.best <- pamk(iris)
cat("number of clusters estimated by optimum average silhouette width:", pamk.best$nc, "\n")
plot(pam(iris, pamk.best$nc))
Both methods throw up error. So please do someone shed light on it. Many Thanks in advance.
apply(iris,2,var)
gives you an error because the 4th column is not numeric.
Try
apply(iris[,1:4],2,var)
The same goes for the second method.
Error in pam(sdata, k, diss = diss, ...) :
x is not a numeric dataframe or matrix.

R kmeans (stats) vs Kmeans (amap)

Hello stackoverflow community,
I'm running kmeans (stats package) and Kmeans (amap package) on the Iris dataset. In both cases, I use the same algorithm (Lloyd–Forgy), the same distance (euclidean), the same number of initial random sets (50), the same maximal number of iterations (1000), and I test for the same set of k values (from 2 to 15). I also use the same seed for both cases (4358).
I don't understand why under these conditions I'm getting different wss curves, in particular: the "elbow" using the stats package is much less accentuated than when using the amap package.
Could you please help me to understand why? Thanks much!
Here the code:
# data load and scaling
newiris <- iris
newiris$Species <- NULL
newiris <- scale(newiris)
# using kmeans (stats)
wss1 <- (nrow(newiris)-1)*sum(apply(newiris,2,var))
for (i in 2:15) {
set.seed(4358)
wss1[i] <- sum(kmeans(newiris, centers=i, iter.max=1000, nstart=50,
algorithm="Lloyd")$withinss)
}
# using Kmeans (amap)
library(amap)
wss2 <- (nrow(newiris)-1)*sum(apply(newiris,2,var))
for (i in 2:15) {
set.seed(4358)
wss2[i] <- sum(Kmeans(newiris, centers=i, iter.max=1000, nstart=50,
method="euclidean")$withinss)
}
# plots
plot(1:15, wss1, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares", main="kmeans (stats package)")
plot(1:15, wss2, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares", main="Kmeans (amap package)")
EDIT:
I've emailed the author of the amap package and will post the reply when/if I get any.
https://cran.r-project.org/web/packages/amap/index.html
The author of the amap package, changed the code and the value of withinss variable is the sum applied by method (eg. euclidean distance).
One way to solve this, given the return of Kmeans function (amap), recalculate the value of withinss ( Error Sum of Squares (SSE) ).
Here is my suggestion:
# using Kmeans (amap)
library(amap)
wss2 <- (nrow(newiris)-1)*sum(apply(newiris,2,var))
for (i in 2:15) {
set.seed(4358)
ans.Kmeans <- Kmeans(newiris, centers=i, iter.max=1000, nstart=50, method="euclidean")
wss <- vector(mode = "numeric", length=i)
for (j in 1:i) {
km = as.matrix(newiris[which(ans.Kmeans$cluster %in% j),])
## average = as.matrix( t(apply(km,2,mean) ))
## wss[j] = sum( apply(km, 1, function(x) sum((x-average) ^ 2 )))
## or
wss[j] <- ( nrow(km)-1) * sum(apply(km,2,var))
}
wss2[i] = sum(wss)
}
Note. The method for pearson in this package is wrong (be careful !) on version 0.8-14.
Line 325 according code in this link:
https://github.com/cran/amap/blob/master/src/distance_T.inl

How to find the optimal number of clusters?

I know this question has already been asked, but I am failing to implement a decent plot for the following code:
options(digits=1)
set.seed(2014)
mydata <- matrix(seq(1,360),nrow=10,ncol=36)
wss <- c()
for (i in 1:19) wss[i] <- sum(kmeans(x=mydata,centers=seq(1,360,length.out=20)[i])$withinss)
plot(1:9, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")
It produces the following error
Error in sample.int(m, k) :
cannot take a sample larger than the population when 'replace = FALSE'
kmeans assumes that each row is your data is an observation. So if you have k rows in x, the results of $clusters will be of lenth k. Here your test data has 10 rows. Yet you are specifying centers=20 when i=2 There is no way that 10 observations can have 20 different clusters.
Just a little spark in the dark!
options(digits=1)
set.seed(2014)
mydata <- seq(from=1,to=365)
wss <- c()
for (i in 5:15){
wss[i-4] <- sum(kmeans(mydata,centers=floor(seq(from=1,to=365,length.out=i)[-i]))$withinss)
}
plot(1:15,wss,type="b",xlab="Number of Clusters",ylab="Within groups sum of squares")
Does that make sense? #jlhoward #jbaums

Resources