nstart for k-means in R - r

Search results in numerous places report that the argument nstart in R's function kmeans sets a number of iterations of the algorithm and chooses 'the best one', see e.g. https://datascience.stackexchange.com/questions/11485/k-means-in-r-usage-of-nstart-parameter. Can anyone provide any clarity on how it does this, i.e. by what measure does it define best?
Secondly: R's kmeans function takes an argument centers. Here, as typical in k-means, it is possible to initialise the centroids before the algorithm begins expectation-maximisation, by choosing as initial centroids rows (data-points) from within your data-set. (You could supply, in vector form, points not present in your data-set as well, with considerably greater effort. In this case you could in theory choose the global optimum as your centroids. This is not what I'm asking for.) When nstart or the seed randomises initializations, I am quite sure that it does so by picking a random choice of centroids from your data-set and starting from those (not just a random set of points within the space).
In general, therefore, I'm looking for a way to get a good (e.g. best out of $n$ trials, or best from nstart) set of starting data-instances from the data-set as initial centroids. Is there any way of extracting the 'winning' (=best) set of initial centroids from nstart (which I could then use, say, in the centers parameter in future)? Any other streamlined & quick way to get a very good set of starting centroids (presumably, reasonably close to where the cluster centres will end up being)?
Is there perhaps, at least, a way to extract from a given kmeans run, what initial centroids it chose to start with?

The criterion that kmeans tries to minimize is the trace of the within scatter matrix, i.e. (unfortunately, this forum does not support LaTeX, but you hopefully can read it nevertheless):
$$ trace(S_w) = \sum_{k=1}^K \sum{x \in C_k} ||x - \mu_k||^2 $$
Concerning the best starting point: obviously, the "best" starting point would be the cluster centers eventually chosen by kmeans. These are returned in the attribute centers:
km <- kmeans(iris[,-5], 3)
print(km$centers)
If you are looking for the best random start point, you can create random start points yourself (with runif), do this nstart times and evaluate which initial configuration leads to the smallest km$tot.withinss:
nstart <- 10
K <- 3 # number of clusters
D <- 4 # data point dimension
# select possible range
r.min <- apply(iris[,-5], MARGIN=2, FUN=min)
r.max <- apply(iris[,-5], MARGIN=2, FUN=max)
for (i in 1:nstart) {
centers <- data.frame(runif(K, r.min[d], r.max[d]))
for (d in 2:D) {
centers <- cbind(centers, data.frame(runif(K, r.min[d], r.max[d])))
}
names(centers) <- names(iris[,-5])
# call kmeans with centers and compare tot.withinss
# ...
}

Related

Generating multiple random graphs in R with same number of nodes and ties?

I would like to generate a large number of random graphs with the same number of nodes and ties, and use the result to find the distributions etc of the standard metrics.
I found this link for generating random graphs with a given number of nodes and ties (Graph generation given number of edges and nodes). Is there an easy way to tell R to do this 1000x or so, and combine all of those into one object, that I can then analyze? (for things like av. distance, degree, diameter, etc).
Ultimately I want to be able to use this information for comparison with an empirical network.
I got this answer from a friend, and it appears to work:
RR1 <- list()
for(i in 1:1000) {
RR1[[i]] <- erdos.renyi.game(559,9640,type=c("gnm"),directed=FALSE,loops=FALSE)
}
number_of_edges <- sapply(RR1, gsize)
number_of_edges

Kmeans function - Amap package - what nstart stands for

I don't understand what the nstart changes in the algorithm.
If centers = 8, that means the function will cluster 8 groups. But, what nstart variates?
This is the explanation on the documentation:
centers:
Either the number of clusters or a set of initial cluster centers. If the first, a random set of rows in x are chosen as the initial centers.
nstart:
If centers is a number, how many random sets should be chosen?
Unfortunately, the ?kmeans doesn't exactly explain this (in both stats and the amap packages). But, one can get an idea by looking at the kmeans code.
If one uses more than one random starts (nstart greater than 1) for the kmeans, then the algorithm returns the partition that corresponds to the smallest total within-cluster sum of squares.
(The output contain the total within-cluster sum of squares value as tot.withinss).
Look further below in the details:
The algorithm of Hartigan and Wong (1979) is used by default. Note that some authors use k-means to refer to a specific algorithm rather than the general method: most commonly the algorithm given by MacQueen (1967) but sometimes that given by Lloyd (1957) and Forgy (1965). The Hartigan–Wong algorithm generally does a better job than either of those, but trying several random starts (nstart> 1) is often recommended. In rare cases, when some of the points (rows of x) are extremely close, the algorithm may not converge in the “Quick-Transfer” stage, signalling a warning (and returning ifault = 4). Slight rounding of the data may be advisable in that case.
nstart stand for the number of random starts. I can not explain the statistical details but in their example code, the authors of this function choose 25 random starts:
## random starts do help here with too many clusters
## (and are often recommended anyway!):
(cl <- kmeans(x, 5, nstart = 25))

Discrepancy in results when using k-means and plotting the distance matrix. Why?

I am doing cluster of some data in R Studio. I am having a problem with results of K-means Cluster Analysis and plotting Hierarchical Clustering. So when I use function kmeans, I get 4 groups with 10, 20, 30 and 6 observations. Nevertheless, when I plot the dendogram, I get 4 groups but with different numbers of observations: 23, 26, 10 and 7.
Have you ever found a problem like this?
Here you are my code:
mydata<-scale(mydata0)
# K-Means Cluster Analysis
fit <- kmeans(mydata, 4) # 4 cluster solution
# get cluster means
aggregate(mydata,by=list(fit$cluster),FUN=mean)
# append cluster assignment
mydatafinal <- data.frame(mydata, fit$cluster)
fit$size
[1] 10 20 30 6
# Ward Hierarchical Clustering
d <- dist(mydata, method = "euclidean") # distance matrix
fit2 <- hclust(d, method="ward.D2")
plot(fit2,cex=0.4) # display dendogram
groups <- cutree(fit2, k=4) # cut tree into 4 clusters
# draw dendogram with red borders around the 4 clusters
rect.hclust(fit2, k=4, border="red")
Results of k-means and hierarchical clustering do not need to be the same in every scenario.
Just to give an example, everytime you run k-means the initial choice of the centroids is different and so results are different.
This is not surprising. K-means clustering is initialised at random and can give distinct answers. Typically one tends to do several runs and then aggregate the results to check which are the 'core' clusters.
Hierarchical clustering is, in contrast, purely deterministic as there is no randomness involved. But like K-means, it is a heuristic: a set of rules is followed to create clusters with no regard to any underlying objective function (for example the intra- and inter- cluster variance vs overall variance). The way existing clusters are aggregated to individual observations is crucial in determining the size of the formed clusters (the "ward.D2" parameter you pass as method in the hclust command).
Having a properly defined objective function to optimise should give you a unique answer (or set thereof) but the problem is NP-hard, because of the sheer size (as a function of the number of observations) of the partitioning involved. This is why only heuristics exist and also why any clustering procedure should not be seen as a tool giving definitive answers but as an exploratory one.

Determining optimum number of clusters for k-means with a large dataset

I have a matrix of 62 columns and 181408 rows that I am going to be clustering using k-means. What I would ideally like is a method of identifying what the optimum number of clusters should be. I have tried implementing the gap statistic technique using clusGap from the cluster package (reproducible code below), but this produces several error messages relating to the size of the vector (122 GB) and memory.limitproblems in Windows and a "Error in dist(xs) : negative length vectors are not allowed" in OS X. Does anyone has any suggestions on techniques that will work in determining optimum number of clusters with a large dataset? Or, alternatively, how to make my code function (and does not take several days to complete)? Thanks.
library(cluster)
inputdata<-matrix(rexp(11247296, rate=.1), ncol=62)
clustergap <- clusGap(inputdata, FUN=kmeans, K.max=12, B=10)
At 62 dimensions, the result will likely be meaningless due to the curse of dimensionality.
k-means does a minimum SSQ assignment, which technically equals minimizing the squared Euclidean distances. However, Euclidean distance is known to not work well for high dimensional data.
If you don't know the numbers of the clusters k to provide as parameter to k-means so there are three ways to find it automaticaly:
G-means algortithm: it discovers the number of clusters automatically using a statistical test to decide whether to split a k-means center into two. This algorithm takes a hierarchical approach to detect the number of clusters, based on a statistical test for the hypothesis that a subset of data follows a Gaussian distribution (continuous function which approximates the exact binomial distribution of events), and if not it splits the cluster. It starts with a small number of centers, say one cluster only (k=1), then the algorithm splits it into two centers (k=2) and splits each of these two centers again (k=4), having four centers in total. If G-means does not accept these four centers then the answer is the previous step: two centers in this case (k=2). This is the number of clusters your dataset will be divided into. G-means is very useful when you do not have an estimation of the number of clusters you will get after grouping your instances. Notice that an inconvenient choice for the "k" parameter might give you wrong results. The parallel version of g-means is called p-means. G-means sources:
source 1
source 2
source 3
x-means: a new algorithm that efficiently, searches the space of cluster locations and number of clusters to optimize the Bayesian Information Criterion (BIC) or the Akaike Information Criterion (AIC) measure. This version of k-means finds the number k and also accelerates k-means.
Online k-means or Streaming k-means: it permits to execute k-means by scanning the whole data once and it finds automaticaly the optimal number of k. Spark implements it.
This is from RBloggers.
https://www.r-bloggers.com/k-means-clustering-from-r-in-action/
You could do the following:
data(wine, package="rattle")
head(wine)
df <- scale(wine[-1])
wssplot <- function(data, nc=15, seed=1234){
wss <- (nrow(data)-1)*sum(apply(data,2,var))
for (i in 2:nc){
set.seed(seed)
wss[i] <- sum(kmeans(data, centers=i)$withinss)}
plot(1:nc, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")}
wssplot(df)
this will create a plot like this.
From this you can choose the value of k to be either 3 or 4. i.e
there is a clear fall in 'within groups sum of squares' when moving from 1 to 3 clusters. After three clusters, this decrease drops off, suggesting that a 3-cluster solution may be a good fit to the data.
But like Anony-Mouse pointed out, the curse of dimensionality affects due to the fact that euclidean distance being used in k means.
I hope this answer helps you to a certain extent.

Function and data format for doing vector-based clustering in R

I need to run clustering on the correlations of data row vectors, that is, instead of using individual variables as clustering predictor variables, I intend to use the correlations between the vector of variables between data rows.
Is there a function in R that does vector-based clustering. If not and I need to do it manually, what is the right data format to feed in a function such as cmeans or kmeans?
Say, I have m variables and n data rows, the m variables constitute one vector for each data row. so I have a n X n matrix for correlation or cosine. Can this matrix be plugged in the clustering function directly or certain processing is required?
Many thanks.
You can transform your correlation matrix into a dissimilarity matrix,
for instance 1-cor(x) (or 2-cor(x) or 1-abs(cor(x))).
# Sample data
n <- 200
k <- 10
x <- matrix( rnorm(n*k), nr=k )
x <- x * row(x) # 10 dimensions, with less information in some of them
# Clustering
library(cluster)
r <- pam(1-cor(x), diss=TRUE, k=5)
# Check the results
plot(prcomp(t(x))$x[,1:2], col=r$clustering, pch=16, cex=3)
R clustering is often a bit limited. This is a design limitation of R, since it heavily relies on low-level C code for performance. The fast kmeans implementation included with R is an example of such a low-level code, that in turn is tied to using Euclidean distance.
There are a dozen of extensions and alternatives available in the community around R. There are PAM, CLARA and CLARANS for example. They aren't exactly k-means, but closely related. There should be a "spherical k-means" somewhere, that is sensible for cosine distance. There is the whole family of hierarchical clusterings (which scale rather badly - usually O(n^3), with O(n^2) in a few exceptions - but are very easy to understand conceptually).
If you want to explore some more clustering options, have a look at ELKI, it should allow clustering (with various methods, including k-means) by correlation based distances (and it also includes such distance functions). It's not R, though, but Java. So if you are bound to using R, it won't work for you.

Resources