Memory issue with K-means clustering - r

I'm trying to cluster key phrases from a search history using K means clustering, but I run into the error "cannot allocate vector of size 30gb" when I run the stringdistmatrix() command. The dataset I am using includes 63455 unique elements, so the resulting matrix requires about 30gb of memory to process. Is there a way to lower the requirements of the process without losing too much significance?
Below is the code I am attempting to run, if you happen to notice any other errors:
#Set data source, format for use, check consistency
MyData <- c('Create company email', 'email for business', 'free trial', 'corporate pricing', 'email cost')
summary(MyData)
#Define number of clusters
kclusters = round(0.90 * length(unique(MyData)))
#Compute distance between words
uniquedata <- unique(as.character(MyData))
distancemodels <- stringdistmatrix(uniquedata, uniquedata, method="jw")
#Create Dendrogram
rownames(distancemodels) <- uniquedata
hc <- hclust(as.dist(distancemodels))
par(mar = rep(2, 4))
plot(hc)
#Create clusters from grouped keywords
dfClust <- data.frame(uniquedata, cutree(hc, k=kclusters))
names(dfClust) <- c('data','cluster')
plot(table(dfClust$cluster))
#End view
view(dfClust)

I don't know of any way to avoid generating the distance matrix when doing k-means clustering.
You could consider alternative clustering algorithms that have been devised to avoid memory issues. The main one that comes to mind is CLARA (Clustering Large Applications; Kaufman and Rousseeuw 1990). In R, it's as simple as cluster::clara, taking numeric data only (like k-means) and requiring you to set k in advance.
Read the manual (?cluster::clara) especially on number of samples which you should set higher than the default. Hope that helps!
edit: just noticed you don't actually have numeric data to start with, so perhaps CLARA is not all that helpful. You could perhaps use some of the same principles as CLARA, including sampling your data multiple times to reduce the memory footprint and combining results later on.

Related

How can I achieve hierarchical clustering with p-values for a large dataset?

I am trying to carry out hierarchical cluster analysis (based on Ward's method) on a large dataset (thousands of records and 13 variables) representing multi-species observations of marine predators, to identify possible significant clusters in species composition.
Each record has date, time etc and presence/absence data (0 / 1) for each species.
I attempted hierarchical clustering with the function pvclust. I transposed the data (pvclust works on transposed tables), then I ran pvclust on the data selecting Jacquard distances (“binary” in R) as a distance measure (suitable for species pres/abs data) and Ward’s method (“ward.D2”). I used “parallel = TRUE” to reduce computation time. However, using a default of nboots= 1000, my computer was not able to finish the computation in hours and finally I got ann error, so I tried with lower nboots (100).
I cannot provide my dataset here, and I do not think it makes sense to provide a small test dataset, as one of the main issues here seems to be the size itself of the dataset. However, I am providing the lines of code I used for the transposition, clustering and plotting:
tdata <- t(data)
cluster <- pvclust(tdata, method.hclust="ward.D2", method.dist="binary",
nboot=100, parallel=TRUE)
plot(cluster, labels=FALSE)
This is the dendrogram I obtained (never mind the confusion at the lower levels due to overlap of branches).
As you can see, the p-values for the higher ramifications of the dendrogram all seem to be 0.
Now, I understand that my data may not be perfect, but I still think there is something wrong with the method I am using, as I would not expect all these values to be zero even with very low significance in the clusters.
So my questions would be
is there anything I got wrong in the pvclust function itself?
may my low nboots (due to “weak” computer) be a reason for the non-significance of my results?
are there other functions in R I could try for hierarchical clustering that also deliver p-values?
Thanks in advance!
.............
I have tried to run the same code on a subset of 500 records with nboots = 1000. This worked in a reasonable computation time, but the output is still not very satisfying - see dendrogram2 .dendrogram obtained for a SUBSET of 500 records and nboots=1000

nstart for k-means in R

Search results in numerous places report that the argument nstart in R's function kmeans sets a number of iterations of the algorithm and chooses 'the best one', see e.g. https://datascience.stackexchange.com/questions/11485/k-means-in-r-usage-of-nstart-parameter. Can anyone provide any clarity on how it does this, i.e. by what measure does it define best?
Secondly: R's kmeans function takes an argument centers. Here, as typical in k-means, it is possible to initialise the centroids before the algorithm begins expectation-maximisation, by choosing as initial centroids rows (data-points) from within your data-set. (You could supply, in vector form, points not present in your data-set as well, with considerably greater effort. In this case you could in theory choose the global optimum as your centroids. This is not what I'm asking for.) When nstart or the seed randomises initializations, I am quite sure that it does so by picking a random choice of centroids from your data-set and starting from those (not just a random set of points within the space).
In general, therefore, I'm looking for a way to get a good (e.g. best out of $n$ trials, or best from nstart) set of starting data-instances from the data-set as initial centroids. Is there any way of extracting the 'winning' (=best) set of initial centroids from nstart (which I could then use, say, in the centers parameter in future)? Any other streamlined & quick way to get a very good set of starting centroids (presumably, reasonably close to where the cluster centres will end up being)?
Is there perhaps, at least, a way to extract from a given kmeans run, what initial centroids it chose to start with?
The criterion that kmeans tries to minimize is the trace of the within scatter matrix, i.e. (unfortunately, this forum does not support LaTeX, but you hopefully can read it nevertheless):
$$ trace(S_w) = \sum_{k=1}^K \sum{x \in C_k} ||x - \mu_k||^2 $$
Concerning the best starting point: obviously, the "best" starting point would be the cluster centers eventually chosen by kmeans. These are returned in the attribute centers:
km <- kmeans(iris[,-5], 3)
print(km$centers)
If you are looking for the best random start point, you can create random start points yourself (with runif), do this nstart times and evaluate which initial configuration leads to the smallest km$tot.withinss:
nstart <- 10
K <- 3 # number of clusters
D <- 4 # data point dimension
# select possible range
r.min <- apply(iris[,-5], MARGIN=2, FUN=min)
r.max <- apply(iris[,-5], MARGIN=2, FUN=max)
for (i in 1:nstart) {
centers <- data.frame(runif(K, r.min[d], r.max[d]))
for (d in 2:D) {
centers <- cbind(centers, data.frame(runif(K, r.min[d], r.max[d])))
}
names(centers) <- names(iris[,-5])
# call kmeans with centers and compare tot.withinss
# ...
}

Efficient plotting of part of a hierarchical cluster

I am running agglomerative clustering on a data set of 130K rows (130K unique keys) and 7 columns, each column ranging from 20 to 2000 unique levels. The data are categorical, specifically alphanumeric codes. At most they can be thought of as factors. I am experimenting with what results I might get from a couple of alternatives to k-modes, including hierarchical clustering and MCA.
My question is, is there any good way to visualize the results up to a certain level with the tree structure?
Standard steps are not a problem:
library{cluster}
Compute Gower distance,
ptm <- proc.time()
gower.dist <- daisy(df[,colnams], metric = c("gower"))
elapsed <- proc.time() - ptm
c(elapsed[3],elapsed[3]/60)
Compute agglomerative clustering object from Gower distance
aggl.clust.c <- hclust(gower.dist, method = "complete")
Now to plotting it. The following line works, but the plot is humanly unreadable
plot(aggl.clust.c, main = "Agglomerative, complete linkages")
Ideally what I am looking for would be something like so (the below is pseudocode that failed on my system)
plot(cutree(aggl.clust.c, k=7), main = "Agglomerative, complete linkages")
I am running R version 3.2.3. That version cannot change (and I don't believe it ought to make a difference for what I am trying to do).
I'd be interested in doing the same in Python, if anyone has good pointers.
I found a useful answer to my question re plotting part of a tree using the as.dendogram() method. Link: http://www.sthda.com/english/wiki/beautiful-dendrogram-visualizations-in-r-5-must-known-methods-unsupervised-machine-learning

R, issue with a Hierarchical clustering after a Multiple correspondence analysis

I want to cluster a dataset (600000 observations), and for each cluster I want to get the principal components.
My vectors are composed by one email and by 30 qualitative variables.
Each quantitative variable has 4 classes: 0,1,2 and 3.
So first thing I'm doing is to load the library FactoMineR and to load my data:
library(FactoMineR)
mydata = read.csv("/home/tom/Desktop/ACM/acm.csv")
Then I'm setting my variables as qualitative (I'm excluding the variable 'email' though):
for(n in 1:length(mydata)){mydata[[n]] <- factor(mydata[[n]])}
I'm removing the emails from my vectors:
mydata2 = mydata[2:31]
And I'm running a MCA in this new dataset:
mca.res <- MCA(mydata2)
I now want to cluster my dataset using the hcpc function:
res.hcpc <- HCPC(mca.res)
But I got the following error message:
Error: cannot allocate vector of size 1296.0 Gb
What do you think I should do? Is my dataset too large? Am I using well the hcpc function?
Since it uses hierarchical clustering, HCPC needs to compute the lower triangle of a 600000 x 600000 distance matrix (~ 180 billion elements). You simply don't have the RAM to store this object and even if you did, the computation would likely take hours if not days to complete.
There have been various discussions on Stack Overflow/Cross Validated on clustering large datasets; some with solutions in R include:
k-means clustering in R on very large, sparse matrix? (bigkmeans)
Cluster Big Data in R and Is Sampling Relevant? (clara)
If you want to use one of these alternative clustering approaches, you would apply it to mca.res$ind$coord in your example.
Another idea, suggested in response to the problem clustering very large dataset in R, is to first use k means to find a certain number of cluster centres and then use hierarchical clustering to build the tree from there. This method is actually implemented via the kk argument of HCPC.
For example, using the tea data set from FactoMineR:
library(FactoMineR)
data(tea)
## run MCA as in ?MCA
res.mca <- MCA(tea, quanti.sup = 19, quali.sup = c(20:36), graph = FALSE)
## run HCPC for all 300 individuals
hc <- HCPC(res.mca, kk = Inf, consol = FALSE)
## run HCPC from 30 k means centres
res.consol <- NULL ## bug work-around
hc2 <- HCPC(res.mca, kk = 30, consol = FALSE)
The consol argument offers the option to consolidate the clusters from the hierarchical clustering using k-means; this option is not available when kk is set to a real number, hence consol is set to FALSE here. The object res.consul is set to NULL to work around a minor bug in FactoMineR 1.27.
The following plot show the clusters based on the 300 individuals (kk = Inf) and based on the 30 k means centres (kk = 30) for the data plotted on the first two MCA axes:
It can be seen that the results are very similar. You should easily be able to apply this to your data with 600 or 1000 k means centres, perhaps up to 6000 with 8GB RAM. If you wanted to use a larger number, you'd probably want to code a more efficient version using bigkmeans, SpatialTools::dist1 and fastcluster::hclust.
That error message usually indicates that R has not enough RAM at its disposal to complete the command. I guess you are running this within 32bit R, possibly under Windows? If this is the case, then killing other processes and deleting unused R variables might possibly help: for example, you might try to delete mydata, mydata2 with
rm(mydata, mydata2)
(as well as all other non-necessary R variables) before executing the command which generates the error. However the ultimate solution in general is to switch to 64bit R, preferably under 64bit Linux and with a decent RAM amount, also see here:
R memory management / cannot allocate vector of size n Mb
R Memory Allocation "Error: cannot allocate vector of size 75.1 Mb"
http://r.789695.n4.nabble.com/Error-cannot-allocate-vector-of-size-td3629384.html

Function and data format for doing vector-based clustering in R

I need to run clustering on the correlations of data row vectors, that is, instead of using individual variables as clustering predictor variables, I intend to use the correlations between the vector of variables between data rows.
Is there a function in R that does vector-based clustering. If not and I need to do it manually, what is the right data format to feed in a function such as cmeans or kmeans?
Say, I have m variables and n data rows, the m variables constitute one vector for each data row. so I have a n X n matrix for correlation or cosine. Can this matrix be plugged in the clustering function directly or certain processing is required?
Many thanks.
You can transform your correlation matrix into a dissimilarity matrix,
for instance 1-cor(x) (or 2-cor(x) or 1-abs(cor(x))).
# Sample data
n <- 200
k <- 10
x <- matrix( rnorm(n*k), nr=k )
x <- x * row(x) # 10 dimensions, with less information in some of them
# Clustering
library(cluster)
r <- pam(1-cor(x), diss=TRUE, k=5)
# Check the results
plot(prcomp(t(x))$x[,1:2], col=r$clustering, pch=16, cex=3)
R clustering is often a bit limited. This is a design limitation of R, since it heavily relies on low-level C code for performance. The fast kmeans implementation included with R is an example of such a low-level code, that in turn is tied to using Euclidean distance.
There are a dozen of extensions and alternatives available in the community around R. There are PAM, CLARA and CLARANS for example. They aren't exactly k-means, but closely related. There should be a "spherical k-means" somewhere, that is sensible for cosine distance. There is the whole family of hierarchical clusterings (which scale rather badly - usually O(n^3), with O(n^2) in a few exceptions - but are very easy to understand conceptually).
If you want to explore some more clustering options, have a look at ELKI, it should allow clustering (with various methods, including k-means) by correlation based distances (and it also includes such distance functions). It's not R, though, but Java. So if you are bound to using R, it won't work for you.

Resources