pca and cluster analysis, very slow computing - r

My data has 30,000 rows and 140 columns and I am trying to cluster the data. I am doing a pca and then using about 12 pc to use in the cluster analysis. I took a random sample of 3000 observations and ran it and it took 44 minutes to run both the pca and the hierarchal clustering.
A co-worker did the same in SPSS and it took significantly less time? Any idea why?
Here is a simplified version of my code which works fine but is really slow on anything over 2000 observations. I included the USArrest dataset which is really small so it doesn't really represent my problem but shows what I'm trying to do. I'm hesitant to post a large dataset as that seems rude.
I'm not sure how to speed the clustering up. I know I can do random samples of the data and then use a predict function to assign clusters to the test data. But optimally I'd like to use all of the data in the clustering since the data is static and isn't ever going to change or be updated.
library(factoextra)
library(FactoMineR)
library(FactoInvestigate)
## Data
# mydata = My data has 32,000 rows with 139 variables.
# example data with small data set
data("USArrests")
mydata <- USArrests
## Initial PCA on mydata
res.pca <- PCA(mydata, ncp=4, scale.unit=TRUE, graph = TRUE)
Investigate(res.pca) # this report is very helpful! I determined to keep 12 PC and start with 3 clusters.
## Keep PCA dataset with only 2 PC
res.pca1 <- PCA(mydata, ncp=2, scale.unit=TRUE, graph = TRUE)
## Run a HC on the PC: Start with suggested number of PC and Clusters
res.hcpc <- HCPC(res.pca1, nb.clust=4, graph = FALSE)
## Dendrogram
fviz_dend(res.hcpc,
cex = 0.7,
palette = "jco",
rect = TRUE, rect_fill = TRUE,
rect_border = "jco",
labels_track_height = 0.8
)
## Cluster Viz
fviz_cluster(res.hcpc,
geom = "point",
elipse.type = "convex",
#repel = TRUE,
show.clust.cent = TRUE,
palette = "jco",
ggtheme = theme_minimal(),
main = "Factor map"
)
#### Cluster 1: Means of Variables
res.hcpc$desc.var$quanti$'1'
#### Cluster 2: Means of Variables
res.hcpc$desc.var$quanti$'2'
#### Cluster 3: Means of Variables
res.hcpc$desc.var$quanti$'3'
#### Cluster 4: Means of Variables
res.hcpc$desc.var$quanti$'4'
#### Number of Observations in each cluster
cluster_hd = res.hcpc$data.clust$clust
summary(cluster_hd)
Any idea why SPSS is so much faster?
Any idea how to speed this up? I know clustering is labor intensive but I'm not sure what the threshold is for efficiency and my data of 30,000 records and 140 variables.
Are some of the other clustering packages more efficient? Suggestions?

HCPC is a hierarchical clustering on the principal components using the Ward criterion. You can use k-means algorithm instead for the clustering part, which is way faster: Hierarchical clustering has a time complexity of O(n³) whereas k-means has a complexity of O(n) where n is the number of observations.
Since the criterions optimized by k-means and hierarchical clustering with Ward are the same (minimize the total within-cluster variance), you can use first k-means with a high number of clusters (say 300 for instance) and then run hierarchical clustering on the centers of the clusters if you need to keep the hierachical aspect.

Related

How to measure performance of K-Means cluster in R? [image & code included]

I am currently doing a K-means cluster analysis for some customer data at my company. I want to measure the performance of this cluster, I just don't know the library packages used to measure performance of it and I am also unsure if my clusters are grouped too close together.
The data feeding my cluster is a simple RFM (recency, frequency, & monetary value). I also included average order value per transaction by customer. I used the elbow method to determine the optimal number clusters to use. Data consists of 1400 customers and 4 metric values.
Attached is also an image of the cluster plot & R Code
drop = c('CUST_Business_NM')
#Cleaning & Scaling the Data
new_cluster_data = na.omit(data)
new_cluster_data = data[, !(names(data)%in%drop)]
new_cluster_data = scale(new_cluster_data)
glimpse(new_cluster_data)
#Elbow Method for Optimal Clusters
k.max <- 15
data <- new_cluster_data
wss <- sapply(1:k.max,
function(k){kmeans(data, k, nstart=50,iter.max = 15 )$tot.withinss})
#Plot out the Elbow
wss
plot(1:k.max, wss,
type="b", pch = 19, frame = FALSE,
xlab="Number of clusters K",
ylab="Total within-clusters sum of squares")
#Create the Cluster
kmeans_test = kmeans(new_cluster_data, centers = 8, nstart = 1000)
View(kmeans_test$cluster)
#Visualize the Cluster
fviz_cluster(kmeans_test, data = new_cluster_data, show.clust.cent = TRUE, geom = c("point", "text"))
You probably do not want to measure the performance of cluster but the performance of the cluster algorithm, in this case kmeans.
First, you need to be clear what cluster distance measure you want to use. The result of the cluster computation is a dissimilarity matrix, thus the choice of the distance measure is critical, you can play with euclidean, manhattan, any kind of correlation or other distance measure, e.g., like this:
library("factoextra")
dis_pearson <- get_dist(yourdataset, method = "pearson")
dis_pearson
fviz_dist(dis_pearson)
This will give you the distance matrix and visualize it.
The output of kmeans has several bits of information. The most important with regard to your question are:
totss: the total sum of squares
withinss: vector of within-cluster sum of squares
tot.withinss: total within-cluster sum of squares
betweenss: the between-cluster sum of squares
Thus, the goal is to optimize these by playing with distances and other methods to cluster the data. Using cluster package, you can simply extract these measures by mycluster <- kmeans(yourdataframe, centers = 2) and then calling mycluster.
Side comment: kmeans requires the number of clusters defined by the user (additional effort) and it is very sensitive to outliers.

How to run hierarchical clustering hclust on large data? [duplicate]

I am trying implement hierarchical clustering in R : hclust() ; this requires a distance matrix created by dist() but my dataset has around a million rows, and even EC2 instances run out of RAM. Is there a workaround?
One possible solution for this is to sample your data, cluster the smaller sample, then treat the clustered sample as training data for k Nearest Neighbors and "classify" the rest of the data. Here is a quick example with 1.1M rows. I use a sample of 5000 points. The original data is not well-separated, but with only 1/220 of the data, the sample is separated. Since your question referred to hclust, I used that. But you could use other clustering algorithms like dbscan or mean shift.
## Generate data
set.seed(2017)
x = c(rnorm(250000, 0,0.9), rnorm(350000, 4,1), rnorm(500000, -5,1.1))
y = c(rnorm(250000, 0,0.9), rnorm(350000, 5.5,1), rnorm(500000, 5,1.1))
XY = data.frame(x,y)
Sample5K = sample(length(x), 5000) ## Downsample
## Cluster the sample
DM5K = dist(XY[Sample5K,])
HC5K = hclust(DM5K, method="single")
Groups = cutree(HC5K, 8)
Groups[Groups>4] = 4
plot(XY[Sample5K,], pch=20, col=rainbow(4, alpha=c(0.2,0.2,0.2,1))[Groups])
Now just assign all other points to the nearest cluster.
Core = which(Groups<4)
library(class)
knnClust = knn(XY[Sample5K[Core], ], XY, Groups[Core])
plot(XY, pch=20, col=rainbow(3, alpha=0.1)[knnClust])
A few quick notes.
Because I created the data, I knew to choose three clusters. With a real problem, you would have to do the work of figuring out an appropriate number of clusters.
Sampling 1/220 could completely miss any small clusters. In the small sample, they would just look like noise.

Compute dissimilarity matrix on parallel cores [duplicate]

I'm trying to compute a dissimilarity matrix based on a big data frame with both numerical and categorical features. When I run the daisy function from the cluster package I get the error message:
Error: cannot allocate vector of size X.
In my case X is about 800 GB. Any idea how I can deal with this problem? Additionally it would be also great if someone could help me to run the function in parallel cores. Below you can find the function that computes the dissimilarity matrix on the iris dataset:
require(cluster)
d <- daisy(iris)
I've had a similar issue before. Running daisy() on even 5k rows of my dataset took a really long time.
I ended up using the kmeans algorithm in the h2o package which parallelizes and 1-hot encodes categorical data. I would just make sure to center and scale your data (mean 0 w/ stdev = 1) before plugging it into h2o.kmeans. This is so that the clustering algorithm doesn't prioritize columns that have high nominal differences (since it's trying to minimize the distance calculation). I used the scale() function.
After installing h2o:
h2o.init(nthreads = 16, min_mem_size = '150G')
h2o.df <- as.h2o(df)
h2o_kmeans <- h2o.kmeans(training_frame = h2o.df, x = vars, k = 5, estimate_k = FALSE, seed = 1234)
summary(h2o_kmeans)

Discrepancy in results when using k-means and plotting the distance matrix. Why?

I am doing cluster of some data in R Studio. I am having a problem with results of K-means Cluster Analysis and plotting Hierarchical Clustering. So when I use function kmeans, I get 4 groups with 10, 20, 30 and 6 observations. Nevertheless, when I plot the dendogram, I get 4 groups but with different numbers of observations: 23, 26, 10 and 7.
Have you ever found a problem like this?
Here you are my code:
mydata<-scale(mydata0)
# K-Means Cluster Analysis
fit <- kmeans(mydata, 4) # 4 cluster solution
# get cluster means
aggregate(mydata,by=list(fit$cluster),FUN=mean)
# append cluster assignment
mydatafinal <- data.frame(mydata, fit$cluster)
fit$size
[1] 10 20 30 6
# Ward Hierarchical Clustering
d <- dist(mydata, method = "euclidean") # distance matrix
fit2 <- hclust(d, method="ward.D2")
plot(fit2,cex=0.4) # display dendogram
groups <- cutree(fit2, k=4) # cut tree into 4 clusters
# draw dendogram with red borders around the 4 clusters
rect.hclust(fit2, k=4, border="red")
Results of k-means and hierarchical clustering do not need to be the same in every scenario.
Just to give an example, everytime you run k-means the initial choice of the centroids is different and so results are different.
This is not surprising. K-means clustering is initialised at random and can give distinct answers. Typically one tends to do several runs and then aggregate the results to check which are the 'core' clusters.
Hierarchical clustering is, in contrast, purely deterministic as there is no randomness involved. But like K-means, it is a heuristic: a set of rules is followed to create clusters with no regard to any underlying objective function (for example the intra- and inter- cluster variance vs overall variance). The way existing clusters are aggregated to individual observations is crucial in determining the size of the formed clusters (the "ward.D2" parameter you pass as method in the hclust command).
Having a properly defined objective function to optimise should give you a unique answer (or set thereof) but the problem is NP-hard, because of the sheer size (as a function of the number of observations) of the partitioning involved. This is why only heuristics exist and also why any clustering procedure should not be seen as a tool giving definitive answers but as an exploratory one.

R, issue with a Hierarchical clustering after a Multiple correspondence analysis

I want to cluster a dataset (600000 observations), and for each cluster I want to get the principal components.
My vectors are composed by one email and by 30 qualitative variables.
Each quantitative variable has 4 classes: 0,1,2 and 3.
So first thing I'm doing is to load the library FactoMineR and to load my data:
library(FactoMineR)
mydata = read.csv("/home/tom/Desktop/ACM/acm.csv")
Then I'm setting my variables as qualitative (I'm excluding the variable 'email' though):
for(n in 1:length(mydata)){mydata[[n]] <- factor(mydata[[n]])}
I'm removing the emails from my vectors:
mydata2 = mydata[2:31]
And I'm running a MCA in this new dataset:
mca.res <- MCA(mydata2)
I now want to cluster my dataset using the hcpc function:
res.hcpc <- HCPC(mca.res)
But I got the following error message:
Error: cannot allocate vector of size 1296.0 Gb
What do you think I should do? Is my dataset too large? Am I using well the hcpc function?
Since it uses hierarchical clustering, HCPC needs to compute the lower triangle of a 600000 x 600000 distance matrix (~ 180 billion elements). You simply don't have the RAM to store this object and even if you did, the computation would likely take hours if not days to complete.
There have been various discussions on Stack Overflow/Cross Validated on clustering large datasets; some with solutions in R include:
k-means clustering in R on very large, sparse matrix? (bigkmeans)
Cluster Big Data in R and Is Sampling Relevant? (clara)
If you want to use one of these alternative clustering approaches, you would apply it to mca.res$ind$coord in your example.
Another idea, suggested in response to the problem clustering very large dataset in R, is to first use k means to find a certain number of cluster centres and then use hierarchical clustering to build the tree from there. This method is actually implemented via the kk argument of HCPC.
For example, using the tea data set from FactoMineR:
library(FactoMineR)
data(tea)
## run MCA as in ?MCA
res.mca <- MCA(tea, quanti.sup = 19, quali.sup = c(20:36), graph = FALSE)
## run HCPC for all 300 individuals
hc <- HCPC(res.mca, kk = Inf, consol = FALSE)
## run HCPC from 30 k means centres
res.consol <- NULL ## bug work-around
hc2 <- HCPC(res.mca, kk = 30, consol = FALSE)
The consol argument offers the option to consolidate the clusters from the hierarchical clustering using k-means; this option is not available when kk is set to a real number, hence consol is set to FALSE here. The object res.consul is set to NULL to work around a minor bug in FactoMineR 1.27.
The following plot show the clusters based on the 300 individuals (kk = Inf) and based on the 30 k means centres (kk = 30) for the data plotted on the first two MCA axes:
It can be seen that the results are very similar. You should easily be able to apply this to your data with 600 or 1000 k means centres, perhaps up to 6000 with 8GB RAM. If you wanted to use a larger number, you'd probably want to code a more efficient version using bigkmeans, SpatialTools::dist1 and fastcluster::hclust.
That error message usually indicates that R has not enough RAM at its disposal to complete the command. I guess you are running this within 32bit R, possibly under Windows? If this is the case, then killing other processes and deleting unused R variables might possibly help: for example, you might try to delete mydata, mydata2 with
rm(mydata, mydata2)
(as well as all other non-necessary R variables) before executing the command which generates the error. However the ultimate solution in general is to switch to 64bit R, preferably under 64bit Linux and with a decent RAM amount, also see here:
R memory management / cannot allocate vector of size n Mb
R Memory Allocation "Error: cannot allocate vector of size 75.1 Mb"
http://r.789695.n4.nabble.com/Error-cannot-allocate-vector-of-size-td3629384.html

Resources