R: How to implement parallel processing when calculating polychoric correlation matrix? - r

I am trying to implement parallel processing to speed up the calculation of a polychoric correlation matrix using the hetcor() function from the polycor package. Here's some example code with desired output:
set.seed(352)
toy <-
data.frame(A = sample(0:7),
B = sample(0:7),
C = sample(0:7)) %>%
mutate_all(.funs = ordered)
toy
hetcor(toy, ML = FALSE, pd = TRUE)
So this takes a second or two already, but can take >20 minutes with my actual dataset. I know that the parallel package can utilize the 8 cores I have on my computer to make this faster. But I can not figure out how to implement it, or even where to start. Most of the examples I find online are using apply-style approaches, which seems different from what I'm trying to do (unless you're supposed to loop this process somehow, maybe with the polychor() function?).
Is there a generous soul that can show an example of how to use the parallel package to calculate polychoric correlation matrix from the toy data?

Related

No convergence for hard competitive learning clustering (flexclust package)

I am applying the functions from the flexclust package for hard competitive learning clustering, and I am having trouble with the convergence.
I am using this algorithm because I was looking for a method to perform a weighed clustering, giving different weights to groups of variables. I chose hard competitive learning based on a response for a previous question (Weighted Kmeans R).
I am trying to find the optimal number of clusters, and to do so I am using the function stepFlexclust with the following code:
new("flexclustControl") ## check the default values
fc_control <- new("flexclustControl")
fc_control#iter.max <- 500 ### 500 iterations
fc_control#verbose <- 1 # this will set the verbose to TRUE
fc_control#tolerance <- 0.01
### I want to give more weight to the first 24 variables of the dataframe
my_weights <- rep(c(1, 0.064), c(24, 31))
set.seed(1908)
hardcl <- stepFlexclust(x=df, k=c(7:20), nrep=100, verbose=TRUE,
FUN = cclust, dist = "euclidean", method = "hardcl", weights=my_weights, #Parameters for hard competitive learning
control = fc_control,
multicore=TRUE)
However, the algorithm does not converge, even with 500 iterations. I would appreciate any suggestion. Should I increase the number of iterations? Is this an indicator that something else is not going well, or did I a mistake with the R commands?
Thanks in advance.
Two things that answer my question (as well as a comment on weighted variables for kmeans, or better said, with hard competitive learning):
The weights are for observations (=rows of x), not variables (=columns of x). so using hardcl for weighting variables is wrong.
In hardcl or neural gas you need much more iterations compared to standard k-means: In k-means one iteration uses the complete data set to change the centroids, hard competitive learning and uses only a single observation. In comparison to k-means multiply the number of iterations by your sample size.

Efficient plotting of part of a hierarchical cluster

I am running agglomerative clustering on a data set of 130K rows (130K unique keys) and 7 columns, each column ranging from 20 to 2000 unique levels. The data are categorical, specifically alphanumeric codes. At most they can be thought of as factors. I am experimenting with what results I might get from a couple of alternatives to k-modes, including hierarchical clustering and MCA.
My question is, is there any good way to visualize the results up to a certain level with the tree structure?
Standard steps are not a problem:
library{cluster}
Compute Gower distance,
ptm <- proc.time()
gower.dist <- daisy(df[,colnams], metric = c("gower"))
elapsed <- proc.time() - ptm
c(elapsed[3],elapsed[3]/60)
Compute agglomerative clustering object from Gower distance
aggl.clust.c <- hclust(gower.dist, method = "complete")
Now to plotting it. The following line works, but the plot is humanly unreadable
plot(aggl.clust.c, main = "Agglomerative, complete linkages")
Ideally what I am looking for would be something like so (the below is pseudocode that failed on my system)
plot(cutree(aggl.clust.c, k=7), main = "Agglomerative, complete linkages")
I am running R version 3.2.3. That version cannot change (and I don't believe it ought to make a difference for what I am trying to do).
I'd be interested in doing the same in Python, if anyone has good pointers.
I found a useful answer to my question re plotting part of a tree using the as.dendogram() method. Link: http://www.sthda.com/english/wiki/beautiful-dendrogram-visualizations-in-r-5-must-known-methods-unsupervised-machine-learning

Fuzzy C-Means Clustering in R

I am performing Fuzzy Clustering on some data. I first scaled the data frame so each variable has a mean of 0 and sd of 1. Then I ran the clValid function from the package clValid as follows:
library(dplyr)
df<-iris[,-5] # I do not use iris, but to make reproducible
clust<-sapply(df,scale)
intvalid <- clValid(clust, 2:10, clMethods=c("fanny"),
validation="internal", maxitems = 1000)
The results told me 4 would be the best number of clusters. Therefore I ran the fanny function from the cluster package as follows:
res.fanny <- fanny(clust, 4, metric='SqEuclidean')
res.fanny$coeff
res.fanny$k.crisp
df$fuzzy<-res.fanny$clustering
profile<-ddply(df,.(fuzzy),summarize,
count=length(fuzzy))
However, in looking at the profile, I only have 3 clusters instead of 4. How is this possible? Should I go with 3 clusters than instead of 4? How do I explain this? I do not know how to re create my data because it is quite large. As anybody else encountered this before?
This is an attempt at an answer, based on limited information and it may not fully address the questioners situation. It sounds like there may be other issues. In chat they indicated that they had encountered additional errors that I can not reproduce. Fanny will calculate and assign items to "crisp" clusters, based on a metric. It will also produce a matrix showing the fuzzy clustering assignment that may be accessed using membership.
The issue the questioner described can be recreated by increasing the memb.exp parameter using the iris data set. Here is an example:
library(plyr)
library(clValid)
library(cluster)
df<-iris[,-5] # I do not use iris, but to make reproducible
clust<-sapply(df,scale)
res.fanny <- fanny(clust, 4, metric='SqEuclidean', memb.exp = 2)
Calling res.fanny$k.crisp shows that this produces 4 crisp clusters.
res.fanny14 <- fanny(clust, 4, metric='SqEuclidean', memb.exp = 14)
Calling res.fanny14$k.crisp shows that this produces 3 crisp clusters.
One can still access the membership of each of the 4 clusters using res.fanny14$membership.
If you have a good reason to think there should be 4 crisp clusters one could reduce the memb.exp parameter. Which would tighten up the cluster assignments. Or if you are doing some sort of supervised learning, one procedure to tune this parameter would be to reserve some test data, do a hyperparameter grid search, then select the value that produces the best result on your preferred metric. However without knowing more about the task, the data, or what the questioner is trying to accomplish it is hard to suggest much more than this.
First of all I encourage to read the nice vignette of the clValid package.
The R package clValid contains functions for validating the results of a cluster analysis. There are three main types of cluster validation measures available. One of this measure is the Dunn index, the ratio between observations not in the same cluster to the larger intra-cluster distance. I focus on Dunn index for simplicity. In general connectivity should be minimized, while both the Dunn index and the silhouette width should be maximized.
clValid creators explicitly refer to the fanny function of the cluster package in their documentation.
The clValid package is useful for running several algorithms/metrics across a prespecified sets of clustering.
library(dplyr)
library(clValid)
iris
table(iris$Species)
clust <- sapply(iris[, -5], scale)
In my code I need to increase the iteration for reaching convergence (maxit = 1500).
Results are obtained with summary function applied to the clValid object intvalid.
Seems that the optimal number of clusters is 2 (but here is not the main point).
intvalid <- clValid(clust, 2:5, clMethods=c("fanny"),
maxit = 1500,
validation="internal",
metric="euclidean")
summary(intvalid)
The results from any method can be extracted from a clValid object for further analysis using the clusters method. Here the results from the 2 clusters solution are extracted(hc$2), with emphasis on the Dunnett coefficient (hc$2$coeff). Of course this results were related to the "euclidean" metric of the clValid call.
hc <- clusters(intvalid, "fanny")
hc$`2`$coeff
Now, I simply call fanny from cluster package using euclidean metric and 2 clusters. Results are completely overlapping with the previous step.
res.fanny <- fanny(clust, 2, metric='euclidean', maxit = 1500)
res.fanny$coeff
Now, we can look at the classification table
table(hc$`2`$clustering, iris[,5])
setosa versicolor virginica
1 50 0 0
2 0 50 50
and to the profile
df$fuzzy <- hc$`2`$clustering
profile <- ddply(df,.(fuzzy), summarize,
count=length(fuzzy))
profile
fuzzy count
1 1 50
2 2 100

Compute dissimilarity matrix on parallel cores [duplicate]

I'm trying to compute a dissimilarity matrix based on a big data frame with both numerical and categorical features. When I run the daisy function from the cluster package I get the error message:
Error: cannot allocate vector of size X.
In my case X is about 800 GB. Any idea how I can deal with this problem? Additionally it would be also great if someone could help me to run the function in parallel cores. Below you can find the function that computes the dissimilarity matrix on the iris dataset:
require(cluster)
d <- daisy(iris)
I've had a similar issue before. Running daisy() on even 5k rows of my dataset took a really long time.
I ended up using the kmeans algorithm in the h2o package which parallelizes and 1-hot encodes categorical data. I would just make sure to center and scale your data (mean 0 w/ stdev = 1) before plugging it into h2o.kmeans. This is so that the clustering algorithm doesn't prioritize columns that have high nominal differences (since it's trying to minimize the distance calculation). I used the scale() function.
After installing h2o:
h2o.init(nthreads = 16, min_mem_size = '150G')
h2o.df <- as.h2o(df)
h2o_kmeans <- h2o.kmeans(training_frame = h2o.df, x = vars, k = 5, estimate_k = FALSE, seed = 1234)
summary(h2o_kmeans)

R, issue with a Hierarchical clustering after a Multiple correspondence analysis

I want to cluster a dataset (600000 observations), and for each cluster I want to get the principal components.
My vectors are composed by one email and by 30 qualitative variables.
Each quantitative variable has 4 classes: 0,1,2 and 3.
So first thing I'm doing is to load the library FactoMineR and to load my data:
library(FactoMineR)
mydata = read.csv("/home/tom/Desktop/ACM/acm.csv")
Then I'm setting my variables as qualitative (I'm excluding the variable 'email' though):
for(n in 1:length(mydata)){mydata[[n]] <- factor(mydata[[n]])}
I'm removing the emails from my vectors:
mydata2 = mydata[2:31]
And I'm running a MCA in this new dataset:
mca.res <- MCA(mydata2)
I now want to cluster my dataset using the hcpc function:
res.hcpc <- HCPC(mca.res)
But I got the following error message:
Error: cannot allocate vector of size 1296.0 Gb
What do you think I should do? Is my dataset too large? Am I using well the hcpc function?
Since it uses hierarchical clustering, HCPC needs to compute the lower triangle of a 600000 x 600000 distance matrix (~ 180 billion elements). You simply don't have the RAM to store this object and even if you did, the computation would likely take hours if not days to complete.
There have been various discussions on Stack Overflow/Cross Validated on clustering large datasets; some with solutions in R include:
k-means clustering in R on very large, sparse matrix? (bigkmeans)
Cluster Big Data in R and Is Sampling Relevant? (clara)
If you want to use one of these alternative clustering approaches, you would apply it to mca.res$ind$coord in your example.
Another idea, suggested in response to the problem clustering very large dataset in R, is to first use k means to find a certain number of cluster centres and then use hierarchical clustering to build the tree from there. This method is actually implemented via the kk argument of HCPC.
For example, using the tea data set from FactoMineR:
library(FactoMineR)
data(tea)
## run MCA as in ?MCA
res.mca <- MCA(tea, quanti.sup = 19, quali.sup = c(20:36), graph = FALSE)
## run HCPC for all 300 individuals
hc <- HCPC(res.mca, kk = Inf, consol = FALSE)
## run HCPC from 30 k means centres
res.consol <- NULL ## bug work-around
hc2 <- HCPC(res.mca, kk = 30, consol = FALSE)
The consol argument offers the option to consolidate the clusters from the hierarchical clustering using k-means; this option is not available when kk is set to a real number, hence consol is set to FALSE here. The object res.consul is set to NULL to work around a minor bug in FactoMineR 1.27.
The following plot show the clusters based on the 300 individuals (kk = Inf) and based on the 30 k means centres (kk = 30) for the data plotted on the first two MCA axes:
It can be seen that the results are very similar. You should easily be able to apply this to your data with 600 or 1000 k means centres, perhaps up to 6000 with 8GB RAM. If you wanted to use a larger number, you'd probably want to code a more efficient version using bigkmeans, SpatialTools::dist1 and fastcluster::hclust.
That error message usually indicates that R has not enough RAM at its disposal to complete the command. I guess you are running this within 32bit R, possibly under Windows? If this is the case, then killing other processes and deleting unused R variables might possibly help: for example, you might try to delete mydata, mydata2 with
rm(mydata, mydata2)
(as well as all other non-necessary R variables) before executing the command which generates the error. However the ultimate solution in general is to switch to 64bit R, preferably under 64bit Linux and with a decent RAM amount, also see here:
R memory management / cannot allocate vector of size n Mb
R Memory Allocation "Error: cannot allocate vector of size 75.1 Mb"
http://r.789695.n4.nabble.com/Error-cannot-allocate-vector-of-size-td3629384.html

Resources