Compute dissimilarity matrix on parallel cores [duplicate] - r

I'm trying to compute a dissimilarity matrix based on a big data frame with both numerical and categorical features. When I run the daisy function from the cluster package I get the error message:
Error: cannot allocate vector of size X.
In my case X is about 800 GB. Any idea how I can deal with this problem? Additionally it would be also great if someone could help me to run the function in parallel cores. Below you can find the function that computes the dissimilarity matrix on the iris dataset:
require(cluster)
d <- daisy(iris)

I've had a similar issue before. Running daisy() on even 5k rows of my dataset took a really long time.
I ended up using the kmeans algorithm in the h2o package which parallelizes and 1-hot encodes categorical data. I would just make sure to center and scale your data (mean 0 w/ stdev = 1) before plugging it into h2o.kmeans. This is so that the clustering algorithm doesn't prioritize columns that have high nominal differences (since it's trying to minimize the distance calculation). I used the scale() function.
After installing h2o:
h2o.init(nthreads = 16, min_mem_size = '150G')
h2o.df <- as.h2o(df)
h2o_kmeans <- h2o.kmeans(training_frame = h2o.df, x = vars, k = 5, estimate_k = FALSE, seed = 1234)
summary(h2o_kmeans)

Related

R: How to implement parallel processing when calculating polychoric correlation matrix?

I am trying to implement parallel processing to speed up the calculation of a polychoric correlation matrix using the hetcor() function from the polycor package. Here's some example code with desired output:
set.seed(352)
toy <-
data.frame(A = sample(0:7),
B = sample(0:7),
C = sample(0:7)) %>%
mutate_all(.funs = ordered)
toy
hetcor(toy, ML = FALSE, pd = TRUE)
So this takes a second or two already, but can take >20 minutes with my actual dataset. I know that the parallel package can utilize the 8 cores I have on my computer to make this faster. But I can not figure out how to implement it, or even where to start. Most of the examples I find online are using apply-style approaches, which seems different from what I'm trying to do (unless you're supposed to loop this process somehow, maybe with the polychor() function?).
Is there a generous soul that can show an example of how to use the parallel package to calculate polychoric correlation matrix from the toy data?

No convergence for hard competitive learning clustering (flexclust package)

I am applying the functions from the flexclust package for hard competitive learning clustering, and I am having trouble with the convergence.
I am using this algorithm because I was looking for a method to perform a weighed clustering, giving different weights to groups of variables. I chose hard competitive learning based on a response for a previous question (Weighted Kmeans R).
I am trying to find the optimal number of clusters, and to do so I am using the function stepFlexclust with the following code:
new("flexclustControl") ## check the default values
fc_control <- new("flexclustControl")
fc_control#iter.max <- 500 ### 500 iterations
fc_control#verbose <- 1 # this will set the verbose to TRUE
fc_control#tolerance <- 0.01
### I want to give more weight to the first 24 variables of the dataframe
my_weights <- rep(c(1, 0.064), c(24, 31))
set.seed(1908)
hardcl <- stepFlexclust(x=df, k=c(7:20), nrep=100, verbose=TRUE,
FUN = cclust, dist = "euclidean", method = "hardcl", weights=my_weights, #Parameters for hard competitive learning
control = fc_control,
multicore=TRUE)
However, the algorithm does not converge, even with 500 iterations. I would appreciate any suggestion. Should I increase the number of iterations? Is this an indicator that something else is not going well, or did I a mistake with the R commands?
Thanks in advance.
Two things that answer my question (as well as a comment on weighted variables for kmeans, or better said, with hard competitive learning):
The weights are for observations (=rows of x), not variables (=columns of x). so using hardcl for weighting variables is wrong.
In hardcl or neural gas you need much more iterations compared to standard k-means: In k-means one iteration uses the complete data set to change the centroids, hard competitive learning and uses only a single observation. In comparison to k-means multiply the number of iterations by your sample size.

R — Automatic Optimal Number of Clusters Sequence Algorithm

I am interested in finding a function to automatically determine the optimal number of clusters in R.
I am using a sequence algorithm from the package TraMineR to compute my distances.
library(TraMineR)
data(biofam)
biofam.seq <- seqdef(biofam[501:600, 10:25])
## OM distances ##
biofam.om <- seqdist(biofam.seq, method = "OM", indel = 3, sm = "TRATE",
full.matrix = F)
For instance, hclust can simply be used like this
h = hclust(as.dist(biofam.om), method = 'ward')
and the number of clusters can then be manually determined with
clusters = cutree(h, k = 7)
What I would like ultimately is to automatically set up in the cutree function the k number of clusters, based on an "ideal" number of clusters.
It seems that the package clValid has such function (optimalScores).
However, I cannot pass a distance matrix into clValid.
clValid(obj = as.dist(biofam.om), 2:6, clMethods = 'hierarchical')
I get this error
argument 'obj' must be a matrix, data.frame, or ExpressionSet object
I get the same kind of error using other packages such as NbClust
NbClust(diss = as.dist(biofam.om), method = 'ward.D')
Data matrix is needed.
Anyone knows how to solve this or knows other packages?
Thanks.
There are several different criteria for measuring the quality of a clustering result and choosing the optimal number of clusters. Take a look at the weightedCluster package: http://mephisto.unige.ch/weightedcluster/WeightedCluster.pdf
You can easily compare between different measures and numbers of clusters.

Random Forest with caret package: Error: cannot allocate vector of size 153.1 Gb

I was trying to build a random forest model for a dataset in Kaggle, i always doing machine learning with caret package, the dataset has 1.5 million + rows and 46 variables with no missing values (about 150 mb in size), 40+ variables are categorical and the outcome is the response i am trying to predict and it is binary. After some pre-processing with dplyr, I started working on building model with caret package, but i got this error message when i was trying to run the "train" function:"Error: cannot allocate vector of size 153.1 Gb" Here is my code:
## load packages
require(tidyr)
require(dplyr)
require(readr)
require(ggplot2)
require(ggthemes)
require(caret)
require(parallel)
require(doParallel)
## prepare for parallel processing
n_Cores <- detectCores()
n_Cluster <- makeCluster(n_Cores)
registerDoParallel(n_Cluster)
## import orginal datasets
people_Dt <- read_csv("people.csv",col_names = TRUE)
activity_Train <- read_csv("act_train.csv",col_names = TRUE)
### join two sets together and remove variables not to be used
first_Try <- people_Dt%>%
left_join(activity_Train,by="people_id")%>%
select(-ends_with("y"))%>%
filter(!is.na(outcome))
## try with random forest
in_Tr <- createDataPartition(first_Try$outcome,p=0.75,list=FALSE)
rf_Train <- firt_Try[in_Tr,]
rf_Test <- firt_Try[-in_Tr,]
## set model cross validation parameters
model_Control <- trainControl(method = "repeatedcv",repeats=2,number=2,allowParallel = TRUE)
rf_RedHat <- train(outcome~.,
data=rf_Train,
method="rf",
tuneLength=10,
importance=TRUE,
trControl=model_Control)
My computer is a fairly powerful machine with E3 processors and 32GB RAM. I have two questions:
1. Where did i get a vector that is as large as 150GB? Is it because some codes I wrote?
2. I cannot get a machine with that big ram, is there any workarouds to solve the issue that i can move on with my model building process?
the dataset has 1.5 million + rows and 46 variables with no missing values (about 150 mb in size)
To be clear here, you most likely don't need 1.5 million rows to build a model. Instead, you should be taking a smaller subset which doesn't cause the memory problems. If you are concerned about reducing the size of your sample data, then you can do some descriptive stats on the 40 predictors, on a smaller set, and make sure that the behavior appears to be the same.
The problem is probably related to the one-hot-encoding of caret in your categorical variables. Since you have a lot of categorical variables, this seems to be a real problem such that it increases your dataset in a huge way. One-hot encoding will create a new column for every factor per categorical variables that you have.
Maybe you could try something like the h2o-package, which handles categorical variable in another way such that in not exploding your dataset when the model is run.

R, issue with a Hierarchical clustering after a Multiple correspondence analysis

I want to cluster a dataset (600000 observations), and for each cluster I want to get the principal components.
My vectors are composed by one email and by 30 qualitative variables.
Each quantitative variable has 4 classes: 0,1,2 and 3.
So first thing I'm doing is to load the library FactoMineR and to load my data:
library(FactoMineR)
mydata = read.csv("/home/tom/Desktop/ACM/acm.csv")
Then I'm setting my variables as qualitative (I'm excluding the variable 'email' though):
for(n in 1:length(mydata)){mydata[[n]] <- factor(mydata[[n]])}
I'm removing the emails from my vectors:
mydata2 = mydata[2:31]
And I'm running a MCA in this new dataset:
mca.res <- MCA(mydata2)
I now want to cluster my dataset using the hcpc function:
res.hcpc <- HCPC(mca.res)
But I got the following error message:
Error: cannot allocate vector of size 1296.0 Gb
What do you think I should do? Is my dataset too large? Am I using well the hcpc function?
Since it uses hierarchical clustering, HCPC needs to compute the lower triangle of a 600000 x 600000 distance matrix (~ 180 billion elements). You simply don't have the RAM to store this object and even if you did, the computation would likely take hours if not days to complete.
There have been various discussions on Stack Overflow/Cross Validated on clustering large datasets; some with solutions in R include:
k-means clustering in R on very large, sparse matrix? (bigkmeans)
Cluster Big Data in R and Is Sampling Relevant? (clara)
If you want to use one of these alternative clustering approaches, you would apply it to mca.res$ind$coord in your example.
Another idea, suggested in response to the problem clustering very large dataset in R, is to first use k means to find a certain number of cluster centres and then use hierarchical clustering to build the tree from there. This method is actually implemented via the kk argument of HCPC.
For example, using the tea data set from FactoMineR:
library(FactoMineR)
data(tea)
## run MCA as in ?MCA
res.mca <- MCA(tea, quanti.sup = 19, quali.sup = c(20:36), graph = FALSE)
## run HCPC for all 300 individuals
hc <- HCPC(res.mca, kk = Inf, consol = FALSE)
## run HCPC from 30 k means centres
res.consol <- NULL ## bug work-around
hc2 <- HCPC(res.mca, kk = 30, consol = FALSE)
The consol argument offers the option to consolidate the clusters from the hierarchical clustering using k-means; this option is not available when kk is set to a real number, hence consol is set to FALSE here. The object res.consul is set to NULL to work around a minor bug in FactoMineR 1.27.
The following plot show the clusters based on the 300 individuals (kk = Inf) and based on the 30 k means centres (kk = 30) for the data plotted on the first two MCA axes:
It can be seen that the results are very similar. You should easily be able to apply this to your data with 600 or 1000 k means centres, perhaps up to 6000 with 8GB RAM. If you wanted to use a larger number, you'd probably want to code a more efficient version using bigkmeans, SpatialTools::dist1 and fastcluster::hclust.
That error message usually indicates that R has not enough RAM at its disposal to complete the command. I guess you are running this within 32bit R, possibly under Windows? If this is the case, then killing other processes and deleting unused R variables might possibly help: for example, you might try to delete mydata, mydata2 with
rm(mydata, mydata2)
(as well as all other non-necessary R variables) before executing the command which generates the error. However the ultimate solution in general is to switch to 64bit R, preferably under 64bit Linux and with a decent RAM amount, also see here:
R memory management / cannot allocate vector of size n Mb
R Memory Allocation "Error: cannot allocate vector of size 75.1 Mb"
http://r.789695.n4.nabble.com/Error-cannot-allocate-vector-of-size-td3629384.html

Resources