How to implement fanny (soft clustering) for a large Dataset? - r

I am trying to implement soft clustering on a imbalanced Dataset. The dataset has around 200k rows and 40 columns.
Whenever i run the fanny() function, RStudio crashes and I am forced to start a new session.
I can run the cmeans() successfully on the above dataset, but when i used the the fanny() function.
It initially used to show this error:
Error: cannot allocate vector of size 123.5 Gb
So i added --max-vsize=1500000M in the target(Properties) while launching R. After adding this the RAM usage would hit 31.8 GB whenever I ran the fanny() function. And after a couple of minutes the RStudio would crash.
library(cluster)
#The dataset 'train' has around 20 factor columns and 20 integer columns with 200k rows.
Cluster <- fanny(trainSet, 3)

Apparently fanny tries to use a distance matrix.
Hence I suggest that you carefully study the ideas of the algorithm and whether it needs that matrix, or whether it can be efficiently implemented (that means to write the algorithm, not just to call it!) without doing this. If it needs the distance matrix, then you won't be able to implement fanny on data sets much larger than 65k.

Related

Parallel processing a function being applied with mapply a large dataset

My problem is the following: I have a large dataset on R (I run it on VS-Code), which I'll call full of about 1.2GB (28kk rows with 10 columns) and subset of this dataset, which I'll call main (4.3kk rows and 10 columns). I use a Windows with an i7-10700k CPU, 3.8 GHz, 8-core, with 16gb of RAM.
These datasets contain unique identifiers for products, which then spam over multiple time periods and stores. For each (product-store) combination, I need to calculate summary statistics for similar products excluding that store and product. For this reason, I essentially need the full dataset to be loaded, and I cannot split it.
I have created a function that takes a given product-store, filters the dataset to exclude that product-store, and then perform the summary statistics.
There are over 1 million product-stores, so an apply would take 1 million runs. Each run of the function is taking about 0.5 seconds, which is a lot.
I then decided to use furrr's future_map2 along with plan(cluster, workers=8) to try and parallelize the process.
One adviced that normally goes against parallelization is that, if a lot of data needs to be moved around for each cluster, this process can take a long time. My understanding is that the parallelization would move the large datasets to each cluster once, and then it would perform the apply in parallel. This seems to imply that my process will be more efficient under parallelization, even with a large dataset.
I wanted to know if overall I am doing the most advisable thing in terms of speeding up the function. I already switched fully to data.table functions to improve speed, so I don't believe there's a lot to be done within the function.
Tried parallelizing, worried about whats the smartest approach

Memory management in phylogenetic tree pairwise distance calculations

I have a metametabolite dendrogram (FTICR-MS data) and I'm measuring the pairwise branch lengths to create a null using randomized distribution.I'm using Bob Danczak's script here, which uses cophenetic() to calc the pairwise distances, then running a for loop to calculate the random distributions. My input is a large list (phylo) of 46.1 MB. Rightfully so I am receiving the error: Error in dist.nodes(x) : tree too big but I really need to calculate these distances. What are some memory-managing techniques to circumvent this issue? I'm certain it's the package and not my computer (8 cores, 64G RAM) though I'm never 100% confident when it comes to computers!

dist() function in R: vector size limitation

I was trying to draw a hierarchical clustering of some samples (40 of them) over some features(genes) and I have a big table with 500k rows and 41 columns (1st one is name) and when I tried
d<-dist(as.matrix(file),method="euclidean")
I got this error
Error: cannot allocate vector of size 1101.1 Gb
How can I get around of this limitation? I googled it and came across to the ff package in R but I don't quite understand whether that could solve my issue.
Thanks!
Generally speaking hierarchical clustering is not the best approach for dealing with very large datasets.
In your case however there is a different problem. If you want to cluster samples structure of your data is wrong. Observations should be represented as the rows, and gene expression (or whatever kind of data you have) as the columns.
Lets assume you have data like this:
data <- as.data.frame(matrix(rnorm(n=500000*40), ncol=40))
What you want to do is:
# Create transposed data matrix
data.matrix.t <- t(as.matrix(data))
# Create distance matrix
dists <- dist(data.matrix.t)
# Clustering
hcl <- hclust(dists)
# Plot
plot(hcl)
NOTE
You should remember that euclidean distances can be rather misleading when you work with high-dimensional data.
When dealing with large data sets, R is not the best choice.
The majority of methods in R seems to be implemented by computing a full distance matrix, which inherently needs O(n^2) memory and runtime. Matrix based implementations don't scale well to large data , unless the matrix is sparse (which a distance matrix per definition isn't).
I don't know if you realized that 1101.1 Gb is 1 Terabyte. I don't think you have that much RAM, and you probably won't have the time to wait for computing this matrix either.
For example ELKI is much more powerful for clustering, as you can enable index structures to accelerate many algorithms. This saves both memory (usually down to linear memory usage; for storing the cluster assignments) and runtime (usually down to O(n log n), one O(log n) operation per object).
But of course, it also varies from algorithm to algorithm. K-means for example, which needs point-to-mean distances only, does not need (and cannot use) an O(n^2) distance matrix.
So in the end: I don't think the memory limit of R is your actual problem. The method you want to use doesn't scale.
I just experience a related issue but with less rows (around 100 thousands for 16 columns).
RAM size is the limiting factor.
To limitate the need in memory space I used 2 different functions from 2 different packages.
from parallelDist the function parDist() allow you to obtain the distances quite fast. it uses RAM of course during the process but it seems that the resulting dist object is taking less memory (no idea why).
Then I used the hclust() function but from the package fastcluster. fastcluster is actually not so fast on such an amount of data but it seems that it uses less memory than the default hclust().
Hope this will be useful for anybody who find this topic.

SVM modeling with BIG DATA

For modeling with SVM in R, I have used kernlab package (ksvm method)with Windows Xp operating system and 2 GB RAM. But having more data rows as 201497, I can'nt able to provide more memory for processing of data modeling (getting issue : can not allocate vector size greater than 2.7 GB).
Therefore, I have used Amazon micro and large instance for SCM modeling. But, it have same issue as local machine (can not allocate vector size greater than 2.7 GB).
Can any one suggest me the solution of this problem with BIG DATA modeling or Is there something wrong with this.
Without a reproducible example it is hard to say if the dataset is just too big, or if some parts of your script are suboptimal. A few general pointers:
Take a look at the High Performance Computing Taskview, this lists the main R packages relevant for working with BigData.
You use your entire dataset for training your model. You could try to take a subset (say 10%) and fit your model on that. Repeating this procedure a few times will yield insight into if the model fit is sensitive to which subset of the data you use.
Some analysis techniques, e.g. PCA analysis, can be done by processing the data iteratively, i.e. in chunks. This makes analyses possible on very big datasets possible (>> 100 gb). I'm not sure if this is possible with kernlab.
Check if the R version you are using is 64 bit.
This earlier question might be of interest.

Random forest on a big dataset

I have a large dataset in R (1M+ rows by 6 columns) that I want to use to train a random forest (using the randomForest package) for regression purposes. Unfortunately, I get a Error in matrix(0, n, n) : too many elements specified error when trying to do the whole thing at once and cannot allocate enough memory kind of errors when running in on a subset of the data -- down to 10,000 or so observations.
Seeing that there is no chance I can add more RAM on my machine and random forests are very suitable for the type of process I am trying to model, I'd really like to make this work.
Any suggestions or workaround ideas are much appreciated.
You're likely asking randomForest to create the proximity matrix for the data, which if you think about it, will be insanely big: 1 million x 1 million. A matrix this size would be required no matter how small you set sampsize. Indeed, simply Googling the error message seems to confirm this, as the package author states that the only place in the entire source code where n,n) is found is in calculating the proximity matrix.
But it's hard to help more, given that you've provided no details about the actual code you're using.

Resources