Memory management in phylogenetic tree pairwise distance calculations - r

I have a metametabolite dendrogram (FTICR-MS data) and I'm measuring the pairwise branch lengths to create a null using randomized distribution.I'm using Bob Danczak's script here, which uses cophenetic() to calc the pairwise distances, then running a for loop to calculate the random distributions. My input is a large list (phylo) of 46.1 MB. Rightfully so I am receiving the error: Error in dist.nodes(x) : tree too big but I really need to calculate these distances. What are some memory-managing techniques to circumvent this issue? I'm certain it's the package and not my computer (8 cores, 64G RAM) though I'm never 100% confident when it comes to computers!

Related

Is it possible in R to calculate all eigenvalues of a very large symmetric n by n dense matrix in blocks to conserve RAM?

To provide some context, I work with DNA methylation data that even after some filtering can still consist of 200K-300K features (with much less samples, about 500). I need to do some operations on this and I have been using the bigstatsr package for other operations, which can use a Filebacked Big Matrix (FBM) to determine for instance a crossproduct in blocks. I further found that this can work with RSpectra::eigs_sym to get a specified number of eigenvalues, but unfortunately not all. To get all eigenvalues I have mainly seen the base R eigen function being used, but with this I run out of RAM when I have a matrix that is 300k by 300k.

Random Forest with p>>n and not enough memory

I am trying to perform Random Forest classification on genomic data with ~200k predictors and ~20 rows. Predictors have been already pruned for autocorrelation. I tried to use the 'ranger' R package, but it complains it cannot allocate 164Gb vector (I do have 32Gb RAM).
Is there any RF implementation that can manage the analysis given
the available RAM (I would like to avoid increasing the swap)?
Should I maybe use a different algorithm (for what I read, RF should
deal alright with p>>n)?
If it's genomic data, are there a lot of zeroes? If so, you might be able to convert into a sparse matrix, using the Matrix package. I believe ranger has been able to work with sparse matrices for a while, and this can help a lot with memory issues.
As far as I know, ranger is the best R random forest package available for datasets where p >> n.

Memory Problem: Average-Linkage Clustering

Data with a million rows and 18 columns need to be clustered using Average-Linkage Clustering, which in turn requires calculating the Euclidian distance between rows. While doing so, d <-dist(data), R gives the following error:
Error: cannot allocate vector of size 3725.3 Gb
My computer has a memory of 32 Gb. What should be my approach?
The distance matrix, even only its upper diagonal, will always need about 2TB of memory. Moreover, a fast implementation of hierarchical clustering has time complexity $O(n^2)$. You can try two things:
Use the function hclust.vector from the fastcluster package, which does not require a distance matrix as input and thereby saves space complexity at the expense of time complexity.
Use a different clustering algorithm that is not based on all pairwise distances, e.g. k-means.
You can also try s hybrid approach by first condensing the data with 2. and then applying 1.

How to implement fanny (soft clustering) for a large Dataset?

I am trying to implement soft clustering on a imbalanced Dataset. The dataset has around 200k rows and 40 columns.
Whenever i run the fanny() function, RStudio crashes and I am forced to start a new session.
I can run the cmeans() successfully on the above dataset, but when i used the the fanny() function.
It initially used to show this error:
Error: cannot allocate vector of size 123.5 Gb
So i added --max-vsize=1500000M in the target(Properties) while launching R. After adding this the RAM usage would hit 31.8 GB whenever I ran the fanny() function. And after a couple of minutes the RStudio would crash.
library(cluster)
#The dataset 'train' has around 20 factor columns and 20 integer columns with 200k rows.
Cluster <- fanny(trainSet, 3)
Apparently fanny tries to use a distance matrix.
Hence I suggest that you carefully study the ideas of the algorithm and whether it needs that matrix, or whether it can be efficiently implemented (that means to write the algorithm, not just to call it!) without doing this. If it needs the distance matrix, then you won't be able to implement fanny on data sets much larger than 65k.

dist() function in R: vector size limitation

I was trying to draw a hierarchical clustering of some samples (40 of them) over some features(genes) and I have a big table with 500k rows and 41 columns (1st one is name) and when I tried
d<-dist(as.matrix(file),method="euclidean")
I got this error
Error: cannot allocate vector of size 1101.1 Gb
How can I get around of this limitation? I googled it and came across to the ff package in R but I don't quite understand whether that could solve my issue.
Thanks!
Generally speaking hierarchical clustering is not the best approach for dealing with very large datasets.
In your case however there is a different problem. If you want to cluster samples structure of your data is wrong. Observations should be represented as the rows, and gene expression (or whatever kind of data you have) as the columns.
Lets assume you have data like this:
data <- as.data.frame(matrix(rnorm(n=500000*40), ncol=40))
What you want to do is:
# Create transposed data matrix
data.matrix.t <- t(as.matrix(data))
# Create distance matrix
dists <- dist(data.matrix.t)
# Clustering
hcl <- hclust(dists)
# Plot
plot(hcl)
NOTE
You should remember that euclidean distances can be rather misleading when you work with high-dimensional data.
When dealing with large data sets, R is not the best choice.
The majority of methods in R seems to be implemented by computing a full distance matrix, which inherently needs O(n^2) memory and runtime. Matrix based implementations don't scale well to large data , unless the matrix is sparse (which a distance matrix per definition isn't).
I don't know if you realized that 1101.1 Gb is 1 Terabyte. I don't think you have that much RAM, and you probably won't have the time to wait for computing this matrix either.
For example ELKI is much more powerful for clustering, as you can enable index structures to accelerate many algorithms. This saves both memory (usually down to linear memory usage; for storing the cluster assignments) and runtime (usually down to O(n log n), one O(log n) operation per object).
But of course, it also varies from algorithm to algorithm. K-means for example, which needs point-to-mean distances only, does not need (and cannot use) an O(n^2) distance matrix.
So in the end: I don't think the memory limit of R is your actual problem. The method you want to use doesn't scale.
I just experience a related issue but with less rows (around 100 thousands for 16 columns).
RAM size is the limiting factor.
To limitate the need in memory space I used 2 different functions from 2 different packages.
from parallelDist the function parDist() allow you to obtain the distances quite fast. it uses RAM of course during the process but it seems that the resulting dist object is taking less memory (no idea why).
Then I used the hclust() function but from the package fastcluster. fastcluster is actually not so fast on such an amount of data but it seems that it uses less memory than the default hclust().
Hope this will be useful for anybody who find this topic.

Resources