dist() function in R: vector size limitation - r

I was trying to draw a hierarchical clustering of some samples (40 of them) over some features(genes) and I have a big table with 500k rows and 41 columns (1st one is name) and when I tried
d<-dist(as.matrix(file),method="euclidean")
I got this error
Error: cannot allocate vector of size 1101.1 Gb
How can I get around of this limitation? I googled it and came across to the ff package in R but I don't quite understand whether that could solve my issue.
Thanks!

Generally speaking hierarchical clustering is not the best approach for dealing with very large datasets.
In your case however there is a different problem. If you want to cluster samples structure of your data is wrong. Observations should be represented as the rows, and gene expression (or whatever kind of data you have) as the columns.
Lets assume you have data like this:
data <- as.data.frame(matrix(rnorm(n=500000*40), ncol=40))
What you want to do is:
# Create transposed data matrix
data.matrix.t <- t(as.matrix(data))
# Create distance matrix
dists <- dist(data.matrix.t)
# Clustering
hcl <- hclust(dists)
# Plot
plot(hcl)
NOTE
You should remember that euclidean distances can be rather misleading when you work with high-dimensional data.

When dealing with large data sets, R is not the best choice.
The majority of methods in R seems to be implemented by computing a full distance matrix, which inherently needs O(n^2) memory and runtime. Matrix based implementations don't scale well to large data , unless the matrix is sparse (which a distance matrix per definition isn't).
I don't know if you realized that 1101.1 Gb is 1 Terabyte. I don't think you have that much RAM, and you probably won't have the time to wait for computing this matrix either.
For example ELKI is much more powerful for clustering, as you can enable index structures to accelerate many algorithms. This saves both memory (usually down to linear memory usage; for storing the cluster assignments) and runtime (usually down to O(n log n), one O(log n) operation per object).
But of course, it also varies from algorithm to algorithm. K-means for example, which needs point-to-mean distances only, does not need (and cannot use) an O(n^2) distance matrix.
So in the end: I don't think the memory limit of R is your actual problem. The method you want to use doesn't scale.

I just experience a related issue but with less rows (around 100 thousands for 16 columns).
RAM size is the limiting factor.
To limitate the need in memory space I used 2 different functions from 2 different packages.
from parallelDist the function parDist() allow you to obtain the distances quite fast. it uses RAM of course during the process but it seems that the resulting dist object is taking less memory (no idea why).
Then I used the hclust() function but from the package fastcluster. fastcluster is actually not so fast on such an amount of data but it seems that it uses less memory than the default hclust().
Hope this will be useful for anybody who find this topic.

Related

Is it possible in R to calculate all eigenvalues of a very large symmetric n by n dense matrix in blocks to conserve RAM?

To provide some context, I work with DNA methylation data that even after some filtering can still consist of 200K-300K features (with much less samples, about 500). I need to do some operations on this and I have been using the bigstatsr package for other operations, which can use a Filebacked Big Matrix (FBM) to determine for instance a crossproduct in blocks. I further found that this can work with RSpectra::eigs_sym to get a specified number of eigenvalues, but unfortunately not all. To get all eigenvalues I have mainly seen the base R eigen function being used, but with this I run out of RAM when I have a matrix that is 300k by 300k.

Memory Problem: Average-Linkage Clustering

Data with a million rows and 18 columns need to be clustered using Average-Linkage Clustering, which in turn requires calculating the Euclidian distance between rows. While doing so, d <-dist(data), R gives the following error:
Error: cannot allocate vector of size 3725.3 Gb
My computer has a memory of 32 Gb. What should be my approach?
The distance matrix, even only its upper diagonal, will always need about 2TB of memory. Moreover, a fast implementation of hierarchical clustering has time complexity $O(n^2)$. You can try two things:
Use the function hclust.vector from the fastcluster package, which does not require a distance matrix as input and thereby saves space complexity at the expense of time complexity.
Use a different clustering algorithm that is not based on all pairwise distances, e.g. k-means.
You can also try s hybrid approach by first condensing the data with 2. and then applying 1.

Is calculating the index in an array more efficient than letting the compiler do it?

I'm trying to generalize a neural network function to arbitrarily many layers, and so I need multiple matrices to hold the weights for each neuron in each layer. I was originally explicitly declaring matrix objects in R to hold my weights for each layer. Instead of having one matrix per layer, I thought of a way (not saying it's original), to store all of my weights in a single array and defined an "indexing function" to map a weight to its appropriate index in the array.
I defined the function as follows:
where is the k-th weight of the j-th neuron in the i-th layer and L(r) is the number of neurons in layer r. After writing these definitions, I realize that stackoverflow doesn't allow latex like mathoverflow which is unfortunate.
Now the question is: Is it more efficient to compute the index of my weights in this way, or is actually less efficient?
After looking up how indices are computed for arrays in general, this is essentially what is done on compilation anyway if I just kept a matrix in each layer holding the weights, so it seems like I may just be making my code overly complicated and harder to understand if there's no difference in time efficiency.
TL;DR use the matrices its easier to understand and takes advantage of optimized CPU instructions.
In computer science parlance, the efficiency (scalability) of algorithms is reasoned about using Big O cost. A score can be given to both the time and space complexity.
Using Big O notation lets compare the two approaches:
Array Approach
time complexity:
Array index access is O(1) time, no matter how large an array becomes, it is just as computationally easy to access an element given its index.
As you've created a function to compute the index of the k-th weight, this adds some small complexity but would probably run in constant O(1) time as it is a mathematical expression, so negligible.
space complexity:
O(N) where N is the number of weights across all layers.
Matrices Approach
time complexity:
A matrix is essentially a 2d array with O(1) access
space complexity
O(N + M), where N is number of neurons and M is number of weights.
Conceptually, we can see that the two approaches have an equivalent time and space complexity score.
However there are the other trade-offs involved (and as a good SO-er must inform you of those)
When it comes to working with the data in the array vs matrices approach, the array approach is less efficient as it circumvents the opportunity for MISD operations. As #liborm alluded to there are vectorised (MISD) operations handled by lower level system libraries like LAPACK/BLAS, which "batch" CPU instructions for some matrix operations (less overhead cost to transfer and compute data at CPU compared to sending a new instruction every time)
Instead of having one matrix per layer, I thought of a way ... to store all of my weights in a single array
It's hard to see why you would opt-ed for the latter as it requires you to create a bespoke indexing function. Maybe its nicer to think about all your weights being in one long array place? However I would argue the mental load required to maintain the array mapping is higher than having multiple matrices dedicated to a layer.
A hash-table like structure of matrices would be much easier to reason about
layers <- list(layer1 = [[...]], layer2 = [[...]], layerN = [[...]])
Further reading
http://www.noamross.net/blog/2014/4/16/vectorization-in-r--why.html
There are many factors to take into consideration in each of the approaches. I'm not familiar with R but I'm assuming matrices' buffers are represented as one-dimensional arrays in memory. (Even if they are written as two dimensional arrays in the underlying C implementation the compiler stores it as one-dimensional array in memory)
The overall outline of memory operations are:
Case: Several matrices per layers
Allocation of matrices:
Accessing of indices:
Case: One matrix for all layers + index calculation
Allocation of matrix cost:
Accesing each of the indices cost:
Function cost:
We can clearly see that the second case, scales better, even though there's the additional cost of the function call.
Having said that, in general having a statically allocated array with all the weights for all the layers, should be faster.
In most cases, computers's bottleneck is memory bandwidth, and the best way to counteract this is to minimize the number of memory accesses.
With this in mind there's another more primitive reason why the 2nd approach will probably be faster: Caches.
Here's a good explanation of the performance difference in accesing a two-dimensional array in a loop by Good Ol' Bob Martin
TL; DR: Caches take advantage of the principle of locality, and therefore, having memory accesses spatially close to each other (as you would in one single array and accessing them in a cache-friendly way as explained in Bob Martin's answer) renders better performance than having them spatially separated (having them in several distinct arrays).
PS: I also recommend to benchmark both approaches and compare, since these nuances regarding the cache are machine-dependent. It might be the case that the Dataset/NN is small enough to fit completely in RAM or even in cache? in a very powerful server.
I'm sure you want to use some kind of native array objects, so you get the speedups provided by BLAS/LAPACK implementations (see eg Intel MKL discussion here if you're on Windows). Most of the time in NN evaluation will be spent in matrix multiplications (like SGEMM), and this is where BLAS implementations like Intel MKL can be an order of magnitude faster.
That is - even if the hand-coded indices for your single-array multi-layer network were super fast, you won't be able to use it with the optimised multiplication routines, which would make your whole network significantly slower. Use the native array objects and create a multi layer abstraction on top of them.
But actually if you want speed and usability (and to really build some NN models), you should consider using something like R interface to TensorFlow. As a bonus you'll get things like running on the GPU for free.
Nice puzzle.. If you are asking calculating index in which would happen in runtime for which it needs to be compiled. Just want to understand how would you let the compiler compute it? IF you have a need to playing with the info anytime later then I would suggest to use Hasmap kind of mechanism. I had done it for a similar need.

dist function with large number of points

I am using the dist {stats} function to calculate the distance between points, my problem is that I have 24469 points, and the output for the dist function gives me a vector with 18705786 length, instead of the matrix. I tried already to export as.matrix, but the file is 2 large.
How can I have access to what points corresponds each distance?
For example which(distance<=700) gives me the position in the vector, but how can I get the info to what points this distance corresponds to?
There are asome things you could try, also depending on what you need exactly:
Calculate the distances in a loop, and only keep those that match the criterium. Especially when the number of matches is much smaller than the total size of the distance matrix, this saves a lot of RAM usage. This loop is probably very slow if it is implemented in pure R, that is alos why dist does not use R but I believe C to perform the calculations. This could mean that you get your results, but have to wait a while. Alternatively, the excellent Rcpp package would allow you to write this down in C/C++, making it much much faster probably.
Start using packages like bigmemory in storing the distance matrix. You then build it in a loop and store it iteratively in the bigmemory object (I have not worked with bigmemory before, so I don't know the exact details). Then after building the matrix, you can access it to extract your desired results. Effectively, all tricks to handle large data in R apply to this bullet. See e.g. R SO posts on big data.
Some interesting links (found googling for r distance matrix for large vector):
Efficient (memory-wise) function for repeated distance matrix calculations AND chunking of extra large distance matrices
(lucky you!) http://stevemosher.wordpress.com/2012/04/08/using-bigmemory-for-a-distance-matrix/

hclust size limit?

I'm new to R. I'm trying to run hclust() on about 50K items. I have 10 columns to compare and 50K rows of data. When I tried assigning the distance matrix, I get: "Cannot allocate vector of 5GB".
Is there a size limit to this? If so, how do I go about doing a cluster of something this large?
EDIT
I ended up increasing the max.limit and increased the machine's memory to 8GB and that seems to have fixed it.
Classic hierarchical clustering approaches are O(n^3) in runtime and O(n^2) in memory complexity. So yes, they scale incredibly bad to large data sets. Obviously, anything that requires materialization of the distance matrix is in O(n^2) or worse.
Note that there are some specializations of hierarchical clustering such as SLINK and CLINK that run in O(n^2), and depending on the implementation may also only need O(n) memory.
You might want to look into more modern clustering algorithms. Anything that runs in O(n log n) or better should work for you. There are plenty of good reasons to not use hierarchical clustering: usually it is rather sensitive to noise (i.e. it doesn't really know what to do with outliers) and the results are hard to interpret for large data sets (dendrograms are nice, but only for small data sets).
The size limit is being set by your hardware and software, and you have not given enough specifics to say much more. On a machine with adequate resources you would not be getting this error. Why not try a 10% sample before diving into the deep end of the pool? Perhaps starting with:
reduced <- full[ sample(1:nrow(full), nrow(full)/10 ) , ]

Resources