hclust size limit? - r

I'm new to R. I'm trying to run hclust() on about 50K items. I have 10 columns to compare and 50K rows of data. When I tried assigning the distance matrix, I get: "Cannot allocate vector of 5GB".
Is there a size limit to this? If so, how do I go about doing a cluster of something this large?
EDIT
I ended up increasing the max.limit and increased the machine's memory to 8GB and that seems to have fixed it.

Classic hierarchical clustering approaches are O(n^3) in runtime and O(n^2) in memory complexity. So yes, they scale incredibly bad to large data sets. Obviously, anything that requires materialization of the distance matrix is in O(n^2) or worse.
Note that there are some specializations of hierarchical clustering such as SLINK and CLINK that run in O(n^2), and depending on the implementation may also only need O(n) memory.
You might want to look into more modern clustering algorithms. Anything that runs in O(n log n) or better should work for you. There are plenty of good reasons to not use hierarchical clustering: usually it is rather sensitive to noise (i.e. it doesn't really know what to do with outliers) and the results are hard to interpret for large data sets (dendrograms are nice, but only for small data sets).

The size limit is being set by your hardware and software, and you have not given enough specifics to say much more. On a machine with adequate resources you would not be getting this error. Why not try a 10% sample before diving into the deep end of the pool? Perhaps starting with:
reduced <- full[ sample(1:nrow(full), nrow(full)/10 ) , ]

Related

Is calculating the index in an array more efficient than letting the compiler do it?

I'm trying to generalize a neural network function to arbitrarily many layers, and so I need multiple matrices to hold the weights for each neuron in each layer. I was originally explicitly declaring matrix objects in R to hold my weights for each layer. Instead of having one matrix per layer, I thought of a way (not saying it's original), to store all of my weights in a single array and defined an "indexing function" to map a weight to its appropriate index in the array.
I defined the function as follows:
where is the k-th weight of the j-th neuron in the i-th layer and L(r) is the number of neurons in layer r. After writing these definitions, I realize that stackoverflow doesn't allow latex like mathoverflow which is unfortunate.
Now the question is: Is it more efficient to compute the index of my weights in this way, or is actually less efficient?
After looking up how indices are computed for arrays in general, this is essentially what is done on compilation anyway if I just kept a matrix in each layer holding the weights, so it seems like I may just be making my code overly complicated and harder to understand if there's no difference in time efficiency.
TL;DR use the matrices its easier to understand and takes advantage of optimized CPU instructions.
In computer science parlance, the efficiency (scalability) of algorithms is reasoned about using Big O cost. A score can be given to both the time and space complexity.
Using Big O notation lets compare the two approaches:
Array Approach
time complexity:
Array index access is O(1) time, no matter how large an array becomes, it is just as computationally easy to access an element given its index.
As you've created a function to compute the index of the k-th weight, this adds some small complexity but would probably run in constant O(1) time as it is a mathematical expression, so negligible.
space complexity:
O(N) where N is the number of weights across all layers.
Matrices Approach
time complexity:
A matrix is essentially a 2d array with O(1) access
space complexity
O(N + M), where N is number of neurons and M is number of weights.
Conceptually, we can see that the two approaches have an equivalent time and space complexity score.
However there are the other trade-offs involved (and as a good SO-er must inform you of those)
When it comes to working with the data in the array vs matrices approach, the array approach is less efficient as it circumvents the opportunity for MISD operations. As #liborm alluded to there are vectorised (MISD) operations handled by lower level system libraries like LAPACK/BLAS, which "batch" CPU instructions for some matrix operations (less overhead cost to transfer and compute data at CPU compared to sending a new instruction every time)
Instead of having one matrix per layer, I thought of a way ... to store all of my weights in a single array
It's hard to see why you would opt-ed for the latter as it requires you to create a bespoke indexing function. Maybe its nicer to think about all your weights being in one long array place? However I would argue the mental load required to maintain the array mapping is higher than having multiple matrices dedicated to a layer.
A hash-table like structure of matrices would be much easier to reason about
layers <- list(layer1 = [[...]], layer2 = [[...]], layerN = [[...]])
Further reading
http://www.noamross.net/blog/2014/4/16/vectorization-in-r--why.html
There are many factors to take into consideration in each of the approaches. I'm not familiar with R but I'm assuming matrices' buffers are represented as one-dimensional arrays in memory. (Even if they are written as two dimensional arrays in the underlying C implementation the compiler stores it as one-dimensional array in memory)
The overall outline of memory operations are:
Case: Several matrices per layers
Allocation of matrices:
Accessing of indices:
Case: One matrix for all layers + index calculation
Allocation of matrix cost:
Accesing each of the indices cost:
Function cost:
We can clearly see that the second case, scales better, even though there's the additional cost of the function call.
Having said that, in general having a statically allocated array with all the weights for all the layers, should be faster.
In most cases, computers's bottleneck is memory bandwidth, and the best way to counteract this is to minimize the number of memory accesses.
With this in mind there's another more primitive reason why the 2nd approach will probably be faster: Caches.
Here's a good explanation of the performance difference in accesing a two-dimensional array in a loop by Good Ol' Bob Martin
TL; DR: Caches take advantage of the principle of locality, and therefore, having memory accesses spatially close to each other (as you would in one single array and accessing them in a cache-friendly way as explained in Bob Martin's answer) renders better performance than having them spatially separated (having them in several distinct arrays).
PS: I also recommend to benchmark both approaches and compare, since these nuances regarding the cache are machine-dependent. It might be the case that the Dataset/NN is small enough to fit completely in RAM or even in cache? in a very powerful server.
I'm sure you want to use some kind of native array objects, so you get the speedups provided by BLAS/LAPACK implementations (see eg Intel MKL discussion here if you're on Windows). Most of the time in NN evaluation will be spent in matrix multiplications (like SGEMM), and this is where BLAS implementations like Intel MKL can be an order of magnitude faster.
That is - even if the hand-coded indices for your single-array multi-layer network were super fast, you won't be able to use it with the optimised multiplication routines, which would make your whole network significantly slower. Use the native array objects and create a multi layer abstraction on top of them.
But actually if you want speed and usability (and to really build some NN models), you should consider using something like R interface to TensorFlow. As a bonus you'll get things like running on the GPU for free.
Nice puzzle.. If you are asking calculating index in which would happen in runtime for which it needs to be compiled. Just want to understand how would you let the compiler compute it? IF you have a need to playing with the info anytime later then I would suggest to use Hasmap kind of mechanism. I had done it for a similar need.

What are the minimum system requirements for analysing large datasets (30gb) in R?

I tried running Apriori algorithm on 30GB CSV file in which each row is a basket upto 34 items(columns) in it. So R studio died just after execution. I want to know what are the minimum system requirements like how much RAM and CPU config I need to run algorithms on large data sets?
This question cannot be answered as such. It highly depends on what you want to do with the data.
Example
If you are able to process all lines 1 by 1, you just need a tiny bit of ram (for example if you want to count them, I believe this also holds for the most trivial use of Apriori)
If you want to calculate the distance between all points efficiently, you will want a ton of ram, and another few GB to store the output (I believe this is even less intense than the most extreme use of Apriori).
Conclusion
As such I would recommend:
Use whatever hardware you have to process a subset of the data. Check your memory and CPU usage, as you increase the data size (or other parameters) and extrapolate your results to see what you probably need.

Handling huge simulations in R

I have written R program that generates a random vector of length 1 million. I need to simulate it 1 million times. Out of the 1 million simulations, I will be using 50K observed vectors (chosen in some random manner) as samples. So, 50K cross 1M is the sample size. Is there way to deal it in R?
There are few problems and some not so good solutions.
First R cannot store such huge matrix in my machine. It exceeds RAM memory. I looked into packages like bigmemory, ffbase etc that uses hard disk space. But such a huge data can have size in TB. I have 200GB hard disk available in my machine.
Even if storing is possible, there is a problem of running time. The code may take more than 100Hrs of running time!
Can anyone please suggest a way out! Thanks
This answer really stands as something in between a comment and an answer. The easy way out of your dilemma is to not work with such massive data sets. You can most likely take a reasonably-sized representative subset of that data (say requiring no more than a few hundred MB) and train your model this way.
If you have to use the model in production on actual data sets with millions of observations, then the problem would no longer be related to R.
If possible use sparse matrix techniques
If possible try leveraging storage memory and chunking the object into parts
If possible try to use Big Data tools such as H2O
Leverage multicore and HPC computing with pbdR, parallel, etc
Consider using a spot instance of a Big Data / HPC cloud VPS instance on AWS, Azure, DigitalOcean, etc. Most offer distributions with R preinstalled and with a high RAM multicore instance you can "spin up" (start) and down (stop) quickly and cheaply
Use sampling and statistical solutions when possible
Consider doing some of your simulations or pre-simulation steps in a relational database, or something like Spark + Scala; some have R integration nowadays, actually

A framework for comparing the time performance of Expectation Maximization

I have my own implementation of the Expectation Maximization (EM) algorithm based on this paper, and I would like to compare this with the performance of another implementation. For the tests, I am using k centroids with 1 Gb of txt data and I am just measuring the time it takes to compute the new centroids in 1 iteration. I tried it with an EM implementation in R, but I couldn't, since the result is plotted in a graph and gets stuck when there's a large number of txt data. I was following the examples in here.
Does anybody know of an implementation of EM that can measure its performance or know how to do it with R?
Fair benchmarking of EM is hard. Very hard.
the initialization will usually involve random, and can be very different. For all I know, the R implementation by default uses hierarchical clustering to find the initial clusters. Which comes at O(n^2) memory and most likely at O(n^3) runtime cost. In my benchmarks, R would run out of memory due to this. I assume there is a way to specify initial cluster centers/models. A random-objects initialization will of course be much faster. Probably k-means++ is a good way to choose initial centers in practise.
EM theoretically never terminates. It just at some point does not change much anymore, and thus you can set a threshold to stop. However, the exact definition of the stopping threshold varies.
There exist all kinds of model variations. A method only using fuzzy assignments such as Fuzzy-c-means will of course be much faster than an implementation using multivariate Gaussian Mixture Models with a covaraince matrix. In particular with higher dimensionality.
Covariance matrixes also need O(k * d^2) memory, and the inversion will take O(k * d^3) time, and thus is clearly not appropriate for text data.
Data may or may not be appropriate. If you run EM on a data set that actually has Gaussian clusters, it will usually work much better than on a data set that doesn't provide a good fit at all. When there is no good fit, you will see a high variance in runtime even with the same implementation.
For a starter, try running your own algorithm several times with different initialization, and check your runtime for variance. How large is the variance compared to the total runtime?
You can try benchmarking against the EM implementation in ELKI. But I doubt the implementation will work with sparse data such as text - that data just is not Gaussian, it is not proper to benchmark. Most likely it will not be able to process the data at all because of this. This is expected, and can be explained from theory. Try to find data sets that are dense and that can be expected to have multiple gaussian clusters (sorry, I can't give you many recommendations here. The classic Iris and Old Faithful data sets are too small to be useful for benchmarking.

dist() function in R: vector size limitation

I was trying to draw a hierarchical clustering of some samples (40 of them) over some features(genes) and I have a big table with 500k rows and 41 columns (1st one is name) and when I tried
d<-dist(as.matrix(file),method="euclidean")
I got this error
Error: cannot allocate vector of size 1101.1 Gb
How can I get around of this limitation? I googled it and came across to the ff package in R but I don't quite understand whether that could solve my issue.
Thanks!
Generally speaking hierarchical clustering is not the best approach for dealing with very large datasets.
In your case however there is a different problem. If you want to cluster samples structure of your data is wrong. Observations should be represented as the rows, and gene expression (or whatever kind of data you have) as the columns.
Lets assume you have data like this:
data <- as.data.frame(matrix(rnorm(n=500000*40), ncol=40))
What you want to do is:
# Create transposed data matrix
data.matrix.t <- t(as.matrix(data))
# Create distance matrix
dists <- dist(data.matrix.t)
# Clustering
hcl <- hclust(dists)
# Plot
plot(hcl)
NOTE
You should remember that euclidean distances can be rather misleading when you work with high-dimensional data.
When dealing with large data sets, R is not the best choice.
The majority of methods in R seems to be implemented by computing a full distance matrix, which inherently needs O(n^2) memory and runtime. Matrix based implementations don't scale well to large data , unless the matrix is sparse (which a distance matrix per definition isn't).
I don't know if you realized that 1101.1 Gb is 1 Terabyte. I don't think you have that much RAM, and you probably won't have the time to wait for computing this matrix either.
For example ELKI is much more powerful for clustering, as you can enable index structures to accelerate many algorithms. This saves both memory (usually down to linear memory usage; for storing the cluster assignments) and runtime (usually down to O(n log n), one O(log n) operation per object).
But of course, it also varies from algorithm to algorithm. K-means for example, which needs point-to-mean distances only, does not need (and cannot use) an O(n^2) distance matrix.
So in the end: I don't think the memory limit of R is your actual problem. The method you want to use doesn't scale.
I just experience a related issue but with less rows (around 100 thousands for 16 columns).
RAM size is the limiting factor.
To limitate the need in memory space I used 2 different functions from 2 different packages.
from parallelDist the function parDist() allow you to obtain the distances quite fast. it uses RAM of course during the process but it seems that the resulting dist object is taking less memory (no idea why).
Then I used the hclust() function but from the package fastcluster. fastcluster is actually not so fast on such an amount of data but it seems that it uses less memory than the default hclust().
Hope this will be useful for anybody who find this topic.

Resources