How to make DBSCAN run faster on R for large datasets? - r

How can I make DBSCAN on R run faster? I have 256000 observations and I have been playing around with the eps and MinPts but I am getting the error message that my vector memory has been exhausted. Is there any way I can make this faster?
I am aware that kmeans will be more suitable for large data sets but I need to explore alternative clustering algorithms beyond that for my project.
Any advice would be much appreciated. Thank you.

Related

R - How to work with large dataset that does not (but could) fit within memory?

I am trying to work with some data, and it is brutally large (genetic data). With only using the bare minimum amount of columns I need, I am still looking at ~30GB of data.
Assuming I upgrade my PC to have 64GB of RAM, would I even be able to work with that data in R, or will I run into issues somewhere else? I.e., CPU not being beefy enough (AMD Ryzen 3600X), RStudio not being able to handle it, etc.
If RStudio will not be able to handle it or will be extremely slow, is there another way I can work with this data? I just want to do dimension reduction (which may make it a lot easier for me to use R) and run logistic regression on the data, maybe with some varied train/test splits.
Any help here is appreciated.
Thank you!

Working with lage datasets in R (Sentinel 2)

I'm working with more than 500 Gigabyte Rasters in Rstudio.
My code is working fine but the problem is that R is writing all raster data into a temporal folder, that means the computation time is more than 4 days (even on SSD). Is there a way to make the processing faster?
I'm working on a Computer with 64Gigabyte RAM and 1.5 Gigabyte SSD.
best regards
I don't know Sentinel 2, so it's complicated to help you on performance. Basically, you have to try to (a) use some parallel computation with foreach and doparallel packages, (b) find better packages to working with, or (c) reducing the complexity, in addition to the bad-answers like 'R is not suited for large datasets'.
A) One of the solutions would be a parallel computing, if it is possible to divide your calculations (e.g., your problem consists in a lot of calculations but you simply write results). For example, with the foreach and doparallel packages, observing many temporal networks is much faster than with a 'normal' serial for-loop (e.g., foreach/doparallel are very useful to compute basic statistics for each member of the network and for the global network, as soon as you need to repeat these computations to many 'sub-networks' or many 'networks at a time T' and .combine the results in a maxi-dataset). This last .combine arg. will be useless for a single 500 gb networks, so you have to write the results one by one and it will be very long (4 days = several hours or parallel computation, assuming parallel computing will be 6 or 7 times fastest than your actual computation).
B) Sometimes, it is simply a matter of identifying a more suitable package, as in the case of text-mining computations, and the performance offered by the quanteda package. I prefer to compute text-mining with tidyverse style, but for large datasets and before migrating to another language than R, quanteda is very powerful and fast in the end, even on large datasets of texts. In this example, if Quanteda is too slow to compute a basic text-mining on your dataset, you have to migrate to another technology or stop deploying 'death computing' and/or reduce the complexity of your problem / solution / size of datasets (e.g., Quanteda is not - yet - fast to compute a GloVe model on a very large dataset of 500 gb and you are reaching the border of the methods offered by the package Quanteda, so you have to try another langage than R: librairies in Python or Java like SpaCy will be better than R for deploy GloVe model on very large dataset, and it's not a very big step from R).
I would suggest trying the terra package, it has pretty much the same functions as raster, but it can be much faster.

k-means for many same points in R

Suppose I have a one dimension data set, which contains many same numbers, for example data set S = c(rep(4, times(1000)), rep(5, times(808)), rep(9, times(990))). Is there any efficient ways to do k-means in R? Actually in my data I have just a around 20 different points, but each of them appears around 100000 times, it runs very slow. So I wonder if there is a more efficient way.
K-means can be implemented with weights. It's straightforward to do so.
But IIRC the version included with R is not implemented this way. The version on flexcluster maybe is, but it's pure R and much much much slower.
Either way, you will want to implement this in Fortran or C, like the regular kmeans version. Maybe you can find some package that has a good implementation already.

hclust size limit?

I'm new to R. I'm trying to run hclust() on about 50K items. I have 10 columns to compare and 50K rows of data. When I tried assigning the distance matrix, I get: "Cannot allocate vector of 5GB".
Is there a size limit to this? If so, how do I go about doing a cluster of something this large?
EDIT
I ended up increasing the max.limit and increased the machine's memory to 8GB and that seems to have fixed it.
Classic hierarchical clustering approaches are O(n^3) in runtime and O(n^2) in memory complexity. So yes, they scale incredibly bad to large data sets. Obviously, anything that requires materialization of the distance matrix is in O(n^2) or worse.
Note that there are some specializations of hierarchical clustering such as SLINK and CLINK that run in O(n^2), and depending on the implementation may also only need O(n) memory.
You might want to look into more modern clustering algorithms. Anything that runs in O(n log n) or better should work for you. There are plenty of good reasons to not use hierarchical clustering: usually it is rather sensitive to noise (i.e. it doesn't really know what to do with outliers) and the results are hard to interpret for large data sets (dendrograms are nice, but only for small data sets).
The size limit is being set by your hardware and software, and you have not given enough specifics to say much more. On a machine with adequate resources you would not be getting this error. Why not try a 10% sample before diving into the deep end of the pool? Perhaps starting with:
reduced <- full[ sample(1:nrow(full), nrow(full)/10 ) , ]

Random forest on a big dataset

I have a large dataset in R (1M+ rows by 6 columns) that I want to use to train a random forest (using the randomForest package) for regression purposes. Unfortunately, I get a Error in matrix(0, n, n) : too many elements specified error when trying to do the whole thing at once and cannot allocate enough memory kind of errors when running in on a subset of the data -- down to 10,000 or so observations.
Seeing that there is no chance I can add more RAM on my machine and random forests are very suitable for the type of process I am trying to model, I'd really like to make this work.
Any suggestions or workaround ideas are much appreciated.
You're likely asking randomForest to create the proximity matrix for the data, which if you think about it, will be insanely big: 1 million x 1 million. A matrix this size would be required no matter how small you set sampsize. Indeed, simply Googling the error message seems to confirm this, as the package author states that the only place in the entire source code where n,n) is found is in calculating the proximity matrix.
But it's hard to help more, given that you've provided no details about the actual code you're using.

Resources