Working with lage datasets in R (Sentinel 2) - r

I'm working with more than 500 Gigabyte Rasters in Rstudio.
My code is working fine but the problem is that R is writing all raster data into a temporal folder, that means the computation time is more than 4 days (even on SSD). Is there a way to make the processing faster?
I'm working on a Computer with 64Gigabyte RAM and 1.5 Gigabyte SSD.
best regards

I don't know Sentinel 2, so it's complicated to help you on performance. Basically, you have to try to (a) use some parallel computation with foreach and doparallel packages, (b) find better packages to working with, or (c) reducing the complexity, in addition to the bad-answers like 'R is not suited for large datasets'.
A) One of the solutions would be a parallel computing, if it is possible to divide your calculations (e.g., your problem consists in a lot of calculations but you simply write results). For example, with the foreach and doparallel packages, observing many temporal networks is much faster than with a 'normal' serial for-loop (e.g., foreach/doparallel are very useful to compute basic statistics for each member of the network and for the global network, as soon as you need to repeat these computations to many 'sub-networks' or many 'networks at a time T' and .combine the results in a maxi-dataset). This last .combine arg. will be useless for a single 500 gb networks, so you have to write the results one by one and it will be very long (4 days = several hours or parallel computation, assuming parallel computing will be 6 or 7 times fastest than your actual computation).
B) Sometimes, it is simply a matter of identifying a more suitable package, as in the case of text-mining computations, and the performance offered by the quanteda package. I prefer to compute text-mining with tidyverse style, but for large datasets and before migrating to another language than R, quanteda is very powerful and fast in the end, even on large datasets of texts. In this example, if Quanteda is too slow to compute a basic text-mining on your dataset, you have to migrate to another technology or stop deploying 'death computing' and/or reduce the complexity of your problem / solution / size of datasets (e.g., Quanteda is not - yet - fast to compute a GloVe model on a very large dataset of 500 gb and you are reaching the border of the methods offered by the package Quanteda, so you have to try another langage than R: librairies in Python or Java like SpaCy will be better than R for deploy GloVe model on very large dataset, and it's not a very big step from R).

I would suggest trying the terra package, it has pretty much the same functions as raster, but it can be much faster.

Related

How to speed EDA and model running in r?

I am running an exploratory data analysis with SmartEDA package (https://cran.r-project.org/web/packages/SmartEDA/SmartEDA.pdf) and one of its functions "ExpReport" allows to create a exploratory data analysis report in HTML format automatically.
I have a dataset with 172 variables and 16487 rows and this is taking so much time to run! Is there a way to speed up R in every tasks we do?
I will also have to run some models with this data (and more data eventually) like randomForest, logistic regression, etc and would like to have a method to do this quickly.
I heard about parallel-processing but can't really understand how it works and if it works only with specific packages or functions...
Thank you all!
This picture shows how memory and cpu are affected just running "ExpReport":
info about memory and CPU consuption
The Problem with large data sets in R is that R reads the entire data set into RAM all at once and R Objects entirely live in memory.
Package doMC Provides a parallel backend for the %dopar% function using the multicore functionality of the parallel package.
Secondly packages like Bigmemory, ff & data.table come in real handy.
Here is a vignette that will help you handle large datasets:
https://rpubs.com/msundar/large_data_analysis
Hope you find this helpful.

Handling huge simulations in R

I have written R program that generates a random vector of length 1 million. I need to simulate it 1 million times. Out of the 1 million simulations, I will be using 50K observed vectors (chosen in some random manner) as samples. So, 50K cross 1M is the sample size. Is there way to deal it in R?
There are few problems and some not so good solutions.
First R cannot store such huge matrix in my machine. It exceeds RAM memory. I looked into packages like bigmemory, ffbase etc that uses hard disk space. But such a huge data can have size in TB. I have 200GB hard disk available in my machine.
Even if storing is possible, there is a problem of running time. The code may take more than 100Hrs of running time!
Can anyone please suggest a way out! Thanks
This answer really stands as something in between a comment and an answer. The easy way out of your dilemma is to not work with such massive data sets. You can most likely take a reasonably-sized representative subset of that data (say requiring no more than a few hundred MB) and train your model this way.
If you have to use the model in production on actual data sets with millions of observations, then the problem would no longer be related to R.
If possible use sparse matrix techniques
If possible try leveraging storage memory and chunking the object into parts
If possible try to use Big Data tools such as H2O
Leverage multicore and HPC computing with pbdR, parallel, etc
Consider using a spot instance of a Big Data / HPC cloud VPS instance on AWS, Azure, DigitalOcean, etc. Most offer distributions with R preinstalled and with a high RAM multicore instance you can "spin up" (start) and down (stop) quickly and cheaply
Use sampling and statistical solutions when possible
Consider doing some of your simulations or pre-simulation steps in a relational database, or something like Spark + Scala; some have R integration nowadays, actually

Parallel cpu processing tm Dcorpus polarity

I am trying to examine a data base containing roughly 80.000 txt-documents through the polarity of each sentence in the text with R.
My problem is that my computer isn't able to transform the txt-files into a corpus (12gb RAM, 8 CPUs, Windows 10) - it takes more than two days.
I found out that there is a way to use all CPU's parallely with the DCorpus-function. However, starting with the Dcorpus, I don't know how to run the splitSentence-function, the transformation to a data frame and the scoring via the polarity-function using all CPUs parallely again.
Moreover, I am not sure whether a parallelization of the code helps me with the RAM-usage?
Thanks for your help in advance!
All your problems raise from tm package usage which is incredibly inefficient.
Try, for example, text2vec package. I believe you will be able to perform your analysis in minutes and with very moderate ram usage.
Disclosure - I'm the author of this package.

How to let R use all the cores of the computer?

I have read that R uses only a single CPU. How can I let R use all the available cores to run statistical algorithms?
Yes, for starters, see the High Performance Computing Task View on CRAN. This lists details of packages that can be used in support of parallel computing on a single machine.
From R version 2.14.0, there is inbuilt support for parallel computing via the parallel package, which includes slightly modified versions of the existing snow and multicore packages. The parallel package has a vignette that you should read. You can view it using:
vignette(package="parallel", topic = "parallel")
There are other ways to exploit your multiple cores, for example via use of a multi-threaded BLAS for linear algebra computations.
Whether any of this will speed up the "statistics calculations" you want to do will depend on what those "statistics calculations" are. Spawning off multiple threads or workers entails an overhead cost to set them up, manage them and collect the results. Some operations see a benefit (some large, some small) of using multiple cores/threads, others are slowed down because of this extra overhead.
In short, do not expect to get an n times decrease in your compute time by using n cores instead of just 1.
If you happen to do few* iterations of the same thing (or same code for few* different parameters), the easiest way to go is to run several copies of R -- OS will allocate the work on different cores.
In the opposite case, go and learn how to use real parallel extensions.
For the sake of this answer, few means less or equal the number of cores.

Best practices for efficient multiple time series analysis

I have a large number of time series (>100) which differ in the sampling frequency and the time period for which they are available. Each time series has to be tested for unit roots and seasonally adjusted and other preliminary data transformations and checking etc.
As a large number of series have to be routinely checked, what is the solution to do it efficiently? The concern is to save time in the routine aspects and keep track of the series and analysis results. Unit root testing of the series for example is something subjective. How much of this type of analysis can be automated and how?
I have already read the questions regarding the statistical workflow which suggests having a common script to run on each series.
I am asking something more specific and based on experience of handling a multiple time series dataset. The focus is more on minimizing errors while dealing with so many series and also automating repetitive tasks.
I assume the series will be examined independently, as you've not mentioned any inter-relationships in the models. I'm not sure what kind of object you're looking to use or which tests, but the basic goal of "best practices" is independent of the actual package to be used.
The simplest approaches involve loading objects into a list and analyzing each series via simple iterators such as lapply or via multicore methods such as mclapply or foreach, in R. For Matlab, you can operate over cell arrays. The parallel computing toolbox has a function called parfor, for "parallel for", which is similar to the foreach function in R. For my money, I'd recommend using R as it's cheaper (free) and has a much richer functionality for statistical analyses. Matlab has better documentation and help tools, but these tend to matter less over time as you become more familiar with the tools and methods of your research (and as your bookshelf of references grows).
It's good to become accustomed to using multicore tools in general, as this can substantially decrease the time it takes to do analyses on a bunch of independent small objects.

Resources