How to speed EDA and model running in r? - r

I am running an exploratory data analysis with SmartEDA package (https://cran.r-project.org/web/packages/SmartEDA/SmartEDA.pdf) and one of its functions "ExpReport" allows to create a exploratory data analysis report in HTML format automatically.
I have a dataset with 172 variables and 16487 rows and this is taking so much time to run! Is there a way to speed up R in every tasks we do?
I will also have to run some models with this data (and more data eventually) like randomForest, logistic regression, etc and would like to have a method to do this quickly.
I heard about parallel-processing but can't really understand how it works and if it works only with specific packages or functions...
Thank you all!
This picture shows how memory and cpu are affected just running "ExpReport":
info about memory and CPU consuption

The Problem with large data sets in R is that R reads the entire data set into RAM all at once and R Objects entirely live in memory.
Package doMC Provides a parallel backend for the %dopar% function using the multicore functionality of the parallel package.
Secondly packages like Bigmemory, ff & data.table come in real handy.
Here is a vignette that will help you handle large datasets:
https://rpubs.com/msundar/large_data_analysis
Hope you find this helpful.

Related

Working with lage datasets in R (Sentinel 2)

I'm working with more than 500 Gigabyte Rasters in Rstudio.
My code is working fine but the problem is that R is writing all raster data into a temporal folder, that means the computation time is more than 4 days (even on SSD). Is there a way to make the processing faster?
I'm working on a Computer with 64Gigabyte RAM and 1.5 Gigabyte SSD.
best regards
I don't know Sentinel 2, so it's complicated to help you on performance. Basically, you have to try to (a) use some parallel computation with foreach and doparallel packages, (b) find better packages to working with, or (c) reducing the complexity, in addition to the bad-answers like 'R is not suited for large datasets'.
A) One of the solutions would be a parallel computing, if it is possible to divide your calculations (e.g., your problem consists in a lot of calculations but you simply write results). For example, with the foreach and doparallel packages, observing many temporal networks is much faster than with a 'normal' serial for-loop (e.g., foreach/doparallel are very useful to compute basic statistics for each member of the network and for the global network, as soon as you need to repeat these computations to many 'sub-networks' or many 'networks at a time T' and .combine the results in a maxi-dataset). This last .combine arg. will be useless for a single 500 gb networks, so you have to write the results one by one and it will be very long (4 days = several hours or parallel computation, assuming parallel computing will be 6 or 7 times fastest than your actual computation).
B) Sometimes, it is simply a matter of identifying a more suitable package, as in the case of text-mining computations, and the performance offered by the quanteda package. I prefer to compute text-mining with tidyverse style, but for large datasets and before migrating to another language than R, quanteda is very powerful and fast in the end, even on large datasets of texts. In this example, if Quanteda is too slow to compute a basic text-mining on your dataset, you have to migrate to another technology or stop deploying 'death computing' and/or reduce the complexity of your problem / solution / size of datasets (e.g., Quanteda is not - yet - fast to compute a GloVe model on a very large dataset of 500 gb and you are reaching the border of the methods offered by the package Quanteda, so you have to try another langage than R: librairies in Python or Java like SpaCy will be better than R for deploy GloVe model on very large dataset, and it's not a very big step from R).
I would suggest trying the terra package, it has pretty much the same functions as raster, but it can be much faster.

Is parallel processing a solution for RAM shortage in R due to a large dataset?

I would like to do several machine learning techniques (logistic regression, SVM, Random forrest, neural network) in R on a dataset of 224 GB while my RAM is only 16 GB.
I suppose a solution could be to rent a virtual PC in the cloud with 256 GB RAM. For example an EC2 at AWS based on an AMI from this post by Louis Aslett:
http://www.louisaslett.com/RStudio_AMI/
Alternatively I understood there are several parallel processing methods and packages. For example Sparklyr, Future and ff. Is parallel processing a solution to my problem of limited RAM? Or is parallel processing targetted at running code faster?
If I assume parallel processing is a solution, I assume I need to modify the processes within the machine learning packages. For example, logistic regression is done with this line of code:
model <- glm ( Y ~., family=binomial ( link='logit' ), data=train )
Although as far as I know I don’t have influence over the calculations within the glm-method.
Your problem is that you can't fit all the data in memory at once, and the standard glm() function needs that. Luckily, linear and generalized linear models can be computed using the data in batches. The issue is how to combine the computations between the batches.
Parallel algorithms need to break up datasets to send to workers, but if you only have one worker, you'd need to process them serially, so it's only the "breaking up" part that you need. The biglm package in R can do that for your class of models.
I'd suggest h2o. It has a lot of support for fitting logistic regression, SVM, Random Forrest, and neural network, among others.
Here's how to install h2o in R
I also didn't find bigmemory packages are limited in functionality available.

Parallel cpu processing tm Dcorpus polarity

I am trying to examine a data base containing roughly 80.000 txt-documents through the polarity of each sentence in the text with R.
My problem is that my computer isn't able to transform the txt-files into a corpus (12gb RAM, 8 CPUs, Windows 10) - it takes more than two days.
I found out that there is a way to use all CPU's parallely with the DCorpus-function. However, starting with the Dcorpus, I don't know how to run the splitSentence-function, the transformation to a data frame and the scoring via the polarity-function using all CPUs parallely again.
Moreover, I am not sure whether a parallelization of the code helps me with the RAM-usage?
Thanks for your help in advance!
All your problems raise from tm package usage which is incredibly inefficient.
Try, for example, text2vec package. I believe you will be able to perform your analysis in minutes and with very moderate ram usage.
Disclosure - I'm the author of this package.

Has anyone tried to parallelize multiple imputation in 'mice' package?

I'm aware of the fact that Amelia R package provides some support for parallel multiple imputation (MI). However, preliminary analysis of my study's data revealed that the data is not multivariate normal, so, unfortunately, I can't use Amelia. Consequently, I've switched to using mice R package for MI, as this package can perform MI on data that is not multivariate normal.
Since the MI process via mice is very slow (currently I'm using AWS m3.large 2-core instance), I've started wondering whether it's possible to parallelize the procedure to save processing time. Based on my review of mice documentation and the corresponding JSS paper, as well as mice's source code, it appears that currently the package doesn't support parallel operations. This is sad, because IMHO the MICE algorithm is naturally parallel and, thus, its parallel implementation should be relatively easy and it would result in a significant economy in both time and resources.
Question: Has anyone tried to parallelize MI in mice package, either externally (via R parallel facilities), or internally (by modifying the source code) and what are results, if any? Thank you!
Recently, I've tried to parallelize multiple imputation (MI) via mice package externally, that is, by using R multiprocessing facilities, in particular parallel package, which comes standard with R base distribution. Basically, the solution is to use mclapply() function to distribute a pre-calculated share of the total number of needed MI iterations and then combine resulting imputed data into a single object. Performance-wise, the results of this approach are beyond my most optimistic expectations: the processing time decreased from 1.5 hours to under 7 minutes(!). That's only on two cores. I've removed one multilevel factor, but it shouldn't have much effect. Regardless, the result is unbelievable!

SVM modeling with BIG DATA

For modeling with SVM in R, I have used kernlab package (ksvm method)with Windows Xp operating system and 2 GB RAM. But having more data rows as 201497, I can'nt able to provide more memory for processing of data modeling (getting issue : can not allocate vector size greater than 2.7 GB).
Therefore, I have used Amazon micro and large instance for SCM modeling. But, it have same issue as local machine (can not allocate vector size greater than 2.7 GB).
Can any one suggest me the solution of this problem with BIG DATA modeling or Is there something wrong with this.
Without a reproducible example it is hard to say if the dataset is just too big, or if some parts of your script are suboptimal. A few general pointers:
Take a look at the High Performance Computing Taskview, this lists the main R packages relevant for working with BigData.
You use your entire dataset for training your model. You could try to take a subset (say 10%) and fit your model on that. Repeating this procedure a few times will yield insight into if the model fit is sensitive to which subset of the data you use.
Some analysis techniques, e.g. PCA analysis, can be done by processing the data iteratively, i.e. in chunks. This makes analyses possible on very big datasets possible (>> 100 gb). I'm not sure if this is possible with kernlab.
Check if the R version you are using is 64 bit.
This earlier question might be of interest.

Resources