I'm aware of the fact that Amelia R package provides some support for parallel multiple imputation (MI). However, preliminary analysis of my study's data revealed that the data is not multivariate normal, so, unfortunately, I can't use Amelia. Consequently, I've switched to using mice R package for MI, as this package can perform MI on data that is not multivariate normal.
Since the MI process via mice is very slow (currently I'm using AWS m3.large 2-core instance), I've started wondering whether it's possible to parallelize the procedure to save processing time. Based on my review of mice documentation and the corresponding JSS paper, as well as mice's source code, it appears that currently the package doesn't support parallel operations. This is sad, because IMHO the MICE algorithm is naturally parallel and, thus, its parallel implementation should be relatively easy and it would result in a significant economy in both time and resources.
Question: Has anyone tried to parallelize MI in mice package, either externally (via R parallel facilities), or internally (by modifying the source code) and what are results, if any? Thank you!
Recently, I've tried to parallelize multiple imputation (MI) via mice package externally, that is, by using R multiprocessing facilities, in particular parallel package, which comes standard with R base distribution. Basically, the solution is to use mclapply() function to distribute a pre-calculated share of the total number of needed MI iterations and then combine resulting imputed data into a single object. Performance-wise, the results of this approach are beyond my most optimistic expectations: the processing time decreased from 1.5 hours to under 7 minutes(!). That's only on two cores. I've removed one multilevel factor, but it shouldn't have much effect. Regardless, the result is unbelievable!
Related
I am running an exploratory data analysis with SmartEDA package (https://cran.r-project.org/web/packages/SmartEDA/SmartEDA.pdf) and one of its functions "ExpReport" allows to create a exploratory data analysis report in HTML format automatically.
I have a dataset with 172 variables and 16487 rows and this is taking so much time to run! Is there a way to speed up R in every tasks we do?
I will also have to run some models with this data (and more data eventually) like randomForest, logistic regression, etc and would like to have a method to do this quickly.
I heard about parallel-processing but can't really understand how it works and if it works only with specific packages or functions...
Thank you all!
This picture shows how memory and cpu are affected just running "ExpReport":
info about memory and CPU consuption
The Problem with large data sets in R is that R reads the entire data set into RAM all at once and R Objects entirely live in memory.
Package doMC Provides a parallel backend for the %dopar% function using the multicore functionality of the parallel package.
Secondly packages like Bigmemory, ff & data.table come in real handy.
Here is a vignette that will help you handle large datasets:
https://rpubs.com/msundar/large_data_analysis
Hope you find this helpful.
I'm working with more than 500 Gigabyte Rasters in Rstudio.
My code is working fine but the problem is that R is writing all raster data into a temporal folder, that means the computation time is more than 4 days (even on SSD). Is there a way to make the processing faster?
I'm working on a Computer with 64Gigabyte RAM and 1.5 Gigabyte SSD.
best regards
I don't know Sentinel 2, so it's complicated to help you on performance. Basically, you have to try to (a) use some parallel computation with foreach and doparallel packages, (b) find better packages to working with, or (c) reducing the complexity, in addition to the bad-answers like 'R is not suited for large datasets'.
A) One of the solutions would be a parallel computing, if it is possible to divide your calculations (e.g., your problem consists in a lot of calculations but you simply write results). For example, with the foreach and doparallel packages, observing many temporal networks is much faster than with a 'normal' serial for-loop (e.g., foreach/doparallel are very useful to compute basic statistics for each member of the network and for the global network, as soon as you need to repeat these computations to many 'sub-networks' or many 'networks at a time T' and .combine the results in a maxi-dataset). This last .combine arg. will be useless for a single 500 gb networks, so you have to write the results one by one and it will be very long (4 days = several hours or parallel computation, assuming parallel computing will be 6 or 7 times fastest than your actual computation).
B) Sometimes, it is simply a matter of identifying a more suitable package, as in the case of text-mining computations, and the performance offered by the quanteda package. I prefer to compute text-mining with tidyverse style, but for large datasets and before migrating to another language than R, quanteda is very powerful and fast in the end, even on large datasets of texts. In this example, if Quanteda is too slow to compute a basic text-mining on your dataset, you have to migrate to another technology or stop deploying 'death computing' and/or reduce the complexity of your problem / solution / size of datasets (e.g., Quanteda is not - yet - fast to compute a GloVe model on a very large dataset of 500 gb and you are reaching the border of the methods offered by the package Quanteda, so you have to try another langage than R: librairies in Python or Java like SpaCy will be better than R for deploy GloVe model on very large dataset, and it's not a very big step from R).
I would suggest trying the terra package, it has pretty much the same functions as raster, but it can be much faster.
The problem:
I am estimating fixed effects models using the felm function in the lfe package. I estimate several models, but the largest includes approximately 40 dependent variables plus county and year fixed effects (~3000 levels and 14 levels, respectively) using ~50 million observations. When I run the model, I get the following error:
Error in Crowsum(xz, ia) :
long vectors not supported yet: ../../src/include/Rinlinedfuns.h:519
Calls: felm -> felm.mm -> newols -> Crowsum
I realize that long vectors contain 2^31 or more elements, so I assume that felm produces these long vectors in estimating the model. Details about my resources: I have access to high-performance computing with multiple nodes, each with multiple cores; the node with the largest memory has 1012GB.
(1) Is support for "long vectors" something that can only be added by the author of lfe package?
(2) If so, are there other options for FE regressions on large data (provided that one has access to large amounts of memory and/or cluster computing, as I do)? (If I need to make a separate post to address this question more specifically, I can do that as well.)
Please note that I know there is a similar post about this problem: Error in LFE - Long Vectors Not Supported - R. 3.4.3
However, I made a new question for two reasons: (1) the author's question was unfocused--unclear what feedback to provide and I didn't want to assume that what I wanted to know was the same as the author; and (2) even if I edited the question, the original author left out details that I thought could be relevant.
It appears that long vectors won't be supported by the lfe package unless the author adds support.
Other routes for performing regression analysis with large datasets include the biglm package. biglm lets you run linear or logistic regression on a chunk, and then update the regression estimates as you run the regression on further chunks. However, this is a sequential process--which means it cannot be performed in parallel--and thus may run into memory issues (or take too long).
Another option, if you have access to some advanced computing resources (more cores, more RAM) is to use the partools package which provides tools for working with R's parallel. The package provides a wrapper for R's lm called calm (chunk averaged linear model), which, similar to biglm regresses on chunks, but allows for the process to be done in parallel. The vignette for partools is an excellent resources for learning how to use the package (even for those who haven't used the parallel package before): https://cran.r-project.org/web/packages/partools/index.html
It appears that calm doesn't report standard errors, but datasets of the size that require parallel computing, standard errors probably are not relevant -- especially so if that dataset contains data from the entire population being studied.
I am using a package called missForest to estimate the missing values in my data set.
My question is: how can we parallelize this process to shorten the time that it takes to get the results?
Please refer to this example (from missForest package):
data(iris)
summary(iris)
The data contains four continuous and one categorical variable.
Artificially produce missing values using the prodNA function:
set.seed(81)
iris.mis <- prodNA(iris, noNA = 0.2)
summary(iris.mis)
Impute missing values providing the complete matrix for illustration. Use ’verbose’ to see what happens between iterations:
iris.imp <- missForest(iris.mis, xtrue = iris, verbose = TRUE)
Yesterday I submitted version 1.4 of missForest to CRAN; the Windows and Linux packages are ready, the Mac version will follow soon.
The new function has an additional argument "parallelize" which allows to either compute the single forests in a parallel fashion (parallelize="forests") or to compute several forests on multiple variables at the same time (parallelize="variables"). The default setting is without parallel computing (parallelize="no").
Do not forget to register a suitable parallel backend, e.g. using the package "doParallel", before trying it for the first time. The "doParallel" vignette gives an illustrative example in Section 4.
Due to some other details I had to temporarily remove the "missForest" vignette from the package. But I will resolve this in due course and release it as version 1.4-1.
It's a bit tricky to do a good job of parallelizing the missForest function. There seem to be two basic ways to do it:
Create the randomForest model objects in parallel;
Execute multiple randomForest operations (create model and predict) in parallel for each of the columns of the data frame that contain NA's.
Method 1 is rather easy to implement, except that you have to compute the error estimates yourself since the randomForest combine function doesn't compute them for you. However, if the randomForest objects don't take that long to compute and there are many columns containing NA's, you may get very little if any speed up, even though the operations in aggregate take a long time to compute.
Method 2 is a bit harder to implement because the sequential algorithm updates the columns of the xmis data frame after each randomForest operation. I think the right way to parallelize this is to process n columns in parallel at a time (where n is the number of worker processes), thus requiring another loop around the n columns in order to process all of the columns of the data frame. My experiments suggest that unless this is done, the outer loop takes longer to converge, thus losing the benefit of executing in parallel.
In general, to get a performance improvement you will need to implement both of these methods, and choose which to use based on your input data. If you just have a few columns with NA's but the randomForest models take a long time to compute, you should choose method 1. If you have many columns with NA's, you should probably choose method 2, even if the individual randomForest models take a long time to compute because this can be done more efficiently, although it's possible that it will still require an extra iteration of the outer while loop.
In the process of experimenting with missForest, I eventually developed a parallel version of the package. I put the modified version of library.R on GitHub Gist, however it isn't trivial to use in that form, especially without documentation. So I contacted the author of missForest, and he is very interested in incorporating at least some of my modifications into the official package, so hopefully the next version of missForest that is posted to CRAN will support parallel execution.
Do you know if there are any plans to introduce parallel programming in R for all packages?
I'm aware of some developments such as R-revolution and parallel programming packages, but they seem to have specialised functions which replace the most popular functions (linear programming etc..). However one of the great things about R is the huge amount of specialised packages which prop up every day and make complex and time-consuming analysis very easy to run. Many of these use very popular functions such as the generalised linear model, but also use the results for additional calculation and comparison and finally sort out the output. As far as I understand you need to define which parts of a function can be run in parallel programming so this is probably why most specialised R packages don't have this functionality and cannot have it unless the code is edited.
Are there are any plans (or any packages) to enable all the most popular R functions to run in parallel processing so that all the less popular functions containing these can be run in parallel processing? For example, the package difR uses the glm function for most of its functions; if the glm package was enabled to run in parallel processing (or re-written and then released in a new R version) for all multi-processor machines then there would be no need to re-write the difR package and this could then run some of its most burdensome procedures with the aid of parallel programming on a Windows PC.
I completely agree with Paul's answer.
In addition, a general system for parallelization needs some very non-trivial calibration, even for those functions that can be easily parallelized: What if you have a call stack of several functions that offer parallel computation (e.g. you are bootstrapping some model fitting, the model fitting may already offer parallelization and low level linear algebra can be implicitly parallel)? You need to estimate (or choose manually) at which level explicit parallelization should be done. In addition, you possibly have implicit parallelization, so you need to trade off between these.
However, there is one particularly easy and general way to parallelize computations implicitly in R: linear algebra can be parallelized and sped up considerably by using an optimized BLAS. Using this can (depending on your system) be as easy as telling your package manager to install the optimized BLAS and R will use it. Once it is linked to R, all packages that use the base linear algebra functions like %*%, crossprod, solve etc. will profit.
See e.g. Dirk Eddelbüttel's gcbd package and its vignette, and also discussions how to use GotoBLAS2 / OpenBLAS.
How to parallelize a certain problem is often non-trivial. Therefore, a specific implementation has to be made in each and every case, in this case for each R package. So, I do not think a general implementation of parallel processing in R will be made, or is even possible.