Parallelize rfcv() function for feature selection in randomForest package - r

I wonder if anyone knows how to parallelize rfcv() function implemented in R-package 'randomForest'. Sorry if the question sounds very basic, but I tried to do this using 'foreach' without any results.

Have a look at the caret package and its documentation.
It not only is more general (allowing for more models than "just" random forests) but also integrates pre- and post-processing --- while also giving you parallel execution where feasible, particularly for evaluation and cross-validation which is an "embarassingly parallel" problem.

Related

Difference between brglm & logistf?

I am currently fitting a penalized logistic regression model using the package logistf (due to quasi-complete separation).
I chose this package over brglm because I found much more recommendations for logistf. However, the brglm seems to integrate better with other functions such as predict() or margins::margins(). In the documentation of brglm it says:
"Implementations of the bias-reduction method for logistic regressions can also be found in thelogistf package. In addition to the obvious advantage ofbrglmin the range of link functions that can be used ("logit","probit","cloglog"and"cauchit"), brglm is also more efficient computationally."
Has anyone experience with those two packages and can tell me whether I am overlooking a weakness in brglm, or can I just use it instead of logistf?
I'd be grateful for any insights!

How to implement regularization / weight decay in R

I'm surprised at the number of R neural network packages that don't appear to have a parameter for regularization/lambda/weight decay. I'm assuming I'm missing something obvious. When I use a package like MLR and look at the integrated learners, I don't see parameters for regularization.
For example: nnTrain from the deepnet package:
list of params
I see parameters for just about everything - even drop out - but not lambda or anything else that looks like regularization.
My understanding of both caret and mlr is that they basically organize other ML packages and try to provide a consistent way to interact with them. I'm not finding L1/L2 regularization in any of them.
I've also done 20 google searches looking for R packages with regularization but found nothing. What am I missing? Thanks!
I looked through more of the models within mlr, (a daunting task), and eventually found the h2o package learners. In mlr, the classif.h2o.deeplearning model has every parameter I could think of, including L1 and L2.
Installing h2o is as simple as:
install.packages('h2o')

Parallelizing random forests

Through searching and asking, I've found many packages I can use to make use of all the cores of my server, and many packages that can do random forest.
I'm quite new at this, and I'm getting lost between all the ways to parallelize the training of my random forest. Could you give some advice on reasons to use and/or avoid each of them, or some specific combinations of them (and with or without caret ?) that have made their proof ?
Packages for parallelization :
doParallel,
doSNOW,
doSMP (discontinued ?),
doMC
(and what about mclapply ?)
Packages for random forest :
[caret + some of the following]
rf,
parRF,
randomForest,
ranger,
Rborist,
parallelRandomForest (crashes my R Studio session...)
Thanks
There are a few answers on SO, such as parallel execution of random forest in R and Suggestions for speeding up Random Forests, that I would take a look at.
Those posts are helpful, but are a bit older. the ranger package is an especially fast implementation of random forest, so if you are new to this it might be the easiest way to speed up your model training. Their paper discusses the tradeoffs of some of the available packages - depending on your data size and number of features, which package gives you the best performance will vary.

Is ezPerm (of ez Package) an alternative for aovp (of lmPerm package)?

I was wondering if the ezPerm function (of ez Package) is an appropiate alternative for the aovp (of the orphaned and unsopported lmPerm package)?
The aovp function has been thepreferred option because it works exactly like aov. The ezPerm is faily easy to use but I am not sure if it is equivalent. And then there is the coin package that supposedly is able to do permutation tests, but I have failed finding a good explanation.
ezANOVA is parametric approach while ezperm is considered non-parametric approach so it does not require assumptions to be satisfied.
I have not used avop, but now I am considering it as an alternative to ezPerm. I tried using ezPerm, but it takes a lot of time for big data (only good for small data), and WARNING: interactions may not be trusted using this function, this packages is still a work in progress.
Regarding aov and ezAnova, I read that the problem with ezANOVA is that it doesn’t use formulae to define the model (Taken from = Just Enough R- Anova ‘Cookbook’). I feel like ezANOVA is a better option for repeated measures, otherwise they are almost the same.

Parallel programming for all R packages

Do you know if there are any plans to introduce parallel programming in R for all packages?
I'm aware of some developments such as R-revolution and parallel programming packages, but they seem to have specialised functions which replace the most popular functions (linear programming etc..). However one of the great things about R is the huge amount of specialised packages which prop up every day and make complex and time-consuming analysis very easy to run. Many of these use very popular functions such as the generalised linear model, but also use the results for additional calculation and comparison and finally sort out the output. As far as I understand you need to define which parts of a function can be run in parallel programming so this is probably why most specialised R packages don't have this functionality and cannot have it unless the code is edited.
Are there are any plans (or any packages) to enable all the most popular R functions to run in parallel processing so that all the less popular functions containing these can be run in parallel processing? For example, the package difR uses the glm function for most of its functions; if the glm package was enabled to run in parallel processing (or re-written and then released in a new R version) for all multi-processor machines then there would be no need to re-write the difR package and this could then run some of its most burdensome procedures with the aid of parallel programming on a Windows PC.
I completely agree with Paul's answer.
In addition, a general system for parallelization needs some very non-trivial calibration, even for those functions that can be easily parallelized: What if you have a call stack of several functions that offer parallel computation (e.g. you are bootstrapping some model fitting, the model fitting may already offer parallelization and low level linear algebra can be implicitly parallel)? You need to estimate (or choose manually) at which level explicit parallelization should be done. In addition, you possibly have implicit parallelization, so you need to trade off between these.
However, there is one particularly easy and general way to parallelize computations implicitly in R: linear algebra can be parallelized and sped up considerably by using an optimized BLAS. Using this can (depending on your system) be as easy as telling your package manager to install the optimized BLAS and R will use it. Once it is linked to R, all packages that use the base linear algebra functions like %*%, crossprod, solve etc. will profit.
See e.g. Dirk Eddelbüttel's gcbd package and its vignette, and also discussions how to use GotoBLAS2 / OpenBLAS.
How to parallelize a certain problem is often non-trivial. Therefore, a specific implementation has to be made in each and every case, in this case for each R package. So, I do not think a general implementation of parallel processing in R will be made, or is even possible.

Resources