Parallel Random forest in R utilizing CARET package - r

I am utilizing one of the regression technique parallel random forest named as method="parRF" in R under the caret package; it seems to work faster than regular random forest. May I kindly request the difference in the implementation detail that speed up the process.
Any link to document explaining parallel random forest algorithm and implementation would be of great help.

It is a parallel implementation using your machine's multiple cores and an MPI package.
Check out the page on parallel implementations at http://caret.r-forge.r-project.org/parallel.html and of course the package's CRAN page. I hope these are enough detail for you.

Related

Is it possible to build a random forest with model based trees i.e., `mob()` in partykit package

I'm trying to build a random forest using model based regression trees in partykit package. I have built a model based tree using mob() function with a user defined fit() function which returns an object at the terminal node.
In partykit there is cforest() which uses only ctree() type trees. I want to know if it is possible to modify cforest() or write a new function which builds random forests from model based trees which returns objects at the terminal node. I want to use the objects in the terminal node for predictions. Any help is much appreciated. Thank you in advance.
Edit: The tree I have built is similar to the one here -> https://stackoverflow.com/a/37059827/14168775
How do I build a random forest using a tree similar to the one in above answer?
At the moment, there is no canned solution for general model-based forests using mob() although most of the building blocks are available. However, we are currently reimplementing the backend of mob() so that we can leverage the infrastructure underlying cforest() more easily. Also, mob() is quite a bit slower than ctree() which is somewhat inconvenient in learning forests.
The best alternative, currently, is to use cforest() with a custom ytrafo. These can also accomodate model-based transformations, very much like the scores in mob(). In fact, in many situations ctree() and mob() yield very similar results when provided with the same score function as the transformation.
A worked example is available in this conference presentation:
Heidi Seibold, Achim Zeileis, Torsten Hothorn (2017).
"Individual Treatment Effect Prediction Using Model-Based Random Forests."
Presented at Workshop "Psychoco 2017 - International Workshop on Psychometric Computing",
WU Wirtschaftsuniversität Wien, Austria.
URL https://eeecon.uibk.ac.at/~zeileis/papers/Psychoco-2017.pdf
The special case of model-based random forests for individual treatment effect prediction was also implemented in a dedicated package model4you that uses the approach from the presentation above and is available from CRAN. See also:
Heidi Seibold, Achim Zeileis, Torsten Hothorn (2019).
"model4you: An R Package for Personalised Treatment Effect Estimation."
Journal of Open Research Software, 7(17), 1-6.
doi:10.5334/jors.219

Is parallel processing a solution for RAM shortage in R due to a large dataset?

I would like to do several machine learning techniques (logistic regression, SVM, Random forrest, neural network) in R on a dataset of 224 GB while my RAM is only 16 GB.
I suppose a solution could be to rent a virtual PC in the cloud with 256 GB RAM. For example an EC2 at AWS based on an AMI from this post by Louis Aslett:
http://www.louisaslett.com/RStudio_AMI/
Alternatively I understood there are several parallel processing methods and packages. For example Sparklyr, Future and ff. Is parallel processing a solution to my problem of limited RAM? Or is parallel processing targetted at running code faster?
If I assume parallel processing is a solution, I assume I need to modify the processes within the machine learning packages. For example, logistic regression is done with this line of code:
model <- glm ( Y ~., family=binomial ( link='logit' ), data=train )
Although as far as I know I don’t have influence over the calculations within the glm-method.
Your problem is that you can't fit all the data in memory at once, and the standard glm() function needs that. Luckily, linear and generalized linear models can be computed using the data in batches. The issue is how to combine the computations between the batches.
Parallel algorithms need to break up datasets to send to workers, but if you only have one worker, you'd need to process them serially, so it's only the "breaking up" part that you need. The biglm package in R can do that for your class of models.
I'd suggest h2o. It has a lot of support for fitting logistic regression, SVM, Random Forrest, and neural network, among others.
Here's how to install h2o in R
I also didn't find bigmemory packages are limited in functionality available.

MXNet-Caret Training and Optimization

I am using MXNet library in RStudio to train a neural network model.
When training the model using caret, I can tune (among others) the "momentum" parameter. Is this related with the Stochastic Gradient Descent optimizer?
I know that this is the default optimizer when training using "mx.model.FeedForward.create", but what happens when I am using caret:::train??
Momentum is related to SGD and controls how prone your algorithm to change direction of descend. There are several formulas to do that, read more about it here: https://towardsdatascience.com/stochastic-gradient-descent-with-momentum-a84097641a5d
Caret package suppose to be general purpose, so it works with MXNet. When you call cret::train it can accept method parameter. It should be taken from the repository of caret package, which at the moment supports MXNet. See this for an example: https://github.com/topepo/caret/issues/887 from Adam or https://github.com/topepo/caret/blob/master/RegressionTests/Code/mxnet.R for regular SGD.

Parallelizing random forests

Through searching and asking, I've found many packages I can use to make use of all the cores of my server, and many packages that can do random forest.
I'm quite new at this, and I'm getting lost between all the ways to parallelize the training of my random forest. Could you give some advice on reasons to use and/or avoid each of them, or some specific combinations of them (and with or without caret ?) that have made their proof ?
Packages for parallelization :
doParallel,
doSNOW,
doSMP (discontinued ?),
doMC
(and what about mclapply ?)
Packages for random forest :
[caret + some of the following]
rf,
parRF,
randomForest,
ranger,
Rborist,
parallelRandomForest (crashes my R Studio session...)
Thanks
There are a few answers on SO, such as parallel execution of random forest in R and Suggestions for speeding up Random Forests, that I would take a look at.
Those posts are helpful, but are a bit older. the ranger package is an especially fast implementation of random forest, so if you are new to this it might be the easiest way to speed up your model training. Their paper discusses the tradeoffs of some of the available packages - depending on your data size and number of features, which package gives you the best performance will vary.

Parallelize rfcv() function for feature selection in randomForest package

I wonder if anyone knows how to parallelize rfcv() function implemented in R-package 'randomForest'. Sorry if the question sounds very basic, but I tried to do this using 'foreach' without any results.
Have a look at the caret package and its documentation.
It not only is more general (allowing for more models than "just" random forests) but also integrates pre- and post-processing --- while also giving you parallel execution where feasible, particularly for evaluation and cross-validation which is an "embarassingly parallel" problem.

Resources