RandomForest algorithm in SparkR? - r

I have implemented randomForest algorithm in R and trying to implement the same using sparkR (from Apache Spark 2.0.0).
But I found only linear model functions like glm() implementations in sparkR
https://www.codementor.io/spark/tutorial/linear-models-apache-spark-1-5-uses-present-limitations
And Couldn't able to find any RandomForest (Decision Tree algorithm) examples.
There is RandomForest in Spark's MLLib but cannot able to find the R bindings for MLLib also.
Kindly let me know, whether SparkR(2.0.0) supports RandomForest? else is it possible to connect SparkR with MLlib to use RandomForest?
If not how can we acheive this using SparkR?

True, it's not available in SparkR as of now.
Possible option is to build random forest on distributed chunks of data and combine your trees later.
Anyways its all about randomness.
A good link: https://groups.google.com/forum/#!topic/sparkr-dev/3N6LK7k4NB0

Related

Can we import the random forest model built using SparkR to R and then use getTree to extract one of the trees?

Like in decision tree we can see or visualize the node splits , I want to do something similar . But I am using SparkR and it does not have decision trees. So I am planning to use random forest with 1 tree as parameter and run on SparkR, then save the model and use getTree to see the node splits and further visualize using ggplot.
The short answer is no.
Models built with SparkR are not compatible with ones built with the respective R packages, in this case randomForest; hence, you will not be able to use the getTree function from the latter to visualize a tree from a random forest built with SparkR.
On a different level: I am surprised that decision trees have still not found their way into SparkR - they seem to be ready since several months now in the Github repo; but even when they are, they are not expected to offer methods for visualizing trees, and you will still not be able to use functions from other R packages for that purpose.

How to implement regularization / weight decay in R

I'm surprised at the number of R neural network packages that don't appear to have a parameter for regularization/lambda/weight decay. I'm assuming I'm missing something obvious. When I use a package like MLR and look at the integrated learners, I don't see parameters for regularization.
For example: nnTrain from the deepnet package:
list of params
I see parameters for just about everything - even drop out - but not lambda or anything else that looks like regularization.
My understanding of both caret and mlr is that they basically organize other ML packages and try to provide a consistent way to interact with them. I'm not finding L1/L2 regularization in any of them.
I've also done 20 google searches looking for R packages with regularization but found nothing. What am I missing? Thanks!
I looked through more of the models within mlr, (a daunting task), and eventually found the h2o package learners. In mlr, the classif.h2o.deeplearning model has every parameter I could think of, including L1 and L2.
Installing h2o is as simple as:
install.packages('h2o')

hybridEnsemble package in R

I want to build a Bagged Logistic Regression Model in R. My dataset is really biased and has 0.007% of positive occurrences.
My thoughts to solve this was to use Bagged Logistic Regression. I came across the hybridEnsemble package in R. Does anyone have an example of how this package can be used? I searched online, but unfortunately did not find any examples.
Any help will be appreciated.
The way that I would try to solve this is use the h2o.stackedEnsemble() function in the h2o R package. You can automatically create more balanced classifiers by using the balance_classes = TRUE option in all of the base learners. More information about how to use this function to create ensembles is located in the Stacked Ensemble H2O docs.
Also, using H2O will be way faster than anything that's written in native R.

Adaboosting in R with any classifier

There is an implementation of AdaBoosting algorithm in R. See this link, the function is called boosting.
The problem is that this package uses classification trees as a base or weak learner.
Is that possible to substitute the original weak learner to any other (e.g., SVM or Neural Networks) using this package?
If not, are there any examples of AdaBoosting implementation in R?
Many thanks!

Running Statistical tests in R from Decisions Trees by Weka

I am trying to figure out how to do this but Google does not seem to find me an answer.
I have a nice dataset that I am able to generate a pruned and unpruned decision trees in Weka. From this I can get the 10-fold cross-validation information which is nice.
But I would like to run statistical tests between the two decision trees, i.e. T-Test or Wilcoxon, using R. I have been suggested to use the DMwR and RWeka package but as I have no prior experience with this language, reading the RWeka docs and googling for tutorials or other explanations, I am coming up empty handed.
As far as I know, you can run a t-test using WEKA's Experimenter. Where you select the dataset and the algorithms (they could be the same algorithm with different parameters) and then perform a t-test.
About Wilcoxon test, what I usually do is to "save" each model generated by WEKA (they would be Java Objects) and I read these objects in my Java program (maybe you can do it in R) and I perform the test in a Java program.

Resources