Running Statistical tests in R from Decisions Trees by Weka - r

I am trying to figure out how to do this but Google does not seem to find me an answer.
I have a nice dataset that I am able to generate a pruned and unpruned decision trees in Weka. From this I can get the 10-fold cross-validation information which is nice.
But I would like to run statistical tests between the two decision trees, i.e. T-Test or Wilcoxon, using R. I have been suggested to use the DMwR and RWeka package but as I have no prior experience with this language, reading the RWeka docs and googling for tutorials or other explanations, I am coming up empty handed.

As far as I know, you can run a t-test using WEKA's Experimenter. Where you select the dataset and the algorithms (they could be the same algorithm with different parameters) and then perform a t-test.
About Wilcoxon test, what I usually do is to "save" each model generated by WEKA (they would be Java Objects) and I read these objects in my Java program (maybe you can do it in R) and I perform the test in a Java program.

Related

Is it possible to build a random forest with model based trees i.e., `mob()` in partykit package

I'm trying to build a random forest using model based regression trees in partykit package. I have built a model based tree using mob() function with a user defined fit() function which returns an object at the terminal node.
In partykit there is cforest() which uses only ctree() type trees. I want to know if it is possible to modify cforest() or write a new function which builds random forests from model based trees which returns objects at the terminal node. I want to use the objects in the terminal node for predictions. Any help is much appreciated. Thank you in advance.
Edit: The tree I have built is similar to the one here -> https://stackoverflow.com/a/37059827/14168775
How do I build a random forest using a tree similar to the one in above answer?
At the moment, there is no canned solution for general model-based forests using mob() although most of the building blocks are available. However, we are currently reimplementing the backend of mob() so that we can leverage the infrastructure underlying cforest() more easily. Also, mob() is quite a bit slower than ctree() which is somewhat inconvenient in learning forests.
The best alternative, currently, is to use cforest() with a custom ytrafo. These can also accomodate model-based transformations, very much like the scores in mob(). In fact, in many situations ctree() and mob() yield very similar results when provided with the same score function as the transformation.
A worked example is available in this conference presentation:
Heidi Seibold, Achim Zeileis, Torsten Hothorn (2017).
"Individual Treatment Effect Prediction Using Model-Based Random Forests."
Presented at Workshop "Psychoco 2017 - International Workshop on Psychometric Computing",
WU Wirtschaftsuniversität Wien, Austria.
URL https://eeecon.uibk.ac.at/~zeileis/papers/Psychoco-2017.pdf
The special case of model-based random forests for individual treatment effect prediction was also implemented in a dedicated package model4you that uses the approach from the presentation above and is available from CRAN. See also:
Heidi Seibold, Achim Zeileis, Torsten Hothorn (2019).
"model4you: An R Package for Personalised Treatment Effect Estimation."
Journal of Open Research Software, 7(17), 1-6.
doi:10.5334/jors.219

Can we import the random forest model built using SparkR to R and then use getTree to extract one of the trees?

Like in decision tree we can see or visualize the node splits , I want to do something similar . But I am using SparkR and it does not have decision trees. So I am planning to use random forest with 1 tree as parameter and run on SparkR, then save the model and use getTree to see the node splits and further visualize using ggplot.
The short answer is no.
Models built with SparkR are not compatible with ones built with the respective R packages, in this case randomForest; hence, you will not be able to use the getTree function from the latter to visualize a tree from a random forest built with SparkR.
On a different level: I am surprised that decision trees have still not found their way into SparkR - they seem to be ready since several months now in the Github repo; but even when they are, they are not expected to offer methods for visualizing trees, and you will still not be able to use functions from other R packages for that purpose.

Posterior Predictive Checking

I'm relatively new at all this. I've performed an imputation on metabolomics data, and another colleague has queried the quality of my imputation (I performed predictive mean matching using MICE in R.)
Having looked into this, there isn't any official way to assess imputation beyond visually assessing the imputed and observed data. I've found some papers on using posterior predictive checking and p-value comparing whether the complete data is more extreme than the observed, but as I'm new to this I feel attempting to write the codes without guidance to be too challenging at this point.
Can anyone direct me to an example script, or perhaps a well described command, in the program R which will enable me to perform sufficient checks on my imputation?
Thank you, K.

RandomForest algorithm in SparkR?

I have implemented randomForest algorithm in R and trying to implement the same using sparkR (from Apache Spark 2.0.0).
But I found only linear model functions like glm() implementations in sparkR
https://www.codementor.io/spark/tutorial/linear-models-apache-spark-1-5-uses-present-limitations
And Couldn't able to find any RandomForest (Decision Tree algorithm) examples.
There is RandomForest in Spark's MLLib but cannot able to find the R bindings for MLLib also.
Kindly let me know, whether SparkR(2.0.0) supports RandomForest? else is it possible to connect SparkR with MLlib to use RandomForest?
If not how can we acheive this using SparkR?
True, it's not available in SparkR as of now.
Possible option is to build random forest on distributed chunks of data and combine your trees later.
Anyways its all about randomness.
A good link: https://groups.google.com/forum/#!topic/sparkr-dev/3N6LK7k4NB0

Deploy R statistical models in WSO2?

A newbie question on WSO2 and 'R'....
I have a customer where they are looking to build some statistical models using 'R'. These models are mostly associated with customer scoring, i.e. sucking in a table of customer data with behavioural attributes as columns, and spitting out a 'score' for each customer.
Two questions on this:
Can 'R' models by deployed like rules in a service model?
Could you deploy R models into a WSO2 middleware, and if so, how and where?
TIA
Note: I'm not familiar with wso2 but I'm with R.
The answer to your question very much depends on what type of models you would like to deploy. The easiest ones are models such as linear/logistic regression followed by decision trees.
The reason they are easy is because for linear & logistic regression you get a nice formula you can plug-in to any programming interface. An example prediction formula might be like the following:
customer_predicted_life_time_value =
17.25+2.365*num_of_products_held-16.12*time_at_address+25.36*monthly_income.
Similarly, decision trees can be easily exported as a bunch of if-then-else rules (there at least a couple of packages in R which will translate the R decision tree model into rules).
You could technically be able to deploy randomForest too in the form of rules but that will be cumbersome if you want to implenent using rules.

Resources