decision trees with forced structure - r

I have been using decision trees (CART) in R using the rpart package to look at the relationship between SST (predictor variables) and climate (predictand variable).
I would like to "force" the tree into a particular structure - i.e. split on predictor variable 1, then on variable 2.
I've been using R for a while so I thought I'd be able to look at the code behind the rpart function and modify it to search for 'best splits' in a particular predictor variable first. However the rpart function calls C routines and not having any experience with C I get lost here...
I could write a function from scratch but would like to avoid it if possible! So my questions are:
Is there another decision tree technique (implemented in R
preferably) in which you can force the structure of the tree?
If not - is there some way I could convert the C code to R?
Any other ideas?
Thanks in advance, and help is much appreciated.

When your data indicates a tree with a known structure, present that structure to R using either a newick or nexus file format. Then you can read in the structure using either read.tree or read.nexus from Package Phylo.

Maybe you should look at the method formal parameter of rpart
In the documentation :
... ‘method’ can be a list of functions named ‘init’, ‘split’ and ‘eval’. Examples are given in the file ‘tests/usersplits.R’ in the sources.

Related

Extract sample of features used to build each tree in H2O

In GBM model, following parameters are used -
col_sample_rate
col_sample_rate_per_tree
col_sample_rate_change_per_level
I understand how the sampling works and how many variables get considered for splitting at each level for every tree. I am trying to understand how many times each feature gets considered for making a decision. Is there a way to easily extract all sample of features used for making a splitting decision from the model object?
Referring to the explanation provided by H2O, http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/col_sample_rate.html, is there a way to know 60 randomly chosen features for each split?
Thank you for your help!
If you want to see which features were used at a given split in a give tree you can navigate the H2OTree object.
For R see documentation here and here
For Python see documentation here
You can also take a look at this Blog (if this link ever dies just do a google search for H2OTree class)
I don’t know if I would call this easy, but the MOJO tree visualizer spits out a graphviz dot data file which is turned into a visualization. This has the information you are interested in.
http://docs.h2o.ai/h2o/latest-stable/h2o-genmodel/javadoc/overview-summary.html#viewing-a-mojo

Can we import the random forest model built using SparkR to R and then use getTree to extract one of the trees?

Like in decision tree we can see or visualize the node splits , I want to do something similar . But I am using SparkR and it does not have decision trees. So I am planning to use random forest with 1 tree as parameter and run on SparkR, then save the model and use getTree to see the node splits and further visualize using ggplot.
The short answer is no.
Models built with SparkR are not compatible with ones built with the respective R packages, in this case randomForest; hence, you will not be able to use the getTree function from the latter to visualize a tree from a random forest built with SparkR.
On a different level: I am surprised that decision trees have still not found their way into SparkR - they seem to be ready since several months now in the Github repo; but even when they are, they are not expected to offer methods for visualizing trees, and you will still not be able to use functions from other R packages for that purpose.

call gbm model from C++

I've got a gbm object and I want to use it from C++. For example, use the predict.gbm() in C++ with new data. At first I tried to translate the if-else rule in C++ and just output the tree to a file. However, I found that the gbm result doesn't match the tree it generates. For example, when I use just the first tree, the SplitCodePred value in the tree doesn't match the value generated by predict.gbm(). So anybody knows how to do the prediction manually based on the gbm model?
See my answer to your question on Cross Validated.
In short, you should be able to call e.g. gbm_pred directly from the C/C++ source. The source is available here. You can see how the gbm output object is mapped onto the arguments for gbm_pred in the R function predict.gbm.

Normalizing a Phylogenetic Tree in R

When working with phylogenetic tree data in R (specifically when working with "phylo" or "phylo4" objects) it would be useful to normalize branch lengths so that certain taxa (the ones that evolve faster) do not contribute a disproportionate amount of branch length to the tree. This seems to be common in computing UniFrac values, as can be found in the discussion here: http://bmf2.colorado.edu/unifrac/help.psp. (I need more than just UniFrac values, however).
However, I cannot find a function that performs this normalization step. I have looked in ape, picante, adephylo, and phylobase. Could someone direct me to a package that includes this function, or a package that makes writing this kind of function straightforward?
Are you looking for a function to just scale the branch lengths of a tree? If so, compute.brlen() in ape will do it. There are built in options for Grafen's rho and all = 1. You can also supply your own function.
I don't know if UniFrac does some other kind of branch length scaling. But if so, you could write your function and pass it.

The internal implementation of R's dataset

I am trying to build a data processing program. Currently I use a double matrix to represent the data table, each row is an instance, each column represents a feature. I also have an extra vector as the target value for each instance, it is of double type for regression, it is of integer for classification.
I want to make it more general. I am wondering what kind of structure R uses to store a dataset, i.e. the internal implementation in R.
Maybe if you inspect the rpy2 package, you can learn something about how data structures are represented (and can be accessed).
The internal data structures are `data.frame', a detailed introduction to the data frame can be found here.
http://cran.r-project.org/doc/manuals/R-intro.html#Data-frames

Resources