Extract sample of features used to build each tree in H2O - r

In GBM model, following parameters are used -
col_sample_rate
col_sample_rate_per_tree
col_sample_rate_change_per_level
I understand how the sampling works and how many variables get considered for splitting at each level for every tree. I am trying to understand how many times each feature gets considered for making a decision. Is there a way to easily extract all sample of features used for making a splitting decision from the model object?
Referring to the explanation provided by H2O, http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/col_sample_rate.html, is there a way to know 60 randomly chosen features for each split?
Thank you for your help!

If you want to see which features were used at a given split in a give tree you can navigate the H2OTree object.
For R see documentation here and here
For Python see documentation here
You can also take a look at this Blog (if this link ever dies just do a google search for H2OTree class)

I don’t know if I would call this easy, but the MOJO tree visualizer spits out a graphviz dot data file which is turned into a visualization. This has the information you are interested in.
http://docs.h2o.ai/h2o/latest-stable/h2o-genmodel/javadoc/overview-summary.html#viewing-a-mojo

Related

LightGBM plot tree not matching feature importance

I am plotting a model from lightgbm and am trying to view the plot tree. When I use plot.tree it works... however, the output of the tree does not match feature importance, nor does it match the # of leafs I have choosen in my optimzation of my parameters.
for example, Feature A is the most important feature in my feature importance plot, but this feature does not show up in my actual decision tree plot as a node to have a decision on. Also, one of my parameters is 22 leaves, but the tree plot has 24 leaves.
I am doing this within databricks environment using python.
any ideas what is happening?
I can't post code, sorry. anyone with a general idea of what is happening will help.
First of all, Ligthgbm is a Boosting ensemble method, which means that you create several tree in series.
So, which tree are you plotting? You have several trees, and only exploring one tree is not representative of how exactly the model works. For sure, if you check a few trees, your feature A should appear.
About different num_leaves, I don't have a clear answer. It makes no sense. I should have some code and output to analyze it (but I have seen in you comment that you can't provide it, don't worry). In theory, you shouldn't have any tree with more than 22 leaves if you specified this value... Anyway, you can try to use another hyperparameter: max_depth, which is quite similar, event better.

How to make Decision Tree rules more understandable?

I'd like to extract useful rules from Decision Trees/Random Forest in order to develop a more applicable way to handle the rules and predictions. So I need an application which makes the rules more understandable.
Any suggestions (e.g. visualizations, validation methods etc) for my purpose?
As far as WHY a particular split was chosen, the answer is always going to be: "Because that split created the best splitting of the target variable."
You referenced scikit-learn... Go ahead and briefly scan scikit-learn's documentation on Decision Trees... It has an example, which is exactly what you are asking for in the middle of the page. It looks like this:
The code to generate this plot is there also:
from sklearn.datasets import load_iris
from sklearn import tree
iris = load_iris()
clf = tree.DecisionTreeClassifier()
clf = clf.fit(iris.data, iris.target)
from sklearn.externals.six import StringIO
with open("iris.dot", 'w') as f:
f = tree.export_graphviz(clf, out_file=f)
There are several other graphical representations there also with accompanying code:
The SKL documentation is generally awesome and is very useful.
Hope this helps!
While this is certainly possible for Decision Trees and AN6U5 did a great job describing how, Random Forests use bundles of little trees that were trained using random subsets of the data and random subsets of the features. Thus each tree is optimal only in that limited setting of features and data. Since there are typically 100s or even 1000s of them, figuring out the context by examining the randomized data is going to be a thankless task. I don't think anyone does it.
However there are importance ranking for the features generated for Random Forests and pretty much all implementations will output them if requested. They turn out to be extremely useful.
Two of the most important ones are MDI (Mean Decrease Impurity) and MDA (Mean Decrease Accuracy). They are described in some detail in chapter 6 of this excellent work: http://arxiv.org/pdf/1407.7502v3.pdf

System Dependance Graph with frama-c

I read that with frama-c, we can generate a PDG
which free tools can I use to generate the program dependence graph for c codes
My question is: there is a way for it to generate a SDG (It is a set of PDG, it aims to modelize interprocedural dependences)?.
Anybody could help me or could give me tips about which tools could generate the SDG.
Thank you
I'm not completely sure that it answers your question, but Frama-C's PDG plugin does have inter-procedural information, in the form of nodes for parameters and implicit inputs (globals that are read by the callee), as well as for the returned value and output locations (globals that are written). It uses results of the From plug-in to compute dependencies.
If I understand correctly PDG's API in Db.Pdg, you should be able to obtain all nodes corresponding to a given call with the Db.Pdg.find_simple_stmt_nodes function.

Fastest way to reduce dimensionality for multi-classification in R

What I currently have:
I have a data frame with one column of factors called "Class" which contains 160 different classes. I have 1200 variables, each one being an integer and no individual cell exceeding the value of 1000 (if that helps). About 1/4 of the cells are the number zero. The total dataset contains 60,000 rows. I have already used the nearZeroVar function, and the findCorrelation function to get it down to this number of variables. In my particular dataset some individual variables may appear unimportant by themselves, but are likely to be predictive when combined with two other variables.
What I have tried:
First I tried just creating a random forest model then planned on using the varimp property to filter out the useless stuff, gave up after letting it run for days. Then I tried using fscaret, but that ran overnight on a 8-core machine with 64GB of RAM (same as the previous attempt) and didn't finish. Then I tried:
Feature Selection using Genetic Algorithms That ran overnight and didn't finish either. I was trying to make principal component analysis work, but for some reason couldn't. I have never been able to successfully do PCA within Caret which could be my problem and solution here. I can follow all the "toy" demo examples on the web, but I still think I am missing something in my case.
What I need:
I need some way to quickly reduce the dimensionality of my dataset so I can make it usable for creating a model. Maybe a good place to start would be an example of using PCA with a dataset like mine using Caret. Of course, I'm happy to hear any other ideas that might get me out of the quicksand I am in right now.
I have done only some toy examples too.
Still, here are some ideas that do not fit into a comment.
All your attributes seem to be numeric. Maybe running the Naive Bayes algorithm on your dataset will gives some reasonable classifications? Then, all attributes are assumed to be independent from each other, but experience shows / many scholars say that NaiveBayes results are often still useful, despite strong assumptions?
If you absolutely MUST do attribute selection .e.g as part of an assignment:
Did you try to process your dataset with the free GUI-based data-mining tool Weka? There is an "attribute selection" tab where you have several algorithms (or algorithm-combinations) for removing irrelevant attributes at your disposal. That is an art, and the results are not so easy to interpret, though.
Read this pdf as an introduction and see this video for a walk-through and an introduction to the theoretical approach.
The videos assume familiarity with Weka, but maybe it still helps.
There is an RWeka interface but it's a bit laborious to install, so working with the Weka GUI might be easier.

Rough Set-based Attribute Reduction

I tried RSAR, a free package, but I wonder if there any other good attribute reducers out there. Even packages for R or MATLAB, any resource capable of letting me find the minimal set of attributes which classify data.
For example, having a set with hundreds of examples of mail and different attributes which describe them and classified as spam or not spam, I want to find the minimal set of attributes that describe all the data, to discard useless information.
Considering the type of problem you describe, that is: choosing the right attributes for email classification, the best way might be to use Weka (Weka home). It has several feature-selection algorithms, which could be applied both interactively to visualize their effect, or in conjunction with various classification algorithms, to evaluate their effect on actual classification. (note that choosing attributes for classification without proper validation for a specific classifier might lead to less than optimal results in real life).
Some relevant links:
Weka's manual regarding attribute selection
A (somewhat outdated) hands-on example
you can use RoughSets package of R language. See the description of FS.one.reduct.computation in R (after installing RoughSets package)
e.g: HIRING2Matrix is a Decision Table with number of attributes. reduct1 is the reduced set of attributes
reduct1<- FS.one.reduct.computation(HIRING2Matrix, greedy = TRUE, power = 1)

Resources