Can DOE driver results feed Metamodel component? - openmdao

I am interested in exploring surrogate based optimization. I am not yet writing opendao code, just trying to figure out to what extent OpenMDAO will support this work.
I see that it has a DOE driver to generate training data (http://openmdao.readthedocs.org/en/1.5.0/usr-guide/tutorials/doe-drivers.html), I see that it has several surrogate models that can be added to a meta model (http://openmdao.readthedocs.org/en/1.5.0/usr-guide/examples/krig_sin.html). Yet, I haven't found an example where the results of the DOE are passed as training data to the Meta-model.
In many of the examples/tutorials/forum-posts it seems that the training data is created directly on or within the meta model. So it is not clear how these things work together.
Could the developers explain how training data is passed from a DOE to a meta model? Thanks!

In openmdao 1.x, this kind of process isn't directly supported (yet) via a DOE, but it is definitely possible. There are two paths that you can take, which offer different benefits depending on your eventual goal.
I will separate the different scenarios based on a single high level classification:
1) You want to do gradient based optimization around the whole DOE/Metamodel combination. This would be the case if, for example, you wanted to use CFD to predict drag at a few key points, then use a meta-model to generate a drag polar for mission analysis. A great example of this kind of modeling can be found in this paper on simultaneous aircraft-mission design optimization..
2) You don't want to do gradient based optimization around the whole model. You might want to do gradient free optimization (like a Genetic algorithm). You might want to do gradient based optimization just around the surrogate itself, with fixed training data. Or you might not want to do optimization at all...
If you're use case falls under scenario 1 (or will eventually fall under this use case in the future), then you want to use a multi-point approach. You create one instance of your model for each training case, then you can mux the results into an array you pass into meta-model. This is necessary so that derivatives can
be propagated through the full model. The multi-point approach will work well, and is very parallelizable. Depending on the structure of the model you will use for generating the training data itself, you might also consider a slightly different multi-point approach with a distributed component or a series of distributed components chained together. If your model will support it, the distributed component approach is the most efficient model structure to use in this case.
If you're use case falls into scenario 2, you can still employ the multi-point approach if you like. It will work out of the box. However, you could also consider using a regular DOE to generate the training data. In order to do this, you'll need to use a nested-problem approach, where you put the DOE training data generation in a sub-problem. This will also work, though it will take a bit of extra coding on your part to get the array of results out of the DOE because thats not currently implemented.
If you wanted to use the DOE to generate the data, then pass it downstream to a surrogate that would get optimized on, you could use a pair of problem instances. This would not necessarily require that you make nested problems at all. Instead you just build a run-script that has one problem instance that uses a DOE, when its done you collect the data into an array. Then you could manually assign that to the training inputs of a meta-model in a second problem instance. Something like the following pseudo-code:
prob1 = Problem()
prob1.driver = DOE()
#set up the DOE variables and model ...
prob1.run()
training_data = prob1.driver.results
prob2 = Problem()
prob2.driver = Optimizer()
#set up the meta-model and optimization problem
prob2['meta_model.train:x'] = training_data
prob2.run()

Related

Extract sample of features used to build each tree in H2O

In GBM model, following parameters are used -
col_sample_rate
col_sample_rate_per_tree
col_sample_rate_change_per_level
I understand how the sampling works and how many variables get considered for splitting at each level for every tree. I am trying to understand how many times each feature gets considered for making a decision. Is there a way to easily extract all sample of features used for making a splitting decision from the model object?
Referring to the explanation provided by H2O, http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/col_sample_rate.html, is there a way to know 60 randomly chosen features for each split?
Thank you for your help!
If you want to see which features were used at a given split in a give tree you can navigate the H2OTree object.
For R see documentation here and here
For Python see documentation here
You can also take a look at this Blog (if this link ever dies just do a google search for H2OTree class)
I don’t know if I would call this easy, but the MOJO tree visualizer spits out a graphviz dot data file which is turned into a visualization. This has the information you are interested in.
http://docs.h2o.ai/h2o/latest-stable/h2o-genmodel/javadoc/overview-summary.html#viewing-a-mojo

customizable cross-validation in h2o (features that depend on the training set)

I have a model where some of the input features are calculated from the training dataset (e.g. average or median of a value). I am trying to perform n-fold cross validation on this model, but that means that the values for these features would be different depending on the samples selected for training/validation for each fold. Is there a way in h2o (I'm using it in R) to perhaps pass a funtion that calculates those features once the training set has been determined?
It seems like a pretty intuitive feature to have, but I have not been able to find any documentation on something like it out-of-the-box. Does it exist? If so, could someone point me to a resource?
There's no way to do this while using the built-in cross-validation in H2O. If H2O were written in pure R or Python, then it would be easy to extend it to allow a user to pass in a function to create custom features within the cross-validation loop, however the core of H2O is written in Java, so automatically translating an arbitrary user-defined function from R or Python, first into a REST call and then into Java is not trivial.
Instead, what you'd have to do is write a loop to do the cross-validation yourself and compute the features within the loop.
It sounds like you may be doing target encoding (or something similar), and if that's the case, you'll be interested in this PR to add target encoding in H2O. In the discussion, we talk about the same issue that you're having.

r rpart only working for integers and not factors? getting a tree with no depth

I'm having a few issues running a simple decision tree within R using rpart.
I can't post my actual data for an example because of confidentiality, but here's the structure. I've blanked out a load of bits just because I've got my tin foil hat on today.
I've run the most basic model to predict MIX based on MIX_BEFORE and LIFESTAGE and I don't get a tree out of the end of it. I've tried using rpart.control and specifying the minsplit, it makes no difference.
Even when I add in a few more variables I still don't get a tree:
Yet... the second I remove the factor variables and attempt to build a tree using an integer, it works fine:
Any ideas at all?
Your data has a fairly strong class imbalance: 99% one class, 1% the other. So rpart can get 99% accuracy just by saying that everything is the majority class (which is what it is doing). Most variables will not be able to discriminate better than that, so you get trees with no branches like you did with the factor variables. Your _NBR variable happens to be more predictive for the small number of points with _NBR >= 7. But even your model that uses _NBR predicts almost all points are majority class. You may be able to get some help from This Cross Validated Post on how to deal with class imbalance.

How to make Decision Tree rules more understandable?

I'd like to extract useful rules from Decision Trees/Random Forest in order to develop a more applicable way to handle the rules and predictions. So I need an application which makes the rules more understandable.
Any suggestions (e.g. visualizations, validation methods etc) for my purpose?
As far as WHY a particular split was chosen, the answer is always going to be: "Because that split created the best splitting of the target variable."
You referenced scikit-learn... Go ahead and briefly scan scikit-learn's documentation on Decision Trees... It has an example, which is exactly what you are asking for in the middle of the page. It looks like this:
The code to generate this plot is there also:
from sklearn.datasets import load_iris
from sklearn import tree
iris = load_iris()
clf = tree.DecisionTreeClassifier()
clf = clf.fit(iris.data, iris.target)
from sklearn.externals.six import StringIO
with open("iris.dot", 'w') as f:
f = tree.export_graphviz(clf, out_file=f)
There are several other graphical representations there also with accompanying code:
The SKL documentation is generally awesome and is very useful.
Hope this helps!
While this is certainly possible for Decision Trees and AN6U5 did a great job describing how, Random Forests use bundles of little trees that were trained using random subsets of the data and random subsets of the features. Thus each tree is optimal only in that limited setting of features and data. Since there are typically 100s or even 1000s of them, figuring out the context by examining the randomized data is going to be a thankless task. I don't think anyone does it.
However there are importance ranking for the features generated for Random Forests and pretty much all implementations will output them if requested. They turn out to be extremely useful.
Two of the most important ones are MDI (Mean Decrease Impurity) and MDA (Mean Decrease Accuracy). They are described in some detail in chapter 6 of this excellent work: http://arxiv.org/pdf/1407.7502v3.pdf

Fastest way to reduce dimensionality for multi-classification in R

What I currently have:
I have a data frame with one column of factors called "Class" which contains 160 different classes. I have 1200 variables, each one being an integer and no individual cell exceeding the value of 1000 (if that helps). About 1/4 of the cells are the number zero. The total dataset contains 60,000 rows. I have already used the nearZeroVar function, and the findCorrelation function to get it down to this number of variables. In my particular dataset some individual variables may appear unimportant by themselves, but are likely to be predictive when combined with two other variables.
What I have tried:
First I tried just creating a random forest model then planned on using the varimp property to filter out the useless stuff, gave up after letting it run for days. Then I tried using fscaret, but that ran overnight on a 8-core machine with 64GB of RAM (same as the previous attempt) and didn't finish. Then I tried:
Feature Selection using Genetic Algorithms That ran overnight and didn't finish either. I was trying to make principal component analysis work, but for some reason couldn't. I have never been able to successfully do PCA within Caret which could be my problem and solution here. I can follow all the "toy" demo examples on the web, but I still think I am missing something in my case.
What I need:
I need some way to quickly reduce the dimensionality of my dataset so I can make it usable for creating a model. Maybe a good place to start would be an example of using PCA with a dataset like mine using Caret. Of course, I'm happy to hear any other ideas that might get me out of the quicksand I am in right now.
I have done only some toy examples too.
Still, here are some ideas that do not fit into a comment.
All your attributes seem to be numeric. Maybe running the Naive Bayes algorithm on your dataset will gives some reasonable classifications? Then, all attributes are assumed to be independent from each other, but experience shows / many scholars say that NaiveBayes results are often still useful, despite strong assumptions?
If you absolutely MUST do attribute selection .e.g as part of an assignment:
Did you try to process your dataset with the free GUI-based data-mining tool Weka? There is an "attribute selection" tab where you have several algorithms (or algorithm-combinations) for removing irrelevant attributes at your disposal. That is an art, and the results are not so easy to interpret, though.
Read this pdf as an introduction and see this video for a walk-through and an introduction to the theoretical approach.
The videos assume familiarity with Weka, but maybe it still helps.
There is an RWeka interface but it's a bit laborious to install, so working with the Weka GUI might be easier.

Resources