Simple Rules in PMML - rules

We are currently exploring deploying Zementis ADAPA or their UPPI plugin on top of a hadoop cluster. We plan to extract out SAS models to PMML and deploy them.
However, in addition to the models extracted from SAS, we need to express much simpler 'models'/classification rules in PMML.
An example is:
input: var1, var2
rule: var1 >= var2
output: 'true' of 'false'
I'm currently thinking of expressing this as a very simple decision tree (TreeModel in PMML) or a very simple rule set (RuleSet in PMML).
Here are my questions:
Am I using the right models?
Is this even the right approach? Is there another way to express rules in PMML?
Is this even the right thing to ask of PMML? Is anyone else using PMML to express rules like this?

Since the PMML document always expects some sort of a 'model' to be present, you'll have to essentially trick it by putting in a dummy regression model. Then, you'll do your 'rule / logic' using the PMML 'if-then-else' construct in you input preprocessing (TransformationDictionary) to 'derive' your answer field. After that, you'll have to output this derived field using the 'output' element.
I know this is just too much work for too little benefit. I did this just as a proof-of-concept and we decided to not do simple rules in PMML.

Related

Performance drop when converting Mask RCNN to uff format

My goal is to deploy a Mask RCNN model trained with the well known Matterport's repo with Nvidia deepstream.
To do so, first I have to convert the generated .h5 model into a .uff. This operation is decribed here.
After the conversion, I have run the generated .uff model with TensoRT and deepstream and it has a very poor performance compared to the .h5model (almost never detects/masks the objects).
Before the conversion, I have done the corresponding changes to handle NCWH models and configured the number of classes and backbone (in this case resnet50).
I don't know how to continue. Any advice could really healp me. Thanks!
To solve the problem one must use the same configuration for the training and the conversion.
In particular, since most of models start from tranfering learning from the pretrained coco model, one has to use its very same config.
In adition, the input images sizes have to be coherent with the trainning configuration.

Extract sample of features used to build each tree in H2O

In GBM model, following parameters are used -
col_sample_rate
col_sample_rate_per_tree
col_sample_rate_change_per_level
I understand how the sampling works and how many variables get considered for splitting at each level for every tree. I am trying to understand how many times each feature gets considered for making a decision. Is there a way to easily extract all sample of features used for making a splitting decision from the model object?
Referring to the explanation provided by H2O, http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/col_sample_rate.html, is there a way to know 60 randomly chosen features for each split?
Thank you for your help!
If you want to see which features were used at a given split in a give tree you can navigate the H2OTree object.
For R see documentation here and here
For Python see documentation here
You can also take a look at this Blog (if this link ever dies just do a google search for H2OTree class)
I don’t know if I would call this easy, but the MOJO tree visualizer spits out a graphviz dot data file which is turned into a visualization. This has the information you are interested in.
http://docs.h2o.ai/h2o/latest-stable/h2o-genmodel/javadoc/overview-summary.html#viewing-a-mojo

customizable cross-validation in h2o (features that depend on the training set)

I have a model where some of the input features are calculated from the training dataset (e.g. average or median of a value). I am trying to perform n-fold cross validation on this model, but that means that the values for these features would be different depending on the samples selected for training/validation for each fold. Is there a way in h2o (I'm using it in R) to perhaps pass a funtion that calculates those features once the training set has been determined?
It seems like a pretty intuitive feature to have, but I have not been able to find any documentation on something like it out-of-the-box. Does it exist? If so, could someone point me to a resource?
There's no way to do this while using the built-in cross-validation in H2O. If H2O were written in pure R or Python, then it would be easy to extend it to allow a user to pass in a function to create custom features within the cross-validation loop, however the core of H2O is written in Java, so automatically translating an arbitrary user-defined function from R or Python, first into a REST call and then into Java is not trivial.
Instead, what you'd have to do is write a loop to do the cross-validation yourself and compute the features within the loop.
It sounds like you may be doing target encoding (or something similar), and if that's the case, you'll be interested in this PR to add target encoding in H2O. In the discussion, we talk about the same issue that you're having.

Can DOE driver results feed Metamodel component?

I am interested in exploring surrogate based optimization. I am not yet writing opendao code, just trying to figure out to what extent OpenMDAO will support this work.
I see that it has a DOE driver to generate training data (http://openmdao.readthedocs.org/en/1.5.0/usr-guide/tutorials/doe-drivers.html), I see that it has several surrogate models that can be added to a meta model (http://openmdao.readthedocs.org/en/1.5.0/usr-guide/examples/krig_sin.html). Yet, I haven't found an example where the results of the DOE are passed as training data to the Meta-model.
In many of the examples/tutorials/forum-posts it seems that the training data is created directly on or within the meta model. So it is not clear how these things work together.
Could the developers explain how training data is passed from a DOE to a meta model? Thanks!
In openmdao 1.x, this kind of process isn't directly supported (yet) via a DOE, but it is definitely possible. There are two paths that you can take, which offer different benefits depending on your eventual goal.
I will separate the different scenarios based on a single high level classification:
1) You want to do gradient based optimization around the whole DOE/Metamodel combination. This would be the case if, for example, you wanted to use CFD to predict drag at a few key points, then use a meta-model to generate a drag polar for mission analysis. A great example of this kind of modeling can be found in this paper on simultaneous aircraft-mission design optimization..
2) You don't want to do gradient based optimization around the whole model. You might want to do gradient free optimization (like a Genetic algorithm). You might want to do gradient based optimization just around the surrogate itself, with fixed training data. Or you might not want to do optimization at all...
If you're use case falls under scenario 1 (or will eventually fall under this use case in the future), then you want to use a multi-point approach. You create one instance of your model for each training case, then you can mux the results into an array you pass into meta-model. This is necessary so that derivatives can
be propagated through the full model. The multi-point approach will work well, and is very parallelizable. Depending on the structure of the model you will use for generating the training data itself, you might also consider a slightly different multi-point approach with a distributed component or a series of distributed components chained together. If your model will support it, the distributed component approach is the most efficient model structure to use in this case.
If you're use case falls into scenario 2, you can still employ the multi-point approach if you like. It will work out of the box. However, you could also consider using a regular DOE to generate the training data. In order to do this, you'll need to use a nested-problem approach, where you put the DOE training data generation in a sub-problem. This will also work, though it will take a bit of extra coding on your part to get the array of results out of the DOE because thats not currently implemented.
If you wanted to use the DOE to generate the data, then pass it downstream to a surrogate that would get optimized on, you could use a pair of problem instances. This would not necessarily require that you make nested problems at all. Instead you just build a run-script that has one problem instance that uses a DOE, when its done you collect the data into an array. Then you could manually assign that to the training inputs of a meta-model in a second problem instance. Something like the following pseudo-code:
prob1 = Problem()
prob1.driver = DOE()
#set up the DOE variables and model ...
prob1.run()
training_data = prob1.driver.results
prob2 = Problem()
prob2.driver = Optimizer()
#set up the meta-model and optimization problem
prob2['meta_model.train:x'] = training_data
prob2.run()

OLS in Python with Dummy Variables - Best Solution?

I have a problem I am trying to solve in Python, and I have found multiple solutions (I think) but I am trying to figure out which one is the best. I am hoping to choose libraries that will be supported fully in the future so I do not have to re-write this service.
I want to do an ordinary multi-variate least squares regression with both categorical and continuous dependent variables. The code has to be written in Python, as it is being integrated into a web service. I have been following Pandas quite a bit but never used it, so this seems to be one approach:
SOLUTION 1. https://github.com/pydata/pandas/blob/master/examples/regressions.py
Obviously, numpy/scipy are ideal, but I cant find an example that uses dummy variables (does anyone have one???). I did find this though,
SOLUTION 2. http://www.scipy.org/Cookbook/OLS
which I could modify to support dummy variables, but I do not want to do that if someone else has done it already + I want the numbers to be very similar to R, as I have done most of my analysis offline and I can use these results for unit tests.
And in the example (2) above, I see that I could technically use rpy/rpy2, although that is not optimal because my web service requires yet another piece of technology (R). The good thing about using the interface is the numbers would be identical to my results from R.
SOLUTION 3. http://www.scipy.org/Cookbook/OLS (but using Rpy/Rpy2)
Anyways, I am interested in what everyone's approach would be out of these three solutions, if there are any I am missing ...... and if Panda's is mature enough to start using in a production web service. The key thing here is that I do not want to have to support/patch bug fixes or write anything from scratch if possible. I'm too busy and probably not smart enough :)
Thanks.
You can use statsmodels, which provides many different models and result statistics
If you want to use an R like formula interface, here are some examples and you can look at the corresponding documentation :
http://statsmodels.sourceforge.net/devel/examples/notebooks/generated/contrasts.html
http://statsmodels.sourceforge.net/devel/examples/notebooks/generated/example_formulas.html
If you want a pure numpy version, then here is an old example that does everything from scratch
http://statsmodels.sourceforge.net/devel/examples/generated/example_ols.html#ols-with-dummy-variables
The models are integrated with pandas, and can use pandas DataFrame as the data structure for the dependent and independent variables (endog and exog in statsmodels naming convention).

Resources