I have a model where some of the input features are calculated from the training dataset (e.g. average or median of a value). I am trying to perform n-fold cross validation on this model, but that means that the values for these features would be different depending on the samples selected for training/validation for each fold. Is there a way in h2o (I'm using it in R) to perhaps pass a funtion that calculates those features once the training set has been determined?
It seems like a pretty intuitive feature to have, but I have not been able to find any documentation on something like it out-of-the-box. Does it exist? If so, could someone point me to a resource?
There's no way to do this while using the built-in cross-validation in H2O. If H2O were written in pure R or Python, then it would be easy to extend it to allow a user to pass in a function to create custom features within the cross-validation loop, however the core of H2O is written in Java, so automatically translating an arbitrary user-defined function from R or Python, first into a REST call and then into Java is not trivial.
Instead, what you'd have to do is write a loop to do the cross-validation yourself and compute the features within the loop.
It sounds like you may be doing target encoding (or something similar), and if that's the case, you'll be interested in this PR to add target encoding in H2O. In the discussion, we talk about the same issue that you're having.
Related
I'm using the Caret package from R to create prediction models for maximum energy demand. What i need to use is neural network multilayer perceptron, but in the Caret package i found out there's 2 of the mlp method, which is "mlp" and "mlpML". what is the difference between the two?
I have read description from a book (Advanced R Statistical Programming and Data Models: Analysis, Machine Learning, and Visualization) but it still doesnt answer my question.
Caret has 238 different models available! However many of them are just different methods to call the same basic algorithm.
Besides mlp there are 9 other methods of calling a multi-layer-perceptron one of which is mlpML. The real difference is only in the parameters of the function call and which model you need depends on your use case and what you want to adapt about the basic model.
Chances are, if you don't know what mlpML or mlpWeightDecay,etc. does you are fine to just use the basic mlp.
Looking at the official documentation we can see that:
mlp(size) while mlpML(layer1,layer2,layer3) so in the first method you can only tune the size of the multi-layer-perceptron while in the second call you can tune each layer individually.
Looking at the source code here:
https://github.com/topepo/caret/blob/master/models/files/mlp.R
and here:
https://github.com/topepo/caret/blob/master/models/files/mlpML.R
It seems that the difference is that mlpML allows several hidden layers:
modelInfo <- list(label = "Multi-Layer Perceptron, with multiple layers",
while mlp has one single layer with hidden units.
The official documentation also hints at this difference. In my opinion, it is not particularly useful to have many different models that differ only very slightly, and the documentation does not explain those slight differences well.
I'm using the nested resampling capabilities of the R mlr package, as outlined here: http://mlr-org.github.io/mlr-tutorial/release/html/nested_resampling/index.html?
The problem I am working on is a financial time series (FX rate) and so the existing resampling methods are not ideal. Ideally I'm looking to implement walk forward resampling within the nesting context. Is it possible to do this?
I can't see any way of getting greater access to the the CV or RepCV methods, but all that would be needed would be a way of controlling the fold selection and ordering they use, such as the foldType parameter available in the package cvTools.
Some timeseries-related functionality is currently under review in PR 1702. This may do most of what you want; you could at least use it as a starting point for your own implementation.
I'm currently working on trust prediction in social networks - from obvious reasons I model this problem as data stream. What I want to do is to "update" my trained model using old model + new chunk of data stream. Classifiers that I am using are SVM, NB (e1071 implementation), neural network (nnet) and C5.0 decision tree.
Sidenote: I know that this solution is possible using RMOA package by defining "model" argument in trainMOA function, but I don't think I can use it with those classifiers implementations (if I am wrong please correct me).
According to strange SO rules, I can't post it as comment, so be it.
Classifiers that you've listed need full data set at the time you train a model, so whenever new data comes in, you should combine it with previous data and retrain the model. What you are probably looking for is online machine learning. One of the very popular implementations is Vowpal Wabbit, it also has bindings to R.
I'm using this LDA package for R. Specifically I am trying to do supervised latent dirichlet allocation (slda). In the linked package, there's an slda.em function. However what confuses me is that it asks for alpha, eta and variance parameters. As far as I understand, I thought these parameters are unknowns in the model. So my question is, did the author of the package mean to say that these are initial guesses for the parameters? If yes, there doesn't seem to be a way of accessing them from the result of running slda.em.
Aside from coding the extra EM steps in the algorithm, is there a suggested way to guess reasonable values for these parameters?
Since you are trying to generate a supervised model, the typical approach would be to use cross validation to determine the model parameters. So you hold out some of the data as your test set, train the a model on the remaining data, and evaluate the model performance, repeating k times. You then continue to repeat with different model parameters to determine which result in the best model performance.
In the specific case of slda, I would run demo(slda) to see the author's implementation of it. When you run the demo, you'll see that he sets alpha=1.0, eta=0.1, and variance=0.25. I'd suggest using these as your starting point, and then use cross validation to determine better parameters if you need to improve model performance.
I am interested in exploring surrogate based optimization. I am not yet writing opendao code, just trying to figure out to what extent OpenMDAO will support this work.
I see that it has a DOE driver to generate training data (http://openmdao.readthedocs.org/en/1.5.0/usr-guide/tutorials/doe-drivers.html), I see that it has several surrogate models that can be added to a meta model (http://openmdao.readthedocs.org/en/1.5.0/usr-guide/examples/krig_sin.html). Yet, I haven't found an example where the results of the DOE are passed as training data to the Meta-model.
In many of the examples/tutorials/forum-posts it seems that the training data is created directly on or within the meta model. So it is not clear how these things work together.
Could the developers explain how training data is passed from a DOE to a meta model? Thanks!
In openmdao 1.x, this kind of process isn't directly supported (yet) via a DOE, but it is definitely possible. There are two paths that you can take, which offer different benefits depending on your eventual goal.
I will separate the different scenarios based on a single high level classification:
1) You want to do gradient based optimization around the whole DOE/Metamodel combination. This would be the case if, for example, you wanted to use CFD to predict drag at a few key points, then use a meta-model to generate a drag polar for mission analysis. A great example of this kind of modeling can be found in this paper on simultaneous aircraft-mission design optimization..
2) You don't want to do gradient based optimization around the whole model. You might want to do gradient free optimization (like a Genetic algorithm). You might want to do gradient based optimization just around the surrogate itself, with fixed training data. Or you might not want to do optimization at all...
If you're use case falls under scenario 1 (or will eventually fall under this use case in the future), then you want to use a multi-point approach. You create one instance of your model for each training case, then you can mux the results into an array you pass into meta-model. This is necessary so that derivatives can
be propagated through the full model. The multi-point approach will work well, and is very parallelizable. Depending on the structure of the model you will use for generating the training data itself, you might also consider a slightly different multi-point approach with a distributed component or a series of distributed components chained together. If your model will support it, the distributed component approach is the most efficient model structure to use in this case.
If you're use case falls into scenario 2, you can still employ the multi-point approach if you like. It will work out of the box. However, you could also consider using a regular DOE to generate the training data. In order to do this, you'll need to use a nested-problem approach, where you put the DOE training data generation in a sub-problem. This will also work, though it will take a bit of extra coding on your part to get the array of results out of the DOE because thats not currently implemented.
If you wanted to use the DOE to generate the data, then pass it downstream to a surrogate that would get optimized on, you could use a pair of problem instances. This would not necessarily require that you make nested problems at all. Instead you just build a run-script that has one problem instance that uses a DOE, when its done you collect the data into an array. Then you could manually assign that to the training inputs of a meta-model in a second problem instance. Something like the following pseudo-code:
prob1 = Problem()
prob1.driver = DOE()
#set up the DOE variables and model ...
prob1.run()
training_data = prob1.driver.results
prob2 = Problem()
prob2.driver = Optimizer()
#set up the meta-model and optimization problem
prob2['meta_model.train:x'] = training_data
prob2.run()