call gbm model from C++ - r

I've got a gbm object and I want to use it from C++. For example, use the predict.gbm() in C++ with new data. At first I tried to translate the if-else rule in C++ and just output the tree to a file. However, I found that the gbm result doesn't match the tree it generates. For example, when I use just the first tree, the SplitCodePred value in the tree doesn't match the value generated by predict.gbm(). So anybody knows how to do the prediction manually based on the gbm model?

See my answer to your question on Cross Validated.
In short, you should be able to call e.g. gbm_pred directly from the C/C++ source. The source is available here. You can see how the gbm output object is mapped onto the arguments for gbm_pred in the R function predict.gbm.

Related

How can I get My.stepwise.glm to return the model outside the console?

I asked this question on RCommunity but haven't had anyone bite... so I'm here!
My current project involves me predicting whether some trees will survive given future climate change scenarios. Against better judgement (like using Maxent) I've decided to pursue this with a GLM, which requires presence and absence data. Everytime I generate my absence data (as I was only given presence data) using randomPoints from dismo, the resulting GLM model has different significant variables. I found a package called My.stepwise that has a My.stepwise.glm function (here: My.stepwise.glm: Stepwise Variable Selection Procedure for Generalized Linear... in My.stepwise: Stepwise Variable Selection Procedures for Regression Analysis) , and this goes through a forward/backward selection process to find the best variables and returns a model ready for you.
My problem is that I don't want to run My.stepwise.glm just once and use the model it spits out for me. I'd like to run it roughly 100 times with different pseudo-absence data and see which variables it returns, then take the most frequent variables and move forward with building my model using those. The issue is that the My.stepwise.glm function ends by 'print(summary(initial.model))' and I would like to be able to access the output similar to how step() returns a list, where you can then say 'step$coefficients' and have the function coefficients return as numerics. Can anyone help me with this?

customizable cross-validation in h2o (features that depend on the training set)

I have a model where some of the input features are calculated from the training dataset (e.g. average or median of a value). I am trying to perform n-fold cross validation on this model, but that means that the values for these features would be different depending on the samples selected for training/validation for each fold. Is there a way in h2o (I'm using it in R) to perhaps pass a funtion that calculates those features once the training set has been determined?
It seems like a pretty intuitive feature to have, but I have not been able to find any documentation on something like it out-of-the-box. Does it exist? If so, could someone point me to a resource?
There's no way to do this while using the built-in cross-validation in H2O. If H2O were written in pure R or Python, then it would be easy to extend it to allow a user to pass in a function to create custom features within the cross-validation loop, however the core of H2O is written in Java, so automatically translating an arbitrary user-defined function from R or Python, first into a REST call and then into Java is not trivial.
Instead, what you'd have to do is write a loop to do the cross-validation yourself and compute the features within the loop.
It sounds like you may be doing target encoding (or something similar), and if that's the case, you'll be interested in this PR to add target encoding in H2O. In the discussion, we talk about the same issue that you're having.

R randomForest to PMML class index is wrong

I'm exporting an R randomForest model to PMML. The resulting PMML always has the class as the first element of the DataDictionary element, which is not always true.
Is there some way to fix this or at least increment the PMML with custom Extension elements? That way I could put the class index there.
I've looked in the pmml package documentation, as well as in the pmmlTransformations packages, but couldn't find anything there that could help me solve this issue.
By PMML class I assume you mean the model type (classification vs regression) in the PMML model attributes?
If so, it is not true that the model type is determined from the data type of the first element of the DataDictionary....these are completely independent. The model type is determined from the model type R thinks it is. The R random forest object determines the type it thinks it is (model$type) and that is the model type exported by the pmml function. If you want your model to be a certain type, just make sure you let R know that...for example, if you are using the iris data set, if your predicted variable is Sepal.Length, R will correctly assume it is a regression model. If you insist on treating it as a classification model, try using as.factor(Sepal.Length) instead.

caret package: Is it possible to implement my own bootstrapping method?

I am using caret package for R to select variables for my model. When using rfe command, one should pass rfeControl object, which has a method parameter. Options for this parameter are boot, cv, LOOCV and LGOCV. Since I am dealing with time series data I need to use special bootstrapping/cross-validation techniques as normal ones do not apply for time series data (otherwise distributions get corrupted etc.).
My question is how would I plug-in my own implementation of bootstrapping but still use caret rfe method, which has every other thing I need.
There isn't an easy way. If you study the code for rfe.default() you will note that in cases where method = "boot" the createResample() function is used. This is the function that generates the bootstrap samples. Similar functions are used for the other CV methods.
There is a hard way; overtake the create*() function that is most appropriate; say you want to do a block bootstrap or ME bootstrap, take over the createResample() function and use method = "boot", or if you want a special form of CV, use method = "cv" and take over createFolds().
You will need to write your own create*() function and replace the one in the caret NAMESPACE with your version. Not easy but eminently doable. Say you write your own createResample() function; first you need to note that this function creates n = times bootstrap samples returning this in a matrix with times columns and as many rows as your have samples. You need to write a custom createResample() function that returns the same object but which performs the time series bootstrapping you want to employ.
Once you have written that function you then need to get it into the caret namespace so that it is used by functions in the caret package. For this you use assignInNamespace(). Say your new bootstrapping function is called createMyResample() and it is your workspace, to insert this into the caret namespace do:
assignInNamespace("createResample", createMyResample, ns = "caret")
Sorry I can't be more specific but you don't say how you want the bootstrap/CV to be performed nor what R code you want to use to do the resampling. If you provide further details on how you would do the resampling I will take another look and see if I can help you write your create*() function.
Failing all of this, contact Max Kuhn, the author and maintainer of caret; he may be able to advice further or at least you can suggest this feature as a wish-list for a future version.

decision trees with forced structure

I have been using decision trees (CART) in R using the rpart package to look at the relationship between SST (predictor variables) and climate (predictand variable).
I would like to "force" the tree into a particular structure - i.e. split on predictor variable 1, then on variable 2.
I've been using R for a while so I thought I'd be able to look at the code behind the rpart function and modify it to search for 'best splits' in a particular predictor variable first. However the rpart function calls C routines and not having any experience with C I get lost here...
I could write a function from scratch but would like to avoid it if possible! So my questions are:
Is there another decision tree technique (implemented in R
preferably) in which you can force the structure of the tree?
If not - is there some way I could convert the C code to R?
Any other ideas?
Thanks in advance, and help is much appreciated.
When your data indicates a tree with a known structure, present that structure to R using either a newick or nexus file format. Then you can read in the structure using either read.tree or read.nexus from Package Phylo.
Maybe you should look at the method formal parameter of rpart
In the documentation :
... ‘method’ can be a list of functions named ‘init’, ‘split’ and ‘eval’. Examples are given in the file ‘tests/usersplits.R’ in the sources.

Resources