I'm trying to use gbm in R to create a boosting classification tree model for my data.
The problem is that I'm trying to classify my data into multiple labels and the only classification distribution I can find for gbm ("bernoulli") only works for binary classification.
Is there some change that I could make to my code to create a model which classifies the data into more than just two classes?
boost=gbm(label~., data=training, distribution="bernoulli",
n.trees=5000,
interaction.depth=4)
Try
distribution = "multinomial"
Notice that, despite the fact that the option does not seem to be available in the documentation of gbm, it is available indeed - check the example on top of page 30 of the pdf manual, where gbm with distribution = "multinomial" is used with the 3-class iris dataset.
Related
I'd like to use the 'e1071' library for fitting an SVM model. So far, I've made a model that creates a curve regression based on the data set.
(take a look at the purple curve):
However, I want the SVM model to "follow" the data, such that the prediction for each value is as close as possible to the actual data. I think this is possible because of this graph that shows how SVMs (model 2) model are similar to an ARIMA model (model 1):
I tried changing the kernel to no avail. Any help will be much appreciated.
Fine tuning a SVM classifier is no easy task. Have you considered other models? For ex. GAM's (generalized additive models)? These work well on very curvy data.
I have a multiclass classification problem (with 10 classes)that I am trying to solve using the neural network option 'mxnet' in the caret package in R. I'm using a 10-fold cross validation during training and would like to plot a learning curve for this to figure out whether/how the model is overfitting. I have modified the solution given in this post (Plot learning curves with caret package and R) to fit my data. However, since the learning curve is being recorded over each one of the resamples, not all factors/classes (1-10) are present in each fold, which leads to the following error:
Error: One or more factor levels in the outcome has no data
I have also tried to use the builtin function of caret with learning_curve_dat, but I encounter the same error message.
Is there a way to bypass this problem of not all factors being present in each one of the folds?
I am newbie in R and I need to know how to plot a tree selected from a random forest training model created using the train () function in caret package.
First and foremost, I used a training dataset to create a fitting model of a random forest using the train() function. The created random forest contains about 500 trees. Is there any methodology to create a plot of a selected tree?
Thank you.
CRAN package party offers a method called prettyTree.
Look here
As far as I know, the randomForest package does not have any built-in functionality to plot individual trees. You can extract trees using the getTree() function, but nothing is provided to plot / visualize it. This question may be a duplicate as a quick search yielded approaches other people have used to extract trees from a random forest are found
here and here and here
Background
The reference manual for the gbm package states the interact.gbm function computes Friedman's H-statistic to assess the strength of variable interactions. the H-statistic is on the scale of [0-1].
The reference manual for the dismo package does not reference any literature for how the gbm.interactions function detects and models interactions. Instead it gives a list of general procedures used to detect and model interactions. The dismo vignette "Boosted Regression Trees for ecological modeling" states that the dismo package extends functions in the gbm package.
Question
How does dismo::gbm.interactions actually detect and model interactions?
Why
I am asking this question because gbm.interactions in the dismo package yields results >1, which the gbm package reference manual says is not possible.
I checked the tar.gz for each of the packages to see if the source code was similar. It is different enough that I cannot determine if these two packages are using the same method to detect and model interactions.
To summarize, the difference between the two approaches boils down to how the "partial dependence function" of the two predictors is estimated.
The dismo package is based on code originally given in Elith et al., 2008 and you can find the original source in the supplementary material. The paper very briefly describes the procedure. Basically the model predictions are obtained over a grid of two predictors, setting all other predictors at their means. The model predictions are then regressed onto the grid. The mean squared errors of this model are then multiplied by 1000. This statistic indicates departures of the model predictions from a linear combination of the predictors, indicating a possible interaction.
From the dismo package, we can also obtain the relevant source code for gbm.interactions. The interaction test boils down to the following commands (copied directly from source):
interaction.test.model <- lm(prediction ~ as.factor(pred.frame[,1]) + as.factor(pred.frame[,2]))
interaction.flag <- round(mean(resid(interaction.test.model)^2) * 1000,2)
pred.frame contains a grid of the two predictors in question, and prediction is the prediction from the original gbm fitted model where all but two predictors under consideration are set at their means.
This is different than Friedman's H statistic (Friedman & Popescue, 2005), which is estimated via formula (44) for any pair of predictors. This is essentially the departure from additivity for any two predictors averaging over the values of the other variables, NOT setting the other variables at their means. It is expressed as a percent of the total variance of the partial dependence function of the two variables (or model implied predictions) so will always be between 0-1.
I've been using the caret package in R to run some boosted regression tree and random forest models and am hoping to generate prediction intervals for a set of new cases using the inbuilt cross-validation routine.
The trainControl function allows you to save the hold-out predictions at each of the n-folds, but I'm wondering whether unknown cases can also be predicted at each fold using the built-in functions, or whether I need to use a separate loop to build the models n-times.
Any advice much appreciated
Check the R package quantregForest, available at CRAN. It can easily calculate prediction intervals for random forest models. There's a nice paper by the author of the package, explaining the backgrounds of the method. (Sorry, I can't say anything about prediction intervals for BRT models; I'm looking for them by myself...)