There are several R functions and packages to estimate features contribution to the response variable as an output of the XGBoost model.
For example there are the following:
library(xgboostExplainer)
function: buildExplainer
library(DALEX)
function: prediction_breakdown
Each of those functions return the contributions of each feature in the X matrix to the response variable. This is particularly useful with continuous target variables.
Does anyone know how to constraint those contributions for example to be positive?
Thanks
Related
I am using the zoib package in R to build zero-inflated beta regression models. I am looking for a simple way to use the models that zoib produces to calculate a predicted response for a new dataset. By "new dataset" I mean data not used to build the original zoib models.
I know I can just take the zoib model parameters and manually write a function in R to predict with but I want to utilise the fact that zoib models are Bayesian so I can get a posterior distribution of possible response values. My plan is to use the posterior distributions to calculate confidence intervals around each prediction.
Because zoib uses a MCMC approach within RJags I have investigated these two solutions:
manipulating the code within RJags
appending the new data with an "NA" response variable
The first solution I don't know how to implement because zoib runs RJags internally and the zero-inflated model it runs is very complicated. I tried the second solution but it just ignored the rows of data that I appended with "NA" response values.
I emailed the zoib package developers and this was there response.
For now, the zoib function can only output posterior predictive samples for Y given the X in the data set where the zoib regression is applied to, but not for a new set of X's. Your suggestion can be easily incorporated into the new version of the package, which is expected to be out in about a few weeks.
This is a follow-up to a previous question I asked a while back that was recently answered.
I have built several gbm models with dismo::gbm.step, which relies on the gbm fitting functions found in R package gbm, as well as cross validation tools from R package splines.
As part of my analysis, I would like to use some of the graphical tools available in R (e. g. perspective plots) to visualize pairwise interactions in the data. Both the gbm and the dismo packages have functions for detecting and modelling interactions in the data.
The implementation in dismo is explained in Elith et. al (2008) and returns a statistic which indicates departures of the model predictions from a linear combination of the predictors, while holding all other predictors at their means.
The implementation in gbm uses Friedman`s H statistic (Friedman & Popescue, 2005), and returns a different metric, and also does NOT set the other variables at their means.
The interactions modelled and plotted with dismo::gbm.interactions are great and have been very informative. However, I would also like to use gbm::interact.gbm, partly for publication strength and also to compare the results from the two methods.
If I try to run gbm::interact.gbm in a gbm.object created with dismo, an error is returned…
"Error in is.factor(data[, x$var.names[j]]) :
argument "data" is missing, with no default"
I understand dismo::gmb.step adds extra data the authors thought would be useful to the gbm model.
I also understand that the answer to my question lies somewherein the source code.
My questions is...
Is it possible to modify a gbm object created in dismo to be used in gbm::gbm.interact? If so, would this be accomplished by...
a. Modifying the gbm object created in dismo::gbm.step?
b. Modifying the source code for gbm::interact.gbm?
c. Doing something else?
I will be going through the source code trying to solve this myself, if I come up with a solution before anyone answers I will answer my own question.
The gbm::interact.gbm function requires data as an argument interact.gbm <- function(x, data, i.var = 1, n.trees = x$n.trees).
The dismo gbm.object is essentially the same as the gbm gbm.object, but with extra information attached so I don't imagine changing the gbm.object would help.
Background
The reference manual for the gbm package states the interact.gbm function computes Friedman's H-statistic to assess the strength of variable interactions. the H-statistic is on the scale of [0-1].
The reference manual for the dismo package does not reference any literature for how the gbm.interactions function detects and models interactions. Instead it gives a list of general procedures used to detect and model interactions. The dismo vignette "Boosted Regression Trees for ecological modeling" states that the dismo package extends functions in the gbm package.
Question
How does dismo::gbm.interactions actually detect and model interactions?
Why
I am asking this question because gbm.interactions in the dismo package yields results >1, which the gbm package reference manual says is not possible.
I checked the tar.gz for each of the packages to see if the source code was similar. It is different enough that I cannot determine if these two packages are using the same method to detect and model interactions.
To summarize, the difference between the two approaches boils down to how the "partial dependence function" of the two predictors is estimated.
The dismo package is based on code originally given in Elith et al., 2008 and you can find the original source in the supplementary material. The paper very briefly describes the procedure. Basically the model predictions are obtained over a grid of two predictors, setting all other predictors at their means. The model predictions are then regressed onto the grid. The mean squared errors of this model are then multiplied by 1000. This statistic indicates departures of the model predictions from a linear combination of the predictors, indicating a possible interaction.
From the dismo package, we can also obtain the relevant source code for gbm.interactions. The interaction test boils down to the following commands (copied directly from source):
interaction.test.model <- lm(prediction ~ as.factor(pred.frame[,1]) + as.factor(pred.frame[,2]))
interaction.flag <- round(mean(resid(interaction.test.model)^2) * 1000,2)
pred.frame contains a grid of the two predictors in question, and prediction is the prediction from the original gbm fitted model where all but two predictors under consideration are set at their means.
This is different than Friedman's H statistic (Friedman & Popescue, 2005), which is estimated via formula (44) for any pair of predictors. This is essentially the departure from additivity for any two predictors averaging over the values of the other variables, NOT setting the other variables at their means. It is expressed as a percent of the total variance of the partial dependence function of the two variables (or model implied predictions) so will always be between 0-1.
I'm experimenting with Bayesian networks in R and have built some networks using the bnlearn package. I can use them to make predictions for new observations with predict(), however I would also like to have the posterior distribution over the possible classes. Is there a way of retrieving this information?
It seems like there is a prob-parameter that does this for the naive bayes implementation in the bnlearn package, but not for networks fitted with bn.fit.
Thankful for any help with this.
See the documentation of bnlearn.
predict function implements prob only for naive.bayes and TAN.
In short, because all other methods do not necessarily compute posterior probabilities.
[bnlearn] :: predict returns the predicted values for node given the data specified by data. Depending on the
value of method, the predicted values are computed as follows:
a)parents b)bayes-lw
When using bayes-lw , likelihood weighting simulations are performed for making predictions.
Hope this helps. :)
I am trying to use ksvm function of kernlab package in R for epsilon-SVM regression. I want to put parameters C(regularization constant) and epsilon (insensitivity) as vectors(length of vector = training data length). But I am not able to figure out how to do this. Please suggest some way.
Why do you assume that you can do it? According to documentation of ksvm you can only weight classes, not particular samples. Such modification is accessible in for example sklearn python library (as samples' weights).
To artificialy implement per samples C-weights you could oversample your data. It will be very inefficient (especially if you have large differences in C values), but it can be applied to almost any SVM library.