MLR - calculating feature importance for bagged, boosted trees (XGBoost) - r

Good morning,
I have a question about calculating feature importance for bagged and boosted regression tree models with MLR package in R. I am using XGBOOST to make predictions and i'm using bagging to estimate prediction uncertainty. My data set is relatively large; approximately 10k features and observations. The predictions work perfectly (see code below), but I can't seem to calculate feature importance (the last line in the code below). The importance function crashes with no errors... and freezes the R session. I saw some related python code, where people seem to calculate the importance for each of the bagged models here and here. I haven't been able to get that to work properly in R either. Specifically, i'm not sure how to access individual models within the objected produced by MLR (mb object in the code below). In python, this seems to be trivial. In R, i can't seem to extract mb$learner.model, which seems logically closest to what i need. So i'm wondering if anyone had any experience with this issues?
Please see the code below
learn1 <- makeRegrTask(data = train.all , target= "resp", weights = weights1)
lrn.xgb <- makeLearner("regr.xgboost", predict.type = "response")
lrn.xgb$par.vals <- list( objective="reg:squarederror", eval_metric="error", nrounds=300, gamma=0, booster="gbtree", max.depth=6)
lrn.xgb.bag = makeBaggingWrapper(lrn.xgb, bw.iters = 50, bw.replace = TRUE, bw.size = 0.85, bw.feats = 1)
lrn.xgb.bag <- setPredictType(lrn.xgb.bag, predict.type="se")
mb = mlr::train(lrn.xgb.bag, learn1)
fimp1 <- getFeatureImportance(mb)

If you set bw.feats = 1 it might be feasible to average the feature importance values.
Basically you just have to apply over all single models that are stored in the HomogeneousEnsembleModel. Some extra care is necessary because the order of the features gets mixed up because of the sampling - although we set it to 100%.
library(mlr)
data = data.frame(x1 = runif(100), x2 = runif(100), x3 = runif(100))
data$y = with(data, x1 + 2 * x2 + 0.1 * x3 + rnorm(100))
task = makeRegrTask(data = data, target = "y")
lrn.xgb = makeLearner("regr.xgboost", predict.type = "response")
lrn.xgb$par.vals = list( objective="reg:squarederror", eval_metric="error", nrounds=50, gamma=0, booster="gbtree", max.depth=6)
lrn.xgb.bag = makeBaggingWrapper(lrn.xgb, bw.iters = 10, bw.replace = TRUE, bw.size = 0.85, bw.feats = 1)
lrn.xgb.bag = setPredictType(lrn.xgb.bag, predict.type="se")
mb = mlr::train(lrn.xgb.bag, task)
fimps = lapply(mb$learner.model$next.model, function(x) getFeatureImportance(x)$res)
fimp = fimps[[1]]
# we have to take extra care because the results are not ordered
for (i in 2:length(fimps)) {
fimp = merge(fimp, fimps[[i]], by = "variable")
}
rowMeans(fimp[,-1]) # only makes sense with bw.feats = 1
# [1] 0.2787052 0.4853880 0.2359068

Related

Meta and Metafor R Package -

I am currently conducting a metaanlysis in R using the package "metafor". Doing my research I came across a different package for metaanalyses in R, namely "meta". I like the forest plot created by the latter package better (designwise) but unfortunatley some of the data is not the same as in the plot I created with metafor.
Specifically, the data is different only for I^2 and the pooled estimate.
meta_1 <- rma(yi=yi, vi=vi, measure="SMD", method="ML", slab=Citation, data=dat)
forest(meta_1)
meta_2 <- metagen(yi,vi^.5,data = dat,studlab = paste(Citation), comb.fixed = FALSE,
comb.random = TRUE, hakn = TRUE, method.tau = "ML", sm = "SMD")
forest(meta_2)
Does anyone know why those differences emerge?
So I was able to get the prediction interval to match across functions but not the I^2 values (even though the difference is off by only 2%). There might be some statistical correction one package is doing compared to the other or it has to do with the RE/FE type of modeling approach.
Anyway I hope this code helps point you in the right direction. To get the CIs to match you also have to use the parameter method.tau.ci in metagen().
library(meta)
library(metafor)
study<- c(1:10)
yi<- c( -0.48965031,0.64970214, 0.11201680,0.07945655,-0.70874645 -0.54922759,0.66768916 , -0.45523574 )
vi <- c(0.10299697,0.14036855,0.05137812, 0.03255550, 0.34913525, 0.34971466, 0.07539957, 0.08428983)
dat <- cbind(study, yi, vi)
dat <- as.data.frame(dat)
meta_1 <- rma(yi=dat$yi, vi=dat$vi, measure="SMD", method="REML", slab=paste(study), data=dat)
forest(meta_1)
meta_2 <- meta::metagen(TE =dat$yi,seTE = dat$vi^.5, method.tau = 'REML',
method.tau.ci = 'BJ', comb.random = TRUE, comb.fixed = TRUE,
sm = 'SMD')
forest(meta_2)

rpart giving same results for cross-validation and no CV

Like the title says, I'm trying to run a decision tree both with and without cross-validation using the rpart package in R. I'm doing this using the xval parameter, as described in the vignette (https://cran.r-project.org/web/packages/rpart/vignettes/longintro.pdf)
Unfortunately, I'm getting the same tree with and without CV. I've compared the calculation time for each and the CV model looks like it takes about 10 times as long, so its apparently doing something, I just can't figure out what.
I've also redone the model a number of times with different complexity parameters, but it hasn't made any difference.
Here's sample code that shows my problem, the printcp's show the same results and the predictions from both on the training and a hold-out set are the same.
library(rpart)
library(caret)
abalone <- read.csv(file = 'https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data',header = FALSE)
names(abalone) <- c("sex", "length", "diameter", "height", "whole_weight", "shucked_weight", "viscera_weight", "shell_weight", "rings")
train_set <- createDataPartition(abalone$sex, times = 1, p = 0.8, list = FALSE)
abalone_train <- slice(abalone, train_set)
abalone_test <- slice(abalone, -train_set)
abalone_fit_noCV <- rpart(sex ~ .,
data = abalone_train,
method = "class",
parms = list(split = 'information'),
control = rpart.control(xval = 0,
cp = 0.005))
abalone_fit_CV <- rpart(sex ~ .,
data = abalone_train,
method = "class",
parms = list(split = 'information'),
control = rpart.control(xval = 10,
cp = 0.005))
printcp(abalone_fit_noCV)
printcp(abalone_fit_CV)
CV_pred <- predict(abalone_fit_CV, type = "class")
noCV_pred <- predict(abalone_fit_noCV, type = "class")
confusionMatrix(CV_pred, noCV_pred)
CV_pred <- predict(abalone_fit_CV, abalone_test, type = "class")
noCV_pred <- predict(abalone_fit_noCV, abalone_test, type = "class")
confusionMatrix(CV_pred, noCV_pred)
In true beginner fashion, I figured this out shortly after posting.
For anybody else coming upon this issue, it is basically answered on Cross Validated :
The final tree that is returned is still the initial tree. You must use the prune function using the cross-validation plot to choose the best subtree.
This is clear if you read the full Pruning the tree section of the vignette, rather than just the cross-validation section.

How to ensemble forecasts in R using weights

I have a couple forecasts and was trying to figure out how to merge the two according to some criterion that is widely used.
In part one, I split the data and compared the forecasts against the actual balues using Forest_comb.
library(forecast)
library(ForecastCombinations)
y1 = rnorm(100)
train = y1[1:90]
test = y1[91:100]
fit1 = auto.arima(train)
fit2 = ets(train)
forc1 = forecast(fit1, n=10)$mean
forc2 = forecast(fit2, n=10)$mean
forc_all = cbind(forc1,forc2)
forc_all
?Forecast_comb
fitted <- Forecast_comb(obs = test ,fhat = as.matrix(forc_all), Averaging_scheme = "best")$fitted
fitted
In part two, I rebuilt the whole model on the entire data and forecast out by ten values. How can one merge the two values together according to some criterion?
fit3 = auto.arima(y1)
fit4 = ets(y1)
forc3 = forecast(fit3, n=20)$mean
forc4 = forecast(fit4, n=20)$mean
forc_all = cbind(forc3,forc4)
forc_all
fitted <- Forecast_comb(obs = y1[91:100] ,fhat = as.matrix(forc_all), Averaging_scheme = "best")$fitted
fitted
Thanks for the help
The reason that I am using ForecastCombination is that it includes procedures for popular combination strategies. I thought that perhaps that function could be modified to perform the desired ensembling.
Based on a lot of Kaggle competitions where people share/discuss their scripts, I'd say that by far the most common and most effective way is simply to manually weight and add your predictions.
pacman::p_load(forecast)
pacman::p_load(ForecastCombinations)
y1 = rnorm(100)
train = y1[1:90]
test = y1[91:100]
fit1 = auto.arima(train)
fit2 = ets(train)
forc1 = forecast(fit1, n=10)$mean
forc2 = forecast(fit2, n=10)$mean
forc_all = cbind(forc1,forc2)
forc_all
?Forecast_comb
fitted_1 <- Forecast_comb(obs = test ,fhat = as.matrix(forc_all), Averaging_scheme = "best")$fitted
fitted_1
fit3 = auto.arima(y1)
fit4 = ets(y1)
forc3 = forecast(fit3, n=20)$mean
forc4 = forecast(fit4, n=20)$mean
forc_all = cbind(forc3,forc4)
forc_all
fitted_2 <- Forecast_comb(obs = y1[91:100] ,fhat = as.matrix(forc_all), Averaging_scheme = "best")$fitted
fitted_2
# By far the most common way to combine/weight is simply:
fitted <- fitted_2*.5+fitted_1*.5
fitted
One might ask if you should use equal weights or how to know what to make the weights. This is usually determined by
(a) naive, equal weighting if that's all you have time for and it seems to work fine
(b) iterating with a holdout or cross-validation sample(s), being careful not to overfit
Some people try to take more fancy approaches. It's easy to mess that up, however if you get it right then it can lead you to a more optimal forecast.
The model-based and other more fancy approaches are things like creating another stage of the modeling process wherein your predictions on a holdout sample are the X matrix and the outcome variable is the actual y for that sample.
Also, check out Erin LeDell's approach in h2oEnsemble.

Set G in prior using MCMCglmm, with categorical response and phylogeny

I am new to the MCMCglmm package in R, and rather new to glm models in general. I have a dataset of species traits and whether or not they have been introduced outside of their native range.
I would like to test whether being introduced (as a binary 0/1 response variable) can be explained by any of the species traits. I would also like to correct for phylogeny between species.
I was told that for a binary response I could use family =“threshold” and I should fix the residual variance at 1. But I am having some trouble with the other parameters needed for the prior.
I've specified the R value for the random effects, but if I specify R I must also specify G and it is not clear to me how to decide the values for this parameter. I've tried putting default values but I get error messages:
Error in MCMCglmm(fixed, random = ~species, data = data2, family = "threshold", :
prior$G has the wrong number of structures
I have read the help vignettes and course but have not found an example with a binary response, and it is not clear to me how to decide the values for the priors. This is what I have so far:
fixed=Intro_binary ~ Trait1+ Trait2 + Trait3
Ainv=inverseA(redTree1)$Ainv
binary_model = MCMCglmm(fixed, random=~species, data = data, family = "threshold", ginverse=list(species=Ainv),
 prior = list(
    G = list(),    #not sure about the parameters for random effects.
    R = list(V = 1, fix = 1)),  #to fix the residual variance at one
  nitt = 60000, burnin = 10000)
Any help or feedback would be greatly appreciated!
This one is a bit tricky with the information you provide. I'd say you can define G as a "weak" prior using:
priors <- list(R = list(V = 1, nu = 0.002),
G = list(V = 1, fix = 1)))
binary_model <- MCMCglmm(fixed, random = ~species, data = data,
family = "threshold",
ginverse = list(species = Ainv),
prior = priors,
nitt = 60000, burnin = 10000)
However, without more information on your analysis, I strongly suggest you plot your posteriors to have a look at the results and see if anything looks wrong. Have a look at the MCMCglmm package Course Notes for more info on how to set these priors (especially on what not to do in section 1.5 - you can also find more specific info on how to tune it to your model if it fits in the categories of the tutorial).

Using Cost Sensitive C50 in caret

I am using train in caret package to train some c50 models. I manage to do fine with the method C5.0 but when I want to use the cost sensitive C50 method I struggle understanding how to tune the cost parameter. What I am trying to do is to introduce a cost when predicting wrong one of my classes. I've try searching in the caret package website (http://topepo.github.io/caret/index.html) and reading several manuals/tutorials found here and there. I didn't find any information about how to handle the cost parameter. So this is what I tried on my own:
Run the train with the default settings to see what I get. In the output, the train function tried with cost from 0 to 2 and gave the best model for cost=2.
Try to add in the expand.grid function the cost as a matrix, the same way you'd do using the package C5.0. The code is below (trials is pushed to 1 cause I just want one tree/set of rules in my output)
c50Grid <- expand.grid(.trials=1, .model=c("tree", "rules"), .winnow=c("TRUE", "FALSE"), .cost=matrix(c(0,1,2,0), ncol=2))
However when I execute the train function, although I don't get any errors (but I get 50 warnings), the train tried again cost from 0 to 2. What am I doing wrong? Which format has the cost parameter? What's the meaning here? How would I interpret the results? Which class is the one getting the cost as "Predicting class 0 wrong cost double than class 1"? Also, what I tried was using one matrix, but although it didn't work with this format, how would I add the different costs that I want to test?
Thanks! Any help would be really welcome!
Edit:
So, trying to find an answer on my own about the meaning of the cost parameter for the C5.0Cost, I went to the C5.0Cost.R (https://r-forge.r-project.org/scm/viewvc.php/models/files/C5.0Cost.R?view=markup&root=caret&pathrev=761) and looked up the code.
This line:
cmat <-matrix(c(0, param$cost, 1, 0), ncol = 2)
I guess, it's passing the cost parameter to the cost matrix. So, I think now I can understand how it works. If I have class = {0,1} and my positive class is 0, this matrix says that "Predicting class 0 wrong costs double than class 1", right?
My question now is, how could I do the opposite? How could I set that "Predicting class 1 wrong costs double than class 0", which would be:
cmat <- matrix(c(0, 1, param$cost, 0), ncol=2)
Could I just set the cost to 0.5? And if want to train with different values, just use values less than 1 { 0.5, 0.6, 0.7, etc}.
Note: the way my data is, when I used C50 or other trees before, it takes as "Positive class = 0", so I had to invert the cost matrix when I used C50 so if I use caret method C5.0Cost, I'd need to do the same or find another way to do it...
I'd really appreciate any help here.
Thanks!
There is a cost-senstivite model code for train and C5.0 (use method = "C5.0Cost"). For example:
library(caret)
set.seed(1)
dat1 <- twoClassSim(1000, intercept = -12)
dat2 <- twoClassSim(1000, intercept = -12)
stats <- function (data, lev = NULL, model = NULL) {
c(postResample(data[, "pred"], data[, "obs"]),
Sens = sensitivity(data[, "pred"], data[, "obs"]),
Spec = specificity(data[, "pred"], data[, "obs"]))
}
ctrl <- trainControl(method = "repeatedcv", repeats = 5,
summaryFunction = stats)
set.seed(2)
mod1 <- train(Class ~ ., data = dat1,
method = "C5.0",
tuneGrid = expand.grid(model = "tree", winnow = FALSE,
trials = c(1:10, (1:5)*10)),
trControl = ctrl)
xyplot(Sens + Spec ~ trials, data = mod1$results,
type = "l",
auto.key = list(columns = 2,
lines = TRUE,
points = FALSE))
set.seed(2)
mod2 <- train(Class ~ ., data = dat1,
method = "C5.0Cost",
tuneGrid = expand.grid(model = "tree", winnow = FALSE,
trials = c(1:10, (1:5)*10),
cost = 1:10),
trControl = ctrl)
xyplot(Sens + Spec ~ trials|format(cost), data = mod2$results,
type = "l",
auto.key = list(columns = 2,
lines = TRUE,
points = FALSE))
Max
If I have class = {0,1} and my positive class is 0, this matrix says that "Predicting class 0 wrong costs double than class 1", right? My question now is, how could I do the opposite? How could I set that "Predicting class 1 wrong costs double than class 0" [...]?
Unfortunately, you can't change the costs for the false positives in caret at the moment. This appears to be a bug! See this post for further information about this issue.

Resources