Bagging in R: How to use train with the bag function - r

In R, I am trying to use the bag function with the train function. I start with using train and rpart for a classification tree model, on the simple iris data set. Now I want to create a bag of such 10 trees with the bag function. The documentation says that the aggregate parameter must be a function to choose a value from all bagged models, so I created one called agg, which chooses the string of greatest frequency. However, the bag function gives the following error:
Error in fitter(btSamples[[iter]], x = x, y = y, ctrl = bagControl, v = vars, :
task 1 failed - "attempt to apply non-function"
Here is my complete code:
# Use bagging to create a bagged classification tree from 10 classification trees created with rpart.
data(iris)
# Create training and testing data sets:
inTrain = createDataPartition(y=iris$Species, p=0.7, list=F)
train = iris[inTrain,]
test = iris[-inTrain,]
# Create regressor and outcome datasets for bag function:
regressors = train[,-5]
species = train[,5]
# Create aggregate function:
agg = function(x, type) {
y = count(x)
y = y[order(y$freq, decreasing=T),]
as.character(y$x[1])
}
# Create bagged trees with bag function:
treebag = bag(regressors, species, B=10,
bagControl = bagControl(fit = train(data=train, species ~ ., method="rpart"),
predict = predict,
aggregate = agg
)
)
This gives the error message stated above. I don't understand why it rejects the agg function.

from ?bag()
When using bag with train, classification models should use type =
"prob" inside of the predict function so that predict.train(object,
newdata, type = "prob") will work.
So I guess you might want to try:
bagControl = bagControl(fit = train(data=train, species ~ .,
method="rpart", type="prob"),
predict = predict,
aggregate = agg
)

Related

Retrain best model on full dataset in R

I have two models to select from and using some criteria I choose one of the two. (The below is just an example, I know it doesn't make much sense)
library(forecast)
set.seed(4)
sample_dat= sample(1:nrow(cars), 5)
train = cars[-sample_dat, ]
test = cars[sample_dat, ]
models = list(lm(dist ~ speed, train), glm(dist ~ speed, train, family = "poisson"))
test_res = sapply(models, function(x) accuracy(predict(x, test, type = "response"), test$dist)[2]) #Getting the RMSE for each model
best_model = models[which.min(test_res)]
How can I retrain the best model using the full dataset (train + test)? I checked the update and update.formula functions but these don't seem to be updating the data part.
update(best_model[[1]],data = rbind(train,test))
You do not want to change the formula since that is the best model but rather update the data
Base R using your own logic, first creating a list mirroring the models list:
set.seed(4)
sample_dat= sample(1:nrow(cars), 5)
train = cars[-sample_dat, ]
test = cars[sample_dat, ]
models = list(lm(dist ~ speed, train), glm(dist ~ speed, train, family = "poisson"))
model_application = list(as.expression("lm(dist ~ speed, cars)$call"),
as.expression("glm(dist ~ speed, cars, family = 'poisson'))$call"))
test_res = sapply(models,
function(x){
# Store a function to caclulate the RMSE: rmse => function
rmse <- function(actual_vec, pred_vec){sqrt(mean((pred_vec - actual_vec)**2))}
# Getting the RMSE for each model: numeric scalar => .GlobalEnv
rmse(test$dist, predict(x, data = test, type = "response"))
}
)
best_model = models[[which.min(test_res)]]
applied_model <- eval(eval(as.expression(parse(text = model_application[[which.min(test_res)]]))))

Plotting variable importance from ensemble of models with for loop

I keep running into an error while attempting to plot variable importance from ensemble of models.
I have ensemble of models I've fitted and now I am trying to create multiple variable importance plots for each algorithm I've fitted. I am using varImp() function from caret to extract variable importance, then plot() it. To fit ensemble of models, I am using caretEnsemble package.
Thank you for any help, please see the example of code below.
# Caret ensemble is needed to produce list of models
library(caret)
library(caretEnsemble)
# Set algorithms I wish to fit
my_algorithms <- c("glmnet", "svmRadial", "rf", "nnet", "knn", "rpart")
# Define controls
my_controls <- trainControl(
method = "cv",
savePredictions = "final",
number = 3
)
# Run the models all at once with caretEnsemble
my_list_of_models <- caretEnsemble::caretList(Species ~ .,
data = iris,
trControl = my_controls,
methodList = my_algorithms)
# Subset models
list_of_algorithms <- my_list_of_models[my_algorithms]
# Create first for loop to extract variable importance via caret::varImp()
importance <- list()
for (algo in seq_along(list_of_algorithms)) {
importance[[algo]] <- varImp(list_of_algorithms[[algo]])
}
# Create second loop to go over extracted importance and plot it using plot()
importance_plots <- list()
for (imp in seq_along(importance)) {
importance_plots[[imp]] <- plot(importance[[imp]])
}
# Error occurs during the second for loop:
Error in data.frame(values = unlist(unname(x)), ind, stringsAsFactors = FALSE):arguments imply differing number of rows: 16,
I've come up with the solution to the problem above and decided to post it as my own answer. I've written a small function to plot variable importance without relying on caret helper functions to create plots. I used dotplot and levelplot because caret returns data.frame that differs based on provided algorithm. It may not work on different algorithms and models that didn't fit.
# Libraries ---------------------------------------------------------------
library(caret) # To train ML algorithms
library(dplyr) # Required for %>% operators in custom function below
library(caretEnsemble) # To train multiple caret models
library(lattice) # Required for plotting, should be loaded alongside caret
library(gridExtra) # Required for plotting multiple plots
# Custom function ---------------------------------------------------------
# The function requires list of models as input and is used in for loop
plot_importance <- function(importance_list, imp, algo_names) {
importance <- importance_list[[imp]]$importance
model_title <- algo_names[[imp]]
if (ncol(importance) < 2) { # Plot dotplot if dim is ncol < 2
importance %>%
as.matrix() %>%
dotplot(main = model_title)
} else { # Plot heatmap if ncol > 2
importance %>%
as.matrix() %>%
levelplot(xlab = NULL, ylab = NULL, main = model_title, scales = list(x = list(rot = 45)))
}
}
# Tuning parameters -------------------------------------------------------
# Set algorithms I wish to fit
# Rather than using methodList as provided above, I've switched to tuneList because I need to control tuning parameters of random forest algorithm.
my_algorithms <- list(
glmnet = caretModelSpec(method = "glmnet"),
rpart = caretModelSpec(method = "rpart"),
svmRadial = caretModelSpec(method = "svmRadial"),
rf = caretModelSpec(method = "rf", importance = TRUE), # Importance is not computed for "rf" by default
nnet = caretModelSpec(method = "nnet"),
knn = caretModelSpec(method = "knn")
)
# Define controls
my_controls <- trainControl(
method = "cv",
savePredictions = "final",
number = 3
)
# Run the models all at once with caretEnsemble
my_list_of_models <- caretList(Species ~ .,
data = iris,
tuneList = my_algorithms,
trControl = my_controls
)
# Extract variable importance ---------------------------------------------
importance <- lapply(my_list_of_models, varImp)
# Plotting variable immportance -------------------------------------------
# Create second loop to go over extracted importance and plot it using plot()
importance_plots <- list()
for (imp in seq_along(importance)) {
# importance_plots[[imp]] <- plot(importance[[imp]])
importance_plots[[imp]] <- plot_importance(importance_list = importance, imp = imp, algo_names = names(my_list_of_models))
}
# Multiple plots at once
do.call("grid.arrange", c(importance_plots))

how to print variable importance of all the models in the leaderboard of h2o.automl in r

I am using automl() function of H2o package in R for regression.
Consider I am using the name "aml" for building models.
aml <- h2o.automl(x=x, y=y, training_frame = train_set,
max_models = 20, seed = 1,
keep_cross_validation_predictions = TRUE)
The leaderboard of automl() shows the top performed models. I am able to print the importance of the predictors through h2o.varimp() function and plot a graph for the same using h2o.varimp_plot() function for only the leader model (the best model given by automl function).
h2o.varimp(aml#leader)
h2o.varimp_plot(aml#leader)
Is there any way to print the variable importance of the predictors for all the models in the leaderboard and plot a graph using the above two functions?
Stacked Ensembles (usually the leader model) does not yet support variable importance (JIRA here). However the variable importance for rest of the models can be retrieved in a loop over the model ids in the leaderboard. See R code below.
library(h2o)
h2o.init()
# Import a sample binary outcome train/test set into H2O
train <- h2o.importFile("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv")
# Identify predictors and response
y <- "response"
x <- setdiff(names(train), y)
# For binary classification, response should be a factor
train[,y] <- as.factor(train[,y])
# Run AutoML for 10 models
aml <- h2o.automl(x = x, y = y,
training_frame = train,
max_models = 10,
seed = 1)
# View the AutoML Leaderboard
lb <- aml#leaderboard
print(lb, n = nrow(lb))
# Get model ids for all models in the AutoML Leaderboard
model_ids <- as.data.frame(lb$model_id)[,1]
# View variable importance for all the models (besides Stacked Ensemble)
for (model_id in model_ids) {
print(model_id)
m <- h2o.getModel(model_id)
h2o.varimp(m)
h2o.varimp_plot(m)
}

Pass model formula as argument in R

I need to cross-validate several glmer models on the same data so I've made a function to do this (I'm not interested in preexisting functions for doing this). I want to pass an arbitrary glmer model to my function as the only argument. Sadly, I can't figure out how to do this, and the interwebz won't tell me.
Ideally, I would like to do something like:
model = glmer(y ~ x + (1|z), data = train_folds, family = "binomial"
model2 = glmer(y ~ x2 + (1|z), data = train_folds, family = "binomial"
And then call cross_validation_function(model) and cross_validation_function(model2). The training data within the function is called train_fold.
However, I suspect I need to pass the model formula in different way using reformulate.
Here is an example of my function. The project is about predicting autism(ASD) from behavioral features. The data variable is da.
library(pacman)
p_load(tidyverse, stringr, lmerTest, MuMIn, psych, corrgram, ModelMetrics,
caret, boot)
cross_validation_function <- function(model){
#creating folds
participants = unique(da$participant)
folds <- createFolds(participants, 10)
cross_val <- sapply(seq_along(folds), function(x) {
train_folds = filter(da, !(as.numeric(participant) %in% folds[[x]]))
predict_fold = filter(da, as.numeric(participant) %in% folds[[x]])
#model to be tested should be passed as an argument here
train_model <- model
predict_fold <- predict_fold %>%
mutate(predictions_perc = predict(train_model, predict_fold, allow.new.levels = T),
predictions_perc = inv.logit(predictions_perc),
predictions = ifelse(predictions_perc > 0.5, "ASD","control"))
conf_mat <- caret::confusionMatrix(data = predict_fold$predictions, reference = predict_fold$diagnosis, positive = "ASD")
accuracy <- conf_mat$overall[1]
sensitivity <- conf_mat$byClass[1]
specificity <- conf_mat$byClass[2]
fixed_ef <- fixef(train_model)
output <- c(accuracy, sensitivity, specificity, fixed_ef)
})
cross_df <- t(cross_val)
return(cross_df)
}
Solution developed from the comment: Using as.formula strings can be converted into a formula which can passed as arguments to my function in the following way:
cross_validation_function <- function(model_formula){
...
train_model <- glmer(model_formula, data = da, family = "binomial")
...}
formula <- as.formula( "y~ x + (1|z"))
cross_validation_function(formula)
If you aim is to extract the model formula from a fitted model, the you can use
attributes(model)$call[[2]]. Then you can use this formula when fitting model with the cv folds.
mod_formula <- attributes(model)$call[[2]]
train_model = glmer(mod_formula , data = train_data,
family = "binomial")

SVM in R e1071 package yields different outputs for different formula notation but same data and parameters

For the purpose of doing Twitter sentiment analysis I am using the SVM function from e1071 package.
I used the RTextTools package to create a document term matrix which I split into a training and a test set. I have "train" which is a data frame of training observations excluding the predicted variable. Then I have "sentitrain", which is a vector of sentiment values corresponding to the training set. Same for testing.
Then I used three different ways for fitting the a SVM model.
Firstly, I created a container
trainmat = as.matrix(train)
# create container object
traincontainer = create_container(trainmat,
sentitrain,
trainSize = 1:nrow(trainmat),
virgin = FALSE)
# create test matrix
testmat = as.matrix(test)
testcontainer = create_container(testmat, labels = rep(0, nrow(test)),
testSize = 1:nrow(test), virgin = FALSE)
model <- train_model(traincontainer , "SVM", kernel="radial", cost=400)
results = classify_model(testcontainer, model)
preds = results[,1]
confusionMatrix(table(preds, sentitest))
This gave me approximately 76% classification accuracy.
In the second method I simply took the column names of my training matrix and then created a formula:
n = names(train)
# exclude the predicted variable
n = setdiff(n, c("sentiment"))
predictors = paste(n, collapse = " + ")
# create formula
f = as.formula(paste("sentiment ~ ", predictors))
model = svm(f, data = train, cost = 400, kernel = "radial")
preds = predict(model, test)
confusionMatrix(table(preds, sentitest))
This gave me around 69% accuracy.
Thirdly I just passed the data frames and the vectors of predicted values directly to the function call:
model = svm(train, sentitrain, data = train, cost = 400, kernel =
"radial")
preds = predict(model, test)
confusionMatrix(table(preds, sentitest))
This resulted in an astonishing 87% accuracy.
I ran each model several times with cross validation to make sure these differences were not due to randomness.
As I understand it I always used the same function from the same package (RTextTools calls the SVM function from e1071 internally) with the same parameters on the same data. The only difference is the way I passed those parameters. How can the results be so different?

Resources