Ensemble different datasets in R - r

I am trying to combine signals from different models using the example described here . I have different datasets which predicts the same output. However, when I combine the model output in caretList, and ensemble the signals, it gives an error
Error in check_bestpreds_resamples(modelLibrary) :
Component models do not have the same re-sampling strategies
Here is the reproducible example
library(caret)
library(caretEnsemble)
df1 <-
data.frame(x1 = rnorm(200),
x2 = rnorm(200),
y = as.factor(sample(c("Jack", "Jill"), 200, replace = T)))
df2 <-
data.frame(z1 = rnorm(400),
z2 = rnorm(400),
y = as.factor(sample(c("Jack", "Jill"), 400, replace = T)))
library(caret)
check_1 <- train( x = df1[,1:2],y = df1[,3],
method = "nnet",
tuneLength = 10,
trControl = trainControl(method = "cv",
classProbs = TRUE,
savePredictions = T))
check_2 <- train( x = df2[,1:2],y = df2[,3] ,
method = "nnet",
preProcess = c("center", "scale"),
tuneLength = 10,
trControl = trainControl(method = "cv",
classProbs = TRUE,
savePredictions = T))
combine <- c(check_1, check_2)
ens <- caretEnsemble(combine)

First of all, you are trying to combine 2 models trained on different training data sets. That is not going to work. All ensemble models will need to be based on the same training set. You will have different sets of resamples in each trained model. Hence your current error.
Also building your models without using caretList is dangerous because you will have a big change of getting different resample strategies. You can control that a bit better by using the index in trainControl (see vignette).
If you use 1 dataset you can use the following code:
ctrl <- trainControl(method = "cv",
number = 5,
classProbs = TRUE,
savePredictions = "final")
set.seed(1324)
# will generate the following warning:
# indexes not defined in trControl. Attempting to set them ourselves, so
# each model in the ensemble will have the same resampling indexes.
models <- caretList(x = df1[,1:2],
y = df1[,3] ,
trControl = ctrl,
tuneList = list(
check_1 = caretModelSpec(method = "nnet", tuneLength = 10),
check_2 = caretModelSpec(method = "nnet", tuneLength = 10, preProcess = c("center", "scale"))
))
ens <- caretEnsemble(models)
A glm ensemble of 2 base models: nnet, nnet
Ensemble results:
Generalized Linear Model
200 samples
2 predictor
2 classes: 'Jack', 'Jill'
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
Resampling results:
Accuracy Kappa
0.5249231 0.04164767
Also read this guide on different ensemble strategies.

Related

Training, validation and testing without using caret

I'm having doubts during the hyperparameters tune step. I think I might be making some confusion.
I split my dataset into training (70%), validation (15%) and testing (15%). Below is the code used for regression with Random Forest.
1. Training
I perform the initial training with the dataset, as follows:
rf_model <- ranger(y ~.,
date = train ,
num.trees = 500,
mtry = 5,
min.node.size = 100,
importance = "impurity")
I get the R squared and the RMSE using the actual and predicted data from the training set.
pred_rf <- predict(rf_model,train)
pred_rf <- data.frame(pred = pred_rf, obs = train$y)
RMSE_rf <- RMSE(pred_rf$pred, pred_rf$obs)
R2_rf <- (color(pred_rf$pred, pred_rf$obs)) ^2
2. Parameter optimization
Using a parameter grid, the best model is chosen based on performance.
hyper_grid <- expand.grid(mtry = seq(3, 12, by = 4),
sample_size = c(0.5,1),
min.node.size = seq(20, 500, by = 100),
MSE = as.numeric(NA),
R2 = as.numeric(NA),
OOB_RMSE = as.numeric(NA)
)
And I perform the search for the best model according to the smallest OOB error, for example.
for (i in 1:nrow(hyper_grid)) {
model <- ranger(formula = y ~ .,
date = train,
num.trees = 500,
mtry = hyper_grid$mtry[i],
sample.fraction = hyper_grid$sample_size[i],
min.node.size = hyper_grid$min.node.size[i],
importance = "impurity",
replace = TRUE,
oob.error = TRUE,
verbose = TRUE
)
hyper_grid$OOB_RMSE[i] <- sqrt(model$prediction.error)
hyper_grid[i, "MSE"] <- model$prediction.error
hyper_grid[i, "R2"] <- model$r.squared
hyper_grid[i, "OOB_RMSE"] <- sqrt(model$prediction.error)
}
Choose the best performing model
x <- hyper_grid[which.min(hyper_grid$OOB_RMSE), ]
The final model:
rf_fit_model <- ranger(formula = y ~ .,
date = train,
num.trees = 100,
mtry = x$mtry,
sample.fraction = x$sample_size,
min.node.size = x$min.node.size,
oob.error = TRUE,
verbose = TRUE,
importance = "impurity"
)
Perform model prediction with validation data
rf_predict_val <- predict(rf_fit_model, validation)
rf_predict_val <- as.data.frame(rf_predict_val[1])
names(rf_predict_val) <- "pred"
rf_predict_val <- data.frame(pred = rf_predict_val, obs = validation$y)
RMSE_rf_fit <- RMSE rf_predict_val$pred, rf_predict_val$obs)
R2_rf_fit <- (cor(rf_predict_val$pred, rf_predict_val$obs)) ^ 2
Well, now I wonder if I should replicate the model evaluation with the test data.
The fact is that the validation data is being used only as a "test" and is not effectively helping to validate the model.
I've used cross validation in other methods, but I'd like to do it more manually. One of the reasons is that the CV via caret is very slow.
I'm in the right way?
Code using Caret, but very slow:
ctrl <- trainControl(method = "repeatedcv",
repeats = 10)
grid <- expand.grid(interaction.depth = seq(1, 7, by = 2),
n.trees = 1000,
shrinkage = c(0.01,0.1),
n.minobsinnode = 50)
gbmTune <- train(y ~ ., data = train,
method = "gbm",
tuneGrid = grid,
verbose = TRUE,
trControl = ctrl)

PCA as preprocessing for KNN model in caret not working

I am executing a kNN classification with PCA as preprocessing (threshold = 0.8). However, when it seems to have concluded and I execute varImp() function, it does not give the importance of the Principal Components (as it should be doing). When I open the model fit information, the preProcess section is also NULL.
It seems as it is not doing the PCA.
The code for training control and training that I am using is:
ctrl <- trainControl(method = "cv",
number = 10,
summaryFunction = defaultSummary,
preProcOptions = list(thresh=0.8),
classProbs = TRUE
)
set.seed(150)
knn.fit = train(defective~.,
data = fTR,
method = "knn",
preProcess = 'pca',
tuneGrid = data.frame(k = 4),
#tuneGrid = data.frame(k = seq(3,5,1)),
#tuneLength = 10,
trControl = ctrl,
metric = "Kappa")
What am I doing wrong?

error with mnLogloss for multinomial classifier using caret/gbm

I am trying to perform a multinomial classifier. It seems to work and I am able to generate a plot with minimized logLoss vs boosting iterations, however I am having trouble extracting the error value. This is the error when I run the mnLogLoss function.
Error in mnLogLoss(predicted, lev = predicted$label) :
'data' should have columns consistent with 'lev'
data has been partitioned into.
-training
-testing
-in both, the column "label" contains the ground truth
library(MLmetrics)
fitControl <- trainControl(method = "repeatedcv", number=10, repeats=3, verboseIter = FALSE,
savePredictions = TRUE, classProbs = TRUE, summaryFunction= mnLogLoss)
gbmGrid1 <- expand.grid(.interaction.depth = (1:3), .n.trees = (1:10)*20, .shrinkage = 0.01, .n.minobsinnode = 3)
system.time(
gbmFit1 <- train(label~., data = training, method = "gbm", trControl=fitControl,
verbose = 1, metric = "logLoss", tuneGrid = gbmGrid1)
)
gbmPredictions <- predict(gbmFit1, testing)
predicted <- cbind(gbmPredictions, testing)
mnLogLoss(predicted, lev = levels(predicted$label))
For mnLogLoss, it says in the vignette:
data: a data frame with columns ‘obs’ and ‘pred’ for the observed
and predicted outcomes. For metrics that rely on class
probabilities, such as ‘twoClassSummary’, columns should also
include predicted probabilities for each class. See the
‘classProbs’ argument to ‘trainControl’.
So it's not asking for the training data. The data parameter here is just an input, so i use some simulated data:
library(caret)
df = data.frame(label=factor(sample(c("a","b"),100,replace=TRUE)),
matrix(runif(500),ncol=50))
training = df[1:50,]
testing = df[1:50,]
fitControl <- trainControl(method = "repeatedcv", number=10, repeats=3, verboseIter = FALSE,
savePredictions = TRUE, classProbs = TRUE, summaryFunction= mnLogLoss)
gbmGrid1 <- expand.grid(.interaction.depth = (1:3), .n.trees = (1:10)*20, .shrinkage = 0.01, .n.minobsinnode = 3)
gbmFit1 <- train(label~., data = training, method = "gbm", trControl=fitControl,verbose = 1, metric = "logLoss", tuneGrid = gbmGrid1)
)
And we put together obs, pred and the last two columns are probabilities of each class:
predicted <- data.frame(obs=testing$label,
pred=predict(gbmFit1, testing),
predict(gbmFit1, testing,type="prob"))
head(predicted)
obs pred a b
1 b a 0.5506054 0.4493946
2 b a 0.5107631 0.4892369
3 a b 0.4859799 0.5140201
4 b a 0.5090264 0.4909736
5 b b 0.4545746 0.5454254
6 a a 0.6211514 0.3788486
mnLogLoss(predicted, lev = levels(predicted$obs))
logLoss
0.6377392

Identical results from random forest hyperparameter tuning in R

I am trying out hyperparameter tuning of randomforest in R using the library(RandomForest). Why am I getting identical results for different values of hyperparameter i.e. maxnodes?
I am using the Titanic dataset that was obtained from Kaggle. I applied grid search only on the hyperparameter 'mtry' and the results gave me different values of Accuracy and Kappa for each 'mtry'.
However, when I tried to search for the best 'maxnode' value, all of them return the same value of Accuracy and Kappa.
This is for tuning 'mtry'
library(randomForest)
control <- trainControl(
method = "cv",
number = 10,
search = "grid"
)
tuneGrid <- expand.grid(.mtry = c(1:10))
rf_mtry <- train(train_X,
train_Y,
method = "rf",
metric = "Accuracy",
tuneGrid = tuneGrid,
trControl = control,
importance = TRUE,
nodesize = 14,
ntree = 300
)
This is for tuning 'maxnodes'
mtry_best <- rf_mtry$bestTune$mtry
store_maxnode <- list()
tuneGrid <- expand.grid(.mtry = mtry_best)
for (maxnodes in c(2:20)) {
set.seed(1234)
rf_maxnode <- train(train_X,
train_Y,
method = "rf",
metric = "Accuracy",
tuneGrid = tuneGrid,
trControl = control,
importance = TRUE,
nodesize = 5,
maxnodes = maxnodes,
ntree = 300
)
current_iteration <- toString(maxnodes)
store_maxnode[[current_iteration]] <- rf_maxnode
}
results_mtry <- resamples(store_maxnode)
summary(results_mtry)
I expected to see different values of Accuracy and Kappa for different maxnode, but they were identical.

Tuning XGboost parameters Using Caret - Error: The tuning parameter grid should have columns

I am using caret for modeling using "xgboost"
1- However, I get following error :
"Error: The tuning parameter grid should have columns nrounds,
max_depth, eta, gamma, colsample_bytree, min_child_weight, subsample"
The code:
library(caret)
library(doParallel)
library(dplyr)
library(pROC)
library(xgboost)
## Create train/test indexes
## preserve class indices
set.seed(42)
my_folds <- createFolds(train_churn$churn, k = 10)
# Compare class distribution
i <- my_folds$Fold1
table(train_churn$churn[i]) / length(i)
my_control <- trainControl(
summaryFunction = twoClassSummary,
classProbs = TRUE,
verboseIter = TRUE,
savePredictions = TRUE,
index = my_folds
)
my_grid <- expand.grid(nrounds = 500,
max_depth = 7,
eta = 0.1,
gammma = 1,
colsample_bytree = 1,
min_child_weight = 100,
subsample = 1)
set.seed(42)
model_xgb <- train(
class ~ ., data = train_churn,
metric = "ROC",
method = "xgbTree",
trControl = my_control,
tuneGrid = my_grid)
2- I also want to get a prediction made by averaging the predictions made by using the model fitted for each fold.
I know it's 'tad' bit late but, check your spelling of gamma in the grid of tuning parameters. You misspelled it as gammma (with triple m's).

Resources