I am trying to perform a multinomial classifier. It seems to work and I am able to generate a plot with minimized logLoss vs boosting iterations, however I am having trouble extracting the error value. This is the error when I run the mnLogLoss function.
Error in mnLogLoss(predicted, lev = predicted$label) :
'data' should have columns consistent with 'lev'
data has been partitioned into.
-training
-testing
-in both, the column "label" contains the ground truth
library(MLmetrics)
fitControl <- trainControl(method = "repeatedcv", number=10, repeats=3, verboseIter = FALSE,
savePredictions = TRUE, classProbs = TRUE, summaryFunction= mnLogLoss)
gbmGrid1 <- expand.grid(.interaction.depth = (1:3), .n.trees = (1:10)*20, .shrinkage = 0.01, .n.minobsinnode = 3)
system.time(
gbmFit1 <- train(label~., data = training, method = "gbm", trControl=fitControl,
verbose = 1, metric = "logLoss", tuneGrid = gbmGrid1)
)
gbmPredictions <- predict(gbmFit1, testing)
predicted <- cbind(gbmPredictions, testing)
mnLogLoss(predicted, lev = levels(predicted$label))
For mnLogLoss, it says in the vignette:
data: a data frame with columns ‘obs’ and ‘pred’ for the observed
and predicted outcomes. For metrics that rely on class
probabilities, such as ‘twoClassSummary’, columns should also
include predicted probabilities for each class. See the
‘classProbs’ argument to ‘trainControl’.
So it's not asking for the training data. The data parameter here is just an input, so i use some simulated data:
library(caret)
df = data.frame(label=factor(sample(c("a","b"),100,replace=TRUE)),
matrix(runif(500),ncol=50))
training = df[1:50,]
testing = df[1:50,]
fitControl <- trainControl(method = "repeatedcv", number=10, repeats=3, verboseIter = FALSE,
savePredictions = TRUE, classProbs = TRUE, summaryFunction= mnLogLoss)
gbmGrid1 <- expand.grid(.interaction.depth = (1:3), .n.trees = (1:10)*20, .shrinkage = 0.01, .n.minobsinnode = 3)
gbmFit1 <- train(label~., data = training, method = "gbm", trControl=fitControl,verbose = 1, metric = "logLoss", tuneGrid = gbmGrid1)
)
And we put together obs, pred and the last two columns are probabilities of each class:
predicted <- data.frame(obs=testing$label,
pred=predict(gbmFit1, testing),
predict(gbmFit1, testing,type="prob"))
head(predicted)
obs pred a b
1 b a 0.5506054 0.4493946
2 b a 0.5107631 0.4892369
3 a b 0.4859799 0.5140201
4 b a 0.5090264 0.4909736
5 b b 0.4545746 0.5454254
6 a a 0.6211514 0.3788486
mnLogLoss(predicted, lev = levels(predicted$obs))
logLoss
0.6377392
Related
Two questions
Visualizing the error of a model
Calculating the log loss
(1) I'm trying to tune a multinomial GBM classifier, but I'm not sure how to adapt to the outputs. I understand that LogLoss is meant to be minimized, but in the below plot, for any range of iterations or trees, it only appears to increase.
inTraining <- createDataPartition(final_data$label, p = 0.80, list = FALSE)
training <- final_data[inTraining,]
testing <- final_data[-inTraining,]
fitControl <- trainControl(method = "repeatedcv", number=10, repeats=3, verboseIter = FALSE, savePredictions = TRUE, classProbs = TRUE, summaryFunction= mnLogLoss)
gbmGrid1 <- expand.grid(.interaction.depth = (1:5)*2, .n.trees = (1:10)*25, .shrinkage = 0.1, .n.minobsinnode = 10)
gbmFit1 <- train(label~., data = training, method = "gbm", trControl=fitControl,
verbose = 1, metric = "ROC", tuneGrid = gbmGrid1)
plot(gbmFit1)
--
(2) on a related note, when I try to directly investigate mnLogLoss I get this error, which keeps me from trying to quantify the error.
mnLogLoss(testing, levels(testing$label)) : 'lev' cannot be NULL
I suspect you set the learning rate too high. So using an example dataset:
final_data = iris
final_data$label=final_data$Species
final_data$Species=NULL
inTraining <- createDataPartition(final_data$label, p = 0.80, list = FALSE)
training <- final_data[inTraining,]
testing <- final_data[-inTraining,]
fitControl <- trainControl(method = "repeatedcv", number=10, repeats=3,
verboseIter = FALSE, savePredictions = TRUE, classProbs = TRUE, summaryFunction= mnLogLoss)
gbmGrid1 <- expand.grid(.interaction.depth = 1:3, .n.trees = (1:10)*10, .shrinkage = 0.1, .n.minobsinnode = 10)
gbmFit1 <- train(label~., data = training, method = "gbm", trControl=fitControl,
verbose = 1, tuneGrid = gbmGrid1,metric="logLoss")
plot(gbmFit1)
A bit different from yours but you can see the upward trend after 20. It really depends on your data but if you have a high learning rate, you arrive very quickly at a minimum and anything after that introduces noise. You can see this illustration from Boehmke's book and also check out a more statistics based discussion.
Let's lower the learning rate and you can see:
gbmGrid1 <- expand.grid(.interaction.depth = 1:3, .n.trees = (1:10)*10, .shrinkage = 0.01, .n.minobsinnode = 10)
gbmFit1 <- train(label~., data = training, method = "gbm", trControl=fitControl,
verbose = 1, tuneGrid = gbmGrid1,metric="logLoss")
plot(gbmFit1)
Note that you most likely need more iterations to reach a lower loss, like what you see with the first.
I'm trying to use log loss as loss function for training with Caret, using the data from the Kobe Bryant shot selection competition of Kaggle.
This is my script:
library(caret)
data <- read.csv("./data.csv")
data$shot_made_flag <- factor(data$shot_made_flag)
data$team_id <- NULL
data$team_name <- NULL
train_data_kaggle <- data[!is.na(data$shot_made_flag),]
test_data_kaggle <- data[is.na(data$shot_made_flag),]
inTrain <- createDataPartition(y=train_data_kaggle$shot_made_flag,p=.8,list=FALSE)
train <- train_data_kaggle[inTrain,]
test <- train_data_kaggle[-inTrain,]
folds <- createFolds(train$shot_made_flag, k = 10)
ctrl <- trainControl(method = "repeatedcv", index = folds, repeats = 3, summaryFunction = mnLogLoss)
res <- train(shot_made_flag~., data = train, method = "gbm", preProc = c("zv", "center", "scale"), trControl = ctrl, metric = "logLoss", verbose = FALSE)
And this is the traceback of the error:
7: stop("'data' should have columns consistent with 'lev'")
6: ctrl$summaryFunction(testOutput, lev, method)
5: evalSummaryFunction(y, wts = weights, ctrl = trControl, lev = classLevels,
metric = metric, method = method)
4: train.default(x, y, weights = w, ...)
3: train(x, y, weights = w, ...)
2: train.formula(shot_made_flag ~ ., data = train, method = "gbm",
preProc = c("zv", "center", "scale"), trControl = ctrl, metric = "logLoss",
verbose = FALSE)
1: train(shot_made_flag ~ ., data = train, method = "gbm", preProc = c("zv",
"center", "scale"), trControl = ctrl, metric = "logLoss",
verbose = FALSE)
When I use defaultFunction as summaryFunction and no metric specified in train, it works, but it doesn't with mnLogLoss. I'm guessing it is expecting the data in a different format than what I am passing, but I can't find where the error is.
From the help file for defaultSummary:
To use twoClassSummary and/or mnLogLoss, the classProbs argument of trainControl should be TRUE. multiClassSummary can be used without class probabilities but some statistics (e.g. overall log loss and the average of per-class area under the ROC curves) will not be in the result set.
Therefore, I think you need to change your trainControl() to the following:
ctrl <- trainControl(method = "repeatedcv", index = folds, repeats = 3, summaryFunction = mnLogLoss, classProbs = TRUE)
If you do this and run your code you will get the following error:
Error: At least one of the class levels is not a valid R variable name; This will cause errors when class probabilities are generated because the variables names will be converted to X0, X1 . Please use factor levels that can be used as valid R variable names (see ?make.names for help).
You just need to change the 0/1 levels of shot_made_flag to something that can be a valid R variable name:
data$shot_made_flag <- ifelse(data$shot_made_flag == 0, "miss", "made")
With the above changes your code will look like this:
library(caret)
data <- read.csv("./data.csv")
data$shot_made_flag <- ifelse(data$shot_made_flag == 0, "miss", "made")
data$shot_made_flag <- factor(data$shot_made_flag)
data$team_id <- NULL
data$team_name <- NULL
train_data_kaggle <- data[!is.na(data$shot_made_flag),]
test_data_kaggle <- data[is.na(data$shot_made_flag),]
inTrain <- createDataPartition(y=train_data_kaggle$shot_made_flag,p=.8,list=FALSE)
train <- train_data_kaggle[inTrain,]
test <- train_data_kaggle[-inTrain,]
folds <- createFolds(train$shot_made_flag, k = 3)
ctrl <- trainControl(method = "repeatedcv", classProbs = TRUE, index = folds, repeats = 3, summaryFunction = mnLogLoss)
res <- train(shot_made_flag~., data = train, method = "gbm", preProc = c("zv", "center", "scale"), trControl = ctrl, metric = "logLoss", verbose = FALSE)
I am trying to make a glm model using caret in r using healthcare data from the CDC. However, whenever i try to train the model using the train() command in caret, i keep on getting the following error:
Error in `[.default`(y, , "time") : incorrect number of dimensions
Below is my code:
#download data
download.file(url = "ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/dataset_documentation/nhamcs/stata/ed2014-stata.zip",destfile = "ed2014-stata.zip")
unzip("ed2014-stata.zip")
library(haven)
nhamcs2014 <- read_dta("ed2014-stata.dta")
dim(nhamcs2014)
#isolate variables of interest
keep2014<- c("SEX","IMMEDR","SEEN72","CANCER","ETOHAB","ALZHD","ASTHMA","CEBVD","CKD","COPD","CHF","CAD","DEPRN",
"DIABTYP1","DIABTYP2","DIABTYP0","ESRD","HPE","EDHIV","HYPLIPID","HTN","OBESITY","OSA","OSTPRSIS",
"SUBSTAB")
new.nhamcs2014 <- nhamcs2014[keep2014]
#remove missing data
e=new.nhamcs2014$IMMEDR==-9
e.clean.nhamcs2014<- new.nhamcs2014[!e,]
f=e.clean.nhamcs2014$IMMEDR==-8
f.clean.nhamcs2014<- e.clean.nhamcs2014[!f,]
g=f.clean.nhamcs2014$SEEN72==-9
g.clean.nhamcs2014 <- f.clean.nhamcs2014[!g,]
h=g.clean.nhamcs2014$SEEN72==-8
h.clean.nhamcs2014 <- g.clean.nhamcs2014[!h,]
i <- h.clean.nhamcs2014$IMMEDR==7
i.clean.nhamcs2014 <- h.clean.nhamcs2014[!i,]
#Convert response variable (IMMEDR) to binomial variable
i.clean.nhamcs2014$IMMEDR[i.clean.nhamcs2014$IMMEDR==3] <- 0
i.clean.nhamcs2014$IMMEDR[i.clean.nhamcs2014$IMMEDR==2] <- 0
i.clean.nhamcs2014$IMMEDR[i.clean.nhamcs2014$IMMEDR==1] <- 0
i.clean.nhamcs2014$IMMEDR[i.clean.nhamcs2014$IMMEDR==5] <- 1
i.clean.nhamcs2014$IMMEDR[i.clean.nhamcs2014$IMMEDR==4] <- 1
#clean data
i.clean.nhamcs2014$SEX[i.clean.nhamcs2014$SEX==1] <- 0
i.clean.nhamcs2014$SEX[i.clean.nhamcs2014$SEX==2] <- 1
i.clean.nhamcs2014$SEEN72[i.clean.nhamcs2014$SEEN72==1] <- 0
i.clean.nhamcs2014$SEEN72[i.clean.nhamcs2014$SEEN72==2] <- 1
View(i.clean.nhamcs2014)
sum(is.na(i.clean.nhamcs2014))
#create glm model using caret
library(caret)
set.seed(1)
inTrain<-createDataPartition(i.clean.nhamcs2014$IMMEDR, p=.75, list = FALSE)
train.nhamcs2014 <- i.clean.nhamcs2014[inTrain,]
test.nhamcs2014 <- i.clean.nhamcs2014[-inTrain,]
control <- trainControl(method = "cv", number = 5, summaryFunction = twoClassSummary,
classProbs = TRUE, verboseIter = TRUE, returnResamp = "final")
model.glm <- train(IMMEDR~.,method = "glm", family = binomial(), metric = "ROC",
maximize = TRUE, data = train.nhamcs2014, trControl = control)
Error in `[.default`(y, , "time") : incorrect number of dimensions
Any input would be greatly appreciated!
The problem is in the input label, it is in an awkward format Labelled double. When you convert it to factor just before training it runs without issue:
Run after sum(is.na(i.clean.nhamcs2014)):
i.clean.nhamcs2014$IMMEDR <- as.character(i.clean.nhamcs2014$IMMEDR)
i.clean.nhamcs2014$IMMEDR[i.clean.nhamcs2014$IMMEDR == "0"] <- "zero"
i.clean.nhamcs2014$IMMEDR[i.clean.nhamcs2014$IMMEDR == "1"] <- "one"
i.clean.nhamcs2014$IMMEDR <- factor(i.clean.nhamcs2014$IMMEDR, levels = c("zero", "one"))
and then
set.seed(1)
inTrain<-createDataPartition(i.clean.nhamcs2014$IMMEDR, p=.75, list = FALSE)
train.nhamcs2014 <- i.clean.nhamcs2014[inTrain,]
test.nhamcs2014 <- i.clean.nhamcs2014[-inTrain,]
control <- trainControl(method = "cv", number = 5, summaryFunction = twoClassSummary,
classProbs = TRUE, verboseIter = TRUE, returnResamp = "final")
model.glm <- train(IMMEDR~.,method = "glm", family = binomial(), metric = "ROC",
maximize = TRUE, data = train.nhamcs2014, trControl = control)
> model.glm
Generalized Linear Model
12194 samples
24 predictor
2 classes: 'zero', 'one'
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 9756, 9755, 9755, 9755, 9755
Resampling results:
ROC Sens Spec
0.632222 0.8814675 0.1774027
I am trying to combine signals from different models using the example described here . I have different datasets which predicts the same output. However, when I combine the model output in caretList, and ensemble the signals, it gives an error
Error in check_bestpreds_resamples(modelLibrary) :
Component models do not have the same re-sampling strategies
Here is the reproducible example
library(caret)
library(caretEnsemble)
df1 <-
data.frame(x1 = rnorm(200),
x2 = rnorm(200),
y = as.factor(sample(c("Jack", "Jill"), 200, replace = T)))
df2 <-
data.frame(z1 = rnorm(400),
z2 = rnorm(400),
y = as.factor(sample(c("Jack", "Jill"), 400, replace = T)))
library(caret)
check_1 <- train( x = df1[,1:2],y = df1[,3],
method = "nnet",
tuneLength = 10,
trControl = trainControl(method = "cv",
classProbs = TRUE,
savePredictions = T))
check_2 <- train( x = df2[,1:2],y = df2[,3] ,
method = "nnet",
preProcess = c("center", "scale"),
tuneLength = 10,
trControl = trainControl(method = "cv",
classProbs = TRUE,
savePredictions = T))
combine <- c(check_1, check_2)
ens <- caretEnsemble(combine)
First of all, you are trying to combine 2 models trained on different training data sets. That is not going to work. All ensemble models will need to be based on the same training set. You will have different sets of resamples in each trained model. Hence your current error.
Also building your models without using caretList is dangerous because you will have a big change of getting different resample strategies. You can control that a bit better by using the index in trainControl (see vignette).
If you use 1 dataset you can use the following code:
ctrl <- trainControl(method = "cv",
number = 5,
classProbs = TRUE,
savePredictions = "final")
set.seed(1324)
# will generate the following warning:
# indexes not defined in trControl. Attempting to set them ourselves, so
# each model in the ensemble will have the same resampling indexes.
models <- caretList(x = df1[,1:2],
y = df1[,3] ,
trControl = ctrl,
tuneList = list(
check_1 = caretModelSpec(method = "nnet", tuneLength = 10),
check_2 = caretModelSpec(method = "nnet", tuneLength = 10, preProcess = c("center", "scale"))
))
ens <- caretEnsemble(models)
A glm ensemble of 2 base models: nnet, nnet
Ensemble results:
Generalized Linear Model
200 samples
2 predictor
2 classes: 'Jack', 'Jill'
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
Resampling results:
Accuracy Kappa
0.5249231 0.04164767
Also read this guide on different ensemble strategies.
I am using caret for modeling using "xgboost"
1- However, I get following error :
"Error: The tuning parameter grid should have columns nrounds,
max_depth, eta, gamma, colsample_bytree, min_child_weight, subsample"
The code:
library(caret)
library(doParallel)
library(dplyr)
library(pROC)
library(xgboost)
## Create train/test indexes
## preserve class indices
set.seed(42)
my_folds <- createFolds(train_churn$churn, k = 10)
# Compare class distribution
i <- my_folds$Fold1
table(train_churn$churn[i]) / length(i)
my_control <- trainControl(
summaryFunction = twoClassSummary,
classProbs = TRUE,
verboseIter = TRUE,
savePredictions = TRUE,
index = my_folds
)
my_grid <- expand.grid(nrounds = 500,
max_depth = 7,
eta = 0.1,
gammma = 1,
colsample_bytree = 1,
min_child_weight = 100,
subsample = 1)
set.seed(42)
model_xgb <- train(
class ~ ., data = train_churn,
metric = "ROC",
method = "xgbTree",
trControl = my_control,
tuneGrid = my_grid)
2- I also want to get a prediction made by averaging the predictions made by using the model fitted for each fold.
I know it's 'tad' bit late but, check your spelling of gamma in the grid of tuning parameters. You misspelled it as gammma (with triple m's).