Error with caret and summaryFunction mnLogLoss: columns consistent with 'lev' - r

I'm trying to use log loss as loss function for training with Caret, using the data from the Kobe Bryant shot selection competition of Kaggle.
This is my script:
library(caret)
data <- read.csv("./data.csv")
data$shot_made_flag <- factor(data$shot_made_flag)
data$team_id <- NULL
data$team_name <- NULL
train_data_kaggle <- data[!is.na(data$shot_made_flag),]
test_data_kaggle <- data[is.na(data$shot_made_flag),]
inTrain <- createDataPartition(y=train_data_kaggle$shot_made_flag,p=.8,list=FALSE)
train <- train_data_kaggle[inTrain,]
test <- train_data_kaggle[-inTrain,]
folds <- createFolds(train$shot_made_flag, k = 10)
ctrl <- trainControl(method = "repeatedcv", index = folds, repeats = 3, summaryFunction = mnLogLoss)
res <- train(shot_made_flag~., data = train, method = "gbm", preProc = c("zv", "center", "scale"), trControl = ctrl, metric = "logLoss", verbose = FALSE)
And this is the traceback of the error:
7: stop("'data' should have columns consistent with 'lev'")
6: ctrl$summaryFunction(testOutput, lev, method)
5: evalSummaryFunction(y, wts = weights, ctrl = trControl, lev = classLevels,
metric = metric, method = method)
4: train.default(x, y, weights = w, ...)
3: train(x, y, weights = w, ...)
2: train.formula(shot_made_flag ~ ., data = train, method = "gbm",
preProc = c("zv", "center", "scale"), trControl = ctrl, metric = "logLoss",
verbose = FALSE)
1: train(shot_made_flag ~ ., data = train, method = "gbm", preProc = c("zv",
"center", "scale"), trControl = ctrl, metric = "logLoss",
verbose = FALSE)
When I use defaultFunction as summaryFunction and no metric specified in train, it works, but it doesn't with mnLogLoss. I'm guessing it is expecting the data in a different format than what I am passing, but I can't find where the error is.

From the help file for defaultSummary:
To use twoClassSummary and/or mnLogLoss, the classProbs argument of trainControl should be TRUE. multiClassSummary can be used without class probabilities but some statistics (e.g. overall log loss and the average of per-class area under the ROC curves) will not be in the result set.
Therefore, I think you need to change your trainControl() to the following:
ctrl <- trainControl(method = "repeatedcv", index = folds, repeats = 3, summaryFunction = mnLogLoss, classProbs = TRUE)
If you do this and run your code you will get the following error:
Error: At least one of the class levels is not a valid R variable name; This will cause errors when class probabilities are generated because the variables names will be converted to X0, X1 . Please use factor levels that can be used as valid R variable names (see ?make.names for help).
You just need to change the 0/1 levels of shot_made_flag to something that can be a valid R variable name:
data$shot_made_flag <- ifelse(data$shot_made_flag == 0, "miss", "made")
With the above changes your code will look like this:
library(caret)
data <- read.csv("./data.csv")
data$shot_made_flag <- ifelse(data$shot_made_flag == 0, "miss", "made")
data$shot_made_flag <- factor(data$shot_made_flag)
data$team_id <- NULL
data$team_name <- NULL
train_data_kaggle <- data[!is.na(data$shot_made_flag),]
test_data_kaggle <- data[is.na(data$shot_made_flag),]
inTrain <- createDataPartition(y=train_data_kaggle$shot_made_flag,p=.8,list=FALSE)
train <- train_data_kaggle[inTrain,]
test <- train_data_kaggle[-inTrain,]
folds <- createFolds(train$shot_made_flag, k = 3)
ctrl <- trainControl(method = "repeatedcv", classProbs = TRUE, index = folds, repeats = 3, summaryFunction = mnLogLoss)
res <- train(shot_made_flag~., data = train, method = "gbm", preProc = c("zv", "center", "scale"), trControl = ctrl, metric = "logLoss", verbose = FALSE)

Related

Error: The tuning parameter grid should have columns parameter

I try to run a 10 fold lasso regression by using R, but when I run the tuneGrid, it shows this error and I don't know how to fix it. Here is my code:
ctrlspecs<-trainControl(method="cv",number=10, savePredictions="all", classProb=TRUE)
lambdas<-c(seq(0,2,length=3))
foldlasso<-train(y1~x1,data=train_dat, method="glm", mtryGrid=expand.grid(alpha=1,lambda=lambdas),
trControl=ctrlspecs,tuneGrid=expand.grid(.alpha=1,.lambda=lambdas),na.action=na.omit)
Clean your code!!!
ctrlspecs <-
trainControl(
method = "cv",
number = 10,
savePredictions = "all",
classProb = TRUE
)
lambdas <- c(seq(0, 2, length = 3))
foldlasso <-
train(
y1~x1,
data=train_dat,
method = "glm",
mtryGrid = expand.grid(alpha = 1, lambda = lambdas),
trControl = ctrlspecs,
na.action = na.omit
)

error with mnLogloss for multinomial classifier using caret/gbm

I am trying to perform a multinomial classifier. It seems to work and I am able to generate a plot with minimized logLoss vs boosting iterations, however I am having trouble extracting the error value. This is the error when I run the mnLogLoss function.
Error in mnLogLoss(predicted, lev = predicted$label) :
'data' should have columns consistent with 'lev'
data has been partitioned into.
-training
-testing
-in both, the column "label" contains the ground truth
library(MLmetrics)
fitControl <- trainControl(method = "repeatedcv", number=10, repeats=3, verboseIter = FALSE,
savePredictions = TRUE, classProbs = TRUE, summaryFunction= mnLogLoss)
gbmGrid1 <- expand.grid(.interaction.depth = (1:3), .n.trees = (1:10)*20, .shrinkage = 0.01, .n.minobsinnode = 3)
system.time(
gbmFit1 <- train(label~., data = training, method = "gbm", trControl=fitControl,
verbose = 1, metric = "logLoss", tuneGrid = gbmGrid1)
)
gbmPredictions <- predict(gbmFit1, testing)
predicted <- cbind(gbmPredictions, testing)
mnLogLoss(predicted, lev = levels(predicted$label))
For mnLogLoss, it says in the vignette:
data: a data frame with columns ‘obs’ and ‘pred’ for the observed
and predicted outcomes. For metrics that rely on class
probabilities, such as ‘twoClassSummary’, columns should also
include predicted probabilities for each class. See the
‘classProbs’ argument to ‘trainControl’.
So it's not asking for the training data. The data parameter here is just an input, so i use some simulated data:
library(caret)
df = data.frame(label=factor(sample(c("a","b"),100,replace=TRUE)),
matrix(runif(500),ncol=50))
training = df[1:50,]
testing = df[1:50,]
fitControl <- trainControl(method = "repeatedcv", number=10, repeats=3, verboseIter = FALSE,
savePredictions = TRUE, classProbs = TRUE, summaryFunction= mnLogLoss)
gbmGrid1 <- expand.grid(.interaction.depth = (1:3), .n.trees = (1:10)*20, .shrinkage = 0.01, .n.minobsinnode = 3)
gbmFit1 <- train(label~., data = training, method = "gbm", trControl=fitControl,verbose = 1, metric = "logLoss", tuneGrid = gbmGrid1)
)
And we put together obs, pred and the last two columns are probabilities of each class:
predicted <- data.frame(obs=testing$label,
pred=predict(gbmFit1, testing),
predict(gbmFit1, testing,type="prob"))
head(predicted)
obs pred a b
1 b a 0.5506054 0.4493946
2 b a 0.5107631 0.4892369
3 a b 0.4859799 0.5140201
4 b a 0.5090264 0.4909736
5 b b 0.4545746 0.5454254
6 a a 0.6211514 0.3788486
mnLogLoss(predicted, lev = levels(predicted$obs))
logLoss
0.6377392

Error in SVM model: Pre-processing methods are limited to: BoxCox, YeoJohnson in R

I am trying to run an SVM model, but I get the error:
Error: pre-processing methods are limited to: BoxCox, YeoJohnson, expoTrans, invHyperbolicSine, center, scale, range, knnImpute, bagImpute, medianImpute, pca, ica, spatialSign, ignore, keep, remove, zv, nzv, conditionalX, corr
I don't understand what is going wrong.
svm.model_unigrams = train(outcome ~.
, data = training_set_unigrams
, trControl = training_controls
, method = "svmRadial"
, preProcess = (training_set_unigrams, method = c("center", "scale"))
, na.action = na.pass)
As you have not provided any data, so, I am using IRIS data.
library(caret)
data(iris)
svm.model_unigrams = train(Species ~., data = iris,
trControl = trainControl(method = "cv",
number = 5,
allowParallel = TRUE),
method = "svmRadial",
preProc = c("center", "scale"),
na.action = na.pass)
Similarly, you can use other methods like
train(Species ~., data = iris,
trControl = trainControl(method = "cv",
number = 5,
allowParallel = TRUE),
method = "svmRadial",
preProc = c("BoxCox"),
na.action = na.pass)

upper/lower values of mtry in Simulated annealing algorithm

I got this code from web. It uses Grid search and Simulated annealing to tune the parameters of R.Forest. My doubt here is where in the code, the Simulated annealing algorithm finds the starting and ending values of the mtry parameter. I mean usually, we give lower and upper values for these type of algorithms but I could not found any. The result gives me the value MAE and the optimal value of mtry. I am surprised from where it calculates this? I use library(randomForest)
d=readARFF("Results.arff")
index <- createDataPartition(log10(d$Result), p = .70,list = FALSE)
tr <- d[index, ]
ts <- d[-index, ]
index_2 <- createFolds(tr$Result, returnTrain = TRUE, list = TRUE)
ctrl <- trainControl(method = "cv", index = index_2, search="grid")
grid_search <- train(log10(Effort) ~ ., data = tr,
method = "rf",
## Will create 48 parameter combinations
tuneLength = 8,
metric = "MAE",
preProc = c("center", "scale", "zv"),
trControl = ctrl)
getTrainPerf(grid_search)
obj <- function(param, maximize = FALSE) {
mod <- train(log10(Effort) ~ ., data = tr,
method = "rf",
preProc = c("center", "scale", "zv"),
metric = "MAE",
trControl = ctrl,
tuneGrid = data.frame(mtry = 10^(param[1])))##, sigma = 10^(param[2])))
if(maximize)
-getTrainPerf(mod)[, "TrainMAE"] else
getTrainPerf(mod)[, "TrainMAE"]
}
num_mods <- 10
## Simulated annealing from base R
set.seed(45642)
tic()
san_res <- optim(par = c(0), fn = obj, method = "SANN",
control = list(maxit = num_mods))
san_res

Illegal column names error yet column names are legal

Wondering why I get this error. I can only reproduce it if I make the levels within my data frame illegal column names, but why does it work in the RF implementation?
Thinking about using ranger as it seems to run faster.
library(caret)
library(ranger)
library(randomForest)
df <- data.frame(class = c(rep(c('A','B'), 10)), var1 = runif(20, 0,10), var2 = runif(20, 0,20), var3 = c(rep(c(' A','1 B', 'C'), 6), 'D','D'))
df
CTRL <- trainControl(method = "repeatedcv",
number = 2,
repeats = 1,
verboseIter = TRUE,
classProbs = TRUE,
returnResamp = "final",
summaryFunction = twoClassSummary)
ranger_model <- caret::train(class ~ .,
df,
method = "ranger",
trControl = CTRL,
preProc = c("center", "scale"),
metric="ROC",
tuneGrid = expand.grid(.mtry=c(1,2)))
rf_model <- caret::train(class ~ .,
df,
method = "rf",
trControl = CTRL,
preProc = c("center", "scale"),
metric="ROC",
tuneGrid = expand.grid(.mtry=c(1,2)))
ranger_model
rf_model
Ranger Output:
+ Fold1.Rep1: mtry=1
model fit failed for Fold1.Rep1: mtry=1 Error in parse.formula(formula, data) :
Error: Illegal column names in formula interface. Fix column names or use alternative interface in ranger.
Also, when I check the documentation for ranger that produces the error, I'm not understanding why this evaluates to TRUE, because when I run the code on my DF, I don't get the same result:
## Error if illegal column name
if (!all(make.names(independent_vars[!interaction_idx]) == independent_vars[!interaction_idx])) {
stop("Error: Illegal column names in formula interface. Fix column names or use alternative interface in ranger.")
}
https://github.com/cran/ranger/blob/master/R/formula.R
When I run it on my df:
formula <- 'class ~ .'
data <- df
f <- as.formula(formula)
t <- terms(f, data = data)
## Get dependent var(s)
response <- data.frame(eval(f[[2]], envir = data))
colnames(response) <- deparse(f[[2]])
## Get independent vars
independent_vars <- attr(t, "term.labels")
interaction_idx <- grepl(":", independent_vars)
## Error if illegal column name
if (!all(make.names(independent_vars[!interaction_idx]) == independent_vars[!interaction_idx])) {
print("Error: Illegal column names in formula interface. Fix column names or use alternative interface in ranger.")
}
> !all(make.names(independent_vars[!interaction_idx]) == independent_vars[!interaction_idx])
## [1] FALSE
Is it because the factor columns are made into a 1-hot encoded matrix that uses the factor level as the column name? Again, not sure why it would work in RF and not ranger.
Thoughts?
This should be fixed in caret 6.0-77. In your example, you'll have to add the splitrule parameter to tuneGrid:
library(caret)
library(ranger)
library(randomForest)
df <- data.frame(class = c(rep(c('A','B'), 10)), var1 = runif(20, 0,10), var2 = runif(20, 0,20), var3 = c(rep(c(' A','1 B', 'C'), 6), 'D','D'))
df
CTRL <- trainControl(method = "repeatedcv",
number = 2,
repeats = 1,
verboseIter = TRUE,
classProbs = TRUE,
returnResamp = "final",
summaryFunction = twoClassSummary)
ranger_model <- caret::train(class ~ .,
df,
method = "ranger",
trControl = CTRL,
preProc = c("center", "scale"),
metric="ROC",
tuneGrid = expand.grid(.mtry=c(1,2), .splitrule="gini"))
rf_model <- caret::train(class ~ .,
df,
method = "rf",
trControl = CTRL,
preProc = c("center", "scale"),
metric="ROC",
tuneGrid = expand.grid(.mtry=c(1,2)))
ranger_model
rf_model

Resources