Illegal column names error yet column names are legal - r

Wondering why I get this error. I can only reproduce it if I make the levels within my data frame illegal column names, but why does it work in the RF implementation?
Thinking about using ranger as it seems to run faster.
library(caret)
library(ranger)
library(randomForest)
df <- data.frame(class = c(rep(c('A','B'), 10)), var1 = runif(20, 0,10), var2 = runif(20, 0,20), var3 = c(rep(c(' A','1 B', 'C'), 6), 'D','D'))
df
CTRL <- trainControl(method = "repeatedcv",
number = 2,
repeats = 1,
verboseIter = TRUE,
classProbs = TRUE,
returnResamp = "final",
summaryFunction = twoClassSummary)
ranger_model <- caret::train(class ~ .,
df,
method = "ranger",
trControl = CTRL,
preProc = c("center", "scale"),
metric="ROC",
tuneGrid = expand.grid(.mtry=c(1,2)))
rf_model <- caret::train(class ~ .,
df,
method = "rf",
trControl = CTRL,
preProc = c("center", "scale"),
metric="ROC",
tuneGrid = expand.grid(.mtry=c(1,2)))
ranger_model
rf_model
Ranger Output:
+ Fold1.Rep1: mtry=1
model fit failed for Fold1.Rep1: mtry=1 Error in parse.formula(formula, data) :
Error: Illegal column names in formula interface. Fix column names or use alternative interface in ranger.
Also, when I check the documentation for ranger that produces the error, I'm not understanding why this evaluates to TRUE, because when I run the code on my DF, I don't get the same result:
## Error if illegal column name
if (!all(make.names(independent_vars[!interaction_idx]) == independent_vars[!interaction_idx])) {
stop("Error: Illegal column names in formula interface. Fix column names or use alternative interface in ranger.")
}
https://github.com/cran/ranger/blob/master/R/formula.R
When I run it on my df:
formula <- 'class ~ .'
data <- df
f <- as.formula(formula)
t <- terms(f, data = data)
## Get dependent var(s)
response <- data.frame(eval(f[[2]], envir = data))
colnames(response) <- deparse(f[[2]])
## Get independent vars
independent_vars <- attr(t, "term.labels")
interaction_idx <- grepl(":", independent_vars)
## Error if illegal column name
if (!all(make.names(independent_vars[!interaction_idx]) == independent_vars[!interaction_idx])) {
print("Error: Illegal column names in formula interface. Fix column names or use alternative interface in ranger.")
}
> !all(make.names(independent_vars[!interaction_idx]) == independent_vars[!interaction_idx])
## [1] FALSE
Is it because the factor columns are made into a 1-hot encoded matrix that uses the factor level as the column name? Again, not sure why it would work in RF and not ranger.
Thoughts?

This should be fixed in caret 6.0-77. In your example, you'll have to add the splitrule parameter to tuneGrid:
library(caret)
library(ranger)
library(randomForest)
df <- data.frame(class = c(rep(c('A','B'), 10)), var1 = runif(20, 0,10), var2 = runif(20, 0,20), var3 = c(rep(c(' A','1 B', 'C'), 6), 'D','D'))
df
CTRL <- trainControl(method = "repeatedcv",
number = 2,
repeats = 1,
verboseIter = TRUE,
classProbs = TRUE,
returnResamp = "final",
summaryFunction = twoClassSummary)
ranger_model <- caret::train(class ~ .,
df,
method = "ranger",
trControl = CTRL,
preProc = c("center", "scale"),
metric="ROC",
tuneGrid = expand.grid(.mtry=c(1,2), .splitrule="gini"))
rf_model <- caret::train(class ~ .,
df,
method = "rf",
trControl = CTRL,
preProc = c("center", "scale"),
metric="ROC",
tuneGrid = expand.grid(.mtry=c(1,2)))
ranger_model
rf_model

Related

How to adjust ggplot axis in R package "caret" for resample class?

I have two models trained by R package caret , and I'd like to compare their performance. The "resample class" works with ggplot , however, an error occurs when I try to adjust the x-axis: Error: Discrete value supplied to continuous scale. Thanks for any help.
library(caret)
data("mtcars")
mydata = mtcars[, -c(8,9)]
set.seed(100)
model_rf <- train(
hp ~ .,
data = mydata,
tuneLength = 5,
method = "ranger",
metric = "RMSE",
preProcess = c('center', 'scale'),
trControl = trainControl(
method = "repeatedcv",
number = 5,
repeats = 5,
verboseIter = TRUE,
savePredictions = "final"
)
)
model_rp <- train(
hp ~ .,
data = mydata,
method = "rpart",
metric = "RMSE",
preProcess = c('center', 'scale'),
trControl = trainControl(
method = "repeatedcv",
number = 5,
repeats = 5,
verboseIter = TRUE,
savePredictions = "final"
)
)
Resamples <- resamples(list("RF" = model_rf, "RP" = model_rp))
ggplot(Resamples, metric = "RMSE")
ggplot(Resamples, metric = "RMSE") + scale_x_continuous(limits = c(0,60), breaks = seq(0,60,10))
## Error: Discrete value supplied to continuous scale
If you change scale_x_continuous to scale_y_continuous, the error goes away like
ggplot(Resamples, metric = "RMSE") +
scale_y_continuous(limits = c(0,60), breaks = seq(0,60,10))

Error in SVM model: Pre-processing methods are limited to: BoxCox, YeoJohnson in R

I am trying to run an SVM model, but I get the error:
Error: pre-processing methods are limited to: BoxCox, YeoJohnson, expoTrans, invHyperbolicSine, center, scale, range, knnImpute, bagImpute, medianImpute, pca, ica, spatialSign, ignore, keep, remove, zv, nzv, conditionalX, corr
I don't understand what is going wrong.
svm.model_unigrams = train(outcome ~.
, data = training_set_unigrams
, trControl = training_controls
, method = "svmRadial"
, preProcess = (training_set_unigrams, method = c("center", "scale"))
, na.action = na.pass)
As you have not provided any data, so, I am using IRIS data.
library(caret)
data(iris)
svm.model_unigrams = train(Species ~., data = iris,
trControl = trainControl(method = "cv",
number = 5,
allowParallel = TRUE),
method = "svmRadial",
preProc = c("center", "scale"),
na.action = na.pass)
Similarly, you can use other methods like
train(Species ~., data = iris,
trControl = trainControl(method = "cv",
number = 5,
allowParallel = TRUE),
method = "svmRadial",
preProc = c("BoxCox"),
na.action = na.pass)

Error with caret and summaryFunction mnLogLoss: columns consistent with 'lev'

I'm trying to use log loss as loss function for training with Caret, using the data from the Kobe Bryant shot selection competition of Kaggle.
This is my script:
library(caret)
data <- read.csv("./data.csv")
data$shot_made_flag <- factor(data$shot_made_flag)
data$team_id <- NULL
data$team_name <- NULL
train_data_kaggle <- data[!is.na(data$shot_made_flag),]
test_data_kaggle <- data[is.na(data$shot_made_flag),]
inTrain <- createDataPartition(y=train_data_kaggle$shot_made_flag,p=.8,list=FALSE)
train <- train_data_kaggle[inTrain,]
test <- train_data_kaggle[-inTrain,]
folds <- createFolds(train$shot_made_flag, k = 10)
ctrl <- trainControl(method = "repeatedcv", index = folds, repeats = 3, summaryFunction = mnLogLoss)
res <- train(shot_made_flag~., data = train, method = "gbm", preProc = c("zv", "center", "scale"), trControl = ctrl, metric = "logLoss", verbose = FALSE)
And this is the traceback of the error:
7: stop("'data' should have columns consistent with 'lev'")
6: ctrl$summaryFunction(testOutput, lev, method)
5: evalSummaryFunction(y, wts = weights, ctrl = trControl, lev = classLevels,
metric = metric, method = method)
4: train.default(x, y, weights = w, ...)
3: train(x, y, weights = w, ...)
2: train.formula(shot_made_flag ~ ., data = train, method = "gbm",
preProc = c("zv", "center", "scale"), trControl = ctrl, metric = "logLoss",
verbose = FALSE)
1: train(shot_made_flag ~ ., data = train, method = "gbm", preProc = c("zv",
"center", "scale"), trControl = ctrl, metric = "logLoss",
verbose = FALSE)
When I use defaultFunction as summaryFunction and no metric specified in train, it works, but it doesn't with mnLogLoss. I'm guessing it is expecting the data in a different format than what I am passing, but I can't find where the error is.
From the help file for defaultSummary:
To use twoClassSummary and/or mnLogLoss, the classProbs argument of trainControl should be TRUE. multiClassSummary can be used without class probabilities but some statistics (e.g. overall log loss and the average of per-class area under the ROC curves) will not be in the result set.
Therefore, I think you need to change your trainControl() to the following:
ctrl <- trainControl(method = "repeatedcv", index = folds, repeats = 3, summaryFunction = mnLogLoss, classProbs = TRUE)
If you do this and run your code you will get the following error:
Error: At least one of the class levels is not a valid R variable name; This will cause errors when class probabilities are generated because the variables names will be converted to X0, X1 . Please use factor levels that can be used as valid R variable names (see ?make.names for help).
You just need to change the 0/1 levels of shot_made_flag to something that can be a valid R variable name:
data$shot_made_flag <- ifelse(data$shot_made_flag == 0, "miss", "made")
With the above changes your code will look like this:
library(caret)
data <- read.csv("./data.csv")
data$shot_made_flag <- ifelse(data$shot_made_flag == 0, "miss", "made")
data$shot_made_flag <- factor(data$shot_made_flag)
data$team_id <- NULL
data$team_name <- NULL
train_data_kaggle <- data[!is.na(data$shot_made_flag),]
test_data_kaggle <- data[is.na(data$shot_made_flag),]
inTrain <- createDataPartition(y=train_data_kaggle$shot_made_flag,p=.8,list=FALSE)
train <- train_data_kaggle[inTrain,]
test <- train_data_kaggle[-inTrain,]
folds <- createFolds(train$shot_made_flag, k = 3)
ctrl <- trainControl(method = "repeatedcv", classProbs = TRUE, index = folds, repeats = 3, summaryFunction = mnLogLoss)
res <- train(shot_made_flag~., data = train, method = "gbm", preProc = c("zv", "center", "scale"), trControl = ctrl, metric = "logLoss", verbose = FALSE)

upper/lower values of mtry in Simulated annealing algorithm

I got this code from web. It uses Grid search and Simulated annealing to tune the parameters of R.Forest. My doubt here is where in the code, the Simulated annealing algorithm finds the starting and ending values of the mtry parameter. I mean usually, we give lower and upper values for these type of algorithms but I could not found any. The result gives me the value MAE and the optimal value of mtry. I am surprised from where it calculates this? I use library(randomForest)
d=readARFF("Results.arff")
index <- createDataPartition(log10(d$Result), p = .70,list = FALSE)
tr <- d[index, ]
ts <- d[-index, ]
index_2 <- createFolds(tr$Result, returnTrain = TRUE, list = TRUE)
ctrl <- trainControl(method = "cv", index = index_2, search="grid")
grid_search <- train(log10(Effort) ~ ., data = tr,
method = "rf",
## Will create 48 parameter combinations
tuneLength = 8,
metric = "MAE",
preProc = c("center", "scale", "zv"),
trControl = ctrl)
getTrainPerf(grid_search)
obj <- function(param, maximize = FALSE) {
mod <- train(log10(Effort) ~ ., data = tr,
method = "rf",
preProc = c("center", "scale", "zv"),
metric = "MAE",
trControl = ctrl,
tuneGrid = data.frame(mtry = 10^(param[1])))##, sigma = 10^(param[2])))
if(maximize)
-getTrainPerf(mod)[, "TrainMAE"] else
getTrainPerf(mod)[, "TrainMAE"]
}
num_mods <- 10
## Simulated annealing from base R
set.seed(45642)
tic()
san_res <- optim(par = c(0), fn = obj, method = "SANN",
control = list(maxit = num_mods))
san_res

Caret: undefined columns selected

I have been trying to get the below code to run in caret but get the error. Can anyone tell me how to trouble shoot it.
Error in [.data.frame(data, , lvls[1]) : undefined columns selected
library(tidyverse)
library(caret)
mydf <- iris
mydf <- mydf %>%
mutate(tgt = as.factor(ifelse(Species == 'setosa','Y','N'))) %>%
select(everything(), -Species)
trainIndex <- createDataPartition(mydf$tgt, p = 0.75, times = 1, list = FALSE)
train <- mydf[trainIndex,]
test <- mydf[-trainIndex,]
fitControl <- trainControl(method = 'repeatedcv',
number = 10,
repeats = 10,
allowParallel = TRUE,
summaryFunction = twoClassSummary)
fit_log <- train(tgt~.,
data = train,
method = "glm",
trControl = fitControl,
family = "binomial")
You need to used classProbs = TRUE in your control function. The ROC curve is based on the class probabilities and the error is the summary function not finding those columns.
Use data = data.frame(xxxxx). As in the example below
fit.cart <- train(Condition~., data = data.frame(trainset), method="rpart", metric=metric, trControl=control)

Resources