Recipe fails with caret::train - r

When using caret with recipes i get an error stating:
Error in { : task 1 failed - "$ operator is invalid for atomic vectors"
I managed to narrow it down to a problem with the recipe. But i am not sure what i'm doing wrong. Anyone has seen this before? The only relevant information i found was here and it stated:
This happens when the model object fails and caused no recipe to be
available
Below follows the code i use. I cannot share the data, but the error appears when using mtcars as well.
library(caret)
library(tidymodels)
library(embed)
library(doParallel)
cluster <- makeCluster(detectCores() - 1)
registerDoParallel(cluster)
df <- mtcars %>%
as_tibble() %>%
mutate(cyl = factor(cyl)) # to have one nominal variable
set.seed(123)
cv_split <- initial_split(df)
df_train <- training(cv_split)
set.seed(123)
cv_folds <- vfold_cv(df_train, v = 10, repeats = 10)
cv_ind <- rsample2caret(cv_folds)
rec <-
recipe(mpg ~ ., data = df_train) %>%
step_nzv(all_predictors()) %>%
step_lencode_mixed(all_nominal(), outcome = vars(mpg))
ctrl <-
trainControl(
method = "repeatedcv",
repeats = 10,
index = cv_ind$index,
indexOut = cv_ind$indexOut,
allowParallel = TRUE)
train(rec,
data = df_train,
method = "glmnet",
tuneLength = 20,
trControl = ctrl)`

Related

Select tuneGrid depending on the model in caret R

I try to apply ML on the iris dataset, using "knn" and "rpart" algorithms. This is my code:
library(tidyverse)
library(caret)
dataset <- iris
tt_index <- createDataPartition(dataset$Sepal.Length, times = 1, p = 0.9, list = FALSE)
train_set <- dataset[tt_index, ]
test_set <- dataset[-tt_index, ]
models <- c("knn","rpart")
fits <- lapply(models, function(model){
print(model)
train(Species ~ .,
data = train_set,
tuneGrid = case_when(model == "knn" ~ data.frame(k = seq(3,50,1)),
model == "rpart" ~ data.frame(cp = seq(0,0.1,len = 50))),
method = model)
})
I want to set tuneGrid parameter depending on the model inside lapply. But I receive this error:
Error in `[.data.frame`(value[[1]], rep(NA_integer_, m)) :
undefined columns selected
Any help will be greatly appreciated.
We could use if/else
library(caret)
out <- lapply(models, function(model)
train(Species ~ ., data = train_set,
tuneGrid = if(model == "knn") data.frame(k = seq(3,50,1)) else
data.frame(cp = seq(0,0.1,len = 50)), method = model))
According to ?case_when
A vector of length 1 or n, matching the length of the logical input or output vectors, with the type (and attributes) of the first RHS. Inconsistent lengths or types will generate an error.

Error in -train : invalid argument to unary operator

I am using R Studio and trying to knit a file. the code chunk below will run as the chunk but throws an error when I try to knit the file.
tree.corolla <- rpart(Price ~ ., data = toyota.corolla.df, control = rpart.control(maxdepth = 5), method = "anova")
The error I am getting is:
Error in -train : invalid argument to unary operator
Calls: ... eval -> predict -> predict.rpart -> [ -> [.data.frame
I am using the ToyotaCorolla.csv dataset that is available here:
https://pitt.box.com/s/e0rhjtba8az85epqus9xu85e4q6zxuts
The entire code chunk is below:
#install.packages("rpart")
#install.packages("rpart.plot")
#install.packages("gbm")
#install.packages("randomForest")
#install.packages("dummies")
library(randomForest)
library(gbm)
library(rpart)
library(rpart.plot)
library(tree)
library(ISLR)
library(dummies)
library(adabag)
library(rpart)
library(caret)
toyota.corolla.df <- read.csv("ToyotaCorolla.csv")
#View(toyota.corolla.df)
# randomly generate training and validation sets
toyota.corolla.df <- toyota.corolla.df[ , -c(1, 2, 5, 6)]
toyota.corolla.df <- cbind(toyota.corolla.df, dummy(toyota.corolla.df$Fuel_Type, sep = "_"))
toyota.corolla.df <- cbind(toyota.corolla.df, dummy(toyota.corolla.df$Color, sep = "_"))
toyota.corolla.df <- toyota.corolla.df[ , -c(4, 7)]
set.seed(123)
inTraining <- createDataPartition(toyota.corolla.df$Price, p = .60, list = FALSE)
training <- toyota.corolla.df[ inTraining,]
testing <- toyota.corolla.df[-inTraining,]
tree.corolla <- rpart(Price ~ ., data = toyota.corolla.df, control = rpart.control(maxdepth = 5), method = "anova")
summary(tree.corolla)
plot(tree.corolla)
text(tree.corolla,pretty=0)
cv.corolla=trainControl(method = "repeatedcv", number = 10, repeats = 10)
prp(tree.corolla, type = 1, extra = 1, split.font = 1, varlen = -10)
yhat=predict(tree.corolla,newdata=toyota.corolla.df[-train,])
corolla.test=toyota.corolla.df[-train,"Price"]
plot(yhat,corolla.test)
abline(0,1)

How to evaluate LightGBM in R using cohen's kappa?

I use XGBoost in R on a regular basis and want to start using LightGBM on the same data. My goal is to use cohen's kappa as evaluation metric. However, I am not able to properly implement LightGBM - it seems that no learning occurs. As a very simple example, I'll use the titanic dataset.
library(data.table)
library(dplyr)
library(caret)
titanic <- fread("https://raw.githubusercontent.com/pcsanwald/kaggle-titanic/master/train.csv")
titanic_complete <- titanic %>%
select(survived, pclass, sex, age, sibsp, parch, fare, embarked) %>%
mutate_if(is.character, as.factor) %>%
mutate(survived = as.factor(survived)) %>%
na.omit()
train_class <- titanic_complete %>%
select(survived) %>%
pull()
train_numeric <- titanic_complete %>%
select_if(is.numeric) %>%
data.matrix()
ctrl <- trainControl(method = "none", search = "grid")
tune_grid_xgbTree <- expand.grid(
nrounds = 700,
eta = 0.1,
max_depth = 3,
gamma = 0,
colsample_bytree = 0,
min_child_weight = 1,
subsample = 1)
set.seed(512)
fit_xgb <- train(
x = train_numeric,
y = train_class,
tuneGrid = tune_grid_xgbTree,
trControl = ctrl,
method = "xgbTree",
metric = "Kappa",
verbose = TRUE)
confusionMatrix(predict(fit_xgb, train_numeric), train_class)
Gives me a Kappa of 0.57 evaluated on the training set (which is only to show my problem, otherwise I would use cross-validation).
For LightGBM, I write Kappa as a custom evaluation function:
library(lightgbm)
lgb.kappa <- function(preds, y) {
label <- getinfo(y, "label")
k <- unlist(e1071::classAgreement(table(label, preds)))["kappa"]
return(list(name = "kappa", value = as.numeric(k), higher_better = TRUE))
}
X_train <- titanic_complete %>% select(-survived) %>% data.matrix()
y_train <- titanic_complete %>% select(survived) %>% data.matrix()
y_train <- y_train - 1
dtrain <- lgb.Dataset(data = X_train, label = y_train)
Here, I use the same parameter set than in XGBoost but I tried different combinations without success.
fit_lgbm <- lgb.train(data = dtrain,
objective = "binary",
learning_rate = 0.1,
nrounds = 700,
colsample_bytree = 0,
eval = lgb.kappa,
min_child_weight = 1,
max_depth = 3)
No learning occurs and the algorithm outputs "No further splits with positive gain, best gain: -inf" and Kappa = 0.
If someone hast successfully implemented LightGBM (maybe with a custom evaluation metric), I would be very happy for a hint of how to resolve this.
No learning occurs and the algorithm outputs "No further splits with positive gain, best gain: -inf"
This is because LightGBM's default parameter values are configured for larger datasets. The training dataset in your example above only has 714 rows. To deal with this, I recommend setting LightGBM's parameters to values that permit smaller leaf nodes, and limiting the number of leaves instead of the depth.
list(
"min_data_in_leaf" = 3
, "max_depth" = -1
, "num_leaves" = 8
)
and Kappa = 0.
I believe your implementation of Cohen's kappa has a mistake. The input to e1071::classAgreement() is expected to be a table of counts (a confusion matrix), and preds is in the form of predicted probabilities. I think this implementation is correct, based on the description of this metric on Wikipedia.
lgb.kappa <- function(preds, dtrain) {
label <- getinfo(dtrain, "label")
threshold <- 0.5
thresholded_preds <- as.integer(preds > threshold)
k <- unlist(e1071::classAgreement(table(label, thresholded_preds)))["kappa"]
return(list(name = "kappa", value = as.numeric(k), higher_better = TRUE))
}
Finally, I think 700 iterations is probably too many for a 700ish-observation dataset. You can see the value of metrics evaluated against the training data at each iteration by passing the training data as a validation set.
Taken together, I think the code below accomplishes what the original question asked for.
library(data.table)
library(dplyr)
library(caret)
library(lightgbm)
titanic <- fread("https://raw.githubusercontent.com/pcsanwald/kaggle-titanic/master/train.csv")
titanic_complete <- titanic %>%
select(survived, pclass, sex, age, sibsp, parch, fare, embarked) %>%
mutate_if(is.character, as.factor) %>%
mutate(survived = as.factor(survived)) %>%
na.omit()
train_class <- titanic_complete %>%
select(survived) %>%
pull()
train_numeric <- titanic_complete %>%
select_if(is.numeric) %>%
data.matrix()
lgb.kappa <- function(preds, dtrain) {
label <- getinfo(dtrain, "label")
threshold <- 0.5
thresholded_preds <- as.integer(preds > threshold)
k <- unlist(e1071::classAgreement(table(label, thresholded_preds)))["kappa"]
return(list(name = "kappa", value = as.numeric(k), higher_better = TRUE))
}
X_train <- titanic_complete %>% select(-survived) %>% data.matrix()
y_train <- titanic_complete %>% select(survived) %>% data.matrix()
y_train <- y_train - 1
# train, printing out eval metrics at ever iteration
fit_lgbm <- lgb.train(
data = lgb.Dataset(
data = X_train,
label = y_train
),
params = list(
"min_data_in_leaf" = 3
, "max_depth" = -1
, "num_leaves" = 8
),
objective = "binary",
learning_rate = 0.1,
nrounds = 10L,
verbose = 1L,
valids = list(
"train" = lgb.Dataset(
data = X_train,
label = y_train
)
),
eval = lgb.kappa,
)
# evaluate a custom function after training
fit_lgbm$eval_train(
feval = lgb.kappa
)

R: LIME package: get probability change when variable changes?

I have a question about the LIME package. I've made a model with lime like this example from thomasp85 (thank you):
devtools::install_github("thomasp85/lime")
setwd("C:/world-happiness-report")
load("data_15_16.RData")
# configure multicore
library(doParallel)
cl <- makeCluster(detectCores())
registerDoParallel(cl)
library(caret)
set.seed(42)
index <- createDataPartition(data_15_16$Happiness.Score.l, p = 0.7, list = FALSE)
train_data <- data_15_16[index, ]
test_data <- data_15_16[-index, ]
set.seed(42)
model_mlp <- caret::train(Happiness.Score.l ~ .,
data = train_data,
method = "mlp",
trControl = trainControl(method = "repeatedcv",
number = 10,
repeats = 5,
verboseIter = FALSE))
library(lime)
explain <- lime(train_data, model_mlp, bin_continuous = TRUE, n_bins = 5, n_permutations = 1000)
pred <- data.frame(sample_id = 1:nrow(test_data),
predict(model_mlp, test_data, type = "prob"),
actual = test_data$Happiness.Score.l)
pred$prediction <- colnames(pred)[3:5][apply(pred[, 3:5], 1, which.max)]
pred$correct <- ifelse(pred$actual == pred$prediction, "correct", "wrong")
library(tidyverse)
pred_cor <- filter(pred, correct == "correct")
pred_wrong <- filter(pred, correct == "wrong")
test_data_cor %
mutate(sample_id = 1:nrow(test_data)) %>%
filter(sample_id %in% pred_cor$sample_id) %>%
sample_n(size = 3) %>%
remove_rownames() %>%
tibble::column_to_rownames(var = "sample_id") %>%
select(-Happiness.Score.l)
test_data_wrong %
mutate(sample_id = 1:nrow(test_data)) %>%
filter(sample_id %in% pred_wrong$sample_id) %>%
sample_n(size = 3) %>%
remove_rownames() %>%
tibble::column_to_rownames(var = "sample_id") %>%
select(-Happiness.Score.l)
explanation_cor <- explain(test_data_cor, n_labels = 3, n_features = 5)
explanation_wrong <- explain(test_data_wrong, n_labels = 3, n_features = 5)
plot_features(explanation_wrong, ncol = 3)
The plot produces the following exampple for a certain case:
What I would like to know: is there a way to get numbers on how the probability would change when (for example) Family would be higher, so for example:
Family in this case has level: 0.71. When Family would increase with
0.1 this would cause the probability to increase with 0.5%
The reason for this is that I have got the same graph type, and I would like to create additional information of the meaning of the effect that is indicated in the graph.
Any information, also in the form of literature etc would be helpful :)
It seems you are asking for counterfactuals. LIME does not really provides this functionality. It is not their goal. You have to create the counterfactuals manually and submit them to LIME in the same iteration because LIME is non-deterministic -- it selects randomly the neighbors to fit the interpretable model-- so you get different results every time you run it.

Caret: undefined columns selected

I have been trying to get the below code to run in caret but get the error. Can anyone tell me how to trouble shoot it.
Error in [.data.frame(data, , lvls[1]) : undefined columns selected
library(tidyverse)
library(caret)
mydf <- iris
mydf <- mydf %>%
mutate(tgt = as.factor(ifelse(Species == 'setosa','Y','N'))) %>%
select(everything(), -Species)
trainIndex <- createDataPartition(mydf$tgt, p = 0.75, times = 1, list = FALSE)
train <- mydf[trainIndex,]
test <- mydf[-trainIndex,]
fitControl <- trainControl(method = 'repeatedcv',
number = 10,
repeats = 10,
allowParallel = TRUE,
summaryFunction = twoClassSummary)
fit_log <- train(tgt~.,
data = train,
method = "glm",
trControl = fitControl,
family = "binomial")
You need to used classProbs = TRUE in your control function. The ROC curve is based on the class probabilities and the error is the summary function not finding those columns.
Use data = data.frame(xxxxx). As in the example below
fit.cart <- train(Condition~., data = data.frame(trainset), method="rpart", metric=metric, trControl=control)

Resources