How to evaluate LightGBM in R using cohen's kappa? - r

I use XGBoost in R on a regular basis and want to start using LightGBM on the same data. My goal is to use cohen's kappa as evaluation metric. However, I am not able to properly implement LightGBM - it seems that no learning occurs. As a very simple example, I'll use the titanic dataset.
library(data.table)
library(dplyr)
library(caret)
titanic <- fread("https://raw.githubusercontent.com/pcsanwald/kaggle-titanic/master/train.csv")
titanic_complete <- titanic %>%
select(survived, pclass, sex, age, sibsp, parch, fare, embarked) %>%
mutate_if(is.character, as.factor) %>%
mutate(survived = as.factor(survived)) %>%
na.omit()
train_class <- titanic_complete %>%
select(survived) %>%
pull()
train_numeric <- titanic_complete %>%
select_if(is.numeric) %>%
data.matrix()
ctrl <- trainControl(method = "none", search = "grid")
tune_grid_xgbTree <- expand.grid(
nrounds = 700,
eta = 0.1,
max_depth = 3,
gamma = 0,
colsample_bytree = 0,
min_child_weight = 1,
subsample = 1)
set.seed(512)
fit_xgb <- train(
x = train_numeric,
y = train_class,
tuneGrid = tune_grid_xgbTree,
trControl = ctrl,
method = "xgbTree",
metric = "Kappa",
verbose = TRUE)
confusionMatrix(predict(fit_xgb, train_numeric), train_class)
Gives me a Kappa of 0.57 evaluated on the training set (which is only to show my problem, otherwise I would use cross-validation).
For LightGBM, I write Kappa as a custom evaluation function:
library(lightgbm)
lgb.kappa <- function(preds, y) {
label <- getinfo(y, "label")
k <- unlist(e1071::classAgreement(table(label, preds)))["kappa"]
return(list(name = "kappa", value = as.numeric(k), higher_better = TRUE))
}
X_train <- titanic_complete %>% select(-survived) %>% data.matrix()
y_train <- titanic_complete %>% select(survived) %>% data.matrix()
y_train <- y_train - 1
dtrain <- lgb.Dataset(data = X_train, label = y_train)
Here, I use the same parameter set than in XGBoost but I tried different combinations without success.
fit_lgbm <- lgb.train(data = dtrain,
objective = "binary",
learning_rate = 0.1,
nrounds = 700,
colsample_bytree = 0,
eval = lgb.kappa,
min_child_weight = 1,
max_depth = 3)
No learning occurs and the algorithm outputs "No further splits with positive gain, best gain: -inf" and Kappa = 0.
If someone hast successfully implemented LightGBM (maybe with a custom evaluation metric), I would be very happy for a hint of how to resolve this.

No learning occurs and the algorithm outputs "No further splits with positive gain, best gain: -inf"
This is because LightGBM's default parameter values are configured for larger datasets. The training dataset in your example above only has 714 rows. To deal with this, I recommend setting LightGBM's parameters to values that permit smaller leaf nodes, and limiting the number of leaves instead of the depth.
list(
"min_data_in_leaf" = 3
, "max_depth" = -1
, "num_leaves" = 8
)
and Kappa = 0.
I believe your implementation of Cohen's kappa has a mistake. The input to e1071::classAgreement() is expected to be a table of counts (a confusion matrix), and preds is in the form of predicted probabilities. I think this implementation is correct, based on the description of this metric on Wikipedia.
lgb.kappa <- function(preds, dtrain) {
label <- getinfo(dtrain, "label")
threshold <- 0.5
thresholded_preds <- as.integer(preds > threshold)
k <- unlist(e1071::classAgreement(table(label, thresholded_preds)))["kappa"]
return(list(name = "kappa", value = as.numeric(k), higher_better = TRUE))
}
Finally, I think 700 iterations is probably too many for a 700ish-observation dataset. You can see the value of metrics evaluated against the training data at each iteration by passing the training data as a validation set.
Taken together, I think the code below accomplishes what the original question asked for.
library(data.table)
library(dplyr)
library(caret)
library(lightgbm)
titanic <- fread("https://raw.githubusercontent.com/pcsanwald/kaggle-titanic/master/train.csv")
titanic_complete <- titanic %>%
select(survived, pclass, sex, age, sibsp, parch, fare, embarked) %>%
mutate_if(is.character, as.factor) %>%
mutate(survived = as.factor(survived)) %>%
na.omit()
train_class <- titanic_complete %>%
select(survived) %>%
pull()
train_numeric <- titanic_complete %>%
select_if(is.numeric) %>%
data.matrix()
lgb.kappa <- function(preds, dtrain) {
label <- getinfo(dtrain, "label")
threshold <- 0.5
thresholded_preds <- as.integer(preds > threshold)
k <- unlist(e1071::classAgreement(table(label, thresholded_preds)))["kappa"]
return(list(name = "kappa", value = as.numeric(k), higher_better = TRUE))
}
X_train <- titanic_complete %>% select(-survived) %>% data.matrix()
y_train <- titanic_complete %>% select(survived) %>% data.matrix()
y_train <- y_train - 1
# train, printing out eval metrics at ever iteration
fit_lgbm <- lgb.train(
data = lgb.Dataset(
data = X_train,
label = y_train
),
params = list(
"min_data_in_leaf" = 3
, "max_depth" = -1
, "num_leaves" = 8
),
objective = "binary",
learning_rate = 0.1,
nrounds = 10L,
verbose = 1L,
valids = list(
"train" = lgb.Dataset(
data = X_train,
label = y_train
)
),
eval = lgb.kappa,
)
# evaluate a custom function after training
fit_lgbm$eval_train(
feval = lgb.kappa
)

Related

Logistic Regression in R - using package "logistf"

I used the package "logistf" to perform a logistic regression in R.
df <- read.csv("data.csv",header=T,row.names=1)
df <- as.data.frame(sapply(df, as.numeric))
df_split <- initial_split(df, prop = 0.9)
df_train <-
training(df_split) %>%
verify(expr = nrow(.) == 14355L)
df_test <-
testing(df_split) %>%
verify(expr = nrow(.) == 1596L)
x_train <- as.matrix(df_train[,1:259]) # Removes class
y_train <- as.double(as.matrix(df_train[, 260]))
mle <- logistf(y_train ~ x_train, firth=TRUE, family = binomial)
When I run the above code, I get the following error:
Error in logistf.fit(x = x, y = y, weight = weight, offset = offset, firth, :
In iteration 0: Determinant of Fisher information matrix was numerically 0
How can I fix this error?

How to fix "sink stack is full" over training a model in R

I want to train a regression model by "keras_model_sequential" in R. For finding the best parameters over the grid search I have used "tuning_run". But I got this error:
training run 1/128 (flags = list(0.05, 66, "relu", 8, 10, 0.001, 0.2))
Error in sink(file = output_file, type = "output", split = TRUE) :
sink stack is full
Calls: tuning_run ... with_changed_file_copy -> force -> sink -> .handleSimpleError -> h
I need to mention, that a folder named "runs" was created in the data and script path which has a lot of subfolders whose name is like a date format. maybe this is the reason.
library(plyr)
library(boot)
library(keras)
library(tensorflow)
library(kerasR)
library(tidyverse)
library(tfruns)
library(MLmetrics)
df= mainlist[[1]] # data which is a 33*31 dataframe (33 samples and 31 features which last column is target)
x = (length(df))-1
print(x)
df1 = df[, 2:x]
#normalization
df2 = df[, 2:length(df)]
normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}
maxmindf <- as.data.frame(lapply(df2, normalize))
attach(maxmindf)
df_norm<-as.matrix(maxmindf)
# Determine sample size
ind <- sample(2, nrow(df_norm), replace=TRUE, prob=c(0.80, 0.20))
# Split the data(peaks)
training <- df_norm[ind==1, 1:ncol(df_norm)-1]
test1 <- df_norm[ind==2, 1:ncol(df_norm)-1]
training_target <- df_norm[ind==1, ncol(df_norm)]
test1_target <- df_norm[ind==2, ncol(df_norm)]
#number of nodes in the first hidden layer
u1_1 = ceiling((1/2) * (ncol(training)+1))
u2_1 = ceiling(1* (ncol(training)+1))
u3_1 = ceiling((2/3) * (ncol(training)+1))
u4_1 = ceiling(2*(ncol(training)))
####a) Declaring the flags for hyperparameters
FLAGS = flags(
flag_numeric("dropout1", 0.05),
flag_integer("units",u1_1),
flag_string("activation1", "relu"),
flag_integer("batchsize1",8),
flag_integer("Epoch1",50),
flag_numeric("learning_rate", 0.01),
flag_numeric("val_split",0.2),
flag_numeric("reg_l1",0.001)
)
# ####b) Defining the DNN model
build_model<-function() {
model <- keras_model_sequential()
model %>%
layer_dense(units = FLAGS$units, activation = FLAGS$activation1, input_shape = c(dim(training)[2])) %>%
layer_dropout(rate = FLAGS$dropout1) %>%
layer_dense(units=1, activation ="linear")
#####c) Compiling the DNN model
model %>% compile(
loss = "mse",
optimizer =optimizer_adam(FLAGS$learning_rate),
metrics = c("mse"))
model
}
model<-build_model()
model %>% summary()
print_dot_callback <- callback_lambda(
on_epoch_end = function(epoch, logs) {
if (epoch %% 80 == 0) cat("\n")
cat(".")})
early_stop <- callback_early_stopping(monitor = "val_loss", mode='min',patience =20)
###########d) Fitting the DNN model#################
model_Final<-build_model()
model_fit_Final<-model_Final %>% fit(
training,
training_target,
epochs =FLAGS$Epoch1, batch_size = FLAGS$batchsize1,
shuffled=F,
validation_split = FLAGS$val_split,
verbose=0,
callbacks = list(early_stop, print_dot_callback)
)
################a) Inner cross-validation##########################
nCVI=5
Hyperpar = data.frame() #the results of each combination of hyperparameters resulting from each inner partition will be saved
for (i in 1:nCVI){ #do it to choose best parameters
print("I is:")
print(i)
Sam_per=sample(1:nrow(training),nrow(training))
X_trII=training[Sam_per,]
y_trII=training_target[Sam_per]
# print(head(X_trII, 3))
print("----------------------")
print(head(y_trII,3))
############b) Grid search using the tuning_run() function of tfruns package########
runs.sp<-tuning_run(paste0("train.R")
,runs_dir = '_tuningE1'
,flags=list(dropout1 = c(0,0.05),
units = c(u1_1, u2_1),
activation1 = ("relu"),
batchsize1 = c(8, 16),
Epoch1 = c(10,50),
learning_rate = c(0.001),
val_split = c(0.2)),
sample = 0.2,
confirm = FALSE,
echo =F)
# clean_runs(ls_runs(completed == FALSE))
#####c) Saving each combination of hyperparameters in the Hyperpar data.frame
runs.sp = runs.sp[order(runs.sp$flag_units,runs.sp$flag_dropout1, runs.sp$flag_batchsize1, runs.sp$flag_Epoch1),]
runs.sp$grid_length = 1:nrow(runs.sp) #we save the grid lenght and also important parameters
Parameters = data.frame(grid_length=runs.sp$grid_length,
metric_val_mse=runs.sp$metric_val_mse,
flag_dropout1=runs.sp$flag_dropout1,
flag_units=runs.sp$flag_units,
flag_batchsize1=runs.sp$flag_batchsize1,
epochs_completed=runs.sp$epochs_completed,
flag_learning_rate=runs.sp$flag_learning_rate,
flag_activation1=runs.sp$flag_activation1)
Hyperpar = rbind(Hyperpar,data.frame(Parameters)) #we saved the important parameters
}
#####d) Summarizing the five inner fold by hyperparameter combination
#the average prediction performance is obtained for each inner fold
Hyperpar %>%
group_by(grid_length) %>%
summarise(val_mse=mean(metric_val_mse),
dropout1=mean(flag_dropout1),
units=mean(flag_units),
batchsize1=mean(flag_batchsize1),
learning_rate=mean(flag_learning_rate),
epochs=mean( epochs_completed)) %>%
select(grid_length,val_mse,dropout1,units,batchsize1,
learning_rate, epochs) %>%
mutate_if(is.numeric, funs(round(., 3)))
Hyperpar_Opt = Hyperpar
######e) ############ select the best combinition of hyperparameters
Min = min(Hyperpar_Opt$val_mse)
pos_opt = which(Hyperpar_Opt$val_mse==Min)
pos_opt=pos_opt[1]
Optimal_Hyper=Hyperpar_Opt[pos_opt,]
#####Selecting the best hyperparameters
Drop_O = Optimal_Hyper$dropout1
Epoch_O = round(Optimal_Hyper$epochs,0)
Units_O = round(Optimal_Hyper$units,0)
activation_O = unique(Hyperpar$flag_activation1)
batchsize_O = round(Optimal_Hyper$batchsize1,0)
lr_O = Optimal_Hyper$learning_rate
print_dot_callback <- callback_lambda(
on_epoch_end = function(epoch, logs) {
if (epoch %% 20 == 0) cat("\n")
cat(".")})
#refitting the model with optimal values
model_Sec<-keras_model_sequential()
model_Sec %>%
layer_dense(units =Units_O , activation =activation_O, input_shape =
c(dim(training)[2])) %>%
layer_dropout(rate =Drop_O) %>%
layer_dense(units =1, activation =activation_O)
model_Sec %>% compile(
loss = "mean_squared_error",
optimizer = optimizer_adam(lr=lr_O),
metrics = c("mean_squared_error"))
# fit the model with our data
ModelFited<-model_Sec %>% fit(
X_trII, y_trII,
epochs=Epoch_O, batch_size =batchsize_O, #####validation_split=0.2,
early_stop,
verbose=0
,callbacks=list(print_dot_callback)
)
#############g) Prediction of testing set ##########################
Yhat=model_Sec%>% predict(test1)
y_p=Yhat
y_p_tst =as.numeric(y_p)
#y_tst=y[tst_set]
plot(test1_target,y_p_tst)
MSE=mean((test1_target - y_p_tst)^2)
Do you have any ideas?
Thanks in advance.

Nested resampling (nested CV) with several metrics using tidymodels

I tried to recreate this example of nested resampling with tidymodels but with two hyperparameters and more than one metric.
Here is the code for two hyperparameters:
# Data
library(mlbench)
sim_data <- function(n) {
tmp <- mlbench.friedman1(n, sd = 1)
tmp <- cbind(tmp$x, tmp$y)
tmp <- as.data.frame(tmp)
names(tmp)[ncol(tmp)] <- "y"
tmp
}
set.seed(9815)
train_dat <- sim_data(50)
train_dat$y <- rep(c("yes", "no"))
large_dat$y <- rep(c("yes", "no"))
train_dat$y <- as.factor(train_dat$y)
# Nested CV
library(tidymodels)
results <- nested_cv(train_dat,
outside = vfold_cv(v= 3, repeats = 3),
inside = bootstraps(times = 5))
results
# Apply SVM to nested CV
library(kernlab)
# `object` will be an `rsplit` object from our `results` tibble
# `cost` is the tuning parameter
svm_metrics <- function(object, cost = 1, rbf_sigma = 0.2) {
y_col <- ncol(object$data)
mod <-
svm_rbf(mode = "classification", cost = cost, rbf_sigma = rbf_sigma) %>%
set_engine("kernlab") %>%
fit(y ~ ., data = analysis(object))
holdout_pred <-
predict(mod, assessment(object)) %>%
bind_cols(assessment(object) %>% dplyr::select(y))
sens(holdout_pred, truth = y, estimate = .pred_class)$.estimate
}
# In some case, we want to parameterize the function over the tuning parameter:
svm_metrics_wrapper <- function(cost, rbf_sigma, object) svm_metrics(object, cost, rbf_sigma)
# `object` will be an `rsplit` object for the bootstrap samples
tune_over_svm <- function(object){
tibble(cost = grid_random(cost(), size = 3),
rbf_sigma = grid_random(rbf_sigma(), size = 3)) %>%
mutate(Sens = map2_dbl(cost, rbf_sigma, svm_metrics_wrapper, object = object))
}
# `object` is an `rsplit` object in `results$inner_resamples`
summarize_tune_results <- function(object) {
# Return row-bound tibble that has the 25 bootstrap results
map_df(object$splits, tune_over_svm) %>%
# For each value of the tuning parameter, compute the
# average sensitivity which is the inner bootstrap estimate.
group_by(cost, rbf_sigma) %>%
summarize(mean_sens = mean(Sens, na.rm = TRUE),
n = length(Sens),
.groups = "drop")
}
library(furrr)
plan(multisession)
tuning_results <- future_map(results$inner_resamples, summarize_tune_results)
However, I am totally at loss how to add other metrics as well, let's say specificity and pr_auc. Any idea how to do this is appreciated. Thanks.

R: LIME package: get probability change when variable changes?

I have a question about the LIME package. I've made a model with lime like this example from thomasp85 (thank you):
devtools::install_github("thomasp85/lime")
setwd("C:/world-happiness-report")
load("data_15_16.RData")
# configure multicore
library(doParallel)
cl <- makeCluster(detectCores())
registerDoParallel(cl)
library(caret)
set.seed(42)
index <- createDataPartition(data_15_16$Happiness.Score.l, p = 0.7, list = FALSE)
train_data <- data_15_16[index, ]
test_data <- data_15_16[-index, ]
set.seed(42)
model_mlp <- caret::train(Happiness.Score.l ~ .,
data = train_data,
method = "mlp",
trControl = trainControl(method = "repeatedcv",
number = 10,
repeats = 5,
verboseIter = FALSE))
library(lime)
explain <- lime(train_data, model_mlp, bin_continuous = TRUE, n_bins = 5, n_permutations = 1000)
pred <- data.frame(sample_id = 1:nrow(test_data),
predict(model_mlp, test_data, type = "prob"),
actual = test_data$Happiness.Score.l)
pred$prediction <- colnames(pred)[3:5][apply(pred[, 3:5], 1, which.max)]
pred$correct <- ifelse(pred$actual == pred$prediction, "correct", "wrong")
library(tidyverse)
pred_cor <- filter(pred, correct == "correct")
pred_wrong <- filter(pred, correct == "wrong")
test_data_cor %
mutate(sample_id = 1:nrow(test_data)) %>%
filter(sample_id %in% pred_cor$sample_id) %>%
sample_n(size = 3) %>%
remove_rownames() %>%
tibble::column_to_rownames(var = "sample_id") %>%
select(-Happiness.Score.l)
test_data_wrong %
mutate(sample_id = 1:nrow(test_data)) %>%
filter(sample_id %in% pred_wrong$sample_id) %>%
sample_n(size = 3) %>%
remove_rownames() %>%
tibble::column_to_rownames(var = "sample_id") %>%
select(-Happiness.Score.l)
explanation_cor <- explain(test_data_cor, n_labels = 3, n_features = 5)
explanation_wrong <- explain(test_data_wrong, n_labels = 3, n_features = 5)
plot_features(explanation_wrong, ncol = 3)
The plot produces the following exampple for a certain case:
What I would like to know: is there a way to get numbers on how the probability would change when (for example) Family would be higher, so for example:
Family in this case has level: 0.71. When Family would increase with
0.1 this would cause the probability to increase with 0.5%
The reason for this is that I have got the same graph type, and I would like to create additional information of the meaning of the effect that is indicated in the graph.
Any information, also in the form of literature etc would be helpful :)
It seems you are asking for counterfactuals. LIME does not really provides this functionality. It is not their goal. You have to create the counterfactuals manually and submit them to LIME in the same iteration because LIME is non-deterministic -- it selects randomly the neighbors to fit the interpretable model-- so you get different results every time you run it.

Can the R version of lime explain xgboost models with count:poisson objective function?

I generated a model using xgb.train with the "count:poisson" objective function and I get the following error when trying to create the explainer:
Error: Unsupported model type
Lime works when I replace the objective by something else such as reg:logistic.
Is there a way to explain count:poisson in lime?
thanks
reproducible example:
library(xgboost)
library(dplyr)
library(caret)
library(insuranceData) # example dataset https://cran.r-project.org/web/packages/insuranceData/insuranceData.pdf
library(lime) # Local Interpretable Model-Agnostic Explanations
set.seed(123)
data(dataCar)
mydb <- dataCar %>% select(clm, exposure, veh_value, veh_body,
veh_age, gender, area, agecat)
label_var <- "clm"
offset_var <- "exposure"
feature_vars <- mydb %>%
select(-one_of(c(label_var, offset_var))) %>%
colnames()
#preparing data for xgboost (one hot encoding of categorical (factor) data
myformula <- paste0( "~", paste0( feature_vars, collapse = " + ") ) %>% as.formula()
dummyFier <- caret::dummyVars(myformula, data=mydb, fullRank = TRUE)
dummyVars.df <- predict(dummyFier,newdata = mydb)
mydb_dummy <- cbind(mydb %>% select(one_of(c(label_var, offset_var))),
dummyVars.df)
rm(myformula, dummyFier, dummyVars.df)
feature_vars_dummy <- mydb_dummy %>% select(-one_of(c(label_var, offset_var))) %>% colnames()
xgbMatrix <- xgb.DMatrix(
data = mydb_dummy %>% select(feature_vars_dummy) %>% as.matrix,
label = mydb_dummy %>% pull(label_var),
missing = "NAN")
#model 1: this does not
myParam <- list(max.depth = 2,
eta = .01,
gamma = 0.001,
objective = 'count:poisson',
eval_metric = "poisson-nloglik")
booster <- xgb.train(
params = myParam,
data = xgbMatrix,
nround = 50)
explainer <- lime(mydb_dummy %>% select(feature_vars_dummy),
model = booster)
explanation <- explain(mydb_dummy %>% select(feature_vars_dummy) %>% head,
explainer,
n_labels = 1,
n_features = 2)
#Error: Unsupported model type
#model 2 : this works
myParam2 <- list(max.depth = 2,
eta = .01,
gamma = 0.001,
objective = 'reg:logistic',
eval_metric = "logloss")
booster2 <- xgb.train(
params = myParam2,
data = xgbMatrix,
nround = 50)
explainer <- lime(mydb_dummy %>% select(feature_vars_dummy),
model = booster)
explanation <- explain(mydb_dummy %>% select(feature_vars_dummy) %>% head,
explainer,
n_features = 2)
plot_features(explanation)

Resources