There were missing values in resampled performance measures - r

I need to do a classification task on this dataset. As the following code shows, I tried to implement xgboost using caret package. Since my dataset is imbalanced, I prefer to use Fscore as performance measure. Furthermore, I need to use the first 700000 instances as the train set and the remaining 150000 instances as the test set. As the commented part of my code shows, I read this post and other related posts. However, I could not solve the issue.
mytrainvalid <- read.csv("mytrainvalid.csv")
library(xgboost)
library(dplyr)
library(caret)
mytrainvalid$DEFAULT <- ifelse(mytrainvalid$DEFAULT != 0,
"one",
"zero")
mytrainvalid$DEFAULT <- as.factor(mytrainvalid$DEFAULT)
input_x <- as.matrix(select(mytrainvalid, -DEFAULT))
## Use the validation index in the trainControl
ind=as.integer(rownames(mytrainvalid))
vi=c(700001:850000)
# modelling
grid_default <- expand.grid(
nrounds = c(100,200),
max_depth = 6,
eta = 0.1,
gamma = 0,
colsample_bytree = 1,
min_child_weight = 1,
subsample = 1
)
## use fScore as data is imbalance: 20:1
f1 <- function (data, lev = NULL, model = NULL) {
precision <- posPredValue(data$pred, data$obs, positive = "pass")
recall <- sensitivity(data$pred, data$obs, postive = "pass")
f1_val <- (2 * precision * recall) / (precision + recall)
names(f1_val) <- c("F1")
f1_val
}
##
data.ctrl <- trainControl(method = "cv",
number = 1,
allowParallel=TRUE,
returnData = FALSE,
index = list(Fold1=(1:ind)[-vi]),
sampling = "smote",
classProbs = TRUE,
summaryFunction = f1,
savePredictions = "final",
verboseIter=TRUE,
search = "random",
#savePred=T
)
xgb_model <-caret::train (input_x,
mytrainvalid$DEFAULT,
method="xgbTree",
trControl=data.ctrl,
#tuneGrid=grid_default,
verbose=FALSE,
metric = "F1",
classProbs=TRUE,
#linout=FALSE,
#threshold = 0.3,
#scale_pos_weight = sum(input_y$DEFAULT == "no")/sum(input_y$DEFAULT == "yes"),
#maximize = FALSE,
tuneLength = 2,
)
Unfortunately, the following error is produced:
Something is wrong; all the F1 metric values are missing:
F1
Min. : NA
1st Qu.: NA
Median : NA
Mean :NaN
3rd Qu.: NA
Max. : NA
NA's :2
Error: Stopping
In addition: Warning messages:
1: model fit failed for Fold1: eta=0.09121, max_depth=8, gamma=7.227, colsample_bytree=0.6533, min_child_weight=15, subsample=0.9783, nrounds=800 Error in createModel(x = subset_x(x, modelIndex), y = y[modelIndex], wts = wts[modelIndex], :
formal argument "classProbs" matched by multiple actual arguments
2: model fit failed for Fold1: eta=0.15119, max_depth=8, gamma=8.877, colsample_bytree=0.4655, min_child_weight= 3, subsample=0.9515, nrounds=536 Error in createModel(x = subset_x(x, modelIndex), y = y[modelIndex], wts = wts[modelIndex], :
formal argument "classProbs" matched by multiple actual arguments
3: In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
There were missing values in resampled performance measures.

Related

Using R, is there a way to train and cross validate a random forest algorithm with the F1 score?

I have data with class imbalance (the response variable has two classes, one of the classes is significantly more common than the other). Accuracy does not seem to be a good metric to train a model in this situation (I can get 99% accuracy and completely misclassify the minority class). I think that using the F1 score would be more beneficial.
Has anyone ever tried using the F1 score as a training metric in R?
I tried modifying the iris data set to make species as a binary variable and run random forest. Could someone please help me debug this?
library(caret)
library(randomForest)
data(iris)
iris$Species = ifelse(iris$Species == "setosa", "a", "b")
iris$Species = as.factor(iris$Species)
f1 <- function (data, lev = NULL, model = NULL) {
precision <- posPredValue(data$pred, data$obs, positive = "pass")
recall <- sensitivity(data$pred, data$obs, postive = "pass")
f1_val <- (2 * precision * recall) / (precision + recall)
names(f1_val) <- c("F1")
f1_val }
train.control <- trainControl(method = "repeatedcv",
number = 10,
repeats = 3,
classProbs = TRUE,
#sampling = "smote",
summaryFunction = f1,
search = "grid")
tune.grid <- expand.grid(.mtry = seq(from = 1, to = 10, by = 1))
random.forest.orig <- train(Species ~ ., data = iris,
method = "rf",
tuneGrid = tune.grid,
metric = "F1",
trControl = train.control)
Gives the following error:
Something is wrong; all the F1 metric values are missing:
F1
Min. : NA
1st Qu.: NA
Median : NA
Mean :NaN
3rd Qu.: NA
Max. : NA
NA's :10
Error: Stopping
In addition: There were 50 or more warnings (use warnings() to see the first 50)
5: stop("Stopping", call. = FALSE)
4: train.default(x, y, weights = w, ...)
3: train(x, y, weights = w, ...)
2: train.formula(Species ~ ., data = iris, method = "rf", tuneGrid = tune.grid,
metric = "F1", trControl = train.control)
1: train(Species ~ ., data = iris, method = "rf", tuneGrid = tune.grid,
metric = "F1", trControl = train.control)
> warnings()
Warning messages:
1: In randomForest.default(x, y, mtry = param$mtry, ...) :
invalid mtry: reset to within valid range
Source: Training Model in Caret Using F1 Metric

How can i fix this cross validation error in R

I am running a cross validation on a training dataset in R. I did it with Random forest and now i am working with decision tree and when i run it it is giving me an error. I ran the cross validation for random forest using 10 and 3 folds. I am following a lesson online to learn data science using R and i ran into this difficulty i have been trying to figure out for hours. The code is:
#cross validation
library(caret)
library(doSNOW)
set.seed(2348)
cv.10.folds <- createMultiFolds(rf.label, k=10, times = 10)
#check stratification
table(rf.label)
342 / 549
#set up caret's trainControl object per above
ctrl.1 <- trainControl(method = "repeatedcv", number = 10, repeats = 10, index = cv.10.folds)
table(rf.label[cv.10.folds[[33]]])
#set up caret's traincontrol object per above
ctrl.1 <- trainControl(method = "repeatedcv", number = 10, repeats = 10, index = cv.10.folds)
#Set up doSNOW package for multi-core training. This is helpful as we're going
#to be training a lot of trees
cl <- makeCluster(6, types = "SOCK")
registerDoSNOW(c1)
#Set seed for reproducibility and train
set.seed(32384)
rf.4.cv.1 <- train(x = rf.train.4, y = rf.label, method = "rf", tunelength = 3,
ntree = 1000, trControl = ctrl.1)
#Shutdown cluster
stopCluster(cl)
#check out results
rf.4.cv.1
#rework with 3 folds
set.seed(37596)
cv.3.folds <- createMultiFolds(rf.label, k=3, times = 10)
#set up caret's trainControl object per above
ctrl.3 <- trainControl(method = "repeatedcv", number = 3, repeats = 10, index = cv.3.folds)
#set up caret's traincontrol object per above
ctrl.3 <- trainControl(method = "repeatedcv", number = 3, repeats = 10,
index = cv.3.folds)
#Set up doSNOW package for multi-core training. This is helpful as we're going
#to be training a lot of trees
cl <- makeCluster(6, types = "SOCK")
registerDoSNOW(c1)
#Set seed for reproducibility and train
set.seed(94622)
rf.3.cv.1 <- train(x = rf.train.3, y = rf.label, method = "rf", tunelength = 3,
ntree = 1000, trControl = ctrl.3)
#Shutdown cluster
stopCluster(cl)
#check out results
rf.3.cv.1
# Using single Decision tree to better understand what's going on with the features
library(rpart)
library(rpart.plot)
#Using 3 fold cross validation repeated 10 times
#create utility function
rpart.cv <- function(seed, training, labels, ctrl) {
cl <- makeCluster(6, type = "SOCK")
registerDoSNOW(cl)
set.seed(seed)
#Leverage formula interface for training
rpart.cv <- train(x = training, y = labels, method = "rpart", tunelength =30,
trControl = ctrl)
#Shutdown cluster
stopCluster(cl)
return (rpart.cv)
}
#Grab features
features <- c("Pclass", "title", "family.size")
rpart.train.1 <- data.combined[1:891, features]
#Run cross validation and check out results
rpart.1.cv.1 <- rpart.cv(94622, rpart.train.1, rf.label, ctrl.3)
rpart.1.cv.1
#Plot
prp(rpart.1.cv.1$finalModel, type = 0, extra =1, under = TRUE)
When i ran it i got the error message:
Something is wrong; all the Accuracy metric values are missing:
Accuracy Kappa
Min. : NA Min. : NA
1st Qu.: NA 1st Qu.: NA
Median : NA Median : NA
Mean :NaN Mean :NaN
3rd Qu.: NA 3rd Qu.: NA
Max. : NA Max. : NA
NA's :3 NA's :3
Error: Stopping
In addition: Warning message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
Show Traceback
Rerun with Debug
Error: Stopping > rpart.1.cv.1
Error: object 'rpart.1.cv.1' not found
I was able to solve it with:
method = "class", parms = list(split = "Gini"), data =data.combined, control = rpart.control(cp)= .2, minsplit =5, minibucket = 5, maxdepth =10)
rpart.cv <- rpart(Survived~ Pclass + title + family.size,
data = data.combined, method = "class")
rpart.plot(rpart.cv, cex =.5, extra =4)
``

Using F1 score metric in KNN through caret package

I am attempting to use the F1 score to determine which k value maximises the model for its given purpose. The model is made through the train function in the caret package.
Example dataset: https://www.kaggle.com/lachster/churndata
My current code includes the following (as the function for f1 score):
f1 <- function(data, lev = NULL, model = NULL) {
precision <- posPredValue(data$pred, data$obs, positive = "pass")
recall <- sensitivity(data$pred, data$obs, positive = "pass")
f1_val <- (2*precision*recall) / (precision + recall)
names(f1_val) <- c("F1")
f1_val
}
The following as train control:
train.control <- trainControl(method = "repeatedcv", number = 10, repeats = 3,
summaryFunction = f1, search = "grid")
And the following as my final execution of the train command:
x <- train(CHURN ~. ,
data = experiment,
method = "knn",
tuneGrid = expand.grid(.k=1:30),
metric = "F1",
trControl = train.control)
Please note that the model is attempting to predict the churn rate from a set of telco customers.
The execution returns the following result:
Something is wrong; all the F1 metric values are missing:
F1
Min. : NA
1st Qu.: NA
Median : NA
Mean :NaN
3rd Qu.: NA
Max. : NA
NA's :30
Error in train.default(x, y, weights = w, ...) : Stopping
In addition: Warning message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
There were missing values in resampled performance measures.
EDIT: Thanks to help from missuse my code now looks like the following but returns this error
levels(exp2$CHURN) <- make.names(levels(factor(exp2$CHURN)))
library(mlbench)
train.control <- trainControl(method = "repeatedcv", number = 10, repeats = 3,
summaryFunction = prSummary, classProbs = TRUE)
knn_fit <- train(CHURN ~., data = exp2, method = "knn", trControl =
train.control, preProcess = c("center", "scale"), tuneLength = 15, metric = "F")
The error:
Error in trainControl(method = "repeatedcv", number = 10, repeats = 3, :
object 'prSummary' not found
Caret contains a summary function: prSummary that provides the F1 score Full example:
library(caret)
library(mlbench)
data(Sonar)
train.control <- trainControl(method = "repeatedcv", number = 10, repeats = 3,
summaryFunction = prSummary, classProbs = TRUE)
knn_fit <- train(Class ~., data = Sonar, method = "knn",
trControl=train.control ,
preProcess = c("center", "scale"),
tuneLength = 15,
metric = "F")
knn_fit
#output
k-Nearest Neighbors
208 samples
60 predictor
2 classes: 'M', 'R'
Pre-processing: centered (60), scaled (60)
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 187, 188, 187, 188, 187, 187, ...
Resampling results across tuning parameters:
k AUC Precision Recall F
5 0.3582687 0.7936713 0.9065657 0.8414592
7 0.4985709 0.7758271 0.8883838 0.8239438
9 0.6632328 0.7484092 0.8853535 0.8089210
11 0.7426320 0.7151175 0.8676768 0.7814297
13 0.7388742 0.6883105 0.8646465 0.7641392
15 0.7594436 0.6787983 0.8467172 0.7520524
17 0.7583071 0.6909693 0.8527778 0.7616448
19 0.7702208 0.6913001 0.8585859 0.7644433
21 0.7642698 0.6962528 0.8707071 0.7719442
23 0.7652370 0.6945755 0.8707071 0.7696863
25 0.7606508 0.6929364 0.8707071 0.7683987
27 0.7454728 0.6916762 0.8676768 0.7669464
29 0.7551679 0.6900416 0.8707071 0.7676640
31 0.7603099 0.6935720 0.8828283 0.7749490
33 0.7614621 0.6938805 0.8770202 0.7728923
F was used to select the optimal model using the largest value.
The final value used for the model was k = 5.

Error "variable lengths differ" when assigning weights parameter in caret (R)

I want to apply weighted observation in caret using the code below:
model_weights <- ifelse(train$y == 0,
(1/table(train$y)[1]) * 0.5,
(1/table(train$y)[2]) * 0.5)
xgbT <- train(x = as.matrix(train[,-21]), y = make.names(as.factor(train$y)),
method = "xgbTree",
trControl = cctrl1,
metric = "MCC",
maximize = TRUE,
weights = model_weights,
preProc = c("center", "scale"),
tuneGrid = expand.grid(nrounds = c(150), #number of trees
max_depth = c(7), #max tree depth
eta = c(0.03), #learning rate
gamma = c(0.3), #min split loss
colsample_bytree = c(0.7),
min_child_weight = c(10, 1, 5), #min number of instances in the leaf
subsample = c(0.6)), #subsample ratio of the training instance
early_stop_round = c(3), #if no improvements over specified rounds
objective = c("binary:logistic"),
silent = 0)
However, it gives me this error: Error in model.frame.default(formula = .outcome ~ ., data = dat, weights = wts) :
variable lengths differ (found for '(weights)')
Though I have checked that their lengths are the same with code below:
> table(model_weights)
model_weights
0.0000277654375832963 0.000231481481481481
18008 2160
> table(train$y)
0 1
18008 2160
Any idea how to fix this?
NOTE: I can run the train function without weights parameter.
After further debugging, I found the problem is because I apply sampling in the cctrl1. Thus, the length of weights differ, since I generate it before they apply re-sampling.
So, you can fix this by simply removing sampling from your trControl. If you still want to apply re-sampling, then you have to re-sample the data before running below code:
model_weights <- ifelse(train$y == 0,
(1/table(train$y)[1]) * 0.5,
(1/table(train$y)[2]) * 0.5)

GBM and Caret package: invalid number of intervals

Though I am defining that target <- factor(train$target, levels = c(0, 1)), the below-given code provides this error:
Error in cut.default(y, unique(quantile(y, probs = seq(0, 1, length =
cuts))), : invalid number of intervals In addition: Warning
messages: 1: In train.default(x, y, weights = w, ...) : cannnot
compute class probabilities for regression
What does it mean and how to fix this?
gbmGrid <- expand.grid(n.trees = (1:30)*10,
interaction.depth = c(1, 5, 9),
shrinkage = 0.1)
fitControl <- trainControl(method = "repeatedcv",
number = 5,
repeats = 5,
verboseIter = FALSE,
returnResamp = "all",
classProbs = TRUE)
target <- factor(train$target, levels = c(0, 1))
gbm <- caret::train(target ~ .,
data = train,
#distribution="gaussian",
method = "gbm",
trControl = fitControl,
tuneGrid = gbmGrid)
prob = predict(gbm, newdata=testing, type='prob')[,2]
First, don't do this:
target <- factor(train$target, levels = c(0, 1))
You will get an warning:
At least one of the class levels are not valid R variables names; This may cause errors if class probabilities are generated because the variables names will be converted to: X0, X1
Second, you created an object called target. Using the formula method means that train will use the column called target in the data frame train and those are different data. Modify the column.

Resources