I am attempting to use the F1 score to determine which k value maximises the model for its given purpose. The model is made through the train function in the caret package.
Example dataset: https://www.kaggle.com/lachster/churndata
My current code includes the following (as the function for f1 score):
f1 <- function(data, lev = NULL, model = NULL) {
precision <- posPredValue(data$pred, data$obs, positive = "pass")
recall <- sensitivity(data$pred, data$obs, positive = "pass")
f1_val <- (2*precision*recall) / (precision + recall)
names(f1_val) <- c("F1")
f1_val
}
The following as train control:
train.control <- trainControl(method = "repeatedcv", number = 10, repeats = 3,
summaryFunction = f1, search = "grid")
And the following as my final execution of the train command:
x <- train(CHURN ~. ,
data = experiment,
method = "knn",
tuneGrid = expand.grid(.k=1:30),
metric = "F1",
trControl = train.control)
Please note that the model is attempting to predict the churn rate from a set of telco customers.
The execution returns the following result:
Something is wrong; all the F1 metric values are missing:
F1
Min. : NA
1st Qu.: NA
Median : NA
Mean :NaN
3rd Qu.: NA
Max. : NA
NA's :30
Error in train.default(x, y, weights = w, ...) : Stopping
In addition: Warning message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
There were missing values in resampled performance measures.
EDIT: Thanks to help from missuse my code now looks like the following but returns this error
levels(exp2$CHURN) <- make.names(levels(factor(exp2$CHURN)))
library(mlbench)
train.control <- trainControl(method = "repeatedcv", number = 10, repeats = 3,
summaryFunction = prSummary, classProbs = TRUE)
knn_fit <- train(CHURN ~., data = exp2, method = "knn", trControl =
train.control, preProcess = c("center", "scale"), tuneLength = 15, metric = "F")
The error:
Error in trainControl(method = "repeatedcv", number = 10, repeats = 3, :
object 'prSummary' not found
Caret contains a summary function: prSummary that provides the F1 score Full example:
library(caret)
library(mlbench)
data(Sonar)
train.control <- trainControl(method = "repeatedcv", number = 10, repeats = 3,
summaryFunction = prSummary, classProbs = TRUE)
knn_fit <- train(Class ~., data = Sonar, method = "knn",
trControl=train.control ,
preProcess = c("center", "scale"),
tuneLength = 15,
metric = "F")
knn_fit
#output
k-Nearest Neighbors
208 samples
60 predictor
2 classes: 'M', 'R'
Pre-processing: centered (60), scaled (60)
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 187, 188, 187, 188, 187, 187, ...
Resampling results across tuning parameters:
k AUC Precision Recall F
5 0.3582687 0.7936713 0.9065657 0.8414592
7 0.4985709 0.7758271 0.8883838 0.8239438
9 0.6632328 0.7484092 0.8853535 0.8089210
11 0.7426320 0.7151175 0.8676768 0.7814297
13 0.7388742 0.6883105 0.8646465 0.7641392
15 0.7594436 0.6787983 0.8467172 0.7520524
17 0.7583071 0.6909693 0.8527778 0.7616448
19 0.7702208 0.6913001 0.8585859 0.7644433
21 0.7642698 0.6962528 0.8707071 0.7719442
23 0.7652370 0.6945755 0.8707071 0.7696863
25 0.7606508 0.6929364 0.8707071 0.7683987
27 0.7454728 0.6916762 0.8676768 0.7669464
29 0.7551679 0.6900416 0.8707071 0.7676640
31 0.7603099 0.6935720 0.8828283 0.7749490
33 0.7614621 0.6938805 0.8770202 0.7728923
F was used to select the optimal model using the largest value.
The final value used for the model was k = 5.
Related
I need to do a classification task on this dataset. As the following code shows, I tried to implement xgboost using caret package. Since my dataset is imbalanced, I prefer to use Fscore as performance measure. Furthermore, I need to use the first 700000 instances as the train set and the remaining 150000 instances as the test set. As the commented part of my code shows, I read this post and other related posts. However, I could not solve the issue.
mytrainvalid <- read.csv("mytrainvalid.csv")
library(xgboost)
library(dplyr)
library(caret)
mytrainvalid$DEFAULT <- ifelse(mytrainvalid$DEFAULT != 0,
"one",
"zero")
mytrainvalid$DEFAULT <- as.factor(mytrainvalid$DEFAULT)
input_x <- as.matrix(select(mytrainvalid, -DEFAULT))
## Use the validation index in the trainControl
ind=as.integer(rownames(mytrainvalid))
vi=c(700001:850000)
# modelling
grid_default <- expand.grid(
nrounds = c(100,200),
max_depth = 6,
eta = 0.1,
gamma = 0,
colsample_bytree = 1,
min_child_weight = 1,
subsample = 1
)
## use fScore as data is imbalance: 20:1
f1 <- function (data, lev = NULL, model = NULL) {
precision <- posPredValue(data$pred, data$obs, positive = "pass")
recall <- sensitivity(data$pred, data$obs, postive = "pass")
f1_val <- (2 * precision * recall) / (precision + recall)
names(f1_val) <- c("F1")
f1_val
}
##
data.ctrl <- trainControl(method = "cv",
number = 1,
allowParallel=TRUE,
returnData = FALSE,
index = list(Fold1=(1:ind)[-vi]),
sampling = "smote",
classProbs = TRUE,
summaryFunction = f1,
savePredictions = "final",
verboseIter=TRUE,
search = "random",
#savePred=T
)
xgb_model <-caret::train (input_x,
mytrainvalid$DEFAULT,
method="xgbTree",
trControl=data.ctrl,
#tuneGrid=grid_default,
verbose=FALSE,
metric = "F1",
classProbs=TRUE,
#linout=FALSE,
#threshold = 0.3,
#scale_pos_weight = sum(input_y$DEFAULT == "no")/sum(input_y$DEFAULT == "yes"),
#maximize = FALSE,
tuneLength = 2,
)
Unfortunately, the following error is produced:
Something is wrong; all the F1 metric values are missing:
F1
Min. : NA
1st Qu.: NA
Median : NA
Mean :NaN
3rd Qu.: NA
Max. : NA
NA's :2
Error: Stopping
In addition: Warning messages:
1: model fit failed for Fold1: eta=0.09121, max_depth=8, gamma=7.227, colsample_bytree=0.6533, min_child_weight=15, subsample=0.9783, nrounds=800 Error in createModel(x = subset_x(x, modelIndex), y = y[modelIndex], wts = wts[modelIndex], :
formal argument "classProbs" matched by multiple actual arguments
2: model fit failed for Fold1: eta=0.15119, max_depth=8, gamma=8.877, colsample_bytree=0.4655, min_child_weight= 3, subsample=0.9515, nrounds=536 Error in createModel(x = subset_x(x, modelIndex), y = y[modelIndex], wts = wts[modelIndex], :
formal argument "classProbs" matched by multiple actual arguments
3: In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
There were missing values in resampled performance measures.
I used caret to train an mlp model with this code.
library(datasets)
library(MASS)
library(caret)
DP = caret::createDataPartition(Boston$medv, p=0.75, list = F)
train = Boston[DP,]
test = Boston[-DP,]
colnames(train) = colnames(Boston)
colnames(test) = colnames(Boston)
mlp = caret::train(medv ~., data = Boston, method = "mlp", trControl = trainControl(method = "cv", number = 3),
tuneGrid = expand.grid(size = 1:3), linOut = T, metric = "RMSE")
Yp = caret::predict.train(mlp, test[,1:13])
I got this error message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, : There were missing values in resampled performance measures.
please guys I need to understand why I got this error?
The R squared value for some runs throws NA, you can check the output:
set.seed(111)
mlp = caret::train(medv ~., data = Boston, method = "mlp",
trControl = trainControl(method = "cv", number = 3),
tuneGrid = expand.grid(size = 1:3), linOut = T, metric = "RMSE")
mlp$results
size RMSE Rsquared MAE RMSESD RsquaredSD MAESD
1 1 9.152376 NaN 6.640184 0.9213123 NA 0.6877405
2 2 14.353732 0.000965434 12.274448 8.3227673 NA 8.6994894
3 3 12.701064 NaN 10.988850 3.2658958 NA 3.6549478
Note even for models that work, your Rsquared is too low. Two problems with the model, 1) your size might be too low, and 2) you don't scale the data, so your predictions give you only one value, and R2 will be total nonsense:
Yp = caret::predict.train(mlp, test[,1:13])
table(Yp)
Yp
20.0358009338379
125
Try something like this:
mlp = caret::train(medv ~., data = Boston, method = "mlp",
trControl = trainControl(method = "cv", number = 3),
preProcess = c("center","scale"),
tuneGrid = expand.grid(size = 3:5), linOut = T, metric = "RMSE")
mlp
Multi-Layer Perceptron
506 samples
13 predictor
Pre-processing: centered (13), scaled (13)
Resampling: Cross-Validated (3 fold)
Summary of sample sizes: 337, 338, 337
Resampling results across tuning parameters:
size RMSE Rsquared MAE
3 7.926669 0.3291762 5.619198
4 6.976707 0.4913297 5.130273
5 6.894459 0.5188481 5.040821
Error: nrow(x) == n is not TRUE
I am not sure what "n" is referring to in this case. Here is the code throwing the error:
# BUILD MODEL
set.seed(9353)
elastic_net_model <- train(x = predictors, y = y,
method = "glmnet",
family = "binomial",
preProcess = c("scale"),
tuneLength = 10,
metric = "ROC",
# metric = "Spec",
trControl = train_control)
The main problem that others were running into with this error is that their y variable was not a factor or numeric. They were often passing it as a matrix or dataframe. I explicitly make my y a factor, shown here:
# Make sure that the outcome variable is a two-level factor
dfBlocksAll$trophout1 = as.factor(dfBlocksAll$trophout1)
# Set levels for dfBlocksAll$trophout1
levels(dfBlocksAll$trophout1) <- c("NoTrophy", "Trophy")
# Split the data into training and test set, 70/30 split
set.seed(1934)
index <- createDataPartition(y = dfBlocksAll$trophout1, p = 0.70, list = FALSE)
training <- dfBlocksAll[index, ]
testing <- dfBlocksAll[-index, ]
# This step is the heart of the process
y <- dfBlocksAll$trophout1 # outcome variable - did they get a trophy or not?
predictors <- training[,which(colnames(training) != "trophout1")]
The only other potentially relevant code that comes before the block throwing the error is this:
train_control <- trainControl(method = "repeatedcv",
number = 10,
repeats = 10,
# sampling = "down",
classProbs = TRUE,
summaryFunction = twoClassSummary,
allowParallel = TRUE,
savePredictions = "final",
verboseIter = FALSE)
Since my y is already a factor, I assume that my error has something to do with the x, not the y. As you can see from the code that my x is a dataframe called "predictors." This dataframe contains 768 obs. of 67 vars, and is filled with chars and numerics.
Your response variable has to come from the training, here I use an example dataset:
dfBlocksAll = data.frame(matrix(runif(1000),ncol=10))
dfBlocksAll$trophout1 = factor(sample(c("NoTrophy", "Trophy"),100,replace=TRUE))
index <- createDataPartition(y = dfBlocksAll$trophout1, p = 0.70, list = FALSE)
training <- dfBlocksAll[index, ]
testing <- dfBlocksAll[-index, ]
And this part should be changed:
y <- training$trophout1
predictors <- training[,which(colnames(training) != "trophout1")]
And the rest runs pretty ok:
elastic_net_model <- train(x = predictors, y = y,
method = "glmnet",
family = "binomial",
preProcess = c("scale"),
tuneLength = 10,
metric = "ROC",
trControl = train_control)
elastic_net_model
glmnet
71 samples
10 predictors
2 classes: 'NoTrophy', 'Trophy'
Pre-processing: scaled (10)
Resampling: Cross-Validated (10 fold, repeated 10 times)
Summary of sample sizes: 65, 64, 64, 63, 64, 64, ...
Resampling results across tuning parameters:
alpha lambda ROC Sens Spec
0.1 0.0003090198 0.5620833 0.5908333 0.51666667
0.1 0.0007138758 0.5620833 0.5908333 0.51666667
0.1 0.0016491457 0.5614583 0.5908333 0.51083333
0.1 0.0038097407 0.5594444 0.5933333 0.51083333
I have data with class imbalance (the response variable has two classes, one of the classes is significantly more common than the other). Accuracy does not seem to be a good metric to train a model in this situation (I can get 99% accuracy and completely misclassify the minority class). I think that using the F1 score would be more beneficial.
Has anyone ever tried using the F1 score as a training metric in R?
I tried modifying the iris data set to make species as a binary variable and run random forest. Could someone please help me debug this?
library(caret)
library(randomForest)
data(iris)
iris$Species = ifelse(iris$Species == "setosa", "a", "b")
iris$Species = as.factor(iris$Species)
f1 <- function (data, lev = NULL, model = NULL) {
precision <- posPredValue(data$pred, data$obs, positive = "pass")
recall <- sensitivity(data$pred, data$obs, postive = "pass")
f1_val <- (2 * precision * recall) / (precision + recall)
names(f1_val) <- c("F1")
f1_val }
train.control <- trainControl(method = "repeatedcv",
number = 10,
repeats = 3,
classProbs = TRUE,
#sampling = "smote",
summaryFunction = f1,
search = "grid")
tune.grid <- expand.grid(.mtry = seq(from = 1, to = 10, by = 1))
random.forest.orig <- train(Species ~ ., data = iris,
method = "rf",
tuneGrid = tune.grid,
metric = "F1",
trControl = train.control)
Gives the following error:
Something is wrong; all the F1 metric values are missing:
F1
Min. : NA
1st Qu.: NA
Median : NA
Mean :NaN
3rd Qu.: NA
Max. : NA
NA's :10
Error: Stopping
In addition: There were 50 or more warnings (use warnings() to see the first 50)
5: stop("Stopping", call. = FALSE)
4: train.default(x, y, weights = w, ...)
3: train(x, y, weights = w, ...)
2: train.formula(Species ~ ., data = iris, method = "rf", tuneGrid = tune.grid,
metric = "F1", trControl = train.control)
1: train(Species ~ ., data = iris, method = "rf", tuneGrid = tune.grid,
metric = "F1", trControl = train.control)
> warnings()
Warning messages:
1: In randomForest.default(x, y, mtry = param$mtry, ...) :
invalid mtry: reset to within valid range
Source: Training Model in Caret Using F1 Metric
I am running a cross validation on a training dataset in R. I did it with Random forest and now i am working with decision tree and when i run it it is giving me an error. I ran the cross validation for random forest using 10 and 3 folds. I am following a lesson online to learn data science using R and i ran into this difficulty i have been trying to figure out for hours. The code is:
#cross validation
library(caret)
library(doSNOW)
set.seed(2348)
cv.10.folds <- createMultiFolds(rf.label, k=10, times = 10)
#check stratification
table(rf.label)
342 / 549
#set up caret's trainControl object per above
ctrl.1 <- trainControl(method = "repeatedcv", number = 10, repeats = 10, index = cv.10.folds)
table(rf.label[cv.10.folds[[33]]])
#set up caret's traincontrol object per above
ctrl.1 <- trainControl(method = "repeatedcv", number = 10, repeats = 10, index = cv.10.folds)
#Set up doSNOW package for multi-core training. This is helpful as we're going
#to be training a lot of trees
cl <- makeCluster(6, types = "SOCK")
registerDoSNOW(c1)
#Set seed for reproducibility and train
set.seed(32384)
rf.4.cv.1 <- train(x = rf.train.4, y = rf.label, method = "rf", tunelength = 3,
ntree = 1000, trControl = ctrl.1)
#Shutdown cluster
stopCluster(cl)
#check out results
rf.4.cv.1
#rework with 3 folds
set.seed(37596)
cv.3.folds <- createMultiFolds(rf.label, k=3, times = 10)
#set up caret's trainControl object per above
ctrl.3 <- trainControl(method = "repeatedcv", number = 3, repeats = 10, index = cv.3.folds)
#set up caret's traincontrol object per above
ctrl.3 <- trainControl(method = "repeatedcv", number = 3, repeats = 10,
index = cv.3.folds)
#Set up doSNOW package for multi-core training. This is helpful as we're going
#to be training a lot of trees
cl <- makeCluster(6, types = "SOCK")
registerDoSNOW(c1)
#Set seed for reproducibility and train
set.seed(94622)
rf.3.cv.1 <- train(x = rf.train.3, y = rf.label, method = "rf", tunelength = 3,
ntree = 1000, trControl = ctrl.3)
#Shutdown cluster
stopCluster(cl)
#check out results
rf.3.cv.1
# Using single Decision tree to better understand what's going on with the features
library(rpart)
library(rpart.plot)
#Using 3 fold cross validation repeated 10 times
#create utility function
rpart.cv <- function(seed, training, labels, ctrl) {
cl <- makeCluster(6, type = "SOCK")
registerDoSNOW(cl)
set.seed(seed)
#Leverage formula interface for training
rpart.cv <- train(x = training, y = labels, method = "rpart", tunelength =30,
trControl = ctrl)
#Shutdown cluster
stopCluster(cl)
return (rpart.cv)
}
#Grab features
features <- c("Pclass", "title", "family.size")
rpart.train.1 <- data.combined[1:891, features]
#Run cross validation and check out results
rpart.1.cv.1 <- rpart.cv(94622, rpart.train.1, rf.label, ctrl.3)
rpart.1.cv.1
#Plot
prp(rpart.1.cv.1$finalModel, type = 0, extra =1, under = TRUE)
When i ran it i got the error message:
Something is wrong; all the Accuracy metric values are missing:
Accuracy Kappa
Min. : NA Min. : NA
1st Qu.: NA 1st Qu.: NA
Median : NA Median : NA
Mean :NaN Mean :NaN
3rd Qu.: NA 3rd Qu.: NA
Max. : NA Max. : NA
NA's :3 NA's :3
Error: Stopping
In addition: Warning message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
Show Traceback
Rerun with Debug
Error: Stopping > rpart.1.cv.1
Error: object 'rpart.1.cv.1' not found
I was able to solve it with:
method = "class", parms = list(split = "Gini"), data =data.combined, control = rpart.control(cp)= .2, minsplit =5, minibucket = 5, maxdepth =10)
rpart.cv <- rpart(Survived~ Pclass + title + family.size,
data = data.combined, method = "class")
rpart.plot(rpart.cv, cex =.5, extra =4)
``