I have a imbalanced dataset with only 87 target events "F" out of all 496,978 obs, since I would like to see a rule/tree, I chose to use the tree models, I have been following the codes in "Applied Predictive Modeling in R" book by Dr Max Kuhn, in chapter 16 this imbalance issue is well addressed.
Here is the sample data structure:
str(training[,predictors])
'data.frame': 496978 obs. of 36 variables:
$ Point_Of_Sale_Code : Factor w/ 5 levels "c0","c2","c90",..: 3 3 5 5 3 3 5 5 5 5 ...
$ Delinquent_Amount : num 0 0 0 0 0 0 0 0 0 0 ...
$ Delinquent_Days_Count : num 0 0 0 0 0 0 0 0 0 0 ...
$ Overlimit_amt : num 0 0 0 0 0 0 0 0 0 0 ...
I tried the downsampling with random forest, it works well, with good auc= 0.9997 on test data, and confusion matrix
Reference
Prediction N F
N 140526 0
F 1442 24
however, rf does not give me a specific rule, so I tried the code in the book exactly as:
library(rpart)
library(e1071)
initialRpart <- rpart(flag ~ ., data = training,
control = rpart.control(cp = 0.0001))
rpartGrid <- data.frame(.cp = initialRpart$cptable[, "CP"])
cmat <- list(loss = matrix(c(0, 1, 20, 0), ncol = 2))
set.seed(1401)
cartWMod1 <- train(x = training[,predictors],
y = training$flag,
method = "rpart",
trControl = ctrlNoProb,
tuneGrid = rpartGrid,
metric = "Kappa",
parms = cmat)
cartWMod1
I got the error msg below everytime, no matter what I tried, like convert all int data type to num type, not sure why I get this warning msg,
Warning message:
In ni[1:m] * nj[1:m] : ***NAs produced by integer overflow***
Aggregating results
Selecting tuning parameters
Error in train.default(x = training[, predictors], y = training$flag, :
***final tuning parameters could not be determined***
I also tried the code for c5.0 package:
library(C50)
c5Grid <- expand.grid(.model = c("tree", "rules"),
.trials = c(1, (1:10)*10),
.winnow = FALSE)
finalCost <- matrix(c(0, 150, 1, 0), ncol = 2)
rownames(finalCost) <- colnames(finalCost) <- levels(training$flag)
set.seed(1401)
C5CostFit1 <- train(training[,predictors],
training$flag,
method = "C5.0",
metric = "Kappa",
tuneGrid = c5Grid,
cost = finalCost,
control = C5.0Control(earlyStopping = FALSE),
trControl = ctrlNoProb)
C5CostCM1 <- confusionMatrix(predict(C5CostFit, training), training$flag)
I got this result, which classify all the target event F to be nonevent N, Is it possible that I can increase the cost penalty from 150 to larger to fix this issue? Thank you!
C5CostCM1
Confusion Matrix and Statistics
Reference
Prediction N F
N 141968 ***24***
F 0 0
Accuracy : 0.9998
95% CI : (0.9997, 0.9999)
No Information Rate : 0.9998
P-Value [Acc > NIR] : 0.554
Kappa : NA
Mcnemar's Test P-Value : 2.668e-06
Sensitivity : 1.0000
Specificity : 0.0000
Pos Pred Value : 0.9998
Neg Pred Value : NaN
Prevalence : 0.9998
Detection Rate : 0.9998
Detection Prevalence : 1.0000
Balanced Accuracy : 0.5000
'Positive' Class : N
I have been googling this issue for the past week, but didn't see a solution, the code from the book work well though, but gives me error for my data... Any suggestion will be appriciated!! Thank you so much!
I think that it's telling you that something in the output (i.e., the list) has NAs in it--the Kappa stat.
Using something like this:
results.matrix = confusionMatrix(data, reference)
results.df = as.data.frame(results.matrix[3])
summary(is.finite(results.df$overall))
Gives you this:
Mode FALSE TRUE NA's
logical 1 6 0
So I'm guessing that's what it's picking up.
Related
I have following simplified dataset as example:
> str(one_year_before)
'data.frame': 3359 obs. of 3 variables:
$ Default_status : Factor w/ 2 levels "NO","YES": 1 1 1 2 2 1 1 1 1 1 ...
$ Average_paydex : num 79.6 73.3 73.3 66.4 64.9 ...
$ Average_amount_of_defaults: num 0 0 0 0 0 0 0 0 0 0 ...
And following code:
library(MASS)
library(caret)
set.seed(567)
# Store row numbers for training set: index_train
index_train <- createDataPartition(y = one_year_before$Default_status,
p = .7, ## The percentage of data in the training set
list = FALSE)
# Create training set: training_set
training_set <- one_year_before[index_train, ]
# Create test set: test_set
test_set <- one_year_before[-index_train, ]
str(training_set)
#k 10 fold cross validation
folds <- 10
train_control <- trainControl(method = "repeatedcv", number = 10, repeats = 20, summaryFunction = twoClassSummary,
classProbs = TRUE, savePredictions = T)
model <- train(Default_status~.,
data = training_set,
method = "glm",
preProcess = c('center', 'scale'),
trControl = train_control,
metric = 'ROC')
I get the following error:
Warning messages:
1: In Ops.factor(y, 0.5) : ‘-’ not meaningful for factors
2: model fit failed for Fold01.Rep01: parameter=none Error in glm(formula = .outcome ~ ., data = structure(list(Average_paydex = c(0.620189463001776, :
The following terms are causing separation among the sample points: (Intercept), Average_paydex, Average_amount_of_defaults
So far I have converted the column Default status from factor with levels 0 and 1 to YES and NO, but that does not help. Same data works perfectly with CARET for random forest, but for GLM and e.g. glmStepAIC, I get the same error.
What am I missing?
Really would appreciate help, as I have spent hours on debugging this.
Here is also a link to the dataset in csv. data
So, managed to solve this. I had safeBinaryRegression loaded, which masks the glm function. So, when using caret package, make sure not to have this loaded at the same time.
Hopefully this solution helps someone :)
I was trying to analyse example provided by caret package for confusionMatrix i.e.
lvs <- c("normal", "abnormal")
truth <- factor(rep(lvs, times = c(86, 258)),
levels = rev(lvs))
pred <- factor(
c(
rep(lvs, times = c(54, 32)),
rep(lvs, times = c(27, 231))),
levels = rev(lvs))
xtab <- table(pred, truth)
confusionMatrix(xtab)
However to be sure I don't quite understand it. Let's just pick for example this very simple model :
set.seed(42)
x <- sample(0:1, 100, T)
y <- rnorm(100)
glm(x ~ y, family = binomial('logit'))
And I don't know how can I analogously perform confusion matrix for this glm model. Do you understand how it can be done ?
EDIT
I tried to run an example provided in comments :
train <- data.frame(LoanStatus_B = as.numeric(rnorm(100)>0.5), b= rnorm(100), c = rnorm(100), d = rnorm(100))
logitMod <- glm(LoanStatus_B ~ ., data=train, family=binomial(link="logit"))
library(caret)
# Use your model to make predictions, in this example newdata = training set, but replace with your test set
pdata <- predict(logitMod, newdata = train, type = "response")
confusionMatrix(data = as.numeric(pdata>0.5), reference = train$LoanStatus_B)
but I gain error : dataandreference` should be factors with the same levels
Am I doing something incorrectly ?
You just need to turn them into factors:
confusionMatrix(data = as.factor(as.numeric(pdata>0.5)),
reference = as.factor(train$LoanStatus_B))
# Confusion Matrix and Statistics
#
# Reference
# Prediction 0 1
# 0 61 31
# 1 2 6
#
# Accuracy : 0.67
# 95% CI : (0.5688, 0.7608)
# No Information Rate : 0.63
# P-Value [Acc > NIR] : 0.2357
#
# Kappa : 0.1556
#
# Mcnemar's Test P-Value : 1.093e-06
#
# Sensitivity : 0.9683
# Specificity : 0.1622
# Pos Pred Value : 0.6630
# Neg Pred Value : 0.7500
# Prevalence : 0.6300
# Detection Rate : 0.6100
# Detection Prevalence : 0.9200
# Balanced Accuracy : 0.5652
#
# 'Positive' Class : 0
I have tried all possible solutions in Stack overflow suggested for data and reference should be factors with the same levels.
set.seed(10)
indices = sample.split(consumers$label, SplitRatio = 0.75)
train = consumers[indices,]
test = consumers[!(indices),]
##Build a logistic regression model
is.factor(train$label)
contrasts(train$label)
lr_model <- data.frame(label = as.numeric(rnorm(100)>0.5), b= rnorm(100), c = rnorm(100), d = rnorm(100))
logitMod <- glm(label ~ ., data=train, family=binomial(link="logit"))
pdata <- predict(logitMod, newdata = train, type = "response")
confusionMatrix(data = as.numeric(pdata>0.5), reference = train$label)
I still get "Error: data and reference should be factors with the same levels."
My dataset has three columns - ration, time and label (where the label is male and female)
Going on a hunch here that you're using caret::confusionMatrix, so here goes. What you're doing is you're passing an integer as data and factor as a reference. Notice that the documentation calls for a factor of predicted classes or a table.
> library(caret)
>
> ref <- factor(sample(0:1, size = 100, replace = TRUE))
> data1 <- sample(0:1, size = 100, replace = TRUE)
> data2 <- factor(sample(0:1, size = 100, replace = TRUE))
# this is your case
> confusionMatrix(data = data1, reference = ref)
Error: `data` and `reference` should be factors with the same levels.
# pass in a factor (try a table for giggles)
> confusionMatrix(data = data2, reference = ref)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 24 19
1 33 24
Accuracy : 0.48
95% CI : (0.379, 0.5822)
No Information Rate : 0.57
P-Value [Acc > NIR] : 0.97198
Kappa : -0.02
Mcnemar's Test P-Value : 0.07142
Sensitivity : 0.4211
Specificity : 0.5581
Pos Pred Value : 0.5581
Neg Pred Value : 0.4211
Prevalence : 0.5700
Detection Rate : 0.2400
Detection Prevalence : 0.4300
Balanced Accuracy : 0.4896
'Positive' Class : 0
confusionMatrix(data = as.factor(as.numeric(pdata>0.5)), reference = train$label)
This should work.
caret gives me the error below. I'm training a SVM for prediction starting from a bag of words and wanted to use caret to tune the C parameter, however:
bow.model.svm.tune <- train(Training.match ~ ., data = data.frame(
Training.match = factor(Training.Data.old$Training.match, labels = c('no match', 'match')),
Text.features.dtm.df) %>%
filter(Training.Data.old$Data.tipe == 'train'),
method = 'svmRadial',
tuneLength = 9,
preProc = c("center","scale"),
metric="ROC",
trControl = trainControl(
method="repeatedcv",
repeats = 5,
summaryFunction = twoClassSummary,
classProbs = T))
Error: At least one of the class levels is not a valid R variable name; This will cause errors when class probabilities are generated
because the variables names will be converted to no.match, match .
Please use factor levels that can be used as valid R variable names
(see ?make.names for help).
The original e1071::svm() function doesn't give problems, therefore I suppose the error arise in the tuning phase:
bow.model.svm.tune <- svm(Training.match ~ ., data = data.frame(
Training.match = factor(Training.Data.old$Training.match, labels = c('no match', 'match')),
Text.features.dtm.df) %>%
filter(Training.Data.old$Data.tipe == 'train'))
The data is simply an outcome factor variable and list of TfIdf transformed words vectors:
'data.frame': 1796 obs. of 1697 variables:
$ Training.match : Factor w/ 2 levels "no match","match": 2 1 1 1 1 1 1 1 2 1 ...
$ azienda : num 0.12 0 0 0 0 ...
$ bus : num 0.487 0 0 0 0 ...
$ locale : num 0.275 0 0 0 0 ...
$ martini : num 0.852 0.741 0.947 0.947 0.501 ...
$ osp : num 0.339 0 0 0 0 ...
$ ospedale : num 0.0389 0.0676 0.0864 0.0864 0.0915 ...
When predicting (internally using train or using predict.train yourself), the functions make new columns for each class probability. If your code expects a column called "no match" it won't see "no.match" (which is what data.frame converts it to) and will throw an error.
Using the caret package for model tuning today I have faced this strange behavior: given a specific combination of tuning parameters T*, the metric (i.e. Cohen's K) value associated with T* changes if T* is evaluated alone or as part of a grid of possible combinations. In the practical example which follows caret is used to interface with the gbm package.
# Load libraries and data
library (caret)
data<-read.csv("mydata.csv")
data$target<-as.factor(data$target)
# data are available at https://www.dropbox.com/s/1bglmqd14g840j1/mydata.csv?dl=0
Pocedure 1: T* evaluated alone
#Define 5-fold cv as validation settings
fitControl <- trainControl(method = "cv",number = 5)
# Define the combination of tuning parameter for this example T*
gbmGrid <- expand.grid(.interaction.depth = 1,
.n.trees = 1000,
.shrinkage = 0.1, .n.minobsinnode=1)
# Fit a gbm with T* as model parameters and K as scoring metric.
set.seed(825)
gbmFit1 <- train(target ~ ., data = data,
method = "gbm",
distribution="adaboost",
trControl = fitControl,
tuneGrid=gbmGrid,
verbose=F,
metric="Kappa")
# The results show that T* is associated with Kappa = 0.47. Remember this result and the confusion matrix.
testPred<-predict(gbmFit1, newdata = data)
confusionMatrix(testPred, data$target)
# output selection
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 832 34
1 0 16
Kappa : 0.4703
Procedure 2: T* evaluated along with other tuning profiles
Here everything is the same as in procedure 1 except for the fact that several combinations of tuning parameters {T} are considered:
# Notice that the original T* is included in {T}!!
gbmGrid2 <- expand.grid(.interaction.depth = 1,
.n.trees = seq(100,1000,by=100),
.shrinkage = 0.1, .n.minobsinnode=1)
# Fit the gbm
set.seed(825)
gbmFit2 <- train(target ~ ., data = data,
method = "gbm",
distribution="adaboost",
trControl = fitControl,
tuneGrid=gbmGrid2,
verbose=F,
metric="Kappa")
# Caret should pick the model with the highest Kappa.
# Since T* is in {T} I would expect the best model to have K >= 0.47
testPred<-predict(gbmFit2, newdata = data)
confusionMatrix(testPred, data$target)
# output selection
Reference
Prediction 0 1
0 831 47
1 1 3
Kappa : 0.1036
The results are inconsistent with my expectations: the best model in {T} scores K=0.10. How is it possible given that T* has K = 0.47 and it is included in {T}? Additionally, according to the following plot , K for T* as evaluated in procedure 2 is now around 0.01. Any idea about what is going on? Am I missing something?
I am getting consistent resampling results from your data and code.
The first model has Kappa = 0.00943
gbmFit1$results
interaction.depth n.trees shrinkage n.minobsinnode Accuracy Kappa AccuracySD
1 1 1000 0.1 1 0.9331022 0.009430576 0.004819004
KappaSD
1 0.0589132
The second model has the same results for n.trees = 1000
gbmFit2$results
shrinkage interaction.depth n.minobsinnode n.trees Accuracy Kappa AccuracySD
1 0.1 1 1 100 0.9421803 -0.002075765 0.002422952
2 0.1 1 1 200 0.9387776 -0.008326896 0.002468351
3 0.1 1 1 300 0.9365049 -0.012187900 0.002625886
4 0.1 1 1 400 0.9353749 -0.013950906 0.003077431
5 0.1 1 1 500 0.9353685 -0.013961221 0.003244201
6 0.1 1 1 600 0.9342322 -0.015486214 0.005202656
7 0.1 1 1 700 0.9319658 -0.018574633 0.007033402
8 0.1 1 1 800 0.9319658 -0.018574633 0.007033402
9 0.1 1 1 900 0.9342386 0.010955568 0.003144850
10 0.1 1 1 1000 0.9331022 0.009430576 0.004819004
KappaSD
1 0.004641553
2 0.004654972
3 0.003978702
4 0.004837097
5 0.004878259
6 0.007469843
7 0.009470466
8 0.009470466
9 0.057825336
10 0.058913202
Note that the best model in your second run has n.trees = 900
gbmFit2$bestTune
n.trees interaction.depth shrinkage n.minobsinnode
9 900 1 0.1 1
Since train picks the "best" model based on your metric, your second prediction is using a different model (n.trees of 900 instead of 1000).