I want to perform a logistic regression with the train() function from the caret package. My model looks something like that:
model <- train(Y ~.,
data = train_data,
family = "binomial",
method = "glmnet")
With the resulting model, I want to make predictions:
pred <- predict(model, newdata = test_data, s = "lambda.min", type = "prob")
Now, I want to evaluate how good the model predictions are in comparison with the actual test data. For this I know how to receive the ROC and AUC. However I am also interested in receiveing the BRIER SCORE. The formula for the Brier Score is almost identical to the MSE.
The problem I am facing, is that the type argument in predict only allows "prob" (or "class" which I am not interested in) which gives the probability of one prediction beeing a ONE (e.g. 0.64) , and the complementing probability of beeing a ZERO (e.g. 0.37). For the Brier Score however, I need One probability estimate for each prediction that contains the information of both (e.g. a value above 0.5 would indicate a 1 and a value below 0.5 would indicate a 0).
I have not found any solution for receiving the Brier Score in the caret package. I am aware that with the package cv.glmnet the predict function allows the argument "response" which would solve my problem. However, for personal preferences I would like to stay with the caretpackage.
Thanks for the help!
If we go by the wiki definition of brier score:
The most common formulation of the Brier score is
where f_t is the probability that was forecast, o_t the actual outcome of the (0 or 1) and N is the number of forecasting instances.
In R, if your label is a factor, then the logistic regression will always predict with respect to the 2nd level, meaning you just calculate the probability and 0/1 with respect to that. For example:
library(caret)
idx = sample(nrow(iris),100)
data = iris
data$Species = factor(ifelse(data$Species=="versicolor","v","o"))
levels(data$Species)
[1] "o" "v"
In this case, o is 0 and v is 1.
train_data = data[idx,]
test_data = data[-idx,]
model <- train(Species ~.,data = train_data,family = "binomial",method = "glmnet")
pred <- predict(model, newdata = test_data)
So we can see the probability of the class:
head(pred)
o v
1 0.8367885 0.16321154
2 0.7970508 0.20294924
3 0.6383656 0.36163437
4 0.9510763 0.04892370
5 0.9370721 0.06292789
To calculate the score:
f_t = pred[,2]
o_t = as.numeric(test_data$Species)-1
mean((f_t - o_t)^2)
[1] 0.32
I use the Brier score to tune my models in caret for binary classification. I ensure that the "positive" class is the second class, which is the default when you label your response "0:1". Then I created this master summary function, based on caret's own suite of summary functions, to return all the metrics I want to see:
BigSummary <- function (data, lev = NULL, model = NULL) {
pr_auc <- try(MLmetrics::PRAUC(data[, lev[2]],
ifelse(data$obs == lev[2], 1, 0)),
silent = TRUE)
brscore <- try(mean((data[, lev[2]] - ifelse(data$obs == lev[2], 1, 0)) ^ 2),
silent = TRUE)
rocObject <- try(pROC::roc(ifelse(data$obs == lev[2], 1, 0), data[, lev[2]],
direction = "<", quiet = TRUE), silent = TRUE)
if (inherits(pr_auc, "try-error")) pr_auc <- NA
if (inherits(brscore, "try-error")) brscore <- NA
rocAUC <- if (inherits(rocObject, "try-error")) {
NA
} else {
rocObject$auc
}
tmp <- unlist(e1071::classAgreement(table(data$obs,
data$pred)))[c("diag", "kappa")]
out <- c(Acc = tmp[[1]],
Kappa = tmp[[2]],
AUCROC = rocAUC,
AUCPR = pr_auc,
Brier = brscore,
Precision = caret:::precision.default(data = data$pred,
reference = data$obs,
relevant = lev[2]),
Recall = caret:::recall.default(data = data$pred,
reference = data$obs,
relevant = lev[2]),
F = caret:::F_meas.default(data = data$pred, reference = data$obs,
relevant = lev[2]))
out
}
Now I can simply pass summaryFunction = BigSummary in trainControl and then metric = "Brier", maximize = FALSE in the train call.
The following script, reproduces an equivalent problem as it was stated in h2o Help (Help -> View Example Flow or Help -> Browse Installed packs.. -> examples -> Airlines Delay.flow, download), but using h2o R-package and a fixed seed (123456):
library(h2o)
# To use avaliable cores
h2o.init(max_mem_size = "12g", nthreads = -1)
IS_LOCAL_FILE = switch(1, FALSE, TRUE)
if (IS_LOCAL_FILE) {
data.input <- read.csv(file = "allyears2k.csv", stringsAsFactors = F)
allyears2k.hex <- as.h2o(data.input, destination_frame = "allyears2k.hex")
} else {
airlinesPath <- "https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv"
allyears2k.hex <- h2o.importFile(path = airlinesPath, destination_frame = "allyears2k.hex")
}
response <- "IsDepDelayed"
predictors <- setdiff(names(allyears2k.hex), response)
# Copied and pasted from the flow, then converting to R syntax
predictors.exc = c("DayofMonth", "DepTime", "CRSDepTime", "ArrTime", "CRSArrTime",
"TailNum", "ActualElapsedTime", "CRSElapsedTime",
"AirTime", "ArrDelay", "DepDelay", "TaxiIn", "TaxiOut",
"Cancelled", "CancellationCode", "Diverted", "CarrierDelay",
"WeatherDelay", "NASDelay", "SecurityDelay", "LateAircraftDelay",
"IsArrDelayed")
predictors <- setdiff(predictors, predictors.exc)
# Convert to factor for classification
allyears2k.hex[, response] <- as.factor(allyears2k.hex[, response])
# Copied and pasted from the flow, then converting to R syntax
fit1 <- h2o.glm(
x = predictors,
model_id="glm_model", seed=123456, training_frame=allyears2k.hex,
ignore_const_cols = T, y = response,
family="binomial", solver="IRLSM",
alpha=0.5,lambda=0.00001, lambda_search=F, standardize=T,
non_negative=F, score_each_iteration=F,
max_iterations=-1, link="family_default", intercept=T, objective_epsilon=0.00001,
beta_epsilon=0.0001, gradient_epsilon=0.0001, prior=-1, max_active_predictors=-1
)
# Analysis
confMatrix <- h2o.confusionMatrix(fit1)
print("Confusion Matrix for training dataset")
print(confMatrix)
print(summary(fit1))
h2o.shutdown()
This is the Confusion Matrix for the training set:
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
NO YES Error Rate
NO 0 20887 1.000000 =20887/20887
YES 0 23091 0.000000 =0/23091
Totals 0 43978 0.474942 =20887/43978
And the metrics:
H2OBinomialMetrics: glm
** Reported on training data. **
MSE: 0.2473858
RMSE: 0.4973789
LogLoss: 0.6878898
Mean Per-Class Error: 0.5
AUC: 0.5550138
Gini: 0.1100276
R^2: 0.007965165
Residual Deviance: 60504.04
AIC: 60516.04
On contrary the result of h2o flow has a better performance:
and Confusion Matrix for max f1 threshold:
The h2o flow performance is much better than running the same algorithm using the equivalent R-package function.
Note: For sake of simplicity I am using Airlines Delay problem, that is a well-known problem using h2o, but I realized that such kind of significant difference are found in other similar situations using glm algorithm.
Any thought about why these significant differences occur
Appendix A: Using default model parameters
Following the suggestion from #DarrenCook answer, just using default building parameters except for excluding columns and seed:
h2o flow
Now the buildModel is invoked like this:
buildModel 'glm', {"model_id":"glm_model-default",
"seed":"123456","training_frame":"allyears2k.hex",
"ignored_columns":
["DayofMonth","DepTime","CRSDepTime","ArrTime","CRSArrTime","TailNum",
"ActualElapsedTime","CRSElapsedTime","AirTime","ArrDelay","DepDelay",
"TaxiIn","TaxiOut","Cancelled","CancellationCode","Diverted",
"CarrierDelay","WeatherDelay","NASDelay","SecurityDelay",
"LateAircraftDelay","IsArrDelayed"],
"response_column":"IsDepDelayed","family":"binomial"
}
and the results are:
and the training metrics:
Running R-Script
The following script allows for an easy switch into default configuration (via IS_DEFAULT_MODEL variable) and also keeping the configuration as it states in the Airlines Delay example:
library(h2o)
h2o.init(max_mem_size = "12g", nthreads = -1) # To use avaliable cores
IS_LOCAL_FILE = switch(2, FALSE, TRUE)
IS_DEFAULT_MODEL = switch(2, FALSE, TRUE)
if (IS_LOCAL_FILE) {
data.input <- read.csv(file = "allyears2k.csv", stringsAsFactors = F)
allyears2k.hex <- as.h2o(data.input, destination_frame = "allyears2k.hex")
} else {
airlinesPath <- "https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv"
allyears2k.hex <- h2o.importFile(path = airlinesPath, destination_frame = "allyears2k.hex")
}
response <- "IsDepDelayed"
predictors <- setdiff(names(allyears2k.hex), response)
# Copied and pasted from the flow, then converting to R syntax
predictors.exc = c("DayofMonth", "DepTime", "CRSDepTime", "ArrTime", "CRSArrTime",
"TailNum", "ActualElapsedTime", "CRSElapsedTime",
"AirTime", "ArrDelay", "DepDelay", "TaxiIn", "TaxiOut",
"Cancelled", "CancellationCode", "Diverted", "CarrierDelay",
"WeatherDelay", "NASDelay", "SecurityDelay", "LateAircraftDelay",
"IsArrDelayed")
predictors <- setdiff(predictors, predictors.exc)
# Convert to factor for classification
allyears2k.hex[, response] <- as.factor(allyears2k.hex[, response])
if (IS_DEFAULT_MODEL) {
fit1 <- h2o.glm(
x = predictors, model_id = "glm_model", seed = 123456,
training_frame = allyears2k.hex, y = response, family = "binomial"
)
} else { # Copied and pasted from the flow, then converting to R syntax
fit1 <- h2o.glm(
x = predictors,
model_id = "glm_model", seed = 123456, training_frame = allyears2k.hex,
ignore_const_cols = T, y = response,
family = "binomial", solver = "IRLSM",
alpha = 0.5, lambda = 0.00001, lambda_search = F, standardize = T,
non_negative = F, score_each_iteration = F,
max_iterations = -1, link = "family_default", intercept = T, objective_epsilon = 0.00001,
beta_epsilon = 0.0001, gradient_epsilon = 0.0001, prior = -1, max_active_predictors = -1
)
}
# Analysis
confMatrix <- h2o.confusionMatrix(fit1)
print("Confusion Matrix for training dataset")
print(confMatrix)
print(summary(fit1))
h2o.shutdown()
It produces the following results:
MSE: 0.2473859
RMSE: 0.497379
LogLoss: 0.6878898
Mean Per-Class Error: 0.5
AUC: 0.5549898
Gini: 0.1099796
R^2: 0.007964984
Residual Deviance: 60504.04
AIC: 60516.04
Confusion Matrix (vertical: actual; across: predicted)
for F1-optimal threshold:
NO YES Error Rate
NO 0 20887 1.000000 =20887/20887
YES 0 23091 0.000000 =0/23091
Totals 0 43978 0.474942 =20887/43978
Some metrics are close, but the Confusion Matrix is quite diferent, the R-Script predict all flights as delayed.
Appendix B: Configuration
Package: h2o
Version: 3.18.0.4
Type: Package
Title: R Interface for H2O
Date: 2018-03-08
Note: I tested the R-Script also under 3.19.0.4231 with the same results
This is the cluster information after running the R:
> h2o.init(max_mem_size = "12g", nthreads = -1)
R is connected to the H2O cluster:
H2O cluster version: 3.18.0.4
...
H2O API Extensions: Algos, AutoML, Core V3, Core V4
R Version: R version 3.3.3 (2017-03-06)
Troubleshooting Tip: build the all-defaults model first:
mDef = h2o.glm(predictors, response, allyears2k.hex, family="binomial")
This takes 2 seconds and gives almotst exactly the same AUC and confusion matrix as in your Flow screenshots.
So, we now know the problem you see is due to all the model customization you have done...
...except when I build your fit1 I get basically the same results as my default model:
NO YES Error Rate
NO 4276 16611 0.795279 =16611/20887
YES 1573 21518 0.068122 =1573/23091
Totals 5849 38129 0.413479 =18184/43978
This was using your script exactly as given, so it fetched the remote csv file. (Oh, I removed the max_mem_size argument, as I don't have 12g on this notebook!)
Assuming you can get exactly your posted results, running exactly the code you posted (and in a fresh R session, with a newly started H2O cluster), one possible explanation is you are using 3.19.x, but the latest stable release is 3.18.0.2? (My test was with 3.14.0.1)
Finally, I guess this is the explanation: both have the same parameter configuration for building the model (that is not the problem), but the H2o flow uses a specific parsing customization converting some variables values into Enum, that the R-script did not specify.
The Airlines Delay problem how it was specified in the h2o Flow example uses as predictor variables (the flow defines the ignored_columns):
"Year", "Month", "DayOfWeek", "UniqueCarrier",
"FlightNum", "Origin", "Dest", "Distance"
Where all of the predictors should be parsed as: Enum except Distance. Therefore the R-Script needs to convert such columns from numeric or char into factor.
Executing using h2o R-package
Here the R-Script updated:
library(h2o)
h2o.init(max_mem_size = "12g", nthreads = -1) # To use avaliable cores
IS_LOCAL_FILE = switch(2, FALSE, TRUE)
IS_DEFAULT_MODEL = switch(2, FALSE, TRUE)
if (IS_LOCAL_FILE) {
data.input <- read.csv(file = "allyears2k.csv", stringsAsFactors = T)
allyears2k.hex <- as.h2o(data.input, destination_frame = "allyears2k.hex")
} else {
airlinesPath <- "https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv"
allyears2k.hex <- h2o.importFile(path = airlinesPath, destination_frame = "allyears2k.hex")
}
response <- "IsDepDelayed"
predictors <- setdiff(names(allyears2k.hex), response)
# Copied and pasted from the flow, then converting to R syntax
predictors.exc = c("DayofMonth", "DepTime", "CRSDepTime",
"ArrTime", "CRSArrTime",
"TailNum", "ActualElapsedTime", "CRSElapsedTime",
"AirTime", "ArrDelay", "DepDelay", "TaxiIn", "TaxiOut",
"Cancelled", "CancellationCode", "Diverted", "CarrierDelay",
"WeatherDelay", "NASDelay", "SecurityDelay", "LateAircraftDelay",
"IsArrDelayed")
predictors <- setdiff(predictors, predictors.exc)
column.asFactor <- c("Year", "Month", "DayofMonth", "DayOfWeek",
"UniqueCarrier", "FlightNum", "Origin", "Dest", response)
# Coercing as factor (equivalent to Enum from h2o Flow)
# Note: Using lapply does not work, see the answer of this question
# https://stackoverflow.com/questions/49393343/how-to-coerce-multiple-columns-to-factors-at-once-for-h2oframe-object
for (col in column.asFactor) {
allyears2k.hex[col] <- as.factor(allyears2k.hex[col])
}
if (IS_DEFAULT_MODEL) {
fit1 <- h2o.glm(x = predictors, y = response,
training_frame = allyears2k.hex,
family = "binomial", seed = 123456
)
} else { # Copied and pasted from the flow, then converting to R syntax
fit1 <- h2o.glm(
x = predictors,
model_id = "glm_model", seed = 123456,
training_frame = allyears2k.hex,
ignore_const_cols = T, y = response,
family = "binomial", solver = "IRLSM",
alpha = 0.5, lambda = 0.00001, lambda_search = F, standardize = T,
non_negative = F, score_each_iteration = F,
max_iterations = -1, link = "family_default", intercept = T,
objective_epsilon = 0.00001,
beta_epsilon = 0.0001, gradient_epsilon = 0.0001, prior = -1,
max_active_predictors = -1
)
}
# Analysis
print("Confusion Matrix for training dataset")
confMatrix <- h2o.confusionMatrix(fit1)
print(confMatrix)
print(summary(fit1))
h2o.shutdown()
Here the result running the R-Script under default configuraiton IS_DEFAULT_MODEL=T:
H2OBinomialMetrics: glm
** Reported on training data. **
MSE: 0.2001145
RMSE: 0.4473416
LogLoss: 0.5845852
Mean Per-Class Error: 0.3343562
AUC: 0.7570867
Gini: 0.5141734
R^2: 0.1975266
Residual Deviance: 51417.77
AIC: 52951.77
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
NO YES Error Rate
NO 10337 10550 0.505099 =10550/20887
YES 3778 19313 0.163614 =3778/23091
Totals 14115 29863 0.325799 =14328/43978
Executing under h2o flow
Now executing the flow: Airlines_Delay_GLMFixedSeed, we can obtain the same results. Here the detail about the flow configuration:
The parseFiles function:
parseFiles
paths: ["https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv"]
destination_frame: "allyears2k.hex"
parse_type: "CSV"
separator: 44
number_columns: 31
single_quotes: false
column_names:
["Year","Month","DayofMonth","DayOfWeek","DepTime","CRSDepTime","ArrTime",
"CRSArrTime","UniqueCarrier","FlightNum","TailNum","ActualElapsedTime",
"CRSElapsedTime","AirTime","ArrDelay","DepDelay","Origin","Dest",
"Distance","TaxiIn","TaxiOut","Cancelled","CancellationCode",
"Diverted","CarrierDelay","WeatherDelay","NASDelay","SecurityDelay",
"LateAircraftDelay","IsArrDelayed",
"IsDepDelayed"]
column_types ["Enum","Enum","Enum","Enum","Numeric","Numeric",
"Numeric","Numeric", "Enum","Enum","Enum","Numeric",
"Numeric", "Numeric","Numeric","Numeric",
"Enum","Enum","Numeric","Numeric","Numeric",
"Enum","Enum","Numeric","Numeric","Numeric",
"Numeric","Numeric","Numeric","Enum","Enum"]
delete_on_done: true
check_header: 1
chunk_size: 4194304
where the following predictor columns are converted to Enum: "Year", "Month", "DayOfWeek", "UniqueCarrier", "FlightNum", "Origin", "Dest"
Now invoking the buildModel function as follows, using the default parameters except for ignored_columns and seed:
buildModel 'glm', {"model_id":"glm_model-default","seed":"123456",
"training_frame":"allyears2k.hex",
"ignored_columns":["DayofMonth","DepTime","CRSDepTime","ArrTime",
"CRSArrTime","TailNum",
"ActualElapsedTime","CRSElapsedTime","AirTime","ArrDelay","DepDelay",
"TaxiIn","TaxiOut","Cancelled","CancellationCode","Diverted",
"CarrierDelay","WeatherDelay","NASDelay","SecurityDelay",
"LateAircraftDelay","IsArrDelayed"],"response_column":"IsDepDelayed",
"family":"binomial"}
and finally we get the following result:
and Training Output Metrics:
model glm_model-default
model_checksum -2438376548367921152
frame allyears2k.hex
frame_checksum -2331137066674151424
description ·
model_category Binomial
scoring_time 1521598137667
predictions ·
MSE 0.200114
RMSE 0.447342
nobs 43978
custom_metric_name ·
custom_metric_value 0
r2 0.197527
logloss 0.584585
AUC 0.757084
Gini 0.514168
mean_per_class_error 0.334347
residual_deviance 51417.772427
null_deviance 60855.951538
AIC 52951.772427
null_degrees_of_freedom 43977
residual_degrees_of_freedom 43211
Comparing both results
The training metrics are almost the same for first 4-significant digits:
R-Script H2o Flow
MSE: 0.2001145 0.200114
RMSE: 0.4473416 0.447342
LogLoss: 0.5845852 0.584585
Mean Per-Class Error: 0.3343562 0.334347
AUC: 0.7570867 0.757084
Gini: 0.5141734 0.514168
R^2: 0.1975266 0.197527
Residual Deviance: 51417.77 51417.772427
AIC: 52951.77 52951.772427
Confusion Matrix is slightly different:
TP TN FP FN
R-Script 10337 19313 10550 3778
H2o Flow 10341 19309 10546 3782
Error
R-Script 0.325799
H2o Flow 0.3258
My understanding is that the difference are withing the acceptable threshold (around 0.0001), therefore we can say that both interfaces provide the same result.
i'm working on building a predictive model for breast cancer data using R. After performing gcrma normalization, i generated the potential predictor variables. Now while i run the RF algorithm i encountered the following error
rf_output=randomForest(x=pred.data, y=target, importance = TRUE, ntree = 25001, proximity=TRUE, sampsize=sampsizes)
Error: Error in randomForest.default(x = pred.data, y = target, importance = TRUE, : Can not handle categorical predictors with more than 53 categories.
code:
library(randomForest)
library(ROCR)
library(Hmisc)
library(genefilter)
setwd("E:/kavya's project_work/final")
datafile<-"trainset_gcrma.txt"
clindatafile<-read.csv("mod clinical_details.csv")
outfile="trainset_RFoutput.txt"
varimp_pdffile="trainset_varImps.pdf"
MDS_pdffile="trainset_MDS.pdf"
ROC_pdffile="trainset_ROC.pdf"
case_pred_outfile="trainset_CasePredictions.txt"
vote_dist_pdffile="trainset_vote_dist.pdf"
data_import=read.table(datafile, header = TRUE, na.strings = "NA", sep="\t")
clin_data_import=clindatafile
clincaldata_order=order(clin_data_import[,"GEO.asscession.number"])
clindata=clin_data_import[clincaldata_order,]
data_order=order(colnames(data_import)[4:length(colnames(data_import))])+3 #Order data without first three columns, then add 3 to get correct index in original file
rawdata=data_import[,c(1:3,data_order)] #grab first three columns, and then remaining columns in order determined above
header=colnames(rawdata)
X=rawdata[,4:length(header)]
ffun=filterfun(pOverA(p = 0.2, A = 100), cv(a = 0.7, b = 10))
filt=genefilter(2^X,ffun)
filt_Data=rawdata[filt,]
#Get potential predictor variables
predictor_data=t(filt_Data[,4:length(header)])
predictor_names=c(as.vector(filt_Data[,3])) #gene symbol
colnames(predictor_data)=predictor_names
target= clindata[,"relapse"]
target[target==0]="NoRelapse"
target[target==1]="Relapse"
target=as.factor(target)
tmp = as.vector(table(target))
num_classes = length(tmp)
min_size = tmp[order(tmp,decreasing=FALSE)[1]]
sampsizes = rep(min_size,num_classes)
rf_output=randomForest(x=pred.data, y=target, importance = TRUE, ntree = 25001, proximity=TRUE, sampsize=sampsizes)
error:"Error in randomForest.default(x = pred.data, y = target, importance = TRUE, : Can not handle categorical predictors with more than 53 categories."
as i'm new to Machine learning i'm unable to proceed. kindly do the needful.
Thnks in advance.
It is hard to say without knowing the data. Run class or summary on all your predictor variables to ensure that they are not accidentally interpreted as characters or factors. If you really do have more than 53 levels, you will have to convert them to binary variables. Example:
mtcars$automatic <- mtcars$am == 0
mtcars$manual <- mtcars$am == 1