i am currently trying to build a muti-class prediction model to predict the letter out of 26 English alphabets. I have currently built a few models using ANN, SVM, Ensemble and nB. But i am stuck at the evaluating the accuracy of these models. Although the confusion matrix shows me the Alphabet-wise True and False predictions, I am only able to get an overall accuracy of each model. Is there a way to evaluate the model's accuracy similar to the ROC and AUC values for a Binomial Classification.
Note: I am currently running the model using the H2o package as it saves me more time.
Once you train a model in H2O, if you simply do: print(fit) it will show you all the available metrics for that model type. For multiclass, I'd recommend h2o.mean_per_class_error().
R code example on the iris dataset:
library(h2o)
h2o.init(nthreads = -1)
data(iris)
fit <- h2o.naiveBayes(x = 1:4,
y = 5,
training_frame = as.h2o(iris),
nfolds = 5)
Once you have the model, we can evaluate model performance using the h2o.performance() function to view all the metrics:
> h2o.performance(fit, xval = TRUE)
H2OMultinomialMetrics: naivebayes
** Reported on cross-validation data. **
** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
Cross-Validation Set Metrics:
=====================
Extract cross-validation frame with `h2o.getFrame("iris")`
MSE: (Extract with `h2o.mse`) 0.03582724
RMSE: (Extract with `h2o.rmse`) 0.1892808
Logloss: (Extract with `h2o.logloss`) 0.1321609
Mean Per-Class Error: 0.04666667
Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>,xval = TRUE)`
=======================================================================
Top-3 Hit Ratios:
k hit_ratio
1 1 0.953333
2 2 1.000000
3 3 1.000000
Or you can look at a particular metric, like mean_per_class_error:
> h2o.mean_per_class_error(fit, xval = TRUE)
[1] 0.04666667
If you want to view performance on a test set, then you can do the following:
perf <- h2o.performance(fit, test)
h2o.mean_per_class_error(perf)
Related
The objective is to train a model to predict the default variable. Train a KNN model with k = 13 using the knn3() function and calculate the test accuracy.
My code to solve this problem so far is:
# load packages
library("mlbench")
library("tibble")
library("caret")
library("rpart")
# set seed
set.seed(49607)
# load data and coerce to tibble
default = as_tibble(ISLR::Default)
# split data
dft_trn_idx = sample(nrow(default), size = 0.8 * nrow(default))
dft_trn = default[dft_trn_idx, ]
dft_tst = default[-dft_trn_idx, ]
# check data
dft_trn
# fit knn model
mod_knn = knn3(default ~ ., data = dft_trn, k = 13)
# make "predictions" with knn model
new_obs = data.frame(balance = 421, income = 28046)
predtrn = predict(mod_knn, new_obs, type = "prob")
confusionMatrix(predtrn,dft_trn)
at the last line of the code chunk, I get error "Error: data and reference should be factors with the same levels." I am unsure as to how I can fix this, or if this is even the correct method to measure the test accuracy.
Any help would be great, thanks!
First of all, as machine learner you are doing well because a necessary step is to split data into train and test set. The issue I found is that you are trying to compare a new prediction from data outside from test and train test. The principle in ML is to train the model on train dataset and then make predictions on test dataset in order to finally evaluate performance. You have the datasets for that (dft_tst). Here the code to obtain confusion matrix. As a reminder, if you have one predicted label without having the real label to compare, the confusion matrix will not be computed. Here the code to obtain the desired matrix:
# load packages
library("mlbench")
library("tibble")
library("caret")
library("rpart")
# set seed
set.seed(49607)
# load data and coerce to tibble
default = as_tibble(ISLR::Default)
Now, we split into train and test sets:
# split data
dft_trn_idx = sample(nrow(default), size = 0.8 * nrow(default))
dft_trn = default[dft_trn_idx, ]
dft_tst = default[-dft_trn_idx, ]
We train the model:
# fit knn model
mod_knn = knn3(default ~ ., data = dft_trn, k = 13)
Now, the key part is making predictions on test set (or any labelled set) and obtain the confusion matrix:
# make "predictions" with knn model
predtrn = predict(mod_knn, dft_tst, type = "class")
In order to compute the confusion matrix, the predictions and original labels must have the same lenght:
#Confusion matrix
confusionMatrix(predtrn,dft_tst$default)
Output:
Confusion Matrix and Statistics
Reference
Prediction No Yes
No 1929 67
Yes 1 3
Accuracy : 0.966
95% CI : (0.9571, 0.9735)
No Information Rate : 0.965
P-Value [Acc > NIR] : 0.4348
Kappa : 0.0776
Mcnemar's Test P-Value : 3.211e-15
Sensitivity : 0.99948
Specificity : 0.04286
Pos Pred Value : 0.96643
Neg Pred Value : 0.75000
Prevalence : 0.96500
Detection Rate : 0.96450
Detection Prevalence : 0.99800
Balanced Accuracy : 0.52117
'Positive' Class : No
I want know what g means and why use Lemeshow goodness of fit (GOF) test in Research ?? and what wrong in "confusion matrix for logistic regression ?
this message :
Error in confusionMatrix(cnfmat) :
could not find function "confusionMatrix"
# ..Binary Logistic Regression :
install.packages("caTools")
library(caTools)
require(caTools)
sample = sample.split(diabetes$Outcome, SplitRatio=0.80)
train = subset(diabetes, sample==TRUE)
test = subset(diabetes, sample==FALSE)
nrow(diabetes) ##calculationg the total number of rows
nrow(train) ## total number of Train data rows >> 0.80 * 768
nrow(test) ## total number of Test data rows >> 0.20 * 768
str(train) ## Structure of train set
Logis_mod<- glm(Outcome~Pregnancies+Glucose+BloodPressure+SkinThickness+
Insulin+BMI+DiabetesPedigreeFunction+Age,family = binomial,data = train)
summary(Logis_mod)
#AIC .. Akaike information criteria ...
#A good model is the one that has minimum AIC among all the other models.
# Testing the Model
glm_probs <- predict(Logis_mod, newdata = test, type = "response")
summary(glm_probs)
glm_pred <- ifelse(glm_probs > 0.5, 1, 0)
summary(glm_pred)
#Avarge prediction for each of the Two outcomes ..
tapply(glm_pred,train$Outcome,mean)
# Confusion Matrix for logistic regression
install.packages("e1071")
library(e1071)
prdval <-predict(Logis_mod,type = "response")
prdbln <-ifelse(prdval > 0.5, 1, 0)
cnfmat <-table(prd=prdbln,act =train$Outcome)
confusionMatrix(cnfmat)
#Odd Ratio :
exp(cbind("OR"=coef(Logis_mod),confint(Logis_mod)))
I'm not sure what "g" you are referring to, but I'm going to assume it's your resulting computed statistic from your Lemeshow test. If this is the case then values of "g" indicate how well a model explains the variability in the data and can be used to compare models based on the same set of data (better models will have larger "g" values).
More generally, any goodness of fit (GOF) test in research is used to determine how well your model fits the variability in your data.
Additionally, you are receiving your error because the confusionMatrix() function is a part of the caret R package. Install caret by first running the below line of code in R or RStudio.
install.packages("caret")
Then in your code change
cnfmat <-table(prd=prdbln,act =train$Outcome)
confusionMatrix(cnfmat)
to
cnfmat <-data.frame(prd=prdbln,act =train$Outcome, stringsAsFactors = FALSE)
caret::confusionMatrix(cnfmat)
I can't tell from documentation whether or not the predict.H2OModel() function from the h2o package in R gives OOB predictions for random forest models built using h2o.randomForest().
In fact, in the 3-4 examples I've tried, it seems the results of predict.H2OModel() are closer to the non-OOB predictions from predict.randomForest() from the randomForest package than the OOB ones.
Does anyone know if they are OOB predictions? If not, do you know how to get OOB predictions for h2o.randomForest() models?
Example:
set.seed(123)
library(randomForest)
library(h2o)
data(mtcars)
d = mtcars[,c('mpg', 'cyl', 'disp', 'hp', 'wt' )]
## define some common settings for both random forests
n.trees=1000
mtry = 3
min.node = 3
## prep for h2o.randomForest
h2o.init()
d.h2o= as.h2o(d)
x.names = colnames(d)[2:5] ## predictors
## fit both models
set.seed(123);
rf = randomForest(mpg ~ ., data = d , ntree=n.trees, mtry = mtry, nodesize=min.node)
h2o = h2o.randomForest(y='mpg', x=x.names, training_frame = d.h2o, ntrees=n.trees, mtries = mtry, min_rows=min.node)
## Correct way and incorrect way of getting OOB predictions for a randomForest model. Not sure about h2o model.
d$rf.oob.pred = predict(rf) ## Gives OOB predictions
d$rf.pred = predict(rf , newdata=d ) ## Doesn't give OOB predictions.
d$h2o.pred = as.vector(predict(h2o, newdata=d.h2o)) ## Not sure if this is OOB or not.
## d$h2o.pred seems more similar to d$rf.pred than d$rf.oob.pred,
## suggesting that predict.H2OModel() might not give OOB predictions.
mean((d$rf.pred - d$h2o.pred)^2)
mean((d$rf.oob.pred - d$h2o.pred)^2)
H2O's h2o.predict() does not provide predictions for the OOB data. You have to specify what dataset you want to predict with the newdata = parameter. So when you have newdata=d.h2o then you are getting the predictions for the d.h2o dataframe you've specified.
Currently there is no method to get the prediction for the oob data. However, there is a jira ticket to specify whether you would like oob metrics (note this ticket also links to another ticket which helps clarify how training metrics are currently being reported for Random Forest).
I am running h2o random forest with the following parameter setting
model_rf <- h2o.randomForest(x = predictors, y = labels,
training_frame = train_data, classification = T,
importance = T,
verbose = T, type = "BigData", ntree = 50)
After running I am getting the following output.
Model Details:
==============
H2ORegressionModel: drf
Model ID: DRFModel__906d074da6ebf8057525b2b61c1c4c87
Model Summary:
number_of_trees model_size_in_bytes min_depth max_depth mean_depth min_leaves max_leaves mean_leaves
1 50.000000 2708173.000000 20.000000 20.000000 20.00000 4200.000000 5241.000000 4720.70000
H2ORegressionMetrics: drf
** Reported on training data. **
Description: Metrics reported on Out-Of-Bag training samples
MSE: 0.0006302392
R2 : -0.03751038
Following are my questions.
1) What does MSE and R2 mean?
2) If they are mean square error or similar why am I getting these metric for a classification setting?
3) How do I get other metrics like gini or auc?
4) Can i say that if these 2 params decrease with a different parameter setting, my model performance has improved?
Here are the answers to your questions:
1. MSE stands for mean squared error. Essentially it measures the difference between the estimator and the estimated.R2 measures how well-fit a statistical model is.
Using MSE you can judge how often you model misclassified data.
If you are using Flow, click on Inspect and then output-training_metrics to see MSE, R2, AUC, gini, etc.
Sorry, I'm not sure I understand this question. Are you asking if a decreaed gini or AUC equate to improved model performance?
Avni
I am using the caret package in R for training a radial basis SVM for classification; in addition, a linear SVM is used for variable selection. With metric="Accuracy", this works fine, but eventually I am more interested in optimizing metric="ROC". While the ROC is calculated for all models that are fit, there seems to be some problem with aggregating the ROC values.
The following is some example code:
library(caret)
library(mlbench)
set.seed(0)
data(Sonar)
x<-scale(Sonar[,1:60])
y<-as.factor(Sonar[,61])
# Custom summary function to use both
# defaultSummary() and twoClassSummary
# Also input and output of summary function are printed
svm.summary<-function(data, lev = NULL, model = NULL){
print(head(data,n=3))
a<-defaultSummary(data, lev, model)
b<-twoClassSummary(data, lev, model)
out<-c(a,b)
print(out)
out}
fitControl <- trainControl(
method = "cv",
number = 2,
classProbs = TRUE,
summaryFunction=svm.summary,
verbose=T,
allowParallel = FALSE)
# Ranking function: Rank Variables using a linear
# SVM
rankSVM<-function(object,x,y) {
print("ranking")
obj<-ksvm(x=as.matrix(x), y=y,
kernel=vanilladot,
kpar=list(), C=10,
scaled=F)
w<-t(obj#coef[[1]]%*%obj#xmatrix[[1]])
z<-abs(w)/sqrt(sum(w^2))
ord<-order(z,decreasing=T)
data.frame(var=dimnames(z)[[1]][ord],Overall=z[ord])
}
svmFuncs<-getModelInfo("svmRadial",regex=F)
svmFit<-function(x,y,first,last,...) {
out<-train(x=x,y=as.factor(y),
method="svmRadial",
trControl=fitControl,
scaled=F,
metric="Accuracy",
maximize=T,
returnData=T)
out$finalModel}
selectionFunctions<-list(summary=svm.summary,
fit=svmFit,
pred=svmFuncs$svmRadial$predict,
prob=svmFuncs$svmRadial$prob,
rank=rankSVM,
selectSize=pickSizeBest,
selectVar=pickVars)
selectionControl<-rfeControl(functions=selectionFunctions,
rerank=F,
verbose=T,
method="cv",
number=2)
subsets<-c(1,30,60)
svmProfile<-rfe(x=x,y=y,
sizes=subsets,
metric="Accuracy",
maximize=TRUE,
rfeControl=selectionControl)
svmProfile
The final output is the following:
> svmProfile
Recursive feature selection
Outer resampling method: Cross-Validated (2 fold)
Resampling performance over subset size:
Variables Accuracy Kappa ROC Sens Spec AccuracySD KappaSD ROCSD SensSD SpecSD Selected
1 0.8075 0.6122 NaN 0.8292 0.7825 0.02981 0.06505 NA 0.06153 0.1344 *
30 0.8028 0.6033 NaN 0.8205 0.7825 0.00948 0.02533 NA 0.09964 0.1344
60 0.8028 0.6032 NaN 0.8206 0.7823 0.00948 0.02679 NA 0.12512 0.1635
The top 1 variables (out of 1):
V49
ROC is NaN. Inspecting the output (as verbose=T and the summary function was patched to display both its output and parts of its input) reveals that while when tuning the SVMs in the inner loop, ROC seems to be calculated correctly:
+ Fold1: sigma=0.01172, C=0.25
pred obs M R
1 M R 0.6658878 0.3341122
2 M R 0.5679477 0.4320523
3 R R 0.2263576 0.7736424
Accuracy Kappa ROC Sens Spec
0.6730769 0.3480826 0.7961310 0.6428571 0.7083333
- Fold1: sigma=0.01172, C=0.25
+ Fold1: sigma=0.01172, C=0.50
pred obs M R
1 M R 0.7841249 0.2158751
2 M R 0.7231365 0.2768635
3 R R 0.3033492 0.6966508
Accuracy Kappa ROC Sens Spec
0.7692308 0.5214724 0.8407738 0.9642857 0.5416667
- Fold1: sigma=0.01172, C=0.50
[...]
there seems to be a problem in the outer iteration. "Between" two folds we get the following:
-(rfe) fit Fold1 size: 1
pred obs Variables
1 M R 1
2 M R 1
3 M R 1
Accuracy Kappa ROC Sens Spec
0.7864078 0.5662328 NA 0.8727273 0.6875000
pred obs Variables
1 R R 30
2 M R 30
3 M R 30
Accuracy Kappa ROC Sens Spec
0.7961165 0.5853939 NA 0.8909091 0.6875000
pred obs Variables
1 R R 60
2 M R 60
3 M R 60
Accuracy Kappa ROC Sens Spec
0.7961165 0.5842783 NA 0.9090909 0.6666667
+(rfe) fit Fold2 size: 60
So here it seems the input for the summary function is a matrix that does not contain the class probabilities but the number of variables instead, and so the ROCs cannot be calculated / aggregated correctly. Does anybody know how to prevent this? Did I forget to tell caret to output class probabilities in some place?
Help is greatly appreciated, as caret is really a cool package to use and would save me plenty of work if I can get this to run correctly.
Thoralf
getModelInfo is designed to get code for train and doesn't automatically work with rfe (I'll make a note of that in the documentation). rfe doesn't look for a slot called probs and no probability predictions means not ROC summary.
You might want base your code on caretFuncs, which is designed to work with rfe and should automate a lot of what I think you would like to do.
For example, in caretFuncs, the pred module will create class and probability predictions:
function(object, x) {
tmp <- predict(object, x)
if(object$modelType == "Classification" &
!is.null(object$modelInfo$prob)) {
out <- cbind(data.frame(pred = tmp),
as.data.frame(predict(object, x, type = "prob")))
} else out <- tmp
out
}
You might be able to simply plug in your rankSVM into caretFuncs$rank.
Take a look at the feature selection page on the website. It has details about what code modules you will need.