I am running h2o random forest with the following parameter setting
model_rf <- h2o.randomForest(x = predictors, y = labels,
training_frame = train_data, classification = T,
importance = T,
verbose = T, type = "BigData", ntree = 50)
After running I am getting the following output.
Model Details:
==============
H2ORegressionModel: drf
Model ID: DRFModel__906d074da6ebf8057525b2b61c1c4c87
Model Summary:
number_of_trees model_size_in_bytes min_depth max_depth mean_depth min_leaves max_leaves mean_leaves
1 50.000000 2708173.000000 20.000000 20.000000 20.00000 4200.000000 5241.000000 4720.70000
H2ORegressionMetrics: drf
** Reported on training data. **
Description: Metrics reported on Out-Of-Bag training samples
MSE: 0.0006302392
R2 : -0.03751038
Following are my questions.
1) What does MSE and R2 mean?
2) If they are mean square error or similar why am I getting these metric for a classification setting?
3) How do I get other metrics like gini or auc?
4) Can i say that if these 2 params decrease with a different parameter setting, my model performance has improved?
Here are the answers to your questions:
1. MSE stands for mean squared error. Essentially it measures the difference between the estimator and the estimated.R2 measures how well-fit a statistical model is.
Using MSE you can judge how often you model misclassified data.
If you are using Flow, click on Inspect and then output-training_metrics to see MSE, R2, AUC, gini, etc.
Sorry, I'm not sure I understand this question. Are you asking if a decreaed gini or AUC equate to improved model performance?
Avni
Related
I have a multivariate model with this (approximate) form:
library(MCMCglmm)
mod.1 <- MCMCglmm(
cbind(OFT1, MIS1, PC1, PC2) ~
trait-1 +
trait:sex +
trait:date,
random = ~us(trait):squirrel_id + us(trait):year,
rcov = ~us(trait):units,
family = c("gaussian", "gaussian", "gaussian", "gaussian"),
data= final_MCMC,
prior = prior.invgamma,
verbose = FALSE,
pr=TRUE, #this saves the BLUPs
nitt=103000, #number of iterations
thin=100, #interval at which the Markov chain is stored
burnin=3000)
For publication purposes, I've been asked to report the Gelman-Rubin statistic to indicate that the model has converged.
I have been trying to run:
gelman.diag(mod.1)
But, I get this error:
Error in mcmc.list(x) : Arguments must be mcmc objects
Any suggestions on the proper approach? I assume that the error means I can't pass my mod.1 output through gelman.diag(), but I am not sure what it is I am supposed to put there instead? My knowledge is quite limited here, so I'd appreciate any and all help!
Note that I haven't added the data here, but I suspect the answer is more code syntax and not data related.
The gelman.diag requires a mcmc.list. If we are running models with different set of parameters, extract the 'Sol' and place it in a list (Below, it is the same model)
library(MCMCglmm)
model1 <- MCMCglmm(PO~1, random=~FSfamily, data=PlodiaPO, verbose=FALSE,
nitt=1300, burnin=300, thin=1)
model2 <- MCMCglmm(PO~1, random=~FSfamily, data=PlodiaPO, verbose=FALSE,
nitt=1300, burnin=300, thin=1 )
mclist <- mcmc.list(model1$Sol, model2$Sol)
gelman.diag(mclist)
# gelman.diag(mclist)
#Potential scale reduction factors:
# Point est. Upper C.I.
#(Intercept) 1 1
According to the documentation, it seems to be applicable for more than one mcmc chain
Gelman and Rubin (1992) propose a general approach to monitoring convergence of MCMC output in which m > 1 parallel chains are run with starting values that are overdispersed relative to the posterior distribution.
The input x here is
x - An mcmc.list object with more than one chain, and with starting values that are overdispersed with respect to the posterior distribution.
I'm trying to run a relatively straightforward glmer model and get warnings that it isSingular and I can't figure out why.
In my dataset, 40 participants did 108 trials. They responded to a question (the response is coded as correct/incorrect - 0/1) and rated confidence in their response on a continuous scale from 0 to 1.
library(lme4)
library(tidybayes)
library(tidyverse)
set.seed(5)
n_trials = 108
n_subjs = 40
data =
tibble(
subject = as.factor(rep(c(1:n_subjs), n_trials)),
correct = sample(c(0,1), replace=TRUE, size=(n_trials*n_subjs)),
confidence = runif(n_trials*n_subjs)
)
I'm trying to run a mixed effects logistic regression, to estimate each participant's ability to associate high confidence to correct responses only. That means, I have good reasons to add the random slope of confidence in my model.
The simplest model that I'm interested in gives me:
model = glmer(correct ~ confidence + (confidence|subject) ,
data = data,
family = binomial)
boundary (singular) fit: see ?isSingular, and
> isSingular(model)
[1] TRUE
So I simplify the model beyond usefulness, and get the same problem:
model = glmer(correct ~ confidence + (1|subject) ,
data = data,
family = binomial)
I tried to bin confidence (I'm sure there are more elegant ways), in case that helped, but it didn't:
#Initialize as vector of 0s
data$confidence_binned <- numeric(dim(data)[1])
nbins = 4
bins=seq(0,1,length.out = (nbins+1))
for (b in 1:(length(bins)-1)) {
data$confidence_binned[data$confidence>=bins[b] & data$confidence<bins[b+1]] = b
}
data$confidence_binned[data$confidence_binned==1]=nbins
model = glmer(correct ~ confidence_binned + (confidence_binned|subject) ,
data = data,
family = binomial)
boundary (singular) fit: see ?isSingular
There are many posts and SO questions about the isSingular warning, but all the ones I've found say that the model is too complex for the data, and the solution is usually to 'keep it maximal'. However, this model is as simple as it can get, and I am confused that with (what sounds to me like) enough trials it still fails.
I also tried changing the controller, but it didn't help:
ctrl = glmerControl(optimizer = "bobyqa",
boundary.tol = 1e-5,
calc.derivs=TRUE,
use.last.params=FALSE,
sparseX = FALSE,
tolPwrss=1e-7,
compDev=TRUE,
nAGQ0initStep=TRUE,
## optimizer args
optCtrl = list(maxfun = 1e5))
model <- glmer(correct ~ confidence_binned + (confidence_binned|subject),
data=data,
verbose=T,
control=ctrl,
family = binomial)
Any help or pointers on what to look out for in the data are appreciated.
EDIT to respond to a comment:
The result of ggplot(data,aes(x=subject, y=correct)) + stat_summary(fun.data=mean_cl_normal)
GLMMs with random slopes and random intercepts correlated (aka the maximal model) are notoriously difficult to fit even under well-fitting data despite some who advocate for this approach. Unless you see some seriously fluctuating by-subject or by-item variance with their random slope predictors, my best advice is to fit a random intercepts only model and see if it fits better.
For three comprehensive papers that have very different views on this subject, see below. The first is a paper often cited for the maximal approach. The second is written by the guy who created the lme4 package, who makes arguments for parsimonious models. The third is an additional peer-reviewed paper by Bates recommended by Ben Bolker.
Citations:
Maximal Model Perspective: Barr et al., 2013
Parisomonious Model Perspective: Bates et al., 2018
Balancing Type I Error and Power: Matuschek et al., 2017
Here's my question:
I have a medium size data set about the condition of a hydraulic system.
The data set is represented by 68 variables plus condition of the system(green, yellow, red)
I have to use several classifiers to predict the behaviour of the system so I have divided my data set into training and test set as follows:
(Talking about the conditions, the colour means: red-Warning, yellow-Pay attention, green-Good)
That's what I wrote
Tab$Condition=factor(Tab$Condition, labels=c("Yellow","Green","Red"))
set.seed(32343)
reg_Control = trainControl("repeatedcv", number = 5, repeats=5, verboseIter = T, classProbs =T)
inTrain = createDataPartition(y=Tab$Condition,p=0.75, list=FALSE)
training = Tab[inTrain,]
testing = Tab[-inTrain,]
I'm using a SVM linear classifier to predict the behaviour of the system.
I started by using a random value for C to see what kind of results I should get.
svmLinear = train(Condition ~.,data=training, method="svmLinear", trControl=reg_Control,tuneGrid=data.frame(C=seq(0.1,1,0.1)))
svmLPredictions = predict(svmLinear,newdata=training)
confusionMatrix(svmLPredictions,training$Condition)
#misclassification of 129/1655 accuracy of 92.21%
svmLPred = predict(svmLinear,newdata=testing)
confusionMatrix(svmLPred,testing$Condition)
#misclassification of 41/550 accuracy of 92.55%
I've used a SVM linear classifier to predict the behaviour of the system.
As Isaid before I started with RANDOM VALUE FOR C.
How do I decide then about the best value to use for the analysis??
Sorry if the question is banal but I'm a beginner!
Answers will be helpful!
Thanks
Caret calls other packages to run the actual modelling process. Caret itself is only a (very powerful) convenience package in this regard. However ,it does that automatically so a user might not realize this easily unless an error is thrown
Anyway , I have cobbled together an example to explain the process.
library(caret)
data("iris")
set.seed(1024)
tr <- createDataPartition(iris$Species, list = FALSE)
training <- iris[ tr,]
testing <- iris[-tr,]
#head(training)
fitControl <- trainControl(##smaller values for quick run
method = "repeatedcv",
number = 5,
repeats = 4)
set.seed(1024)
tunegrid=data.frame(C=c(0.25, 0.5, 1,5,8,12,100))
tunegrid
svmfit <- train(Species ~ ., data = training,
method = "svmLinear",
trControl = fitControl,
tuneGrid= tunegrid)
#print this, it will give model's accuracy (on train data) given various
# parameter values
svmfit
#C Accuracy Kappa
#0.25 0.9533333 0.930
#0.50 0.9666667 0.950
#1.00 0.9766667 0.965
#5.00 0.9800000 0.970
#8.00 0.9833333 0.975
#12.00 0.9833333 0.975
#100.00 0.9400000 0.910
#The final value used for the model was C = 8.
# it has already chosen the best model (as per train Accuracy )
# how well does it work on test data?
preds <-predict(svmfit, testing)
cmSVM <-confusionMatrix(preds, testing$Species)
print(cmSVM)
I can't tell from documentation whether or not the predict.H2OModel() function from the h2o package in R gives OOB predictions for random forest models built using h2o.randomForest().
In fact, in the 3-4 examples I've tried, it seems the results of predict.H2OModel() are closer to the non-OOB predictions from predict.randomForest() from the randomForest package than the OOB ones.
Does anyone know if they are OOB predictions? If not, do you know how to get OOB predictions for h2o.randomForest() models?
Example:
set.seed(123)
library(randomForest)
library(h2o)
data(mtcars)
d = mtcars[,c('mpg', 'cyl', 'disp', 'hp', 'wt' )]
## define some common settings for both random forests
n.trees=1000
mtry = 3
min.node = 3
## prep for h2o.randomForest
h2o.init()
d.h2o= as.h2o(d)
x.names = colnames(d)[2:5] ## predictors
## fit both models
set.seed(123);
rf = randomForest(mpg ~ ., data = d , ntree=n.trees, mtry = mtry, nodesize=min.node)
h2o = h2o.randomForest(y='mpg', x=x.names, training_frame = d.h2o, ntrees=n.trees, mtries = mtry, min_rows=min.node)
## Correct way and incorrect way of getting OOB predictions for a randomForest model. Not sure about h2o model.
d$rf.oob.pred = predict(rf) ## Gives OOB predictions
d$rf.pred = predict(rf , newdata=d ) ## Doesn't give OOB predictions.
d$h2o.pred = as.vector(predict(h2o, newdata=d.h2o)) ## Not sure if this is OOB or not.
## d$h2o.pred seems more similar to d$rf.pred than d$rf.oob.pred,
## suggesting that predict.H2OModel() might not give OOB predictions.
mean((d$rf.pred - d$h2o.pred)^2)
mean((d$rf.oob.pred - d$h2o.pred)^2)
H2O's h2o.predict() does not provide predictions for the OOB data. You have to specify what dataset you want to predict with the newdata = parameter. So when you have newdata=d.h2o then you are getting the predictions for the d.h2o dataframe you've specified.
Currently there is no method to get the prediction for the oob data. However, there is a jira ticket to specify whether you would like oob metrics (note this ticket also links to another ticket which helps clarify how training metrics are currently being reported for Random Forest).
i am currently trying to build a muti-class prediction model to predict the letter out of 26 English alphabets. I have currently built a few models using ANN, SVM, Ensemble and nB. But i am stuck at the evaluating the accuracy of these models. Although the confusion matrix shows me the Alphabet-wise True and False predictions, I am only able to get an overall accuracy of each model. Is there a way to evaluate the model's accuracy similar to the ROC and AUC values for a Binomial Classification.
Note: I am currently running the model using the H2o package as it saves me more time.
Once you train a model in H2O, if you simply do: print(fit) it will show you all the available metrics for that model type. For multiclass, I'd recommend h2o.mean_per_class_error().
R code example on the iris dataset:
library(h2o)
h2o.init(nthreads = -1)
data(iris)
fit <- h2o.naiveBayes(x = 1:4,
y = 5,
training_frame = as.h2o(iris),
nfolds = 5)
Once you have the model, we can evaluate model performance using the h2o.performance() function to view all the metrics:
> h2o.performance(fit, xval = TRUE)
H2OMultinomialMetrics: naivebayes
** Reported on cross-validation data. **
** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
Cross-Validation Set Metrics:
=====================
Extract cross-validation frame with `h2o.getFrame("iris")`
MSE: (Extract with `h2o.mse`) 0.03582724
RMSE: (Extract with `h2o.rmse`) 0.1892808
Logloss: (Extract with `h2o.logloss`) 0.1321609
Mean Per-Class Error: 0.04666667
Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>,xval = TRUE)`
=======================================================================
Top-3 Hit Ratios:
k hit_ratio
1 1 0.953333
2 2 1.000000
3 3 1.000000
Or you can look at a particular metric, like mean_per_class_error:
> h2o.mean_per_class_error(fit, xval = TRUE)
[1] 0.04666667
If you want to view performance on a test set, then you can do the following:
perf <- h2o.performance(fit, test)
h2o.mean_per_class_error(perf)