Goodness of fit test for logistic model - r

I want know what g means and why use Lemeshow goodness of fit (GOF) test in Research ?? and what wrong in "confusion matrix for logistic regression ?
this message :
Error in confusionMatrix(cnfmat) :
could not find function "confusionMatrix"
# ..Binary Logistic Regression :
install.packages("caTools")
library(caTools)
require(caTools)
sample = sample.split(diabetes$Outcome, SplitRatio=0.80)
train = subset(diabetes, sample==TRUE)
test = subset(diabetes, sample==FALSE)
nrow(diabetes) ##calculationg the total number of rows
nrow(train) ## total number of Train data rows >> 0.80 * 768
nrow(test) ## total number of Test data rows >> 0.20 * 768
str(train) ## Structure of train set
Logis_mod<- glm(Outcome~Pregnancies+Glucose+BloodPressure+SkinThickness+
Insulin+BMI+DiabetesPedigreeFunction+Age,family = binomial,data = train)
summary(Logis_mod)
#AIC .. Akaike information criteria ...
#A good model is the one that has minimum AIC among all the other models.
# Testing the Model
glm_probs <- predict(Logis_mod, newdata = test, type = "response")
summary(glm_probs)
glm_pred <- ifelse(glm_probs > 0.5, 1, 0)
summary(glm_pred)
#Avarge prediction for each of the Two outcomes ..
tapply(glm_pred,train$Outcome,mean)
# Confusion Matrix for logistic regression
install.packages("e1071")
library(e1071)
prdval <-predict(Logis_mod,type = "response")
prdbln <-ifelse(prdval > 0.5, 1, 0)
cnfmat <-table(prd=prdbln,act =train$Outcome)
confusionMatrix(cnfmat)
#Odd Ratio :
exp(cbind("OR"=coef(Logis_mod),confint(Logis_mod)))

I'm not sure what "g" you are referring to, but I'm going to assume it's your resulting computed statistic from your Lemeshow test. If this is the case then values of "g" indicate how well a model explains the variability in the data and can be used to compare models based on the same set of data (better models will have larger "g" values).
More generally, any goodness of fit (GOF) test in research is used to determine how well your model fits the variability in your data.
Additionally, you are receiving your error because the confusionMatrix() function is a part of the caret R package. Install caret by first running the below line of code in R or RStudio.
install.packages("caret")
Then in your code change
cnfmat <-table(prd=prdbln,act =train$Outcome)
confusionMatrix(cnfmat)
to
cnfmat <-data.frame(prd=prdbln,act =train$Outcome, stringsAsFactors = FALSE)
caret::confusionMatrix(cnfmat)

Related

Quasi-Poisson mixed-effect model on overdispersed count data from multiple imputed datasets in R

I'm dealing with problems of three parts that I can solve separately, but now I need to solve them together:
extremely skewed, over-dispersed dependent count variable (the number of incidents while doing something),
necessity to include random effects,
lots of missing values -> multiple imputation -> 10 imputed datasets.
To solve the first two parts, I chose a quasi-Poisson mixed-effect model. Since stats::glm isn't able to include random effects properly (or I haven't figured it out) and lme4::glmer doesn't support the quasi-families, I worked with glmer(family = "poisson") and then adjusted the std. errors, z statistics and p-values as recommended here and discussed here. So I basically turn Poisson mixed-effect regression into quasi-Poisson mixed-effect regression "by hand".
This is all good with one dataset. But I have 10 of them.
I roughly understand the procedure of analyzing multiple imputed datasets – 1. imputation, 2. model fitting, 3. pooling results (I'm using mice library). I can do these steps for a Poisson regression but not for a quasi-Poisson mixed-effect regression. Is it even possible to A) pool across models based on a quasi-distribution, B) get residuals from a pooled object (class "mipo")? I'm not sure. Also I'm not sure how to understand the pooled results for mixed models (I miss random effects in the pooled output; although I've found this page which I'm currently trying to go through).
Can I get some help, please? Any suggestions on how to complete the analysis (addressing all three issues above) would be highly appreciated.
Example of data is here (repre_d_v1 and repre_all_data are stored in there) and below is a crucial part of my code.
library(dplyr); library(tidyr); library(tidyverse); library(lme4); library(broom.mixed); library(mice)
# please download "qP_data.RData" from the last link above and load them
## ===========================================================================================
# quasi-Poisson mixed model from single data set (this is OK)
# first run Poisson regression on df "repre_d_v1", then turn it into quasi-Poisson
modelSingle = glmer(Y ~ Gender + Age + Xi + Age:Xi + (1|Country) + (1|Participant_ID),
family = "poisson",
data = repre_d_v1)
# I know there are some warnings but it's because I share only a modified subset of data with you (:
printCoefmat(coef(summary(modelSingle))) # unadjusted coefficient table
# define quasi-likelihood adjustment function
quasi_table = function(model, ctab = coef(summary(model))) {
phi = sum(residuals(model, type = "pearson")^2) / df.residual(model)
qctab = within(as.data.frame(ctab),
{`Std. Error` = `Std. Error`*sqrt(phi)
`z value` = Estimate/`Std. Error`
`Pr(>|z|)` = 2*pnorm(abs(`z value`), lower.tail = FALSE)
})
return(qctab)
}
printCoefmat(quasi_table(modelSingle)) # done, makes sense
## ===========================================================================================
# now let's work with more than one data set
# object "repre_all_data" of class "mids" contains 10 imputed data sets
# fit model using with() function, then pool()
modelMultiple = with(data = repre_all_data,
expr = glmer(Y ~ Gender + Age + Xi + Age:Xi + (1|Country) + (1|Participant_ID),
family = "poisson"))
summary(pool(modelMultiple)) # class "mipo" ("mipo.summary")
# this has quite similar structure as coef(summary(someGLM))
# but I don't see where are the random effects?
# and more importantly, I wanted a quasi-Poisson model, not just Poisson model...
# ...but here it is not possible to use quasi_table function (defined earlier)...
# ...and that's because I can't compute "phi"
This seems reasonable, with the caveat that I'm only thinking about the computation, not whether this makes statistical sense. What I'm doing here is computing the dispersion for each of the individual fits and then applying it to the summary table, using a variant of the machinery that you posted above.
## compute dispersion values
phivec <- vapply(modelMultiple$analyses,
function(model) sum(residuals(model, type = "pearson")^2) / df.residual(model),
FUN.VALUE = numeric(1))
phi_mean <- mean(phivec)
ss <- summary(pool(modelMultiple)) # class "mipo" ("mipo.summary")
## adjust
qctab <- within(as.data.frame(ss),
{ std.error <- std.error*sqrt(phi_mean)
statistic <- estimate/std.error
p.value <- 2*pnorm(abs(statistic), lower.tail = FALSE)
})
The results look weird (dispersion < 1, all model results identical), but I'm assuming that's because you gave us a weird subset as a reproducible example ...

How to test accuracy of a trained knn model in R Studio?

The objective is to train a model to predict the default variable. Train a KNN model with k = 13 using the knn3() function and calculate the test accuracy.
My code to solve this problem so far is:
# load packages
library("mlbench")
library("tibble")
library("caret")
library("rpart")
# set seed
set.seed(49607)
# load data and coerce to tibble
default = as_tibble(ISLR::Default)
# split data
dft_trn_idx = sample(nrow(default), size = 0.8 * nrow(default))
dft_trn = default[dft_trn_idx, ]
dft_tst = default[-dft_trn_idx, ]
# check data
dft_trn
# fit knn model
mod_knn = knn3(default ~ ., data = dft_trn, k = 13)
# make "predictions" with knn model
new_obs = data.frame(balance = 421, income = 28046)
predtrn = predict(mod_knn, new_obs, type = "prob")
confusionMatrix(predtrn,dft_trn)
at the last line of the code chunk, I get error "Error: data and reference should be factors with the same levels." I am unsure as to how I can fix this, or if this is even the correct method to measure the test accuracy.
Any help would be great, thanks!
First of all, as machine learner you are doing well because a necessary step is to split data into train and test set. The issue I found is that you are trying to compare a new prediction from data outside from test and train test. The principle in ML is to train the model on train dataset and then make predictions on test dataset in order to finally evaluate performance. You have the datasets for that (dft_tst). Here the code to obtain confusion matrix. As a reminder, if you have one predicted label without having the real label to compare, the confusion matrix will not be computed. Here the code to obtain the desired matrix:
# load packages
library("mlbench")
library("tibble")
library("caret")
library("rpart")
# set seed
set.seed(49607)
# load data and coerce to tibble
default = as_tibble(ISLR::Default)
Now, we split into train and test sets:
# split data
dft_trn_idx = sample(nrow(default), size = 0.8 * nrow(default))
dft_trn = default[dft_trn_idx, ]
dft_tst = default[-dft_trn_idx, ]
We train the model:
# fit knn model
mod_knn = knn3(default ~ ., data = dft_trn, k = 13)
Now, the key part is making predictions on test set (or any labelled set) and obtain the confusion matrix:
# make "predictions" with knn model
predtrn = predict(mod_knn, dft_tst, type = "class")
In order to compute the confusion matrix, the predictions and original labels must have the same lenght:
#Confusion matrix
confusionMatrix(predtrn,dft_tst$default)
Output:
Confusion Matrix and Statistics
Reference
Prediction No Yes
No 1929 67
Yes 1 3
Accuracy : 0.966
95% CI : (0.9571, 0.9735)
No Information Rate : 0.965
P-Value [Acc > NIR] : 0.4348
Kappa : 0.0776
Mcnemar's Test P-Value : 3.211e-15
Sensitivity : 0.99948
Specificity : 0.04286
Pos Pred Value : 0.96643
Neg Pred Value : 0.75000
Prevalence : 0.96500
Detection Rate : 0.96450
Detection Prevalence : 0.99800
Balanced Accuracy : 0.52117
'Positive' Class : No

Extimate prediction accuracy of cox ph

i would like to develop a cox proportional hazard model with r, use it to predict input and evaluate the accuracy of the model. For the evaluation I would like to use the Brior score.
# import various packages, needed at some point of the script
library("survival")
library("survminer")
library("prodlim")
library("randomForestSRC")
library("pec")
library("rpart")
library("mlr")
library("Hmisc")
library("ipred")
# load lung cancer data
data("lung")
head(lung)
# recode status variable
lung$status <- lung$status-1
# Delete rows with missing values
lung <- na.omit(lung)
# split data into training and testing
## 80% of the sample size
smp_size <- floor(0.8 * nrow(lung))
## set the seed to make your partition reproducible
set.seed(123)
train_ind <- sample(seq_len(nrow(lung)), size = smp_size)
# training and testing data
train.lung <- lung[train_ind, ]
test.lung <- lung[-train_ind, ]
# time and failure event
s <- Surv(train.lung$time, train.lung$status)
# create model
cox.ph2 <- coxph(s~age+meal.cal+wt.loss, data=train.lung)
# predict
pred <- predict(cox.ph2, newdata = train.lung)
# evaluate
sbrier(s, pred)
as an outcome of the prediction I would expect the time (as in "when does this individuum experience failure). Instead I get values like this
[1] 0.017576359 -0.135928959 -0.347553969 0.112509137 -0.229301199 -0.131861582 0.044589175 0.002634008
[9] 0.345966978 0.209488560 0.002418358
What does that mean?
Furthermore sbrier does not work. Apparently it can not work with the prediction pred (no surprise there)
How do I solve this? How do I make a prediction with cox.ph2? How can I evaluate the model afterwards?
The predict() function won't return a time value, you have to specify the argument type = c("lp", "risk","expected","terms","survival") in the predict() function.
If you want to get the hazard ratios :
predict(cox.ph2, newdata = test.lung, type = "risk")
Note that you want to predict the values on the test set not the training set.
I have read that you can use AFT models in your case :
https://stats.stackexchange.com/questions/79362/how-to-get-predictions-in-terms-of-survival-time-from-a-cox-ph-model
You also can read this post :
Calculate the Survival prediction using Cox Proportional Hazard model in R
Hope it will help

How do I evaluate a multinomial classification model in R?

i am currently trying to build a muti-class prediction model to predict the letter out of 26 English alphabets. I have currently built a few models using ANN, SVM, Ensemble and nB. But i am stuck at the evaluating the accuracy of these models. Although the confusion matrix shows me the Alphabet-wise True and False predictions, I am only able to get an overall accuracy of each model. Is there a way to evaluate the model's accuracy similar to the ROC and AUC values for a Binomial Classification.
Note: I am currently running the model using the H2o package as it saves me more time.
Once you train a model in H2O, if you simply do: print(fit) it will show you all the available metrics for that model type. For multiclass, I'd recommend h2o.mean_per_class_error().
R code example on the iris dataset:
library(h2o)
h2o.init(nthreads = -1)
data(iris)
fit <- h2o.naiveBayes(x = 1:4,
y = 5,
training_frame = as.h2o(iris),
nfolds = 5)
Once you have the model, we can evaluate model performance using the h2o.performance() function to view all the metrics:
> h2o.performance(fit, xval = TRUE)
H2OMultinomialMetrics: naivebayes
** Reported on cross-validation data. **
** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
Cross-Validation Set Metrics:
=====================
Extract cross-validation frame with `h2o.getFrame("iris")`
MSE: (Extract with `h2o.mse`) 0.03582724
RMSE: (Extract with `h2o.rmse`) 0.1892808
Logloss: (Extract with `h2o.logloss`) 0.1321609
Mean Per-Class Error: 0.04666667
Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>,xval = TRUE)`
=======================================================================
Top-3 Hit Ratios:
k hit_ratio
1 1 0.953333
2 2 1.000000
3 3 1.000000
Or you can look at a particular metric, like mean_per_class_error:
> h2o.mean_per_class_error(fit, xval = TRUE)
[1] 0.04666667
If you want to view performance on a test set, then you can do the following:
perf <- h2o.performance(fit, test)
h2o.mean_per_class_error(perf)

Different no of tuples for the prediction model and test set data in SVM

I have a dataset with two columns as shown below, where Column 1, timestamp is a particular value for time for which Column.10 gives the total power usage at that instance of time. There are totally 81502 instances for this data.
I'm doing support vector regression on this data in R using the e1071 package to predict the future usage of power. The code is given below. I first divided the dataset into training and test data. Then using the training data modeled the data using the svm function and then predict the power usage for the testset.
library(e1071)
attach(data.csv)
index <- 1:nrow(data.csv)
testindex <- sample(index,trunc(length(index)/3))
testset <- na.omit(data.csv[testindex, ])
trainingset <- na.omit(data.csv[-testindex, ])
model <- svm(Column.10 ~ timestamp, data=trainingset)
prediction <- predict(model, testset[,-2])
tab <- table(pred = prediction, true = testset[,2])
However, when I try to make a confusion matrix from the prediction, I'm getting the error:
Error in table(pred = prediction, true = testset[, 2]) : all arguments must have the same length
So I tried to find the length of the two arguments and found that
the length(prediction) to be 81502
and the length(testset[,2]) to be 27167
Since I had done the prediction only for the testset, I don't know how prediction is done for 81502 values. How are the total no of values different for the prediction and the testset? How is the power value for the entire dataset getting predicted eventhough I gave it only for the testset?
Change
prediction <- predict(model, testset[,-2])
in
prediction <- predict(model, testset)
However, you should not use table when doing regression, use the MSE instead.

Resources