Interpretation of AUC NaN values in h2o cross-validation predictions summary - r

I have noticed that for some runs of:
train=as.h2o(u)
mod = h2o.glm(family= "binomial", x= c(1:15), y="dc",
training_frame=train, missing_values_handling = "Skip",
lambda = 0, compute_p_values = TRUE, nfolds = 10,
keep_cross_validation_predictions= TRUE)
there are NaNs in cross-validation metrics summary of AUC for some cv iterations of the model.
For example:
print(mod#model$cross_validation_metrics_summary["auc",])
Cross-Validation Metrics Summary:
mean sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid cv_5_valid cv_6_valid cv_7_valid cv_8_valid cv_9_valid cv_10_valid
auc 0.63244045 0.24962118 0.25 0.6666667 0.8095238 1.0 0.6666667 0.46666667 NaN NaN 1.0 0.2
NaN in CV seems to appear less frequently when I set smaller nfolds=7.
How these NaN values should be interpreted and when h2o cross-validation outputs them?
I suppose it happens when AUC can't be assessed correctly in an iteration. My training set has 70 complete rows.
Can such AUC cross-validation results (containing NaNs) be considered as reliable?

There are specific cases that could cause division by zero when calculating the ROC curve, which could cause an AUC to be NaN. It's probable that due to small data you have some folds that have no true positives and are causing this issue.
We can test this by keeping the fold column and then counting the values of dc in each fold:
...
train <- as.h2o(u)
mod <- h2o.glm(family = "binomial"
, x = c(1:15)
, y = "dc"
, training_frame = train
, missing_values_handling = "Skip"
, lambda = 0
, compute_p_values = TRUE
, nfolds = 10
, keep_cross_validation_fold_assignment = TRUE
, seed = 1234)
fold <- as.data.frame(h2o.cross_validation_fold_assignment(mod))
df <- cbind(u,fold)
table(df[c("dc","fold_assignment")])
fold_assignment
dc 0 1 2 3 4 5 6 7 8 9
0 4 6 6 2 9 6 6 4 4 6
1 2 2 3 4 0 2 0 0 1 2
mod#model$cross_validation_metrics_summary["auc",]
Cross-Validation Metrics Summary:
mean sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid cv_5_valid cv_6_valid cv_7_valid
auc 0.70238096 0.19357596 0.875 0.6666667 0.5 0.375 NaN 0.5833333 NaN
cv_8_valid cv_9_valid cv_10_valid
auc NaN 1.0 0.9166667
We see that the folds with NaN are the same folds that have only dc=0.
Not counting the NaN, the wide variety of AUC for your folds (from 0.2 to 1) tells us that this is not a robust model, and it is likely being overfitted. Can you add more data?

Related

R:How to Fix Error Coding for A Bonferroni Correction

I am stuck on how to proceed with coding in RStudio for the Bonferroni Correction and the raw P values for the Pearson Correlation Matrix. I am a student and am new to R. I am also lost on how to get a table of the mean,SD, and n for the data. When I calculated the Pearson Correlation Matrix I just got the r value and not the raw probabilities value also. I am not sure how to code to get that in RStudio. I then tried to calculate the Bonferroni Correction and received an error message saying list object cannot be coerced to type double. How do I fix my code so this goes away? I also tried to create a table of the mean, SD, and n for the data and I became stuck on how to proceed.
My data is as follows:
Tree Height DBA Leaf Diameter
45.3 14.9 0.76
75.2 26.6 1.06
70.1 22.9 1.19
95 31.8 1.59
107.8 35.5 0.43
93 26.2 1.49
91.5 29 1.19
78.5 29.2 1.1
85.2 30.3 1.24
50 16.8 0.67
47.1 12.8 0.98
73.2 28.4 1.2
Packages I have installed dplyr,tidyr,multcomp,multcompview
I Read in the data from excel CSV(comma delimited) file and This creates data>dataHW8_1 12obs. of 3 variables
summary(dataHW8_1)
I then created Scatterplots of the data
plot(dataHW8_1$Tree_Height,dataHW8_1$DBA,main="Scatterplot Tree Height Vs Trunk Diameter at Breast Height (DBA)",xlab="Tree Height (cm)",ylab="DBA (cm)")
plot(dataHW8_1$Tree_Height,dataHW8_1$Leaf_Diameter,main="Scatterplot Tree Height Vs Leaf Diameter",xlab="Tree Height (cm)",ylab="Leaf Diameter (cm)")
plot(dataHW8_1$DBA,dataHW8_1$Leaf_Diameter,main="Scatterplot Trunk Diameter at Breast Height (DBA) Vs Leaf Diameter",xlab="DBA (cm)",ylab="Leaf Diameter (cm)")
I then noticed that the data was not linear so I transformed it using the log() fucntion
dataHW8_1log = log(dataHW8_1)
I then re-created my Scatterplots using the transformed data
plot(dataHW8_1log$Tree_Height,dataHW8_1log$DBA,main="Scatterplot of
Transformed (log)Tree Height Vs Trunk Diameter at Breast Height
(DBA)",xlab="Tree Height (cm)",ylab="DBA (cm)")
plot(dataHW8_1log$Tree_Height,dataHW8_1log$Leaf_Diameter,main="Scatterplot
of Transformed (log)Tree Height Vs Leaf Diameter",xlab="Tree Height
(cm)",ylab="Leaf Diameter (cm)")
plot(dataHW8_1log$DBA,dataHW8_1log$Leaf_Diameter,main="Scatterplot of
Transformed (log) Trunk Diameter at Breast Height (DBA) Vs Leaf
Diameter",xlab="DBA (cm)",ylab="Leaf Diameter (cm)")
I then created a matrix plot of Scatterplots
pairs(dataHW8_1log)
I then calculated the correlation coefficent using the Pearson method
this does not give an uncorreted matrix of P values------How do you do that?
cor(dataHW8_1log,method="pearson")
I am stuck on what to do to get a matrix of the raw probabilities (uncorrected P values) of the data
I then calculated the Bonferroni correction-----How do you do that?
Data$Bonferroni =
p.adjust(dataHW8_1log,
method = "bonferroni")
Doing this gave me the follwing error:
Error in p.adjust(dataHW8_1log, method = "bonferroni") :
(list) object cannot be coerced to type 'double'
I tried to fix using lapply, but that did not fix my promblem
I then tried to make a table of mean, SD, n, but I was only able to create the following code and became stuck on where to go from there------How do you do that?
(,data = dataHW8_1log,
FUN = function(x) c(Mean = mean(x, na.rm = T),
n = length(x),
sd = sd(x, na.rm = T))
I have tried following examples online, but none of them have helped me with the getting the Bonferroni Correction to code correctly.If anyone can help explain what I did wrong and how to make the Matrices/table I would greatly appreciate it.
Here is an example using a 50 rows by 10 columns sample dataframe.
# 50 rows x 10 columns sample dataframe
df <- as.data.frame(matrix(runif(500), ncol = 10));
We can show pairwise scatterplots.
# Pairwise scatterplot
pairs(df);
We can now use cor.test to get p-values for a single comparison. We use a convenience function cor.test.p to do this for all pairwise comparisons. To give credit where credit is due, the function cor.test.p has been taken from this SO post, and takes as an argument a dataframe whilst returning a matrix of uncorrected p-values.
# cor.test on dataframes
# From: https://stackoverflow.com/questions/13112238/a-matrix-version-of-cor-test
cor.test.p <- function(x) {
FUN <- function(x, y) cor.test(x, y)[["p.value"]];
z <- outer(
colnames(x),
colnames(x),
Vectorize(function(i,j) FUN(x[,i], x[,j])));
dimnames(z) <- list(colnames(x), colnames(x));
return(z);
}
# Uncorrected p-values from pairwise correlation tests
pval <- cor.test.p(df);
We now correct for multiple hypothesis testing by applying the Bonferroni correction to every row (or column, since the matrix is symmetric) and we're done. Note that p.adjust takes a vector of p-values as an argument.
# Multiple hypothesis-testing corrected p-values
# Note: pval is a symmetric matrix, so it doesn't matter if we correct
# by column or by row
padj <- apply(pval, 2, p.adjust, method = "bonferroni");
padj;
#V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
#V1 0 1 1.0000000 1 1.0000000 1.0000000 1 1 1.0000000 1
#V2 1 0 1.0000000 1 1.0000000 1.0000000 1 1 1.0000000 1
#V3 1 1 0.0000000 1 0.9569498 1.0000000 1 1 1.0000000 1
#V4 1 1 1.0000000 0 1.0000000 1.0000000 1 1 1.0000000 1
#V5 1 1 0.9569498 1 0.0000000 1.0000000 1 1 1.0000000 1
#V6 1 1 1.0000000 1 1.0000000 0.0000000 1 1 0.5461443 1
#V7 1 1 1.0000000 1 1.0000000 1.0000000 0 1 1.0000000 1
#V8 1 1 1.0000000 1 1.0000000 1.0000000 1 0 1.0000000 1
#V9 1 1 1.0000000 1 1.0000000 0.5461443 1 1 0.0000000 1
#V10 1 1 1.0000000 1 1.0000000 1.0000000 1 1 1.0000000 0

Conditional logistic regression: within subject matching

I'm trying to compare the prevalence of a specific lesion (binary) at the symptomatic side to the asymptomatic side within a group of patients.
I've already performed a McNemar test to compare the prevalence at the symptomatic versus asymptomatic side within patients.
However, I'm asked to perform also a conditional logistic regression. I'm not sure if my syntax is correct with respect to the stratification:
summary(clogit(ds$symp ~ ds$asymp, strata(ds$ID), data=ds, method = "exact"))
Does R compare both sides of the patient (symptomatic vs asymptomatic) within the patient(s)? Or do I have to duplicate manually the patient ID (one ID for the symptomatic side AND one ID for the asymptomatic side)?
Thanks,
An example:
ID symp asymp
1 0 0
2 1 0
3 0 0
4 0 0
5 1 0
6 1 1
7 0 0
8 0 0
9 0 1
10 0 0
As an example: patient 2 has a lesion at the symptomatic side and patient 9 only at the asymptomatic side. Patients 6 at both sides.
A Exact McNemar test showes:
test <- table(df$symp, df$asymp)
compare <- exact2x2(test, paired = TRUE, alternative = "two.sided", tsmethod = "central")
print(compare)
Exact McNemar test (with central confidence intervals)
data: test
b = 1, c = 2, p-value = 1
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.00847498 9.60452988
sample estimates:
odds ratio
0.5
However, a conditional logistic regression model:
> summary(clogit(df$symp ~ df$asymp, strata(df$ID), data=df, method = "exact"))
Call:
coxph(formula = Surv(rep(1, 10L), df$symp) ~ df$asymp, data = df,
method = "exact")
n= 10, number of events= 3
coef exp(coef) se(coef) z Pr(>|z|)
df$symp 0.973 2.646 1.524 0.638 0.523
exp(coef) exp(-coef) lower .95 upper .95
df$asymp 2.646 0.378 0.1334 52.46
Rsquare= 0.039 (max possible= 0.616 )
Likelihood ratio test= 0.4 on 1 df, p=0.528
Wald test = 0.41 on 1 df, p=0.5232
Score (logrank) test = 0.43 on 1 df, p=0.5127

Getting predicted values at response scale using broom::augment function

I'm fitting glm model in R and can get predicted values at response scale using predict.glm(object=fm1, type="response") where fm1 is the fitted model. I wonder how to get predicted values at response scale using augment function from broom package. My minimum working example is given below.
Dilution <- c(1/128, 1/64, 1/32, 1/16, 1/8, 1/4, 1/2, 1, 2, 4)
NoofPlates <- rep(x=5, times=10)
NoPositive <- c(0, 0, 2, 2, 3, 4, 5, 5, 5, 5)
Data <- data.frame(Dilution, NoofPlates, NoPositive)
fm1 <- glm(formula=NoPositive/NoofPlates~log(Dilution),
family=binomial("logit"), data=Data, weights=NoofPlates)
predict.glm(object=fm1, type="response")
# 1 2 3 4 5 6 7 8 9 10
# 0.02415120 0.07081045 0.19005716 0.41946465 0.68990944 0.87262421 0.95474066 0.98483820 0.99502511 0.99837891
library(broom)
broom::augment(x=fm1)
# NoPositive.NoofPlates log.Dilution. X.weights. .fitted .se.fit .resid .hat .sigma
# 1 0.0 -4.8520303 5 -3.6989736 1.1629494 -0.4944454 0.15937234 0.6483053
# 2 0.0 -4.1588831 5 -2.5743062 0.8837030 -0.8569861 0.25691194 0.5662637
# 3 0.4 -3.4657359 5 -1.4496388 0.6404560 1.0845988 0.31570923 0.4650405
# 4 0.4 -2.7725887 5 -0.3249714 0.4901128 -0.0884021 0.29247321 0.6784308
# 5 0.6 -2.0794415 5 0.7996960 0.5205868 -0.4249900 0.28989252 0.6523116
# 6 0.8 -1.3862944 5 1.9243633 0.7089318 -0.4551979 0.27931425 0.6486704
# 7 1.0 -0.6931472 5 3.0490307 0.9669186 0.6805552 0.20199632 0.6155754
# 8 1.0 0.0000000 5 4.1736981 1.2522190 0.3908698 0.11707018 0.6611557
# 9 1.0 0.6931472 5 5.2983655 1.5498215 0.2233227 0.05944982 0.6739965
# 10 1.0 1.3862944 5 6.4230329 1.8538108 0.1273738 0.02781019 0.6778365
# .cooksd .std.resid
# 1 0.0139540988 -0.5392827
# 2 0.0886414317 -0.9941540
# 3 0.4826245827 1.3111391
# 4 0.0022725303 -0.1050972
# 5 0.0543073747 -0.5043322
# 6 0.0637954916 -0.5362006
# 7 0.0375920888 0.7618349
# 8 0.0057798939 0.4159767
# 9 0.0008399932 0.2302724
# 10 0.0001194412 0.1291827
For generalized linear model, in order for the math to come out, the model needs to be transformed using a link function. For Gaussian model, this is the identity function, but for logistic regression, we use a logit function (can also be probit, does that ring a bell?). This means that you can get "raw" predicted values or transformed. This is why ?predict.glm offers a type argument, which translates to type.predict in augment.
broom::augment(x=fm1, newdata = Data, type.predict = "response")

Tuning of mtry by caret returning strange value

I tune the mtry parameter of randomForest using the train function from the caret package. There are only 48 columns in my X data, however train returns mtry=50 as the best value whereas this is not a valid value (>48). What is the explanation of that ?
> dim(X)
[1] 93 48
> fit <- train(level~., data=data.frame(X,level), tuneLength=13)
> fit$finalModel
Call:
randomForest(x = x, y = y, mtry = param$mtry)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 50
OOB estimate of error rate: 2.15%
Confusion matrix:
high low class.error
high 81 1 0.01219512
low 1 10 0.09090909
It is even worse if I don't set the tuneLength parameter:
> fit <- train(level~., data=data.frame(X,level))
> fit$finalModel
Call:
randomForest(x = x, y = y, mtry = param$mtry)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 55
OOB estimate of error rate: 2.15%
Confusion matrix:
high low class.error
high 81 1 0.01219512
low 1 10 0.09090909
I don't provide the data cause it is confidential. But there's nothing special in these data: each column is numerical or is a factor, and there are no missing value.
The apparent discrepancy is most likely[1] between the number of columns in your data set and the number of predictors, which may not be the same if any of the columns are factors. You used the formula method, which will expand the factors into dummy variables. For example:
> head(model.matrix(Sepal.Width ~ ., data = iris))
(Intercept) Sepal.Length Petal.Length Petal.Width Speciesversicolor Speciesvirginica
1 1 5.1 1.4 0.2 0 0
2 1 4.9 1.4 0.2 0 0
3 1 4.7 1.3 0.2 0 0
4 1 4.6 1.5 0.2 0 0
5 1 5.0 1.4 0.2 0 0
6 1 5.4 1.7 0.4 0 0
So there are 3 predictor columns in iris but you end up with 5 (non-intercept) predictors.
Max
[1] This is why you need to provide a reproducible example. Often, when I get ready to ask a question, the answer becomes apparent while I take the time to write a good description of the issue.

caret train glm training error - empty result

I have a dataset with a number of binary predictors and binary outcome. I am trying to use logistic regression to predict the outcome and use caret package.
For some reason, after training my model does not produce result, but finishes without any errors. However, when I train with cross-validation, I get the result.
> Model = train(success ~ . - contestid - index - tags, data = p.train,
+ method = "glm",
+ family = binomial(link = "logit"),
+ trControl = trainControl(method = "none"));
Warning message:
glm.fit: fitted probabilities numerically 0 or 1 occurred
> Model$results
[1] Accuracy Kappa parameter
<0 rows> (or 0-length row.names)
With cross-validation:
> Model = train(success ~ . - contestid - index - tags, data = p.train,
+ method = "glm",
+ family = binomial(link = "logit"),
+ trControl = trainControl(method = "cv"));
There were 22 warnings (use warnings() to see them)
> Model$results
parameter Accuracy Kappa AccuracySD KappaSD
1 none 0.8 0.4208333 0.1972027 0.460482
> Model$resample
Accuracy Kappa Resample
1 0.75 0.5000000 Fold01
2 0.50 0.2000000 Fold02
3 1.00 1.0000000 Fold03
4 0.75 0.5000000 Fold04
5 1.00 1.0000000 Fold05
6 1.00 NA Fold06
7 0.75 0.5000000 Fold07
8 0.75 0.0000000 Fold08
9 0.50 -0.3333333 Fold09
10 1.00 NA Fold10
All warnings are the same, about the fitted probabilities, since my data allows perfect separation. However, this does not prevent training with cv to produce results.
What might be the reason for the absence of results in the first case?
Thanks

Resources