This question already has answers here:
Obtaining threshold values from a ROC curve
(5 answers)
Closed 7 years ago.
My data is kind of irregular to apply ROC etc. for a threshold determination. To simplify, here is a demo, let x is
x<-c(0,0,0,12, 3, 4, 5, 15, 15.3, 20,18, 26)
Suppose x=15.1 is the unknown true threshold and the corresponding test outcome y will be negative (0) if x==0 OR x > 15.1, otherwise y is positive(1) such that:
y<-c(0,0,0,1, 1, 1, 1, 1, 0,0,0,0)
Due to 0 is a positive outcome in x, I'm wondering in which way I can determine the threshold of x to predict y the best. I have tried R packaged pROC and ROCR, both seem not straight forward for this situation. Would somebody have me some suggestions?
You have a situation where you predict 0 for high values of x and predict 1 for low values of x, except you always predict 0 if x == 0. Standard packages like pROC and ROCR expect low values of x to be associated with predicting y=0. You could transform your data to this situation by:
Flipping the sign of all your predictions
Replacing 0 with a small negative value in x
In code (using this answer to extract TPR and FPR for each cutoff):
x2 <- -x
x2[x2 == 0] <- -1000
library(ROCR)
pred <- prediction(x2, y)
perf <- performance(pred, "tpr", "fpr")
data.frame(cut=perf#alpha.values[[1]], fpr=perf#x.values[[1]],
tpr=perf#y.values[[1]])
# cut fpr tpr
# 1 Inf 0.0000000 0.0
# 2 -3.0 0.0000000 0.2
# 3 -4.0 0.0000000 0.4
# 4 -5.0 0.0000000 0.6
# 5 -12.0 0.0000000 0.8
# 6 -15.0 0.0000000 1.0
# 7 -15.3 0.1428571 1.0
# 8 -18.0 0.2857143 1.0
# 9 -20.0 0.4285714 1.0
# 10 -26.0 0.5714286 1.0
# 11 -1000.0 1.0000000 1.0
Now you can select your favorite cutoff based on the true and false positive rates, remembering that the selected cutoff value will be negated from the original value.
Related
I was doing a logistic regression and made a table that represents the predicted probability ,actual class, and predicted class.
If the predicted probability is more than 0.5, I classified it as 1,so the predicted class becomes 1. But I want to change the threshold value from 0.5 to another value.
I was considering to find a threshold value that maximizes both true positive rate and true negative rate. Here I made a simple data df to demonstrate what I want to do.
df<-data.frame(actual_class=c(0,1,0,0,1,1,1,0,0,1),
predicted_probability=c(0.51,0.3,0.2,0.35,0.78,0.69,0.81,0.31,0.59,0.12),
predicted_class=c(1,0,0,0,1,1,1,0,1,0))
If I can find a threshold value, I will classify using that value instead of 0.5.
I don't know how to find a threshold value that both maximizes true positive rate and true negative rate.
You can check a range of values pretty easily:
probs <- seq(0, 1, by=.05)
names(probs) <- probs
results <- sapply(probs, function(x) df$actual_class == as.integer(df$predicted_probability > x))
results is a 10 row by 21 column logical matrix showing when the predicted class equals the actual class:
colSums(results) # Number of correct predictions
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1
5 5 5 4 5 5 4 6 6 6 6 7 8 8 7 7 6 5 5 5 5
predict <- as.integer(df$predicted_probability > .6)
xtabs(~df$actual_class+predict)
# predict
# df$actual_class 0 1
# 0 5 0
# 1 2 3
You can see that probabilities of .6 and .65 result in 8 correct predictions. This conclusion is based on the data you used in the analysis so it probably overestimates how successful you would be with new data.
I have noticed that for some runs of:
train=as.h2o(u)
mod = h2o.glm(family= "binomial", x= c(1:15), y="dc",
training_frame=train, missing_values_handling = "Skip",
lambda = 0, compute_p_values = TRUE, nfolds = 10,
keep_cross_validation_predictions= TRUE)
there are NaNs in cross-validation metrics summary of AUC for some cv iterations of the model.
For example:
print(mod#model$cross_validation_metrics_summary["auc",])
Cross-Validation Metrics Summary:
mean sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid cv_5_valid cv_6_valid cv_7_valid cv_8_valid cv_9_valid cv_10_valid
auc 0.63244045 0.24962118 0.25 0.6666667 0.8095238 1.0 0.6666667 0.46666667 NaN NaN 1.0 0.2
NaN in CV seems to appear less frequently when I set smaller nfolds=7.
How these NaN values should be interpreted and when h2o cross-validation outputs them?
I suppose it happens when AUC can't be assessed correctly in an iteration. My training set has 70 complete rows.
Can such AUC cross-validation results (containing NaNs) be considered as reliable?
There are specific cases that could cause division by zero when calculating the ROC curve, which could cause an AUC to be NaN. It's probable that due to small data you have some folds that have no true positives and are causing this issue.
We can test this by keeping the fold column and then counting the values of dc in each fold:
...
train <- as.h2o(u)
mod <- h2o.glm(family = "binomial"
, x = c(1:15)
, y = "dc"
, training_frame = train
, missing_values_handling = "Skip"
, lambda = 0
, compute_p_values = TRUE
, nfolds = 10
, keep_cross_validation_fold_assignment = TRUE
, seed = 1234)
fold <- as.data.frame(h2o.cross_validation_fold_assignment(mod))
df <- cbind(u,fold)
table(df[c("dc","fold_assignment")])
fold_assignment
dc 0 1 2 3 4 5 6 7 8 9
0 4 6 6 2 9 6 6 4 4 6
1 2 2 3 4 0 2 0 0 1 2
mod#model$cross_validation_metrics_summary["auc",]
Cross-Validation Metrics Summary:
mean sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid cv_5_valid cv_6_valid cv_7_valid
auc 0.70238096 0.19357596 0.875 0.6666667 0.5 0.375 NaN 0.5833333 NaN
cv_8_valid cv_9_valid cv_10_valid
auc NaN 1.0 0.9166667
We see that the folds with NaN are the same folds that have only dc=0.
Not counting the NaN, the wide variety of AUC for your folds (from 0.2 to 1) tells us that this is not a robust model, and it is likely being overfitted. Can you add more data?
I am stuck on how to proceed with coding in RStudio for the Bonferroni Correction and the raw P values for the Pearson Correlation Matrix. I am a student and am new to R. I am also lost on how to get a table of the mean,SD, and n for the data. When I calculated the Pearson Correlation Matrix I just got the r value and not the raw probabilities value also. I am not sure how to code to get that in RStudio. I then tried to calculate the Bonferroni Correction and received an error message saying list object cannot be coerced to type double. How do I fix my code so this goes away? I also tried to create a table of the mean, SD, and n for the data and I became stuck on how to proceed.
My data is as follows:
Tree Height DBA Leaf Diameter
45.3 14.9 0.76
75.2 26.6 1.06
70.1 22.9 1.19
95 31.8 1.59
107.8 35.5 0.43
93 26.2 1.49
91.5 29 1.19
78.5 29.2 1.1
85.2 30.3 1.24
50 16.8 0.67
47.1 12.8 0.98
73.2 28.4 1.2
Packages I have installed dplyr,tidyr,multcomp,multcompview
I Read in the data from excel CSV(comma delimited) file and This creates data>dataHW8_1 12obs. of 3 variables
summary(dataHW8_1)
I then created Scatterplots of the data
plot(dataHW8_1$Tree_Height,dataHW8_1$DBA,main="Scatterplot Tree Height Vs Trunk Diameter at Breast Height (DBA)",xlab="Tree Height (cm)",ylab="DBA (cm)")
plot(dataHW8_1$Tree_Height,dataHW8_1$Leaf_Diameter,main="Scatterplot Tree Height Vs Leaf Diameter",xlab="Tree Height (cm)",ylab="Leaf Diameter (cm)")
plot(dataHW8_1$DBA,dataHW8_1$Leaf_Diameter,main="Scatterplot Trunk Diameter at Breast Height (DBA) Vs Leaf Diameter",xlab="DBA (cm)",ylab="Leaf Diameter (cm)")
I then noticed that the data was not linear so I transformed it using the log() fucntion
dataHW8_1log = log(dataHW8_1)
I then re-created my Scatterplots using the transformed data
plot(dataHW8_1log$Tree_Height,dataHW8_1log$DBA,main="Scatterplot of
Transformed (log)Tree Height Vs Trunk Diameter at Breast Height
(DBA)",xlab="Tree Height (cm)",ylab="DBA (cm)")
plot(dataHW8_1log$Tree_Height,dataHW8_1log$Leaf_Diameter,main="Scatterplot
of Transformed (log)Tree Height Vs Leaf Diameter",xlab="Tree Height
(cm)",ylab="Leaf Diameter (cm)")
plot(dataHW8_1log$DBA,dataHW8_1log$Leaf_Diameter,main="Scatterplot of
Transformed (log) Trunk Diameter at Breast Height (DBA) Vs Leaf
Diameter",xlab="DBA (cm)",ylab="Leaf Diameter (cm)")
I then created a matrix plot of Scatterplots
pairs(dataHW8_1log)
I then calculated the correlation coefficent using the Pearson method
this does not give an uncorreted matrix of P values------How do you do that?
cor(dataHW8_1log,method="pearson")
I am stuck on what to do to get a matrix of the raw probabilities (uncorrected P values) of the data
I then calculated the Bonferroni correction-----How do you do that?
Data$Bonferroni =
p.adjust(dataHW8_1log,
method = "bonferroni")
Doing this gave me the follwing error:
Error in p.adjust(dataHW8_1log, method = "bonferroni") :
(list) object cannot be coerced to type 'double'
I tried to fix using lapply, but that did not fix my promblem
I then tried to make a table of mean, SD, n, but I was only able to create the following code and became stuck on where to go from there------How do you do that?
(,data = dataHW8_1log,
FUN = function(x) c(Mean = mean(x, na.rm = T),
n = length(x),
sd = sd(x, na.rm = T))
I have tried following examples online, but none of them have helped me with the getting the Bonferroni Correction to code correctly.If anyone can help explain what I did wrong and how to make the Matrices/table I would greatly appreciate it.
Here is an example using a 50 rows by 10 columns sample dataframe.
# 50 rows x 10 columns sample dataframe
df <- as.data.frame(matrix(runif(500), ncol = 10));
We can show pairwise scatterplots.
# Pairwise scatterplot
pairs(df);
We can now use cor.test to get p-values for a single comparison. We use a convenience function cor.test.p to do this for all pairwise comparisons. To give credit where credit is due, the function cor.test.p has been taken from this SO post, and takes as an argument a dataframe whilst returning a matrix of uncorrected p-values.
# cor.test on dataframes
# From: https://stackoverflow.com/questions/13112238/a-matrix-version-of-cor-test
cor.test.p <- function(x) {
FUN <- function(x, y) cor.test(x, y)[["p.value"]];
z <- outer(
colnames(x),
colnames(x),
Vectorize(function(i,j) FUN(x[,i], x[,j])));
dimnames(z) <- list(colnames(x), colnames(x));
return(z);
}
# Uncorrected p-values from pairwise correlation tests
pval <- cor.test.p(df);
We now correct for multiple hypothesis testing by applying the Bonferroni correction to every row (or column, since the matrix is symmetric) and we're done. Note that p.adjust takes a vector of p-values as an argument.
# Multiple hypothesis-testing corrected p-values
# Note: pval is a symmetric matrix, so it doesn't matter if we correct
# by column or by row
padj <- apply(pval, 2, p.adjust, method = "bonferroni");
padj;
#V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
#V1 0 1 1.0000000 1 1.0000000 1.0000000 1 1 1.0000000 1
#V2 1 0 1.0000000 1 1.0000000 1.0000000 1 1 1.0000000 1
#V3 1 1 0.0000000 1 0.9569498 1.0000000 1 1 1.0000000 1
#V4 1 1 1.0000000 0 1.0000000 1.0000000 1 1 1.0000000 1
#V5 1 1 0.9569498 1 0.0000000 1.0000000 1 1 1.0000000 1
#V6 1 1 1.0000000 1 1.0000000 0.0000000 1 1 0.5461443 1
#V7 1 1 1.0000000 1 1.0000000 1.0000000 0 1 1.0000000 1
#V8 1 1 1.0000000 1 1.0000000 1.0000000 1 0 1.0000000 1
#V9 1 1 1.0000000 1 1.0000000 0.5461443 1 1 0.0000000 1
#V10 1 1 1.0000000 1 1.0000000 1.0000000 1 1 1.0000000 0
I have a tab delimited file with 70 rows of data and 34 columns of characteristics, where the first 60 rows look like this:
groups x1 x2 x3 x4 x5 (etc, up to x34)
0 0.1 0.5 0.5 0.4 0.2
1 0.2 0.3 0.8 0.4 0.1
0 0.4 0.7 0.6 0.2 0.1
1 0.4 0.4 0.7 0.1 0.4
And the last 10 rows look like this:
groups x1 x2 x3 x4 x5
NA 0.2 0.1 0.5 0.4 0.2
NA 0.2 0.1 0.8 0.4 0.1
NA 0.2 0.2 0.6 0.2 0.1
NA 0.2 0.3 0.7 0.1 0.4
The groups are binary (i.e. each row either belongs to group 0 or group 1). The aim is to use the first 60 rows as my training data set, and the last 10 rows as my test data set; to classify the last 10 rows into groups 0 or 1. The class of the last 10 rows is currently labelled as "NA" (as they have not been assigned to a class).
I ran this code:
library(caret)
data <-read.table("data_challenge_test.tab",header=TRUE)
set.seed(3303)
train <-sample(1:60)
data.train <-data[train,]
dim(data.train)
data.test <-data[-train,]
dim(data.test)
data.train[["groups"]] = factor(data.train[["groups"]])
trctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
knn_fit <- train(groups ~x1+x2+x3+x4+x5, data = data.train, method = "knn",trControl=trctrl,preProcess = c("center", "scale"),tuneLength = 10)
test_pred <- predict(knn_fit, newdata = data.test)
confusionMatrix(test_pred, data.test$groups)
the test_pred output is:
> test_pred
[1] 0 0 0 0 1 1 0 1 1 0
Levels: 0 1
and the confusion matrix output is:
> confusionMatrix(test_pred, data.test$groups)
Error in confusionMatrix.default(test_pred, data.test$groups) :
the data cannot have more levels than the reference
Then I checked the str of test_pred and data.test$groups:
> str(test_pred)
Factor w/ 2 levels "0","1": 1 1 1 1 2 2 1 2 2 1
> str(data.test$groups)
int [1:10] NA NA NA NA NA NA NA NA NA NA
So I understand that my error is because my two inputs to the confusion matrix are not of the same type.
So then in my data set, I changed my "NA" columns to randomly either 0 or 1 (i.e. I just manually randomly changed the first 5 unknown classes to class 0 and then the second 5 unknown classes to class 1).
Then I re-ran the above code
The output was:
> test_pred
[1] 0 0 0 0 1 1 0 1 1 0
Levels: 0 1
> confusionMatrix(test_pred, data.test$groups)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 4 2
1 1 3
Accuracy : 0.7
95% CI : (0.3475, 0.9333)
No Information Rate : 0.5
P-Value [Acc > NIR] : 0.1719
Kappa : 0.4
Mcnemar's Test P-Value : 1.0000
Sensitivity : 0.8000
Specificity : 0.6000
Pos Pred Value : 0.6667
Neg Pred Value : 0.7500
Prevalence : 0.5000
Detection Rate : 0.4000
Detection Prevalence : 0.6000
Balanced Accuracy : 0.7000
'Positive' Class : 0
So I have three questions:
Originally, the class of all my training data set was 0 or 1, and the class of my test data sets were all marked as NA or ?.
caret doesn't seem to like that due to the error described above. When I assigned my test data set random starting binary variables instead of NA/?, the analysis "worked" (as in no errors).
Is the binary groups I've manually randomly assigned to the test data set affecting the confusion matrix (or any aspect of the analysis?), or is this acceptable? If not, what is the solution: what group do I assigned unclassified test data to in the beginning of the analysis.
Is the test_pred output ordered? I wanted the last 10 rows of my table to be predicted and the output of test_pred is: 0 0 0 0 1 1 0 1 1 0. Are these last 10 rows in order?
I would like to visualise the results once I have this issue sorted. Can anyone recommend a standard package that is commonly done to do this (I am new to machine learning)?
Edit: Given that the confusion matrix is directly uses references and prediction to calculate accuracy, I'm pretty sure I cannot just randomly assign classes to the unknown classed rows as it will affect the accuracy of the confusion matrix. So an alternative suggestion would be appreciated.
A confusion matrix is comparison of your classification output to the actual classes. So if your test data set does not have the labels, you cannot draw a confusion matrix.
There are other ways of checking how well your classification algorithm did. You can for now read about AIC which is analogous to Linear Regressions R-squared.
If you still want a confusion matrix, use first 50 rows for training and 50-60 for testing. This output will let you create a confusion matrix.
Yes the output is ordered and you can column bind it to your test set.
Visualising classification tasks is done by drawing a ROC curve. CARET library should have that too.
I'm fitting glm model in R and can get predicted values at response scale using predict.glm(object=fm1, type="response") where fm1 is the fitted model. I wonder how to get predicted values at response scale using augment function from broom package. My minimum working example is given below.
Dilution <- c(1/128, 1/64, 1/32, 1/16, 1/8, 1/4, 1/2, 1, 2, 4)
NoofPlates <- rep(x=5, times=10)
NoPositive <- c(0, 0, 2, 2, 3, 4, 5, 5, 5, 5)
Data <- data.frame(Dilution, NoofPlates, NoPositive)
fm1 <- glm(formula=NoPositive/NoofPlates~log(Dilution),
family=binomial("logit"), data=Data, weights=NoofPlates)
predict.glm(object=fm1, type="response")
# 1 2 3 4 5 6 7 8 9 10
# 0.02415120 0.07081045 0.19005716 0.41946465 0.68990944 0.87262421 0.95474066 0.98483820 0.99502511 0.99837891
library(broom)
broom::augment(x=fm1)
# NoPositive.NoofPlates log.Dilution. X.weights. .fitted .se.fit .resid .hat .sigma
# 1 0.0 -4.8520303 5 -3.6989736 1.1629494 -0.4944454 0.15937234 0.6483053
# 2 0.0 -4.1588831 5 -2.5743062 0.8837030 -0.8569861 0.25691194 0.5662637
# 3 0.4 -3.4657359 5 -1.4496388 0.6404560 1.0845988 0.31570923 0.4650405
# 4 0.4 -2.7725887 5 -0.3249714 0.4901128 -0.0884021 0.29247321 0.6784308
# 5 0.6 -2.0794415 5 0.7996960 0.5205868 -0.4249900 0.28989252 0.6523116
# 6 0.8 -1.3862944 5 1.9243633 0.7089318 -0.4551979 0.27931425 0.6486704
# 7 1.0 -0.6931472 5 3.0490307 0.9669186 0.6805552 0.20199632 0.6155754
# 8 1.0 0.0000000 5 4.1736981 1.2522190 0.3908698 0.11707018 0.6611557
# 9 1.0 0.6931472 5 5.2983655 1.5498215 0.2233227 0.05944982 0.6739965
# 10 1.0 1.3862944 5 6.4230329 1.8538108 0.1273738 0.02781019 0.6778365
# .cooksd .std.resid
# 1 0.0139540988 -0.5392827
# 2 0.0886414317 -0.9941540
# 3 0.4826245827 1.3111391
# 4 0.0022725303 -0.1050972
# 5 0.0543073747 -0.5043322
# 6 0.0637954916 -0.5362006
# 7 0.0375920888 0.7618349
# 8 0.0057798939 0.4159767
# 9 0.0008399932 0.2302724
# 10 0.0001194412 0.1291827
For generalized linear model, in order for the math to come out, the model needs to be transformed using a link function. For Gaussian model, this is the identity function, but for logistic regression, we use a logit function (can also be probit, does that ring a bell?). This means that you can get "raw" predicted values or transformed. This is why ?predict.glm offers a type argument, which translates to type.predict in augment.
broom::augment(x=fm1, newdata = Data, type.predict = "response")