Find a threshold value for confusion matrix in R - r

I was doing a logistic regression and made a table that represents the predicted probability ,actual class, and predicted class.
If the predicted probability is more than 0.5, I classified it as 1,so the predicted class becomes 1. But I want to change the threshold value from 0.5 to another value.
I was considering to find a threshold value that maximizes both true positive rate and true negative rate. Here I made a simple data df to demonstrate what I want to do.
df<-data.frame(actual_class=c(0,1,0,0,1,1,1,0,0,1),
predicted_probability=c(0.51,0.3,0.2,0.35,0.78,0.69,0.81,0.31,0.59,0.12),
predicted_class=c(1,0,0,0,1,1,1,0,1,0))
If I can find a threshold value, I will classify using that value instead of 0.5.
I don't know how to find a threshold value that both maximizes true positive rate and true negative rate.

You can check a range of values pretty easily:
probs <- seq(0, 1, by=.05)
names(probs) <- probs
results <- sapply(probs, function(x) df$actual_class == as.integer(df$predicted_probability > x))
results is a 10 row by 21 column logical matrix showing when the predicted class equals the actual class:
colSums(results) # Number of correct predictions
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1
5 5 5 4 5 5 4 6 6 6 6 7 8 8 7 7 6 5 5 5 5
predict <- as.integer(df$predicted_probability > .6)
xtabs(~df$actual_class+predict)
# predict
# df$actual_class 0 1
# 0 5 0
# 1 2 3
You can see that probabilities of .6 and .65 result in 8 correct predictions. This conclusion is based on the data you used in the analysis so it probably overestimates how successful you would be with new data.

Related

Heatmap of effect sizes and p-values using different exposures and outcomes in ggplot2

I want to create a heat map that graphically shows effect sizes between different outcomes and exposures and if p-values were significant.
I have created one big dataframe containing all exposure-outcomes tests with p-values and effect sizes. The effect direction can be positive or negative. Now, there are great resources to create this for correlation matrices such as corrplot.
I don't get how to do this for effects sizes with different exposures and outcomes.
This would be the sample dataframe. The exposures would be 20 and the outcome 15.
Here is a shortened example. Estimates and p-values made up, so disregard the statical nonsense in the values.
dat
# id Exposure Outcome beta p-value se x
# 1 a 1 0.02 0.04 0.001
# 1 a 2 0.52 0.001 0.02
# 1 a 3 0.001 0.54 0.001
# 1 b 1 -0.02 0.09 0.045
# 1 b 2 0.06 0.12 0.03
# 1 b 3 -0.1 0.41 0.09
# 1 c 1 -0.42 0.01 0.08
This is an example of a similar plot using correlation.

How to calculate the probability of each columns in DataFrame -using R (Assuming Normal Distribution)

This is my DF:
Con1 Con2 Con3 Con4
1 45 576
2 23 1234
3 67 345
4 22 44
5 5 567
I want for each column to find the Mean and the SD.
Then for each cell in the specific column I want to apply Normal distribution calculation to find the probability for each cell's number in specific column.
For example, Con1's mean are 32.4 and SD 4, I want to take each number in this column and apply normal distribution to find the probability for each number - and then replace the number with its probability.
the output (For example):
Con1 Con2 Con3 Con4
1 0.6 0.455
2 0.34 0.09
3 0.23 0.12
4 0.1 0.55
5 0.7 0.88
Any help?
In base R you can do this with...
sapply(df, function(x) pnorm(x, mean = mean(x), sd = sd(x)))
Con1 Con2
[1,] 0.7002401 0.5207649
[2,] 0.3476271 0.9400139
[3,] 0.9253371 0.3172112
[4,] 0.3323590 0.1224208
[5,] 0.1267551 0.5125718
This uses pnorm, which is the cumulative normal distribution function. If you want the density instead, use dnorm. You might also like to have a look at the scale function to normalise values.

R - How to create a new column in a dataframe with calculations based on condition of another column

In a project, I measured the iodine concentration of tumors (column=ROI_IC) at different off center positions (column=Offcenter) (table heights) in a CT scanner. I know the true concentration of each of the tumors (column=Real_IC; there are 4 different tumors with 4 different real_IC concentrations). Each tumor was measured at each off-center position 10 times (column=Measurement_repeat). I calculated an absolute error between the measured iodine concentration and the real iodine concentration (column=absError_IC)
This is just the head of the data:
Offcenter Measurement_repeat Real_IC ROI_IC absError_IC
1 0 1 0.0 0.4 0.4
2 0 2 0.0 0.3 0.3
3 0 3 0.0 0.3 0.3
4 0 4 0.0 0.0 0.0
5 0 5 0.0 0.0 0.0
6 0 6 0.0 -0.1 0.1
7 0 7 0.0 -0.2 0.2
8 0 8 0.0 -0.2 0.2
9 0 9 0.0 -0.1 0.1
10 0 10 0.0 0.0 0.0
11 0 1 0.4 0.4 0.0
12 0 2 0.4 0.3 0.1
13 0 3 0.4 0.2 0.2
14 0 4 0.4 0.0 0.4
15 0 5 0.4 0.0 0.4
16 0 6 0.4 -0.1 0.5
17 0 7 0.4 0.1 0.3
18 0 8 0.4 0.3 0.1
19 0 9 0.4 0.6 0.2
20 0 10 0.4 0.7 0.3
Now I would like to create a new column called corrError_IC.
In this column, the measured iodine concentration (ROI_IC) should be corrected based on the mean absolute error (mean of 10 measurements) that was found for that specific Real_IC concentration at Offcenter = 0
Because there are 4 tumor concentrations there are 4 mean values at Off-center =0 that I want to apply on the other off-center-values.
mean1=mean of the 10 absError-IC measurements of the `Real_IC=0`
mean2=mean of the 10 absError-IC measurements of the `Real_IC=0.4`
mean3=mean of the 10 absError-IC measurements of the `Real_IC=3`
mean4=mean of the 10 absError-IC measurements of the `Real_IC=5`
Basically, I want the average absolute error for a specific tumor at Offcenter = 0 (there are 4 different tumor types with four different Real_IC) and then I want correct all tumors at the other Offcenter positions by this absolute error values that were derived from the Offcenter = 0 data.
I tried ifelse statements but I was not able to figure it out.
EDIT: Off-center has specific levels: c(-6,-4,-3,-2,-1,0,1,2,3,4,6)
Here is how I would approach this problem.
compute mean of absError_IC grouped by Real_IC.
left join original data.frame with grouped mean
Code Example
## replicate sample data sets
ROI_IC = c(0.4, 0.3, 0.3, 0.0, 0.0, -0.1, -0.2, -0.2, -0.1, 0.0,
0.4, 0.3, 0.2, 0.0, 0.0, -0.1, 0.1, 0.3, 0.6, 0.7)
df = data.frame("Offcenter"=rep(0, 40),
"Measurement_repeat"=rep( c(1:10), 4),
"Real_IC"=rep( c(0,0.4,3,5), each=10),
"ROI_IC"=rep(ROI_IC, 2),
stringsAsFactors=F)
df$absError_IC = abs(df$Real_IC - df$ROI_IC)
## compute mean of "absError_IC" grouped by "Real_IC"
mean_values = aggregate(df[df$Offcenter==0, c("absError_IC")],
by=list("Real_IC"=df$Real_IC),
FUN=mean)
names(mean_values)[which(names(mean_values)=="x")] = "MAE"
## left join to append column
df = merge(df, mean_values, by.x="Real_IC", by.y="Real_IC", all.x=T, all.y=F, sort=F)
## notice that column order shifts based on "key"
df[c(1:5, 10:15), ]
I suggest using data.table package which is particularly useful when there is need to manipulate large data.
library(data.table)
## dt = data.table(df) or dt = fread(<path>)
## dt[dt$Offcenter==0, c("absError_IC") := abs(dt$Real_IC - dt$ROI_IC)]
## compute grouped mean
mean_values = dt[, j=list("MAE"=mean(absError_IC)), by=list(Real_IC)]
## left join
dt = merge(dt, mean_values, by.x="Real_IC", by.y="Real_IC", all.x=T, all.y=F, sort=F)
Consider ave for inline aggregation where its first argument is the numeric quantity field, next arguments is grouping fields, and very last argument requiring named parameter, FUN, is the numeric function: ave(num_vector, ..., FUN=func).
df$corrError_IC <- with(df, ave(absError_IC, Real_IC, FUN=mean))
To handle NAs, extend the function argument for na.rm argument:
df$corrError_IC <- with(df, ave(absError_IC, Real_IC, FUN=function(x) mean(x, na.rm=TRUE))
I found a way to compute what I want by creating an extra column taking the average absolute errors from the 4 Real_IC levels for Off-center = 0 and matching them whenever Real_IC has a certain level.
In a second step, I subtract these from the ROI_ICs. However, how can I simplify that code to a more general form (at the moment I calculate the average absErrors based on their row location)? Sorry I am an absolute beginner ;(
Of note: My data.frame is called "ds_M"
#Define absolute errors for the 4 Real_IC levels as variables
average1<-mean(ds_M$absError_IC[1:10]) #for Real_IC=0
average2<-mean(ds_M$absError_IC[11:20]) #for Real_IC=0.4
average3<-mean(ds_M$absError_IC[21:30]) #for Real_IC=3
average4<-mean(ds_M$absError_IC[31:40]) #for Real_IC=5
# New column assigning the correction factor to each Real_IC level
ds_M$absCorr[ds_M$Real_IC==0]<-average1
ds_M$absCorr[ds_M$Real_IC==0.4]<-average2
ds_M$absCorr[ds_M$Real_IC==3]<-average3
ds_M$absCorr[ds_M$Real_IC==5]<-average4
# Calculate new column with corrected ROI_ICs
ds_M$corrError_IC<-ds_M$ROI_IC - ds_M$absCorr

Caret to predict class with knn: Do I need to provide unknown classes with a random class variable?

I have a tab delimited file with 70 rows of data and 34 columns of characteristics, where the first 60 rows look like this:
groups x1 x2 x3 x4 x5 (etc, up to x34)
0 0.1 0.5 0.5 0.4 0.2
1 0.2 0.3 0.8 0.4 0.1
0 0.4 0.7 0.6 0.2 0.1
1 0.4 0.4 0.7 0.1 0.4
And the last 10 rows look like this:
groups x1 x2 x3 x4 x5
NA 0.2 0.1 0.5 0.4 0.2
NA 0.2 0.1 0.8 0.4 0.1
NA 0.2 0.2 0.6 0.2 0.1
NA 0.2 0.3 0.7 0.1 0.4
The groups are binary (i.e. each row either belongs to group 0 or group 1). The aim is to use the first 60 rows as my training data set, and the last 10 rows as my test data set; to classify the last 10 rows into groups 0 or 1. The class of the last 10 rows is currently labelled as "NA" (as they have not been assigned to a class).
I ran this code:
library(caret)
data <-read.table("data_challenge_test.tab",header=TRUE)
set.seed(3303)
train <-sample(1:60)
data.train <-data[train,]
dim(data.train)
data.test <-data[-train,]
dim(data.test)
data.train[["groups"]] = factor(data.train[["groups"]])
trctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
knn_fit <- train(groups ~x1+x2+x3+x4+x5, data = data.train, method = "knn",trControl=trctrl,preProcess = c("center", "scale"),tuneLength = 10)
test_pred <- predict(knn_fit, newdata = data.test)
confusionMatrix(test_pred, data.test$groups)
the test_pred output is:
> test_pred
[1] 0 0 0 0 1 1 0 1 1 0
Levels: 0 1
and the confusion matrix output is:
> confusionMatrix(test_pred, data.test$groups)
Error in confusionMatrix.default(test_pred, data.test$groups) :
the data cannot have more levels than the reference
Then I checked the str of test_pred and data.test$groups:
> str(test_pred)
Factor w/ 2 levels "0","1": 1 1 1 1 2 2 1 2 2 1
> str(data.test$groups)
int [1:10] NA NA NA NA NA NA NA NA NA NA
So I understand that my error is because my two inputs to the confusion matrix are not of the same type.
So then in my data set, I changed my "NA" columns to randomly either 0 or 1 (i.e. I just manually randomly changed the first 5 unknown classes to class 0 and then the second 5 unknown classes to class 1).
Then I re-ran the above code
The output was:
> test_pred
[1] 0 0 0 0 1 1 0 1 1 0
Levels: 0 1
> confusionMatrix(test_pred, data.test$groups)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 4 2
1 1 3
Accuracy : 0.7
95% CI : (0.3475, 0.9333)
No Information Rate : 0.5
P-Value [Acc > NIR] : 0.1719
Kappa : 0.4
Mcnemar's Test P-Value : 1.0000
Sensitivity : 0.8000
Specificity : 0.6000
Pos Pred Value : 0.6667
Neg Pred Value : 0.7500
Prevalence : 0.5000
Detection Rate : 0.4000
Detection Prevalence : 0.6000
Balanced Accuracy : 0.7000
'Positive' Class : 0
So I have three questions:
Originally, the class of all my training data set was 0 or 1, and the class of my test data sets were all marked as NA or ?.
caret doesn't seem to like that due to the error described above. When I assigned my test data set random starting binary variables instead of NA/?, the analysis "worked" (as in no errors).
Is the binary groups I've manually randomly assigned to the test data set affecting the confusion matrix (or any aspect of the analysis?), or is this acceptable? If not, what is the solution: what group do I assigned unclassified test data to in the beginning of the analysis.
Is the test_pred output ordered? I wanted the last 10 rows of my table to be predicted and the output of test_pred is: 0 0 0 0 1 1 0 1 1 0. Are these last 10 rows in order?
I would like to visualise the results once I have this issue sorted. Can anyone recommend a standard package that is commonly done to do this (I am new to machine learning)?
Edit: Given that the confusion matrix is directly uses references and prediction to calculate accuracy, I'm pretty sure I cannot just randomly assign classes to the unknown classed rows as it will affect the accuracy of the confusion matrix. So an alternative suggestion would be appreciated.
A confusion matrix is comparison of your classification output to the actual classes. So if your test data set does not have the labels, you cannot draw a confusion matrix.
There are other ways of checking how well your classification algorithm did. You can for now read about AIC which is analogous to Linear Regressions R-squared.
If you still want a confusion matrix, use first 50 rows for training and 50-60 for testing. This output will let you create a confusion matrix.
Yes the output is ordered and you can column bind it to your test set.
Visualising classification tasks is done by drawing a ROC curve. CARET library should have that too.

Determine threshold value based on senstivities and specificities [duplicate]

This question already has answers here:
Obtaining threshold values from a ROC curve
(5 answers)
Closed 7 years ago.
My data is kind of irregular to apply ROC etc. for a threshold determination. To simplify, here is a demo, let x is
x<-c(0,0,0,12, 3, 4, 5, 15, 15.3, 20,18, 26)
Suppose x=15.1 is the unknown true threshold and the corresponding test outcome y will be negative (0) if x==0 OR x > 15.1, otherwise y is positive(1) such that:
y<-c(0,0,0,1, 1, 1, 1, 1, 0,0,0,0)
Due to 0 is a positive outcome in x, I'm wondering in which way I can determine the threshold of x to predict y the best. I have tried R packaged pROC and ROCR, both seem not straight forward for this situation. Would somebody have me some suggestions?
You have a situation where you predict 0 for high values of x and predict 1 for low values of x, except you always predict 0 if x == 0. Standard packages like pROC and ROCR expect low values of x to be associated with predicting y=0. You could transform your data to this situation by:
Flipping the sign of all your predictions
Replacing 0 with a small negative value in x
In code (using this answer to extract TPR and FPR for each cutoff):
x2 <- -x
x2[x2 == 0] <- -1000
library(ROCR)
pred <- prediction(x2, y)
perf <- performance(pred, "tpr", "fpr")
data.frame(cut=perf#alpha.values[[1]], fpr=perf#x.values[[1]],
tpr=perf#y.values[[1]])
# cut fpr tpr
# 1 Inf 0.0000000 0.0
# 2 -3.0 0.0000000 0.2
# 3 -4.0 0.0000000 0.4
# 4 -5.0 0.0000000 0.6
# 5 -12.0 0.0000000 0.8
# 6 -15.0 0.0000000 1.0
# 7 -15.3 0.1428571 1.0
# 8 -18.0 0.2857143 1.0
# 9 -20.0 0.4285714 1.0
# 10 -26.0 0.5714286 1.0
# 11 -1000.0 1.0000000 1.0
Now you can select your favorite cutoff based on the true and false positive rates, remembering that the selected cutoff value will be negated from the original value.

Resources