Probability - Flu test using Bayes Rule

Probability - Flu test using Bayes Rule - math

I have the following question:
You go to the doctor about a strong headache. The doctor randomly selects you for a blood test for flu, which is suspected to affect 1 in 9,000 people in your city. The accuracy of the test is 99%, meaning that the probability of a false positive is 1%. The probability of a false negative is zero. Given that you test positive, what is the probability that you have flu?
In this question, could someone help me understand that would what would the P(Positive|Flu) would be? Would it be 1 or .99?

P(+|Flu) = 0.99. But the question is somewhat misleading and really cannot be solved unless we know the prevalence of flu among people with strong headaches, since not everyone has a strong headache in the population. The prevalence of flu in the population is 1 in 9000. But you have a strong headache, which probably means you have a slightly higher probability of actually having the flu than, say, your friend who doesn't have a strong headache. Anyway,...
Bayes' Rule says: P(Flu|+) = P(+|Flu) x P(Flu) / P(+)
Information known:
P(Flu) = 1/9000
P(+|no Flu) = 0.01 (False positive rate)
P(-|Flu) = 0 (False negative rate)
We need P(+). Using the total law of probability, we can calculate it.
P(+) = P(+|Flu) x P(Flu) + P(+|no Flu) x P(no Flu) = 0.99x1/9000 + 0.01x8999/9000 = 0.01088
So, P(Flu|+) = 0.99 x 1/9000 / 0.01088 = 0.0109 or about 1.1%. So you're unlikely to have the flu, even after the positive test. Why? Because the prevalence of flu is very low (~0.0001) and the test is not perfect (1 in 100 without flu will test +).
Moral of the story? Don't screen for flu among the general population. Only screen those at high risk or those who show symptoms (like headache AND fever + cough), in which case the prevalence of flu would be much higher than 1 in 9,000, probably 1 in 20. Change the prevalence to 1 in 20 and your risk of flu given a + test result would jump to 84%.

Related

How to add multiple independent probabilities to determine the overall probability of a single output

I apologize in advance for any confusing explanations, but I will try to be as clear as possible.
If there are multiple indicators that predict an outcome with a known accuracy, and they are all attempting to predict the same result, how do you properly add the probabilities?
For example, if John and David are taking a test, and historically John answers 80% of questions correctly, and David answers 75% of questions correctly, and both John and David select the same answer on a question, what is the probability that they are correct? Let's assume that John and David are completely independent of each other and that all questions are equally difficult.
I would think that the probability that they are correct is higher than 80%, so I don't think averaging makes sense.

Thank you to Robert who commented on this question, I was able to figure out that what I was looking for is a well-known problem solved by Bayes Theorem, which is used to re-evaluate existing probabilities given new information. I won't go further into the intuition behind it but 3Blue1Brown has a very good video on the topic.
Bayes Theorem states:
P(A|B) = (P(A)*P(B)) / (P(A)*P(B) + P(!A)*P(!B))
Where:
P(A) is probability 1,
P(!A) is 1 - P(A),
P(B) is probability 2, and
P(!B) is 1 - P(B)
Using this equation in the scenario in the question, if John has an 80% chance of being right and David has a 75% chance of being right, and both agree, then the chance that they are both correct is 92.3%.
To prove this, I wrote a simple python script that simulates this exact scenario n times and prints out the result. In this code, two "experts" have a set probability of being true or false, and their accuracy is tracked individually and together.
import random
TRIALS = 1000000
exp1_correct = 0
exp2_correct = 0
combined_correct = 0
consensus_count = 0
for i in range(TRIALS):
expert1 = random.random() <= 0.8
expert2 = random.random() <= 0.75
if expert1 and expert2:
combined_correct += 1
if expert1:
exp1_correct += 1
if expert2:
exp2_correct += 1
if expert1 == expert2:
consensus_count += 1
print(f'Expert 1 had an accuracy of {exp1_correct / TRIALS}')
print(f'Expert 2 had an accuracy of {exp2_correct / TRIALS}')
print(f'Consensus had an accuracy of {combined_correct / consensus_count}')
Running this verifies that the equation above is correct. Hopefully this is helpful to someone that has the same question that I did!

Calculate vaccine efficacy confidence Interval using the exact method

I'm trying to calculate confidence intervals for vaccine efficacy studies.
All the studies I am looking at claim that they use the Exact method and cite this free PDF: Statistical Methods in Cancer Research Volume II: The Design and Analysis of Cohort Studies It is my understanding that the exact method is also sometimes called the Clopper Pearson method.
The data I have is: Person-years of vaccinated, Person-years of unvaccinated, Number of cases among vaccinated, Number of cases among unvaccinated,
Efficacy is easy to calculate: 1 - ( (Number of cases among vaccinated/Person-years of vaccinated) / (Number of cases among unvaccinated/Person-years of unvaccinated) ) * 100
But calculating the confidence interval is harder.
At first I thought that this website gave the code I needed:
testall <- binom.test(8, 8+162)
(theta <- testall$conf.int)
(VE <- (1-2*theta)/(1-theta))
In this example, 8 is the number of cases in the vaccinated group and 162 is the number of cases in the unvaccinated group. But I have had a few problems with this.
(1) there are some studies where the size of the two cohorts (vaccinated vs. not vaccinated) are different. I don't think that this code works for those cohorts.
(2) I want to be able to adjust the type of confidence interval. For example, one study used "one-sided α risk of 2·5%" where as another study used "a two-sided α level of 5%". I'm not clear if this effects the numbers.
Either way, when I tried to run the numbers, it didn't work.
Here is an example of a data sets I am trying to validate:
Number of cases among vaccinated: 176
Number of cases among unvaccinated: 221
Person-years of vaccinated: 11,793
Person-years of unvaccinated: 5,809
Efficacy: 60.8 95%
Two sided 95% CI: 52.0–68.0

testing for proportional hazards: cox.zph()

I am confused on what the cox.zph is showing. I came across this test in documentation for the finalfit package, and there was this bit under the heading "Testing for Proportional Hazards" about halfway down, which suggested testing the assumption that the risk associated with a particular variable does not change over time.
I ran it through using the code, but the info seems to imply that I want a straight line from zero (which I have in the graph), and that hypothesis test should not have variables that significantly differ from zero (which I don't have). This seems like a contradiction: Does anyone have any insight in where I may be going wrong here.
matt_sfit1 <- coxph(Surv(matt_tmove_cen, matt_moved_cen)~
matt_ncdem + flood_risk_simple + pre_matt.yr + CurrentAge + distance_bi + percap.inc.k + employment + rentership + pop.change + pop.den.k,
data=matt_timeadd)
matt_sfit1 %>% cox.zph()
chisq df p
matt_ncdem 39.22057 1 0.000000000378530830
flood_risk_simple 28.56281 1 0.000000090707709686
pre_matt.yr 7.96306 1 0.0047742
CurrentAge 5.83612 1 0.0157004
distance_bi 141.75756 1 < 0.000000000000000222
percap.inc.k 58.80923 1 0.000000000000017372
employment 30.16208 1 0.000000039740433777
rentership 8.69457 1 0.0031916
pop.change 36.13011 1 0.000000001845730660
pop.den.k 9.56108 1 0.0019875
GLOBAL 281.42991 10 < 0.000000000000000222
matt_sfit1 %>% cox.zph() %>% {zph_result <<- .} %>% plot(var=5)

Testing for proportionality is very important. If the proportional hazards assumption is rejected, it means that the effect of interest varies over time, and that the 'pooled' coefficient you are looking at is actually an average of different underlying values.
The first test you reported gives an overview of whether the PH assumption holds, i.e. of whether the effect of interest is constant over time. A graphical inspection can be informative in detecting 'when' this variation happens (for example, a covariate may have a stronger effect earlier/later on; this can sometimes be expected from a theoretical point of view). I think that the chosen y-scale is hiding a non-horizontal line. I would try to isolate the smoothed curve by removing the observation points. You have to specify the resid=FALSE argument in plot.
The two tests shuold give you a coherent outcome.
Past threads (among the others, here and here) offer excellent guidance on how to address the issue.

logistic regression alternative interpretation

I am trying to analyze the data that shows people catch the disease or not. That is, response is binary. I applied logistic regression. Assume the result of the log.reg (logistic regression) is like;
ID = c(1,2,3,4)
Test_Data = c(0,1,1,0)
Log.Reg_Output = c(0.01,0.4,0.8,0.49)
result = data.frame(ID,Test_Data,Reg_Output)
result
# 1 | 0 | 0.01
# 2 | 1 | 0.4
# 3 | 1 | 0.8
# 4 | 0 | 0.49
Can I say that person who has ID=3 will catch the disease at 80 percent? Is it right approach? If not, why? I am so confused, any help will be great!
Second question is how can I calculate accuracy rate except rounding the model result 0 or 1. Because rounding 0.49 to 0 is not so meaningful I think.
For my example, model output will turn 0,0,1,0 instead 0.01, 0.4, 0.8, 0.49 based on greater or less than 0.5. And accuracy rate will be 75%. Is there any other calculation method?
Thanks!

Can I say that person who has ID=3 will catch the disease at 80 percent?
It is unclear what you mean by "at"; the traditional/conventional interpretation of the logistic regression output here is the model estimates that person #3 will catch the disease, with 80% confidence. It is also unclear what you mean by "alternative" in your title (you don't elaborate in the question body).
how can I calculate accuracy rate except rounding the model result 0 or 1.
Accuracy by definition requires rounding the model results to 0/1. But, at least in principle, the decision threshold need not necessarily be 0.5...
Because rounding 0.49 to 0 is not so meaningful I think.
Do you think rounding 0.49 to 1 is more meaningful? Because this is the only alternative choice in a binary classification setting (a person either will catch the disease, or not).
Regarding the log loss metric, mentioned in the comments: its role is completely different than that of the accuracy. You may find these relevant answers of mine helpful:
Loss & accuracy - Are these reasonable learning curves?
How does Keras evaluate the accuracy? (despite the mislading title, it has nothing particular to do with Keras).
I seriously suggest you have a look at some logistic regression tutorials (there are literally hundreds out there); a highly recommended source is the textbook An Introduction to Statistical Learning (with Applications in R), made freely available by the authors...

Calculating test characteristics

I have a 2x2 contingency table from a larger dataset:
> ct
disease
test 0 1
no 118 12
yes 24 46
and I would like to quickly retrieve the different (medical) diagnostic test characteristics such as
Sensitivity
Specificity
Likelihood Ratio +
Likelihood Ratio -
False positive rate
False negative rate
Prob of disease
Pred value positive
Pred value negative
p(neg test wrong)
p(test positive)
p(test negative)
Overall accuracy
with their respective 95% CIs. Is there a package/function that does that? Many thanks.

Possibly you could write a custom function for each of these test characteristics. It would ensure the correct format for your particular problem and is probably faster that all the Googling you're already doing. Each one should be pretty quick. For example, Sensitivity:
sens <- function(ct) { ct[2,2] / sum(ct[,2]) }
And Specificity:
spec <- function(ct) { ct[1,1] / sum(ct[,1]) }