Conditional probability in r - r

The question:
A screening test for a disease, that affects 0.05% of the male population, is able to identify the disease in 90% of the cases where an individual actually has the disease. The test however generates 1% false positives (gives a positive reading when the individual does not have the disease). Find the probability that a man has the disease given that has tested positive. Then, find the probability that a man has the disease given that he has a negative test.
My wrong attempt:
I first started by letting:
• T be the event that a man has a positive test
• Tc be the event that a man has a negative test
• D be the event that a man has actually the disease
• Dc be the event that a man does not have the disease
Therefore we need to find P(D|T) and P(D|Tc)
Then I wrote this code:
set.seed(110)
sims = 1000
D = rep(0, sims)
Dc = rep(0, sims)
T = rep(0, sims)
Tc = rep(0, sims)
# run the loop
for(i in 1:sims){
# flip to see if we have the disease
flip = runif(1)
# if we got the disease, mark it
if(flip <= .0005){
D[i] = 1
}
# if we have the disease, we need to flip for T and Tc,
if(D[i] == 1){
# flip for S1
flip1 = runif(1)
# see if we got S1
if(flip1 < 1/9){
T[i] = 1
}
# flip for S2
flip2 = runif(1)
# see if we got S1
if(flip2 < 1/10){
Tc[i] = 1
}
}
}
# P(D|T)
mean(D[T == 1])
# P(D|Tc)
mean(D[Tc == 1])
I'm really struggling so any help would be appreciated!

Perhaps the best way to think through a conditional probability question like this is with a concrete example.
Say we tested one million individuals in the population. Then 500 individuals (0.05% of one million) would be expected to have the disease, of whom 450 would be expected to test positive and 50 to test negative (since the false negative rate is 10%).
Conversely, 999,500 would be expected to not have the disease (one million minus the 500 who do have the disease), but since 1% of them would test positive, then we would expect 9,995 people (1% of 999,500) with false positive results.
So, given a positive test result taken at random, it either belongs to one of the 450 people with the disease who tested positive, or one of the 9,995 people without the disease who tested positive - we don't know which.
This is the situation in the first question, since we have a positive test result but don't know whether it is a true positive or a false positive. The probability of our subject having the disease given their positive test is the probability that they are one of the 450 true positives out of the 10,445 people with positive results (9995 false positives + 450 true positives). This boils down to the simple calculation 450/10,445 or 0.043, which is 4.3%.
Similarly, a negative test taken at random either belongs to one of the 989505 (999500 - 9995) people without the disease who tested negative, or one of the 50 people with the disease who tested negative, so the probability of having the disease is 50/989505, or 0.005%.
I think this question is demonstrating the importance of knowing that disease prevalence needs to be taken into account when interpreting test results, and very little to do with programming, or R. It requires only a calculator (at most).
If you really wanted to run a simulation in R, you could do:
set.seed(1) # This makes the sample reproducible
sample_size <- 1000000 # This can be changed to get a larger or smaller sample
# Create a large sample of 1 million "people", using a 1 to denote disease and
# a 0 to denote no disease, with probabilities of 0.0005 (which is 0.05%) and
# 0.9995 (which is 99.95%) respectively.
disease <- sample(x = c(0, 1),
size = sample_size,
replace = TRUE,
prob = c(0.9995, 0.0005))
# Create an empty vector to hold the test results for each person
test <- numeric(sample_size)
# Simulate the test results of people with the disease, using a 1 to denote
# a positive test and 0 to denote a negative test. This uses a probability of
# 0.9 (which is 90%) of having a positive test and 0.1 (which is 10%) of having
# a negative test. We draw as many samples as we have people with the disease
# and put them into the "test" vector at the locations corresponding to the
# people with the disease.
test[disease == 1] <- sample(x = c(0, 1),
size = sum(disease),
replace = TRUE,
prob = c(0.1, 0.9))
# Now we do the same for people without the disease, simulating their test
# results, with a 1% probability of a positive test.
test[disease == 0] <- sample(x = c(0, 1),
size = 1e6 - sum(disease),
replace = TRUE,
prob = c(0.99, 0.01))
Now we have run our simulation, we can just count the true positives, false positives, true negatives and false negatives by creating a contingency table
contingency_table <- table(disease, test)
contingency_table
#> test
#> disease 0 1
#> 0 989566 9976
#> 1 38 420
and get the approximate probability of having the disease given a positive test like this:
contingency_table[2, 2] / sum(contingency_table[,2])
#> [1] 0.04040015
and the probability of having the disease given a negative test like this:
contingency_table[2, 1] / sum(contingency_table[,1])
#> [1] 3.83992e-05
You'll notice that the probability estimates from sampling are not that accurate because of how small some of the sampling probabilities are. You could simulate a larger sample, but it might take a while for your computer to run it.
Created on 2021-08-19 by the reprex package (v2.0.0)

To expand on Allan's answer, but relating it back to Bayes Theorem, if you prefer:
From the question, you know (converting percentages to probabilities):
Plugging in:

Related

R function to find difference in mean greater than or equal to a specific number

I have just started my basic statistic course using R and we're studying using R for paired t-tests. I have come across questions where we're given two sets of data and we're asked to find whether the difference in mean is equal to 0 or greater than 0 so on so forth. The function we use for two samples x and y with an unknown variance is similar to the one below;
t.test(x, y, var.equal=TRUE, alternative="greater")
My question is, how would we to do this if we wanted to test the difference in mean is more than or equal to a specified number against the alternative that its less than a specific number and not 0.
For example, say we're given two datas for before and after weights of 10 people. How do we test that the mean difference in weight is more than or equal to say 3kg against the alternative where the mean difference in weight is less than 3kg. Is there a way to do this? Would really appreciate any guidance on this matter.
It might be worthwhile posting on https://stats.stackexchange.com/ as well if you're in need of more theoretical proof. Is it ok to add/subtract the 3kg from either x or y and then use the t-test to check for similarity? I think this would tell you at least which outcome is more likely, if that's the end goal. It would be good to get feedback on this
# number of obs, and rnorm dist for simulating
N <- 10
mu <- 70
sd <- 10
set.seed(1)
x <- round(rnorm(N, mu, sd), 1)
# three outcomes
# (1) no change
y_same <- x + round(rnorm(N, 0, 5), 1)
# (2) average increase of 3
y_imp <- x + rnorm(N, 3, 5)
# (3) average decrease of 3
y_dec <- x + rnorm(N, -3, 5)
# say y_imp is true
y_act <- y_imp
# can we test whether we're closer to the output by altering
# the original data? or conversely, altering y_imp
t_inc <- t.test(x+3, y_act, var.equal=TRUE, alternative="two.sided")
t_dec <- t.test(x-3, y_act, var.equal=TRUE, alternative="two.sided")
t_inc$p.value
[1] 0.8279801
t_dec$p.value
[1] 0.0956033
# one with the highest p.value has the closest distribution, so
# +3 kg more likely than -3kg
You can set mu=3 to change the null hypothesis from 0 to 3 assuming your x variables are in the units you describe above.
t.test(x, y, mu=3, alternative="greater", paired=TRUE)
More (general) information on Stack Exchange [here].(https://stats.stackexchange.com/questions/206316/can-a-paired-or-two-group-t-test-test-if-the-difference-between-two-means-is-l/206317#206317)

Test for no change in Limma

I'm looking for a way to identify genes that are significantly stable across conditions. In other words, the opposite of standard DE analysis.
Standard DE splits genes in two categories: significantly changing on one side, everything else, "the rest", on the other.
"The rest", however, contains both genes that actually do not change, and genes for which the confidence in the change is not sufficient to call them differential.
What I want is to find those that do not change, or in other words, those for which I can confidently say that there's no change across my conditions.
I know this is possible in DEseq by providing an alternative null-hypothesis, but I have to integrate this as an extra step into someone else's pipeline that already uses limma, and I'd like to stick to it.
Ideally I would like to test for both DE and non changing genes in a similar way, something conceptually similar to changing the H0 in DEseq.
At the moment the code to test for DE goes like:
# shaping data
comparison <- eBayes(lmFit(my_data, weights = my.weights^2))
results <- limma::topTable(my_data, sort.by = "t",
coef = 1, number = Inf)
as an example I'd love something like the following, but anything conceptually alike would do.
comparison <- eBayes(lmFit(my_data, weights = my.weights^2), ALTERNATIVE_H0 = my_H0)
I know treat() allows to specify an interval null hypothesis by providing a fold change, citing the manual: "it uses an interval null hypothesis, where the interval is [-lfc,lfc]".
However this still tests for change from a central interval around 0, while the intervals I would like to test against are [-inf,-lfc] + [lfc,inf].
Is there any option I'm missing?
Thanks!
You can try to use the confidence interval of the logFC to select your genes, but I must say this is very dependent on the number of samples you have, and also how strong is the biological variance. Below I show an example how it can be done:
first we use DESeq2 to generate an example dataset, we set betaSD so that we have a small proportion of genes that should show differences between conditions
library(DESeq2)
library(limma)
set.seed(100)
dds = makeExampleDESeqDataSet(n=2000,betaSD=1)
#pull out the data
DF = colData(dds)
# get out the true fold change
FC = mcols(dds)
Now we can run limma-voom on this dataset,
V = voom(counts(dds),model.matrix(~condition,data=DF))
fit = lmFit(V,model.matrix(~condition,data=DF))
fit = eBayes(fit)
# get the results, in this case, we are interested in the 2nd coef
res = topTable(fit,coef=2,n=nrow(V),confint=TRUE)
So there is an option to collect the 95% confidence interval of the fold change in the function topTable. We do that and compare against the true FC:
# fill in the true fold change
res$true_FC = FC[rownames(res),"trueBeta"]
We can look at how the estimated and true differ:
plot(res$logFC,res$true_FC)
So let's say we want to find genes, where we are confident there's a fold change < 1, we can do:
tabResults = function(tab,fc_cutoff){
true_unchange = abs(tab$true_FC)<fc_cutoff
pred_unchange = tab$CI.L>(-fc_cutoff) & res$CI.R <fc_cutoff
list(
X = table(pred_unchange,true_unchange),
expression_distr = aggregate(
tab$AveExpr ~ pred_unchange+true_unchange,data=tab,mean
))
}
tabResults(res,1)$X
true_unchange
pred_unchange FALSE TRUE
FALSE 617 1249
TRUE 7 127
The above results tells us, if we set limit it to genes whose 95% confidence are within +/- 1 FC, we get 134 hits, with 7 being false (with actual fold change > 1).
And the reason we miss out on some true no-changing genes is because they are expressed a bit lower, while most of what we predicted correctly to be unchanging, have high expression:
tabResults(res,1)$expression_distr
pred_unchange true_unchange tab$AveExpr
1 FALSE FALSE 7.102364
2 TRUE FALSE 8.737670
3 FALSE TRUE 6.867615
4 TRUE TRUE 10.042866
We can go lower FC, but we also end up with less genes:
tabResults(res,0.7)
true_unchange
pred_unchange FALSE TRUE
FALSE 964 1016
TRUE 1 19
The confidence interval depends a lot on the number of samples you have. So a cutoff of 1 for one dataset would mean something different for another.
So I would say if you have a dataset at hand, you can first run DESeq2 on the dataset, obtain the mean variance relationship and simulate data like I have done, to more or less guess, what fold change cutoff would be ok, how many can you possibly get, and make a decision from there.

Binomial probabilty invocation with R

At a car hire service 50% of cars are returned on time. A sample of 20 car hires is studied. Is this the correct R invocation of dbinom in order to calculate all 20 cars are returned on time ? :
dbinom(x=20, size=20, prob=0.5)
Yes.
We can check against what we know the answer to be (0.5^20, since 20 choose 20 is 1 and (0.5^20)*(0.5^(20-20)) = 0.5^20):
dbinom(x=20, size=20, prob=0.5)
# [1] 9.536743e-07
0.5^20
# [1] 9.536743e-07
From help("dbinom"):
x, q vector of quantiles.
...
size number of trials (zero or more).
prob probability of success on each trial.
So here, x is our quantile (what is the probability there were 20 successes?), size is our number of trials (a sample of 20), and prob is the probability of success in each one (there is a 1/2 chance each car is returned timely).

Confusing p values with ANOVA on a big dataframe

I am trying to analyse the significant differences between different car company performance values across different countries. I am using ANOVA to do this.
Running ANOVA on my real dataset (30 countries, 1000 car companies and 90000 measurement scores) gave every car a zero p-value.
Confused by this, I created a reproducible example (below) with 30 groups, 3 car companies, 90000 random scores. Purposely, I kept a score of 1 for the Benz company where you shouldn't see any difference between countries. After running anova, I see a pvalue of 0.46 instead of 1.
Does any one know why is this ?
Reproducible example
set.seed(100000)
qqq <- 90000
df = data.frame(id = c(1:90000), country = c(rep("usa",3000), rep("usb",3000), rep("usc",3000), rep("usd",3000), rep("use",3000), rep("usf",3000), rep("usg",3000), rep("ush",3000), rep("usi",3000), rep("usj",3000), rep("usk",3000), rep("usl",3000), rep("usm",3000), rep("usn",3000), rep("uso",3000), rep("usp",3000), rep("usq",3000), rep("usr",3000), rep("uss",3000), rep("ust",3000), rep("usu",3000), rep("usv",3000), rep("usw",3000), rep("usx",3000), rep("usy",3000), rep("usz",3000), rep("usaa",3000), rep("usab",3000), rep("usac",3000), rep("usad",3000)), tesla=runif(90000), bmw=runif(90000), benz=rep(1, each=qqq))
str(df)
out<-data.frame()
for(j in 3:ncol(df)){
amod2 <- aov(df[,j]~df$country)
out[(j-2),1]<-colnames(df)[j]
out[(j-2),2]<-summary(amod2, test = adjusted("bonferroni"))[[1]][[1,"Pr(>F)"]]
}
colnames(out)<-c("cars","pvalue")
write.table(out,"df.output")
df.output
"cars" "pvalue"
"1" "tesla" 0.245931589754359
"2" "bmw" 0.382730335188437
"3" "benz" 0.465083026215268
With respect to the "benz" p-value in your reproducible example: an ANOVA analysis requires positive variance (i.e., non-constant data). If you violate this assumption, the model is degenerate. Technically, the p-value is based on an F-statistic whose value is a normalized ratio of the variance attributable to the "country" effect (for "benz" in your example, zero) divided by the total variance (for "benz" in your example, zero), so your F-statistic has "value" 0/0 or NaN.
Because of the approach R takes to calculating the F-statistic (using a QR matrix decomposition to improve numerical stability in "nearly" degenerate cases), it calculates an F-statistic equal to 1 (w/ 29 and 89970 degrees of freedom). This gives a p-value of:
> pf(1, 29, 89970, lower=FALSE)
[1] 0.465083
>
but it is, of course, largely meaningless.
With respect to your original problem, with large datasets relatively small effects will yield very small p-values. For example, if you add the following after your df definition above to introduce a difference in country usa:
df = within(df, {
o = country=="usa"
tesla[o] = tesla[o] + .1
bmw[o] = bmw[o] + .1
benz[o] = benz[o] + .1
rm(o)
})
you will find that out looks like this:
> out
cars pvalue
1 tesla 9.922166e-74
2 bmw 5.143542e-74
3 benz 0.000000e+00
>
Is this what you're seeing, or are you seeing all of them exactly zero?

Function that return TRUE with a given probability

I'm looking for a function that returns TRUE with a given probability. Something like:
> proba = 2/3
> function(proba)
It returns TRUE (or 1) with a probability of 2/3 and it returns FALSE (or 0) with a probability of 1/3
The only way to compute that I can think of is:
> sample(c(rep(1,ceiling(proba*100)),rep(0,ceiling((1-proba)*100))),1)
but it gives only an approximation (and it is not really good looking !) as it can only deal with values that have a finite number of decimals.
proba <- 2/3
# number of values:
n <- 1
as.logical(rbinom(n,size=1,prob=proba))
prob <- runif(1)>0.3333333 will do it for you. Or in the general case,
prob <-function(winval) runif(1)>(1-winval)
How about:
function(proba) sample(c(TRUE, FALSE), 1, prob = c(proba, 1 - proba))
And if you want to be able to draw any number of TRUE/FALSE, not just one:
function(proba, size) sample(c(TRUE, FALSE), size, prob = c(proba, 1 - proba),
replace = TRUE)
Just for reference, you can avoid the doubt about the fractional representation of your probabilities by creating the total population and then performing a selection, like so:
sample(c(rep(TRUE, 2), rep(FALSE, 1)), 1)
OR
sample(c(TRUE, TRUE, FALSE), 1)
Usually, we use probabilities to represent the selection likelihood of a population of unknown or feasibly uncountable size. Probability is used as a proxy. When you know the details of the population, using the exact population is actually preferred from a mathematical perspective. It also has the side-effect of being a more accurate representation of this specific problem.
To extend the solution, you would need to convert your probabilities into a population total for each population subset. In this case, we have two subsets: TRUE, and FALSE. Instead of representing the selection likelihood of a TRUE individual as 2/3, you would instead state the number of TRUE individuals contained in the total population TRUE_N, and the number of FALSE individuals contained in the total population FALSE_N.
TRUE_N <- 2
FALSE_N <- 1
sample(c(rep(TRUE, TRUE_N), rep(FALSE, FALSE_N)), 1)

Resources