Binomial probabilty invocation with R - r

At a car hire service 50% of cars are returned on time. A sample of 20 car hires is studied. Is this the correct R invocation of dbinom in order to calculate all 20 cars are returned on time ? :
dbinom(x=20, size=20, prob=0.5)

Yes.
We can check against what we know the answer to be (0.5^20, since 20 choose 20 is 1 and (0.5^20)*(0.5^(20-20)) = 0.5^20):
dbinom(x=20, size=20, prob=0.5)
# [1] 9.536743e-07
0.5^20
# [1] 9.536743e-07
From help("dbinom"):
x, q vector of quantiles.
...
size number of trials (zero or more).
prob probability of success on each trial.
So here, x is our quantile (what is the probability there were 20 successes?), size is our number of trials (a sample of 20), and prob is the probability of success in each one (there is a 1/2 chance each car is returned timely).

Related

Conditional probability in r

The question:
A screening test for a disease, that affects 0.05% of the male population, is able to identify the disease in 90% of the cases where an individual actually has the disease. The test however generates 1% false positives (gives a positive reading when the individual does not have the disease). Find the probability that a man has the disease given that has tested positive. Then, find the probability that a man has the disease given that he has a negative test.
My wrong attempt:
I first started by letting:
• T be the event that a man has a positive test
• Tc be the event that a man has a negative test
• D be the event that a man has actually the disease
• Dc be the event that a man does not have the disease
Therefore we need to find P(D|T) and P(D|Tc)
Then I wrote this code:
set.seed(110)
sims = 1000
D = rep(0, sims)
Dc = rep(0, sims)
T = rep(0, sims)
Tc = rep(0, sims)
# run the loop
for(i in 1:sims){
# flip to see if we have the disease
flip = runif(1)
# if we got the disease, mark it
if(flip <= .0005){
D[i] = 1
}
# if we have the disease, we need to flip for T and Tc,
if(D[i] == 1){
# flip for S1
flip1 = runif(1)
# see if we got S1
if(flip1 < 1/9){
T[i] = 1
}
# flip for S2
flip2 = runif(1)
# see if we got S1
if(flip2 < 1/10){
Tc[i] = 1
}
}
}
# P(D|T)
mean(D[T == 1])
# P(D|Tc)
mean(D[Tc == 1])
I'm really struggling so any help would be appreciated!
Perhaps the best way to think through a conditional probability question like this is with a concrete example.
Say we tested one million individuals in the population. Then 500 individuals (0.05% of one million) would be expected to have the disease, of whom 450 would be expected to test positive and 50 to test negative (since the false negative rate is 10%).
Conversely, 999,500 would be expected to not have the disease (one million minus the 500 who do have the disease), but since 1% of them would test positive, then we would expect 9,995 people (1% of 999,500) with false positive results.
So, given a positive test result taken at random, it either belongs to one of the 450 people with the disease who tested positive, or one of the 9,995 people without the disease who tested positive - we don't know which.
This is the situation in the first question, since we have a positive test result but don't know whether it is a true positive or a false positive. The probability of our subject having the disease given their positive test is the probability that they are one of the 450 true positives out of the 10,445 people with positive results (9995 false positives + 450 true positives). This boils down to the simple calculation 450/10,445 or 0.043, which is 4.3%.
Similarly, a negative test taken at random either belongs to one of the 989505 (999500 - 9995) people without the disease who tested negative, or one of the 50 people with the disease who tested negative, so the probability of having the disease is 50/989505, or 0.005%.
I think this question is demonstrating the importance of knowing that disease prevalence needs to be taken into account when interpreting test results, and very little to do with programming, or R. It requires only a calculator (at most).
If you really wanted to run a simulation in R, you could do:
set.seed(1) # This makes the sample reproducible
sample_size <- 1000000 # This can be changed to get a larger or smaller sample
# Create a large sample of 1 million "people", using a 1 to denote disease and
# a 0 to denote no disease, with probabilities of 0.0005 (which is 0.05%) and
# 0.9995 (which is 99.95%) respectively.
disease <- sample(x = c(0, 1),
size = sample_size,
replace = TRUE,
prob = c(0.9995, 0.0005))
# Create an empty vector to hold the test results for each person
test <- numeric(sample_size)
# Simulate the test results of people with the disease, using a 1 to denote
# a positive test and 0 to denote a negative test. This uses a probability of
# 0.9 (which is 90%) of having a positive test and 0.1 (which is 10%) of having
# a negative test. We draw as many samples as we have people with the disease
# and put them into the "test" vector at the locations corresponding to the
# people with the disease.
test[disease == 1] <- sample(x = c(0, 1),
size = sum(disease),
replace = TRUE,
prob = c(0.1, 0.9))
# Now we do the same for people without the disease, simulating their test
# results, with a 1% probability of a positive test.
test[disease == 0] <- sample(x = c(0, 1),
size = 1e6 - sum(disease),
replace = TRUE,
prob = c(0.99, 0.01))
Now we have run our simulation, we can just count the true positives, false positives, true negatives and false negatives by creating a contingency table
contingency_table <- table(disease, test)
contingency_table
#> test
#> disease 0 1
#> 0 989566 9976
#> 1 38 420
and get the approximate probability of having the disease given a positive test like this:
contingency_table[2, 2] / sum(contingency_table[,2])
#> [1] 0.04040015
and the probability of having the disease given a negative test like this:
contingency_table[2, 1] / sum(contingency_table[,1])
#> [1] 3.83992e-05
You'll notice that the probability estimates from sampling are not that accurate because of how small some of the sampling probabilities are. You could simulate a larger sample, but it might take a while for your computer to run it.
Created on 2021-08-19 by the reprex package (v2.0.0)
To expand on Allan's answer, but relating it back to Bayes Theorem, if you prefer:
From the question, you know (converting percentages to probabilities):
Plugging in:

R function to find difference in mean greater than or equal to a specific number

I have just started my basic statistic course using R and we're studying using R for paired t-tests. I have come across questions where we're given two sets of data and we're asked to find whether the difference in mean is equal to 0 or greater than 0 so on so forth. The function we use for two samples x and y with an unknown variance is similar to the one below;
t.test(x, y, var.equal=TRUE, alternative="greater")
My question is, how would we to do this if we wanted to test the difference in mean is more than or equal to a specified number against the alternative that its less than a specific number and not 0.
For example, say we're given two datas for before and after weights of 10 people. How do we test that the mean difference in weight is more than or equal to say 3kg against the alternative where the mean difference in weight is less than 3kg. Is there a way to do this? Would really appreciate any guidance on this matter.
It might be worthwhile posting on https://stats.stackexchange.com/ as well if you're in need of more theoretical proof. Is it ok to add/subtract the 3kg from either x or y and then use the t-test to check for similarity? I think this would tell you at least which outcome is more likely, if that's the end goal. It would be good to get feedback on this
# number of obs, and rnorm dist for simulating
N <- 10
mu <- 70
sd <- 10
set.seed(1)
x <- round(rnorm(N, mu, sd), 1)
# three outcomes
# (1) no change
y_same <- x + round(rnorm(N, 0, 5), 1)
# (2) average increase of 3
y_imp <- x + rnorm(N, 3, 5)
# (3) average decrease of 3
y_dec <- x + rnorm(N, -3, 5)
# say y_imp is true
y_act <- y_imp
# can we test whether we're closer to the output by altering
# the original data? or conversely, altering y_imp
t_inc <- t.test(x+3, y_act, var.equal=TRUE, alternative="two.sided")
t_dec <- t.test(x-3, y_act, var.equal=TRUE, alternative="two.sided")
t_inc$p.value
[1] 0.8279801
t_dec$p.value
[1] 0.0956033
# one with the highest p.value has the closest distribution, so
# +3 kg more likely than -3kg
You can set mu=3 to change the null hypothesis from 0 to 3 assuming your x variables are in the units you describe above.
t.test(x, y, mu=3, alternative="greater", paired=TRUE)
More (general) information on Stack Exchange [here].(https://stats.stackexchange.com/questions/206316/can-a-paired-or-two-group-t-test-test-if-the-difference-between-two-means-is-l/206317#206317)

How to generate n random numbers from negative binomial distribution?

I am trying to make a function in order to generate n random numbers from negative binomial distribution.
To generate it, I first made a function to generate n random variables from geometric distribution. My function for generating n random numbers from geometric distribution as follows:
rGE<-function(n,p){
I<-rep(NA,n)
for (j in 1:n){
x<-rBer(1,p)
i<-1 # number of trials
while(x==0){
x<-rBer(1,p)
i<-i+1
}
I[j]<- i
}
return(I)
}
I tested this function (rGE), for example for rGE(10,0.5), which is generating 10 random numbers from a geometric distribution with probability of success 0.5, a random result was:
[1] 2 4 2 1 1 3 4 2 3 3
In rGE function I used a function named rBer which is:
rBer<-function(n,p){
sample(0:1,n,replace = TRUE,prob=c(1-p,p))
}
Now, I want to improve my above function (rGE) in order to make a function for generating n random numbers from a negative binomial function. I made the following function:
rNB<-function(n,r,p){
I<-seq(n)
for (j in 1:n){
x<-0
x<-rBer(1,p)
i<-1 # number of trials
while(x==0 & I[j]!=r){
x<-rBer(1,p)
i<-i+1
}
I[j]<- i
}
return(I)
}
I tested it for rNB(3,2,0.1), which generates 3 random numbers from a negative binomial distribution with parametrs r=2 and p=0.1 for several times:
> rNB(3,2,0.1)
[1] 2 1 7
> rNB(3,2,0.1)
[1] 3 1 4
> rNB(3,2,0.1)
[1] 3 1 2
> rNB(3,2,0.1)
[1] 3 1 3
> rNB(3,2,0.1)
[1] 46 1 13
As you can see, I think my function (rNB) does not work correctly, because the results always generat 1 for the second random number.
Could anyone help me to correct my function (rNB) in order to generate n random numbers from a negative binomial distribution with parametrs n, r, and p. Where r is the number of successes and p is the probability of success?
[[Hint: Explanations regarding geometric distribution and negative binomial distribution:
Geometric distribution: In probability theory and statistics, the geometric distribution is either of two discrete probability distributions:
The probability distribution of the number X of Bernoulli trials needed to get one success, supported on the set { 1, 2, 3, ... }.
The probability distribution of the number Y = X − 1 of failures before the first success, supported on the set { 0, 1, 2, 3, ... }
Negative binomial distribution:A negative binomial experiment is a statistical experiment that has the following properties:
The experiment consists of x repeated trials.
Each trial can result in just two possible outcomes. We call one of these outcomes a success and the other, a failure.
The probability of success, denoted by P, is the same on every trial.
The trials are independent; that is, the outcome on one trial does not affect the outcome on other trials.
The experiment continues until r successes are observed, where r is specified in advance.
]]
Your function will be much faster if you use R's native vectorization. The way you can do this is to generate all your Bernoulli trials at once.
Note that for a negative binomial distribution, the expected value (i.e. the mean number of Bernoulli trials it will take to get r successes) is r * p / (1 - p) (Reference)
If we want to draw n negative binomial samples, then the expected total number of Bernoulli trials will therefore be n * r * p / (1 - p). So we want to draw at least that many Bernoulli samples. For simplicity, we can start by drawing twice that number: 2 * n * r * p / (1 - p) . In the unlikely case that this is not enough, we can draw twice as many again repeatedly until we have enough; once the sum of the resultant vector of Bernoulli trials is greater than r * n, we know we have enough Bernoulli trials to simulate our n negative binomial trials.
We can now run a cumsum on the vector of Bernoulli trials to keep track of the number of positive trials. If you then perform integer division on this vector by %/% r, you will have all the Bernoulli trials labelled according to which negative binomial trial they belonged to. You then table this vector.
The first r numbers of the table (obtained by subsetting the table by [1:n] or equivalently by [seq(n)] is your negative binomial draw. We just remove the table's names by using as.numeric. We also subtract the number of successes (i.e. r), from each of our counts, since we are only counting the failures, not the successes.
rNB <- function(n, r, p) {
mult <- 2
all_samples <- 0
while(sum(all_samples) < n * r)
{
all_samples <- rBer(mult * n * r * p / (1 - p), p)
mult <- mult * 2
}
as.numeric(table(cumsum(all_samples) %/% r))[seq(n)] - r
}
So we can do:
rNB(3, 2, 0.1)
#> [1] 14 19 41
rNB(3, 2, 0.1)
#> [1] 23 6 56
rNB(3, 2, 0.1)
#> [1] 11 31 59
rNB(3, 2, 0.1)
#> [1] 7 21 14
mean(rNB(10000, 2, 0.1))
#> [1] 18.0002
We can test this against R's own rnbinom:
mean(rnbinom(10000, 2, 0.1))
#> [1] 18.0919
hist(rnbinom(10000, 2, 0.5), breaks = 0:20)
hist(rNB(10000, 2, 0.5), breaks = 0:20)
Note that the logic of your own version isn't quite right. In particular, the line while(x == 0 & I[j] != r) doesn't make any sense. I is a vector of 1:n, so in your example, whenever j is 2, I[j] is equal to r and the loop stops. This is why your second number is always 1. I don't know what you were trying to do here.
If you want to do it one Bernoulli trial at a time, as you are doing in your own version, try this modified function. The variable names should hopefully make it easy to follow the logic:
rNB <- function(n, r, p) {
# Create an empty vector of length n for our results
draws <- numeric(n)
# Now for each of the n trials we will get a negative binomial sample:
for (i in 1:n) {
# Create success and failure counters for this draw
failures <- successes <- 0
# Now run Bernoulli trials, counting successes and failures as we go
# until we hit r successes
while(successes < r)
{
if(rBer(1, p) == 1)
successes <- successes + 1
else
failures <- failures + 1
}
# Once we have reached r successes, the current number of failures is our
# negative binomial draw
draws[i] <- failures
}
return(draws)
}
This gives identical results to the faster, albeit more opaque, vectorized version.

Simulation on the probability of doing laundry together

I am trying to simulate a self-made probability problem "Suppose there are 6 households living in a unit of an apartment complex. On average, a single household does laundry twice a week for 2 hours each time. Find the probability that any two households doing laundry at the same time."
However, I was able to simulate for the case when a single household does laundry ONCE a week (R code below) and would appreciate any help on extending the code to the scenario for doing laundry TWICE a week.
I also attempted to find a theoretical solution but it did not match with my simulation results below. Any help is appreciated. Thanks!
dist.min <- function(x) {
ifelse(min(dist(x)) <= 2 * 3600 - 1, T, F)
}
set.seed(12345)
N <- 100000
mat <- matrix(sample(1:(24 * 60 * 60 * 7), N * 6, replace = T), ncol = 6)
is.same <- apply(mat, 1, dist.min)
mean(is.same) # 0.30602
Hi if I understood the problem correctly I would take such an approach.
This is binomial distribution where n=6 number of families and p of success that a family is doing laundry is 4/168 as 4 hours divided by number of week hours.
Then theoretical probability of at least 2 families doing laundry at the same time is
sum(dbinom(2:6,6,4/168))
which gives about 0.7%
And as per simulation let's create a matrix with 6 columns per each family and 10K rows as
number of simulation. Then let's fill matrix with 1(doing laundry) and 0(not) where
probs correspond probabilities of doing a laundry at any point in time.
Running this code I am getting about 0.7% probability of 2 or more families doing laundry at the same time
mat<-replicate(6,sample(0:1,size = 10000,replace=T,prob = c(164/168,4/168)))
table(rowSums(mat))

Confusing p values with ANOVA on a big dataframe

I am trying to analyse the significant differences between different car company performance values across different countries. I am using ANOVA to do this.
Running ANOVA on my real dataset (30 countries, 1000 car companies and 90000 measurement scores) gave every car a zero p-value.
Confused by this, I created a reproducible example (below) with 30 groups, 3 car companies, 90000 random scores. Purposely, I kept a score of 1 for the Benz company where you shouldn't see any difference between countries. After running anova, I see a pvalue of 0.46 instead of 1.
Does any one know why is this ?
Reproducible example
set.seed(100000)
qqq <- 90000
df = data.frame(id = c(1:90000), country = c(rep("usa",3000), rep("usb",3000), rep("usc",3000), rep("usd",3000), rep("use",3000), rep("usf",3000), rep("usg",3000), rep("ush",3000), rep("usi",3000), rep("usj",3000), rep("usk",3000), rep("usl",3000), rep("usm",3000), rep("usn",3000), rep("uso",3000), rep("usp",3000), rep("usq",3000), rep("usr",3000), rep("uss",3000), rep("ust",3000), rep("usu",3000), rep("usv",3000), rep("usw",3000), rep("usx",3000), rep("usy",3000), rep("usz",3000), rep("usaa",3000), rep("usab",3000), rep("usac",3000), rep("usad",3000)), tesla=runif(90000), bmw=runif(90000), benz=rep(1, each=qqq))
str(df)
out<-data.frame()
for(j in 3:ncol(df)){
amod2 <- aov(df[,j]~df$country)
out[(j-2),1]<-colnames(df)[j]
out[(j-2),2]<-summary(amod2, test = adjusted("bonferroni"))[[1]][[1,"Pr(>F)"]]
}
colnames(out)<-c("cars","pvalue")
write.table(out,"df.output")
df.output
"cars" "pvalue"
"1" "tesla" 0.245931589754359
"2" "bmw" 0.382730335188437
"3" "benz" 0.465083026215268
With respect to the "benz" p-value in your reproducible example: an ANOVA analysis requires positive variance (i.e., non-constant data). If you violate this assumption, the model is degenerate. Technically, the p-value is based on an F-statistic whose value is a normalized ratio of the variance attributable to the "country" effect (for "benz" in your example, zero) divided by the total variance (for "benz" in your example, zero), so your F-statistic has "value" 0/0 or NaN.
Because of the approach R takes to calculating the F-statistic (using a QR matrix decomposition to improve numerical stability in "nearly" degenerate cases), it calculates an F-statistic equal to 1 (w/ 29 and 89970 degrees of freedom). This gives a p-value of:
> pf(1, 29, 89970, lower=FALSE)
[1] 0.465083
>
but it is, of course, largely meaningless.
With respect to your original problem, with large datasets relatively small effects will yield very small p-values. For example, if you add the following after your df definition above to introduce a difference in country usa:
df = within(df, {
o = country=="usa"
tesla[o] = tesla[o] + .1
bmw[o] = bmw[o] + .1
benz[o] = benz[o] + .1
rm(o)
})
you will find that out looks like this:
> out
cars pvalue
1 tesla 9.922166e-74
2 bmw 5.143542e-74
3 benz 0.000000e+00
>
Is this what you're seeing, or are you seeing all of them exactly zero?

Resources