I am trying to simulate a self-made probability problem "Suppose there are 6 households living in a unit of an apartment complex. On average, a single household does laundry twice a week for 2 hours each time. Find the probability that any two households doing laundry at the same time."
However, I was able to simulate for the case when a single household does laundry ONCE a week (R code below) and would appreciate any help on extending the code to the scenario for doing laundry TWICE a week.
I also attempted to find a theoretical solution but it did not match with my simulation results below. Any help is appreciated. Thanks!
dist.min <- function(x) {
ifelse(min(dist(x)) <= 2 * 3600 - 1, T, F)
}
set.seed(12345)
N <- 100000
mat <- matrix(sample(1:(24 * 60 * 60 * 7), N * 6, replace = T), ncol = 6)
is.same <- apply(mat, 1, dist.min)
mean(is.same) # 0.30602
Hi if I understood the problem correctly I would take such an approach.
This is binomial distribution where n=6 number of families and p of success that a family is doing laundry is 4/168 as 4 hours divided by number of week hours.
Then theoretical probability of at least 2 families doing laundry at the same time is
sum(dbinom(2:6,6,4/168))
which gives about 0.7%
And as per simulation let's create a matrix with 6 columns per each family and 10K rows as
number of simulation. Then let's fill matrix with 1(doing laundry) and 0(not) where
probs correspond probabilities of doing a laundry at any point in time.
Running this code I am getting about 0.7% probability of 2 or more families doing laundry at the same time
mat<-replicate(6,sample(0:1,size = 10000,replace=T,prob = c(164/168,4/168)))
table(rowSums(mat))
Related
I got a question related to probability theory and I tried to solve it by simulating it in R. However, I ran into a problem as the while loop does not seem to break.
The question is asking: How many people are needed such that there is at least a 70% chance that one of them is born on the last day of December?
Here is my code:
prob <- 0
people <- 1
while (prob <= 0.7) {
people <- people + 1 #start the iteration with 2 people in the room and increase 1 for every iteration
birthday <- sample(365, size = people, replace = TRUE)
prob <- length(which(birthday == 365)) / people
}
return(prob)
My guess is that it could never hit 70%, therefore the while loop never breaks, am I right? If so, did I interpret the question wrongly?
I did not want to post this on stats.stackexchange.com because I thought this is more related to code rather than math itself, but I will move it if necessary, thanks.
This is a case where an analytical solution based on probability is easier and more accurate than trying to simulate. I agree with Harshvardhan that your formulation is solving the wrong problem.
The probability of having at least one person in a pool of n have their birthday on a particular target date is 1-P{all n miss the target date}. This probability is at least 0.7 when P{all n miss the target date} < 0.3. The probability of each individual missing the target is assumed to be P{miss} = 1-1/365 (365 days per year, all birthdates equally likely). If the individual birthdays are independent, then P{all n miss the target date} = P{miss}^n.
I am not an R programmer, but the following Ruby should translate pretty easily:
# Use rationals to avoid cumulative float errors.
# Makes it slower but accurate.
P_MISS_TARGET = 1 - 1/365r
p_all_miss = P_MISS_TARGET
threshold = 3r / 10 # seeking P{all miss target} < 0.3
n = 1
while p_all_miss > threshold
p_all_miss *= P_MISS_TARGET
n += 1
end
puts "With #{n} people, the probability all miss is #{p_all_miss.to_f}"
which produces:
With 439 people, the probability all miss is 0.29987476838793214
Addendum
I got curious, since my answer differs from the accepted one, so I wrote a small simulation. Again, I think it's straightforward enough to understand even though it's not in R:
require 'quickstats' # Stats "gem" available from rubygems.org
def trial
n = 1
# Keep adding people to the count until one of them hits the target
n += 1 while rand(1..365) != 365
return n
end
def quantile(percentile = 0.7, number_of_trials = 1_000)
# Create an array containing results from specified number of trials.
# Defaults to 1000 trials
counts = Array.new(number_of_trials) { trial }
# Sort the array and determine the empirical target percentile.
# Defaults to 70th percentile
return counts.sort[(percentile * number_of_trials).to_i]
end
# Tally the statistics of 100 quantiles and report results,
# including margin of error, formatted to 3 decimal places.
stats = QuickStats.new
100.times { stats.new_obs(quantile) }
puts "#{"%.3f" % stats.avg}+/-#{"%.3f" % (1.96*stats.std_err)}"
Five runs produce outputs such as:
440.120+/-3.336
440.650+/-3.495
435.820+/-3.558
439.500+/-3.738
442.290+/-3.909
which is strongly consistent with the analytical result derived earlier and seems to differ significantly from other responder's answers.
Note that on my machine the simulation takes roughly 40 times longer than the analytical calculation, is more complex, and introduces uncertainty. To increase the precision you would need larger sample sizes, and thus longer run times. Given these considerations, I would reiterate my advice to go for the direct solution in this case.
Indeed, your probability will (almost) never reach 0.7, because you hardly will hit the point where exactly 1 person has got birthday = 365. When people gets larger, there will be more people having a birthday = 365, and the probability for exactly 1 person will decrease.
Furthermore, to calculate a probability for a given number of persons, you should draw many samples and then calculate the probability. Here is a way to achieve that:
N = 450 # max. number of peoples being tried
probs = array(numeric(), N) # empty array to store found probabilities
# try for all people numbers in range 1:N
for(people in 1:N){
# do 200 samples to calculate prop
samples = 200
successes = 0
for(i in 1:samples){
birthday <- sample(365, size = people, replace = TRUE)
total_last_day <- sum(birthday == 365)
if(total_last_day >= 1){
successes <- successes + 1
}
}
# store found prop in array
probs[people] = successes/samples
}
# output of those people numbers that achieved a probability of > 0.7
which(probs>0.7)
As this is a simulation, the result depends on the run. Increasing the sample rate would make the result more stable.
You are solving the wrong problem. The question is, "How many people are needed such that there is at least a 70% chance that one of them is born on the last day of December?". What you are finding now is "How many people are needed such that 70% have their birthdays on the last day of December?". The answer to the second question is close to zero. But the first one is much simpler.
Replace prob <- length(which(birthday == 365)) / people with check = any(birthday == 365) in your logic because at least one of them has to be born on Dec 31. Then, you will be able to find if that number of people will have at least one person born on Dec 31.
After that, you will have to rerun the simulation multiple times to generate empirical probability distribution (kind of Monte Carlo). Only then you can check for probability.
Simulation Code
people_count = function(i)
{
set.seed(i)
for (people in 1:10000)
{
birthday = sample(365, size = people, replace = TRUE)
check = any(birthday == 365)
if(check == TRUE)
{
pf = people
break
}
}
return(pf)
}
people_count() function returns the number of people required to have so that at least one of them was born on Dec 31. Then I rerun the simulation 10,000 times.
# Number of simulations
nsim = 10000
l = lapply(1:nsim, people_count) %>%
unlist()
Let's see the distribution of the number of people required.
To find actual probability, I'll use cumsum().
> cdf = cumsum(l/nsim)
> which(cdf>0.7)[1]
[1] 292
So, on average, you would need 292 people to have more than a 70% chance.
In addition to #pjs answer, I would like to provide one myself, written in R. I attempted to solve this question by simulation rather than an analytical approach, and I am sharing it in case it is helpful for someone else who also has the same problem. Its not that well written but the idea is there:
# create a function which will find if anyone is born on last day
last_day <- function(x){
birthdays <- sample(365, size = x, replace = TRUE) #randomly get everyone's birthdays
if(length(which(birthdays == 365)) >= 1) {
TRUE #find amount of people born on last day and return true if >1
} else {
FALSE
}
}
# find out how many people needed to get 70%
people <- 0 #set number of people to zero
prob <- 0 #set prob to zero
while (prob <= 0.7) { #loop does not stop until it hits 70%
people <- people + 1 #increase the number of people every iteration
prob <- mean(replicate(10000, last_day(people))) #run last_day 10000 times to find the mean of probability
}
print(no_of_people)
last_day() only return TRUE or FALSE. So I run last_day() 10000 times in the loop for every iteration to find out, out of 10000 times, how many times does it have one or more people born on the last day (This will give the probability). I then keep the loop running until the probability is 70% or more, then print the number of people.
The answer I get from running the loop once is 440 which is quite close to the answer provided by #pjs.
I am trying to develop code that will tell me the likelihood of rolling at least one six given 1 thru 20 die using. I am specifically trying to build a single piece of code that loops through the problem space. generates this information. The question has left me at a loss.
I have tried using the sample function and looked at contingency tables.
die1 = sample(1:6,n,replace=T)
die2 = sample(1:6,n,replace=T)
sum_of_dice = die1 + die2
counts = table(sum_of_dice)
proba_empiric = counts/sum(counts)
barplot(proba_empiric)
The above provides the basis for a probability but not for the joint probability of two die.
The final code should be able to tell me the likelihood of rolling a six on 1 die, 2 die, 3 die, all the way to twenty die.
One way to simulate the probability of rolling at least one 6 using 1 to 20 die is to use rbinom():
sapply(1:20, function(x) mean(rbinom(10000, x, 1/6) > 0))
[1] 0.1675 0.3008 0.4174 0.5176 0.5982 0.6700 0.7157 0.7704 0.8001 0.8345 0.8643 0.8916 0.9094 0.9220 0.9310
[16] 0.9471 0.9547 0.9623 0.9697 0.9718
If I am understanding you correctly, you have 20 dice and you want to know the probability of atleast one six happening in them.
We can write a function to roll one die
roll_die <- function() sample(6, 1)
Then write another function which rolls 20 dice and checks if there is atleast one six in it
roll_20_die <- function() {
any(replicate(20, roll_die()) == 6)
}
and replicate this function sufficient number of times to get the probability ratio
n <- 10000
table(replicate(n, roll_20_die()))/n
# FALSE TRUE
#0.0244 0.9756
I'm new to R but i am trying to use it in order to aggregate losses that are observed from a severity distribution by an observation from a frequency distribution - essentially what rcompound does. However, i need a more granular approach as i need to manipulate the severity distribution before 'aggregation'.
Lets take an example. Suppose you have:
rpois(10,lambda=3)
Thereby, giving you something like:
[1] 2 2 3 5 2 5 6 4 3 1
Additionally, suppose we have severity of losses determined by:
rgamma(20,shape=1,scale=10000)
So that we also have the following output:
[1] 233.0257 849.5771 7760.4402 731.5646 8982.7640 24172.2369 30824.8424 22622.8826 27646.5168 1638.2333 6770.9010 2459.3722 782.0580 16956.1417 1145.4368 5029.0473 3485.6412 4668.1921 5637.8359 18672.0568
My question is: what is an efficient way to get R to take each Poisson observation in turn and then aggregate losses from my severity distribution? For example, the first Poisson observation is 2. Therefore, adding two observations (the first two) from my Gamma distribution gives 1082.61.
I say this needs to be 'efficient' (run time) due to the fact:
- The Poisson parameter may be come significantly large, i.e. up to 1000 or so.
- The realisations are likely to be up to 1,000,000, i.e. up to a million Poisson and Gamma observations to sort through.
Any help would be greatly appreciated.
Thanks, Dave.
It looks like you want to split the gamma vector at positions indicated by the accumulation of the poisson vector.
The following function (from here) does the splitting:
splitAt <- function(x, pos) unname(split(x, cumsum(seq_along(x) %in% pos)))
pois <- c(2, 2, 3, 5, 2, 5, 6, 4, 3, 1)
gam <- c(233.0257, 849.5771, 7760.4402, 731.5646, 8982.7640, 24172.2369, 30824.8424, 22622.8826, 27646.5168, 1638.2333, 6770.9010, 2459.3722, 782.0580, 16956.1417, 1145.4368, 5029.0473, 3485.6412, 4668.1921, 5637.8359, 18672.0568)
posits <- cumsum(pois)
Then do the following:
sapply(splitAt(gam, posits + 1), sum)
[1] 1082.603 8492.005 63979.843 61137.906 17738.200 19966.153 18672.057
According to post I linked to above, the splitAt() function slows down for large arrays, so you could (if necessary) consider the alternatives proposed in that post. For my part, I generated 1e6 poissons and 1e6 gammas, and the above function ran in 0.78 sec on my machine.
At a car hire service 50% of cars are returned on time. A sample of 20 car hires is studied. Is this the correct R invocation of dbinom in order to calculate all 20 cars are returned on time ? :
dbinom(x=20, size=20, prob=0.5)
Yes.
We can check against what we know the answer to be (0.5^20, since 20 choose 20 is 1 and (0.5^20)*(0.5^(20-20)) = 0.5^20):
dbinom(x=20, size=20, prob=0.5)
# [1] 9.536743e-07
0.5^20
# [1] 9.536743e-07
From help("dbinom"):
x, q vector of quantiles.
...
size number of trials (zero or more).
prob probability of success on each trial.
So here, x is our quantile (what is the probability there were 20 successes?), size is our number of trials (a sample of 20), and prob is the probability of success in each one (there is a 1/2 chance each car is returned timely).
I am currently working with a raster comprised of 1,750,000 data points of a storm taken last winter. I am using the pracma::findpeaks() function in an effort to quantify and analyze periods of the storm. Every hour of the storm occurs over 90,000 data points, and I would like to get an hour by hour analysis. Over some hour intervals, the function works perfectly:
findpeaks(winddf$s1[1609931:1699931], nups = 3, ndowns = 3, minpeakheight = 10.79, minpeakdistance = 5)
returns 110 peaks over this interval with these parameters for a peak
However, over another 90,000 count interval I get this error message after I run this:
findpeaks(winddf$s1[179133:269132], nups = 3, ndowns = 3, minpeakheight = 8.84, minpeakdistance = 5)
Error in xp[i] <- which.max(x[x1[i]:x2[i]]) + x1[i] - 1 :
replacement has length zero
The only changes made were the threshold minpeakheight and the interval at which I am viewing. The function works for all 3000 count intervals, some 8000 count invervals, and a few 15,000 count intervals, but I would much rather just perform 20 analysis of the 20 hours over 90,000 count intervals of the storm than perform 600 analysis using 3000 count intervals. I can not give the complete code or data as the data file is too large. Thank you.
I had the same issue with the findpeaks function. I found out that I had some non-numerical values in my vector that were producing NAs. I converted all NAs to zero with the line of code below and this resolved the issue.
dt$X5[is.na(dt$X5)] <- 0
Where dt is a dataframe and X5 is the vector you are plugging into the findpeaks function.