Need help interpreting this code - r

So this code was handed out in our school, its one of many examples (model of rolling a fair dice).
x<-runif(1)
y<-as.double(x<=c(1/6,2/6,3/6,4/6,5/6,1))*(1:6)
x<-min(y[y>0])
Im having trouble understanding the relation of this code and rolling a dice.
So the first line generates 1 randomly uniform distributed number x between 0 and 1.
In the second line we put a condition into x: If its less than 1 of the components of the vector (1/6,2/6,3/6,4/6,5/6,1) we get a TRUE=1 , else FALSE=0.
And then this result is multiplied by the vector (1,2,3,4,5,6).
Lastly we take the minimum value of that vector product (has to be greater than zero).
I cant get the intuition behind this. Would someone here mind to explain the relation of this code to rolling a dice in real life. Im confused..

So for rolling a dice each number has the same probability of 1/6 to appear.
Now what is done here, is to simulate rolling a dice.
Therefore in the first line a random number between 0 and 1 is generated.
The intervals it is compared to are all equally sized and they have a length of 1/6.
There for, for x to lie in one of these intervals the probability is again 1/6.
So what is done then in the third line, is to look up in which interval x has fallen.
Lets do an example:
Supposed x is 0.25.
Then the vector of the second line would look like this:
FALSE, TRUE, TRUE, TRUE , TRUE, TRUE
With the multiplication you get:
0, 2, 3, 4, 5, 6
Therefore at the end x is equal to 2.
So at the end x is supposed to be the number the dice is showing

Benjamin is right basically your code is saying
If runif(1) = 0 to 1/6 = die roll = 1
runif(1) = 1/6 to 2/6 = die roll = 2
runif(1) = 2/6 to 3/6 = die roll = 3
etc
Because the second line returns a vector of results the x<-min(y[y>0]) just returns the first positive which is your die roll

Related

Having trouble solving simulation

I got a question related to probability theory and I tried to solve it by simulating it in R. However, I ran into a problem as the while loop does not seem to break.
The question is asking: How many people are needed such that there is at least a 70% chance that one of them is born on the last day of December?
Here is my code:
prob <- 0
people <- 1
while (prob <= 0.7) {
people <- people + 1 #start the iteration with 2 people in the room and increase 1 for every iteration
birthday <- sample(365, size = people, replace = TRUE)
prob <- length(which(birthday == 365)) / people
}
return(prob)
My guess is that it could never hit 70%, therefore the while loop never breaks, am I right? If so, did I interpret the question wrongly?
I did not want to post this on stats.stackexchange.com because I thought this is more related to code rather than math itself, but I will move it if necessary, thanks.
This is a case where an analytical solution based on probability is easier and more accurate than trying to simulate. I agree with Harshvardhan that your formulation is solving the wrong problem.
The probability of having at least one person in a pool of n have their birthday on a particular target date is 1-P{all n miss the target date}. This probability is at least 0.7 when P{all n miss the target date} < 0.3. The probability of each individual missing the target is assumed to be P{miss} = 1-1/365 (365 days per year, all birthdates equally likely). If the individual birthdays are independent, then P{all n miss the target date} = P{miss}^n.
I am not an R programmer, but the following Ruby should translate pretty easily:
# Use rationals to avoid cumulative float errors.
# Makes it slower but accurate.
P_MISS_TARGET = 1 - 1/365r
p_all_miss = P_MISS_TARGET
threshold = 3r / 10 # seeking P{all miss target} < 0.3
n = 1
while p_all_miss > threshold
p_all_miss *= P_MISS_TARGET
n += 1
end
puts "With #{n} people, the probability all miss is #{p_all_miss.to_f}"
which produces:
With 439 people, the probability all miss is 0.29987476838793214
Addendum
I got curious, since my answer differs from the accepted one, so I wrote a small simulation. Again, I think it's straightforward enough to understand even though it's not in R:
require 'quickstats' # Stats "gem" available from rubygems.org
def trial
n = 1
# Keep adding people to the count until one of them hits the target
n += 1 while rand(1..365) != 365
return n
end
def quantile(percentile = 0.7, number_of_trials = 1_000)
# Create an array containing results from specified number of trials.
# Defaults to 1000 trials
counts = Array.new(number_of_trials) { trial }
# Sort the array and determine the empirical target percentile.
# Defaults to 70th percentile
return counts.sort[(percentile * number_of_trials).to_i]
end
# Tally the statistics of 100 quantiles and report results,
# including margin of error, formatted to 3 decimal places.
stats = QuickStats.new
100.times { stats.new_obs(quantile) }
puts "#{"%.3f" % stats.avg}+/-#{"%.3f" % (1.96*stats.std_err)}"
Five runs produce outputs such as:
440.120+/-3.336
440.650+/-3.495
435.820+/-3.558
439.500+/-3.738
442.290+/-3.909
which is strongly consistent with the analytical result derived earlier and seems to differ significantly from other responder's answers.
Note that on my machine the simulation takes roughly 40 times longer than the analytical calculation, is more complex, and introduces uncertainty. To increase the precision you would need larger sample sizes, and thus longer run times. Given these considerations, I would reiterate my advice to go for the direct solution in this case.
Indeed, your probability will (almost) never reach 0.7, because you hardly will hit the point where exactly 1 person has got birthday = 365. When people gets larger, there will be more people having a birthday = 365, and the probability for exactly 1 person will decrease.
Furthermore, to calculate a probability for a given number of persons, you should draw many samples and then calculate the probability. Here is a way to achieve that:
N = 450 # max. number of peoples being tried
probs = array(numeric(), N) # empty array to store found probabilities
# try for all people numbers in range 1:N
for(people in 1:N){
# do 200 samples to calculate prop
samples = 200
successes = 0
for(i in 1:samples){
birthday <- sample(365, size = people, replace = TRUE)
total_last_day <- sum(birthday == 365)
if(total_last_day >= 1){
successes <- successes + 1
}
}
# store found prop in array
probs[people] = successes/samples
}
# output of those people numbers that achieved a probability of > 0.7
which(probs>0.7)
As this is a simulation, the result depends on the run. Increasing the sample rate would make the result more stable.
You are solving the wrong problem. The question is, "How many people are needed such that there is at least a 70% chance that one of them is born on the last day of December?". What you are finding now is "How many people are needed such that 70% have their birthdays on the last day of December?". The answer to the second question is close to zero. But the first one is much simpler.
Replace prob <- length(which(birthday == 365)) / people with check = any(birthday == 365) in your logic because at least one of them has to be born on Dec 31. Then, you will be able to find if that number of people will have at least one person born on Dec 31.
After that, you will have to rerun the simulation multiple times to generate empirical probability distribution (kind of Monte Carlo). Only then you can check for probability.
Simulation Code
people_count = function(i)
{
set.seed(i)
for (people in 1:10000)
{
birthday = sample(365, size = people, replace = TRUE)
check = any(birthday == 365)
if(check == TRUE)
{
pf = people
break
}
}
return(pf)
}
people_count() function returns the number of people required to have so that at least one of them was born on Dec 31. Then I rerun the simulation 10,000 times.
# Number of simulations
nsim = 10000
l = lapply(1:nsim, people_count) %>%
unlist()
Let's see the distribution of the number of people required.
To find actual probability, I'll use cumsum().
> cdf = cumsum(l/nsim)
> which(cdf>0.7)[1]
[1] 292
So, on average, you would need 292 people to have more than a 70% chance.
In addition to #pjs answer, I would like to provide one myself, written in R. I attempted to solve this question by simulation rather than an analytical approach, and I am sharing it in case it is helpful for someone else who also has the same problem. Its not that well written but the idea is there:
# create a function which will find if anyone is born on last day
last_day <- function(x){
birthdays <- sample(365, size = x, replace = TRUE) #randomly get everyone's birthdays
if(length(which(birthdays == 365)) >= 1) {
TRUE #find amount of people born on last day and return true if >1
} else {
FALSE
}
}
# find out how many people needed to get 70%
people <- 0 #set number of people to zero
prob <- 0 #set prob to zero
while (prob <= 0.7) { #loop does not stop until it hits 70%
people <- people + 1 #increase the number of people every iteration
prob <- mean(replicate(10000, last_day(people))) #run last_day 10000 times to find the mean of probability
}
print(no_of_people)
last_day() only return TRUE or FALSE. So I run last_day() 10000 times in the loop for every iteration to find out, out of 10000 times, how many times does it have one or more people born on the last day (This will give the probability). I then keep the loop running until the probability is 70% or more, then print the number of people.
The answer I get from running the loop once is 440 which is quite close to the answer provided by #pjs.

How to understand the result of Discrete Fourier Transform under period finding?

I am learning how to use Discrete Fourier Transform(DFT) to find the period about a^x mod(N), in which x is a positive integer, a is any prime number, and N is the product of two prime factors p and q.
For example, the period of 2^x mod(15) is 4,
>>> for x in range(8):
... print(2**x % 15)
...
Output: 1 2 4 8 1 2 4 8
^-- the next period
and the result of DFT is as following,
(cited from O'Reilly Programming Quantum Computers chapter 12)
There are 4 spikes with 4-unit spacing, and I think the latter 4 means that period is 4.
But, when N is 35 and period is 12
>>> for x in range(16):
... print(2**x % 35)
...
Output: 1 2 4 8 16 32 29 23 11 22 9 18 1 2 4 8
^-- the next period
In this case, there are 8 spikes greater than 100, whose locations are 0, 5, 6, 11, 32, 53, 58, 59, respectively.
Does the location sequence imply the magic number 12? And how to understand "12 evenly spaced spikes" from the righthand graph?
(cited from O'Reilly Programming Quantum Computers chapter 12)
see How to compute Discrete Fourier Transform? and all the sublinks especially How do I obtain the frequencies of each value in an FFT?.
As you can see i-th element of DFT result (counting from 0 to n-1 including) represent Niquist frequency
f(i) = i * fsampling / n
And DFT result uses only those sinusoidal frequencies. So if your signal does have different one (even slightly different frequency or shape) aliasing occurs.
Aliased sinusoid creates 2 frequencies in DFT output one higher and one lower frequency.
Any sharp edge is translated to many frequencies (usually continuous spectrum like your last example)
The f(0) is no frequency and represents DC offset.
On top of all this if the input of your DFT is real domain then the DFT result is symmetric meaning you can use only first half of the result as the second is just mirror image (not including the f(0)) which makes sense as you can not represent bigger than fsampling/2 frequency in real domain data.
Conclusion:
You can not obtain frequency of signal used by DFT as there is infinite number of ways how such signal can be computed. DFT is reconstructing the signal using sinwaves and your signal is definately no sinwave so the results will not match what you think.
Matching niquist frequencies to yours is done by correctly chosing the n for DFT however without knowing the frequency ahead you can not do this ...
It may be possible to compute the singular sinwave frequency from its 2 aliases however your signal is no sinwave so that is not applicable for your case anyway.
I would use different approaches to determine frequency of integer numeric signal:
compute histogram of signal
so count how many of each number there is
test possible frequencies
You can brute force all possible periods of signal and test if consequent periods are the same however for big data is this not optimal...
We can use histogram to speed this up. So if you look at the counts cnt(ix) from histogram for periodic signal of frequency f and period T in data of size n then the period of signal should be a common divider of all the counts
T = n/f
k*f = GCD(all non zero cnt[i])
where k divides the GCD result. However in case n is not exact multiple of T or the signal has noise or slight deviations in it this will not work. However we can at least estimate the GCD and test all frequencies around which will be still faster than brute force.
So for each count (not accounting for noise) it should comply this:
cnt(ix) = ~ n/(f*k)
k = { 1,2,3,4,...,n/f}
so:
f = ~ n/(cnt(ix)*k)
so if you got signal like this:
1,1,1,2,2,2,2,3,3,1,1,1,2,2,2,2,3,3,1
then histogram would be cnt[]={0,7,8,4,0,0,0,0,...} and n=19 so computing f in periods per n for each used element leads to:
f(ix) = n/(cnt(ix)*k)
f(1) = 19/(7*k) = ~ 2.714/k
f(2) = 19/(8*k) = ~ 2.375/k
f(3) = 19/(4*k) = ~ 4.750/k
Now the real frequency should be a common divider (CD) of results so taking biggest and smallest counts rounded up and down (ignoring noise) leads to these options:
f = CD(2,4) = 2
f = CD(3,4) = none
f = CD(2,5) = none
f = CD(3,5) = none
so now test frequency (luckily its just one valid in this case) 2 periods per 19 samples meaning T = ~ 9.5 so test rounded up and down ...
signal(t+ 0)=1,1,1,2,2,2,2,3,3,1,1,1,2,2,2,2,3,3,1
signal(t+ 9)=1,1,1,2,2,2,2,3,3,1 // check 9 elements
signal(t+10)=1,1,2,2,2,2,3,3,1,? // check 10 elements
As you can see signal(t...t+9)==signal(t+9...t+9+9) meaning the period is T=9.

Confusion Between 'sample' and 'rbinom' in R

Why are these not equivalent?
#First generate 10 numbers between 0 and .5
set.seed(1)
x <- runif(10, 0, .5)
These are the two statements I'm confused by:
#First
sample(rep(c(0,1), length(x)), size = 10, prob = c(rbind(1-x,x)), replace = F)
#Second
rbinom(length(x), size = 1, prob=x)
I was originally trying to use 'sample'. What I thought I was doing was generating ten (0,1) pairs, then assigning the probability that each would return either a 0 or a 1.
The second one works and gives me the output I need (trying to run a sim). So I've been able to solve my problem. I'm just curious as to what's going on under the hood with 'sample' so that I can understand R better.
The first area of difference is the location of the length of the vector specification in the parameter list. The names size have different meanings in these two functions. (I hadn't thought about that source of confusion before, and I'm sure I have made this error myself many times.)
The random number generators (starting with r and having a distribution suffix) have that choice as the first parameter, whereas sample has it as the second parameter. So the length of the second one is 10 and the length of the first is 1. In sample the draw is from the values in the first argument, while 'size' is the length of the vector to create. In the rbinom function, n is the length of the vector to create, while size is the number of items to hypothetically draw from a theoretical urn having a distribution determined by 'prob'. The result returned is the number of "ones". Try:
rbinom(length(x), size = 10, prob=x)
Regarding the argument to prob: I don't think you need the c().
The difference between the two function is quite simple.
Think of a pack of shuffled cards, and choose a number of cards from it. That is exactly the situation that sample simulates.
This code,
> set.seed(123)
> sample(1:40, 5)
[1] 12 31 16 33 34
randomly extract five numbers from the 1:40 vector of numbers.
In your example, you set size = 1. It means you choose only one element from the pool of possible values. If you set size = 10 you will get ten values as you desire.
set.seed(1)
x <- runif(10, 0, .5)
> sample(rep(c(0,1), length(x)), size = 10, prob = c(rbind(1-x,x)), replace = F)
[1] 0 0 0 0 0 0 0 1 0 1
Instead, the goal of the rbinom function is to simulate events where the results are "discrete", such as the flip of a coin. It considers, as parameters, the probability of success on a trial, such as the flip of the coin, according to a given probability of 0.5. Here we simulate 100 flips. If you think that the coin could be stacked in order to favor one specific outcome, we could simulate this behaviour by setting probability equals to 0.8, as in the example below.
> set.seed(123)
> table(rbinom(100, 1, prob = 0.5))
0 1
53 47
> table(rbinom(100, 1, prob = 0.8))
0 1
19 81

Find first increasing value in vector

I draw a random sample from Uniform Distribution by
u <- runif (1000,0,1)
Now I want to calculate the value of this random variable
N = min_n {n : u_n > u_{n-1}}
Edit
Let say I draw a random sample of size 10.
So, I have u= (u_1,u_2,u_3,...,u_10). Now I want to find minimum n for which u_n > u_{n-1}
If you take the difference (using diff) then you're looking for where the difference is greater than 0. We search for the first time that happens
u <- c(.5, .4, .3, .6)
min(which(diff(u) > 0))
This gives us 3 which is close to what we want but not exactly. Since this will return 1 if the first difference is greater than 0 what we really want to do is add 1 to the result
min(which(diff(u) > 0))) + 1
which should give what we want. This will give a warning if your sequence is strictly descending though since it can't find a value that meets the criteria. We could code in some tests and decide on the appropriate output in that case but I'll leave that as an exercise for the reader.

On average, how many times will this incorrect loop iterate?

In some cases, a loop needs to run for a random number of iterations that ranges from min to max, inclusive. One working solution is to do something like this:
int numIterations = randomInteger(min, max);
for (int i = 0; i < numIterations; i++) {
/* ... fun and exciting things! ... */
}
A common mistake that many beginning programmers make is to do this:
for (int i = 0; i < randomInteger(min, max); i++) {
/* ... fun and exciting things! ... */
}
This recomputes the loop upper bound on each iteration.
I suspect that this does not give a uniform distribution of the number of times the loop will iterate that ranges from min to max, but I'm not sure exactly what distribution you do get when you do something like this. Does anyone know what the distribution of the number of loop iterations will be?
As a specific example: suppose that min = 0 and max = 2. Then there are the following possibilities:
When i = 0, the random value is 0. The loop runs 0 times.
When i = 0, the random value is nonzero. Then:
When i = 1, the random value is 0 or 1. Then the loop runs 1 time.
When i = 1, the random value is 2. Then the loop runs 2 times.
The probability of this first event is 1/3. The second event has probability 2/3, and within it, the first subcase has probability 2/3 and the second event has probability 1/3. Therefore, the average number of distributions is
0 × 1/3 + 1 × 2/3 × 2/3 + 2 × 2/3 × 1/3
= 0 + 4/9 + 4/9
= 8/9
Note that if the distribution were indeed uniform, we'd expect to get 1 loop iteration, but now we only get 8/9 on average. My question is whether it's possible to generalize this result to get a more exact value on the number of iterations.
Thanks!
Final edit (maybe!). I'm 95% sure that this isn't one of the standard distributions that are appropriate. I've put what the distribution is at the bottom of this post, as I think the code that gives the probabilities is more readable! A plot for the mean number of iterations against max is given below.
Interestingly, the number of iterations tails off as you increase max. Would be interesting if someone else could confirm this with their code.
If I were to start modelling this, I would start with the geometric distribution, and try to modify that. Essentially we're looking at a discrete, bounded distribution. So we have zero or more "failures" (not meeting the stopping condition), followed by one "success". The catch here, compared to the geometric or Poisson, is that the probability of success changes (also, like the Poisson, the geometric distribution is unbounded, but I think structurally the geometric is a good base). Assuming min=0, the basic mathematical form for P(X=k), 0 <= k <= max, where k is the number of iterations the loop runs, is, like the geometric distribution, the product of k failure terms and 1 success term, corresponding to k "false"s on the loop condition and 1 "true". (Note that this holds even to calculate the last probability, as the chance of stopping is then 1, which obviously makes no difference to a product).
Following on from this, an attempt to implement this in code, in R, looks like this:
fx = function(k,maximum)
{
n=maximum+1;
failure = factorial(n-1)/factorial(n-1-k) / n^k;
success = (k+1) / n;
failure * success
}
This assumes min=0, but generalizing to arbitrary mins isn't difficult (see my comment on the OP). To explain the code. First, as shown by the OP, the probabilities all have (min+1) as a denominator, so we calculate the denominator, n. Next, we calculate the product of the failure terms. Here factorial(n-1)/factorial(n-1-k) means, for example, for min=2, n=3 and k=2: 2*1. And it generalises to give you (n-1)(n-2)... for the total probability of failure. The probability of success increases as you get further into the loop, until finally, when k=maximum, it is 1.
Plotting this analytic formula gives the same results as the OP, and the same shape as the simulation plotted by John Kugelman.
Incidentally the R code to do this is as follows
plot_probability_mass_function = function(maximum)
{
x=0:maximum;
barplot(fx(x,max(x)), names.arg=x, main=paste("max",maximum), ylab="P(X=x)");
}
par(mfrow=c(3,1))
plot_probability_mass_function(2)
plot_probability_mass_function(10)
plot_probability_mass_function(100)
Mathematically, the distribution is, if I've got my maths right, given by:
which simplifies to
(thanks a bunch to http://www.codecogs.com/latex/eqneditor.php)
The latter is given by the R function
function(x,m) { factorial(m)*(x+1)/(factorial(m-x)*(m+1)^(x+1)) }
Plotting the mean number of iterations is done like this in R
meanf = function(minimum)
{
x = 0:minimum
probs = f(x,minimum)
x %*% probs
}
meanf = function(maximum)
{
x = 0:maximum
probs = f(x,maximum)
x %*% probs
}
par(mfrow=c(2,1))
max_range = 1:10
plot(sapply(max_range, meanf) ~ max_range, ylab="Mean number of iterations", xlab="max")
max_range = 1:100
plot(sapply(max_range, meanf) ~ max_range, ylab="Mean number of iterations", xlab="max")
Here are some concrete results I plotted with matplotlib. The X axis is the value i reached. The Y axis is the number of times that value was reached.
The distribution is clearly not uniform. I don't know what distribution it is offhand; my statistics knowledge is quite rusty.
1. min = 10, max = 20, iterations = 100,000
2. min = 100, max = 200, iterations = 100,000
I believe that it would still, given a sufficient amount of executions, conform to the distribution of the randomInteger function.
But this is probably a question better suited to be asked on MATHEMATICS.
I don’t know the math behind it, but I know how to compute it! In Haskell:
import Numeric.Probability.Distribution
iterations min max = iteration 0
where
iteration i = do
x <- uniform [min..max]
if i < x
then iteration (i + 1)
else return i
Now expected (iterations 0 2) gives you the expected value of ~0.89. Maybe someone with the requisite math knowledge can explain what I’m actually doing here. Because you start at 0, the loop will always run at least min times.

Resources