Simulating coin toss - r

In the New York Times yesterday there was a reference to a paper essentially saying that the probability of 'heads' after a 'head' appears is not 0.5 (assuming a fair coin), challenging the "hot hand" myth. I want to prove it to myself.
Thus, I am working on coding a simulation of 7 coin tosses, and counting the number of heads after the first head, provided, naturally, that there is a first head at all.
I came up with the following lines of R code, but I'm still getting NA values, and would appreciate some help:
n <- 7 # number of tosses
p <- 0.5 # probability of heads
sims <- 100 # number of simulations
Freq_post_H <- 0 # frequency of 'head'-s after first 'head'
for(i in 1:sims){
z <- rbinom(n, 1, p)
if(sum(z==1)!=0){
y <- which(z==1)[1]
Freq_post_H[i] <- sum(z[(y+1):n])/length((y+1):n)
}else{
next()
}
Freq_post_H
}
Freq_post_H
What am I missing?
CONCLUSION: After the initial hiccups of mismatched variable names, both responses solve the question. One of the answers corrects problems in the initial code related to what happens with the last toss (i + 1) by introducing min(y + 1, n), and corrects the basic misunderstanding of next within a loop generating NA for skipped iterations. So thank you (+1).
Critically, and the reason for this appended "conclusion" the second response addresses a more fundamental or conceptual problem: we want to calculate the fraction of H's that are preceded by a H, as opposed to p(H) in whatever number of tosses remain after a head has appeared, which will be 0.5 for a fair coin.

This is a simulation of what they did in the newspaper:
nsims <- 10000
k <- 4
set.seed(42)
sims <- replicate(nsims, {
x <- sample(0:1, k, TRUE)
#print(x)
sum( # sum logical values, i.e. 0/1
diff(x) == 0L & # is difference between consecutive values 0?
x[-1] == 1L ) / # and are these values heads?
sum(head(x, -1) == 1L) #divide by number of heads (without last toss)
})
mean(sims, na.rm = TRUE) #NaN cases are samples without heads, i.e. 0/0
#[1] 0.4054715
k <- 7
sims <- replicate(nsims, {
x <- sample(0:1, k, TRUE)
#print(x)
sum(diff(x) == 0L & x[-1] == 1L) / sum(head(x, -1) == 1L)
})
mean(sims, na.rm = TRUE)
#[1] 0.4289402

n <- 7 # number of tosses
p <- 0.5 # probability of heads
sims <- 100 # number of simulations
Prob_post_H <- 0 # frequency of 'head'-s after first 'head'
for(i in 1:sims){
z <- rbinom(n, 1, p)
if(sum(z==1) != 0){
y <- which(z==1)[1]
Prob_post_H[i] <- mean(z[min(y+1, n):n], na.rm=TRUE)
}else{
next()
}
}
mean(Prob_post_H,na.rm=TRUE)
#[1] 0.495068
It looks like it's right around 50%. We can scale up to see more simulations.
sims <- 10000
mean(Prob_post_H,na.rm=TRUE)
#[1] 0.5057866
Still around 50%.

This is to simulate 100 fair coin tosses 30,000 times
counter <- 1
coin <- sum(rbinom(100,1,0.5))
while(counter<30000){
coin <- c(coin, sum(rbinom(100,1,0.5)))
counter <- counter+1
}
Try these after running above variable
hist(coin)
str(coin)
mean(coin)
sd(coin)

Below is some sample code in R to simulate a fair coin toss in R using the sample function. You can modify it as you like to simulate any number of flips. Since the outcome of flipping a coin is independent for each flip, the probability of a head or tail is always 0.5 for any given flip. Over many coin flips the probability of at least half of the flips being heads (or tails) will converge to 0.5. The probability that you get exactly half heads and half tails approaches 0.
n <- 7
count_heads <- 0
coin_flip <- sample(c(0,1), n, replace = TRUE)
for(flip_i in 1:n)
{
if(coin_flip[flip_i] == 1)
{
count_heads = count_heads + 1
}
}
count_heads/n

Related

How to simulate a Polya urn (Martingale) like problem?

In a village there are living N=100 people and they decide with an
interesting way about some actions. Specifically, if someone proposes
an action, then all the N villagers vote for it with YES or NO. The
next day each villager re-adjusts her/his opinion independently from
the other villagers, and votes again with probability equal to the
probability of the total (maximum) supporters of the previous day.This
voting process continues until all N=100 agree on the same opinion.
Question:
How many voting days must pass until all N villagers vote the same ?
My effort
The answer must be the number of iterations needed.
I want to simulate in R this process which is a Polya Urn like (I believe) process but here we don't have red =1 and green = 1 ball in the urn.We have N balls (voters).
Also we have random i people in YES and j people on NO on X_{0} the first day.
Therefore we have p = i/N and q =j/n.
Now the next day each villager will vote again but with probability equal to maximum probability of the previous day.
Something like
votevillage <- function(n) {
i = sample(1:N,1);i
j = N-i;j
p = i/N;p
q = 1-p;q
support = max(i,j)
while (support != n) {
vote = sample(c("YES","NO"),1,prob=c(1-p,p))
support = support + vote
}
if (vote == "YES")
return(1)
else
return(0)
}
n = 100
trials = 100000
simlist = replicate(trials, votevillage(n))
mean(simlist)
The above code is wrong.It's my idea (something like a pseudo code).
As mentioned in the comments, it depends of course on the distribution of yes voters in the first round (if all villagers voted yes (no) on the first round the whole election lasts only 1 day.
The following lines show how to simulate the voting:
nr_of_yes_votes <- function(prob, N) {
rbinom(1, N, prob)
}
nr_of_days_until_unanimity <- function(x0, N) {
i <- 1
x <- x0
while (x < N && x > 0) {
p <- x / N
x <- nr_of_yes_votes(p, N)
i <- i + 1
}
i
}
simulate <- function(prob0, N = 100, seed = 123, reps = 10000) {
set.seed(seed)
x0 <- nr_of_yes_votes(prob0, N)
mean(replicate(reps, nr_of_days_until_unanimity(x0, N)))
}
simulate(.5) ## 137.9889
simulate(0) ## 1
simulate(1) ## 1
Intuitively, the more disagreement in the beginning the longer it will take to get to unanimity. Furthermore, the problem is symmetric. Thus, we would expect something where the numebr of days peaks when there is maximum disagreement in the first voting (which corresponds to an initial voting probability of 0.5) and which declines symmetrically as we vome closer to 0 (1).
This can be nicely shown wiht the following lines:
ns <- vapply((p0 <- seq(0, 1, by = .01)), simulate, numeric(1))
plot(p0, ns, type = "l", xlab = expression(Prob[0]),
ylab = "Expected Days")

Comparison of two vectors resulted after simulation

I would like to apply the Rejection sampling method to simulate a random vector Y=(Y_1, Y_2) of a uniform distribution from a unit disc D = { (X_1 , X_2) \in R^2: \sqrt{x^2_1 + x^2_2} ≤ 1} such that X = (X_1 , X_ 2) is random vector of a uniform distribution in the square S = [−1, 1]^2 and the joint density f(y_1,y_2) = \frac{1}{\pi} 1_{D(y_1,y_2)}.
In the rejection method, we accept a sample generally if f(x) \leq C * g(x). I am using the following code to :
x=runif(100,-1,1)
y=runif(100,-1,1)
d=data.frame(x=x,y=y)
disc_sample=d[(d$x^2+d$y^2)<1,]
plot(disc_sample)
I have two questions:
{Using the above code, logically, the size of d should be greater than the size of disc_sample but when I call both of them I see there are 100 elements in each one of them. How could this be possible. Why the sizes are the same.} THIS PART IS SOLVED, thanks to the comment below.
The question now
Also, how could I reformulate my code to give me the total number of samples needed to get 100 samples follow the condition. i.e to give me the number of samples rejected until I got the 100 needed sample?
Thanks to the answer of r2evans but I am looking to write something simpler, a while loop to store all possible samples inside a matrix or a data frame instead of a list then to call from that data frame just the samples follow the condition. I modified the code from the answer without the use of the lists and without sapply function but it is not giving the needed result, it yields only one row.
i=0
samps <- data.frame()
goods <- data.frame()
nr <- 0L
sampsize <- 100L
needs <- 100L
while (i < needs) {
samps <- data.frame(x = runif(1, -1, 1), y = runif(1, -1, 1))
goods <- samps[(samps$x^2+samps$y^2)<1, ]
i = i+1
}
and I also thought about this:
i=0
j=0
samps <- matrix()
goods <- matrix()
needs <- 100
while (j < needs) {
samps[i,1] <- runif(1, -1, 1)
samps[i,2] <- runif(1, -1, 1)
if (( (samps[i,1])**2+(samps[i,2])**2)<1){
goods[j,1] <- samps[i,1]
goods[j,2] <- samps[i,2]
}
else{
i = i+1
}
}
but it is not working.
I would be very grateful for any help to modify the code.
As to your second question ... you cannot reformulate your code to know precisely how many it will take to get (at least) 100 resulting combinations. You can use a while loop and concatenate results until you have at least 100 such rows, and then truncate those over 100. Because using entropy piecewise (at scale) is "expensive", you might prefer to always over-estimate the rows you need and grab all at once.
(Edited to reduce "complexity" based on homework constraints.)
set.seed(42)
samps <- vector(mode = "list")
goods <- vector(mode = "list")
nr <- 0L
iter <- 0L
sampsize <- 100L
needs <- 100L
while (nr < needs && iter < 50) {
iter <- iter + 1L
samps[[iter]] <- data.frame(x = runif(sampsize, -1, 1), y = runif(sampsize, -1, 1))
rows <- (samps[[iter]]$x^2 + samps[[iter]]$y^2) < 1
goods[[iter]] <- samps[[iter]][rows, ]
nr <- nr + sum(rows)
}
iter # number of times we looped
# [1] 2
out <- head(do.call(rbind, goods), n = 100)
NROW(out)
# [1] 100
head(out) ; tail(out)
# x y
# 1 0.8296121 0.2524907
# 3 -0.4277209 -0.5668654
# 4 0.6608953 -0.2221099
# 5 0.2834910 0.8849114
# 6 0.0381919 0.9252160
# 7 0.4731766 0.4797106
# x y
# 221 -0.65673577 -0.2124462
# 231 0.08606199 -0.7161822
# 251 -0.37263236 0.1296444
# 271 -0.38589120 -0.2831997
# 28 -0.62909284 0.6840144
# 301 -0.50865171 0.5014720

Simplify Simulations on R

as I mentioned in a previous question. I am brand new to programming and have no prior experience, but am very happy to be learning.
However, I've run into the following problem, my professor has given us the following:
sim1 <- function(n) {
xm <- matrix(nrow=n,ncol=2)
for (i in 1:n) {
d <- rnorm(1)
if (runif(1) < 0.5) {
xm[i,1] <- 1
xm[i,2] <- 2.5*d + 69
} else {
xm[i,1] <- 0
xm[i,2] <- 2*d + 64
}
}
return(xm)
}
With the following task: Try to improve the efficiency of this code. Use speed.test to see if it is improved for generating n=1000 observations.
I have finally at least been able to figure out what this code does, nonetheless, I am completely lost on how I could possibly make this code more efficient.
Any help means a whole lot.
Thank you!
If possible, don't use loops in R. rep and rnorm will fill vectors with 5, 10, or 500,000 values all in one call, very quickly. Calling rnorm(1) 500,000 times is a waste and much slower than simply calling rnorm(500000). It's like taking a Ferrari for a drive, going 1 foot and stopping, going 1 foot and stopping, over and over to get to your destination.
This function will return statistically identical results as your function. However, instead of using loops, it does things in the R way.
sim2 <- function(n) {
n1 <- floor(n/2) #this is how many of the else clause we'll do
n2 <- n - n1 #this is how many of the if clause we'll do
col11 <- rep(0, n1) #bam! we have a vector filled with 0s
col12 <- (rnorm(n1) * 2) + 64 #bam! vector filled with deviates
col21 <- rep(1, n2) #bam! vector filled with 1s
col22 <- (rnorm(n2) * 2.5) + 69 #bam! vector filled with deviates
xm <- cbind(c(col11,col21), c(col12,col22)) #now we have a matrix, 2 cols, n rows
return(xm[sample(nrow(xm)),]) #shuffle the rows, return matrix
}
No loops! The functionality might be obvious but in case it is not, I'll explain. First, n1 & n2 are simply to split the size of n appropriately (accounting for odd numbers).
Next, the binomial process (i.e., if(runif(1) < 0.5) {} else {}) per element can be eliminated since we know that in sim1, half of the matrix falls into the if condition and half in the else (see proof below). We don't need to decide for each element over and over and over which random path to take when we know that it's 50/50. So, we're going to do ALL the else 50% first: we fill a vector with n/2 0s (col11) and another with n/2 random deviates (mean = 0, sd = 1 by default) and, for each deviate, multiply by 2 and add 64, with result vector col12. That 50% is done.
Next, we finish the second 50% (the if portion). We fill a vector with n/2 1s (col21) and another with random deviates and, for each deviate, multiply by 2.5 and add 69.
We now have 4 vectors that we'll turn into a matrix. STEP 1: We glue col11 (filled with n/2 0s) and col21 (filled with n/2 1s) together using the c function to get a vector (n elements). STEP 2: Glue col12 and col22 together (filled with the deviates) using c to get a vector (like a 1 column x n row matrix). Note: 0s/1s are associated with the correct deviates based on 64/69 formulas. STEP 3: Use cbind to make a matrix (xm) out of the vectors: 0/1 vector becomes column 1, deviate vector becomes column 2. STEP 4: Get the number of rows in the matrix (which should just be n) using nrow. STEP 5: Make a shuffled vector with all the row numbers randomly ordered using sample. STEP 6: Make a new (unnamed) matrix putting xm's rows in order according to the shuffled vector. The point of steps 4-6 is just to randomly order the rows, since the binomial process in sim1 would have produced a random order of rows.
This version runs 866% faster!
> system.time({ sim1(500000)})
user system elapsed
1.341 0.179 1.527
> system.time({ sim2(500000)})
user system elapsed
0.145 0.011 0.158
If you're concerned about proof that this maintains the integrity of the binomial process, consider that the binomial process does two things: 1) It associates 1 with the 2.5*d+69 equation and 0 with the 2*d + 64 equation - the association is maintained since rows are shuffled intact; 2) 50% go in the if clause and 50% in the else clause, as proved below.
sim3 <- function(n) {
a <- 0
for(j in 1:n) {
if(runif(1) < 0.5) {
a <- a + 1
}
}
return(a/n)
}
> sim3(50)
[1] 0.46
> sim3(5000)
[1] 0.4926
> sim3(10000)
[1] 0.5022
> sim3(5000000)
[1] 0.4997844
The binomial process produces 50% 1s and 50% 0s (column 1).
I'll do what I think is the most obvious step, namely to move rnorm() out of the loop and take advantage of its vectorized nature (as rawr alluded to)
sim2 <- function(n) {
xm <- matrix(nrow=n, ncol=2)
d <- rnorm(n)
for (i in 1:n) {
if (runif(1) < 0.5) {
xm[i,1] <- 1
xm[i,2] <- 2.5*d[i] + 69
} else {
xm[i,1] <- 0
xm[i,2] <- 2*d[i] + 64
}
}
return(xm)
}
n <- 1e3
set.seed(1); system.time(s1 <- sim1(n)); system.time(s2 <- sim2(n))
# user system elapsed
# 0.019 0.004 0.023
# user system elapsed
# 0.010 0.000 0.009
t.test(s1[,2], s2[,2]) # Not identical, but similar, again alluded to by rawr
Just that gives us a reasonable improvement. A similar thing can be done with runif() as well, but I'll leave that to you.
If you want some reading material I can recommend Hadley Wickhams Advanced R and the chapter Optimising code.
And in case you're wondering, it is indeed possible to eliminate both the loop and the conditionals.
One optimization I can suggest is that you create the matrix with default value as 0. Once matrix has been created with 0 value as default then there will be no need to populate a value 0 in function.
The modified code will look like:
sim1 <- function(n) {
#create matrix with 0 value.
xm <- matrix(0,nrow=n,ncol=2)
for (i in 1:n) {
d <- rnorm(1)
if (runif(1) < 0.5) {
xm[i,1] <- 1
xm[i,2] <- 2.5*d + 69
} else {
#xm[i,1] <- 0 --- No longer needed
xm[i,2] <- 2*d + 64
}
}
return(xm)
}

using replicate() to roll 100 dice 10,000 times; unexpected results

In an EdX R stats class, we are asked to look at the proportion of times a '6' is rolled in a set of 100 die rolls. Then we are asked to roll 100 dice 10,000 times, to see the standard deviation of the difference in means from the 100-die rolls.
The results of the 100-die rolls are as expected; around 0.1703 or so (1/6 = 0.1666667)
But when I load up replicate() to throw sets of 100 dice 10,000 times to see 10,000 means, the results are not what I expect. I don't see any values outside the range of a z-score = 2:
set.seed(1)
# get mean of 100 dice rolls
mean100dice <- function(){
n=100
x <- replicate(n, sample(1:6, n, replace=TRUE), simplify='vector')
mean(x==6)
}
mean100dice() #these come out as expected
means10k <- replicate(10000, mean100dice(),simplify='vector')
p = 1/6
z = (means10k - p) / sqrt(p*(1-p)/n)
mean(z > 2) ## I expect this to be > 0
range(means10k) ## sanity check
> mean(z > 2)
[1] 0
> range(means10k)
[1] 0.1522 0.1806
At a guess, you set n <- 100 instead of n <- 10000 when calculating z.
It's a good idea to provide explicit variable names, so you don't mixed up. For example, you need to distinguish n_dice_rolls and n_replicates.
Incidentally, your code for calculating the mean of 100 dice rolls is not correct.
sample(1:6, n, replace=TRUE) rolls n dice; you don't need to call replicate() as well. I think you want something like this.
roll_nd6 <- function(n_dice) {
sample(1:6, n_dice, replace = TRUE)
}
get_fraction_of_sixes_from_rolling_nd6 <- function(n_dice) {
mean(roll_nd6(n_dice) == 6L)
}
monte_carlo_simulate_get_fraction_of_sixes <- function(n_replications, n_dice) {
replicate(
n_replications,
get_fraction_of_sixes_from_rolling_nd6(n_dice),
simplify = "vector"
)
}
calc_z_score <- function(actual_p, expected_p) {
(actual_p - expected_p) /
sqrt(expected_p * (1 - expected_p) / length(actual_p))
}
actual_fraction_of_sixes <- monte_carlo_simulate_get_fraction_of_sixes(10000, 100)
z_scores <- calc_z_score(actual_fraction_of_sixes, 1 / 6)
You have a mistake in mean100dice: You sample 100 dice, and replicate that 100 times, so it's actually not the average of a 100 dice, but of 100*100 = 10,000 dice. Of course, the mean of that is going to be much closer to p on average.

Faster coding than using for loop

Suppose I have the following data frame
set.seed(36)
n <- 300
dat <- data.frame(x = round(runif(n,0,200)), y = round(runif(n, 0, 500)))
d <- dat[order(dat$y),]
For each value of d$y<=300, I have to create a variable res in which the numerator is the sum of the indicator (d$x <= d$y[i]) and the denominator is the sum of the indicator (d$y >= d$y[i]). I have written the codes in for loop:
res <- NULL
for( i in seq_len(sum(d$y<=300)) ){
numerator <- sum(d$x <= d$y[i])
denominator <- sum(d$y >= d$y[i])
res[i] <- numerator / denominator
}
But my concern is when the number of observations of x and y is large, that is, the number of rows of the data frame increases, the for loop will work slowly. Additionally, if I simulate data 1000 times and each time run the for loop, the program will be inefficient.
What can be the more efficient solution of the code?
This depends on d already being sorted as it is:
# example data
set.seed(36)
n <- 1e5
dat <- data.frame(x = round(runif(n,0,200)), y = round(runif(n, 0, 500)))
d <- dat[order(dat$y),]
My suggestion (thanks to #alexis_laz for the denominator):
system.time(res3 <- {
xs <- sort(d$x) # sorted x
yt <- d$y[d$y <= 300] # truncated y
num = findInterval(yt, xs)
den = length(d$y) - match(yt, d$y) + 1L
num/den
})
# user system elapsed
# 0 0 0
OP's approach:
system.time(res <- {
res <- NULL
for( i in seq_len(sum(d$y<=300)) ){
numerator <- sum(d$x <= d$y[i])
denominator <- sum(d$y >= d$y[i])
res[i] <- numerator / denominator
}
res
})
# user system elapsed
# 50.77 1.13 52.10
# verify it matched
all.equal(res,res3) # TRUE
#d.b's approach:
system.time(res2 <- {
numerator = rowSums(outer(d$y, d$x, ">="))
denominator = rowSums(outer(d$y, d$y, "<="))
res2 = numerator/denominator
res2 = res2[d$y <= 300]
res2
})
# Error: cannot allocate vector of size 74.5 Gb
# ^ This error is common when using outer() on large-ish problems
Vectorization. Generally, tasks are faster in R if they can be vectorized. The key functions related to ordered vectors have confusing names (findInterval, sort, order and cut), but fortunately they all work on vectors.
Continuous vs discrete. The match above should be a fast way to compute the denominator whether the data is continuous or has mass points / repeating values. If the data is continuous (and so has no repeats), the denominator can just be seq(length(xs), length = length(yt), by=-1). If it is fully discrete and has a lot of repetition (like the example here), there might be some way to make that faster as well, maybe like one of these:
den2 <- inverse.rle(with(rle(yt), list(
values = length(xs) - length(yt) + rev(cumsum(rev(lengths))),
lengths = lengths)))
tab <- unname(table(yt))
den3 <- rep(rev(cumsum(rev(tab))) + length(xs) - length(yt), tab)
# verify
all.equal(den,den2) # TRUE
all.equal(den,den3) # TRUE
findInterval will still work for the numerator for continuous data. It's not ideal for the repeated-values case considered here I guess (since we're redundantly finding the interval for many repeated yt values). Similar ideas for speeding that up likely apply.
Other options. As #chinsoon suggested, the data.table package might be a good fit if findInterval is too slow, since it has a lot of features focused on sorted data, but it's not obvious to me how to apply it here.
Instead of running loop, generate all the numerator and denominator at once. This also allows you to keep track of which res is associated with which x and y. Later, you can keep only the ones you want.
You can use outer for element wise comparison between vectors.
numerator = rowSums(outer(d$y, d$x, ">=")) #Compare all y against all x
denominator = rowSums(outer(d$y, d$y, "<=")) #Compare all y against itself
res2 = numerator/denominator #Obtain 'res' for all rows
#I would first 'cbind' res2 to d and only then remove the ones for 'y <=300'
res2 = res2[d$y <= 300] #Keep only those 'res' that you want
Since this is using rowSums, this should be faster.

Resources