using replicate() to roll 100 dice 10,000 times; unexpected results - r

In an EdX R stats class, we are asked to look at the proportion of times a '6' is rolled in a set of 100 die rolls. Then we are asked to roll 100 dice 10,000 times, to see the standard deviation of the difference in means from the 100-die rolls.
The results of the 100-die rolls are as expected; around 0.1703 or so (1/6 = 0.1666667)
But when I load up replicate() to throw sets of 100 dice 10,000 times to see 10,000 means, the results are not what I expect. I don't see any values outside the range of a z-score = 2:
set.seed(1)
# get mean of 100 dice rolls
mean100dice <- function(){
n=100
x <- replicate(n, sample(1:6, n, replace=TRUE), simplify='vector')
mean(x==6)
}
mean100dice() #these come out as expected
means10k <- replicate(10000, mean100dice(),simplify='vector')
p = 1/6
z = (means10k - p) / sqrt(p*(1-p)/n)
mean(z > 2) ## I expect this to be > 0
range(means10k) ## sanity check
> mean(z > 2)
[1] 0
> range(means10k)
[1] 0.1522 0.1806

At a guess, you set n <- 100 instead of n <- 10000 when calculating z.
It's a good idea to provide explicit variable names, so you don't mixed up. For example, you need to distinguish n_dice_rolls and n_replicates.
Incidentally, your code for calculating the mean of 100 dice rolls is not correct.
sample(1:6, n, replace=TRUE) rolls n dice; you don't need to call replicate() as well. I think you want something like this.
roll_nd6 <- function(n_dice) {
sample(1:6, n_dice, replace = TRUE)
}
get_fraction_of_sixes_from_rolling_nd6 <- function(n_dice) {
mean(roll_nd6(n_dice) == 6L)
}
monte_carlo_simulate_get_fraction_of_sixes <- function(n_replications, n_dice) {
replicate(
n_replications,
get_fraction_of_sixes_from_rolling_nd6(n_dice),
simplify = "vector"
)
}
calc_z_score <- function(actual_p, expected_p) {
(actual_p - expected_p) /
sqrt(expected_p * (1 - expected_p) / length(actual_p))
}
actual_fraction_of_sixes <- monte_carlo_simulate_get_fraction_of_sixes(10000, 100)
z_scores <- calc_z_score(actual_fraction_of_sixes, 1 / 6)

You have a mistake in mean100dice: You sample 100 dice, and replicate that 100 times, so it's actually not the average of a 100 dice, but of 100*100 = 10,000 dice. Of course, the mean of that is going to be much closer to p on average.

Related

How to simulate a Polya urn (Martingale) like problem?

In a village there are living N=100 people and they decide with an
interesting way about some actions. Specifically, if someone proposes
an action, then all the N villagers vote for it with YES or NO. The
next day each villager re-adjusts her/his opinion independently from
the other villagers, and votes again with probability equal to the
probability of the total (maximum) supporters of the previous day.This
voting process continues until all N=100 agree on the same opinion.
Question:
How many voting days must pass until all N villagers vote the same ?
My effort
The answer must be the number of iterations needed.
I want to simulate in R this process which is a Polya Urn like (I believe) process but here we don't have red =1 and green = 1 ball in the urn.We have N balls (voters).
Also we have random i people in YES and j people on NO on X_{0} the first day.
Therefore we have p = i/N and q =j/n.
Now the next day each villager will vote again but with probability equal to maximum probability of the previous day.
Something like
votevillage <- function(n) {
i = sample(1:N,1);i
j = N-i;j
p = i/N;p
q = 1-p;q
support = max(i,j)
while (support != n) {
vote = sample(c("YES","NO"),1,prob=c(1-p,p))
support = support + vote
}
if (vote == "YES")
return(1)
else
return(0)
}
n = 100
trials = 100000
simlist = replicate(trials, votevillage(n))
mean(simlist)
The above code is wrong.It's my idea (something like a pseudo code).
As mentioned in the comments, it depends of course on the distribution of yes voters in the first round (if all villagers voted yes (no) on the first round the whole election lasts only 1 day.
The following lines show how to simulate the voting:
nr_of_yes_votes <- function(prob, N) {
rbinom(1, N, prob)
}
nr_of_days_until_unanimity <- function(x0, N) {
i <- 1
x <- x0
while (x < N && x > 0) {
p <- x / N
x <- nr_of_yes_votes(p, N)
i <- i + 1
}
i
}
simulate <- function(prob0, N = 100, seed = 123, reps = 10000) {
set.seed(seed)
x0 <- nr_of_yes_votes(prob0, N)
mean(replicate(reps, nr_of_days_until_unanimity(x0, N)))
}
simulate(.5) ## 137.9889
simulate(0) ## 1
simulate(1) ## 1
Intuitively, the more disagreement in the beginning the longer it will take to get to unanimity. Furthermore, the problem is symmetric. Thus, we would expect something where the numebr of days peaks when there is maximum disagreement in the first voting (which corresponds to an initial voting probability of 0.5) and which declines symmetrically as we vome closer to 0 (1).
This can be nicely shown wiht the following lines:
ns <- vapply((p0 <- seq(0, 1, by = .01)), simulate, numeric(1))
plot(p0, ns, type = "l", xlab = expression(Prob[0]),
ylab = "Expected Days")

Comparison of two vectors resulted after simulation

I would like to apply the Rejection sampling method to simulate a random vector Y=(Y_1, Y_2) of a uniform distribution from a unit disc D = { (X_1 , X_2) \in R^2: \sqrt{x^2_1 + x^2_2} ≤ 1} such that X = (X_1 , X_ 2) is random vector of a uniform distribution in the square S = [−1, 1]^2 and the joint density f(y_1,y_2) = \frac{1}{\pi} 1_{D(y_1,y_2)}.
In the rejection method, we accept a sample generally if f(x) \leq C * g(x). I am using the following code to :
x=runif(100,-1,1)
y=runif(100,-1,1)
d=data.frame(x=x,y=y)
disc_sample=d[(d$x^2+d$y^2)<1,]
plot(disc_sample)
I have two questions:
{Using the above code, logically, the size of d should be greater than the size of disc_sample but when I call both of them I see there are 100 elements in each one of them. How could this be possible. Why the sizes are the same.} THIS PART IS SOLVED, thanks to the comment below.
The question now
Also, how could I reformulate my code to give me the total number of samples needed to get 100 samples follow the condition. i.e to give me the number of samples rejected until I got the 100 needed sample?
Thanks to the answer of r2evans but I am looking to write something simpler, a while loop to store all possible samples inside a matrix or a data frame instead of a list then to call from that data frame just the samples follow the condition. I modified the code from the answer without the use of the lists and without sapply function but it is not giving the needed result, it yields only one row.
i=0
samps <- data.frame()
goods <- data.frame()
nr <- 0L
sampsize <- 100L
needs <- 100L
while (i < needs) {
samps <- data.frame(x = runif(1, -1, 1), y = runif(1, -1, 1))
goods <- samps[(samps$x^2+samps$y^2)<1, ]
i = i+1
}
and I also thought about this:
i=0
j=0
samps <- matrix()
goods <- matrix()
needs <- 100
while (j < needs) {
samps[i,1] <- runif(1, -1, 1)
samps[i,2] <- runif(1, -1, 1)
if (( (samps[i,1])**2+(samps[i,2])**2)<1){
goods[j,1] <- samps[i,1]
goods[j,2] <- samps[i,2]
}
else{
i = i+1
}
}
but it is not working.
I would be very grateful for any help to modify the code.
As to your second question ... you cannot reformulate your code to know precisely how many it will take to get (at least) 100 resulting combinations. You can use a while loop and concatenate results until you have at least 100 such rows, and then truncate those over 100. Because using entropy piecewise (at scale) is "expensive", you might prefer to always over-estimate the rows you need and grab all at once.
(Edited to reduce "complexity" based on homework constraints.)
set.seed(42)
samps <- vector(mode = "list")
goods <- vector(mode = "list")
nr <- 0L
iter <- 0L
sampsize <- 100L
needs <- 100L
while (nr < needs && iter < 50) {
iter <- iter + 1L
samps[[iter]] <- data.frame(x = runif(sampsize, -1, 1), y = runif(sampsize, -1, 1))
rows <- (samps[[iter]]$x^2 + samps[[iter]]$y^2) < 1
goods[[iter]] <- samps[[iter]][rows, ]
nr <- nr + sum(rows)
}
iter # number of times we looped
# [1] 2
out <- head(do.call(rbind, goods), n = 100)
NROW(out)
# [1] 100
head(out) ; tail(out)
# x y
# 1 0.8296121 0.2524907
# 3 -0.4277209 -0.5668654
# 4 0.6608953 -0.2221099
# 5 0.2834910 0.8849114
# 6 0.0381919 0.9252160
# 7 0.4731766 0.4797106
# x y
# 221 -0.65673577 -0.2124462
# 231 0.08606199 -0.7161822
# 251 -0.37263236 0.1296444
# 271 -0.38589120 -0.2831997
# 28 -0.62909284 0.6840144
# 301 -0.50865171 0.5014720

Simplify Simulations on R

as I mentioned in a previous question. I am brand new to programming and have no prior experience, but am very happy to be learning.
However, I've run into the following problem, my professor has given us the following:
sim1 <- function(n) {
xm <- matrix(nrow=n,ncol=2)
for (i in 1:n) {
d <- rnorm(1)
if (runif(1) < 0.5) {
xm[i,1] <- 1
xm[i,2] <- 2.5*d + 69
} else {
xm[i,1] <- 0
xm[i,2] <- 2*d + 64
}
}
return(xm)
}
With the following task: Try to improve the efficiency of this code. Use speed.test to see if it is improved for generating n=1000 observations.
I have finally at least been able to figure out what this code does, nonetheless, I am completely lost on how I could possibly make this code more efficient.
Any help means a whole lot.
Thank you!
If possible, don't use loops in R. rep and rnorm will fill vectors with 5, 10, or 500,000 values all in one call, very quickly. Calling rnorm(1) 500,000 times is a waste and much slower than simply calling rnorm(500000). It's like taking a Ferrari for a drive, going 1 foot and stopping, going 1 foot and stopping, over and over to get to your destination.
This function will return statistically identical results as your function. However, instead of using loops, it does things in the R way.
sim2 <- function(n) {
n1 <- floor(n/2) #this is how many of the else clause we'll do
n2 <- n - n1 #this is how many of the if clause we'll do
col11 <- rep(0, n1) #bam! we have a vector filled with 0s
col12 <- (rnorm(n1) * 2) + 64 #bam! vector filled with deviates
col21 <- rep(1, n2) #bam! vector filled with 1s
col22 <- (rnorm(n2) * 2.5) + 69 #bam! vector filled with deviates
xm <- cbind(c(col11,col21), c(col12,col22)) #now we have a matrix, 2 cols, n rows
return(xm[sample(nrow(xm)),]) #shuffle the rows, return matrix
}
No loops! The functionality might be obvious but in case it is not, I'll explain. First, n1 & n2 are simply to split the size of n appropriately (accounting for odd numbers).
Next, the binomial process (i.e., if(runif(1) < 0.5) {} else {}) per element can be eliminated since we know that in sim1, half of the matrix falls into the if condition and half in the else (see proof below). We don't need to decide for each element over and over and over which random path to take when we know that it's 50/50. So, we're going to do ALL the else 50% first: we fill a vector with n/2 0s (col11) and another with n/2 random deviates (mean = 0, sd = 1 by default) and, for each deviate, multiply by 2 and add 64, with result vector col12. That 50% is done.
Next, we finish the second 50% (the if portion). We fill a vector with n/2 1s (col21) and another with random deviates and, for each deviate, multiply by 2.5 and add 69.
We now have 4 vectors that we'll turn into a matrix. STEP 1: We glue col11 (filled with n/2 0s) and col21 (filled with n/2 1s) together using the c function to get a vector (n elements). STEP 2: Glue col12 and col22 together (filled with the deviates) using c to get a vector (like a 1 column x n row matrix). Note: 0s/1s are associated with the correct deviates based on 64/69 formulas. STEP 3: Use cbind to make a matrix (xm) out of the vectors: 0/1 vector becomes column 1, deviate vector becomes column 2. STEP 4: Get the number of rows in the matrix (which should just be n) using nrow. STEP 5: Make a shuffled vector with all the row numbers randomly ordered using sample. STEP 6: Make a new (unnamed) matrix putting xm's rows in order according to the shuffled vector. The point of steps 4-6 is just to randomly order the rows, since the binomial process in sim1 would have produced a random order of rows.
This version runs 866% faster!
> system.time({ sim1(500000)})
user system elapsed
1.341 0.179 1.527
> system.time({ sim2(500000)})
user system elapsed
0.145 0.011 0.158
If you're concerned about proof that this maintains the integrity of the binomial process, consider that the binomial process does two things: 1) It associates 1 with the 2.5*d+69 equation and 0 with the 2*d + 64 equation - the association is maintained since rows are shuffled intact; 2) 50% go in the if clause and 50% in the else clause, as proved below.
sim3 <- function(n) {
a <- 0
for(j in 1:n) {
if(runif(1) < 0.5) {
a <- a + 1
}
}
return(a/n)
}
> sim3(50)
[1] 0.46
> sim3(5000)
[1] 0.4926
> sim3(10000)
[1] 0.5022
> sim3(5000000)
[1] 0.4997844
The binomial process produces 50% 1s and 50% 0s (column 1).
I'll do what I think is the most obvious step, namely to move rnorm() out of the loop and take advantage of its vectorized nature (as rawr alluded to)
sim2 <- function(n) {
xm <- matrix(nrow=n, ncol=2)
d <- rnorm(n)
for (i in 1:n) {
if (runif(1) < 0.5) {
xm[i,1] <- 1
xm[i,2] <- 2.5*d[i] + 69
} else {
xm[i,1] <- 0
xm[i,2] <- 2*d[i] + 64
}
}
return(xm)
}
n <- 1e3
set.seed(1); system.time(s1 <- sim1(n)); system.time(s2 <- sim2(n))
# user system elapsed
# 0.019 0.004 0.023
# user system elapsed
# 0.010 0.000 0.009
t.test(s1[,2], s2[,2]) # Not identical, but similar, again alluded to by rawr
Just that gives us a reasonable improvement. A similar thing can be done with runif() as well, but I'll leave that to you.
If you want some reading material I can recommend Hadley Wickhams Advanced R and the chapter Optimising code.
And in case you're wondering, it is indeed possible to eliminate both the loop and the conditionals.
One optimization I can suggest is that you create the matrix with default value as 0. Once matrix has been created with 0 value as default then there will be no need to populate a value 0 in function.
The modified code will look like:
sim1 <- function(n) {
#create matrix with 0 value.
xm <- matrix(0,nrow=n,ncol=2)
for (i in 1:n) {
d <- rnorm(1)
if (runif(1) < 0.5) {
xm[i,1] <- 1
xm[i,2] <- 2.5*d + 69
} else {
#xm[i,1] <- 0 --- No longer needed
xm[i,2] <- 2*d + 64
}
}
return(xm)
}

For loop in R (Special Case: Wiener Process)

I'm dealing right now with a valuation of Option prices for my university thesis.
We need to program some things in R. It's the first time I'm working with a programming software like R. I've been doing this for the last 2 weeks and this is where I went so far:
s <- 120
#Value of the stock today
sd <- 0.1
#standard deviation
d <- 0.003
#Drift
N <- 365
T <-1
dt <-T/N
t <- seq(0,T, length=N+1)
W <- c(0, cumsum(sqrt(dt)*rnorm(N)))
#plot( t, W, type="l", main="Wiener process", ylim=c(-1,1))
S <- s*exp(d+sd*W)
S
This is a simple generalized Wiener process which I want to turn into a Monte Carlo simulation.
For S there are now 366 (N+1) Values of the Stock path. What I need is a "for loop" which takes the last Value of S and allocates it into a vector (list vector), so that I can run the loop for example 10000 times, collect every last Value of S and get the average of the vector.
I have no idea how I can program such a for loop.
I would really appreciate if you could help me or give me some good hints.
Greetings from Germany
Christian
I never studied Wiener Processes, but I think this would be a simple outline of the code you're trying to achieve:
stock_prices <- s #Initialise vector of stock prices
numIter <- 10^4 #Set number of iterations in the for loop
for(i in 1:numIter) {
s <- stock_prices[i] #This is the current stock price (for ith iteration / time step)
#Calculate the next stock price here, call it next_price
#Add price of next iteration / time step to your vector:
stock_prices <- c(stock_prices, next_price)
}
stock_prices will be a vector of the 10,000 stock prices you simulated.
I don't know how you calculate the next stock price from S, but if you draw from the values of S randomly, then it might be useful to check out the function sample (type ?sample for help on it).
Hope that helps
If you just want to run code repeatedly, putting it in a function is nice (but not absolutely necessary). I will refer to all the code in your question as <your code>.
To make a function that runs your code,
my_function = function() {
<your code>
}
The function will, by default, return its last line, in this case S. You only want the last element of S, tail(S, 1). So we can modify the function to return only that:
my_function = function() {
<your code>
return(tail(S, 1))
}
We can then call it in a for loop n times and assign the result. It is best to pre-allocate the vector for the results so that an appropriately sized block of memory can be set aside for it up front:
n = 10000
results = rep(NA, n)
for (i in 1:n) {
results[i] <- my_function()
}
This is equivalent to
n = 10000
results = rep(NA, n)
for (i in 1:n) {
<your code>
results[i] <- tail(S, 1)
}
And, for that matter, it is also equivalent to
results = replicate(n, my_function())
which is a handy shortcut.
If you want to be fancy, you could parameterize your function:
my_nice_function = function(s = 120, sd = 0.1, d = 0.003, N = 365) {
T <- 1
dt <- T / N
t <- seq(0, T, length = N + 1)
W <- c(0, cumsum(sqrt(dt) * rnorm(N)))
S <- s * exp(d + sd * W)
return(tail(S, 1))
}
Now my_nice_function has default values as in your code, but you can easily adjust them, e.g., to run the 50 simulations with sd = 0.2 you can do this:
replicate(50, my_nice_function(sd = 0.2))

Simulating coin toss

In the New York Times yesterday there was a reference to a paper essentially saying that the probability of 'heads' after a 'head' appears is not 0.5 (assuming a fair coin), challenging the "hot hand" myth. I want to prove it to myself.
Thus, I am working on coding a simulation of 7 coin tosses, and counting the number of heads after the first head, provided, naturally, that there is a first head at all.
I came up with the following lines of R code, but I'm still getting NA values, and would appreciate some help:
n <- 7 # number of tosses
p <- 0.5 # probability of heads
sims <- 100 # number of simulations
Freq_post_H <- 0 # frequency of 'head'-s after first 'head'
for(i in 1:sims){
z <- rbinom(n, 1, p)
if(sum(z==1)!=0){
y <- which(z==1)[1]
Freq_post_H[i] <- sum(z[(y+1):n])/length((y+1):n)
}else{
next()
}
Freq_post_H
}
Freq_post_H
What am I missing?
CONCLUSION: After the initial hiccups of mismatched variable names, both responses solve the question. One of the answers corrects problems in the initial code related to what happens with the last toss (i + 1) by introducing min(y + 1, n), and corrects the basic misunderstanding of next within a loop generating NA for skipped iterations. So thank you (+1).
Critically, and the reason for this appended "conclusion" the second response addresses a more fundamental or conceptual problem: we want to calculate the fraction of H's that are preceded by a H, as opposed to p(H) in whatever number of tosses remain after a head has appeared, which will be 0.5 for a fair coin.
This is a simulation of what they did in the newspaper:
nsims <- 10000
k <- 4
set.seed(42)
sims <- replicate(nsims, {
x <- sample(0:1, k, TRUE)
#print(x)
sum( # sum logical values, i.e. 0/1
diff(x) == 0L & # is difference between consecutive values 0?
x[-1] == 1L ) / # and are these values heads?
sum(head(x, -1) == 1L) #divide by number of heads (without last toss)
})
mean(sims, na.rm = TRUE) #NaN cases are samples without heads, i.e. 0/0
#[1] 0.4054715
k <- 7
sims <- replicate(nsims, {
x <- sample(0:1, k, TRUE)
#print(x)
sum(diff(x) == 0L & x[-1] == 1L) / sum(head(x, -1) == 1L)
})
mean(sims, na.rm = TRUE)
#[1] 0.4289402
n <- 7 # number of tosses
p <- 0.5 # probability of heads
sims <- 100 # number of simulations
Prob_post_H <- 0 # frequency of 'head'-s after first 'head'
for(i in 1:sims){
z <- rbinom(n, 1, p)
if(sum(z==1) != 0){
y <- which(z==1)[1]
Prob_post_H[i] <- mean(z[min(y+1, n):n], na.rm=TRUE)
}else{
next()
}
}
mean(Prob_post_H,na.rm=TRUE)
#[1] 0.495068
It looks like it's right around 50%. We can scale up to see more simulations.
sims <- 10000
mean(Prob_post_H,na.rm=TRUE)
#[1] 0.5057866
Still around 50%.
This is to simulate 100 fair coin tosses 30,000 times
counter <- 1
coin <- sum(rbinom(100,1,0.5))
while(counter<30000){
coin <- c(coin, sum(rbinom(100,1,0.5)))
counter <- counter+1
}
Try these after running above variable
hist(coin)
str(coin)
mean(coin)
sd(coin)
Below is some sample code in R to simulate a fair coin toss in R using the sample function. You can modify it as you like to simulate any number of flips. Since the outcome of flipping a coin is independent for each flip, the probability of a head or tail is always 0.5 for any given flip. Over many coin flips the probability of at least half of the flips being heads (or tails) will converge to 0.5. The probability that you get exactly half heads and half tails approaches 0.
n <- 7
count_heads <- 0
coin_flip <- sample(c(0,1), n, replace = TRUE)
for(flip_i in 1:n)
{
if(coin_flip[flip_i] == 1)
{
count_heads = count_heads + 1
}
}
count_heads/n

Resources