Monte Carlo for 3 or more consecutive faces - r

I wrote this code to check for 3 or more consecutive faces in a simulation of 100000 iterations, with five rolls of a fair die. I think it is in the right track, but I am missing something. I keep getting a missing value error:
nrep = 500000
count = 0
for (i in 1:nrep) {
roll = sample(6, 5)
print(roll)
if (roll[i] == roll[i+1] & roll[i+1] == roll[i+2]) count = count + 1
}
print(count)
Please advise on a correction using base R only.

Adding to my comment, you can use the function rle() to compute the lengths and values of runs of equal values in a vector. You can do something like the following
nrep = 500000
count = 0
for (i in 1:nrep) {
roll = sample(6, 5, replace = TRUE)
roll_rle = rle(roll)
if (any(roll_rle$lengths >= 3)) {
print(roll)
count = count + 1
}
}

Related

Use for-loop and if function to create a new vector?

I want to do the following operation with the code: I want to get a sample of n = 30 out of a given normal distribution and calculate the mean of each sample. (until this step my function works without any problem). After that I want to create a new vector with yes or no , dependent on if the mean is in a certain range or not. Sadly the code does notconduct this step. I always get a vector with 13 elements,but there should be 500. What is the problem? Where is my mistake?
o = 13
u = 7
d = c()
for (i in 1:500){
i = rnorm(30,mean = 10,sd = 6.04)
i = mean(i)
if (i <= o & i >=u) {
d[i]=("Yes")
} else {
d[i]=("No")
}
}
You should avoid changing the value of your iterator (i) within your loop. In your case, your i is becoming a non-integer value. When you try to index your d vector, it takes the integer portion of i.
Consider what happens when I have a vector
x <- 1:4
and I take the pi index of it.
x[pi]
# [1] 3
Your code should look more like this:
o = 13
u = 7
d = c()
for (i in 1:500){
sample_i = rnorm(30, mean = 10, sd = 6.04)
mean_i = mean(sample_i)
if (mean_i <= o & mean_i >=u) {
d[i]=("Yes")
} else {
d[i]=("No")
}
}
If you would like to improve your code some, here are some suggestions:
First, avoid "growing" your results. This has performance implications. It is better to decide how long your result (d) should be and set it to that length to begin with.
Next, try not to hard code the number of iterations into your loop. Get familiar with seq_along and seq_len and use them to count iterations for you.
o = 13
u = 7
d = numeric(500) # I made a change here
for (i in seq_along(d)){ # And I made a change here
sample_i = rnorm(30, mean = 10, sd = 6.04)
mean_i = mean(sample_i)
if (mean_i <= o & mean_i >=u) {
d[i]=("Yes")
} else {
d[i]=("No")
}
}
Re-assigning i looks like a bad idea to me.
Are you sure you want to do this in a for loop? If not, a vectorised solution with crossing (tidyverse - nice explanations at varianceexplained.org ) should work pretty nicely, I think?
o = 13
u = 7
crossing(trial = 1:500,
rounds = 1:30)%>%
mutate(num = rnorm(n(), mean = 10, sd = 6.04))%>%
group_by(trial)%>%
summarise(mean = mean(num))%>%
mutate(d = case_when(mean <= o & mean >= u ~ "Yes",
TRUE ~ "No"))%>%
count(d)

Monte Carlo Simulation for Pattern

I am writing a Monte Carlo simulation to check how many times y was not immediately next to another y. I conjured up a vector of 40 x's and 10 y's placed at random position in the vector. My goal is to calculate the probabilities of not having any adjacent y's in the vector. Here is what I tried:
nrep = 100000
count = 0
for (i in 1:nrep) {
x = sample(c(rep('x', 40), c(rep('y', 10))))
if (x[i]!=x[i+1] && x[i+1]!=x[i+2]) count = count + 1
}
print(count/nrep)
The result is a very small number, which doesn't seem to make sense to me.
The if part is not correct. We can use head/tail to check for consecutive elements and see if there are any two consecutive 'y's in one iteration.
nrep = 100000
count = 0
set.seed(2020)
for (i in 1:nrep) {
x = sample(rep(c('x', 'y'), c(40, 10)))
if(any(head(x, -1) == 'y' & tail(x, -1) == 'y')) count = count + 1
}
count/nrep
#[1] 0.891

Optimising a calculation on every cumulative subset of a vector in R

I have a collection of DNA sequencing reads of various lengths, sorted from longest to shortest. I would like to know the largest number of reads I can include in a set such that the N50 of that set is above some threshold t
For any given set of reads, the total amount of data is just the cumulative sum of the lengths of the reads. The N50 is defined as the length of the read such that half of the data are contained in reads at least that long.
I have a solution below, but it is slow for very large read sets. I tried vectorising it, but this was slower (probably because my threshold is usually relatively large, such that my solution below stops calculating fairly early on).
Here's a worked example:
df = data.frame(l = 100:1) # read lengths
df$cs = cumsum(df$l) # getting the cumulative sum is easy and quick
t = 95 # let's imagine that this is my threshold N50
for(i in 1:nrow(df)){
N50 = df$l[min(which(df$cs>df$cs[i]/2))]
if(N50 < t){ break }
}
# the loop will have gone one too far, so I subtract one
number.of.reads = as.integer(i-1)
This works fine on small datasets, but my actual data are more like 5m reads that vary from ~200,000 to 1 in length (longer reads are rarer), and I'm interested in an N50 of 100,000, then it gets pretty slow.
This example is closer to something that's realistic. It takes ~15s on my desktop.
l = ceiling(runif(100000, min = 0, max = 19999))
l = sort(l, decreasing = T)
df = data.frame(l = l)
df$cs = cumsum(df$l)
t = 18000
for(i in 1:nrow(df)){
n = df$l[min(which(df$cs>df$cs[i]/2))]
if(n < t){ break }
}
result = as.integer(i-1)
So, I'm interested in any ideas, tips, or tricks to noticeably optimise this. It seems like this should be possible, but I'm out of ideas.
As n is decreasing with i, you should use a binary search algorithm.
binSearch <- function(min, max) {
print(mid <- floor(mean(c(min, max))))
if (mid == min) {
if (df$l[min(which(df$cs>df$cs[min]/2))] < t) {
return(min - 1)
} else {
return(max - 1)
}
}
n = df$l[min(which(df$cs>df$cs[mid]/2))]
if (n >= t) {
return(binSearch(mid, max))
} else {
return(binSearch(min, mid))
}
}
Then, just call
binSearch(1, nrow(df))
Since your data are ordered by DNA/read length, maybe you could avoid testing every single row. On the contrary, you can iterate and test a limited number of rows (reasonably spaced) at each iteration (using while() for example), and so get progressively closer to your solution. This should make things much faster. Just make sure that once you get close to the solution, you stop iterating.
This is your solution
set.seed(111)
l = ceiling(runif(100000, min = 0, max = 19999))
l = sort(l, decreasing = T)
df = data.frame(l = l)
df$cs = cumsum(df$l)
t = 18000
for(i in 1:nrow(df)){
n = df$l[min(which(df$cs>df$cs[i]/2))]
if(n < t){ break }
}
result = as.integer(i-1)
result
# 21216, in ~29 seconds
Instead of testing every row, let's set a range
i1 <- 1
i2 <- nrow(df)
i.range <- as.integer(seq(i1, i2, length.out = 10))
Now, test only these 10 rows. Get the closest one and "focus in" by re-defining the range. Stop when you cannot increase granularity.
while(sum(duplicated(i.range))==0){
for(i in 1:length(i.range)){
N50 = df$l[min(which(df$cs>df$cs[i.range[i]]/2))]
if(N50 < t){ break }
}
#update i1 and i2
i1 <- i.range[(i-1)]
i2 <- i.range[i]
i.range <- as.integer(seq(i1, i2, length.out = 10))
}
i.range <- seq(i1, i2, by=1)
for(i in i.range){
N50 = df$l[min(which(df$cs>df$cs[i]/2))]
if(N50 < t){ break }
}
result <- as.integer(i-1)
result
#21216, in ~ 0.06 seconds
Same result in a fraction of the time.

R "non-random" number simulaton

This question is in regards to R. I would like to generate a "non-random" sample of 1's and 0's where the next value generated is dependant on the previous value. For example, if the first value in the vector is 1 then there is a 60% chance that next value is 1, and if the next value is a 0, then there is a 60% chance that the following will be 0. I have attached the question as it was posed to me. I greatly appreciate any help.
Given your conditions, I would generate a sequence using for loop.
set.seed(111)
n = 10000 #As the Q said at least 10,000
seq = vector()
seq[1] = sample(0:1, 1) #Also given in the Q
for(i in 2:n) {
if(seq[i-1] == 0){
seq[i] = sample(0:1, 1 , prob = c(0.6,0.4))
}
else{
seq[i] = sample(0:1, 1 , prob = c(0.4,0.6))
}
}
table(seq)
seq
0 1
4961 5039

How to use a while() loop within a for() loop in R

I'm new to R, so most of my code is most likely wrong. However, I was wondering how to use a while() loop within a for() loop. I'm trying to simulate rolling a pair of dice several times if the total 2,3,7,11,or 12 then I stop. If the total 4,5,6,8,9, or 10 then I continue to the roll the dice until the initial total appears or 7. I'm trying to find the average number of rolls it take to end the game
count = 0
x = NULL
for (i in 1:10) {
x[i] = c(sample(1:6,1) +sample(1:6,1))
if(x[i] == c(2||3||7||11||12)) {
if(TRUE) {count = count +1}
} else { while(x[i] == c(4||5||6||8||9||10)) {
x[i +1] = c(sample(1:6,1)+sample(1:6,1))
if(x[i+1] == c(x[i]||7)) {
if(TRUE){count = count + x[i+1]}
}
}
}
}
print(count)
I think there are a few issues with your logic. I'm not quite sure what you're trying to do in your code, but this is my interpretation of your description of your problem ... this only runs a single round of your game -- it should work if you embed it in a for loop though (just don't reset count or reset the random-number seed in side your loop -- then count will give you the total number of rolls, and you can divide by the number of rounds to get the average)
Setup:
count = 0
sscore <- c(2,3,7,11,12)
set.seed(101)
debug = TRUE
Running a single round:
x = sample(1:6,1) +sample(1:6,1) ## initial roll
count = count + 1
if (x %in% sscore) {
## don't need to do anything if we hit,
## as the roll has already been counted
if (debug) cat("hit",x[i],"\n")
} else {
## initialize while loop -- try one more time
y = c(sample(1:6,1)+sample(1:6,1))
count = count + 1
if (debug) cat("initial",x,"next",y,"\n")
while(!(y %in% c(x,7))) {
y = c(sample(1:6,1)+sample(1:6,1))
count = count+1
if (debug) cat("keep trying",y,"\n")
} ## end while
} ## end if-not-hit
print(count)
I tried embedding this in a for loop and got a mean of 3.453 for 1000 rounds, close to #PawelP's answer.
PS I hope this isn't homework, as I prefer not to answer homework questions ...
EDIT: I had a bug - forgot to remove if negation. Now the below seems to be 100% true to your description of the problem.
This is my implementation of the game you've described. It calculates the average number of rolls it took to end the game over a TOTAL_GAMES many games.
TOTAL_GAMES = 1000
counts = rep(0, TOTAL_GAMES)
x = NULL
for (i in 1:TOTAL_GAMES) {
x_start = c(sample(1:6,1) +sample(1:6,1))
counts[i] = counts[i] + 1
x = x_start
if(x %in% c(2, 3, 7, 11, 12)){
next
}
repeat {
x = c(sample(1:6,1)+sample(1:6,1))
counts[i] = counts[i] + 1
if(x %in% c(x_start, 7)){
break
}
}
}
print(mean(counts))
It seems that the average number of rolls is around 3.38
Here's one approach to this question - I made a function that runs a single trial, and another function which conducts a variable number of these trials and returns the cumulative average.
## Single trial
rollDice <- function(){
init <- sample(1:6,1)+sample(1:6,1)
rolls <- 1
if( init %in% c(2,3,7,11,12) ){
return(1)
} else {
Flag <- TRUE
while( Flag ){
roll <- sample(1:6,1)+sample(1:6,1)
rolls <- rolls + 1
if( roll %in% c(init,7) ){
Flag <- FALSE
}
rolls
}
}
return(rolls)
}
## Multiple trials
simAvg <- function(nsim = 100){
x <- replicate(nsim,rollDice())
Reduce("+",x)/nsim
}
##
## Testing
nTrial <- seq(1,1000,25)
Results <- sapply(nTrial, function(X){ simAvg(X) })
##
## Plot over varying number of simulations
plot(x=nTrial,y=Results,pch=20)
As #Ben Bolker pointed out, you had a couple of syntax errors with ||, which is understandable for someone new to R. Also, you'll probably hear it a thousand times, but for and while loops are pretty inefficient in R so you generally want to avoid them if possible. In the case of the while loop in the above rollDice() function, it probably isn't a big deal because the probability of the loop executing a large number of times is very low. I used the functions Reduce and replicate to serve the role of a for loop in the second function. Good question though, it was fun to work on.

Resources