I am trying to code the following game in R:
Roll a dice until you observe a 4 followed by a 6
Count how many times it took you to observe a 4 followed by a 6
Repeat these first two steps 100 times
Calculate the average number of times it took to observe a 4 followed by a 6
I tried to manually simulate this as follows - I first used the "runif" command in R to "roll a dice" a large number of times, hoping that you will eventually see a 4 followed by a 6 (I don't know how to code this using "do until loops"). I repeated this 100 times and put all these rolls into a data frame:
roll_1 = floor(runif(100, min=1, max=6))
roll_2 = floor(runif(100, min=1, max=6))
roll_3 = floor(runif(100, min=1, max=6))
roll_4 = floor(runif(100, min=1, max=6))
roll_5 = floor(runif(100, min=1, max=6))
#etc
roll_100 = floor(runif(100, min=1, max=6))
all_rolls = data.frame(roll_1, roll_2, roll_3, roll_4, roll_5, roll_100)
This looks as follows:
head(all_rolls)
roll_1 roll_2 roll_3 roll_4 roll_5 roll_100
1 4 2 5 3 1 4
2 3 2 4 4 1 2
3 1 3 1 4 2 1
4 3 2 1 4 4 3
5 4 1 2 2 5 5
6 2 3 3 5 3 1
I then exported this data frame into Microsoft Excel and manually inspected each column and counted the row number at which a 6 appears when preceded by a 4. I then averaged this number for all columns and calculated the average number of times you need to roll a dice before you observe a 4 followed by a 6. This took some time to do, but it worked.
I am looking for a quicker way to do this. Does anyone know if "do until" loops can be used in R to accelerate this "game"?
Thanks
Instead of runif, I would sample 1:6 value since a die would have only values from 1 to 6 and will not have values like 1.23 etc.
This is how you can use while loop -
roll_from_4_to6 <- function() {
n <- 1:6
i <- 1
previous_4 <- FALSE
while(TRUE) {
current_value = sample(n, 1)
i <- i + 1
if(previous_4 && current_value == 6) break
previous_4 <- current_value == 4
}
i
}
Run it once.
roll_from_4_to6()
Run it 100 times and take the average.
mean(replicate(100, roll_from_4_to6()))
I considered a different approach to solve this problem, deviating from the exact instructions you received.
Create a sequence of rolls that is extremely large, so you can find 100 cases in which a 6 follows a 4:
x = sample(1:6, 1e6, TRUE)
The mean of rolls needed to get a 6 after a 4 is:
mean(diff(which(x == 4 & data.table::shift(x) == 6)[1:100]))
What you're doing there:
x == 4 & data.table::shift(x) == 6 is a vector of records for which a 4 is followed by a 6. This vector is a bunch of FALSEs and TRUEs.
which(x == 4 & data.table::shift(x) == 6)[1:100] is the indexes of those TRUEs (the first 100 TRUEs)
diff tells us how many rolls there were between consecutive matches.
mean gives us the average of the last value.
Sampling from dice is following categorical distribution. By using rcat function from extraDistr package, you can sample from categorical distribution
roll_game <- function() {
count <- 2
dices <- rcat(2, c(1/6 ,1/6, 1/6, 1/6, 1/6, 1/6))
while(!(rev(dices)[2] ==4 && rev(dices)[1] ==6 )){
dices <- c(dices, rcat(1, c(1/6 ,1/6, 1/6, 1/6, 1/6, 1/6)))
count <- count+1
}
count
}
mean(replicate(100, roll_game()))
will get your answer
Related
100 people are watching a theater.At the end of the show all of them are visiting the vesting room in order to take their coats.The man working on the vesting room give back people's coat totally at random.The participants that they will pick the right coat leave.The other that have picked the wrong one, give back the coat and the man again randomly gives back the coat.The process ends when all the customers of the theater take back their right coat.
I want to simulate in R this martingale process in order to find the expected time that this process will end.
But I don't know how .Any help ?
Something like:
# 100 customers
x = seq(1,100,by=1);x
# random sample from x
y = sample(x,100,replace=FALSE)
x==y
# for the next iteration exclude those how are TRUE and run it again until everyone is TRUE
The expected time is how many iterations where needed .
Or something like this :
n = 100
X = seq(1,100,by=1)
martingale = rep(NA,n)
iterations = 0
accept = 0
while (X != n) {
iterations = iterations + 1
y = sample(1:100,100,replace=FALSE)
if (X = y){
accept = accept + 1
X = X+1
martingale [X] = y
}
}
accept
iterations
One way to do this is as follows (using 10 people as an example, the print statement is unnecessary, just to show what's done in each iteration):
set.seed(0)
x <- 1:10
count <- 0
while(length(x) > 0){
x <- x[x != sample(x)]
print(x)
count <- count + 1
}
# [1] 1 2 3 4 5 6 7 9 10
# [1] 3 4 5 6 7 9
# [1] 3 4 5 6 7
# [1] 3 4 5 6 7
# [1] 3 4 5 6 7
# [1] 3 4 5 6 7
# [1] 3 4 5 6 7
# [1] 3 4 5 6 7
# [1] 3 6
#
count
# [1] 10
For each step in the loop, it removes the values of x where the customers have been randomly allocated their coat, until there are none left.
To use this code to get the expected time taken for 100 people, you could extend it to:
set.seed(0)
nits <- 1000 #simulate the problem 1000 times
count <- 0
for (i in 1:nits){
x <- 1:100
while(length(x) > 0){
x <- x[x != sample(x)]
count <- count + 1/nits
}
}
count
# [1] 99.901
I hypothesise without proof that the expected time for n people is n iterations - it seems pretty close when I tried with 50, 100 or 200 people.
I didn't follow your discussion above and I'm not entirely sure if that's what you want, but my rationale was as follows:
You have N people and queue them.
In the first round the first person has a chance of 1/N to get their clothes right.
At this point you have two options. Eitehr person 1 gets their clothes right or not.
If person 1 gets their clothes right, then person 2 has a chance of 1/(N-1) to get their clothes right. If person 1 didn't get the correct clothes, person 1 remains in the pool (at the end), and person 2 also has a 1/N probability to get their clothes right.
You continue to assign thes probabilities until all N persons have seen the clerk once. Then you sort out those who have the right clothes and repeat at step 1 until everyone has their clothes right.
For simulation purposes, you'd of course repeat the whole thing 1000 or 10000 times.
If I understand you correctly, you are interstes in the number of iterations, i.e. how often does the clerk have to go through the whole queue (or what remains of it) until everyone has their clothes.
library(tidyverse)
people <- 100
results <- data.frame(people = 1:people,
iterations = NA)
counter <- 0
finished <- 0
while (finished < people)
{
loop_people <- results %>%
filter(is.na(iterations)) %>%
pull(people)
loop_prob <- 1/length(loop_people)
loop_correct <- 0
for (i in 1:length(loop_people))
{
correct_clothes_i <- sample(c(0,1), size = 1, prob = c(1-loop_prob, loop_prob))
if (correct_clothes_i == 1)
{
results[loop_people[i], 2] <- counter + 1
loop_correct <- loop_correct + 1
loop_prob <- 1/(length(loop_people) - loop_correct)
}
}
counter <- counter + 1
finished <- length(which(!is.na(results$iterations)))
}
max(results$iterations)
[1] 86
head(results)
people iterations
1 1 7
2 2 42
3 3 86
4 4 67
5 5 2
6 6 9
The results$iterations column contains the iteration number where each person has gotten their clothes right, thus max(results$iterations) gives you the total number of loops.
I have no proof, but empirically and intuitively the number of required iterations should approach N.
So let's say I roll 5 dice.
The code to simulate the rolls would be
Rolls<-sample(1:6, 5, replace=TRUE)
and that's if I want to store my rolls under the object Rolls.
Now let's say for some reason I don't want there to be more than 2 sixes. That means if I roll, for example, 6 3 5 6 6 1 would I be able to re-roll one of the 6 values into a new value so that there are only 2 values of 6 and 4 values that are not 6?
Any support would be appreciated.
Thanks in advance
A solution without loops could be:
condition = which(Rolls==6)
if(length(condition)>=3){
Rolls[condition[3:length(condition)]] = sample(1:5, length(condition)-2, replace=TRUE)
}
condition states the places in Rolls with 6's, if there's more than 2, you select the third one onward Rolls[condition[3:length(condition)]] and re-sample them.
And the second question could be something like:
remove = 3
Rolls = Rolls[-which(Rolls==remove)[1]]
You can easily put those into functions if you like
Edit 1
To make the second answer a bit more interactive, you can build a function for it:
remove.roll = function(remove, rolls){
rolls = rolls[-which(rolls==remove)[1]]}
And then the user can call the function with whatever remove he likes. You can also make a program that takes information from the prompt:
remove = readline(prompt="Enter number to remove: ")
print(Rolls = Rolls[-which(Rolls==remove)[1]])
if i understood it correctly, that should work:
n <- 10
(Rolls<-sample(1:6, n, replace=TRUE))
#> [1] 6 2 4 1 1 6 5 2 1 6
(Nr_of_six <- sum(6 == Rolls))
#> [1] 3
while (Nr_of_six > 1) {
extra_roll <- sample(1:6, 1, replace=TRUE)
second_six <- which(Rolls==6)[2]
Rolls[second_six] <- extra_roll
print(Rolls)
Nr_of_six <- sum(6 == Rolls)
}
#> [1] 6 2 4 1 1 4 5 2 1 6
#> [1] 6 2 4 1 1 4 5 2 1 3
print(Rolls)
#> [1] 6 2 4 1 1 4 5 2 1 3
Created on 2021-03-21 by the reprex package (v1.0.0)
We can make this a fun demonstration of a use case for scan(). You can input the position of the values that you want to replace. Note that you need to hand scan() each position value piece by piece and hit enter after every one, in the end you can end the input by handing over an empty string "" and pressing enter.
Code
dice.roll <- function(){
# Initial toss
Rolls <- sample(seq(1, 6), 5, replace=TRUE)
# Communicate
cat("The outcome of the dice roll was:", "\n\n", Rolls, "\n\n",
"would you like to reroll any of those values ?", "\n",
"If yes enter the positions of the values you would \n like to replace, else just input an empty string.")
# Take input
tmp1 <- scan(what = "")
# Replace
Rolls[as.numeric(tmp1)] <- sample(seq(1, 6), length(tmp1), replace=TRUE)
# Return
cat("You succesfully replaced", length(tmp1), "elements. Your rolls now look as follows: \n\n",
Rolls)
}
dice.roll()
# The outcome of the dice Roll was:
#
# 6 4 6 3 4
#
# would you like to reroll any of those values ?
# If yes enter the positions of the values you would
# like to replace, else just input an empty string.
# 1: 1
# 2: 3
# 3: ""
# Read 2 items
# You succesfully replaced 2 elements. Your set now looks as follows
#
# 2 4 2 3 4
Please note that this function is just a quick write-up to properly implement this you should use a while statement or recursion to repeat the replacement as often as you'd like. Additionally, before actually using this one would have to insert if statements that handle inputs that are too long and other user behavior that could cause an error.
Here is my version of this function that uses recursion to roll extra values so that we only have no more than 2 6s. Pay attention that I put rolls vector outside of the function so in order to replace third, fourth or ... 6 from inside the function we use complex assignment operator <<-.
I personally chose to modify the first 6 value in a run of 3 6s or more.
rolls <- sample(1:6, 6, replace = TRUE)
n_six <- function() {
n <- length(rolls[rolls == 6])
if(n <= 2) {
return(rolls)
} else {
extra <- sample(1:6, 1, replace = TRUE)
rolls[which(rolls == 6)][1] <<- extra
}
n_six()
}
# Imagine our rolls would be a vector with 3 six values like this
rolls <- c(1, 2, 6, 5, 6, 6)
> n_six()
[1] 1 2 3 5 6 6 # First 6 was replaced
# Or our rolls contains 4 six values
rolls <- c(1, 6, 6, 5, 6, 6)
> n_six()
[1] 1 4 1 5 6 6 # First 2 6s have been replaced
And so on ...
My dataframe contains a column with various touch points, numbers 1 till 18. I want to know which touch point results in touch point 10. Therefore I want to create a new column which shows the touch point which occurred before touch point 10 per customer journey (PurchaseID). If touch point 10 doesn't occur in a customer journey the value can be NULL or 0.
So for example:
dd <- read.table(text="
PurchaseId TouchPoint DesiredOutcome
1 8 6
1 6 6
1 10 6
2 12 0
2 8 0
3 17 4
3 3 4
3 4 4
3 10 4", header=TRUE)
The complete dataset contains 2.500.000 observations. Does anyone know how to solve my problem? Thanks in advance.
Firstly, it is better to give a complete reproducible sample code. I suggest you look at the data.table library which is nice for handling large datasets.
library(data.table)
mdata <- matrix(sample(x = c(1:20, 21), size = 15*10, replace = TRUE), ncol = 10)
mdata[mdata==21] <- NA
mdata <- data.frame(mdata)
names(mdata) <- paste0("cj", 1:10)
df_touch <- data.table(mdata)
# -- using for
res <- rep(0, nrow(df_touch))
for( i in 1:10){
cat(i, "\n")
res[i] <- i*df_touch[, (10 %in% get(paste0("cj", i)))]
cat(res[i], "\n")
}
# -- using lapply
dfun <- function(x, k = 10){ return( k %in% x ) }
df_touch[, lapply(.SD, dfun)]
I am trying to group a column of my data.frame/data.table into three groups, all with equal sums.
The data is first ordered from smallest to largest, such that group one would be made up of a large number of rows with small values, and group three would have a small number of rows with large values. This is accomplished in spirit with:
test <- data.frame(x = as.numeric(1:100000))
store <- 0
total <- sum(test$x)
for(i in 1:100000){
store <- store + test$x[i]
if(store < total/3){
test$y[i] <- 1
} else {
if(store < 2*total/3){
test$y[i] <- 2
} else {
test$y[i] <- 3
}
}
}
While successful, I feel like there must be a better way (and maybe a very obvious solution that I am missing).
I never like resorting to loops, especially with nested ifs, when a vectorized approach is available - with even 100,000+ records this code becomes quite slow
This method would become impossibly complex to code to a larger number of groups (not necessarily the looping, but the ifs)
Requires pre-ordering of the column. Might not be able to get around this one.
As a nuance (not that it makes a difference) but the data to be summed would not always (or ever) be consecutive integers.
Maybe with cumsum:
test$z <- cumsum(test$x) %/% (ceiling(sum(test$x) / 3)) + 1
This is more or less a bin-packing problem.
Use the binPack function from the BBmisc package:
library(BBmisc)
test$bins <- binPack(test$x, sum(test$x)/3+1)
The sums of the 3 bins are nearly identical:
tapply(test$x, test$bins, sum)
1 2 3
1666683334 1666683334 1666683332
I thought that the cumsum/modulo division approach was very elegant, but it does retrun a somewhat irregular allocation:
> tapply(test$x, test$z, sum)
1 2 3
1666636245 1666684180 1666729575
> sum(test)/3
[1] 1666683333
So I though I would first create a random permutation and offer something similar:
test$x <- sample(test$x)
test$z2 <- cumsum(test$x)[ findInterval(cumsum(test$x),
c(0, 1666683333*(1:2), sum(test$x)+1))]
> tapply(test$x, test$z2, sum)
91099 116379 129539
1666676164 1666686837 1666686999
This also achieves a more even distribution of counts:
> table(test$z2)
91099 116379 129539
33245 33235 33520
> table(test$z)
1 2 3
57734 23915 18351
I must admit to puzzlement regarding the naming of the entries in z2.
Or you can just cut on the cumsum
test$z <- cut(cumsum(test$x), breaks = 3, labels = 1:3)
or use ggplot2::cut_interval instead of cut:
test$z <- cut_interval(cumsum(test$x), n = 3, labels = 1:3)
You can use fold() from groupdata2 and get an almost equal number of elements per group:
# Create data frame
test <- data.frame(x = as.numeric(1:100000))
# Use fold() to create 3 numerically balanced groups
test <- groupdata2::fold(k = 3, num_col = "x")
# Watch first 10 rows
head(test, 10)
## # A tibble: 10 x 2
## # Groups: .folds [3]
## x .folds
## <dbl> <fct>
## 1 1 1
## 2 2 3
## 3 3 2
## 4 4 1
## 5 5 2
## 6 6 2
## 7 7 1
## 8 8 3
## 9 9 2
## 10 10 3
# Check the sum and number of elements per group
test %>%
dplyr::group_by(.folds) %>%
dplyr::summarize(sum_ = sum(x),
n_members = dplyr::n())
## # A tibble: 3 x 3
## .folds sum_ n_members
## <fct> <dbl> <int>
## 1 1 1666690952 33333
## 2 2 1666716667 33334
## 3 3 1666642381 33333
This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center.
Closed 9 years ago.
The experiment involves rolling a fair die and getting x say, then tossing a fair coin x number of times and recording the number of tails. I need to do this experiment 50 times and record the outcomes in a vector, (which I'll then use to plot a histogram.)
This is my code so far:
for (i in 1:100)
{X <- sample(6,1,replace=TRUE,c(1,1,1,1,1,1)/6)
Y <- sample(2,1,replace=TRUE,c(1,1)/2)}
Youtcomes <- c(sum(Y))
Youtcomes
But instead of giving me a vector with 100 elements, I keep getting just a single number. Where am I going wrong?
Note: I have to use a for loop.
Use the fact that R is vectorized. You can then use a binomial distribution to replicate the coin toss.
heads <- rbinom(size = sample(6,100, replace = TRUE), n=100, prob = 0.5)
sum(heads)
Perhaps I have missed something, but what is wrong with one call to sample() to do the 100 rolls of the dice, and then plug that into rbinom() to do the coin tosses? We pass the output from sample() to the size argument
> set.seed(1)
> rbinom(100, size = sample(6, 100, replace = TRUE), prob = 0.5)
[1] 1 1 1 6 1 2 2 2 3 1 2 1 2 1 1 0 3 1 1 3 6 1 2 0 2 1 1 1 2 2 2 1 0 1 4 3 3
[38] 1 5 2 3 2 2 1 3 2 0 2 1 4 2 3 1 1 1 0 1 1 1 1 2 2 1 2 3 1 0 2 1 2 2 4 2 1
[75] 1 5 3 2 3 5 1 2 3 1 4 0 3 1 2 1 1 0 1 5 2 3 0 2 2 3
Discalimer: (very inefficient solution see mnel/Gavin's solution)
As you can read the many, many, .. MANY comments underneath each of the answers, while this answer attempts to answer OP's specific question(however inefficient his requirements maybe), in the spirit of maintaining decorum of the forum, some have (rightly) pointed out that the question is in bad taste and my answer doesn't do justice to the forum requirements. I accept all criticism and leave the answer here only for obvious reasons (marked as answer, continuity). I suggest you look at mnel/Gavin's answer for a vectorised solution to this specific problem. If you're interested in looking at an implementation of for-loop, then refer to the bottom of this post, but I suggest you look at it to know the structure of for-loop, but not implement a for-loop to this specific problem. Thank you.
Your code is riddled with quite a few problems, apart from the main problem #Joshua already mentioned:
First, you rewrite every time the values of X and Y inside the loop so, at the end of the loop, there is only the last value of Y that is being summed up.
Second, your code for Y is not correct. You say, you have to get x amount of coin tosses, Yet, you use sample(2, 1, ...). The 1 must be replaced with X which equals the number from the die roll.
Try out this code instead:
Youtcomes <- sapply(1:100, function(x) {
X <- sample(1:6, 1, replace=TRUE, rep(1,6)/6)
Y <- sample(c("H", "T"), X, replace=TRUE, rep(1,2)/2)
sum(Y == "T")
})
Here, we loop over 100 times, and each time, sample values between 1 and 6 and store in X. Then, we sample either head (H) or tail (T) X number of times and store in Y.
Now, sum(Y == "T") gives the sum for current value of x (1 <= x <= 100). So, at the end, Youtcomes will be your set of simulated Y == Tail values.
Then, you can do a hist(Youtcomes).
Edit: If its a for-loop solution that's desired then,
# always assign the variable you'll index inside for-loop
# else the object will keep growing every time and a copy of
# entire object is made for every i, which makes it extremely
# slow/inefficient.
Youtcomes <- rep(0, 100)
for (i in 1:100) {
X <- sample(1:6, 1, replace=TRUE, rep(1,6)/6)
Y <- sample(c("H", "T"), X, replace=TRUE, rep(1,2)/2)
# assign output inside the loop with [i] indexing
Youtcomes[i] <- sum(Y == "T")
# since Youtcomes is assigned a 100 values of 0's before
# the values will replace 0' at each i. Thus the object
# is not copied every time. This is faster/efficient.
}
Arun beat me to it. But another of the many many ways could be (if I understand your desired outcome correctly..
X <- sample(6,100,replace=TRUE,c(1,1,1,1,1,1)/6)
Y <- lapply(X , function(x){ res <- sample( c( "H" , "T" ) , x , replace=TRUE , c(1,1)/2 ) ; table( res ) } )
You want to histogram the results....
res <- unlist(Y)
hist( res[names( res )=="T"] )