How to simulate a martingale process problem in R? - r

100 people are watching a theater.At the end of the show all of them are visiting the vesting room in order to take their coats.The man working on the vesting room give back people's coat totally at random.The participants that they will pick the right coat leave.The other that have picked the wrong one, give back the coat and the man again randomly gives back the coat.The process ends when all the customers of the theater take back their right coat.
I want to simulate in R this martingale process in order to find the expected time that this process will end.
But I don't know how .Any help ?
Something like:
# 100 customers
x = seq(1,100,by=1);x
# random sample from x
y = sample(x,100,replace=FALSE)
x==y
# for the next iteration exclude those how are TRUE and run it again until everyone is TRUE
The expected time is how many iterations where needed .
Or something like this :
n = 100
X = seq(1,100,by=1)
martingale = rep(NA,n)
iterations = 0
accept = 0
while (X != n) {
iterations = iterations + 1
y = sample(1:100,100,replace=FALSE)
if (X = y){
accept = accept + 1
X = X+1
martingale [X] = y
}
}
accept
iterations

One way to do this is as follows (using 10 people as an example, the print statement is unnecessary, just to show what's done in each iteration):
set.seed(0)
x <- 1:10
count <- 0
while(length(x) > 0){
x <- x[x != sample(x)]
print(x)
count <- count + 1
}
# [1] 1 2 3 4 5 6 7 9 10
# [1] 3 4 5 6 7 9
# [1] 3 4 5 6 7
# [1] 3 4 5 6 7
# [1] 3 4 5 6 7
# [1] 3 4 5 6 7
# [1] 3 4 5 6 7
# [1] 3 4 5 6 7
# [1] 3 6
#
count
# [1] 10
For each step in the loop, it removes the values of x where the customers have been randomly allocated their coat, until there are none left.
To use this code to get the expected time taken for 100 people, you could extend it to:
set.seed(0)
nits <- 1000 #simulate the problem 1000 times
count <- 0
for (i in 1:nits){
x <- 1:100
while(length(x) > 0){
x <- x[x != sample(x)]
count <- count + 1/nits
}
}
count
# [1] 99.901
I hypothesise without proof that the expected time for n people is n iterations - it seems pretty close when I tried with 50, 100 or 200 people.

I didn't follow your discussion above and I'm not entirely sure if that's what you want, but my rationale was as follows:
You have N people and queue them.
In the first round the first person has a chance of 1/N to get their clothes right.
At this point you have two options. Eitehr person 1 gets their clothes right or not.
If person 1 gets their clothes right, then person 2 has a chance of 1/(N-1) to get their clothes right. If person 1 didn't get the correct clothes, person 1 remains in the pool (at the end), and person 2 also has a 1/N probability to get their clothes right.
You continue to assign thes probabilities until all N persons have seen the clerk once. Then you sort out those who have the right clothes and repeat at step 1 until everyone has their clothes right.
For simulation purposes, you'd of course repeat the whole thing 1000 or 10000 times.
If I understand you correctly, you are interstes in the number of iterations, i.e. how often does the clerk have to go through the whole queue (or what remains of it) until everyone has their clothes.
library(tidyverse)
people <- 100
results <- data.frame(people = 1:people,
iterations = NA)
counter <- 0
finished <- 0
while (finished < people)
{
loop_people <- results %>%
filter(is.na(iterations)) %>%
pull(people)
loop_prob <- 1/length(loop_people)
loop_correct <- 0
for (i in 1:length(loop_people))
{
correct_clothes_i <- sample(c(0,1), size = 1, prob = c(1-loop_prob, loop_prob))
if (correct_clothes_i == 1)
{
results[loop_people[i], 2] <- counter + 1
loop_correct <- loop_correct + 1
loop_prob <- 1/(length(loop_people) - loop_correct)
}
}
counter <- counter + 1
finished <- length(which(!is.na(results$iterations)))
}
max(results$iterations)
[1] 86
head(results)
people iterations
1 1 7
2 2 42
3 3 86
4 4 67
5 5 2
6 6 9
The results$iterations column contains the iteration number where each person has gotten their clothes right, thus max(results$iterations) gives you the total number of loops.
I have no proof, but empirically and intuitively the number of required iterations should approach N.

Related

Simulating Dice Rolls in R

I am trying to code the following game in R:
Roll a dice until you observe a 4 followed by a 6
Count how many times it took you to observe a 4 followed by a 6
Repeat these first two steps 100 times
Calculate the average number of times it took to observe a 4 followed by a 6
I tried to manually simulate this as follows - I first used the "runif" command in R to "roll a dice" a large number of times, hoping that you will eventually see a 4 followed by a 6 (I don't know how to code this using "do until loops"). I repeated this 100 times and put all these rolls into a data frame:
roll_1 = floor(runif(100, min=1, max=6))
roll_2 = floor(runif(100, min=1, max=6))
roll_3 = floor(runif(100, min=1, max=6))
roll_4 = floor(runif(100, min=1, max=6))
roll_5 = floor(runif(100, min=1, max=6))
#etc
roll_100 = floor(runif(100, min=1, max=6))
all_rolls = data.frame(roll_1, roll_2, roll_3, roll_4, roll_5, roll_100)
This looks as follows:
head(all_rolls)
roll_1 roll_2 roll_3 roll_4 roll_5 roll_100
1 4 2 5 3 1 4
2 3 2 4 4 1 2
3 1 3 1 4 2 1
4 3 2 1 4 4 3
5 4 1 2 2 5 5
6 2 3 3 5 3 1
I then exported this data frame into Microsoft Excel and manually inspected each column and counted the row number at which a 6 appears when preceded by a 4. I then averaged this number for all columns and calculated the average number of times you need to roll a dice before you observe a 4 followed by a 6. This took some time to do, but it worked.
I am looking for a quicker way to do this. Does anyone know if "do until" loops can be used in R to accelerate this "game"?
Thanks
Instead of runif, I would sample 1:6 value since a die would have only values from 1 to 6 and will not have values like 1.23 etc.
This is how you can use while loop -
roll_from_4_to6 <- function() {
n <- 1:6
i <- 1
previous_4 <- FALSE
while(TRUE) {
current_value = sample(n, 1)
i <- i + 1
if(previous_4 && current_value == 6) break
previous_4 <- current_value == 4
}
i
}
Run it once.
roll_from_4_to6()
Run it 100 times and take the average.
mean(replicate(100, roll_from_4_to6()))
I considered a different approach to solve this problem, deviating from the exact instructions you received.
Create a sequence of rolls that is extremely large, so you can find 100 cases in which a 6 follows a 4:
x = sample(1:6, 1e6, TRUE)
The mean of rolls needed to get a 6 after a 4 is:
mean(diff(which(x == 4 & data.table::shift(x) == 6)[1:100]))
What you're doing there:
x == 4 & data.table::shift(x) == 6 is a vector of records for which a 4 is followed by a 6. This vector is a bunch of FALSEs and TRUEs.
which(x == 4 & data.table::shift(x) == 6)[1:100] is the indexes of those TRUEs (the first 100 TRUEs)
diff tells us how many rolls there were between consecutive matches.
mean gives us the average of the last value.
Sampling from dice is following categorical distribution. By using rcat function from extraDistr package, you can sample from categorical distribution
roll_game <- function() {
count <- 2
dices <- rcat(2, c(1/6 ,1/6, 1/6, 1/6, 1/6, 1/6))
while(!(rev(dices)[2] ==4 && rev(dices)[1] ==6 )){
dices <- c(dices, rcat(1, c(1/6 ,1/6, 1/6, 1/6, 1/6, 1/6)))
count <- count+1
}
count
}
mean(replicate(100, roll_game()))
will get your answer

Binary Search like concept to create subset data in R

I have below dataset w and key variable x for two cases.
Case 1:
x = 4
w = c(1,2,4,4,4,4,6,7,8,9,10,11,12,14,15)
Case2:
x = 12
w = c(1,2,4,4,4,4,6,7,8,9,10,11,12,14,15)
I want to create a function which will search for x through dataset w and will subset original dataset to lower size dataset as per x's location in w. Output will be a lower size dataset having upper bound value same as search key. Below is the function I am trying to write in R:
create_chunk <- function(val, tab, L=1L, H=length(tab))
{
if(H >= L)
{
mid = L + ((H-L)/2)
## If the element is present within middle length
if(tab[mid] > val)
{
## subset the original data in reduced size and again do mid position value checking
## then subset the data
} else
{
mid = mid + (mid/2)
## Increase the mid position to go for right side checking
}
}
}
In the output I am looking for below:
Output for Case 1:
Dataset containing: 1,2,4,4,4,4
Output for Case 2:
Dataset containing: 1,2,4,4,4,4,6,7,8,9,10,11,12
Please note:
1. Dataset may contain duplicate values for search key and
all the duplicate values are expected in the output dataset.
2. I have huge size datasets (say around 2M rows) from
where I am trying to subset smaller dataset as per my requirement of search key.
New Update: Case 3
Input Data:
date value size stockName
1 2016-08-12 12:44:43 10093.40 4 HWA IS Equity
2 2016-08-12 12:44:38 10093.35 2 HWA IS Equity
3 2016-08-12 12:44:47 10088.00 2 HWA IS Equity
4 2016-08-12 12:44:52 10089.95 1 HWA IS Equity
5 2016-08-12 12:44:53 10089.95 1 HWA IS Equity
6 2016-08-12 12:44:54 10088.95 1 HWA IS Equity
Search Key is: 10089.95 in value column.
Expected Output is:
date value size stockName
1 2016-08-12 12:44:47 10088.00 2 HWA IS Equity
2 2016-08-12 12:44:54 10088.95 1 HWA IS Equity
3 2016-08-12 12:44:52 10089.95 1 HWA IS Equity
4 2016-08-12 12:44:53 10089.95 1 HWA IS Equity
You could do this which takes care of duplicate values. In case of duplicates, the highest position of which will be returned. Please note that A should be in non-decreasing order.
binSearch <- function(A, value, left=1, right=length(A)){
if (left > right)
return(-1)
middle <- (left + right) %/% 2
if (A[middle] == value){
while (A[middle] == value)
middle<-middle+1
return(middle-1)
}
else {
if (A[middle] > value)
return(binSearch(A, value, left, middle - 1))
else
return(binSearch(A, value, middle + 1, right))
}
}
w[1:binSearch(w,x1)]
# [1] 1 2 4 4 4 4
w[1:binSearch(w,x2)]
# [1] 1 2 4 4 4 4 6 7 8 9 10 11 12
However, as its mentioned in the comments, you could simply use findInterval to achieve the same:
w[1:findInterval(x1,w)]
As you know, binary search has order of log(n) but as stated in ?findInterval, it also benefits from log(n) since the length of the first argument is one:
The function findInterval finds the index of one vector x in another, vec, where the latter must be non-decreasing. Where this is trivial, equivalent to apply( outer(x, vec, ">="), 1, sum), as a matter of fact, the internal algorithm uses interval search ensuring O(n * log(N)) complexity where n <- length(x) (and N <- length(vec)). For (almost) sorted x, it will be even faster, basically O(n).
EDIT
As per your edit and your new setting, you could do this (suppose your data is in df):
o <- order(df$value)
rows <- o[1:findInterval(key, df$value[o])]
df[rows,]
Or equivalently, using the proposed binSearch function:
o <- order(df$value)
rows <- o[1:binSearch(df$value[o], key)]
df[rows,]
data
x1 <- 4
x2 <- 12
w <- c(1,2,4,4,4,4,6,7,8,9,10,11,12,14,15)
key <- 10089.95
Here is a very simple solution and you can build your function out of this commands. Of course you have to check if x is in w, but that's your part :-)
x <- 12
w <- c(1,2,4,4,4,4,6,7,8,9,10,11,12,14,15)
index <- which(x == w)
w_new <- w[1:index[length(index)]]
print(w_new)
#[1] 1 2 4 4 4 4 6 7 8 9 10 11 12

Creating groups of equal sum in R

I am trying to group a column of my data.frame/data.table into three groups, all with equal sums.
The data is first ordered from smallest to largest, such that group one would be made up of a large number of rows with small values, and group three would have a small number of rows with large values. This is accomplished in spirit with:
test <- data.frame(x = as.numeric(1:100000))
store <- 0
total <- sum(test$x)
for(i in 1:100000){
store <- store + test$x[i]
if(store < total/3){
test$y[i] <- 1
} else {
if(store < 2*total/3){
test$y[i] <- 2
} else {
test$y[i] <- 3
}
}
}
While successful, I feel like there must be a better way (and maybe a very obvious solution that I am missing).
I never like resorting to loops, especially with nested ifs, when a vectorized approach is available - with even 100,000+ records this code becomes quite slow
This method would become impossibly complex to code to a larger number of groups (not necessarily the looping, but the ifs)
Requires pre-ordering of the column. Might not be able to get around this one.
As a nuance (not that it makes a difference) but the data to be summed would not always (or ever) be consecutive integers.
Maybe with cumsum:
test$z <- cumsum(test$x) %/% (ceiling(sum(test$x) / 3)) + 1
This is more or less a bin-packing problem.
Use the binPack function from the BBmisc package:
library(BBmisc)
test$bins <- binPack(test$x, sum(test$x)/3+1)
The sums of the 3 bins are nearly identical:
tapply(test$x, test$bins, sum)
1 2 3
1666683334 1666683334 1666683332
I thought that the cumsum/modulo division approach was very elegant, but it does retrun a somewhat irregular allocation:
> tapply(test$x, test$z, sum)
1 2 3
1666636245 1666684180 1666729575
> sum(test)/3
[1] 1666683333
So I though I would first create a random permutation and offer something similar:
test$x <- sample(test$x)
test$z2 <- cumsum(test$x)[ findInterval(cumsum(test$x),
c(0, 1666683333*(1:2), sum(test$x)+1))]
> tapply(test$x, test$z2, sum)
91099 116379 129539
1666676164 1666686837 1666686999
This also achieves a more even distribution of counts:
> table(test$z2)
91099 116379 129539
33245 33235 33520
> table(test$z)
1 2 3
57734 23915 18351
I must admit to puzzlement regarding the naming of the entries in z2.
Or you can just cut on the cumsum
test$z <- cut(cumsum(test$x), breaks = 3, labels = 1:3)
or use ggplot2::cut_interval instead of cut:
test$z <- cut_interval(cumsum(test$x), n = 3, labels = 1:3)
You can use fold() from groupdata2 and get an almost equal number of elements per group:
# Create data frame
test <- data.frame(x = as.numeric(1:100000))
# Use fold() to create 3 numerically balanced groups
test <- groupdata2::fold(k = 3, num_col = "x")
# Watch first 10 rows
head(test, 10)
## # A tibble: 10 x 2
## # Groups: .folds [3]
## x .folds
## <dbl> <fct>
## 1 1 1
## 2 2 3
## 3 3 2
## 4 4 1
## 5 5 2
## 6 6 2
## 7 7 1
## 8 8 3
## 9 9 2
## 10 10 3
# Check the sum and number of elements per group
test %>%
dplyr::group_by(.folds) %>%
dplyr::summarize(sum_ = sum(x),
n_members = dplyr::n())
## # A tibble: 3 x 3
## .folds sum_ n_members
## <fct> <dbl> <int>
## 1 1 1666690952 33333
## 2 2 1666716667 33334
## 3 3 1666642381 33333

Using R as a game simulator

I am trying to simulate a simple game where you spin a spinner, labeled 1-5, and then progress on until you pass the finish line (spot 50). I am a bit new to R and have been working on this for a while searching for answers. When I run the code below, it doesn't add the numbers in sequence, it returns a list of my 50 random spins and their value. How do I get this to add the spins on top of each other, then stop once => 50?
SpacesOnSpinner<-(seq(1,5,by=1))
N<-50
L1<-integer(N)
for (i in 1:N){
takeaspin<-sample(SpacesOnSpinner,1,replace=TRUE)
L1[i]<-L1[i]+takeaspin
}
This is a good use-case for replicate. I'm not sure if you have to use a for loop, but you could do this instead (replicate is a loop too):
SpacesOnSpinner<-(seq(1,5,by=1))
N<-10
cumsum( replicate( N , sample(SpacesOnSpinner,1,replace=TRUE) ) )
#[1] 5 10 14 19 22 25 27 29 30 33
However, since you have a condition which you want to break on, perhaps the other answer with a while condition is exactly what you need in this case (people will tell you they are bad in R, but they have their uses). Using this method, you can see how many spins it took you to get past 50 by a simple subset afterwards (but you will not know in advance how many spins it will take, but at most it will be 50!):
N<-50
x <- cumsum( replicate( N , sample(5,1) ) )
# Value of accumulator at each round until <= 50
x[ x < 50 ]
#[1] 5 6 7 8 12 16 21 24 25 29 33 34 36 38 39 41 42 44 45 49
# Number of spins before total <= 50
length(x[x < 50])
[1] 20
Here is another interesting way to simulate your game, using a recursive function.
spin <- function(outcomes = 1:5, start = 0L, end = 50L)
if (start <= end)
c(got <- sample(outcomes, 1), Recall(outcomes, start + got, end))
spin()
# [1] 5 4 4 5 1 5 3 2 3 4 4 1 5 4 3
Although elegant, it won't be as fast as an improved version of #Simon's solution that makes a single call to sample, as suggested by #Viktor:
spin <- function(outcomes = 1:5, end = 50L) {
max.spins <- ceiling(end / min(outcomes))
x <- sample(outcomes, max.spins, replace = TRUE)
head(x, match(TRUE, cumsum(x) >= end))
}
spin()
# [1] 3 5 2 3 5 2 2 5 1 2 1 5 5 5 2 4
For your ultimate goal (find the probability of one person being in the lead for the entire game), it is debatable whether while will be more efficient or not: a while loop is certainly slower, but you may benefit from the possibility of exiting early as the lead switches from one player to the other. Both approaches are worth testing.
You can use a while statement and a variable total for keeping track of the sum:
total <- 0
while(total <= 50){
takeaspin<-sample(SpacesOnSpinner,1,replace=TRUE)
total <- takeaspin + total
}
print (total)

Outcome of a simulated dice and coin toss in R [closed]

This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center.
Closed 9 years ago.
The experiment involves rolling a fair die and getting x say, then tossing a fair coin x number of times and recording the number of tails. I need to do this experiment 50 times and record the outcomes in a vector, (which I'll then use to plot a histogram.)
This is my code so far:
for (i in 1:100)
{X <- sample(6,1,replace=TRUE,c(1,1,1,1,1,1)/6)
Y <- sample(2,1,replace=TRUE,c(1,1)/2)}
Youtcomes <- c(sum(Y))
Youtcomes
But instead of giving me a vector with 100 elements, I keep getting just a single number. Where am I going wrong?
Note: I have to use a for loop.
Use the fact that R is vectorized. You can then use a binomial distribution to replicate the coin toss.
heads <- rbinom(size = sample(6,100, replace = TRUE), n=100, prob = 0.5)
sum(heads)
Perhaps I have missed something, but what is wrong with one call to sample() to do the 100 rolls of the dice, and then plug that into rbinom() to do the coin tosses? We pass the output from sample() to the size argument
> set.seed(1)
> rbinom(100, size = sample(6, 100, replace = TRUE), prob = 0.5)
[1] 1 1 1 6 1 2 2 2 3 1 2 1 2 1 1 0 3 1 1 3 6 1 2 0 2 1 1 1 2 2 2 1 0 1 4 3 3
[38] 1 5 2 3 2 2 1 3 2 0 2 1 4 2 3 1 1 1 0 1 1 1 1 2 2 1 2 3 1 0 2 1 2 2 4 2 1
[75] 1 5 3 2 3 5 1 2 3 1 4 0 3 1 2 1 1 0 1 5 2 3 0 2 2 3
Discalimer: (very inefficient solution see mnel/Gavin's solution)
As you can read the many, many, .. MANY comments underneath each of the answers, while this answer attempts to answer OP's specific question(however inefficient his requirements maybe), in the spirit of maintaining decorum of the forum, some have (rightly) pointed out that the question is in bad taste and my answer doesn't do justice to the forum requirements. I accept all criticism and leave the answer here only for obvious reasons (marked as answer, continuity). I suggest you look at mnel/Gavin's answer for a vectorised solution to this specific problem. If you're interested in looking at an implementation of for-loop, then refer to the bottom of this post, but I suggest you look at it to know the structure of for-loop, but not implement a for-loop to this specific problem. Thank you.
Your code is riddled with quite a few problems, apart from the main problem #Joshua already mentioned:
First, you rewrite every time the values of X and Y inside the loop so, at the end of the loop, there is only the last value of Y that is being summed up.
Second, your code for Y is not correct. You say, you have to get x amount of coin tosses, Yet, you use sample(2, 1, ...). The 1 must be replaced with X which equals the number from the die roll.
Try out this code instead:
Youtcomes <- sapply(1:100, function(x) {
X <- sample(1:6, 1, replace=TRUE, rep(1,6)/6)
Y <- sample(c("H", "T"), X, replace=TRUE, rep(1,2)/2)
sum(Y == "T")
})
Here, we loop over 100 times, and each time, sample values between 1 and 6 and store in X. Then, we sample either head (H) or tail (T) X number of times and store in Y.
Now, sum(Y == "T") gives the sum for current value of x (1 <= x <= 100). So, at the end, Youtcomes will be your set of simulated Y == Tail values.
Then, you can do a hist(Youtcomes).
Edit: If its a for-loop solution that's desired then,
# always assign the variable you'll index inside for-loop
# else the object will keep growing every time and a copy of
# entire object is made for every i, which makes it extremely
# slow/inefficient.
Youtcomes <- rep(0, 100)
for (i in 1:100) {
X <- sample(1:6, 1, replace=TRUE, rep(1,6)/6)
Y <- sample(c("H", "T"), X, replace=TRUE, rep(1,2)/2)
# assign output inside the loop with [i] indexing
Youtcomes[i] <- sum(Y == "T")
# since Youtcomes is assigned a 100 values of 0's before
# the values will replace 0' at each i. Thus the object
# is not copied every time. This is faster/efficient.
}
Arun beat me to it. But another of the many many ways could be (if I understand your desired outcome correctly..
X <- sample(6,100,replace=TRUE,c(1,1,1,1,1,1)/6)
Y <- lapply(X , function(x){ res <- sample( c( "H" , "T" ) , x , replace=TRUE , c(1,1)/2 ) ; table( res ) } )
You want to histogram the results....
res <- unlist(Y)
hist( res[names( res )=="T"] )

Resources