Simulate coin toss for one week? - r

This is not homework. I am interested in setting up a simulation of a coin toss in R. I would like to run the simulation for a week. Is there a function in R that will allow me to start and stop the simulation over a time period such as a week? If all goes well, I may want to increase the length of the simulation period.
For example:
x <- rbinom(10, 1, 1/2)
So to clarify, instead of 10 in the code above, how do I keep the simulation going for a week (number of trials in a week versus set number of trials)? Thanks.

Here is code that will continue to run for three seconds, then stop and print the totals.
x <- Sys.time()
duration <- 3 # number of seconds
heads <- 0
tails <- 0
while(Sys.time() <= x + duration){
s <- sample(0:1, 1)
if(s == 1) heads <- heads+1 else tails <- tails+1
cat(sample(0:1, 1))
}
cat("heads: ", heads)
cat("tails: ", tails)
The results:
001100111000011010000010110111111001011110100110001101101010 ...
heads: 12713
tails: 12836
Note of warning:
At the speed of my machine, I bet that you get a floating point error long before the end of the week. In other words, you may get to the maximum value your machine allows you to store as an integer, double, float or whatever you are using, and then your code will crash.
So you may have to build in some error checking or rollover mechanism to protect you from this.
For an accelerated illustration of what will happen, try the following:
x <- 1e300
while(is.finite(x)){
x <- x+x
cat(x, "\n")
}
R deals with the floating point overload gracefully, and returns Inf.
So, whatever data you had in the simulation is now lost. It's not possible to analyse infinity to any sensible degree.
Keep this in mind when you design your simulation.

While now is smaller than a week later time stamp append to x rbinmo(1,1,1/2)
R> week_later <- strptime("2012-06-22 16:45:00", "%Y-%m-%d %H:%M:%S")
R> x <- rbinom(1, 1, 1/2) // init x
R> while(as.numeric(Sys.time()) < as.numeric(week_later)){
R> x <- append(x, rbinom(1, 1, 1/2))
R> }

You may be interested in the fairly new package harvestr by Andrew Redd. It splits a task into pieces (the idea being that pieces could be run in parallel). The part of the package that applies to your question is that it caches results of the pieces that have already been processed, so that if the task is interupted and restarted then those pieces that have finished will not be rerun, but it will pick up on those that did not complete (pieces that were interupted part way through will start from the beginning of that piece).
This may let you start and stop the simulation as you request.

Related

R Precision for Double - Why code returns negative why positive outcome expected?

I am testing 2 ways of calculating Prod(b-a), where a and b are vectors of length n. Prod(b-a)=(b1-a1)(b2-a2)(b3-a3)*... (bn-an), where b_i>a_i>0 for all i=1,2,3, n. For some special cases, another way (Method 2) of calculation this prod(b-a) is more efficient. It uses the following formula, which is to expand the terms and sum them:
Here is my question is: When it happens that a_i very close to b_i, the true outcome could be very, very close 0, something like 10^(-16). Method 1 (substract and Multiply) always returns positive output. Method 2 of using the formula some times return negative output ( about 7~8% of time returning negative for my experiment). Mathematically, these 2 methods should return exactly the same output. But in computer language, it apparently produces different outputs.
Here are my codes to run the test. When I run the testing code for 10000 times, about 7~8% of my runs for method 2 returns negative output. According to the official document, the R double has the precision of "2.225074e-308" as indicated by R parameter: ".Machine$double.xmin". Why it's getting into the negative values when the differences are between 10^(-16) ~ 10^(-18)? Any help that sheds light on this will be apprecaited. I would also love some suggestions concerning how to practically increase the precision to higher level as indicated by R document.
########## Testing code 1.
ftest1case<-function(a,b) {
n<-length(a)
if (length(b)!=n) stop("--------- length a and b are not right.")
if ( any(b<a) ) stop("---------- b has to be greater than a all the time.")
out1<-prod(b-a)
out2<-0
N<-2^n
for ( i in 1:N ) {
tidx<-rev(as.integer(intToBits(x=i-1))[1:n])
tsign<-ifelse( (sum(tidx)%%2)==0,1.0,-1.0)
out2<-out2+tsign*prod(b[tidx==0])*prod(a[tidx==1])
}
c(out1,out2)
}
########## Testing code 2.
ftestManyCases<-function(N,printFreq=1000,smallNum=10^(-20))
{
tt<-matrix(0,nrow=N,ncol=2)
n<-12
for ( i in 1:N) {
a<-runif(n,0,1)
b<-a+runif(n,0,1)*0.1
tt[i,]<-ftest1case(a=a,b=b)
if ( (i%%printFreq)==0 ) cat("----- i = ",i,"\n")
if ( tt[i,2]< smallNum ) cat("------ i = ",i, " ---- Negative summation found.\n")
}
tout<-apply(tt,2,FUN=function(x) { round(sum(x<smallNum)/N,6) } )
names(tout)<-c("PerLess0_Method1","PerLee0_Method2")
list(summary=tout, data=tt)
}
######## Step 1. Test for 1 case.
n<-12
a<-runif(n,0,1)
b<-a+runif(n,0,1)*0.1
ftest1case(a=a,b=b)
######## Step 2 Test Code 2 for multiple cases.
N<-300
tt<-ftestManyCases(N=N,printFreq = 100)
tt[[1]]
It's hard for me to imagine when an algorithm that consists of generating 2^n permutations and adding them up is going to be more efficient than a straightforward product of differences, but I'll take your word for it that there are some special cases where it is.
As suggested in comments, the root of your problem is the accumulation of floating-point errors when adding values of different magnitudes; see here for an R-specific question about floating point and here for the generic explanation.
First, a simplified example:
n <- 12
set.seed(1001)
a <- runif(a,0,1)
b <- a + 0.01
prod(a-b) ## 1e-24
out2 <- 0
N <- 2^n
out2v <- numeric(N)
for ( i in 1:N ) {
tidx <- rev(as.integer(intToBits(x=i-1))[1:n])
tsign <- ifelse( (sum(tidx)%%2)==0,1.0,-1.0)
j <- as.logical(tidx)
out2v[i] <- tsign*prod(b[!j])*prod(a[j])
}
sum(out2v) ## -2.011703e-21
Using extended precision (with 1000 bits of precision) to check that the simple/brute force calculation is more reliable:
library(Rmpfr)
a_m <- mpfr(a, 1000)
b_m <- mpfr(b, 1000)
prod(a_m-b_m)
## 1.00000000000000857647286522936696473705868726043995807429578968484409120647055193862325070279593735821154440625984047036486664599510856317884962563644275433171621778761377125514191564456600405460403870124263023336542598111475858881830547350667868450934867675523340703947491662460873009229537576817962228e-24
This proves the point in this case, but in general doing extended-precision arithmetic will probably kill any performance gains you would get.
Redoing the permutation-based calculation with mpfr values (using out2 <- mpfr(0, 1000), and going back to the out2 <- out2 + ... running summation rather than accumulating the values in a vector and calling sum()) gives an accurate answer (at least to the first 20 or so digits, I didn't check farther), but takes 6.5 seconds on my machine (instead of 0.03 seconds when using regular floating-point).
Why is this calculation problematic? First, note the difference between .Machine$double.xmin (approx 2e-308), which is the smallest floating-point value that the system can store, and .Machine$double.eps (approx 2e-16), which is the smallest value such that 1+x > x, i.e. the smallest relative value that can be added without catastrophic cancellation (values a little bit bigger than this magnitude will experience severe, but not catastrophic, cancellation).
Now look at the distribution of values in out2v, the series of values in out2v:
hist(out2v)
There are clusters of negative and positive numbers of similar magnitude. If our summation happens to add a bunch of values that almost cancel (so that the result is very close to 0), then add that to another value that is not nearly zero, we'll get bad cancellation.
It's entirely possible that there's a way to rearrange this calculation so that bad cancellation doesn't happen, but I couldn't think of one easily.

R poisson simulation

I need to write a function which calculates the number of arrivals until time t in n trials. And the arguments should be the lambda (which I can assume to be between 0.1 and 1), the time (I can assume to be less than or equal to 1) and the number of counts to be sampled.
I've previously written a function which takes a vector of length n which has the first n-1 elements as inter-event times and the nth element as the time t, and it counts the number of events which occur before t.
inp <- readline(prompt="Input vector with each element seperated by a space")
inp <- strsplit(inp," ")
inp <- as.integer(as.vector(inp[[1]]))
t <- tail(inp, n=1)
c.e <- function(x) {
inp = x
stopped = NA
for (i in seq_along(inp)) {
runsum <- sum(inp[1:i])
cat("The sum of the", i, "first elements is", runsum, "\n")
if (runsum > tail(inp, 1)) {
stopped = i - 1
break()
}
}
stopped
}
cat(c.e(inp), "events occur in", t, "time units")
(e.g. inputting 1 2 3 4 7 would output that 3 events occur in 6 time units)
I think I need to use and possibly edit this function in order to get it to do what I need it do, but I'm really not sure how to do this. Any help would be appreciated :)
You could edit this in to your old function if you want something like the printing and the nice inputs/outputs that you've got, but as for what you've actually asked for, it seems like all that you need is counts<-function(lambda, t, n) sum(rpois(n, lambda) < t).
rpois generates the trial results, with parameters n (the number of trials) and lambda, we then compare them with t (time) and sum our results.

How to append first and last element in each vector (of a different length) of a list in R, without making the code slow?

I am new to R and StackOverFlow. So if there's something I am missing, please let me know. I am working on simulating nonhomogeneous processes to understand the neural behavior better. My code works in a way that I have, say 20 trials lasting 2 seconds each (each trial representing a spike train). Then the list SpikeTimes gives me a list of 20 vectors where each vector corresponds to the time stamps of where the spikes occurred in that particular trial. [Eg.SpikeTimes[1] which looks like this, 0.002250802 0.053934034...1.971574170 2.023090384 means that in the first Spike Train, spikes occurred at 0.002250802, 0.053934034 and so on. I don't know why it also brings up a time stamps beyond my time limit of 2 seconds, but I will work on that later]. My code looks like this--
nTrials=20
t_max=2.0000000
LambdaInv<- function(x){ifelse( x< 15, x/30,
ifelse(x >= 15 & x < 38, ((x-15)/46)+0.5,
ifelse(x >= 38 & x <53, ((x-38)/30)+1.0,
ifelse(x >= 53 & x< 67.4, ((x-53)/72)+1.5,
((x-67.4)/30)+1.7))))}
t = 0
s = 0
X = numeric(0)
NonHomoSpikes <- function(t_max){
while(t <= t_max){
u <- runif(1)
s <- s-log(u)
t <- LambdaInv(s)
X <- c(X,t)
}
return(X)
}
SpikeTimes <- lapply(1:nTrials, function(x) NonHomoSpikes(2))
My problem is that, for each of the vector in the list SpikeTimes; which gives time stamps of spikes, I also want to include the beginning (that is 0) and the end (that is 2) of the spike train. So I want to append this list to have each vector include the first entry as 0 and the last entry as 2.
My SpikeTimes[1] would then look like 0 0.002250802 0.053934034...1.971574170 2 and other SpikeTimes[i] would look similar. I tried SpikeTimes <- c(0, SpikeTimes) for entering 0 in the beginning but it only made the list have 21 vectors instead of 20 with the vector 0 as the first element (I mean I get why that happened). How can I do it in a way that doesn't make my code slow? I am newbie in R and reading up on the internet hasn't help with this particular problem. I would appreciate any sort of input.
Solution: change NonHomoSpikes return statement to return(c(0, X, 2)).
To 'manually' stop if t < 2 (as described in comments):
NonHomoSpikes <- function(t_max){
while(t <= t_max){
u <- runif(1)
s <- s-log(u)
t <- LambdaInv(s)
if(t > 2) break
X <- c(X,t)
}
return(c(0,X,0))
}

Expected value of the difference between a sum of variables and a threshold

I had a custom deck consisting of eight cards of the sequence 2^n, n=0,..,6. I draw cards (without replacement) until the sum is equal or greater than the threshold. How can I implement in R a function that calculates the mean of the difference between the sum and the threshold??
I tried to do it using this How to store values in a vector with nested functions
but it takes ages... I think there is a way to do it with probabilities/simulations but I can figure out.
The threshold could be greater than the value of one single card, ie, threshold=500 or less than the value of a single card, ie, threshold=50
What I have done so far is to find all the subsets that meet the condition of the sum greater or equal to the threshold. Then I will only substract the threshold and calculate the mean.
I am using the following code in R. For a small set I get the answer quite fast. However, I have been running the function for several ours with the set containing the 56 numbers and is still working.
set<-c(rep(1,8),rep(2,8), rep(4,8),rep(8,8),rep(16,8),rep(32,8),rep(64,8))
recursive.subset <-function(x, index, current, threshold, result){
for (i in index:length(x)){
if (current + x[i] >= threshold){
store <<- append(store, sum(c(result,x[i])))
} else {
recursive.subset(x, i + 1, current+x[i], threshold, c(result,x[i]))
}
}
}
store <- vector()
inivector <- vector(mode="numeric", length=0) #initializing empty vector
recursive.subset (set, 1, 0, threshold, inivector)
I don't know if it is possible to get an exact solution, simply because there are so many possible combinations. It is probably better to do simulations, i.e. write a script for 1 full draw and then rerun that script many times. Since the solutions are very similar, the simulation should give a pretty good approximation.
Ok, here goes:
set <- rep(2^(0:6), each = 8)
thr <- 500
fun <- function(set,thr){
x <- cumsum(sample(set))
value <- x[min(which(x >= thr))]
value
}
system.time(a <- replicate(100000, fun(set,thr)))
# user system elapsed
# 1.10 0.00 1.09
mean(a - thr)
# [1] 21.22992
Explanation: Rather than drawing a card one at a time, I draw all cards simultaneously (sample) and then calculate the cumulative sum (cumsum). I then find the point where the cards at up to the threshold or larger, and find the sum of those cards back in x. We run this function many times with replicate, to obtain a vector of the values. We use mean(a-thr) to calculate the mean difference.
Edit: Made a really stupid typo in the code, fixed it now.
Edit2: Shortened the function a little.

R probability simulation that won't terminate?

I'm teaching a statistics class where I'm having students explore questions in probability and statistics through simulation using R. Recently there was some confusion about the probability of getting exactly two 6's when rolling 5 dice. The answer is choose(5,2)*5^3/6^5, but some students were convinced that "order shouldn't matter"; i.e. that the answer should be choose(5,2)*choose(25,3)/choose(30,5). I thought it would be fun to have them simulate rolling 5 dice thousands of times, keeping track of the empirical probability for each experiment, and then repeat the experiment many times. The problem is the two numbers above are sufficiently close that it's quite hard to get a simulation to tease out the difference in a statistically significant fashion (of course I could just be doing it wrong). I tried rolling 5 dice 100000 times, then repeating the experiment 10000 times. This took an hour or so to run on my i7 linux machine and still allowed for a 25% chance that the correct answer is choose(5,2)*choose(25,3)/choose(30,5). So I increased the number of dice rolls per experiment to 10^6. Now the code has been running for over 2 days and shows no sign of finishing. I'm confused by this, as I only increased the number of operations by an order of magnitude, implying that the run time should be closer to 10 hours.
Second question: Is there a better way to do this? See code posted below:
probdist = rep(0,10000)
for (j in 1:length(probdist))
{
outcome = rep(0,1000000)
for (k in 1:1000000)
{
rolls = sample(1:6, 5, replace=T)
if (length(rolls[rolls == 6]) == 2) outcome[k] = 1
}
probdist[j] = sum(outcome)/length(outcome)
}
A good rule of thumb is to never, ever write a for loop in R. Here's an alternative solution:
doSample <- function()
{
sum(sample(1:6,size=5,replace=TRUE)==6)==2
}
> system.time(samples <- replicate(n=10000,expr=doSample()))
user system elapsed
0.06 0.00 0.06
> mean(samples)
[1] 0.1588
> choose(5,2)*5^3/6^5
[1] 0.160751
Doesn't seem to be too accurate with $10,000$ samples. Better with $100,000$:
> system.time(samples <- replicate(n=100000,expr=doSample()))
user system elapsed
0.61 0.02 0.61
> mean(samples)
[1] 0.16135
I had originally awarded a correct answer check to M. Berk for his/her suggestion to use the R replicate() function. Further investigation has forced to to rescind my previous endorsement. It turns out that replicate() is just a wrapper for sapply(), which doesn't actually afford any performance benefits over a for loop (this seems to be a common misconception). In any case, I prepared 3 versions of the simulation, 2 using a for loop, and one using replicate, as suggested, and ran them one after the other, starting from a fresh R session each time, in order to compare the execution times:
# dice26dist1.r: For () loop version with unnecessary array allocation
probdist = rep(0,100)
for (j in 1:length(probdist))
{
outcome = rep(0,1000000)
for (k in 1:1000000)
{
rolls = sample(1:6, 5, replace=T)
if (length(rolls[rolls == 6]) == 2) outcome[k] = 1
}
probdist[j] = sum(outcome)/length(outcome)
}
system.time(source('dice26dist1.r'))
user system elapsed
596.365 0.240 598.614
# dice26dist2.r: For () loop version
probdist = rep(0,100)
for (j in 1:length(probdist))
{
outcomes = 0
for (k in 1:1000000)
{
rolls = sample(1:6, 5, replace=T)
if (length(rolls[rolls == 6]) == 2) outcomes = outcomes + 1
}
probdist[j] = outcomes/1000000
}
system.time(source('dice26dist2.r'))
user system elapsed
506.331 0.076 508.104
# dice26dist3.r: replicate() version
doSample <- function()
{
sum(sample(1:6,size=5,replace=TRUE)==6)==2
}
probdist = rep(0,100)
for (j in 1:length(probdist))
{
samples = replicate(n=1000000,expr=doSample())
probdist[j] = mean(samples)
}
system.time(source('dice26dist3.r'))
user system elapsed
804.042 0.472 807.250
From this you can see that the replicate() version is considerably slower than either of the for loop versions by any system.time metric. I had originally thought that my problem was mostly due to cache misses by allocating the million character outcome[] array, but comparing the times of dice26dist1.r and dice26dist2.r indicates that this only has nominal impact on performance (although the impact on system time is considerable: >300% difference.
One might argue that I'm still using for loops in all three simulations, but as far as I can tell this is completely unavoidable when simulating a random process; I have to simulate actually going through the random process (in this case, rolling 5 die) every time. I would love to know about any technique that would allow me to avoid using a for loop (in a way that improves performance, of course). I understand that this problem would lend itself very effectively to parallelization, but I'm talking about using a single R session -- is there a way to make this faster?
Vectorization is almost always preferred to any for loop. In this case, you should see substantial speedup by generating all your dice throws first, then checking how many in each group of five equal 6.
set.seed(5)
N <- 1e6
foo <- matrix(sample(1:6, 5*N, replace=TRUE), ncol=5)
p <- mean(rowSums(foo==6)==2)
se <- sqrt(p*(1-p)/N)
p
## [1] 0.160382
Here's a 95% confidence interval:
p + se*qnorm(0.975)*c(-1,1)
## [1] 0.1596628 0.1611012
We can see that the true answer (ans1) is in the interval, but the false answer (ans2) is not, or we could perform significance tests; the p-value when testing the true answer is 0.31 but for the false answer is 0.0057.
(ans1 <- choose(5,2)*5^3/6^5)
## [1] 0.160751
pnorm(abs((ans1-p)/se), lower=FALSE)*2
## [1] 0.3145898
ans2 <- choose(5,2)*choose(25,3)/choose(30,5)
## [1] 0.1613967
pnorm(abs((ans2-p)/se), lower=FALSE)*2
## [1] 0.005689008
Note that I'm generating all the dice throws at once; if memory is an issue, you could split this up into pieces and combine, as you did in your original post. This is possibly what caused your unexpected speedup in time; if it was necessary to use swap memory, this would slow it substantially. If so, better to increase the number of time you run the loop, not the number of rolls within the loop.

Resources