Simplify Simulations on R - r

as I mentioned in a previous question. I am brand new to programming and have no prior experience, but am very happy to be learning.
However, I've run into the following problem, my professor has given us the following:
sim1 <- function(n) {
xm <- matrix(nrow=n,ncol=2)
for (i in 1:n) {
d <- rnorm(1)
if (runif(1) < 0.5) {
xm[i,1] <- 1
xm[i,2] <- 2.5*d + 69
} else {
xm[i,1] <- 0
xm[i,2] <- 2*d + 64
}
}
return(xm)
}
With the following task: Try to improve the efficiency of this code. Use speed.test to see if it is improved for generating n=1000 observations.
I have finally at least been able to figure out what this code does, nonetheless, I am completely lost on how I could possibly make this code more efficient.
Any help means a whole lot.
Thank you!

If possible, don't use loops in R. rep and rnorm will fill vectors with 5, 10, or 500,000 values all in one call, very quickly. Calling rnorm(1) 500,000 times is a waste and much slower than simply calling rnorm(500000). It's like taking a Ferrari for a drive, going 1 foot and stopping, going 1 foot and stopping, over and over to get to your destination.
This function will return statistically identical results as your function. However, instead of using loops, it does things in the R way.
sim2 <- function(n) {
n1 <- floor(n/2) #this is how many of the else clause we'll do
n2 <- n - n1 #this is how many of the if clause we'll do
col11 <- rep(0, n1) #bam! we have a vector filled with 0s
col12 <- (rnorm(n1) * 2) + 64 #bam! vector filled with deviates
col21 <- rep(1, n2) #bam! vector filled with 1s
col22 <- (rnorm(n2) * 2.5) + 69 #bam! vector filled with deviates
xm <- cbind(c(col11,col21), c(col12,col22)) #now we have a matrix, 2 cols, n rows
return(xm[sample(nrow(xm)),]) #shuffle the rows, return matrix
}
No loops! The functionality might be obvious but in case it is not, I'll explain. First, n1 & n2 are simply to split the size of n appropriately (accounting for odd numbers).
Next, the binomial process (i.e., if(runif(1) < 0.5) {} else {}) per element can be eliminated since we know that in sim1, half of the matrix falls into the if condition and half in the else (see proof below). We don't need to decide for each element over and over and over which random path to take when we know that it's 50/50. So, we're going to do ALL the else 50% first: we fill a vector with n/2 0s (col11) and another with n/2 random deviates (mean = 0, sd = 1 by default) and, for each deviate, multiply by 2 and add 64, with result vector col12. That 50% is done.
Next, we finish the second 50% (the if portion). We fill a vector with n/2 1s (col21) and another with random deviates and, for each deviate, multiply by 2.5 and add 69.
We now have 4 vectors that we'll turn into a matrix. STEP 1: We glue col11 (filled with n/2 0s) and col21 (filled with n/2 1s) together using the c function to get a vector (n elements). STEP 2: Glue col12 and col22 together (filled with the deviates) using c to get a vector (like a 1 column x n row matrix). Note: 0s/1s are associated with the correct deviates based on 64/69 formulas. STEP 3: Use cbind to make a matrix (xm) out of the vectors: 0/1 vector becomes column 1, deviate vector becomes column 2. STEP 4: Get the number of rows in the matrix (which should just be n) using nrow. STEP 5: Make a shuffled vector with all the row numbers randomly ordered using sample. STEP 6: Make a new (unnamed) matrix putting xm's rows in order according to the shuffled vector. The point of steps 4-6 is just to randomly order the rows, since the binomial process in sim1 would have produced a random order of rows.
This version runs 866% faster!
> system.time({ sim1(500000)})
user system elapsed
1.341 0.179 1.527
> system.time({ sim2(500000)})
user system elapsed
0.145 0.011 0.158
If you're concerned about proof that this maintains the integrity of the binomial process, consider that the binomial process does two things: 1) It associates 1 with the 2.5*d+69 equation and 0 with the 2*d + 64 equation - the association is maintained since rows are shuffled intact; 2) 50% go in the if clause and 50% in the else clause, as proved below.
sim3 <- function(n) {
a <- 0
for(j in 1:n) {
if(runif(1) < 0.5) {
a <- a + 1
}
}
return(a/n)
}
> sim3(50)
[1] 0.46
> sim3(5000)
[1] 0.4926
> sim3(10000)
[1] 0.5022
> sim3(5000000)
[1] 0.4997844
The binomial process produces 50% 1s and 50% 0s (column 1).

I'll do what I think is the most obvious step, namely to move rnorm() out of the loop and take advantage of its vectorized nature (as rawr alluded to)
sim2 <- function(n) {
xm <- matrix(nrow=n, ncol=2)
d <- rnorm(n)
for (i in 1:n) {
if (runif(1) < 0.5) {
xm[i,1] <- 1
xm[i,2] <- 2.5*d[i] + 69
} else {
xm[i,1] <- 0
xm[i,2] <- 2*d[i] + 64
}
}
return(xm)
}
n <- 1e3
set.seed(1); system.time(s1 <- sim1(n)); system.time(s2 <- sim2(n))
# user system elapsed
# 0.019 0.004 0.023
# user system elapsed
# 0.010 0.000 0.009
t.test(s1[,2], s2[,2]) # Not identical, but similar, again alluded to by rawr
Just that gives us a reasonable improvement. A similar thing can be done with runif() as well, but I'll leave that to you.
If you want some reading material I can recommend Hadley Wickhams Advanced R and the chapter Optimising code.
And in case you're wondering, it is indeed possible to eliminate both the loop and the conditionals.

One optimization I can suggest is that you create the matrix with default value as 0. Once matrix has been created with 0 value as default then there will be no need to populate a value 0 in function.
The modified code will look like:
sim1 <- function(n) {
#create matrix with 0 value.
xm <- matrix(0,nrow=n,ncol=2)
for (i in 1:n) {
d <- rnorm(1)
if (runif(1) < 0.5) {
xm[i,1] <- 1
xm[i,2] <- 2.5*d + 69
} else {
#xm[i,1] <- 0 --- No longer needed
xm[i,2] <- 2*d + 64
}
}
return(xm)
}

Related

Calculate vector whose length is not known beforehand - should I "grow" it?

I need to calculate entries of a vector whose length I do not know beforehand. How to do so efficiently?
A trivial solution is to "grow" it: start with a small or empty vector and successively append new entries until the stopping criterion is reached. For example:
foo <- numeric(0)
while ( sum(foo) < 100 ) foo <- c(foo,runif(1))
length(foo)
# 195
However, "growing" vectors is frowned upon in R for performance reasons.
Of course, I could "grow it in chunks": pre-allocate a "good-sized" vector, fill it, double its length when it is full, and finally cut it down to size. But this feels error-prone and will make for inelegant code.
Is there a better or canonical way to do this? (In my actual application, the calculation and the stopping criterion are a bit more complicated, of course.)
In reply to some useful comments
Even if you don't know the length beforehand, do you know the maximum possible length it can theoretically have? In such cases I tend to initialize the vector with that length and after the loop cut the NAs or remove the unused entries based on the latest index value.
No, the maximum length is not known in advance.
Do you need to keep all values as the vector grows?
Yes, I do.
What about something like rand_num <- runif(300); rand_num[cumsum(rand_num) < 100] where you choose a sufficiently large vector that you know for a high probability that the condition will be met? You can of course check it and use an even bigger number if it's not met. I've tested up till runif(10000) it's still faster than "growing".
My actual use case involves a dynamic calculation, which I can't simply vectorize (otherwise I would not be asking).
Specifically, to approximate the convolution of negative binomial random variables, I need to calculate the probability masses of the integer random variable $K$ in Theorem 2 in Furman, 2007 up to a high cumulative probability. These masses $pr_k$ involve some intricate recursive sums.
I could "grow it in chunks": pre-allocate a "good-sized" vector, fill it, double its length when it is full, and finally cut it down to size. But this feels error-prone and will make for inelegant code.
Sounds like you are referring to the accepted answer of Collecting an unknown number of results in a loop. Have you coded it up and tried it? The idea of length doubling is more than sufficient (see the end of this answer), as the length will grow geometrically. I will demonstrate my method in the following.
For test purpose, wrap your code in a function. Note how I avoid doing sum(z) for every while test.
ref <- function (stop_sum, timing = TRUE) {
set.seed(0) ## fix a seed to compare performance
if (timing) t1 <- proc.time()[[3]]
z <- numeric(0)
sum_z <- 0
while ( sum_z < stop_sum ) {
z_i <- runif(1)
z <- c(z, z_i)
sum_z <- sum_z + z_i
}
if (timing) {
t2 <- proc.time()[[3]]
return(t2 - t1) ## return execution time
} else {
return(z) ## return result
}
}
Chunking is necessary to reduce the operational costs of concatenation.
template <- function (chunk_size, stop_sum, timing = TRUE) {
set.seed(0) ## fix a seed to compare performance
if (timing) t1 <- proc.time()[[3]]
z <- vector("list") ## store all segments in a list
sum_z <- 0 ## cumulative sum
while ( sum_z < stop_sum ) {
segmt <- numeric(chunk_size) ## initialize a segment
i <- 1
while (i <= chunk_size) {
z_i <- runif(1) ## call a function & get a value
sum_z <- sum_z + z_i ## update cumulative sum
segmt[i] <- z_i ## fill in the segment
if (sum_z >= stop_sum) break ## ready to break at any time
i <- i + 1
}
## grow the list
if (sum_z < stop_sum) z <- c(z, list(segmt))
else z <- c(z, list(segmt[1:i]))
}
if (timing) {
t2 <- proc.time()[[3]]
return(t2 - t1) ## return execution time
} else {
return(unlist(z)) ## return result
}
}
Let's check correctness first.
z <- ref(1e+4, FALSE)
z1 <- template(5, 1e+4, FALSE)
z2 <- template(1000, 1e+4, FALSE)
range(z - z1)
#[1] 0 0
range(z - z2)
#[1] 0 0
Let's then compare speed.
## reference implementation
t0 <- ref(1e+4, TRUE)
## unrolling implementation
trial_chunk_size <- seq(5, 1000, by = 5)
tm <- sapply(trial_chunk_size, template, stop_sum = 1e+4, timing = TRUE)
## visualize timing statistics
plot(trial_chunk_size, tm, type = "l", ylim = c(0, t0), col = 2, bty = "l")
abline(h = t0, lwd = 2)
Looks like chunk_size = 200 is sufficiently good, and the speedup factor is
t0 / tm[trial_chunk_size == 200]
#[1] 16.90598
Let's finally see how much time is spent for growing vector with c, via profiling.
Rprof("a.out")
z0 <- ref(1e+4, FALSE)
Rprof(NULL)
summaryRprof("a.out")$by.self
# self.time self.pct total.time total.pct
#"c" 1.68 90.32 1.68 90.32
#"runif" 0.12 6.45 0.12 6.45
#"ref" 0.06 3.23 1.86 100.00
Rprof("b.out")
z1 <- template(200, 1e+4, FALSE)
Rprof(NULL)
summaryRprof("b.out")$by.self
# self.time self.pct total.time total.pct
#"runif" 0.10 83.33 0.10 83.33
#"c" 0.02 16.67 0.02 16.67
Adaptive chunk_size with linear growth
ref has O(N * N) operational complexity where N is the length of the final vector. template in principle has O(M * M) complexity, where M = N / chunk_size. To attain linear complexity O(N), chunk_size needs to grow with N, but a linear growth is sufficient: chunk_size <- chunk_size + 1.
template1 <- function (chunk_size, stop_sum, timing = TRUE) {
set.seed(0) ## fix a seed to compare performance
if (timing) t1 <- proc.time()[[3]]
z <- vector("list") ## store all segments in a list
sum_z <- 0 ## cumulative sum
while ( sum_z < stop_sum ) {
segmt <- numeric(chunk_size) ## initialize a segment
i <- 1
while (i <= chunk_size) {
z_i <- runif(1) ## call a function & get a value
sum_z <- sum_z + z_i ## update cumulative sum
segmt[i] <- z_i ## fill in the segment
if (sum_z >= stop_sum) break ## ready to break at any time
i <- i + 1
}
## grow the list
if (sum_z < stop_sum) z <- c(z, list(segmt))
else z <- c(z, list(segmt[1:i]))
## increase chunk_size
chunk_size <- chunk_size + 1
}
## remove this line if you want
cat(sprintf("final chunk size = %d\n", chunk_size))
if (timing) {
t2 <- proc.time()[[3]]
return(t2 - t1) ## return execution time
} else {
return(unlist(z)) ## return result
}
}
A quick test verifies that we have attained linear complexity.
template1(200, 1e+4)
#final chunk size = 283
#[1] 0.103
template1(200, 1e+5)
#final chunk size = 664
#[1] 1.076
template1(200, 1e+6)
#final chunk size = 2012
#[1] 10.848
template1(200, 1e+7)
#final chunk size = 6330
#[1] 108.183

Data generation: Creating a vector of vectors

I have a vector of positive integers of unknown length. Let's call it vector a with elements a[1], a[2], ...
I want to perform calculations on vector b where for all i, 0 <= b[i] <= a[i].
The following does not work:
for(b in 0:a)
{
# calculations
}
The best I have come up with is:
probabilities <- function(a,p)
{
k <- a
k[1] <- 1
h <- rep(0,sum(a)+1)
for(i in 2:length(a))
{
k[i] <- k[i-1]*(a[i-1]+1)
}
for(i in 0:prod(a+1))
{
b <- a
for(j in 1:length(a))
{
b[j] <- (floor(i/k[j]) %% (a[j]+1))
}
t <- 1
for(j in 1:length(a))
{
t <- t * choose(a[j],b[j])*(p[j])^(b[j])*(1-p[j])^(a[j]-b[j])
}
h[sum(b)+1] <- h[sum(b)+1] + t
}
return(h)
}
In the middle of my function is where I create b. I start off by setting b equal to a (so that it is the same size). Then, I replace all of the elements of b with different elements that are rather tricky to calculate. This seems like an inefficient solution. It works, but it is fairly slow as the numbers get large. Any ideas for how I can cut down on process time? Essentially, what this does for b is the first time through, b is all zeros. Then, it is 1, 0,0,0,... The first element keeps incrementing until it reaches a[1], then b[2] increments and b[1] is set to 0. Then b[1] starts incrementing again.
I know the math is sound, I just do not trust that it is efficient. I studied combinatorics for a few years, but have never studied computational complexity theory, so coming up with a fast algorithm is a bit beyond my realm of knowledge. Any ideas would be helpful!

Different results on same expression

I have the following code snippets sum s which i have written in two ways. Both the expressions although are same, giving me two different answers. The output is coming to be negative where it is supposed to come in a positive value. Please tell me why it is happening and how to rectify it.
n <- 40
k <- 20
m <- 30
T <- 1.2
t1 <- 3
(1) The first way of expressing the sum
s <- 0
for (j in 0:(m-1)) {
a <- choose(m-1, j)*(-1)^j*(1/(n-m+1+j))*(1-exp(-T*(n-m+1+j)*(1/t1)))
s <- s + a
}
s <- s * m*choose(n, m)
(2) The second way of expressing the same sum is
s <- 0
for (j in 0:(m-1)) {
a <- choose(m-1, j)*(-1)^j*(1/(n-m+1+j))*(1-exp(-T*(n-m+1+j)*(1/t1)))*m*choose(n, m)
s <- s + a
}

Calculating standard deviation on large table [duplicate]

I recently posted this question on the r-help mailing list but got no answers, so I thought I would post it here as well and see if there were any suggestions.
I am trying to calculate the cumulative standard deviation of a matrix. I want a function that accepts a matrix and returns a matrix of the same size where output cell (i,j) is set to the standard deviation of input column j between rows 1 and i. NAs should be ignored, unless cell (i,j) of the input matrix itself is NA, in which case cell (i,j) of the output matrix should also be NA.
I could not find a built-in function, so I implemented the following code. Unfortunately, this uses a loop that ends up being somewhat slow for large matrices. Is there a faster built-in function or can someone suggest a better approach?
cumsd <- function(mat)
{
retval <- mat*NA
for (i in 2:nrow(mat)) retval[i,] <- sd(mat[1:i,], na.rm=T)
retval[is.na(mat)] <- NA
retval
}
Thanks.
You could use cumsum to compute necessary sums from direct formulas for variance/sd to vectorized operations on matrix:
cumsd_mod <- function(mat) {
cum_var <- function(x) {
ind_na <- !is.na(x)
nn <- cumsum(ind_na)
x[!ind_na] <- 0
cumsum(x^2) / (nn-1) - (cumsum(x))^2/(nn-1)/nn
}
v <- sqrt(apply(mat,2,cum_var))
v[is.na(mat) | is.infinite(v)] <- NA
v
}
just for comparison:
set.seed(2765374)
X <- matrix(rnorm(1000),100,10)
X[cbind(1:10,1:10)] <- NA # to have some NA's
all.equal(cumsd(X),cumsd_mod(X))
# [1] TRUE
And about timing:
X <- matrix(rnorm(100000),1000,100)
system.time(cumsd(X))
# user system elapsed
# 7.94 0.00 7.97
system.time(cumsd_mod(X))
# user system elapsed
# 0.03 0.00 0.03
Another try (Marek's is faster)
cumsd2 <- function(y) {
n <- nrow(y)
apply(y,2,function(i) {
Xmeans <- lapply(1:n,function(z) rep(sum(i[1:z])/z,z))
Xs <- sapply(1:n, function(z) i[1:z])
sapply(2:n,function(z) sqrt(sum((Xs[[z]]-Xmeans[[z]])^2,na.rm = T)/(z-1)))
})
}

Simulating coin toss

In the New York Times yesterday there was a reference to a paper essentially saying that the probability of 'heads' after a 'head' appears is not 0.5 (assuming a fair coin), challenging the "hot hand" myth. I want to prove it to myself.
Thus, I am working on coding a simulation of 7 coin tosses, and counting the number of heads after the first head, provided, naturally, that there is a first head at all.
I came up with the following lines of R code, but I'm still getting NA values, and would appreciate some help:
n <- 7 # number of tosses
p <- 0.5 # probability of heads
sims <- 100 # number of simulations
Freq_post_H <- 0 # frequency of 'head'-s after first 'head'
for(i in 1:sims){
z <- rbinom(n, 1, p)
if(sum(z==1)!=0){
y <- which(z==1)[1]
Freq_post_H[i] <- sum(z[(y+1):n])/length((y+1):n)
}else{
next()
}
Freq_post_H
}
Freq_post_H
What am I missing?
CONCLUSION: After the initial hiccups of mismatched variable names, both responses solve the question. One of the answers corrects problems in the initial code related to what happens with the last toss (i + 1) by introducing min(y + 1, n), and corrects the basic misunderstanding of next within a loop generating NA for skipped iterations. So thank you (+1).
Critically, and the reason for this appended "conclusion" the second response addresses a more fundamental or conceptual problem: we want to calculate the fraction of H's that are preceded by a H, as opposed to p(H) in whatever number of tosses remain after a head has appeared, which will be 0.5 for a fair coin.
This is a simulation of what they did in the newspaper:
nsims <- 10000
k <- 4
set.seed(42)
sims <- replicate(nsims, {
x <- sample(0:1, k, TRUE)
#print(x)
sum( # sum logical values, i.e. 0/1
diff(x) == 0L & # is difference between consecutive values 0?
x[-1] == 1L ) / # and are these values heads?
sum(head(x, -1) == 1L) #divide by number of heads (without last toss)
})
mean(sims, na.rm = TRUE) #NaN cases are samples without heads, i.e. 0/0
#[1] 0.4054715
k <- 7
sims <- replicate(nsims, {
x <- sample(0:1, k, TRUE)
#print(x)
sum(diff(x) == 0L & x[-1] == 1L) / sum(head(x, -1) == 1L)
})
mean(sims, na.rm = TRUE)
#[1] 0.4289402
n <- 7 # number of tosses
p <- 0.5 # probability of heads
sims <- 100 # number of simulations
Prob_post_H <- 0 # frequency of 'head'-s after first 'head'
for(i in 1:sims){
z <- rbinom(n, 1, p)
if(sum(z==1) != 0){
y <- which(z==1)[1]
Prob_post_H[i] <- mean(z[min(y+1, n):n], na.rm=TRUE)
}else{
next()
}
}
mean(Prob_post_H,na.rm=TRUE)
#[1] 0.495068
It looks like it's right around 50%. We can scale up to see more simulations.
sims <- 10000
mean(Prob_post_H,na.rm=TRUE)
#[1] 0.5057866
Still around 50%.
This is to simulate 100 fair coin tosses 30,000 times
counter <- 1
coin <- sum(rbinom(100,1,0.5))
while(counter<30000){
coin <- c(coin, sum(rbinom(100,1,0.5)))
counter <- counter+1
}
Try these after running above variable
hist(coin)
str(coin)
mean(coin)
sd(coin)
Below is some sample code in R to simulate a fair coin toss in R using the sample function. You can modify it as you like to simulate any number of flips. Since the outcome of flipping a coin is independent for each flip, the probability of a head or tail is always 0.5 for any given flip. Over many coin flips the probability of at least half of the flips being heads (or tails) will converge to 0.5. The probability that you get exactly half heads and half tails approaches 0.
n <- 7
count_heads <- 0
coin_flip <- sample(c(0,1), n, replace = TRUE)
for(flip_i in 1:n)
{
if(coin_flip[flip_i] == 1)
{
count_heads = count_heads + 1
}
}
count_heads/n

Resources