Trying to wrap my mind arround vectorizing, trying to make some simulations faster I found this very basic epidemic simulation. The code is from the book http://www.amazon.com/Introduction-Scientific-Programming-Simulation-Using/dp/1420068725/ref=sr_1_1?ie=UTF8&qid=1338069156&sr=8-1
#program spuRs/resources/scripts/SIRsim.r
SIRsim <- function(a, b, N, T) {
# Simulate an SIR epidemic
# a is infection rate, b is removal rate
# N initial susceptibles, 1 initial infected, simulation length T
# returns a matrix size (T+1)*3 with columns S, I, R respectively
S <- rep(0, T+1)
I <- rep(0, T+1)
R <- rep(0, T+1)
S[1] <- N
I[1] <- 1
R[1] <- 0
for (i in 1:T) {
S[i+1] <- rbinom(1, S[i], (1 - a)^I[i])
R[i+1] <- R[i] + rbinom(1, I[i], b)
I[i+1] <- N + 1 - R[i+1] - S[i+1]
}
return(matrix(c(S, I, R), ncol = 3))
}
The core of the simulation is the for loop. My question, is since the code produces the S[i+1] and R[i+1] values from the S[i] and R[i] values, is it possible to vectorize it with an apply function?
Many thanks
It's hard to 'vectorize' iterative calculations, but this is a simulation and simulations are likely to be run many times. So write this to do all the the simulations at the same time by adding an argument M (number of simulations to perform), allocating an M x (T + 1) matrix, and then filling in successive columns (times) of each simulation. The changes seem to be remarkably straight-forward (so I've probably made a mistake; I'm particularly concerned about the use of vectors in the second and third arguments to rbinom, though this is consistent with the documentation).
SIRsim <- function(a, b, N, T, M) {
## Simulate an SIR epidemic
## a is infection rate, b is removal rate
## N initial susceptibles, 1 initial infected, simulation length T
## M is the number of simulations to run
## returns a list of S, I, R matricies, each M simulation
## across T + 1 time points
S <- I <- R <- matrix(0, M, T + 1)
S[,1] <- N
I[,1] <- 1
for (i in seq_along(T)) {
S[,i+1] <- rbinom(M, S[,i], (1 - a)^I[,i])
R[,i+1] <- R[,i] + rbinom(M, I[,i], b)
I[,i+1] <- N + 1 - R[,i+1] - S[,i+1]
}
list(S=S, I=I, R=R)
}
Related
Why does the computation of the following code in R take so much time? It takes many minutes, so I have interruped the calculations.
My aim is to adapt my simulated random numbers (sumzv, dim(sumzv) = 1000000 x 10) to my market model S_t (geometric brownian motion).
The vectors m and s describe the drift and the deviation of the GBM and are vectors containing 10 numbers. DEL is the variable for the time steps. S_0 is a vector containing 10 stock prices at time 0.
n <- 1000000
k <- 10
S_t <- data.frame(matrix(0, nrow = n, ncol = k))
i <- 1
j <- 1
t <- 10
for (j in 1:k) {
for (i in 1:n) {
S_t[i, j] <- S_0[j] * exp(m[j] * t * DEL + s[j] * sqrt(DEL) * sumzv[i, j])
}
}
Thank you for your help. Please keep in mind that I'm a beginner :)
Unfortunately, I couldn't find any helpful information so far on the internet. Some pages said, vectorization is helpful to speed up an R Code, but this doesn't seem helpful to me.
I tried to break down the data frames into vectors but this got very complex.
The following code with vectorized inner loop is equivalent to the posted code.
It also pre-computes some inner loop vectors, fac1 and fac2.
S_t <- data.frame(matrix(0, nrow = n, ncol = m))
fac1 <- m * t * DEL
fac2 <- s * sqrt(DEL)
for (j in 1:k) {
S_t[, j] <- S_0[j] * exp(fac1[j] + fac2[j] * sumzv[, j])
}
The fully vectorized version of the loop on j above is the one-liner below. The transposes are needed because R is column major and we are multiplying by row vectors indexed on j = 1:k.
S_t2 <- t(S_0 * exp(fac1 + fac2 * t(sumzv)))
Hellou
I've had problems with the following while loop in R. I try to know with what number of samples (n), I can achieve a variance less than 0.01 (dtest) and that let me to know the values of n, m, s and d:
n <- 100
x <- rnorm(n,0,1)
sd(x)
d <- sd(x)/sqrt(n)
dtest <- 0.01
while(dtest <=0.01) {
x <- rnorm(n,0,1)
n <- n+1
m <- mean(x)
s <- sd(x)
d <- s/sqrt(n)
return(output <- data.frame(n,m,s,d))
}
The first time I did the cycle without problems and it marked a n of approx 27K. Now only every time I execute the loop it accumulates
There are a number of issues:
Your condition should compare d to dtest. Currently, it’s comparing two values that aren’t changed within the loop, so will run forever.
Increment n at the start of the loop. Otherwise you’re using a different n to compute x and d.
Just create your results dataframe once, after the loop, rather than creating and discarding with each loop. And don’t use return(), which is meant for use inside functions.
Note that sd(x)/sqrt(n) is standard error, not variance. Variance would be sd(x)^2.
set.seed(13)
n <- 99
x <- rnorm(n,0,1)
d <- sd(x)/sqrt(n)
dtest <- 0.01
while(dtest <= d) {
n <- n+1
x <- rnorm(n,0,1)
s <- sd(x)
d <- s/sqrt(n)
}
output <- data.frame(n,m = mean(x),s,d)
output
n m s d
1 9700 0.01906923 0.9848469 0.009999605
I have constructed a discrete time SIR model using a loop within a function (i have added my code below).
Currently the results of the iterations are coming out as a list which seems to show all the S values first followed by the I values and then the R values, which I have deduced myself from the nature of the values.
I need the output as a data frame with the column names: 'Iteration', 'S', 'I' and 'R' from left to right and the corresponding values underneath such that when a row is read it will tell you the iteration and values of S, I and R at that iteration.
I do not know how to construct a data frame that and returns the output values in this way, I have only started learning R a few weeks ago and so am not yet proficient so any help would be HUGELY appreciated.
Thank you in advance.
#INITIAL CONDITIONS
S=999
I=1
R=0
#PARAMETERS
beta = 0.003 # infectious contact rate (/person/day)
gamma = 0.2 # recovery rate (/day)
#SIR MODEL WITH POISSON SAMPLING
discrete_SIR_model <- function(){
for(i in 1:30){ #the number of iterations of loop indicates the
#duration of the model in days
# i.e. 'i in 1:30' constitutes 30 days
deltaI<- rpois(1,beta * I * S) #rate at which individuals in the
#population are becoming infected
deltaR<-rpois(1,gamma * I)#rate at which infected individuals are
#recovering
S[i+1]<-S[i] -deltaI
I[i+1] <-I[i] + deltaI -deltaR
R[i+1]<-R[i]+deltaR
}
}
output <- list(c(S, I, R))
output
If a foor loop is used, one can define vectors or a data frame beforehand where the results are stored:
beta <- 0.001 # infectious contact rate (/person/day)
gamma <- 0.2 # recovery rate (/day)
S <- I <- R <- numeric(31)
S[1] <- 999
I[1] <- 1
R[1] <- 0
set.seed(123) # makes the example reproducible
for(i in 1:30){
deltaI <- rpois(1, beta * I[i] * S[i])
deltaR <- rpois(1, gamma * I[i])
S[i+1] <- S[i] - deltaI
I[i+1] <- I[i] + deltaI - deltaR
R[i+1] <- R[i] + deltaR
}
output <- data.frame(S, I, R)
output
matplot(output)
As an alternative, it is also possible to employ a package for this. Package deSolve is intended for differential equations, but it can also solve the discrete case with method "euler":
library(deSolve)
discrete_SIR_model <- function(t, y, p) {
with(as.list(c(y, p)), {
deltaI <- rpois(1, beta * I * S)
deltaR <- rpois(1, gamma * I)
list(as.double(c(-deltaI, deltaI - deltaR, deltaR)))
})
}
y0 <- c(S = 999.0, I=1, R=0)
p <- c(
beta = 0.001, # infectious contact rate (/person/day)
gamma = 0.2 # recovery rate (/day)
)
times <- 1:30
set.seed(576) # to make the example reproducible
output <- ode(y0, times, discrete_SIR_model, p, method="euler")
plot(output, mfrow=c(1,3))
Note: I reduced beta, otherwise the discrete model would become unstable.
The pricing of the Asian option is approximated, using Monte Carlo simulation, by:
delta <- 1/12
T <- 2
S0 <- 100
sigma <- 0.20
K <- 100
r <- 0.01
n <- 10^4
m <- T/delta
S <- S0
for(i in 1:n) {
for(j in 1:m) {
W <- rnorm(1)
Si <- S[length(S)]*exp((r-0.5*sigma^2)*delta + sigma*sqrt(delta)*W)
S <- c(S, Si)
}
Si.bar <- mean(S[-1])
Ci <- exp(-r*T)*max(Si.bar - K, 0)
}
mean(Ci)
The for(j in 1:m) for loop runs perfectly, I think... But when I run it n times, using for(i in 1:n) S gets smaller and smaller by n. It decreases to almost zero when n grows. This leads to a mean (Si.bar <- mean(S[-1]) well below the strike price, K= 100.
I can't figure out what is wrong with the two last lines of codes. I'm getting a value on the Asian call option of 0, due to the payoff function. The correct solution to this option is a value of approximately 7 (mean(Ci))
There's a couple of issues with your code. Firstly, it's inefficient in R to build a vector by repeated concatenation. Instead, you should allocate the vector up front and then assign to its members.
Secondly, as I understand it, the aim is to repeat the inner loop n times and store the output into members of a vector C before taking the mean. That's not what you're doing at the moment - each iteration of the outer loop makes S longer and overwrites Ci such that the last statement, mean(Ci) is meaningless.
Here's an amended version of the code. I've used plyr partly to make the code neater, and partly for its progress bar functionality.
library(plyr)
delta <- 1/12
T <- 2
S0 <- 100
sigma <- 0.20
K <- 100
r <- 0.01
n <- 10^4
m <- T/delta
S <- numeric(m + 1)
S[1] <- S0
asian_price <- function() {
for(j in 1:m) {
W <- rnorm(1)
S[j + 1] <- S[j] * exp((r - 0.5 * sigma^2) * delta + sigma * sqrt(delta) * W)
}
Si.bar <- mean(S[-1])
exp(-r * T) * max(Si.bar - K, 0)
}
C <- raply(n, asian_price(), .progress = "text")
mean(C)
# [1] 7.03392
I am working on a research project where I want to determine equivalence of two distributions. I am currently using the Mann-Whitney Test for Equivalence and the code I am running (below) was provided with the book Testing Statistical Hypotheses of Equivalence and Noninferiority by Stefan Wellek (2010). Before running my data I am testing this code with random normal distributions which have the same mean and standard deviation. My problem is that there are three nested for loops and when running larger distributions sizes (as in the example below) the code takes forever to run. If I only had to run it once that would not be such a problem, but I am doing a simulation test and creating power curves so I need to run many iterations of this code (around 10,000). At the moment, depending on how I alter the distribution sizes, it takes days to run 10,000 iterations.
Any help in a way to increase the performance of this would be greatly appreciated.
x <- rnorm(n=125, m=3, sd=1)
y <- rnorm(n=500, m=3, sd=1)
alpha <- 0.05
m <- length(x)
n <- length(y)
eps1_ <- 0.2 #0.1382 default
eps2_ <- 0.2 #0.2602 default
eqctr <- 0.5 + (eps2_-eps1_)/2
eqleng <- eps1_ + eps2_
wxy <- 0
pihxxy <- 0
pihxyy <- 0
for (i in 1:m)
for (j in 1:n)
wxy <- wxy + trunc(0.5*(sign(x[i] - y[j]) + 1))
for (i in 1:m)
for (j1 in 1:(n-1))
for (j2 in (j1+1):n)
pihxyy <- pihxyy + trunc(0.5*(sign(x[i] - max(y[j1],y[j2])) + 1))
for (i1 in 1:(m-1))
for (i2 in (i1+1):m)
for (j in 1:n)
pihxxy <- pihxxy + trunc(0.5*(sign(min(x[i1],x[i2]) - y[j]) + 1))
wxy <- wxy / (m*n)
pihxxy <- pihxxy*2 / (m*(m-1)*n)
pihxyy <- pihxyy*2 / (n*(n-1)*m)
sigmah <- sqrt((wxy-(m+n-1)*wxy**2+(m-1)*pihxxy+(n-1)*pihxyy)/(m*n))
crit <- sqrt(qchisq(alpha,1,(eqleng/2/sigmah)**2))
if (abs((wxy-eqctr)/sigmah) >= crit) rej <- 1
if (abs((wxy-eqctr)/sigmah) < crit) rej <- 0
if (is.na(sigmah) || is.na(crit)) rej <- 1
MW_Decision <- rej
cat(" ALPHA =",alpha," M =",m," N =",n," EPS1_ =",eps1_," EPS2_ =",eps2_,
"\n","WXY =",wxy," SIGMAH =",sigmah," CRIT =",crit," REJ=",MW_Decision)
See edit below for an even better suggestion
One simple suggestion to get a bit of a speed boost is to byte compile your code.
For example, I wrapped your code into a function starting from the alpha <- 0.05 line and ran it on my laptop. Simply byte compiling your current code, it runs twice as fast.
set.seed(1234)
x <- rnorm(n=125, m=3, sd=1)
y <- rnorm(n=500, m=3, sd=1)
# f1 <- function(x,y){ ...your code...}
system.time(f1(x, y))
# user system elapsed
# 33.249 0.008 33.278
library(compiler)
f2 <- cmpfun(f1)
system.time(f2(x, y))
# user system elapsed
# 17.162 0.002 17.170
EDIT
I should add, this is the type of things that a different language would do much better than R. Have you looked at the Rcpp and the inline packages?
I've been curious to learn how to use them so I figured this was a good chance.
Here's a tweak of your code using the inline package and Fortran (since I'm more comfortable with that than C). It wasn't hard at all (provided you know Fortran or C); I just followed the examples listed in cfunction.
First, let's re-write your loops and compile them:
library(inline)
# Fortran code for first loop
loop1code <- "
integer i, j1, j2
real*8 tmp
do i = 1, m
do j1 = 1, n-1
do j2 = j1+1, n
tmp = x(i) - max(y(j1),y(j2))
if (tmp > 0.) pihxyy = pihxyy + 1
end do
end do
end do
"
# Compile the code and turn loop into a function
loop1fun <- cfunction(sig = signature(x="numeric", y="numeric", pihxyy="integer", m="integer", n="integer"), dim=c("(m)", "(n)", "", "", ""), loop1code, language="F95")
# Fortran code for second loop
loop2code <- "
integer i1, i2, j
real*8 tmp
do i1 = 1, m-1
do i2 = i1+1, m
do j = 1, n
tmp = min(x(i1), x(i2)) - y(j)
if (tmp > 0.) pihxxy = pihxxy + 1
end do
end do
end do
"
# Compile the code and turn loop into a function
loop2fun <- cfunction(sig = signature(x="numeric", y="numeric", pihxxy="integer", m="integer", n="integer"), dim=c("(m)", "(n)", "", "", ""), loop2code, language="F95")
Now let's create a new function that uses these. So it's not too long, I'll just sketch the key parts I modified from your code:
f3 <- function(x, y){
# ... code ...
# Remove old loop
## for (i in 1:m)
## for (j1 in 1:(n-1))
## for (j2 in (j1+1):n)
## pihxyy <- pihxyy + trunc(0.5*(sign(x[i] - max(y[j1],y[j2])) + 1))
# Call new function from compiled code instead
pihxyy <- loop1fun(x, y, pihxyy, m, n)$pihxyy
# Remove second loop
## for (i1 in 1:(m-1))
## for (i2 in (i1+1):m)
## for (j in 1:n)
## pihxxy <- pihxxy + trunc(0.5*(sign(min(x[i1],x[i2]) - y[j]) + 1))
# Call new compiled function for second loop
pihxxy <- loop2fun(x, y, pihxxy, m, n)$pihxxy
# ... code ...
}
And now we run it and voila, we get a huge speed boost! :)
system.time(f3(x, y))
# user system elapsed
0.12 0.00 0.12
I did check that it got the same results as your code, but you probably want to run some additional tests just in case.
You can use outer instead of the first double loop:
set.seed(42)
f1 <- function(x,y) {
wxy <- 0
for (i in 1:m)
for (j in 1:n)
wxy <- wxy + trunc(0.5*(sign(x[i] - y[j]) + 1))
wxy
}
f2 <- function(x,y) sum(outer(x,y, function(x,y) trunc(0.5*(sign(x-y)+1))))
f1(x,y)
[1] 32041
f2(x,y)
[1] 32041
You get roughly 50x speedup:
library(microbenchmark)
microbenchmark(f1(x,y),f2(x,y))
Unit: milliseconds
expr min lq median uq max neval
f1(x, y) 138.223841 142.586559 143.642650 145.754241 183.0024 100
f2(x, y) 1.846927 2.194879 2.677827 3.141236 21.1463 100
The other loops are trickier.