Using replicate over for loop - r

I want to make a simulation 10000 times on the difference between load and demand. I need the proportion of negative differences over all simulation. I used the for loop below, but this is not performant for large simulations as it requires me to calculate everything by hand later on.
How can I do the same with replicate? What are the benefits of using replicate over for loop?
for(i in 1:10){
load<-rnorm(10,8,2)
demand<-c(6,7,6,5,6,7,8,7,6,5)
diff<-load-demand
res<-sum(diff<0) / 10# sum the negative differences
print(res)
}

The big disadvantage of using a for-loop, is that you don't automatically store everything in an object. replicate() does that for you.
Another tip: keep everything that doesn't change out of the loop (like the definition of demand)
How to do what you describe using replicate():
nsim <- 10000
demand <- c(6,7,6,5,6,7,8,7,6,5)
res <- replicate(nsim,{
load <- rnorm(10,8,2)
diff <- load - demand
return(sum(diff < 0))
})
res <- sum(res) / nsim
The return() function isn't necessary (a code block returns the last line it executes), but I add it so you see immediately what's going to be stored in res.
But you actually don't need a loop to do this (given you described what you want accurately). You can just use vectorization and recycling to do this:
nsim <- 10000
demand <- c(6,7,6,5,6,7,8,7,6,5)
load <- rnorm(10*nsim, 8, 2)
diff <- load - demand
res <- sum(diff < 0) / nsim
replicate or a for loop actually only make sense if you want to keep the results for each simulation. As you're only interested in the overall result, the second code block will be more performant and R-like.

Related

Foreach instead for loop

I have a large database and I wrote a code which executes the same calculations on that in a rolling manner by nesting it in a for loop. My problem is that the code runs pretty long. As I read, this is probably caused by R using a single-threaded method as default. As far as I know, foreach package would make it possible to speed up the execution by considerable time, however, I am unsure how to implement it. Currently, my code looks like this, in every iteration I subset a chunk of the large database and do various stuff with these subsets. At the end of an iteration, I collect the output in a time series. Is it possible to apply foreach in this situation?
(k in seq(1,5284, 21)) {
fdata <- data[k:(k+251),]
tdata <- data[(k+252):(k+377),]
}
Thanks!
This is certainly doable using foreach. Depending on your OS you would first have to load a suitable backend (e.g. SNOW on a windows machine) and then set up a cluster.
Example:
library(foreach)
library(doSNOW)
# set number of cores/CPUs to be used
(n_cores <- parallel::detectCores() - 1)
# some example data
dat <- matrix(1:1e3, ncol = 10)
# a set you iterate over
k <- 1:99
# run stuff in parallel
cl <- makeCluster(n_cores)
registerDoSNOW(cl)
result <- foreach(k) %dopar% {
fdata <- dat[k:(k+1), ]
# do computationally expensive stuff with `fdata`
# ... and return something
cumsum(fdata[1,] + fdata[2,])
}
stopCluster(cl)
By default result will be a list of the results. There are, however, ways to combine into an array or similar. Look at details on the .combine argument in ?foreach.

Speed up for loop assigning data to matrix in R

I am simulating data and filling a matrix using a for loop in R. Currently the loop is running slower than I would like. I've done some work to vectorize some of the variables to improve the loops speed but it still taking some time. I believe the
mat[j,year] <- sum(vec==1)/x
part of the loop is slowing things down. I've looked into filling matrices more efficiently but could not find anything to help my current problem. Eventually this will be used as a part of a shiny app so all of variables I assign will need to be easily assigned different values.
Any advice to speed up the loop or more efficiently write this loop would be greatly appreciated.
Here is the loop:
#These variables are all specified because they need to change with different simulations
num.sims <- 20
time <- 50
mat <- matrix(nrow = num.sims, ncol = time)
x <- 1000
init <- 0.5*x
vec <- vector(length = x)
ratio <- 1
freq <- -0.4
freq.vec <- numeric(nrow(mat))
## start a loop
for (j in 1:num.sims) {
vec[1:init] <- 1; vec[(init+1):x] <- 2
year <- 2
freq.vec[j] <- sum(vec==1)/x
for (i in 1:(x*(time-1))) {
freq.1 <- sum(vec==1)/x; freq.2 <- 1 - freq.1
fit.ratio <- exp(freq*(freq.1-0.5) + log(ratio))
Pr.1 <- fit.ratio*freq.1/(fit.ratio*freq.1 + freq.2)
vec[ceiling(x*runif(1))] <- sample(c(1,2), 1, prob=c(Pr.1,1-Pr.1))
## record data
if (i %% x == 0) {
mat[j,year] <- sum(vec==1)/x
year <- year + 1
}}}
The inner loop is what is slowing you down. You're doing x number of iterations to update each cell in the matrix. Since each trip to modify vec depends on the previous iteration, this would be difficult to simplify. #Andrew Feierman is probably correct that this would benefit from being moved to C++, at least the four lines before the if statement.
Alternatively, this only takes 10-20 seconds to run. Unless you're going to scale this up or run it many times, it might not be worth the trouble to speed it up. If you do keep it as is, you could put a progress bar in Shiny to let the user know things are still working.
Depending on how often you will need to call this loop, it could be worth rewriting it in C++. R is built on C++, and any C++ will run many, many times faster than even efficient R code.
sourceCpp is a good package to start with: https://www.rdocumentation.org/packages/Rcpp/versions/0.12.11/topics/sourceCpp

Block bootstrap for genomic data

I am trying to implement a block bootstrap procedure, but I haven't figured out a way of doing this efficiently.
My data.frame has the following structure:
CHR POS var_A var_B
1 192 0.9 0.7
1 2000 0.8 0.3
2 3 0.21 0.76
2 30009 0.36 0.15
...
The first column is the chromosome identification, the second column is the position, and the last two columns are variables for which I want to calculate a correlation. The problem is that each row is not entirely independent to one another, depending on the distance between them (the closer the more dependent), and so I cannot simply do cor(df$var_A, df$var_B).
The way out of this problem that is commonly used with this type of data is performing a block bootstrap. That is, I need to divide my data into blocks of length X, randomly select one row inside that block, and then calculate my statistic of interest. Note, however, that these blocks need to be defined based on the column POS, and not based on the row number. Also, this procedure needs to be done for each chromosome.
I tried to implement this, but I came up with the slowest code possible (it didn't even finish running) and I am not 100% sure it works.
x = 1000
cors = numeric()
iter = 1000
for(j in 1:iter) {
df=freq[0,]
for (i in unique(freq$CHR)) {
t = freq[freq$CHR==i,]
fim = t[nrow(t),2]
i = t[1,2]
f = i + x
while(f < fim) {
rows = which(t$POS>=i & t$POS<f)
s = sample(rows)
df = rbind(df,t[s,])
i = f
f = f + x
}
}
cors = c(cors, cor(df$var_A, df$var_B))
}
Could anybody help me out? I am sure there is a more efficient way of doing this.
Thank you in advance.
One efficient way to try would be to use the 'boot' package, of which functions include parallel processing capabilities.
In particular, the 'tsboot', or time series boot function, will select ordered blocks of data. This could work if your POS variable is some kind of ordered observation.
The boot package functions are great, but they need a little help first. To use bootstrap functions in the boot package, one must first wrap the statistic of interest in a function which includes an index argument. This is the device the bootstrap generated index will use to pass sampled data to your statistic.
cor_hat <- function(data, index) cor(y = data[index,]$var_A, x = data[index,]$var_B)
Note cor_hat in the arguments below. The sim = "fixed", l = 1000 arguments, which indicate you want fixed blocks of length(l) 1000. However, you could do blocks of any size, 5 or 10 if your trying to capture nearest neighbor dynamics moving over time. The multicore argument speaks for itself, but it maybe "snow" if you are using windows.
library(boot)
tsboot(data, cor_hat, R = 1000, sim = "fixed", l = 1000, parallel = "multicore", ncpus = 4)
In addition, page 194 of Elements of Statistical Learning provides a good example of the framework using the traditional boot function, all of which is relevant to tsboot.
Hope that helps, good luck.
Justin
r
I hope I understood you right:
# needed for round_any()
library(plyr)
res <- lapply(unique(freq$CHR),function(x){
freq_sel <- freq[freq$CHR==x,]
blocks <- lapply(seq(1,round_any(max(freq_sel$POS),1000,ceiling),1000), function(ix) freq_sel[freq_sel$POS > ix & freq_sel$POS <= ix+999,])
do.call(rbind,lapply(blocks,function(x) if (nrow(x) > 1) x[sample(1:nrow(x),1),] else x))
})
This should return a list with an entry for each chromosome. Within each entry, there's an observation per 1kb-block if present. The number of blocks is determined by the maximum POS value.
EDIT:
library(doParallel)
library(foreach)
library(plyr)
cl <- makeCluster(detectCores())
registerDoParallel(cl)
res <- foreach(x=unique(freq$CHR),.packages = 'plyr') %dopar% {
freq_sel <- freq[freq$CHR==x,]
blocks <- lapply(seq(1,round_any(max(freq_sel$POS),1000,ceiling),1000), function(ix) freq_sel[freq_sel$POS > ix & freq_sel$POS <= ix+999,])
do.call(rbind,lapply(blocks,function(x) if (nrow(x) > 1) x[sample(1:nrow(x),1),] else x))
}
stopCluster(cl)
This is a simple parallelisation with foreach on each Chromosome. It could be better to restructure the function and base the parallel processing on another level (such as the 1000 iterations or maybe the blocks). In any case, I can just stress again what I was saying in my comment: Before you work on parallelising your code, you should be sure that it's as efficient as possible. Meaning you might want to look into the boot package or similar to get an increase in efficiency. That said, with the number of iterations you're planning, parallel processing might be useful once you're comfortable with your function.
So, after a while I came up with an answer to my problem. Here it goes.
You'll need the package dplyr.
l = 1000
teste = freq %>%
mutate(w = ceiling(POS/l)) %>%
group_by(CHR, w) %>%
sample_n(1)
This code creates a new variable named w based on the position in the genome (POS). This variable w is the window to which each row was assigned, and it depends on l, which is the length of your window.
You can repeat this code several times, each time sampling one row per window/CHR (with the sample_n(1)) and apply whatever statistic of interest that you want.

Implementing fast numerical calculations in R

I was trying to do an extensive computation in R. Eighteen hours have passed but my RStudio seems to continue to work. I'm not sure if I could have written the script in a different way to make it faster. I was trying to implement a Crank–Nicolson type method over a 50000 by 350 matrix as shown below:
#defining the discretization of cells
dt<-1
t<-50000
dz<-0.0075
z<-350*dz
#velocity & diffusion
v<-2/(24*60*60)
D<-0.02475/(24*60*60)
#make the big matrix (all filled with zeros)
m <- as.data.frame(matrix(0, t/dt+1, z/dz+2)) #extra columns/rows for boundary conditions
#fill the first and last columns with constant boundary values
m[,1]<-400
m[,length(m)]<-0
#implement the calculation
for(j in 2:(length(m[1,])-1)){
for(i in 2:length(m[[1]])){
m[i,][2:length(m)-1][[j]]<-m[i-1,][[j]]+
D*dt*(m[i-1,][[j+1]]-2*m[i-1,][[j]]+m[i-1,][[j-1]])/(dz^2)-
v*dt*(m[i-1,][[j+1]]-m[i-1,][[j-1]])/(2*dz)
}}
Is there a way to know how long would it take for R to implement it? Is there a better way of constructing the numerical calculation? At this point, I feel like excel could have been faster!!
Just making a few simple optimisations really helps here. The original version code of your code would take ~ 5 days on my laptop. Using a matrix and calculating just once values that are reused in the loop, we bring this down to around 7 minutes
And think about messy constructions like
m[i,][2:length(m)-1][[j]]
This is equivalent to
m[[i, j]]
which would be faster (as well as much easier to understand). Making this change further reduces the runtime by another factor of over 2, to around 3 minutes
Putting this together we have
dt<-1
t<-50000
dz<-0.0075
z<-350*dz
#velocity & diffusion
v<-2/(24*60*60)
D<-0.02475/(24*60*60)
#make the big matrix (all filled with zeros)
m <- (matrix(0, t/dt+1, z/dz+2)) #extra columns/rows for boundary conditions
# cache a few values that get reused many times
NC = NCOL(m)
NR = NROW(m)
C1 = D*dt / dz^2
C2 = v*dt / (2*dz)
#fill the first and last columns with constant boundary values
m[,1]<-400
m[,NC]<-0
#implement the calculation
for(j in 2:(NC-1)){
for(i in 2:NR){
ma = m[i-1,]
ma.1 = ma[[j+1]]
ma.2 = ma[[j-1]]
m[[i,j]] <- ma[[j]] + C1*(ma.1 - 2*ma[[j]] + ma.2) - C2*(ma.1 - ma.2)
}
}
If you need to go even faster than this, you can try out some more optimisations. For example see here for how different ways of indexing the same element can have very different execution times. In general it is better to refer to column first, then row.
If all the optimisations you can do in R are not enough for your speed requirements, then you might implement the loop in RCpp instead.

Nested for loop improvement

I'm trying to write a simple population model using nested for loops. I want to project the population 10yrs, and I want to run this projection 100 times. I need the output of each time step and include a counter so I know which year of which iteration the results correspond to. I have an example running using this code, but I was wondering if:
1) There was a more elegant solution to than using the rows<-rows+1 command to reset and advance the counter each time?
2) There is a more elegant solution than for loops to accomplish this?
library(VGAM)
popdata<-matrix(nrow=1000,ncol=3)
dimnames(popdata)[[2]]<-c('iteration','year','popsize')
rows<-1
for (iteration in 1:100){
pop<-50
for(year in 1:10){
popdata[rows,1]<-iteration
popdata[rows,2]<-year
pop<-rbetabinom(1,pop,0.6)
popdata[rows,3]<-pop
rows<-rows+1
}
}
You can replace the row counter by (iteration - 1) * 10 + year.
You can also clean up the work done in the loop by assigning the iteration and year data outside as you know in advance what these results will be.
This gives something like
popdata<-matrix(nrow=1000,ncol=3)
dimnames(popdata)[[2]]<-c('iteration','year','popsize')
# Do the deterministic stuff first
popdata[, 1] <- rep(1:100, each = 10)
popdata[, 2] <- rep(1:10, times = 100)
for (iteration in 1:100){
pop<-50
for(year in 1:10){
rows <- (iteration - 1) * 10 + yea
pop<-rbetabinom(1,pop,0.6)
popdata[rows,3]<-pop
}
}
Because pop depends on the previous pop it is less easy to simplify the inner loop entirely.
I think the approach would be to generate all your binomial results as a long vector of 1s and 0s.
Then in each loop you would subset the next pop elements of that vector.
In the end I don't think that would actually be neater or more readable though.

Resources