Speed up for loop assigning data to matrix in R - r

I am simulating data and filling a matrix using a for loop in R. Currently the loop is running slower than I would like. I've done some work to vectorize some of the variables to improve the loops speed but it still taking some time. I believe the
mat[j,year] <- sum(vec==1)/x
part of the loop is slowing things down. I've looked into filling matrices more efficiently but could not find anything to help my current problem. Eventually this will be used as a part of a shiny app so all of variables I assign will need to be easily assigned different values.
Any advice to speed up the loop or more efficiently write this loop would be greatly appreciated.
Here is the loop:
#These variables are all specified because they need to change with different simulations
num.sims <- 20
time <- 50
mat <- matrix(nrow = num.sims, ncol = time)
x <- 1000
init <- 0.5*x
vec <- vector(length = x)
ratio <- 1
freq <- -0.4
freq.vec <- numeric(nrow(mat))
## start a loop
for (j in 1:num.sims) {
vec[1:init] <- 1; vec[(init+1):x] <- 2
year <- 2
freq.vec[j] <- sum(vec==1)/x
for (i in 1:(x*(time-1))) {
freq.1 <- sum(vec==1)/x; freq.2 <- 1 - freq.1
fit.ratio <- exp(freq*(freq.1-0.5) + log(ratio))
Pr.1 <- fit.ratio*freq.1/(fit.ratio*freq.1 + freq.2)
vec[ceiling(x*runif(1))] <- sample(c(1,2), 1, prob=c(Pr.1,1-Pr.1))
## record data
if (i %% x == 0) {
mat[j,year] <- sum(vec==1)/x
year <- year + 1
}}}

The inner loop is what is slowing you down. You're doing x number of iterations to update each cell in the matrix. Since each trip to modify vec depends on the previous iteration, this would be difficult to simplify. #Andrew Feierman is probably correct that this would benefit from being moved to C++, at least the four lines before the if statement.
Alternatively, this only takes 10-20 seconds to run. Unless you're going to scale this up or run it many times, it might not be worth the trouble to speed it up. If you do keep it as is, you could put a progress bar in Shiny to let the user know things are still working.

Depending on how often you will need to call this loop, it could be worth rewriting it in C++. R is built on C++, and any C++ will run many, many times faster than even efficient R code.
sourceCpp is a good package to start with: https://www.rdocumentation.org/packages/Rcpp/versions/0.12.11/topics/sourceCpp

Related

How to optimize for loops and rbinds with large datasets

I am currently working on a large dataset (~1.5M of entries) using R - a language I am not yet completely familiar with.
Basically, what I try to do is the following :
I want to check what happens during a time interval after "Start".
"Start" represents a few temporal values within every "Trial", and "Trial" represents all of the trials recorded for one "Reference".
So for each Reference, i want to check all Trials and see what happens after "Start", during this Trial
It's not so important if what i'm trying to do is still obscure, the thing is that I want to check every data in my dataframe.
My instinctive (understand, R-noob-ish) way of programming this function led me to a piece of code which I know is far from being optimized, and takes a LOT of time to run.
My_Function <- function(DataFrame){
counts <- data.frame()
for (reference in DataFrame$Ref){
ref_tested <- subset(DataFrame, Ref == reference)
ref_count <- data.frame()
for (trial in ref_tested$Trial){
trial_tested <- subset(ref_tested, Trial == trial)
for (timing in trial_tested$Start){
interesting <- subset(DataFrame, Start > timing & Start <= timing + some_time & Trial == trial)
ref_count <- rbind(ref_count,as.data.frame(table(interesting$ele)))
}
}
temp <- aggregate(Freq~Var1,data=ref_count,FUN=sum);
counts <- rbind (counts, temp)
}
return(counts)
}
Here, as.data.frame(table(interesting$ele)) can have different lengths, and thus, so do ref_count.
I failed to find a way to grow my dataframe without using rbind, but I also know that given the size of my output it is not time-efficient at all.
Also, I have already programmed in other languages such as Python or C++ (a long time ago) and also know that having 3 consecutive for loops usually means that you're doing it wrong. But then again, I did not find a way to avoid doing that in this particular case.
So, do you have any advice on how to use R, or one of its package, to avoid such a situation?
Thank you in advance,
K.
EDIT :
Thank you for your first advices.
I tried the 'plyr' package and was able to reduce the size of my code chunck - it does as expected and is more understandable.Plus, i was able to produce some example data for reproductivity. See :
#Example Input
DF <- data.frame(c(sample(1:400,500000, replace = TRUE)),c(sample(1:25,500000, replace = TRUE)), rnorm(n=500000, m=1, sd=1) )
colnames(DF)<-c("Trial","Ref","Start")
DF$rn<-rownames(DF)
tempDF <- DF[sample(nrow(DF), 100), ] #For testing purposes
Test<- ddply(.data = tempDF, "rn", function(x){
interesting <- subset(DF,
Trial == x$Trial &
Start > x$Start &
Start < x$Start + some_time )
interesting$Elec <- x$Ref
return(interesting)
})
This is nice, but I still feel like it is not the way to go ; in this example, we only browse 100 observations, which takes ~4sec (I used a system.time()), but if i want to scan the 500000 observations of DF, it'd take more than 5 hours.
I have checked data.table but I am still trying to understand how to use it for now.

Using replicate over for loop

I want to make a simulation 10000 times on the difference between load and demand. I need the proportion of negative differences over all simulation. I used the for loop below, but this is not performant for large simulations as it requires me to calculate everything by hand later on.
How can I do the same with replicate? What are the benefits of using replicate over for loop?
for(i in 1:10){
load<-rnorm(10,8,2)
demand<-c(6,7,6,5,6,7,8,7,6,5)
diff<-load-demand
res<-sum(diff<0) / 10# sum the negative differences
print(res)
}
The big disadvantage of using a for-loop, is that you don't automatically store everything in an object. replicate() does that for you.
Another tip: keep everything that doesn't change out of the loop (like the definition of demand)
How to do what you describe using replicate():
nsim <- 10000
demand <- c(6,7,6,5,6,7,8,7,6,5)
res <- replicate(nsim,{
load <- rnorm(10,8,2)
diff <- load - demand
return(sum(diff < 0))
})
res <- sum(res) / nsim
The return() function isn't necessary (a code block returns the last line it executes), but I add it so you see immediately what's going to be stored in res.
But you actually don't need a loop to do this (given you described what you want accurately). You can just use vectorization and recycling to do this:
nsim <- 10000
demand <- c(6,7,6,5,6,7,8,7,6,5)
load <- rnorm(10*nsim, 8, 2)
diff <- load - demand
res <- sum(diff < 0) / nsim
replicate or a for loop actually only make sense if you want to keep the results for each simulation. As you're only interested in the overall result, the second code block will be more performant and R-like.

Nested for loop improvement

I'm trying to write a simple population model using nested for loops. I want to project the population 10yrs, and I want to run this projection 100 times. I need the output of each time step and include a counter so I know which year of which iteration the results correspond to. I have an example running using this code, but I was wondering if:
1) There was a more elegant solution to than using the rows<-rows+1 command to reset and advance the counter each time?
2) There is a more elegant solution than for loops to accomplish this?
library(VGAM)
popdata<-matrix(nrow=1000,ncol=3)
dimnames(popdata)[[2]]<-c('iteration','year','popsize')
rows<-1
for (iteration in 1:100){
pop<-50
for(year in 1:10){
popdata[rows,1]<-iteration
popdata[rows,2]<-year
pop<-rbetabinom(1,pop,0.6)
popdata[rows,3]<-pop
rows<-rows+1
}
}
You can replace the row counter by (iteration - 1) * 10 + year.
You can also clean up the work done in the loop by assigning the iteration and year data outside as you know in advance what these results will be.
This gives something like
popdata<-matrix(nrow=1000,ncol=3)
dimnames(popdata)[[2]]<-c('iteration','year','popsize')
# Do the deterministic stuff first
popdata[, 1] <- rep(1:100, each = 10)
popdata[, 2] <- rep(1:10, times = 100)
for (iteration in 1:100){
pop<-50
for(year in 1:10){
rows <- (iteration - 1) * 10 + yea
pop<-rbetabinom(1,pop,0.6)
popdata[rows,3]<-pop
}
}
Because pop depends on the previous pop it is less easy to simplify the inner loop entirely.
I think the approach would be to generate all your binomial results as a long vector of 1s and 0s.
Then in each loop you would subset the next pop elements of that vector.
In the end I don't think that would actually be neater or more readable though.

Can I stop and start a script in R

I am working with R and my script is taking a very long time. I was thinking I can stop it and then start it again by changing my counters.
My code is this
NC <- MLOA
for (i in 1:313578){
len_mods <- length(MLOA[[i]])
for (j in 1:2090){
for(k in 1:len_mods){
temp_match <- matchv[j]
temp_rep <- replacev[j]
temp_mod <- MLOA[[i]][k]
is_found <- match(temp_mod,temp_match, nomatch = 0, incomparables = 0)
if(is_found[1] == 1) NC[[i]][k] <- temp_rep
rm(temp_match,temp_rep,temp_mod)
}
}
}
I am thinking that I can stop my script, then re-start it by checking what values of i,j and k are and changing the counts to start at their current values. So instead of counting "for (i in 1:313578)" if i is up to 100,000 I could do (i in 100000:313578).
I don't want to stop my script though before checking that my logic about restarting it is solid.
Thanks in anticipation
I'm a bit confused what you are doing. Generally on this forum it is a good idea to greatly simplify your code, and only present the core of the problem in a very simple example. That withstanding, this might help. Put your for loop in a function whose parameters are the first elements of the sequence of numbers you loop over. For example:
myloop <- function(x,...){
for (i in seq(x,313578,1)){
...
This way you can easily manipulate were your loop starts.
The more important question is, however, why are you using for loops in the first place? In R, for loops should be avoided at all costs. By vectorizing your code you can greatly increase its speed. I have realized speed increases of a factor of 500!
In general, the only reason you use a for loop in R is if current iterations of the for loop depend on previous iterations. If this is the case then you are likely bound to the slow for loop.
Depending on your computer skills, however, even for loops can be made faster in R. If you know C, or are willing to learn a bit, interfacing with C can dramatically increase the speed of your code.
An easier way to increase the speed of your code, which unfortunately will not yield the same speed up as interfacing with C, is using R's Byte Complier. Check out the cmpfun function.
One final thing on speeding up code: The following line of codetemp_match <- matchv[j] looks innocuous enough, however, this can really slow things down. This is because every time you assign matchv[j] to temp_match you make a copy of temp_match. That means that your computer needs to find some were to store this copy in RAM. R is smart, as you make more and more copies, it will clean up after you and throw away those copies you are no longer using with the garbage collect function. However finding places to store your copies as well as calling the garbage collect function take time. Read this if you want to learn more: http://adv-r.had.co.nz/memory.html.
You could also use while loops for your 3 loops to maintain a counter. In the following, you can stop the script at any time (and view the intermediate results) and restart by changing continue=TRUE or simply running the loop part of the script:
n <- 6
res <- array(NaN, dim=rep(n,3))
continue = FALSE
if(!continue){
i <- 1
j <- 1
k <- 1
}
while(k <= n){
while(j <= n){
while(i <= n){
res[i,j,k] <- as.numeric(paste0(i,j,k))
Sys.sleep(0.1)
i <- i+1
}
j <- j+1
i <- 1
}
k <- k+1
j <- 1
}
i;j;k
res
This is what I got to....
for(i in 1:313578)
{
mp<-match(MLOA[[i]],matchv,nomatch = 0, incomparables=0)
lgic<- which(as.logical(mp),arr.ind = FALSE, useNames = TRUE)
NC[[i]][lgic]<-replacev[mp]}
Thanks to those who responded, Jacob H, you are right, I am definitely a newby with R, your response was useful. Frank - your pointers helped.
My solution probably still isn't an optimal one. All I wanted to do was a find and replace. Matchv was the vector in which I was searching for a match for each MLOA[i], with replacev being the vector of replacement information.

Make nested loops more efficient?

I'm analyzing large sets of data using the following script:
M <- c_alignment
c_check <- function(x){
if (x == c_1) {
1
}else{
0
}
}
both_c_check <- function(x){
if (x[res_1] == c_1 && x[res_2] == c_1) {
1
}else{
0
}
}
variance_function <- function(x,y){
sqrt(x*(1-x))*sqrt(y*(1-y))
}
frames_total <- nrow(M)
cols <- ncol(M)
c_vector <- apply(M, 2, max)
freq_vector <- matrix(nrow = sum(c_vector))
co_freq_matrix <- matrix(nrow = sum(c_vector), ncol = sum(c_vector))
insertion <- 0
res_1_insertion <- 0
for (res_1 in 1:cols){
for (c_1 in 1:conf_vector[res_1]){
res_1_insertion <- res_1_insertion + 1
insertion <- insertion + 1
res_1_subset <- sapply(M[,res_1], c_check)
freq_vector[insertion] <- sum(res_1_subset)/frames_total
res_2_insertion <- 0
for (res_2 in 1:cols){
if (is.na(co_freq_matrix[res_1_insertion, res_2_insertion + 1])){
for (c_2 in 1:max(c_vector[res_2])){
res_2_insertion <- res_2_insertion + 1
both_res_subset <- apply(M, 1, both_c_check)
co_freq_matrix[res_1_insertion, res_2_insertion] <- sum(both_res_subset)/frames_total
co_freq_matrix[res_2_insertion, res_1_insertion] <- sum(both_res_subset)/frames_total
}
}
}
}
}
covariance_matrix <- (co_freq_matrix - crossprod(t(freq_vector)))
variance_matrix <- matrix(outer(freq_vector, freq_vector, variance_function), ncol = length(freq_vector))
correlation_coefficient_matrix <- covariance_matrix/variance_matrix
A model input would be something like this:
1 2 1 4 3
1 3 4 2 1
2 3 3 3 1
1 1 2 1 2
2 3 4 4 2
What I'm calculating is the binomial covariance for each state found in M[,i] with each state found in M[,j]. Each row is the state found for that trial, and I want to see how the state of the columns co-vary.
Clarification: I'm finding the covariance of two multinomial distributions, but I'm doing it through binomial comparisons.
The input is a 4200 x 510 matrix, and the c value for each column is about 15 on average. I know for loops are terribly slow in R, but I'm not sure how I can use the apply function here. If anyone has a suggestion as to properly using apply here, I'd really appreciate it. Right now the script takes several hours. Thanks!
I thought of writing a comment, but I have too much to say.
First of all, if you think apply goes faster, look at Is R's apply family more than syntactic sugar? . It might be, but it's far from guaranteed.
Next, please don't grow matrices as you move through your code, that slows down your code incredibly. preallocate the matrix and fill it up, that can increase your code speed more than a tenfold. You're growing different vectors and matrices through your code, that's insane (forgive me the strong speech)
Then, look at the help page of ?subset and the warning given there:
This is a convenience function intended for use interactively. For
programming it is better to use the standard subsetting functions like
[, and in particular the non-standard evaluation of argument subset
can have unanticipated consequences.
Always. Use. Indices.
Further, You recalculate the same values over and over again. fre_res_2 for example is calculated for every res_2 and state_2 as many times as you have combinations of res_1 and state_1. That's just a waste of resources. Get out of your loops what you don't need to recalculate, and save it in matrices you can just access again.
Heck, now I'm at it: Please use vectorized functions. Think again and see what you can drag out of the loops : This is what I see as the core of your calculation:
cov <- (freq_both - (freq_res_1)*(freq_res_2)) /
(sqrt(freq_res_1*(1-freq_res_1))*sqrt(freq_res_2*(1-freq_res_2)))
As I see it, you can construct a matrix freq_both, freq_res_1 and freq_res_2 and use them as input for that one line. And that will be the whole covariance matrix (don't call it cov, cov is a function). Exit loops. Enter fast code.
Given the fact I have no clue what's in c_alignment, I'm not going to rewrite your code for you, but you definitely should get rid of the C way of thinking and start thinking R.
Let this be a start: The R Inferno
It's not really the 4 way nested loops but the way your code is growing memory on each iteration. That's happening 4 times where I've placed # ** on the cbind and rbind lines. Standard advice in R (and Matlab and Python) in situations like this is to allocate in advance and then fill it in. That's what the apply functions do. They allocate a list as long as the known number of results, assign each result to each slot, and then merge all the results together at the end. In your case you could just allocate the correct size matrix in advance and assign into it at those 4 points (roughly speaking). That should be as fast as the apply family, and you might find it easier to code.

Resources