I'm still pretty new to R programming, and I've read a lot about replacing for-loops, particularly with the apply functions, which has been really useful in making my code more efficient. However, in some of the programs I'm trying to create, I have for-loops where each loop must be carried out in order, because the effects of one loop affect what happens in the next loop, and as far as I'm aware this cannot be achieved with, for example, lapply(). An example of this sort of for-loop:
p <- 1
for (i in 1:x) {
p <- p + sample(c(1, 0), prob = c(p, 1), size = 1)
}
Is there a way to replace this kind of loop?
Thanks for your time, everyone!
This kind of logic is known as reduction or fold. Consequently it’s solved not by lapply (or similar), which is an example of a mapping, but by Reduce:
p = Reduce(function (a, b) a + sample(c(1, 0), prob = c(a, 1), size = 1),
seq_len(x), initial_value)
Note that the argument b in this case isn’t used — this corresponds to your loop variable i, which is likewise not used.
I think it is a slight myth that using lapply will always make your code more efficient. It can make your code easier to read and understand, but not necessarily faster:
> system.time(for(i in 1:10000000) x <- i )
user system elapsed
1.712 0.035 1.759
> system.time(lapply(1:10000000, function(i) x <- i ))
user system elapsed
13.941 0.414 14.464
In this case, your code seems very simple and clear. So, is there a reason to get rid of the for loop?
Related
I have a dataframe (check the picture). I am creating periods of 30 values and I am calculating how many of this values are over 0.1. At the end, I want to save all the 336 outputs in a dataframe (as a row). How could I do that? My code is failing!
i <- 0
secos=as.data.frame(NULL)
for (i in c(0:336)){
hola=as.data.frame(pp[c(1+i:29 + i)])
secos[[i]]=sum(hola > 0.1)
secos=rbind(secos[[i]])}
Iteratively building (growing) data.frames in R is a bad thing. For good reading, see the R Inferno, chapter 2 on Growing Objects. Bottom line, though: it works, but as you add more rows, it will get progressively slower and use (at least) twice as much memory as you intend.
You explicitly overwrite secos with rbind(secos[[i]]), where the rbind call is a complete no-op doing nothing. (e.g., see identical(rbind(mtcars), mtcars)). Back to (1), best to L <- lapply(0:336, function(i) ...) then secos <- do.call(rbind, L).
R indexes are 1-based, but your first call assigns to secos[[0]] which fails.
A literal translation of this into a better start is something like the following. (Up front, your reference to pp only makes sense if you have an object pp that you used to create your data.frame above ... since pp[.] by itself will not reference the frame. If you're using attach(.) to be able to do that, then ... don't. Too many risks and things that can go wrong with it, it is one of the base functions I'd vote to remove.)
invec <- 0:336
L <- sapply(invec, function(i) {
hola=as.data.frame(pp[c(1+i:29 + i)])
sum(hola > 0.1)
})
secos <- data.frame(i = invec, secos = L)
An alternative:
L <- lapply(invec, function(i) {
hola=as.data.frame(pp[c(1+i:29 + i)])
data.frame(secos = sum(hola > 0.1))
})
out <- do.call(rbind, L)
I can't help but think there is a more efficient, R-idiomatic way to aggregate this data. My guess is that it's a moving window of sorts, perhaps a month wide (or similar). If that's the case, I recommend looking into zoo::rollapply(pp, 30, function(z) sum(z > 0.1)), perhaps with meaningful application of align=, partial=, and/or fill=.
I am trying to implement a block bootstrap procedure, but I haven't figured out a way of doing this efficiently.
My data.frame has the following structure:
CHR POS var_A var_B
1 192 0.9 0.7
1 2000 0.8 0.3
2 3 0.21 0.76
2 30009 0.36 0.15
...
The first column is the chromosome identification, the second column is the position, and the last two columns are variables for which I want to calculate a correlation. The problem is that each row is not entirely independent to one another, depending on the distance between them (the closer the more dependent), and so I cannot simply do cor(df$var_A, df$var_B).
The way out of this problem that is commonly used with this type of data is performing a block bootstrap. That is, I need to divide my data into blocks of length X, randomly select one row inside that block, and then calculate my statistic of interest. Note, however, that these blocks need to be defined based on the column POS, and not based on the row number. Also, this procedure needs to be done for each chromosome.
I tried to implement this, but I came up with the slowest code possible (it didn't even finish running) and I am not 100% sure it works.
x = 1000
cors = numeric()
iter = 1000
for(j in 1:iter) {
df=freq[0,]
for (i in unique(freq$CHR)) {
t = freq[freq$CHR==i,]
fim = t[nrow(t),2]
i = t[1,2]
f = i + x
while(f < fim) {
rows = which(t$POS>=i & t$POS<f)
s = sample(rows)
df = rbind(df,t[s,])
i = f
f = f + x
}
}
cors = c(cors, cor(df$var_A, df$var_B))
}
Could anybody help me out? I am sure there is a more efficient way of doing this.
Thank you in advance.
One efficient way to try would be to use the 'boot' package, of which functions include parallel processing capabilities.
In particular, the 'tsboot', or time series boot function, will select ordered blocks of data. This could work if your POS variable is some kind of ordered observation.
The boot package functions are great, but they need a little help first. To use bootstrap functions in the boot package, one must first wrap the statistic of interest in a function which includes an index argument. This is the device the bootstrap generated index will use to pass sampled data to your statistic.
cor_hat <- function(data, index) cor(y = data[index,]$var_A, x = data[index,]$var_B)
Note cor_hat in the arguments below. The sim = "fixed", l = 1000 arguments, which indicate you want fixed blocks of length(l) 1000. However, you could do blocks of any size, 5 or 10 if your trying to capture nearest neighbor dynamics moving over time. The multicore argument speaks for itself, but it maybe "snow" if you are using windows.
library(boot)
tsboot(data, cor_hat, R = 1000, sim = "fixed", l = 1000, parallel = "multicore", ncpus = 4)
In addition, page 194 of Elements of Statistical Learning provides a good example of the framework using the traditional boot function, all of which is relevant to tsboot.
Hope that helps, good luck.
Justin
r
I hope I understood you right:
# needed for round_any()
library(plyr)
res <- lapply(unique(freq$CHR),function(x){
freq_sel <- freq[freq$CHR==x,]
blocks <- lapply(seq(1,round_any(max(freq_sel$POS),1000,ceiling),1000), function(ix) freq_sel[freq_sel$POS > ix & freq_sel$POS <= ix+999,])
do.call(rbind,lapply(blocks,function(x) if (nrow(x) > 1) x[sample(1:nrow(x),1),] else x))
})
This should return a list with an entry for each chromosome. Within each entry, there's an observation per 1kb-block if present. The number of blocks is determined by the maximum POS value.
EDIT:
library(doParallel)
library(foreach)
library(plyr)
cl <- makeCluster(detectCores())
registerDoParallel(cl)
res <- foreach(x=unique(freq$CHR),.packages = 'plyr') %dopar% {
freq_sel <- freq[freq$CHR==x,]
blocks <- lapply(seq(1,round_any(max(freq_sel$POS),1000,ceiling),1000), function(ix) freq_sel[freq_sel$POS > ix & freq_sel$POS <= ix+999,])
do.call(rbind,lapply(blocks,function(x) if (nrow(x) > 1) x[sample(1:nrow(x),1),] else x))
}
stopCluster(cl)
This is a simple parallelisation with foreach on each Chromosome. It could be better to restructure the function and base the parallel processing on another level (such as the 1000 iterations or maybe the blocks). In any case, I can just stress again what I was saying in my comment: Before you work on parallelising your code, you should be sure that it's as efficient as possible. Meaning you might want to look into the boot package or similar to get an increase in efficiency. That said, with the number of iterations you're planning, parallel processing might be useful once you're comfortable with your function.
So, after a while I came up with an answer to my problem. Here it goes.
You'll need the package dplyr.
l = 1000
teste = freq %>%
mutate(w = ceiling(POS/l)) %>%
group_by(CHR, w) %>%
sample_n(1)
This code creates a new variable named w based on the position in the genome (POS). This variable w is the window to which each row was assigned, and it depends on l, which is the length of your window.
You can repeat this code several times, each time sampling one row per window/CHR (with the sample_n(1)) and apply whatever statistic of interest that you want.
I have a df, YearHT, 6.5M x 55 columns. There is specific information I want to extract and add but only based on an aggregate values. I am using a for loop to subset the large df, and then performing the computations.
I have heard that for loops should be avoided, and I wonder if there is a way to avoid a for loop that I have used, as when I run this query it takes ~3hrs.
Here is my code:
srt=NULL
for(i in doubletCounts$Var1){
s=subset(YearHT,YearHT$berthlet==i)
e=unlist(c(strsplit(i,'\\|'),median(s$berthtime)))
srt=rbind(srt,e)
}
srt=data.frame(srt)
s2=data.frame(srt$X2,srt$X1,srt$X3)
colnames(s2)=colnames(srt)
s=rbind(srt,s2)
doubletCounts is 700 x 3 df, and each of the values is found within the large df.
I would be glad to hear any ideas to optimize/speed up this process.
Here is a fast solution using data.table , although it is not completely clear from your question what is the output you want to get.
# load library
library(datat.table)
# convert your dataset into data.table
setDT(YearHT)
# subset YearHT keeping values that are present in doubletCounts$Var1
YearHT_df <- YearHT[ berthlet %in% doubletCounts$Var1]
# aggregate values
output <- YearHT_df[ , .( median= median(berthtime)) ]
for loops aren't necessarily something to avoid, but there are certain ways of using for loops that should be avoided. You've committed the classic for loop blunder here.
srt = NULL
for (i in index)
{
[stuff]
srt = rbind(srt, [stuff])
}
is bound to be slower than you would like because each time you hit srt = rbind(...), you're asking R to do all sorts of things to figure out what kind of object srt needs to be and how much memory to allocate to it. When you know what the length of your output needs to be up front, it's better to do
srt <- vector("list", length = doubletCounts$Var1)
for(i in doubletCounts$Var1){
s=subset(YearHT,YearHT$berthlet==i)
srt[[i]] = unlist(c(strsplit(i,'\\|'),median(s$berthtime)))
}
srt=data.frame(srt)
Or the apply alternative of
srt = lapply(doubletCounts$Var1,
function(i)
{
s=subset(YearHT,YearHT$berthlet==i)
unlist(c(strsplit(i,'\\|'),median(s$berthtime)))
}
)
Both of those should run at about the same speed
(Note: both are untested, for lack of data, so they might be a little buggy)
Something else you can try that might have a smaller effect would be dropping the subset call and use indexing. The content of your for loop could be boiled down to
unlist(c(strsplit(i, '\\|'),
median(YearHT[YearHT$berthlet == i, "berthtime"])))
But I'm not sure how much time that would save.
I am working with R and my script is taking a very long time. I was thinking I can stop it and then start it again by changing my counters.
My code is this
NC <- MLOA
for (i in 1:313578){
len_mods <- length(MLOA[[i]])
for (j in 1:2090){
for(k in 1:len_mods){
temp_match <- matchv[j]
temp_rep <- replacev[j]
temp_mod <- MLOA[[i]][k]
is_found <- match(temp_mod,temp_match, nomatch = 0, incomparables = 0)
if(is_found[1] == 1) NC[[i]][k] <- temp_rep
rm(temp_match,temp_rep,temp_mod)
}
}
}
I am thinking that I can stop my script, then re-start it by checking what values of i,j and k are and changing the counts to start at their current values. So instead of counting "for (i in 1:313578)" if i is up to 100,000 I could do (i in 100000:313578).
I don't want to stop my script though before checking that my logic about restarting it is solid.
Thanks in anticipation
I'm a bit confused what you are doing. Generally on this forum it is a good idea to greatly simplify your code, and only present the core of the problem in a very simple example. That withstanding, this might help. Put your for loop in a function whose parameters are the first elements of the sequence of numbers you loop over. For example:
myloop <- function(x,...){
for (i in seq(x,313578,1)){
...
This way you can easily manipulate were your loop starts.
The more important question is, however, why are you using for loops in the first place? In R, for loops should be avoided at all costs. By vectorizing your code you can greatly increase its speed. I have realized speed increases of a factor of 500!
In general, the only reason you use a for loop in R is if current iterations of the for loop depend on previous iterations. If this is the case then you are likely bound to the slow for loop.
Depending on your computer skills, however, even for loops can be made faster in R. If you know C, or are willing to learn a bit, interfacing with C can dramatically increase the speed of your code.
An easier way to increase the speed of your code, which unfortunately will not yield the same speed up as interfacing with C, is using R's Byte Complier. Check out the cmpfun function.
One final thing on speeding up code: The following line of codetemp_match <- matchv[j] looks innocuous enough, however, this can really slow things down. This is because every time you assign matchv[j] to temp_match you make a copy of temp_match. That means that your computer needs to find some were to store this copy in RAM. R is smart, as you make more and more copies, it will clean up after you and throw away those copies you are no longer using with the garbage collect function. However finding places to store your copies as well as calling the garbage collect function take time. Read this if you want to learn more: http://adv-r.had.co.nz/memory.html.
You could also use while loops for your 3 loops to maintain a counter. In the following, you can stop the script at any time (and view the intermediate results) and restart by changing continue=TRUE or simply running the loop part of the script:
n <- 6
res <- array(NaN, dim=rep(n,3))
continue = FALSE
if(!continue){
i <- 1
j <- 1
k <- 1
}
while(k <= n){
while(j <= n){
while(i <= n){
res[i,j,k] <- as.numeric(paste0(i,j,k))
Sys.sleep(0.1)
i <- i+1
}
j <- j+1
i <- 1
}
k <- k+1
j <- 1
}
i;j;k
res
This is what I got to....
for(i in 1:313578)
{
mp<-match(MLOA[[i]],matchv,nomatch = 0, incomparables=0)
lgic<- which(as.logical(mp),arr.ind = FALSE, useNames = TRUE)
NC[[i]][lgic]<-replacev[mp]}
Thanks to those who responded, Jacob H, you are right, I am definitely a newby with R, your response was useful. Frank - your pointers helped.
My solution probably still isn't an optimal one. All I wanted to do was a find and replace. Matchv was the vector in which I was searching for a match for each MLOA[i], with replacev being the vector of replacement information.
I have a question regarding R apply (and all its variants). Is there a way to update the arguments of the function while apply is working?
For example, I have a function NextSol(Prev_Sol) that generates a new solution from Prev_Sol, compares it with the original one in some way and then returns either the original or the new, depending on the result of the comparison. I need to save all the solutions returned. Currently, I am doing this:
for( i in 2:N ) {
Results[[i]] <- NextSol(Results[[i-1]])
}
But maybe there is a (faster) way to do it using apply? I have seen also that Reduce could help but I have no idea of how can I use it. Any help will be much appreciated!
As Thomas said, the for loop is the standard way of looping when one iteration depends on a previous one. (Just make sure that you correctly handle the case of N = 1 in your code.)
An alternative is to use the Reduce function. This example is adapted from the one on the ?Reduce help page.
NextSol <- function(x) x + 1 #Or whatever you want
Funcall <- function(f, ...) f(...)
Reduce(Funcall, rep.int(list(NextSol), 5), 0, right = TRUE)
## [1] 5
It's unlikely that this will be much faster, and it's arguably harder to read, so you may well decide to stick with a for loop.
Well, I suppose we can make it easier to read by wrapping it in an Iterate function.
Iterate <- function(f, init, n)
{
Reduce(
function(f, ...) f(...),
rep.int(list(f), n),
init,
right = TRUE
)
}
Iterate(NextSol, 0, 5) #same as before