I have a question regarding R apply (and all its variants). Is there a way to update the arguments of the function while apply is working?
For example, I have a function NextSol(Prev_Sol) that generates a new solution from Prev_Sol, compares it with the original one in some way and then returns either the original or the new, depending on the result of the comparison. I need to save all the solutions returned. Currently, I am doing this:
for( i in 2:N ) {
Results[[i]] <- NextSol(Results[[i-1]])
}
But maybe there is a (faster) way to do it using apply? I have seen also that Reduce could help but I have no idea of how can I use it. Any help will be much appreciated!
As Thomas said, the for loop is the standard way of looping when one iteration depends on a previous one. (Just make sure that you correctly handle the case of N = 1 in your code.)
An alternative is to use the Reduce function. This example is adapted from the one on the ?Reduce help page.
NextSol <- function(x) x + 1 #Or whatever you want
Funcall <- function(f, ...) f(...)
Reduce(Funcall, rep.int(list(NextSol), 5), 0, right = TRUE)
## [1] 5
It's unlikely that this will be much faster, and it's arguably harder to read, so you may well decide to stick with a for loop.
Well, I suppose we can make it easier to read by wrapping it in an Iterate function.
Iterate <- function(f, init, n)
{
Reduce(
function(f, ...) f(...),
rep.int(list(f), n),
init,
right = TRUE
)
}
Iterate(NextSol, 0, 5) #same as before
Related
Firstly, let us generate data like this:
library(data.table)
data <- data.table(date = as.Date("2015-05-01")+0:299)
set.seed(123)
data[,":="(
a = round(30*cumprod(1+rnorm(300,0.001,0.05)),2),
b = rbinom(300,5000,0.8)
)]
Then I want to use my custom function to operate multiple columns multiple times without manually typing out .Such as my custom function is add <- function(x,n) (x+n)
I provide my for loops code as following:
add <- function(x,n) (x+n)
n <- 3
freture_old <- c("a","b")
for(i in 1:n ){
data[,(paste0(freture_old,"_",i)) := add(.SD,i),.SDcols =freture_old ]
}
Could you please tell me a lapply version to instead of for loop?
If all you want is to use an lapply loop instead of a for loop you really do not need to change much. For a data.table object it is even easier since every iteration will change the data.table without having to save a copy to the global environment. One thing I add just to suppress the output to the console is to wrap an invisible around it.
lapply(1:n,function(i) data[,paste0(freture_old,"_",i):=lapply(.SD,add,i),.SDcols =freture_old])
Note that if you assign this lapply to an object you will get a list of data.tables the size of the number of iterations or in this case 3. This will kill memory because you are really only interested in the final entry. Therefore just run the code without assigning it to a variable. Now if you do not assign it to anything you will get every iteration printed out to the console. So what I would suggest is to wrap an invisible around it like this:
invisible(lapply(1:n,function(i) data[,paste0(freture_old,"_",i):=lapply(.SD,add,i),.SDcols =freture_old]))
Hope this helps and let me know if you need me to add anything else to this answer. Good luck!
An option without R "loop" (quoted since ultimately its a loop at certain level somewhere):
data[,
c(outer(freture_old, seq_len(n), paste, sep="_")) :=
as.data.table(matrix(outer(as.matrix(.SD), seq_len(n), add), .N)),
.SDcols=freture_old]
Or equivalently in base R:
setDF(data)
cbind(data, matrix(outer(as.matrix(data[, freture_old]), seq_len(n), add),
nrow(data)))
I'm still pretty new to R programming, and I've read a lot about replacing for-loops, particularly with the apply functions, which has been really useful in making my code more efficient. However, in some of the programs I'm trying to create, I have for-loops where each loop must be carried out in order, because the effects of one loop affect what happens in the next loop, and as far as I'm aware this cannot be achieved with, for example, lapply(). An example of this sort of for-loop:
p <- 1
for (i in 1:x) {
p <- p + sample(c(1, 0), prob = c(p, 1), size = 1)
}
Is there a way to replace this kind of loop?
Thanks for your time, everyone!
This kind of logic is known as reduction or fold. Consequently it’s solved not by lapply (or similar), which is an example of a mapping, but by Reduce:
p = Reduce(function (a, b) a + sample(c(1, 0), prob = c(a, 1), size = 1),
seq_len(x), initial_value)
Note that the argument b in this case isn’t used — this corresponds to your loop variable i, which is likewise not used.
I think it is a slight myth that using lapply will always make your code more efficient. It can make your code easier to read and understand, but not necessarily faster:
> system.time(for(i in 1:10000000) x <- i )
user system elapsed
1.712 0.035 1.759
> system.time(lapply(1:10000000, function(i) x <- i ))
user system elapsed
13.941 0.414 14.464
In this case, your code seems very simple and clear. So, is there a reason to get rid of the for loop?
I have a df, YearHT, 6.5M x 55 columns. There is specific information I want to extract and add but only based on an aggregate values. I am using a for loop to subset the large df, and then performing the computations.
I have heard that for loops should be avoided, and I wonder if there is a way to avoid a for loop that I have used, as when I run this query it takes ~3hrs.
Here is my code:
srt=NULL
for(i in doubletCounts$Var1){
s=subset(YearHT,YearHT$berthlet==i)
e=unlist(c(strsplit(i,'\\|'),median(s$berthtime)))
srt=rbind(srt,e)
}
srt=data.frame(srt)
s2=data.frame(srt$X2,srt$X1,srt$X3)
colnames(s2)=colnames(srt)
s=rbind(srt,s2)
doubletCounts is 700 x 3 df, and each of the values is found within the large df.
I would be glad to hear any ideas to optimize/speed up this process.
Here is a fast solution using data.table , although it is not completely clear from your question what is the output you want to get.
# load library
library(datat.table)
# convert your dataset into data.table
setDT(YearHT)
# subset YearHT keeping values that are present in doubletCounts$Var1
YearHT_df <- YearHT[ berthlet %in% doubletCounts$Var1]
# aggregate values
output <- YearHT_df[ , .( median= median(berthtime)) ]
for loops aren't necessarily something to avoid, but there are certain ways of using for loops that should be avoided. You've committed the classic for loop blunder here.
srt = NULL
for (i in index)
{
[stuff]
srt = rbind(srt, [stuff])
}
is bound to be slower than you would like because each time you hit srt = rbind(...), you're asking R to do all sorts of things to figure out what kind of object srt needs to be and how much memory to allocate to it. When you know what the length of your output needs to be up front, it's better to do
srt <- vector("list", length = doubletCounts$Var1)
for(i in doubletCounts$Var1){
s=subset(YearHT,YearHT$berthlet==i)
srt[[i]] = unlist(c(strsplit(i,'\\|'),median(s$berthtime)))
}
srt=data.frame(srt)
Or the apply alternative of
srt = lapply(doubletCounts$Var1,
function(i)
{
s=subset(YearHT,YearHT$berthlet==i)
unlist(c(strsplit(i,'\\|'),median(s$berthtime)))
}
)
Both of those should run at about the same speed
(Note: both are untested, for lack of data, so they might be a little buggy)
Something else you can try that might have a smaller effect would be dropping the subset call and use indexing. The content of your for loop could be boiled down to
unlist(c(strsplit(i, '\\|'),
median(YearHT[YearHT$berthlet == i, "berthtime"])))
But I'm not sure how much time that would save.
This question already has answers here:
How to assign values to dynamic names variables
(2 answers)
Closed 7 years ago.
I keep running into situations where I want to dynamically create variables using a for loop (or similar / more efficient construct using dplyr perhaps). However, it's unclear to me how to do it right now.
For example, the below shows a construct that I would intuitively expect to generate 10 variables assigned numbers 1:10, but it doesn't work.
for (i in 1:10) {paste("variable",i,sep = "") = i}
The error
Error in paste("variable", i, sep = "") = i :
target of assignment expands to non-language object
Any thoughts on what method I should use to do this? I assume there are multiple approaches (including a more efficient dplyr method). Full disclosure: I'm relatively new to R and really appreciate the help. Thanks!
I've run into this problem myself many times. The solution is the assign command.
for(i in 1:10){
assign(paste("variable", i, sep = ""), i)
}
If you wanted to get everything into one vector, you could use sapply. The following code would give you a vector from 1 to 10, and the names of each item would be "variable i," where i is the value of each item. This may not be the prettiest or most elegant way to use the apply family for this, but I think it ought to work well enough.
var.names <- function(x){
a <- x
names(a) <- paste0("variable", x)
return(a)
}
variables <- sapply(X = 1:10, FUN = var.names)
This sort of approach seems to be favored because it keeps all of those variables tucked away in one object, rather than scattered all over the global environment. This could make calling them easier in the future, preventing the need to use get to scrounge up variables you'd saved.
No need to use a loop, you can create character expression with paste0 and then transform it as uneveluated expression with parse, and finally evaluate it with eval.
eval(parse(text = paste0("variable", 1:10, "=",1:10, collapse = ";") ))
The code you have is really no more useful than a vector of elements:
x<-1
for(i in 2:10){
x<-c(x,i)
}
(Obviously, this example is trivial, could just use x<-1:10 and be done. I assume there's a reason you need to do non-vectored calculations on each variable).
I know that I should avoid for-loops, but I'm not exactly sure how to do what I want to do with an apply function.
Here is a slightly simplified model of what I'm trying to do. So, essentially I have a big matrix of predictors and I want to run a regression using a window of 5 predictors on each side of the indexed predictor (i in the case of a for loop). With a for loop, I can just say something like:
results<-NULL
window<-5
for(i in 1:ncol(g))
{
first<-i-window #Set window boundaries
if(first<1){
1->first
}
last<-i+window-1
if(last>ncol(g)){
ncol(g)->last
}
predictors<-g[,first:last]
#Do regression stuff and return some result
results[i]<-regression stuff
}
Is there a good way to do this with an apply function? My problem is that the vector that apply would be shoving into the function really doesn't matter. All that matters is the index.
This question touches several points that are made in 'The R Inferno' http://www.burns-stat.com/pages/Tutor/R_inferno.pdf
There are some loops you should avoid, but not all of them. And using an apply function is more hiding the loop than avoiding it. This example seems like a good choice to leave in a 'for' loop.
Growing objects is generally bad form -- it can be extremely inefficient in some cases. If you are going to have a blanket rule, then "not growing objects" is a better one than "avoid loops".
You can create a list with the final length by:
result <- vector("list", ncol(g))
for(i in 1:ncol(g)) {
# stuff
result[[i]] <- #results
}
In some circumstances you might think the command:
window<-5
means give me a logical vector stating which values of 'window' are less than -5.
Spaces are good to use, mostly not to confuse humans, but to get the meaning directly above not to confuse R.
Using an apply function to do your regression is mostly a matter of preference in this case; it can handle some of the bookkeeping for you (and so possibly prevent errors) but won't speed up the code.
I would suggest using vectorized functions though to compute your first's and last's, though, perhaps something like:
window <- 5
ng <- 15 #or ncol(g)
xy <- data.frame(first = pmax( (1:ng) - window, 1 ),
last = pmin( (1:ng) + window, ng) )
Or be even smarter with
xy <- data.frame(first= c(rep(1, window), 1:(ng-window) ),
last = c((window+1):ng, rep(ng, window)) )
Then you could use this in a for loop like this:
results <- list()
for(i in 1:nrow(xy)) {
results[[i]] <- xy$first[i] : xy$last[i]
}
results
or with lapply like this:
results <- lapply(1:nrow(xy), function(i) {
xy$first[i] : xy$last[i]
})
where in both cases I just return the sequence between first and list; you would substitute with your actual regression code.