I have a list of fairly large objects that I want to apply a complicated function to in parallel, but my current method uses too much memory. I thought Reference Classes might help, but using mcapply to modify them doesn't seem to work.
The function modifies the object itself, so I overwrite the original object with the new one. Since the object is a list and I'm only modifying a small part of it, I was hoping that R's copy-on-modify semantics would avoid having multiple copies made; however, in running it, it doesn't seem to be the case for what I'm doing. Here's a small example of the base R methods I have been using. It correctly resets the balance to zero.
## make a list of accounts, each with a balance
## and a function to reset the balance
foo <- lapply(1:5, function(x) list(balance=x))
reset1 <- function(x) {x$balance <- 0; x}
foo[[4]]$balance
## 4 ## BEFORE reset
foo <- mclapply(foo, reset1)
foo[[4]]$balance
## 0 ## AFTER reset
It seems that using Reference Classes might help as they are mutable, and when using lapply it does do as I expect; the balance is reset to zero.
Account <- setRefClass("Account", fields=list(balance="numeric"),
methods=list(reset=function() {balance <<- 0}))
foo <- lapply(1:5, function(x) Account$new(balance=x))
foo[[4]]$balance
## 4
invisible(lapply(foo, function(x) x$reset()))
foo[[4]]$balance
## 0
But when I use mclapply, it doesn't properly reset. Note that if you're on Windows or have mc.cores=1, lapply will be called instead.
foo <- lapply(1:5, function(x) Account$new(balance=x))
foo[[4]]$balance
## 4
invisible(mclapply(foo, function(x) x$reset()))
foo[[4]]$balance
## 4
What's going on? How can I work with Reference Classes in parallel? Is there a better way altogether to avoid unnecessary copying of objects?
I think the forked processes, while they have access to all the variables in the workspace, must not be able to change them. This works, but I don't know yet if it improves the memory issues or not.
foo <- mclapply(foo, function(x) {x$reset(); x})
foo[[4]]$balance
## 0
Related
All of the below is conducted in R.
I am trying to store the result of a for-loop result in different containers, but somehow I keep ending up with NA-warnings and the results are not not stored in my container. Even tried different containers for different for-loops within the function and then finally a matrix for the containers, but it seems it's not working.
Already trying different solutions for two full days, and it seems there should be such an easy solution. Maybe I just can't see it myself anymore...
data.ols<-data.frame(cbind(rep(1),holiday,weathersit,atemp,hum,windspeed))
y<-as.vector(cnt)
z=c(holiday, weathersit, atemp, hum, windspeed)
z.names=c("holiday","weathersit","atemp","hum","windspeed")
result.container<-data.frame(matrix(nrow=6,ncol=4))
colnames(result.container)<-c("beta","SE","t-statistic","p-value")
ols<-function(y,X2,x=0){
X<-matrix(z, ncol=5)
X2<-cbind(rep(1, nrow(X)), X)
XXinv <- solve(t(X2) %*% X2, diag(ncol(X2))) # Compute (X'X)^-1
beta<-XXinv%*%t(X2)%*%y
print(beta)
result.container[,1]<-beta
result.testdebug<-vector()
for (i in c("V1","holiday","weathersit","atemp","hum","windspeed")){
SE<-sd(i)
result.testdebug[i]<-sd(data.ols[,i])
return(result.testdebug)
result.container[,2]<-result.testdebug}
result.testtvalue<-vector()
for (i in c("V1","holiday","weathersit","atemp","hum","windspeed")){
nominator<-(mean(i)-x)
t.value <- nominator/sd(i)
return(t.value)
result.testtvalue<-t.value
result.container[,3]<-result.testtvalue}
df <- length(X)-1
p.value <- 2*pt(t.value, df, lower.tail=FALSE)
return(p.value)
result.container[,4]<-p.value
list(rbind(beta,result.testdebug,t.value,p.value))}
It seems you are having some trouble with functions in R. In R, functions have their own environment (i.e. their own set of objects). Even though they can read from their parent environment (the set of all objects), they cannot write on it. Let me demonstrate that with a simpler code.
teste2=matrix(,2,2)
teste=function(a,b) {teste2[,1]=c(a,b)}
teste(3,2)
teste2
[,1] [,2]
[1,] NA NA
[2,] NA NA
As you can see teste (the function) cannot change teste2 (the matrix).
In R ,the best way to make a function is to give it all the objects it needs as parameters and by the end of the function body, to give a single return() function that gives the final object.
You did something close to that, but used multiple return() functions. R only uses the first return() and ignores the rest. See below:
teste=function(a,b) {c=a;return(c);d=b;return(d)}
teste(3,2)
[1] 3
For your particular code, I reccomend excluding all result.container<- and put return() only on the end, around that last (list)
I have a question regarding R apply (and all its variants). Is there a way to update the arguments of the function while apply is working?
For example, I have a function NextSol(Prev_Sol) that generates a new solution from Prev_Sol, compares it with the original one in some way and then returns either the original or the new, depending on the result of the comparison. I need to save all the solutions returned. Currently, I am doing this:
for( i in 2:N ) {
Results[[i]] <- NextSol(Results[[i-1]])
}
But maybe there is a (faster) way to do it using apply? I have seen also that Reduce could help but I have no idea of how can I use it. Any help will be much appreciated!
As Thomas said, the for loop is the standard way of looping when one iteration depends on a previous one. (Just make sure that you correctly handle the case of N = 1 in your code.)
An alternative is to use the Reduce function. This example is adapted from the one on the ?Reduce help page.
NextSol <- function(x) x + 1 #Or whatever you want
Funcall <- function(f, ...) f(...)
Reduce(Funcall, rep.int(list(NextSol), 5), 0, right = TRUE)
## [1] 5
It's unlikely that this will be much faster, and it's arguably harder to read, so you may well decide to stick with a for loop.
Well, I suppose we can make it easier to read by wrapping it in an Iterate function.
Iterate <- function(f, init, n)
{
Reduce(
function(f, ...) f(...),
rep.int(list(f), n),
init,
right = TRUE
)
}
Iterate(NextSol, 0, 5) #same as before
Or put it more general: How can I add multiple attributes to the elements of list?
I am stuck trying to set an attribute to elements of a list all of which are data.frames. In the end I would like to add names(myList) as a varying attribute to every data.frame inside. But I even cannot get a static attribute for all list elements to go.
lapply(myList,attr,which="myname") <- "myStaticName"
This does not work because lapply does not work with lapply<-. If I had at least an idea how to do this, maybe I could figure out how to do it with varying attributes like the name of the list.
I don't recommend it, but you could do: lapply(myList, 'attr<-', which='myname', value='myStaticName'). An old fashioned for loop is probably the clearest way to perform this task---or do this assignment upstream when the objects are created.
for (i in seq_along(myList)) attr(myList[[i]], 'myname') <- 'myStaticName'
EDIT:
As #mnel points out in the comments, setattr in the data.table package is also an efficient option, since it assigns by reference.
Edit: #mnel -- don't use setattr with lapply. This is one case where the for loop is much faster.
library(microbenchmark)
library(data.table)
myList <- as.list(1:10000)
`lapply.attr<-` <-
function()
lapply(myList, 'attr<-', which='myname', value='myStaticName')
`for.attr<-` <-
function()
for (i in seq_along(myList))
attr(myList[[i]], 'myname') <- 'myStaticName'
lapply.setattr <-
function()
lapply(myList, setattr, name='myname', value='myStaticName')
for.setattr <- function()
for (i in seq_along(myList))
setattr(myList[[i]], name = 'myname', value = 'myStaticName')
result <- microbenchmark(`lapply.attr<-`(), `for.attr<-`(), lapply.setattr(), for.setattr())
plot(result)
Based on this answer by Thierry I found a solution on my own. Actually I have been close with several tries but did not return the WHOLE list which is key.
myList <- lapply(names(myList),function(X){
attr(myList[[X]],"myname") <- X
myList[[X]]
})
My mistake was not to return the whole list but only the second line of the function, i.e. the attribute. Thus I was not able to replace the initial list.
#Matthew Plourde: what's strange: your benchmark looks somewhat different on my machine: RStudio, OS X, 2.5 Ghz Intel Core i7, 16GB RAM.
I have lots of variables in R, all of type list
a100 = list()
a200 = list()
# ...
p700 = list()
Each variable is a complicated data structure:
a200$time$data # returns 1000 x 1000 matrix
Now, I want to apply code to each variable in turn. However, since R doesn't support pass-by-reference, I'm not sure what to do.
One idea I had was to create a big list of all these lists, i.e.,
biglist = list()
biglist[[1]] = a100
...
And then I could iterate over biglist:
for (i in 1:length(biglist)){
biglist[[i]]$newstuff = "profit"
# more code here
}
And finally, after the loop, go backwards so that existing code (that uses variable names) still works:
a100 = biglist[[1]]
# ...
The question is: is there a better way to iterate over a set of named lists? I have a feeling that I'm doing things horribly wrong. Is there something easier, like:
# FAKE, Idealized code:
foreach x in (a100, a200, ....){
x$newstuff = "profit"
}
a100$newstuff # "profit"
To parallel walk over lists you can use mapply, which will take parallel lists and then walk over them in lock-step. Furthermore, in a functional language you should emit the object that you want rather than modify the data structure within a function call.
You should use the sapply, apply, lapply, ... family of functions.
jim
jimmyb is quite right. lapply and sapply are specifically designed to work on lists. So they would work with your biglist as well. You shouldn't forget to return the object in the nested function though : An example :
X <- list(A=list(A1=1:2,A2=3:4),B=list(B1=5:6,B2=7:8))
lapply(X,function(i){
i$newstuff = "profit"
return(i)
})
Now as you said, R passes by value so you have multiple copies of the data roaming around. If you work with really big lists, you might want to try toning the memory usage down by working on each variable seperately, using assign and get. The following is considered bad coding, but can sometimes be necessary to avoid memory trouble :
A <- X[[1]] ; B <- X[[2]] #make the data
list.names <- c("A","B")
for (i in list.names){
tmp <- get(i)
tmp$newstuff <- "profit"
assign(i,tmp)
rm(tmp)
}
Make sure you are well aware of the implication this code has, as you're working within the global environment. If you need to do this more often, you might want to work with environments instead :
my.env <- new.env() # make the environment
my.env$A <- X[[1]];my.env$B <- X[[2]] # put vars in environment
for (i in list.names){
tmp <- get(i,envir=my.env)
tmp$newstuff <- "profit"
assign(i,tmp,envir=my.env)
rm(tmp)
}
my.env$A
my.env$B
I often want to do essentially the following:
mat <- matrix(0,nrow=10,ncol=1)
lapply(1:10, function(i) { mat[i,] <- rnorm(1,mean=i)})
But, I would expect that mat would have 10 random numbers in it, but rather it has 0. (I am not worried about the rnorm part. Clearly there is a right way to do that. I am worry about affecting mat from within an anonymous function of lapply) Can I not affect matrix mat from inside lapply? Why not? Is there a scoping rule of R that is blocking this?
I discussed this issue in this related question: "Is R’s apply family more than syntactic sugar". You will notice that if you look at the function signature for for and apply, they have one critical difference: a for loop evaluates an expression, while an apply loop evaluates a function.
If you want to alter things outside the scope of an apply function, then you need to use <<- or assign. Or more to the point, use something like a for loop instead. But you really need to be careful when working with things outside of a function because it can result in unexpected behavior.
In my opinion, one of the primary reasons to use an apply function is explicitly because it doesn't alter things outside of it. This is a core concept in functional programming, wherein functions avoid having side effects. This is also a reason why the apply family of functions can be used in parallel processing (and similar functions exist in the various parallel packages such as snow).
Lastly, the right way to run your code example is to also pass in the parameters to your function like so, and assigning back the output:
mat <- matrix(0,nrow=10,ncol=1)
mat <- matrix(lapply(1:10, function(i, mat) { mat[i,] <- rnorm(1,mean=i)}, mat=mat))
It is always best to be explicit about a parameter when possible (hence the mat=mat) rather than inferring it.
One of the main advantages of higher-order functions like lapply() or sapply() is that you don't have to initialize your "container" (matrix in this case).
As Fojtasek suggests:
as.matrix(lapply(1:10,function(i) rnorm(1,mean=i)))
Alternatively:
do.call(rbind,lapply(1:10,function(i) rnorm(1,mean=i)))
Or, simply as a numeric vector:
sapply(1:10,function(i) rnorm(1,mean=i))
If you really want to modify a variable above of the scope of your anonymous function (random number generator in this instance), use <<-
> mat <- matrix(0,nrow=10,ncol=1)
> invisible(lapply(1:10, function(i) { mat[i,] <<- rnorm(1,mean=i)}))
> mat
[,1]
[1,] 1.6780866
[2,] 0.8591515
[3,] 2.2693493
[4,] 2.6093988
[5,] 6.6216346
[6,] 5.3469690
[7,] 7.3558518
[8,] 8.3354715
[9,] 9.5993111
[10,] 7.7545249
See this post about <<-. But in this particular example, a for-loop would just make more sense:
mat <- matrix(0,nrow=10,ncol=1)
for( i in 1:10 ) mat[i,] <- rnorm(1,mean=i)
with the minor cost of creating a indexing variable, i, in the global workspace.
Instead of actually altering mat, lapply just returns the altered version of mat (as a list). You just need to assign it to mat and turn it back into a matrix using as.matrix().