run a for loop in parallel in R - r

I have a for loop that is something like this:
for (i=1:150000) {
tempMatrix = {}
tempMatrix = functionThatDoesSomething() #calling a function
finalMatrix = cbind(finalMatrix, tempMatrix)
}
Could you tell me how to make this parallel ?
I tried this based on an example online, but am not sure if the syntax is correct. It also didn't increase the speed much.
finalMatrix = foreach(i=1:150000, .combine=cbind) %dopar% {
tempMatrix = {}
tempMatrix = functionThatDoesSomething() #calling a function
cbind(finalMatrix, tempMatrix)
}

Thanks for your feedback. I did look up parallel after I posted this question.
Finally after a few tries, I got it running. I have added the code below in case it is useful to others
library(foreach)
library(doParallel)
#setup parallel backend to use many processors
cores=detectCores()
cl <- makeCluster(cores[1]-1) #not to overload your computer
registerDoParallel(cl)
finalMatrix <- foreach(i=1:150000, .combine=cbind) %dopar% {
tempMatrix = functionThatDoesSomething() #calling a function
#do other things if you want
tempMatrix #Equivalent to finalMatrix = cbind(finalMatrix, tempMatrix)
}
#stop cluster
stopCluster(cl)
Note - I must add a note that if the user allocates too many processes, then user may get this error: Error in serialize(data, node$con) : error writing to connection
Note - If .combine in the foreach statement is rbind , then the final object returned would have been created by appending output of each loop row-wise.
Hope this is useful for folks trying out parallel processing in R for the first time like me.
References:
http://www.r-bloggers.com/parallel-r-loops-for-windows-and-linux/
https://beckmw.wordpress.com/2014/01/21/a-brief-foray-into-parallel-processing-with-r/

Related

function that works for lapply returns error for parLapply

I have a function that I am using with lapply and it works fine. I have tried to convert the lapply to a parLapply, and now it returns a error on cluster node 1 processing element 1: object of type 'closure' is not subsettable. Is this an issue with how I am setting up the parallel environment, or is it something else?
The function I am using:
setLimits<-function(col){
if(choice=="num"){
if (auto=="0"){
high=inHigh[count]
low=inLow[count]
}
else{
high=inHigh[1]
low=inLow[1]
}
}
else{
if (auto=="0"){
high=attributes(dt[,col])$SpotfireColumnMetaData$limits.prod.upper+attributes(dt[,col])$SpotfireColumnMetaData$limits.prod.upper*(inHigh[count]/100)
low=attributes(dt[,col])$SpotfireColumnMetaData$limits.prod.lower+attributes(dt[,col])$SpotfireColumnMetaData$limits.prod.lower*(inLow[count]/100)
}
else{
high=attributes(dt[,col])$SpotfireColumnMetaData$limits.prod.upper+attributes(dt[,col])$SpotfireColumnMetaData$limits.prod.upper*(inHigh[1]/100)
low=attributes(dt[,col])$SpotfireColumnMetaData$limits.prod.lower+attributes(dt[,col])$SpotfireColumnMetaData$limits.prod.lower*(inLow[1]/100)
}
}
list(low,high)
}
The setup:
numCores <- detectCores()
cluster<-makeCluster(numCores)
inHigh<-[100,5000,340,6532,45325,645345,2342,2466]
inLow<-[-432,-34325,-5342-,643,234,234,234,1]
x=ncol(dt)
calling the parlApply:
inVal=parLapply(cl=cluster,X=1:ncol(dt),fun=setLimits)
calling the lapply:
inVal=lapply(1:ncol(dt),setLimits)
Sorry I won't be able to provide dt and dput doesn't appear to work for the program I am using. dt is stored as a data.frame. I did simplify the function a little, but I tried to keep the main things that happen in it.

How to find / modify objects directly on parallel workers in R

I have an expensive problem I'm trying to split into pieces.
It's an optimization problem, and consists of an initial expensive setup step, followed by a recursive structure, such that the workers can only perform one step at a time before the results need to be collected, and a new task sent to the workers.
A complicating feature is that an initial setup step for the sub computations that should occur on each worker, has to be performed directly on each worker, and cannot be exported to the worker via clusterExport or similar.
I had hoped to be able to use clusterApply to assign the outcome of this initial setup to be stored on the specific worker, but can't seem to achieve this.
The first part of my code below shows my current attempts and describes what I would like, the second shows an attempt to see all objects available on the worker and where they are located.
library(parallel)
### What I would like to do:
test2<-function(){
MYOBJECT <-0
cl=makeCluster(2,type='PSOCK')
clusterExport(cl,c('MYOBJECT'),envir = environment())
clusterApply(cl,1:2,function(x) { #attempt to modify / create MYOBJECT on the worker processes
y <- x * 2 #expensive operation I only want to do once, that *cannot* be exported to the worker
MYOBJECT <<- y
MYOBJECT <- y
assign('MYOBJECT',y,envir = parent.frame()) #envs[[1]])
})
clusterApply(cl,1:2,function(x) MYOBJECT * .5) #cheap operation to be done many times
}
test2() #should return a list of 1 and 2, without assignment into the test2 function environment / re exporting
#trying to find out where MYOBJECT is on the worker
test<-function(){
MYOBJECT <-1
cl=makeCluster(1,type='PSOCK')
clusterExport(cl,c('MYOBJECT'),envir = environment())
clusterApply(cl,1,function(x) {
MYOBJECT <<- list('hello')
assign('MYOBJECT',list('hellohello'),envir = parent.frame()) #envs[[1]])
})
clusterApply(cl,1,function(x)
lapply(sys.frames(),ls) #where is MYOBJECT?
)
}
test()
Simple solution in the end -- to modify the contents of individual workers in a persistent manner, the assignment within the clusterApply function needs to be made to the global environment.
library(parallel)
### What I would like to do:
test2<-function(){
MYOBJECT <-0
cl=makeCluster(2,type='PSOCK')
clusterExport(cl,c('MYOBJECT'),envir = environment())
clusterApply(cl,1:2,function(x) { #attempt to modify / create MYOBJECT on the worker processes
y <- x * 2 #expensive operation I only want to do once, that *cannot* be exported to the worker
assign('MYOBJECT2',y,envir = globalenv()) #envs[[1]])
})
clusterApply(cl,1:2,function(x) MYOBJECT2 * .5) #cheap operation to be done many times
}
test2() #should return a list of 1 and 2, without assignment into the test2 function environment / re exporting

Parallelized code dependent from other functions

Here is the structure of a function I have created inside which a parallelization has been implemented.
parallelized.function <- function(...){
# Function used in the parallelization
used.in.par <- function(...)
# Function needed by used.in.par (auxiliars)
aux1<-function(...)
aux2<-function(...)
#---------------------------------------------------#
# PARALLELIZATION PROCESS
suppressMessages(library(foreach))
suppressMessages(library(doParallel))
..................................
%dopar%{ used.in.par(...) }
#---------------------------------------------------#
return(something)
}
The code works, but it supposes to define aux1 and aux2 inside parallelized.function to work well (which needs a lot of lines of code).
Is there any way to call aux1 and aux2 functions instead of writing all the code inside parallelized.function?
I tried creating new scripts with aux functions and writing source(".../aux1.R") and source(".../aux2.R") inside parallelized.function without success.
Thank you,
foreach package would do that for you. In order to acccess function that was not defined in the current environment, you should use .export; and loading required packages on each of the workers would be possible by using .packages option.
foreach(
...,
.export = c('aux1', 'aux2'),
.packages = c(...)
) %dopar% {
...
}
Note that loading foreach and doParallel package inside foreach loop is not required. But you missed the part when you register clusters.

Foreach does not perform IF statements

I've implemented foreach in one of the multiple for statements of my R code. It returns the main result (the one after all the iterations), however, it does not perform an IF statement within the code.
Below the skeleton of my code (it's too long to put everything).The if statement does not work and variable "Disc_Time" remains the same (as initialized). What I'm doing wrong or missing? I've tried with .export="f" and .export=ls(GlovalEnv) without success.
library(foreach)
library(doParallel)
cores=detectCores()
cl <- makeCluster(cores[1]-1) #not to overload your computer
registerDoParallel(cl)
Disc_Time<-c("UE","Beam_order","Time")
.... MORE VARIABLES
MDP_x<-foreach (d = 1:length(dista),.combine='c')%dopar%
{
for (q in 1:sim)
{
for (ue in 1:n)
{
for (i in 1:length(seq_order_BS))
{
for (j in 1:length(seq_order_UE))
{
if(first==0)
{
Disc_Time<-rbind(Disc_Time,c(ue,i,D_Time))
}
}
}
}
}
stopcluster(cl)
To see if your if statement is working, we need to know how first is set, and what value it has before your loop. It does not look like first changes within your loop anyway, so it really should sit outside your %dopar% statement anyway.
That said, the if statement is not your issue. foreach returns a list of each return of the expression within. For example:
ls <- foreach(d = 1:5) %dopar% {
return(i)
}
gives an list ls that contains the numbers one to five.
The only expression in your function is an assignment call to Disc_Time. This is evaluated within each of your nodes, and never returned to the parent environment. Disc_Time is never changed where the code was called from).
It looks as though you are trying to set a side effect of your parallel function (to change Disc_Time), which to my knowledge is not possible in a parallel context. Perhaps you want:
MDP_x<-foreach (d = 1:length(dista),.combine='c')%dopar%
{
for (q in 1:sim)
{
for (ue in 1:n)
{
for (i in 1:length(seq_order_BS))
{
for (j in 1:length(seq_order_UE))
{
if(first==0)
{
return(rbind(Disc_Time,c(ue,i,D_Time)))
} else {
return(NA)
}
}
}
}
}
stopcluster(cl)
MDP_x should then have the values you want for each d

Run different function in different threads/task in R

Does R have any mechanism to run different calculation in different threads (Windows-like mechanism of threads/tasks)? Let's
func1 <- function(x) { return (x^2); }
func2 <- function(y) { return (y^3); }
I need to execute something like this (imagine code):
thread1 <- thread_run(func1);
thread2 <- thread_run(func2);
with same mechanism of synchronization, like:
wait(thread1);
wait(thread2);
You can do that with the future package
install.packages(future)
library(future)
And then just use your code and just change the assigment to
thread1 %<-% thread_run(func1);
thread2 %<-% thread_run(func2);
Here more to read: http://www.r-bloggers.com/a-future-for-r-slides-from-user-2016/

Resources