In R, is there danger of communication between foreach loops (doSNOW) when using assignments to store intermediate output? - r

I want to create a function that uses assignments to store intermediate output (p). This intermediate output is used in statements below. I want everything to be parallelized using doSNOW and foreach and I do NOT want that intermediate output to be communicated between iteration of the forearch loop. I don't want to store intermediate output in a list (e.g. p[[i]]) because then I have to change a huge amount of code.
Question 1: Is there any danger that another iteration of the foreach loop will use the intermediate output (p)?
Question 2: If yes, when would there be danger of that happening and how to prevent it?
Here is an example of what I mean:
install.packages('foreach')
library('foreach')
install.packages('doSNOW')
library('doSNOW')
NbrCores <- 4
cl<-makeCluster(NbrCores)
registerDoSNOW(cl)
test <- function(value){
foreach(i=1:500) %dopar% {
#some statement based on parameter 'value'
p <- value
#some statement that uses p
v <- p
#other statements
}
}
test(value=1)

Each of the nodes used in parallel computations runs in its own R process I believe. Therefore there is no risk of variables from one node influencing the results in another. In general, there is a possibility to communicate between the processes. However foreach only iterates over the sequence it is given, executing each item in the sequence in one of the nodes independently.

Related

Send chunks of large dataset to specific cores for R parallel foreach

I am attempting to scale a script that involves applying a feature extraction function on a image file generated using data from each row of an R matrix. For faster computation I split the matrix into equally sized chunks and run each chunk in R parallel using the foreach structure. To do this, I have to send the entire matrix to each core using clusterExport and subset it to desired chunk within the foreach loop.
I am hoping to find a way to export only the matrix chunks to each core, instead of passing the full matrix to each core and then subsetting. I was only able to find one thread close to what I was looking for, this answer by Steve Weston who sent individual chunks to each core using clusterCall (code pasted below)
library(parallel)
cl <- makeCluster(detectCores())
df <- data.frame(a=1:10, b=1:10)
ix <- splitIndices(nrow(df), length(cl))
for (i in seq_along(cl)) {
clusterCall(cl[i], function(d) {
assign('mydata', d, pos=.GlobalEnv)
NULL # don't return any data to the master
}, df[ix[[i]],,drop=FALSE])
}
This answer worked as advertised, however, the cores in this example run in sequence instead of parallel. My attempt to parallelize this using foreach instead of for was hamstrung by having to use clusterExport to transfer the dataset variable, which is the issue I'm trying to avoid
clusterExport(cl,c("df","ix"))
foreach() %dopar% {etc}
Is there a way to pass chunks of a variable to each core, and operate on them in parallel? A foreach solution would be nice, or a parallelized adaptation to Steve Weston's structure. Note that I am developing for Windows, so forking is not an option for me.

How to modify elements of a vector based on other elements in parallel?

I'm trying to parallelize part of my code, but I can't figure out how since everywhere I read about parallelization the object is completely split and then processed using only the "information" in the chunk. My problem cannot be split in independent chunks, but the updates can be done independently.
Here's a simplified version
a <- runif(10000)
indexes <- c(seq(1,9999,2), seq(2,10000,2))
for(i in indexes){
.prev <- ifelse(i>1, a[i-1], 0)
.next <- ifelse(i<10000, a[i+1], 1)
a[i] <- runif(1, min(.prev,.next), max(.prev, .next))
}
While each iteration on the for loop depends on the current values in a, the order defined in indexes makes the problem parallelized within even and odd indexes, e.g., if indexes = seq(1,9999,2) the dependence is not a problem since the values used in each iteration won't ever be modified. On the other hand it will be impossible to execute this with the subset a[indexes], hence the "splitting" strategy in the guides I read are unable to perform this operation.
How can I parallelize a problem like this, where I need the object to be modified and "look" at the whole vector every time? Instead of each worker returning some output, each worker has to modify a "shared" object in memory?

Getting the index of an iterator in R (in parallel with foreach)

I'm using the foreach function to iterate over columns of a data.frame. At each iteration, I would like to get the index of the iterator (i.e. the index or the name of the column considered) and the column itself.
However, the following code, which seems fine in first place, doesn't work because i has no names or colnames attributes.
foreach(i=iter(base[1:N],by='col')) %dopar% c(colnames(i),i)
Now, if you wonder why I'm not iterating over indexes, the reason is that I'm using the %dopar% tool and I don't want to send the whole base to all workers, but only the columns each of them require.
Question : How can I get the index of an iterator ?
Thank you
I would just specify a second iteration variable in the foreach loop that acts as a counter:
library(foreach)
library(iterators)
df <- data.frame(a=1:10, b=rnorm(10), c=runif(10))
r <- foreach(d=df, i=icount()) %do% {
list(d=d, i=i)
}
The "icount" function from the iterators package will return an unbounded counting iterator if no arguments are used, so this example works regardless of the number of columns in the data frame.
You could also include the column name as a third iteration variable:
r <- foreach(d=df, i=icount(), nm=colnames(df)) %do% {
list(d=d, i=i, nm=nm)
}
Here are a couple of possibilities:
Modify the iter function (or write your own) so that instead of sending just the value of the column it includes the names or other information)
You could iterate over the indexes, but use a shared memory tool (such as the Rdsm package) so that each process only needs to grab the part of the data frame it needs rather than distributing the entire data frame.
You could convert your base data frame into a list where each element contains the corresponding column of base along with the column name, then iterate over that list (so the entire element is sent, but not the other elements).

Directly assign results of doMC (foreach) to data frame

Lets say I have the example code
kkk<-data.frame(m.mean=1:1000, m.sd=1:1000/20)
kkk[,3:502]<-NA
for (i in 1:nrow(kkk)){
kkk[i,3:502]<-rnorm(n=500, mean=kkk[i,1], sd=kkk[i,2])
}
I would like to convert this function to run parallel with doMC. My problem is that foreach results in a list, whereas I need the results of each iteration to be a vector that can be then transfered to the data frame (which later will be exported as CVS for further processing).
Any ideas?
You don't need a loop for this, and putting a large matrix of numbers in a data frame only to treat is as a matrix is inefficient (although you may need to create a data frame at the end after doing all your math in order to write to a CSV file).
m.mean <- 1:1000
m.sd <- 1:1000/20
num.columns <- 500
x <- matrix(nrow=length(m.mean), ncol=num.columns,
data=rnorm(n=length(m.mean) * num.columns))
x <- x * cbind(m.sd)[,rep(1,num.columns)] + cbind(m.mean)[,rep(1,num.columns)]
kkk <- data.frame(m.mean=m.mean, m.sd=m.sd, unname(x))
write.csv(kkk, "kkk.txt")
To answer your original question about directly assigning results to an existing data structure from a foreach loop, that is not possible. The foreach package's parallel backends are designed to perform each computation in a separate R process, so each one has to return a separate object to the parent process, which collects them with the .combine function provided to foreach. You could write a parallel foreach loop that assignes directly to the kkk variable, but it would have no effect, because each assignment would happen in the separate processes and would not be shared with the main process.

How to rewrite my R code for multicore processing?

I have R code that I need to get to A "parallelization" stage. Im new at this so please forgive me if I use the wrong terms. I have a process that just has to chug through individual by individual one at a time and then average across individuals in the end. The process is the exact same for each individual (its a Brownian Bridge), I just have to do this for >300 individuals. So, I was hoping someone here might know how to change my code so that it can be spawned? or parallelized? or whatever the word is to make sure that the 48 CPU's I now have access to can help reduce the 58 days it will take to compute this with my little laptop. In my head I would just send out 1 individual to one processor. Have it run through the script and then send another one....if that makes sense.
Below is my code. I have tried to comment in it and have indicated where I think the code needs to be changed.
for (n in 1:(length(IDNames))){ #THIS PROCESSES THROUGH EACH INDIVIDUAL
#THIS FIRST PART IS JUST EXTRACTING THE DATA FROM MY TWO INPUT FILES.
#I HAVE ONE FILE WITH ALL THE LOCATIONS AND THEN ANOTHER FILE WITH A DATE RANGE.
#EACH INDIVIDUAL HAS DIFFERENT DATE RANGES, THUS IT HAS TO PULL OUT EACH INDIVIDUALS
#DATA SET SEPARATELY AND THEN RUN THE FUNCTION ON IT.
IndivData = MovData[MovData$ID==IDNames[n],]
IndivData = IndivData[1:(nrow(IndivData)-1),]
if (UseTimeWindow==T){
IndivDates = dates[dates$ID==IDNames[n],]
IndivData = IndivData[IndivData$DateTime>IndivDates$Start[1]&IndivData$DateTime<IndivDates$End[1],]
}
IndivData$TimeDif[nrow(IndivData)]=NA
########################
#THIS IS THE PROCESS WHERE I THINK I NEED THAT HAS TO HAVE EACH INDIVIDUAL RUN THROUGH IT
BBMM <- brownian.bridge(x=IndivData$x, y=IndivData$y,
time.lag = IndivData$TimeDif[1:(nrow(IndivData)-1)], location.error=20,
area.grid = Grid, time.step = 0.1)
#############################
# BELOW IS JUST CODE TO BIND THE RESULTS INTO A GRID DATA FRAME I ALREADY CREATED.
#I DO NOT UNDERSTAND HOW THE MULTICORE PROCESSED CODE WOULD JOIN THE DATA BACK
#WHICH IS WHY IVE INCLUDED THIS PART OF THE CODE.
if(n==1){ #creating a data fram with the x, y, and probabilities for the first individual
BBMMProbGrid = as.data.frame(1:length(BBMM[[2]]))
BBMMProbGrid = cbind(BBMMProbGrid,BBMM[[2]],BBMM[[3]],BBMM[[4]])
colnames(BBMMProbGrid)=c("GrdId","X","Y",paste(IDNames[n],"_Prob", sep=""))
} else { #For every other individual just add the new information to the dataframe
BBMMProbGrid = cbind(BBMMProbGrid,BBMM[[4]])
colnames(BBMMProbGrid)[n*2+2]=paste(IDNames[n],"_Prob", sep ="")
}# end if
} #end loop through individuals
Not sure why this has been voted down either. I think the foreach package is what you're after. Those first few pdfs have very clear useful information in them. Basically write what you want done for each person as a function. Then use foreach to send the data for one person out to a node to run the function (while sending another persons to another node etc) and then it compiles all the results using something like rbind. I've used this a few times with great results.
Edit: I didn't look to rework your code as I figure given you've got that far you'll easily have the skills to wrap it into a function and then use the one liner foreach.
Edit 2: This was too long for a comment to reply to you.
I thought since you had got that far with the code that you would be able to get it into a function :) If you're still working on this, it might help to think of writing a for loop to loop over your subjects and do the calculations required for that subject. Then, that for loop is what you want in your function. I think in your code that is everything down to 'area.grid'. Then you can get rid of most of your [n]'s since the data is only subset once per iteration.
Perhaps:
pernode <- function(MovData) {
IndivData = MovData[MovData$ID==IDNames[i],]
IndivData = IndivData[1:(nrow(IndivData)-1),]
if (UseTimeWindow==T){
IndivDates = dates[dates$ID==IDNames,]
IndivData = IndivData[IndivData$DateTime>IndivDates$Start[1]
&IndivData$DateTime<IndivDates$End[1],]
}
IndivData$TimeDif[nrow(IndivData)]=NA
BBMM <- brownian.bridge(x=IndivData$x, y=IndivData$y,
time.lag = IndivData$TimeDif[1:(nrow(IndivData)-1)], location.error=20,
area.grid = Grid, time.step = 0.1)
return(BBMM)
}
Then something like:
library(doMC)
library(foreach)
registerDoMC(cores=48) # or perhaps a few less than all you have
system.time(
output <- foreach(i = 1:length(IDNames)), .combine = "rbind", .multicombine=T,
.inorder = FALSE) %dopar% {pernode(i)}
)
Hard to say whether that is it without some test data, let me know how you get on.
This is a general example since I didn't have the patience to read through all of your code. One of the quickest ways to spread this across multiple processors would be to use the multicore library and the mclapply (a parallelized version of lapply) to push a list (individual items on the list would be dataframes for each of the 300+ individuals in your case) through a function.
Example:
library(multicore)
result=mclapply(data_list, your_function,mc.preschedule=FALSE, mc.set.seed=FALSE)
As I understand you description you have access to a distributed computer cluster. So the package multicore will be not working. You have to use Rmpi, snow or foreach. Based on your existing loop structure I would advice to use the foreach and doSnow package. But your codes looks like as you have a lot of data. You probably have to check to reduce the data (only the required ones) which will be send to the nodes.

Resources