Send chunks of large dataset to specific cores for R parallel foreach - r

I am attempting to scale a script that involves applying a feature extraction function on a image file generated using data from each row of an R matrix. For faster computation I split the matrix into equally sized chunks and run each chunk in R parallel using the foreach structure. To do this, I have to send the entire matrix to each core using clusterExport and subset it to desired chunk within the foreach loop.
I am hoping to find a way to export only the matrix chunks to each core, instead of passing the full matrix to each core and then subsetting. I was only able to find one thread close to what I was looking for, this answer by Steve Weston who sent individual chunks to each core using clusterCall (code pasted below)
library(parallel)
cl <- makeCluster(detectCores())
df <- data.frame(a=1:10, b=1:10)
ix <- splitIndices(nrow(df), length(cl))
for (i in seq_along(cl)) {
clusterCall(cl[i], function(d) {
assign('mydata', d, pos=.GlobalEnv)
NULL # don't return any data to the master
}, df[ix[[i]],,drop=FALSE])
}
This answer worked as advertised, however, the cores in this example run in sequence instead of parallel. My attempt to parallelize this using foreach instead of for was hamstrung by having to use clusterExport to transfer the dataset variable, which is the issue I'm trying to avoid
clusterExport(cl,c("df","ix"))
foreach() %dopar% {etc}
Is there a way to pass chunks of a variable to each core, and operate on them in parallel? A foreach solution would be nice, or a parallelized adaptation to Steve Weston's structure. Note that I am developing for Windows, so forking is not an option for me.

Related

How to fill a very large array in parallel R

I need to fill a lot of very large arrays by opening up thousands of csv files, extracting columns of data, and inserting them into 3D and 4D matrices. I've tried writing this in parallel, but what always happens is that my computer crashes when my memory fills up. I've looked at this question, Parallel `for` loop with an array as output, but I have not gotten those suggestions to work for me. Here's my code (generalized where needed):
tmin_array_1981_2010 <- array(NA,c(585,1386,366))
foreach (f = 1:500000, .packages=c('dplyr','lubridate')) %dopar% {
data <- read.csv(file_name[f])
tmin_array_1981_2010[y[f],x[f],] = data$column}
There's a lot more that I'm doing in the foreach loop, but this is enough to understand what I want to do. I've read that I can use an lapply statement to parallelize this code, but I'm not going to pretend I understand what, or how, they're doing it. I've also tried using the abind function as shown in this post, Parallel `for` loop with an array as output, but this performs worse than the simple code I have above.
acomb <- function(...) abind(..., along=3)
foreach (f=1:18, .combine='acomb', .multicombine=TRUE, .packages=c('dplyr','lubridate','vroom','tidyverse')) %dopar% {
data <- read.csv(file_name[f])
tmin_array_1981_2010[y[f],x[f],] = data$column}
Any help would great. Thank you.
I guess the part taking time is reading the CSVs.
So you can always return list(y[f], x[f], data$column) (or even just data$column) and fill the array later. Do not use .combine then.

Does clusterMap in Snow support dynamic processing?

It seems clusterMap in Snow doesn't support dynamic processing. I'd like to do parallel computing with two pairs of parameters stored in a data frame. But the elapsed time of every job vary very much. If the jobs are run un-dynamically, it will be time consuming.
e.g.
library(snow)
cl2 <- makeCluster(3, type = "SOCK")
df_t <- data.frame (type=c(rep('a',3),rep('b',3)), value=c(rep('1',3),rep('2',3)))
clusterExport(cl2,"df_t")
clusterMap(cl2, function(x,y){paste(x,y)},
df_t$type,df_t$value)
It is true that clusterMap doesn't support dynamic processing, but there is a comment in the code suggesting that it might be implemented in the future.
In the meantime, I would create a list from the data in order to call clusterApplyLB with a slightly different worker function:
ldf <- lapply(seq_len(nrow(df_t)), function(i) df_t[i,])
clusterApplyLB(cl2, ldf, function(df) {paste(df$type, df$value)})
This was common before clusterMap was added to the snow package.
Note that your use of clusterMap doesn't actually require you to export df_t since your worker function doesn't refer to it. But if you're willing to export df_t to the workers, you could also use:
clusterApplyLB(cl2, 1:nrow(df_t), function(i){paste(df_t$type[i],df_t$value[i])})
In this case, df_t must be exported to the cluster workers since the worker function references it. However, it is generally less efficient since each worker only needs a fraction of the entire data frame.
I found clusterMap in Parallel package support LB. But it less efficient than the method of clusterApplyLB combined with lapply implemented by Snow. I tried to find out the source code to figure out. But the clusterMap is not available when I click the link 'source' and 'R code'.

In R, is there danger of communication between foreach loops (doSNOW) when using assignments to store intermediate output?

I want to create a function that uses assignments to store intermediate output (p). This intermediate output is used in statements below. I want everything to be parallelized using doSNOW and foreach and I do NOT want that intermediate output to be communicated between iteration of the forearch loop. I don't want to store intermediate output in a list (e.g. p[[i]]) because then I have to change a huge amount of code.
Question 1: Is there any danger that another iteration of the foreach loop will use the intermediate output (p)?
Question 2: If yes, when would there be danger of that happening and how to prevent it?
Here is an example of what I mean:
install.packages('foreach')
library('foreach')
install.packages('doSNOW')
library('doSNOW')
NbrCores <- 4
cl<-makeCluster(NbrCores)
registerDoSNOW(cl)
test <- function(value){
foreach(i=1:500) %dopar% {
#some statement based on parameter 'value'
p <- value
#some statement that uses p
v <- p
#other statements
}
}
test(value=1)
Each of the nodes used in parallel computations runs in its own R process I believe. Therefore there is no risk of variables from one node influencing the results in another. In general, there is a possibility to communicate between the processes. However foreach only iterates over the sequence it is given, executing each item in the sequence in one of the nodes independently.

Directly assign results of doMC (foreach) to data frame

Lets say I have the example code
kkk<-data.frame(m.mean=1:1000, m.sd=1:1000/20)
kkk[,3:502]<-NA
for (i in 1:nrow(kkk)){
kkk[i,3:502]<-rnorm(n=500, mean=kkk[i,1], sd=kkk[i,2])
}
I would like to convert this function to run parallel with doMC. My problem is that foreach results in a list, whereas I need the results of each iteration to be a vector that can be then transfered to the data frame (which later will be exported as CVS for further processing).
Any ideas?
You don't need a loop for this, and putting a large matrix of numbers in a data frame only to treat is as a matrix is inefficient (although you may need to create a data frame at the end after doing all your math in order to write to a CSV file).
m.mean <- 1:1000
m.sd <- 1:1000/20
num.columns <- 500
x <- matrix(nrow=length(m.mean), ncol=num.columns,
data=rnorm(n=length(m.mean) * num.columns))
x <- x * cbind(m.sd)[,rep(1,num.columns)] + cbind(m.mean)[,rep(1,num.columns)]
kkk <- data.frame(m.mean=m.mean, m.sd=m.sd, unname(x))
write.csv(kkk, "kkk.txt")
To answer your original question about directly assigning results to an existing data structure from a foreach loop, that is not possible. The foreach package's parallel backends are designed to perform each computation in a separate R process, so each one has to return a separate object to the parent process, which collects them with the .combine function provided to foreach. You could write a parallel foreach loop that assignes directly to the kkk variable, but it would have no effect, because each assignment would happen in the separate processes and would not be shared with the main process.

mclapply with big objects - "serialization is too large to store in a raw vector"

I keep hitting an issue with the multicore package and big objects. The basic idea is that I'm using a Bioconductor function (readBamGappedAlignments) to read in large objects. I have a character vector of filenames, and I've been using mclapply to loop over the files and read them into a list. The function looks something like this:
objects <- mclapply(files, function(x) {
on.exit(message(sprintf("Completed: %s", x)))
message(sprintf("Started: '%s'", x))
readBamGappedAlignments(x)
}, mc.cores=10)
However, I keep getting the following error: Error: serialization is too large to store in a raw vector. However, it seems I can read the same files in alone without this error. I've found mention of this issue here, without resolution.
Any parallel solution suggestions would be appreciated - this has to be done in parallel. I could look towards snow, but I have a very powerful server with 15 processors, 8 cores each and 256GB of memory I can do this on. I rather just do it on this machine across cores, rather than using one of our clusters.
The integer limit is rumored to be addressed very soon in R. In my experience that limit can block datasets with under 2 billion cells (around the maximum integer), and low level functions like sendMaster in the multicore package rely on passing raw vectors. I had around 1 million processes representing about 400 million rows of data and 800 million cells in the data.table format, and when mclapply was sending the results back it ran into this limit.
A divide and conquer strategy is not that hard and it works. I realize this is a hack and one should be able to rely on mclapply.
Instead of one big list, create a list of lists. Each sub-list is smaller than the broken version, and you then feed them into mclapply split by split. Call this file_map. The results are a list of lists, so you could then use the special double concatenate do.call function. As a result, each time mclapply finishes the size of the serialized raw vector is of a manageable size.
Just loop over the smaller pieces:
collector = vector("list", length(file_map)) # more complex than normal for speed
for(index in 1:length(file_map)) {
reduced_set <- mclapply(file_map[[index]], function(x) {
on.exit(message(sprintf("Completed: %s", x)))
message(sprintf("Started: '%s'", x))
readBamGappedAlignments(x)
}, mc.cores=10)
collector[[index]]= reduced_set
}
output = do.call("c",do.call('c', collector)) # double concatenate of the list of lists
Alternately, save the output to a database as you go such as SQLite.

Resources