R parallel computing with snowfall - writing to files from separate workers - r

I am using the snowfall 1.84 package for parallel computing and would like each worker to write data to its own separate file during the computation. Is this possible ? if so how ?
I am using the "SOCK" type connection e.g., sfInit( parallel=TRUE, ...,type="SOCK" ) and would like the code to be platform independent (unix/windows).
I know it is possible to Use the "slaveOutfile" option in sfInit to define a file where to write the log files. But this is intended for debugging purposes and all slaves/workers must use the same file. I need each worker to have its OWN output file !!!
The data i need to write are large dataframes, and NOT simple diagnostic messages. These dataframes need be output by the slaves and could not be sent back to the master process.
Anyone knows how i can get this done?
Thanks

A simple solution is to use sfClusterApply to execute a function that opens a different file on each of the workers, assigning the resulting file object to a global variable so you can write to it in subsequent parallel operations:
library(snowfall)
nworkers <- 3
sfInit(parallel=TRUE, cpus=nworkers, type='SOCK')
workerinit <- function(datfile) {
fobj <<- file(datfile, 'w')
NULL
}
sfClusterApply(sprintf('worker_%02d.dat', seq_len(nworkers)), workerinit)
work <- function(i) {
write.csv(data.frame(x=1:3, i=i), file=fobj)
i
}
sfLapply(1:10, work)
sfStop()

Related

Working with environments in the parallel package in R

I have an R API that makes use of 5 different R files that define different metrics that I use. Each of those files has a number of tasks that I run using the parallel package since they all use the same data, but with different groupings. To avoid having to create and close the clusters in each file, I took out those commands and put them into a cluster.R file. So the structure I have is basically:
cluster.R —
cl <- makeCluster(detectCores() - 1)
clusterEvalQ(computeCluster, {
library(‘dplyr’)
source(‘helpers.R’)
})
.Last <- function() {
stopCluster(cl)
}
Metric1.R —
metric1.function <- function(x,y,z) {
dplyr transformations
}
some_date <- date_from_api_input
tasks <- list(job1 = function() {metric1.function(data, grouping1, some_date)},
job2 = function() {metric1.function(data, grouping2, some_date)},
job3 = function() {metric1.function(data, grouping3, some_date)}
)
clusterExport(cl, c('data', 'metric1.function', 'some_date'), envir = environment())
out <- clusterApplyLB(
cl,
tasks,
function(f) f()
)
bind_rows(out)
This API just creates different metrics that then fills a database table that holds them all. So each metric file contains different functions and inputs but output the same columns and groupings.
Metric 2-5 are all the same except the custom function is different for each file and defined at the beginning of each file. The problem I’m having is that all metrics are also ran in parallel and I’m having issues working with the environments. What ends up happening is that the job will say that some_date isn’t found or that metric2.function isn’t found in metric5.R.
I use plumber to expose R and each time it starts, it sources the cluster.R file, starts up the clusters with their initializations, and listens for any requests that come in.
When running in series, it works just fine for testing and everything passes as expected but in production when our server runs all the scripts in parallel, the variables and functions I've exported in the clusterExport function either don't get passed in or are getting mixed up.
Should I be structuring it in a different fashion or am I using the parallel package incorrectly for my purpose?

Is there a way to combine Rmpi & mclapply?

I have some R code that applies a function to a list of objects. The function is simple but involves a bootstrapping calculation, which can be easily sped up using mclapply. When run on a single node, everything is fine.
However, I have a cluster and what I've been trying to do is to distribute the application of the function to the list of objects across multiple nodes. To do this I've been using Rmpi (0.6-6).
The code below runs fine
library(Rmpi)
cl <- parallel::makeCluster(10, type='MPI')
parallel::clusterExport(cl, varlist=c('as.matrix'), envir=environment())
descriptor <- parallel::parLapply(1:5, function(am) {
val <- mean(unlist(lapply(1:120, function(x) mean(rnorm(1e7)))))
return(c(val, Rmpi::mpi.universe.size()))
}, cl=cl)
print(do.call(rbind, descriptor))
snow::stopCluster(cl)
However, if I convert the lapply to mclapply and set mc.cores=10, MPI warns that forking will lead to bad things, and the job hangs.
(In all cases jobs are being submitted via SLURM)
Based on the MPI warning, it seems that I should not be using mclapply within Rpmi jobs. Is this a correct assessment?
If so, does anybody have suggestions on how I can parallelize the function that is being run on each node?

Read/write the same object from scripts executed by 2 separate threads in R

Problem Description:
I am trying to build an automated trading system in R connecting to Oanda REST API. My Operating System is Windows 10.
The program has two separate infinite looping components through "while (TRUE)": The "trading engine" and the "tick data streaming engine".
The program is organised in such a way that the two components communicate through a queue object created using "R6" package.
The "tick data streaming engine" receives tick FX data from Oanda server.
It then creates a tick event object from the data and "push" the tick event to the queue using an instance of the queue class created using "R6" package.
The "trading engine" "pop" the queue object and analyses the event object that comes out.
If it is a Tick data event, it makes an analysis to see whether it meets the conditions set by the logic of the trading strategy.
If the popped tick event meets the conditions, the trading engine creates an order event object
which is "pushed" to the back of the queue using the same instance of the queue class created using "R6" package.
To this end, I want to run the "trading engine" using one thread and run the "tick data streaming engine" using another thread.
The two separate threads should be able to push to, and pop from the same event queue instance.
My understanding is that the event queue instance object should be a shared object for the two separate threads to have access to it.
Question:
My question is how can I implement a shared object which can be dynamically modified (write/read) by code files running on two separate threads
or any other construct that can help achieve read/write to the same object from two or more threads?
How can I possibly use other packages such as "mmap" for shared memory implementation or any other package to achieve my objective?
Attempts:
In order to test the feasibility of the program, this is what I tried:
For simplicity and reproducibility, I created a shared object called "sharedMatrix".
It is a 10 x 1 matrix which will play the role of the event queue instance in my actual Oanda API program.
I used the "bigmemory" R package to transform the initial matrix "m" into a big.matrix object "x" and attached it so that it could be a shared object: "sharedMatrix".
By doing this, I was expecting "sharedMatrix" to be "seen" and modified by each thread running the two separate code files.
#Codefile1
for(i in 1:5){
sharedMatrix[i,1] <- i
}
#Codefile2
for(j in 6:10){
sharedMatrix[j,1] <- j
}
I sourced the two code files using the "foreach" and "doParallel" R packages by executing the following code:
library(doParallel)
library(bigmemory)
library(foreach)
m <- matrix(nrow = 10) # Create 10 x 1 matrix
x <-as.big.matrix(m) #convert m to bigmatrix
mdesc <- describe(x) # get a description of the matrix
cl <- makeCluster(2,outfile = "Log.txt") # Create a cluster of two threads
#with output file "Log.txt"
registerDoParallel(cl)
clusterExport(cl=cl,varlist=ls()) #export input data to all cores
fileList <-list("Codefile1.R","Codefile2.R") # a list of script files saved
#in current working directory
foreach(f=fileList, .packages = "bigmemory") %dopar% {
sharedMatrix <- attach.big.matrix(mdesc) # attach the matrix via shared
#memory
source(f) # Source the script files for parallel execution
}
To my surprise this is the console output when the above code is executed:
Error in { : task 1 failed - "object 'sharedMatrix' not found"
After checking the content of sharedMatrix, I was expecting to see something like this:
sharedMatrix[]
1 2 3 4 5 6 7 8 9 10
However this is what I see:
sharedMatrix[]
Error: object 'sharedMatrix' not found
It seems to me that the worker threads do not "see" the shared object "sharedMatrix".
Any help will be very much appreciated. Thanks.
Use
library(doParallel)
library(bigmemory)
library(foreach)
m <- matrix(nrow = 10) # Create 10 x 1 matrix
x <-as.big.matrix(m) #convert m to bigmatrix
mdesc <- describe(x) # get a description of the matrix
cl <- makeCluster(2,outfile = "Log.txt") # Create a cluster of two threads
#with output file "Log.txt"
registerDoParallel(cl)
clusterExport(cl=cl,varlist=ls()) #export input data to all cores
fileList <-list("Codefile1.R","Codefile2.R") # a list of script files saved
#in current working directory
foreach(f=fileList, .packages = "bigmemory") %dopar% {
sharedMatrix <- attach.big.matrix(mdesc) # attach the matrix via shared
#memory
source(f, local = TRUE) # Source the script files for parallel execution
NULL
}
parallel::stopCluster(cl)
Basically, you need the option local = TRUE in the source() function.
PS: Also, make sure to stop clusters.

Parallelization using shared memory [bigmemory]

I'm experiencing some difficulties when trying to make it work a parallel scenario [doSNOW], with involves the use of shared memory [bigmemory]. The summary is that I get the following error "Error in { : task 1 failed - "cannot open the connection"" in some of the foreach workers. More specifically, checking the cluster output log, it is related to "'/temp/x_bigmatrix.desc': Permission denied" like if there were some problem with a concurrent access to the big.matrix descriptor file.
Please excuse me, but because the code is a bit complex I'm not including a reproducible example but going to try to explain what's the workflow with the main points.
I have a matrix X, which is converted into big.matrix through:
x_bigmatrix <- as.big.matrix(x_matrix,
type = "double",
separated = FALSE,
backingfile = "x_bigmatrix.bin",
descriptorfile = "x_bigmatrix.desc",
backingpath = "./temp/")
Then, I initialize the sock cluster with doSNOW [I'm on Windows 10 x64]:
cl <- parallel::makeCluster(N-1, outfile= "output.log")
registerDoSNOW(cl)
(showConnections() shows properly the registered connections)
Now I have to explain that I have the main loop (foreach) for each worker and then, there is an inner loop where each worker loops over the rows in X. The main idea is that, within the main corpus, each worker is fed with chunks of data sequentially through the inner loop and then, each worker may store some of these observations but instead of store the observations themselves; they store the row indexes for posterior retrieval. In order to complicate even more the things, each worker modifies an associated R6 class environment where the indices are stored. I say this because the access to the big.matrix descriptor file takes places in two different places: the main foreach loop and within each R6 environment. The foreach main corpus is the following:
workersOutput <- foreach(worker = workers, .verbose = TRUE) %dopar% {
# In the main foreach loop I don't include the .packages argument because
# I do source with all the needed libraries at the beginning of the loop
source(libs)
# I attach the big.matrix using attach.big.matrix and the descriptor file
x_bigmatrix <- attach.big.matrix("./temp/x_bigmatrix.desc")
# Now the inner for loop is going to loop over the rows in X.
# Each worker can store some of these indices for posterior retrieval
for (i in seq(1, nrow(X)) {
# Each R6 object associated with each worker is modified, storing indices...
# Within these environments, there is read-only access through a getter using the same
# procedure that above: x_bigmatrix <- attach.big.matrix("./temp/x_bigmatrix.desc")
}
}
stopCluster(cl)
The problem occurs in the inner loop when trying to access the big.matrix backed in file. Because if I change the behaviour in these environments to store explicitly the observations instead of the row indices (thus, there is no access to the descriptor file within these objects anymore), then it works without any problem. Also, if I run it without parallelization [registerDoSEQ()] but storing the row indices in the objects, there is also no errors. So the problem takes place if I mix parallelization and double accessing to the shared big.matrix within the different R6 environments. The weird thing is that some of the workers can run for longer time than others, and even in the end at least one finish its run... So that makes me think about the problem with a concurrent access to the big.matrix descriptor file.
Am I failing at some basics here?
If the problem is a concurrent access to the big.matrix descriptor file, you can just pass the descriptor object (with describe) rather than the descriptor file which contains the object.
Explanation:
When attaching from the descriptor file, it first creates the big.matrix.descriptor object and then attach the big.matrix from this object. So, if you use directly the object, it will be copied to all your clusters and you can attach the big.matrix from them.

How can I run multiple independent and unrelated functions in parallel without larger code do-over?

I've been searching around the internet, trying to understand parallel processing.
What they all seem to assume is that I have some kind of loop function operating on e.g. every Nth row of a data set divided among N cores and combined afterwards, and I'm pointed towards a lot of parallelized apply() functions.
(Warning, ugly code below)
My situation though is that I have is on form
tempJob <- myFunction(filepath, string.arg1, string.arg2)
where the path is a file location, and the string arguments are various ways of sorting my data.
My current workflow is simply amassing a lot of
tempjob1 <- myFunction(args)
tempjob2 <- myFunction(other args)
...
tempjobN <- myFunction(some other args here)
# Make a list of all temporary outputs in the global environment
temp.list <- lapply(ls(pattern = "temp"), get)
# Stack them all
df <- rbindlist(temp.list)
# Remove all variables from workspace matching "temp"
rm(list=ls(pattern="temp"))
These jobs are entirely independent, and could in principle be run in 8 separate instances of R (although that would be a bother to manage I guess). How can I separate the first 8 jobs out to 8 cores, and whenever a core finishes its job and returns a treated dataset to the global environment it'll simply take whichever job is next in line.
With the future package (I'm the author) you can achieve what you want with a minor modification to your code - use "future" assignments %<-% instead of regular assignments <- for the code you want to run asynchronously.
library("future")
plan(multisession)
tempjob1 %<-% myFunction(args)
tempjob2 %<-% myFunction(other args)
...
tempjobN %<-% myFunction(some other args here)
temp.list <- lapply(ls(pattern = "temp"), get)
EDIT 2022-01-04: plan(multiprocess) -> plan(multisession) since multiprocess is deprecated and will eventually be removed.
Unless you are unfortunate enough to be using Windows, you could maybe try with GNU Parallel like this:
parallel Rscript ::: script1.R script2.R JOB86*.R
and that would keep 8 scripts running at a time, if your CPU has 8 cores. You can change it with -j 4 if you just want 4 at a time. The JOB86 part is just random - I made it up.
You can also add switches for a progress bar, for how to handle errors, for adding parameters and distributing jobs across multiple machines.
If you are on a Mac, you can install GNU Parallel with homebrew:
brew install parallel
I think the easiest way is to use one of the parallelized apply functions. Those will do all the fiddly work of separating out the jobs, taking whichever job is next in line, etc.
Put all your arguments into a list:
args <- list(
list(filePath1, stringArgs11, stringArgs21),
list(filePath2, stringArgs12, stringArgs22),
...
list(filePath8, stringArgs18, stringArgs28)
)
Then do something like
library(parallel)
cl <- makeCluster(detectCores())
df <- parSapply(cl, args, myFunction)
I'm not sure about parSapply, and I can't check as R isn't working on my machine just now. If that doesn't work, use parLapply and then manipulate the result.

Resources