Parallelization using shared memory [bigmemory] - r

I'm experiencing some difficulties when trying to make it work a parallel scenario [doSNOW], with involves the use of shared memory [bigmemory]. The summary is that I get the following error "Error in { : task 1 failed - "cannot open the connection"" in some of the foreach workers. More specifically, checking the cluster output log, it is related to "'/temp/x_bigmatrix.desc': Permission denied" like if there were some problem with a concurrent access to the big.matrix descriptor file.
Please excuse me, but because the code is a bit complex I'm not including a reproducible example but going to try to explain what's the workflow with the main points.
I have a matrix X, which is converted into big.matrix through:
x_bigmatrix <- as.big.matrix(x_matrix,
type = "double",
separated = FALSE,
backingfile = "x_bigmatrix.bin",
descriptorfile = "x_bigmatrix.desc",
backingpath = "./temp/")
Then, I initialize the sock cluster with doSNOW [I'm on Windows 10 x64]:
cl <- parallel::makeCluster(N-1, outfile= "output.log")
registerDoSNOW(cl)
(showConnections() shows properly the registered connections)
Now I have to explain that I have the main loop (foreach) for each worker and then, there is an inner loop where each worker loops over the rows in X. The main idea is that, within the main corpus, each worker is fed with chunks of data sequentially through the inner loop and then, each worker may store some of these observations but instead of store the observations themselves; they store the row indexes for posterior retrieval. In order to complicate even more the things, each worker modifies an associated R6 class environment where the indices are stored. I say this because the access to the big.matrix descriptor file takes places in two different places: the main foreach loop and within each R6 environment. The foreach main corpus is the following:
workersOutput <- foreach(worker = workers, .verbose = TRUE) %dopar% {
# In the main foreach loop I don't include the .packages argument because
# I do source with all the needed libraries at the beginning of the loop
source(libs)
# I attach the big.matrix using attach.big.matrix and the descriptor file
x_bigmatrix <- attach.big.matrix("./temp/x_bigmatrix.desc")
# Now the inner for loop is going to loop over the rows in X.
# Each worker can store some of these indices for posterior retrieval
for (i in seq(1, nrow(X)) {
# Each R6 object associated with each worker is modified, storing indices...
# Within these environments, there is read-only access through a getter using the same
# procedure that above: x_bigmatrix <- attach.big.matrix("./temp/x_bigmatrix.desc")
}
}
stopCluster(cl)
The problem occurs in the inner loop when trying to access the big.matrix backed in file. Because if I change the behaviour in these environments to store explicitly the observations instead of the row indices (thus, there is no access to the descriptor file within these objects anymore), then it works without any problem. Also, if I run it without parallelization [registerDoSEQ()] but storing the row indices in the objects, there is also no errors. So the problem takes place if I mix parallelization and double accessing to the shared big.matrix within the different R6 environments. The weird thing is that some of the workers can run for longer time than others, and even in the end at least one finish its run... So that makes me think about the problem with a concurrent access to the big.matrix descriptor file.
Am I failing at some basics here?

If the problem is a concurrent access to the big.matrix descriptor file, you can just pass the descriptor object (with describe) rather than the descriptor file which contains the object.
Explanation:
When attaching from the descriptor file, it first creates the big.matrix.descriptor object and then attach the big.matrix from this object. So, if you use directly the object, it will be copied to all your clusters and you can attach the big.matrix from them.

Related

RStudio: Statement to clear memory [duplicate]

I was hoping to make a global function that would clear my workspace and dump my memory. I called my function "cleaner" and want it to execute the following code:
remove(list = ls())
gc()
I tried making the function in the global environment but when I run it, the console just prints the text of the function. In my function file to be sourced:
cleaner <- function(){
remove(list = ls())
gc()
#I tried adding return(invisible()) to see if that helped but no luck
}
cleaner()
Even when I make the function in the script I want it to run (cutting out any potential errors with sourcing), the storage dump seems to work, but it still doesn't clear the workspace.
Two thoughts about this: Your code does not delete all objects, to also remove the hidden ones use
rm(list = ls(all.names = TRUE))
There is also the command gctorture() which provokes garbage collection on (nearly) every memory allocation (as the man page said). It's intended for R developers to ferret out memory protection bugs:
cleaner <- function(){
# Turn it on
gctorture(TRUE)
# Clear workspace
rm(list = ls(all.names = TRUE, envir=sys.frame(-1)),
envir = sys.frame(-1))
# Turn it off (important or it gets very slow)
gctorture(FALSE)
}
If this procedure is used within a function, there is the following problem: Since the function has its own stack frame, only the objects within this stack frame are deleted. They still exist outside. Therefore, it must be specified separately with sys.frame(-1) that only the higher-level stack frame should be considered. The variables are then only deleted within the function that calls cleaner() and in cleaner itself when the function is exited.
But this also means that the function may only be called from the top level in order to function correctly (you can use sys.frames() which lists all higher-level stack frames to build something that also avoids this problem if really necessary)

Working with environments in the parallel package in R

I have an R API that makes use of 5 different R files that define different metrics that I use. Each of those files has a number of tasks that I run using the parallel package since they all use the same data, but with different groupings. To avoid having to create and close the clusters in each file, I took out those commands and put them into a cluster.R file. So the structure I have is basically:
cluster.R —
cl <- makeCluster(detectCores() - 1)
clusterEvalQ(computeCluster, {
library(‘dplyr’)
source(‘helpers.R’)
})
.Last <- function() {
stopCluster(cl)
}
Metric1.R —
metric1.function <- function(x,y,z) {
dplyr transformations
}
some_date <- date_from_api_input
tasks <- list(job1 = function() {metric1.function(data, grouping1, some_date)},
job2 = function() {metric1.function(data, grouping2, some_date)},
job3 = function() {metric1.function(data, grouping3, some_date)}
)
clusterExport(cl, c('data', 'metric1.function', 'some_date'), envir = environment())
out <- clusterApplyLB(
cl,
tasks,
function(f) f()
)
bind_rows(out)
This API just creates different metrics that then fills a database table that holds them all. So each metric file contains different functions and inputs but output the same columns and groupings.
Metric 2-5 are all the same except the custom function is different for each file and defined at the beginning of each file. The problem I’m having is that all metrics are also ran in parallel and I’m having issues working with the environments. What ends up happening is that the job will say that some_date isn’t found or that metric2.function isn’t found in metric5.R.
I use plumber to expose R and each time it starts, it sources the cluster.R file, starts up the clusters with their initializations, and listens for any requests that come in.
When running in series, it works just fine for testing and everything passes as expected but in production when our server runs all the scripts in parallel, the variables and functions I've exported in the clusterExport function either don't get passed in or are getting mixed up.
Should I be structuring it in a different fashion or am I using the parallel package incorrectly for my purpose?

Read/write the same object from scripts executed by 2 separate threads in R

Problem Description:
I am trying to build an automated trading system in R connecting to Oanda REST API. My Operating System is Windows 10.
The program has two separate infinite looping components through "while (TRUE)": The "trading engine" and the "tick data streaming engine".
The program is organised in such a way that the two components communicate through a queue object created using "R6" package.
The "tick data streaming engine" receives tick FX data from Oanda server.
It then creates a tick event object from the data and "push" the tick event to the queue using an instance of the queue class created using "R6" package.
The "trading engine" "pop" the queue object and analyses the event object that comes out.
If it is a Tick data event, it makes an analysis to see whether it meets the conditions set by the logic of the trading strategy.
If the popped tick event meets the conditions, the trading engine creates an order event object
which is "pushed" to the back of the queue using the same instance of the queue class created using "R6" package.
To this end, I want to run the "trading engine" using one thread and run the "tick data streaming engine" using another thread.
The two separate threads should be able to push to, and pop from the same event queue instance.
My understanding is that the event queue instance object should be a shared object for the two separate threads to have access to it.
Question:
My question is how can I implement a shared object which can be dynamically modified (write/read) by code files running on two separate threads
or any other construct that can help achieve read/write to the same object from two or more threads?
How can I possibly use other packages such as "mmap" for shared memory implementation or any other package to achieve my objective?
Attempts:
In order to test the feasibility of the program, this is what I tried:
For simplicity and reproducibility, I created a shared object called "sharedMatrix".
It is a 10 x 1 matrix which will play the role of the event queue instance in my actual Oanda API program.
I used the "bigmemory" R package to transform the initial matrix "m" into a big.matrix object "x" and attached it so that it could be a shared object: "sharedMatrix".
By doing this, I was expecting "sharedMatrix" to be "seen" and modified by each thread running the two separate code files.
#Codefile1
for(i in 1:5){
sharedMatrix[i,1] <- i
}
#Codefile2
for(j in 6:10){
sharedMatrix[j,1] <- j
}
I sourced the two code files using the "foreach" and "doParallel" R packages by executing the following code:
library(doParallel)
library(bigmemory)
library(foreach)
m <- matrix(nrow = 10) # Create 10 x 1 matrix
x <-as.big.matrix(m) #convert m to bigmatrix
mdesc <- describe(x) # get a description of the matrix
cl <- makeCluster(2,outfile = "Log.txt") # Create a cluster of two threads
#with output file "Log.txt"
registerDoParallel(cl)
clusterExport(cl=cl,varlist=ls()) #export input data to all cores
fileList <-list("Codefile1.R","Codefile2.R") # a list of script files saved
#in current working directory
foreach(f=fileList, .packages = "bigmemory") %dopar% {
sharedMatrix <- attach.big.matrix(mdesc) # attach the matrix via shared
#memory
source(f) # Source the script files for parallel execution
}
To my surprise this is the console output when the above code is executed:
Error in { : task 1 failed - "object 'sharedMatrix' not found"
After checking the content of sharedMatrix, I was expecting to see something like this:
sharedMatrix[]
1 2 3 4 5 6 7 8 9 10
However this is what I see:
sharedMatrix[]
Error: object 'sharedMatrix' not found
It seems to me that the worker threads do not "see" the shared object "sharedMatrix".
Any help will be very much appreciated. Thanks.
Use
library(doParallel)
library(bigmemory)
library(foreach)
m <- matrix(nrow = 10) # Create 10 x 1 matrix
x <-as.big.matrix(m) #convert m to bigmatrix
mdesc <- describe(x) # get a description of the matrix
cl <- makeCluster(2,outfile = "Log.txt") # Create a cluster of two threads
#with output file "Log.txt"
registerDoParallel(cl)
clusterExport(cl=cl,varlist=ls()) #export input data to all cores
fileList <-list("Codefile1.R","Codefile2.R") # a list of script files saved
#in current working directory
foreach(f=fileList, .packages = "bigmemory") %dopar% {
sharedMatrix <- attach.big.matrix(mdesc) # attach the matrix via shared
#memory
source(f, local = TRUE) # Source the script files for parallel execution
NULL
}
parallel::stopCluster(cl)
Basically, you need the option local = TRUE in the source() function.
PS: Also, make sure to stop clusters.

rxDataStep in RevoScaleR package crashing

I am trying to create a new factor column on an .xdf data set with the rxDataStep function in RevoScaleR:
rxDataStep(nyc_lab1
, nyc_lab1
, transforms = list(RatecodeID_desc = factor(RatecodeID, levels=RatecodeID_Levels, labels=RatecodeID_Labels))
, overwrite=T
)
where nyc_lab1 is a pointer to a .xdf file. I know that the file is fine because I imported it into a data table and successfully created a the new factor column.
However, I get the following error message:
Error in doTryCatch(return(expr), name, parentenv, handler) :
ERROR: The sample data set for the analysis has no variables.
What could be wrong?
First, RevoScaleR has some warts when it comes to replacing data. In particular, overwriting the input file with the output can sometimes causes rxDataStep to fail for unknown reasons.
Even if it works, you probably shouldn't do it anyway. If there is a mistake in your code, you risk destroying your data. Instead, write to a new file each time, and only delete the old file once you've verified you no longer need it.
Second, any object you reference that isn't part of the dataset itself, has to be passed in via the transformObjects argument. See ?rxTransform. Basically the rx* functions are meant to be portable to distributed computing contexts, where the R session that runs the code isn't be the same as your local session. In this scenario, you can't assume that objects in your global environment will exist in the session where the code executes.
Try something like this:
nyc_lab2 <- RxXdfData("nyc_lab2.xdf")
nyc_lab2 <- rxDataStep(nyc_lab1, nyc_lab2,
transforms=list(
RatecodeID_desc=factor(RatecodeID, levels=.levs, labels=.labs)
),
rxTransformObjects=list(
.levs=RatecodeID_Levels,
.labs=RatecodeID_Labels
)
)
Or, you could use dplyrXdf which will handle all this file management business for you:
nyc_lab2 <- nyc_lab1 %>% factorise(RatecodeID)

R parallel computing with snowfall - writing to files from separate workers

I am using the snowfall 1.84 package for parallel computing and would like each worker to write data to its own separate file during the computation. Is this possible ? if so how ?
I am using the "SOCK" type connection e.g., sfInit( parallel=TRUE, ...,type="SOCK" ) and would like the code to be platform independent (unix/windows).
I know it is possible to Use the "slaveOutfile" option in sfInit to define a file where to write the log files. But this is intended for debugging purposes and all slaves/workers must use the same file. I need each worker to have its OWN output file !!!
The data i need to write are large dataframes, and NOT simple diagnostic messages. These dataframes need be output by the slaves and could not be sent back to the master process.
Anyone knows how i can get this done?
Thanks
A simple solution is to use sfClusterApply to execute a function that opens a different file on each of the workers, assigning the resulting file object to a global variable so you can write to it in subsequent parallel operations:
library(snowfall)
nworkers <- 3
sfInit(parallel=TRUE, cpus=nworkers, type='SOCK')
workerinit <- function(datfile) {
fobj <<- file(datfile, 'w')
NULL
}
sfClusterApply(sprintf('worker_%02d.dat', seq_len(nworkers)), workerinit)
work <- function(i) {
write.csv(data.frame(x=1:3, i=i), file=fobj)
i
}
sfLapply(1:10, work)
sfStop()

Resources