Accessing R objects from subprocess into parent process - r

In the context of teaching R programming, I am trying to run R scripts completely independently, so that I can compare the objects they have generated.
Currently, I do this with R environments:
student_env <- new.env()
solution_env <- new.env()
eval(parse(text = "x <- 4"), env = student_env)
eval(parse(text = "x <- 5"), env = solution_env)
student_env$x == student_env$y
While this provides some encapsulation, is is by far complete. E.g., if I execute a library() call in the student environment, it is attached to the global R session's search path, making the package available for code running in solution environment as well.
To ensure complete separation, I could fire up subprocesses using the subprocess package:
library(subprocess)
rbin <- file.path(R.home("bin"), "R")
student_handle <- spawn_process(rbin, c('--no-save'))
solution_handle <- spawn_process(rbin, c('--no-save'))
process_write(student_handle, "x <- 4\n")
process_write(solution_handle, "x <- 5\n")
However, I'm not sure how to go about the step of fetching the R objects so I can compare them.
My questions:
Is subprocess a good approach?
If yes, how can I (efficiently!) grab the R representations of objects from a subprocess so I can compare the objects in the parent process? Python does this through pickling/dilling.
I could communicate through .rds files, but this is unnecessary file creation/reading.
In R, I came across RProtoBuf, but I'm not sure if it solves my problem.
If no, are there other approaches I should consider? I've looked into opencpu, but the concept of firing up a local server and then use R to talk to that server and get representations feels like too complex an approach.
Thanks!

Another possible approach is the callr package, which is popular and developed by a credible source: https://github.com/r-lib/callr#readme.
An example from there:
r(function() var(iris[, 1:4]))
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> Sepal.Length 0.6856935 -0.0424340 1.2743154 0.5162707
#> Sepal.Width -0.0424340 0.1899794 -0.3296564 -0.1216394
#> Petal.Length 1.2743154 -0.3296564 3.1162779 1.2956094
#> Petal.Width 0.5162707 -0.1216394 1.2956094 0.5810063

I'd use RServe as it lets you run multiple R sessions and control them all from the master R session. You can run commands in those sessions in any given (interwoven) order and access objects stored there in the native format.
subprocess has been created to run and control any arbitrary program via its command line interface, so I have never intended on adding an object-passing mechanism. Although, if I was to access objects from child processes, I'd do it via saveRDS and readRDS.

Related

Dynamic library dependencies not recognized when run in parallel under R foreach()

I'm using the Rfast package, which imports the package RcppZiggurat. I'm running R 3.6.3 on a Linux cluster (Red Hat 6.1). The packages are installed on my local directory but R is installed system-wide.
The Rfast functions (e.g. colsums()) work well when I call them directly. But when I call them in a foreach() loop like the following (EDIT: I added the code to register the cluster as pointed out by Rui Barradas but it didn't fix the problem).
library(Rfast)
library(doParallel)
library(foreach)
cores <- detectCores()
cl <- makeCluster(cores)
registerDoParallel(cl)
A <- matrix(rnorm(1e6), 1000, 1000)
cm <- foreach(n = 1:4, .packages = 'Rfast') %dopar% colmeans(A)
stopCluster(cl)
then I get an error:
unable to load shared object '/home/users/sutd/R/x86_64-pc-linux-gnu-library/3.6/RcppZiggurat/libs/RcppZiggurat.so':
libgsl.so.0: cannot open shared object file: No such file or directory
Somehow, the dynamic library is recognized when called directly but not when called under foreach().
I know that libgsl.so is located in /usr/lib64/, so I've added the following line at the beginning of my R script
Sys.setenv(LD_LIBRARY_PATH=paste("/usr/lib64/", Sys.getenv("LD_LIBRARY_PATH"), sep = ":"))
But it didn't work.
I have also tried to do dyn.load('/usr/lib64/libgsl.so') but I get the following error:
Error in dyn.load("/usr/lib64/libgsl.so") : unable to load shared object '/usr/lib64/libgsl.so':
/usr/lib64/libgsl.so: undefined symbol: cblas_ctrmv
How do I make the dependencies available in the foreach() parallel loops?
NOTE
In the actual use case I am using the genetic algorithm package GA, and have GA::ga() which handles the foreach() loop, and within the loop I use a function in my own package which calls the Rfast functions. So I'm hoping that there is a solution where I don't have to modify the foreach() call.
The following works with no problems. Unlike the code in the question, it starts by detecting the number of available cores, create a cluster and make it available to foreach.
library(Rfast)
library(doParallel)
library(foreach)
cores <- detectCores()
cl <- makeCluster(cores)
registerDoParallel(cl)
set.seed(2020)
A <- matrix(rnorm(1e6), 1000, 1000)
cm <- foreach(n = 1:4,
.combine = rbind,
.packages = "Rfast") %dopar% {
colmeans(A)
}
stopCluster(cl)
str(cm)
#num [1:4, 1:1000] -0.02668 -0.02668 -0.02668 -0.02668 0.00172 ...
# - attr(*, "dimnames")=List of 2
# ..$ : chr [1:4] "result.1" "result.2" "result.3" "result.4"
# ..$ : NULL
The foreach package was great for its time. But, now, parallel computations should be done with future just for the static code analysis handling the correct export to workers. As a result, under the future approach, registering a package with .packages= isn't needed. Moreover, the future mirrors usual R code with only a slight change regarding the setup of an output variable as a listenv. For example, we have:
library("future")
library("listenv")
library("Rfast")
plan(tweak(multiprocess , workers = 2L))
# For all cores, directly use:
# plan(multiprocess)
# Generate matrix once
A <- matrix(rnorm(1e6), 1000, 1000)
# Setup output
x <- listenv()
# Iterate 4 times
for(i in 1:4) {
# On each core, compute the colmeans()
x[[i]] %<-% {
colmeans(A)
# For better control over function applies, use a namespace call
# e.g. Rfast::colmeans(A)
}
}
# Switch from listenv to list
output <- as.list(x)
Thanks to the answers by #RuiBarradas and #coatless, I realize that the problem is not with foreach(), because (1) the problem occurred when I ran the code with future too, and (2) it occurred with the foreach() code even with the wrong call, when I didn't register the cluster. When there is no cluster registered, foreach() will throw a warning and runs in sequential mode instead. But that didn't happen.
Therefore, I realize that the problem must have occurred even before the foreach() call. In the log, it appeared right after the message Loading package RcppZiggurat. Something must have gone wrong when this package is loaded.
I then checked the dependencies of RcppZiggurat, and found that it depends on another package called RcppGSL, which interfaces R and the GSL library. Bingo, that's where libgsl.so.0 is needed when RcppZiggurat is called.
So I made an R script named test-gsl.R, which has the following two lines.
library(RcppZiggurat)
print(‘OK’)
Now, I run the following on the head node
$ module load R/3.6.3
$ Rscript test-gsl.R
And everything works fine. The ‘OK’ is printed.
But this doesn’t work if I submit the job on the compute node. First, the PBS script, called test.sh, is as follows
### Resources request
#PBS -l select=1:ncpus=1:mem=1GB
### Walltime
#PBS -l walltime=00:01:00
echo Working directory is $PBS_O_WORKDIR
cd $PBS_O_WORKDIR
### Run R
module load R/3.6.3
Rscript test-gsl.R
Then I ran
qsub test.sh
And the error popped out. This means that there is something different between the compute node and the head node on my system, and nothing to do with the packages. I contacted the system administrator, who explained to me that the GSL library is available on the head node at the default path, but not on the compute node. So in my shell script, I need to add module load gsl/2.1 before running my R script. I tested that and everything worked.
The solution seems simple enough, but I know very little about Linux administration to realize it. Only after asking around and trying (rather blindly) many things did I finally come to this solution. So thanks to those who've offered help, and mea culpa for not being able to describe the problem accurately at the beginning.

How can I evaluate a C function from a dynamic library in an R package?

I’m trying to implement parallel computing in an R package that calls C from R with the .C function. It seems that the nodes of the cluster can’t access the dynamic library. I have made a parallel socket cluster, like this:
cl <- makeCluster(2)
I would like to evaluate a C function called valgrad from my R package on each of the nodes in my cluster using clusterEvalQ, from the R package parallel. However, my code is producing an error. I compile my package, but when I run
out <- clusterEvalQ(cl, cresults <- .C(C_valgrad, …))
where … represents the arguments in the C function valgrad. I get this error:
Error in checkForRemoteErrors(lapply(cl, recvResult)) :
2 nodes produced errors; first error: object 'C_valgrad' not found
I suspect there is a problem with clusterEvalQ’s ability to access the dynamic library. I attempted to fix this problem by loading the glmm package into the cluster using
clusterEvalQ(cl, library(glmm))
but that did not fix the problem.
I can evaluate valgrad on each of the clusters using the foreach function from the foreach R package, like this:
out <- foreach(1:no_cores) %dopar% {.C(C_valgrad, …)}
no_cores is the number of nodes in my cluster. However, this function doesn’t allow any of the results of the evaluation of valgrad to be accessed in any subsequent calculation on the cluster.
How can I either
(1) make the results of the evaluation of valgrad accessible for later calculations on the cluster or
(2) use clusterEvalQ to evaluate valgrad?
You have to load the external library. But this is not done with library calls, it's done with dyn.load.
The following two functions are usefull if you work with more than one operating system, they use the built-in variable .Platform$dynlib.ext.
Note also the unload function. You will need it if you develop a C functions library. If you change a C function before testing it the dynamic library has to be unloaded, then (the new version) reloaded.
See Writing R Extensions, file R-exts.pdf in the doc folder, section 5 or on CRAN.
dynLoad <- function(dynlib){
dynlib <- paste(dynlib, .Platform$dynlib.ext, sep = "")
dyn.load(dynlib)
}
dynUnload <- function(dynlib){
dynlib <- paste(dynlib, .Platform$dynlib.ext, sep = "")
dyn.unload(dynlib)
}

Is rJava object is exportable in future(Package for Asynchronous computing in R)

I'm trying to speed up my R code using future package by using mutlicore plan on Linux. In future definition I'm creating a java object and trying to pass it to .jcall(), But I'm getting a null value for java object in future. Could anyone please help me out to resolve this. Below is sample code-
library("future")
plan(multicore)
library(rJava)
.jinit()
# preprocess is a user defined function
my_value <- preprocess(a = value){
# some preprocessing task here
# time consuming statistical analysis here
return(lreturn) # return a list of 3 components
}
obj=.jnew("java.custom.class")
f <- future({
.jcall(obj, "V", "CustomJavaMethod", my_value)
})
Basically I'm dealing with large streaming data. In above code I'm sending the string of streaming data to user defined function for statistical analysis and returning the list of 3 components. Then want to send this list to custom java class [ java.custom.class ]for further processing using custom Java method [ CustomJavaMethod ].
Without using future my code is running fine. But I'm getting 12 streaming records in one minute and then my code is getting slow, observed delay in processing.
Currently I'm using Unix with 16 cores. After using future package my process is done fast. I have traced back my code, in .jcall something happens wrong.
Hope this clarifies my pain.
(Author of the future package here:)
Unfortunately, there are certain types of objects in R that cannot be sent to another R process for further processing. To clarify, this is a limitation to those type of objects - not to the parallel framework use (here the future framework). This simplest example of such an objects may be a file connection, e.g. con <- file("my-local-file.txt", open = "wb"). I've documented some examples in Section 'Non-exportable objects' of the 'Common Issues with Solutions' vignette (https://cran.r-project.org/web/packages/future/vignettes/future-4-issues.html).
As mentioned in the vignette, you can set an option (*) such that the future framework looks for these type of objects and gives an informative error before attempting to launch the future ("early stopping"). Here is your example with this check activated:
library("future")
plan(multisession)
## Assert that global objects can be sent back and forth between
## the main R process and background R processes ("workers")
options(future.globals.onReference = "error")
library("rJava")
.jinit()
end <- .jnew("java/lang/String", " World!")
f <- future({
start <- .jnew("java/lang/String", "Hello")
.jcall(start, "Ljava/lang/String;", "concat", end)
})
# Error in FALSE :
# Detected a non-exportable reference ('externalptr') in one of the
# globals ('end' of class 'jobjRef') used in the future expression
So, yes, your example actually works when using plan(multicore). The reason for that is that 'multicore' uses forked processes (available on Unix and macOS but not Windows). However, I would try my best to limit your software to parallelize only on "forkable" systems; if you can find an alternative approach I would aim for that. That way your code will also work on, say, a huge cloud cluster.
(*) The reason for these checks not being enabled by default is (a) it's still in beta testing, and (b) it comes with overhead because we basically need to scan for non-supported objects among all the globals. Whether these checks will be enabled by default in the future or not, will be discussed over at https://github.com/HenrikBengtsson/future.
The code in the question is calling unknown Method1 method, my_value is undefined, ... hard to know what you are really trying to achieve.
Take a look at the following example, maybe you can get inspiration from it:
library(future)
plan(multicore)
library(rJava)
.jinit()
end = .jnew("java/lang/String", " World!")
f <- future({
start = .jnew("java/lang/String", "Hello")
.jcall(start, "Ljava/lang/String;", "concat", end)
})
value(f)
[1] "Hello World!"

Read/write the same object from scripts executed by 2 separate threads in R

Problem Description:
I am trying to build an automated trading system in R connecting to Oanda REST API. My Operating System is Windows 10.
The program has two separate infinite looping components through "while (TRUE)": The "trading engine" and the "tick data streaming engine".
The program is organised in such a way that the two components communicate through a queue object created using "R6" package.
The "tick data streaming engine" receives tick FX data from Oanda server.
It then creates a tick event object from the data and "push" the tick event to the queue using an instance of the queue class created using "R6" package.
The "trading engine" "pop" the queue object and analyses the event object that comes out.
If it is a Tick data event, it makes an analysis to see whether it meets the conditions set by the logic of the trading strategy.
If the popped tick event meets the conditions, the trading engine creates an order event object
which is "pushed" to the back of the queue using the same instance of the queue class created using "R6" package.
To this end, I want to run the "trading engine" using one thread and run the "tick data streaming engine" using another thread.
The two separate threads should be able to push to, and pop from the same event queue instance.
My understanding is that the event queue instance object should be a shared object for the two separate threads to have access to it.
Question:
My question is how can I implement a shared object which can be dynamically modified (write/read) by code files running on two separate threads
or any other construct that can help achieve read/write to the same object from two or more threads?
How can I possibly use other packages such as "mmap" for shared memory implementation or any other package to achieve my objective?
Attempts:
In order to test the feasibility of the program, this is what I tried:
For simplicity and reproducibility, I created a shared object called "sharedMatrix".
It is a 10 x 1 matrix which will play the role of the event queue instance in my actual Oanda API program.
I used the "bigmemory" R package to transform the initial matrix "m" into a big.matrix object "x" and attached it so that it could be a shared object: "sharedMatrix".
By doing this, I was expecting "sharedMatrix" to be "seen" and modified by each thread running the two separate code files.
#Codefile1
for(i in 1:5){
sharedMatrix[i,1] <- i
}
#Codefile2
for(j in 6:10){
sharedMatrix[j,1] <- j
}
I sourced the two code files using the "foreach" and "doParallel" R packages by executing the following code:
library(doParallel)
library(bigmemory)
library(foreach)
m <- matrix(nrow = 10) # Create 10 x 1 matrix
x <-as.big.matrix(m) #convert m to bigmatrix
mdesc <- describe(x) # get a description of the matrix
cl <- makeCluster(2,outfile = "Log.txt") # Create a cluster of two threads
#with output file "Log.txt"
registerDoParallel(cl)
clusterExport(cl=cl,varlist=ls()) #export input data to all cores
fileList <-list("Codefile1.R","Codefile2.R") # a list of script files saved
#in current working directory
foreach(f=fileList, .packages = "bigmemory") %dopar% {
sharedMatrix <- attach.big.matrix(mdesc) # attach the matrix via shared
#memory
source(f) # Source the script files for parallel execution
}
To my surprise this is the console output when the above code is executed:
Error in { : task 1 failed - "object 'sharedMatrix' not found"
After checking the content of sharedMatrix, I was expecting to see something like this:
sharedMatrix[]
1 2 3 4 5 6 7 8 9 10
However this is what I see:
sharedMatrix[]
Error: object 'sharedMatrix' not found
It seems to me that the worker threads do not "see" the shared object "sharedMatrix".
Any help will be very much appreciated. Thanks.
Use
library(doParallel)
library(bigmemory)
library(foreach)
m <- matrix(nrow = 10) # Create 10 x 1 matrix
x <-as.big.matrix(m) #convert m to bigmatrix
mdesc <- describe(x) # get a description of the matrix
cl <- makeCluster(2,outfile = "Log.txt") # Create a cluster of two threads
#with output file "Log.txt"
registerDoParallel(cl)
clusterExport(cl=cl,varlist=ls()) #export input data to all cores
fileList <-list("Codefile1.R","Codefile2.R") # a list of script files saved
#in current working directory
foreach(f=fileList, .packages = "bigmemory") %dopar% {
sharedMatrix <- attach.big.matrix(mdesc) # attach the matrix via shared
#memory
source(f, local = TRUE) # Source the script files for parallel execution
NULL
}
parallel::stopCluster(cl)
Basically, you need the option local = TRUE in the source() function.
PS: Also, make sure to stop clusters.

How can I run multiple independent and unrelated functions in parallel without larger code do-over?

I've been searching around the internet, trying to understand parallel processing.
What they all seem to assume is that I have some kind of loop function operating on e.g. every Nth row of a data set divided among N cores and combined afterwards, and I'm pointed towards a lot of parallelized apply() functions.
(Warning, ugly code below)
My situation though is that I have is on form
tempJob <- myFunction(filepath, string.arg1, string.arg2)
where the path is a file location, and the string arguments are various ways of sorting my data.
My current workflow is simply amassing a lot of
tempjob1 <- myFunction(args)
tempjob2 <- myFunction(other args)
...
tempjobN <- myFunction(some other args here)
# Make a list of all temporary outputs in the global environment
temp.list <- lapply(ls(pattern = "temp"), get)
# Stack them all
df <- rbindlist(temp.list)
# Remove all variables from workspace matching "temp"
rm(list=ls(pattern="temp"))
These jobs are entirely independent, and could in principle be run in 8 separate instances of R (although that would be a bother to manage I guess). How can I separate the first 8 jobs out to 8 cores, and whenever a core finishes its job and returns a treated dataset to the global environment it'll simply take whichever job is next in line.
With the future package (I'm the author) you can achieve what you want with a minor modification to your code - use "future" assignments %<-% instead of regular assignments <- for the code you want to run asynchronously.
library("future")
plan(multisession)
tempjob1 %<-% myFunction(args)
tempjob2 %<-% myFunction(other args)
...
tempjobN %<-% myFunction(some other args here)
temp.list <- lapply(ls(pattern = "temp"), get)
EDIT 2022-01-04: plan(multiprocess) -> plan(multisession) since multiprocess is deprecated and will eventually be removed.
Unless you are unfortunate enough to be using Windows, you could maybe try with GNU Parallel like this:
parallel Rscript ::: script1.R script2.R JOB86*.R
and that would keep 8 scripts running at a time, if your CPU has 8 cores. You can change it with -j 4 if you just want 4 at a time. The JOB86 part is just random - I made it up.
You can also add switches for a progress bar, for how to handle errors, for adding parameters and distributing jobs across multiple machines.
If you are on a Mac, you can install GNU Parallel with homebrew:
brew install parallel
I think the easiest way is to use one of the parallelized apply functions. Those will do all the fiddly work of separating out the jobs, taking whichever job is next in line, etc.
Put all your arguments into a list:
args <- list(
list(filePath1, stringArgs11, stringArgs21),
list(filePath2, stringArgs12, stringArgs22),
...
list(filePath8, stringArgs18, stringArgs28)
)
Then do something like
library(parallel)
cl <- makeCluster(detectCores())
df <- parSapply(cl, args, myFunction)
I'm not sure about parSapply, and I can't check as R isn't working on my machine just now. If that doesn't work, use parLapply and then manipulate the result.

Resources