How to fork processes in R - r

I'm trying to understand the forking system implemented by R's multicore package. The package example is:
p <- fork()
if (inherits(p, "masterProcess")) {
cat("I’m a child! ", Sys.getpid(), "\n")
exit(,"I was a child")
}
cat("I’m the master\n")
unserialize(readChildren(1.5))
but it doesn't seem to work when pasted in the R interactive console. Does anyone have an example of using fork() with R's multicore or parallel packages?

The fork example in the multicore package 'works for me' ; try example(fork). fork is only supported on non-Windows systems.
I think the equivalent functions in parallel are mcparallel() to fork and then evaluate an expression, and mcollect() to retrieve the result when done. So
id = mcparallel({ Sys.sleep(5); TRUE })
returns immediately but the process is running, and
mccollect(id)
will return TRUE after 5 seconds. There is no communication other than the collection between the forked and master process; it would be interesting and not too challenging to implement two-way communication using, e.g., sockets.

Related

'mc.cores' > 1 is not supported on Windows

I am new to programming and I am trying to use parallel processing for R in windows, using an existing code.
Following is the snippet of my code:
if (length(grep("linux", R.version$os)) == 1){
num_cores = detectCores()
impact_list <- mclapply(len_a, impact_func, mc.cores = (num_cores - 1))
}
# else if(length(grep("mingw32", R.version$os)) == 1){
# num_cores = detectCores()
# impact_list <- mclapply(len_a, impact_func, mc.cores = (num_cores - 1))
#
# }
else{
impact_list <- lapply(len_a, impact_func)
}
return(sum(unlist(impact_list, use.names = F)))
This works fine, I am using R on windows so the code enters in 'else' statement and it runs the code using lapply() and not by parallel processing.
I have added the 'else if' statement to make it work for windows. So when I un-comment 'else if' block of code and run it, I am getting an error "'mc.cores' > 1 is not supported on Windows".
Please suggest how can I use parallel processing in windows, so that less time is taken to run the code.
Any help will be appreciated.
(disclaimer: I'm author of the future framework here)
The future.apply package provides parallel versions of R's built-in "apply" functions. It's cross platform, i.e. it works on Linux, macOS, and Windows. The package allows you to often just replace an existing lapply() with a future_lapply() call, e.g.
library(future.apply)
plan(multisession)
your_fcn <- function(len_a) {
impact_list <- future_lapply(len_a, impact_func)
sum(unlist(impact_list, use.names = FALSE))
}
Regarding mclapply() per se: If you use parallel::mclapply() in your code, make sure that there is always an option not to use it. The reason is that it is not guaranteed to work in all environment, that is, it might be unstable and crash R. In R-devel thread 'mclapply returns NULLs on MacOS when running GAM' (https://stat.ethz.ch/pipermail/r-devel/2020-April/079384.html), the author of mclapply() wrote on 2020-04-28:
Do NOT use mcparallel() in packages except as a non-default option that user can set for the reasons Henrik explained. Multicore is intended for HPC applications that need to use many cores for computing-heavy jobs, but it does not play well with RStudio and more importantly you don't know the resource available so only the user can tell you when it's safe to use. Multi-core machines are often shared so using all detected cores is a very bad idea. The user should be able to explicitly enable it, but it should not be enabled by default.

Is rJava object is exportable in future(Package for Asynchronous computing in R)

I'm trying to speed up my R code using future package by using mutlicore plan on Linux. In future definition I'm creating a java object and trying to pass it to .jcall(), But I'm getting a null value for java object in future. Could anyone please help me out to resolve this. Below is sample code-
library("future")
plan(multicore)
library(rJava)
.jinit()
# preprocess is a user defined function
my_value <- preprocess(a = value){
# some preprocessing task here
# time consuming statistical analysis here
return(lreturn) # return a list of 3 components
}
obj=.jnew("java.custom.class")
f <- future({
.jcall(obj, "V", "CustomJavaMethod", my_value)
})
Basically I'm dealing with large streaming data. In above code I'm sending the string of streaming data to user defined function for statistical analysis and returning the list of 3 components. Then want to send this list to custom java class [ java.custom.class ]for further processing using custom Java method [ CustomJavaMethod ].
Without using future my code is running fine. But I'm getting 12 streaming records in one minute and then my code is getting slow, observed delay in processing.
Currently I'm using Unix with 16 cores. After using future package my process is done fast. I have traced back my code, in .jcall something happens wrong.
Hope this clarifies my pain.
(Author of the future package here:)
Unfortunately, there are certain types of objects in R that cannot be sent to another R process for further processing. To clarify, this is a limitation to those type of objects - not to the parallel framework use (here the future framework). This simplest example of such an objects may be a file connection, e.g. con <- file("my-local-file.txt", open = "wb"). I've documented some examples in Section 'Non-exportable objects' of the 'Common Issues with Solutions' vignette (https://cran.r-project.org/web/packages/future/vignettes/future-4-issues.html).
As mentioned in the vignette, you can set an option (*) such that the future framework looks for these type of objects and gives an informative error before attempting to launch the future ("early stopping"). Here is your example with this check activated:
library("future")
plan(multisession)
## Assert that global objects can be sent back and forth between
## the main R process and background R processes ("workers")
options(future.globals.onReference = "error")
library("rJava")
.jinit()
end <- .jnew("java/lang/String", " World!")
f <- future({
start <- .jnew("java/lang/String", "Hello")
.jcall(start, "Ljava/lang/String;", "concat", end)
})
# Error in FALSE :
# Detected a non-exportable reference ('externalptr') in one of the
# globals ('end' of class 'jobjRef') used in the future expression
So, yes, your example actually works when using plan(multicore). The reason for that is that 'multicore' uses forked processes (available on Unix and macOS but not Windows). However, I would try my best to limit your software to parallelize only on "forkable" systems; if you can find an alternative approach I would aim for that. That way your code will also work on, say, a huge cloud cluster.
(*) The reason for these checks not being enabled by default is (a) it's still in beta testing, and (b) it comes with overhead because we basically need to scan for non-supported objects among all the globals. Whether these checks will be enabled by default in the future or not, will be discussed over at https://github.com/HenrikBengtsson/future.
The code in the question is calling unknown Method1 method, my_value is undefined, ... hard to know what you are really trying to achieve.
Take a look at the following example, maybe you can get inspiration from it:
library(future)
plan(multicore)
library(rJava)
.jinit()
end = .jnew("java/lang/String", " World!")
f <- future({
start = .jnew("java/lang/String", "Hello")
.jcall(start, "Ljava/lang/String;", "concat", end)
})
value(f)
[1] "Hello World!"

Is there a way to combine Rmpi & mclapply?

I have some R code that applies a function to a list of objects. The function is simple but involves a bootstrapping calculation, which can be easily sped up using mclapply. When run on a single node, everything is fine.
However, I have a cluster and what I've been trying to do is to distribute the application of the function to the list of objects across multiple nodes. To do this I've been using Rmpi (0.6-6).
The code below runs fine
library(Rmpi)
cl <- parallel::makeCluster(10, type='MPI')
parallel::clusterExport(cl, varlist=c('as.matrix'), envir=environment())
descriptor <- parallel::parLapply(1:5, function(am) {
val <- mean(unlist(lapply(1:120, function(x) mean(rnorm(1e7)))))
return(c(val, Rmpi::mpi.universe.size()))
}, cl=cl)
print(do.call(rbind, descriptor))
snow::stopCluster(cl)
However, if I convert the lapply to mclapply and set mc.cores=10, MPI warns that forking will lead to bad things, and the job hangs.
(In all cases jobs are being submitted via SLURM)
Based on the MPI warning, it seems that I should not be using mclapply within Rpmi jobs. Is this a correct assessment?
If so, does anybody have suggestions on how I can parallelize the function that is being run on each node?

parallel reads from Athena (AWS) database, via R

I've got a largish dataset on an Athena database on AWS. I'd like to read from it in parallel, and I'm accustomed to the foreach package's approach to forking from within R.
I'm using RJDBC
Here's what I am trying:
out <- foreach(i = 1:length(fipsvec), .combine = rbind, .errorhandling = "remove") %dopar% {
coni <- dbConnect(driver, "jdbc:awsathena://<<location>>/",
s3_staging_dir="my_directory",
user="...",
password="...")
print(paste0("starting ", i))
sqlstring <- paste0("SELECT ",
"My_query_body"
fipsvec[i]
)
row <- fetch(dbSendQuery(coni, sqlstring), -1, block = 999)
print(i)
dbDisconnect(coni)
rm(coni)
gc()
return(row)
}
(Sorry I can't make this reproducible -- I obviously can't hand out the keys to the DB online.)
When I run this, the first c = number of cores steps run fine, but then it hangs and does nothing -- indefinitely as far as I can tell. htop shows no activity on any of the cores. And when I change the for loop to only loop over c entries, the output is what I expect. When I change from parallel to serial (%do% instead of %dopar%), it also works fine.
Does this have something to do with the connection not being closed properly, or somehow being defined redundantly? I've placed the connection within the parallel loop, so each core should have its own connection in its own environment. But I don't know enough about databases to tell whether this is sufficiently distinct.
I'd appreciate answers that help me understand what's going on under the hood here -- it's all voodoo to me at this point.
Are you passing the RJDBC package (and it's dependencies-- methods, DBI, and rJava) into the cluster anywhere?
If not, your the first line of your code should look something like below:
results <- foreach(i = 1:length(fipsvec),
.combine = rbind,
.errorhandling = "remove",
.packages=c('methods','DBI','rJava','RJDBC')) %dopar% {
One thing that I suspect (but don't know) might make things a little hairier is that RJDBC uses a JVM to execute the queries. Not super knowledgeable about how rJava handles JVM initialization, and if each of the threads may be trying to re-use the same JVM simultaneously, or if they have enough information about the external environment to properly initialize one in the first place.
Another troubleshooting step if the above doesn't work might be to move the assignment step for driver into the %dopar% environment.
On another track, how many rows are in your result set? If the result set is in the million+ row range and can be returned with a single query, I actually came across an opportunity for optimization within the RJDBC package and have an open pull request on github ( https://github.com/s-u/RJDBC/pull/50 ) that I haven't heard anything on but have been using myself for a couple months. There's a basic benchmark documented in the pull request, I found the speedup to be substantial on the particular query I was running.
If it seems applicable you can install the branch with:
library(devtools)
devtools::install_github("msummersgill/RJDBC",ref = "harmonize", force = TRUE)

Cannot access parameters in C++ code in parallel code called from Snow

I am developing a parallel R code using the Snow package, but when calling C++ code using the Rcpp package the program just hangs and is unresponsive.
as an example...
I have the following code in R that is using snow to split into certain number of processes
MyRFunction<-function(i) {
n=i
.Call("CppFunction",n,PACKAGE="MyPackage")
}
if (mpi) {
cl<-getMPIcluster()
clusterExport(cl, list("set.user.Random.seed"))
clusterEvalQ(cl, {library(Rcpp); NULL})
out<-clusterApply(cl,1:mc.cores,MyRFunction)
stopCluster(cl)
}
else
out <- parallel::mclapply(1:mc.cores,MyRFunction)
Whereas my C++ function looks like...
RcppExport SEXP CppFunction(SEXP n) {
int n=as<int>(n);
}
If I run it with mpi=false and mc.cores=[some number of threads] the program runs beautifully BUT
if i run it with mpi=true, therefore using snow, the program just hangs at int=as<int>(n) ?????
On the other hand if I define the C++ function as...
RcppExport SEXP CppFunction(SEXP n) {
CharacterVector nn(n);
int n=boost::lexical_cast<int>(nn[0]);
}
The program runs perfectly on each mpi thread?? The problem is that it works for integers doubles etc, but not matrices
Also, I must use lexical_cast from the boost package to make it works since as<> does not.
Does anybody know why this is, and what I am missing here, so I can load my matrices as well?
It is not entirely clear from your question what you are doing but I'd recommend to
simplify: snow certainly works, and works with Rcpp as it does with other packages
trust packages: I found parallel computing setups easier when all nodes are identical local packages sets
be careful with threading: if you have trouble with explicit threading in the snow context, try it first without it and the add it once the basic mechanics work
Finally the issue was resolved, and the problem seems to lie with getMPICluster() which works perfectly fine for pure R code, but not as well with Rcpp, as explained above.
Instead using makeMPICluster command
mc.cores <- max(1, NumberOfNodes*CoresPerNode-1) # minus one for master
cl <- makeMPIcluster(mc.cores)
cat(sprintf("Running with %d workers\n", length(cl)))
clusterCall(cl, function() { library(MyPackage); NULL })
out<-clusterApply(cl,1:mc.cores,MyRFunction)
stopCluster(cl)
Works great! The problem is that you have to manually define the number of nodes and cores per node within the R code, instead of defining it using the mpirun command.

Resources