errors running R script on Clusters - r

I'm running an R script using foreach() and %dopar% on clusters. Without changing anything in my code, the execution of my program is interrupted every time returning 3 different errors:
1.
Error in unserialize(socklist[[n]]) : error reading from connection
Calls: %dopar% ... recvOneData -> recvOneData.SOCKcluster -> unserialize
Execution halted
Error in { :
task 1 failed - "unable to load shared object '/cluster/apps/r/3.5.1_openblas/x86_64/lib64/R/library/units/libs/units.so':
libudunits2.so.0: cannot open shared object file: No such file or directory"
Calls: %dopar% -> <Anonymous>
Execution halted
3.
TERM_MEMLIMIT: job killed after reaching LSF memory usage limit.
Exited with exit code 1
In the last scenario, I tried increasing the number of nodes or the used memory in the submission command bsub -n 23 -W 20:00 -R "rusage[mem=4072]" "R --vanilla --slave <1.algorithm_function_part0_alternative.R> resultFunPart0Alt.out", but still it returns the same error or one of the previous two. I'm using the R version 3.5.1.

Related

How to quit R script when an error occurred

I am writing an R script as a command to run in terminal, and I found if an error happen when running this script, the command still returns a normal exit signal. So I cannot check the running results by checking [ $? -ne 0 ], it just returns succeed.
This is because R will continue running the next command when it encounters an error in previous command. Is there any way to solve this situation?
Best,
Shixiang
I combine tryCatch() and quit() to solve this problem. I firstly wrap my main function to tryCatch structure to let it detect if an error occurred, once an error is detected, I print the error message and call quit("no", -1) to quit R with exit status signature -1.

Negative binomial regression in R using brm causing error when using multiple cores

I am calculating a negative binomial regression using the brm function from the brms package. As this takes quite some time, I would like to use multiple cores as suggested in the documentation.
bfit_s <- brm(
dep_var ~ ind_var +
var1 +
var2 +
(1 | some_level1) + (1 | some_level2),
data = my_df,
family = negbinomial(link = "log", link_shape = "log"),
cores = 4,
control = list(adapt_delta = 0.999)
)
However, I am running into an error saying that the connection of all four workers failed:
Compiling the C++ model
Start sampling
starting worker pid=11603 on localhost:11447 at 14:13:56.193
starting worker pid=11601 on localhost:11447 at 14:13:56.193
starting worker pid=11602 on localhost:11447 at 14:13:56.198
starting worker pid=11604 on localhost:11447 at 14:13:56.201
Error in unserialize(node$con) : error reading from connection
Calls: <Anonymous> -> slaveLoop -> makeSOCKmaster
Error in unserialize(node$con) : error reading from connection
Calls: <Anonymous> -> slaveLoop -> makeSOCKmaster
Error in unserialize(node$con) : error reading from connection
Calls: <Anonymous> -> slaveLoop -> makeSOCKmaster
Execution halted
Execution halted
Execution halted
Error in unserialize(node$con) : error reading from connection
Calls: <Anonymous> -> slaveLoop -> makeSOCKmaster
Execution halted
The traceback says Error in makePSOCKcluster(names = spec, ...) : Cluster setup failed. 4 of 4 workers failed to connect.
I tried to understand the problem, read some questions on SO like this, but couldn't figure out why I can't connect. I'm using macOS Mojave and the problem is not that I try to use more cores than possible. Any suggestions on how I could get this to run on multiple cores?
Edit:
As sjp pointed out in his answer, there is an issue with RStudio. I thought I share the code to solve the problem right here in my question, so everyone stumbling across can solve this without clicking (and reading) any further.
The problem is the parallel package from R-4.0.0. - but a workaround is provided by a user from this stan forum. If you can initialize clusters with setup_strategy="sequential" like this:
cl <- parallel::makeCluster(2, setup_strategy = "sequential")
You can add a short snippet to your ~/.Rprofile to make this kind of a default setting:
## WORKAROUND: https://github.com/rstudio/rstudio/issues/6692
## Revert to 'sequential' setup of PSOCK cluster in RStudio Console on macOS and R 4.0.0
if (Sys.getenv("RSTUDIO") == "1" && !nzchar(Sys.getenv("RSTUDIO_TERM")) &&
Sys.info()["sysname"] == "Darwin" && getRversion() == "4.0.0") {
parallel:::setDefaultClusterOptions(setup_strategy = "sequential")
}
This is a known issue that has to do with RStudio. Check out these related posts on the Stan forums and Github.
Github: https://github.com/rstudio/rstudio/issues/6692
Stan forums: https://discourse.mc-stan.org/t/r-4-0-0-and-cran-macos-binaries/13989/13

R error SQLSatellite cannot read data chunk

I have an SSIS 2015 package that calls a Stored Proc in SQL Server 2016.
When I run the SSIS package I get these two messages:
Error: A 'R' script error occurred during execution of 'sp_execute_external_script' with HRESULT 0x80004004.
Error: STDERR message(s) from external script:
Error in eval(expr, envir, enclos) : bad allocation
Calls: source -> withVisible -> eval -> eval -> .Call
Execution halted
So I ran the stored proc in SSMS but get these messages.
A 'R' script error occurred during execution of 'sp_execute_external_script' with HRESULT 0x80004004.
STDERR message(s) from external script:
SqlSatellite cannot read data chunk. Error code:0x80004004.
Error in eval(expr, envir, enclos) : SqlSatellite cannot read data chunk. Error code:0x80004004.
Calls: source -> withVisible -> eval -> eval -> .Call
I have run the R scripts input query in SSMS which returns data, I do not believe I am missing any columns in the R script which I believe was working previously.
But being new to R I have no idea how to diagnose what may be causing the problem.
I did some more research on the errors and found some information at http://www.nielsberglund.com/2017/11/11/microsoft-sql-server-r-services-internals-xiii/.
The suspicion was that it may not be a code issue, more a data issue. A considerable amount of testing indicated it was the amount of data I was analyzing with the R script, I was able to restrict the amount of data using some date parameters and finished the data loads.
Hope this helps others.
I had the same error using Python in SQL Server 2017. I figured out it was because my WITH RESULT SETS statement did not fit my OutputDataSet.

run Rmpi on cluster, specify library path

I'm trying to run an analysis in parallel on our computing cluster.
Unfortunately I've had to set up Rmpi myself and may not have done so properly.
Because I had to install all necessary packages into my home folder, I always have to call
.libPaths('/home/myfolder/Rlib');
before I can load packages.
However, it appears that doMPI attempts to load itself, before I can set the library path.
.libPaths('/home/myfolder/Rlib');
cat("Step 1")
library(doMPI)
cl <- startMPIcluster()
registerDoMPI(cl)
cat("Step 2")
Children_mcmc1 = foreach(i=1:2) %dopar% {
cat("Step 3")
.libPaths('/home/myfolder/Rlib');
library(MCMCglmm)
cat("Step 4")
load("krmh_married.rdata")
nitt = 1000; thin = 50; burnin = 100
MCMCglmm( children ~ paternalage.factor ,
random=~idParents,
family="poisson",
data=krmh_married,
pr = F, saveX = T, saveZ = T,
nitt=nitt,thin=thin,burnin=burnin)
}
closeCluster(cl)
mpi.quit()
If I do
mpirun -H localhost -n 3 R --slave -f "3 - krmh mcmcglmm scc test 2.r"
I get (after removing some boilerplate messages)
During startup - Warning message:
Step 1
Step 1
Step 1
Step 2Error in { : task 2 failed - "cannot open the connection"
Calls: %dopar% ->
Execution halted
If I do
R --slave -f "3 - krmh mcmcglmm scc test 2.r"
I get
Step 1
Error in library(doMPI) : there is no package called 'doMPI'
Calls: local ... eval -> suppressMessages -> withCallingHandlers -> library
Execution halted
Error in library(doMPI) : there is no package called 'doMPI'
Calls: local ... eval -> suppressMessages -> withCallingHandlers -> library
Execution halted
I've tried installing doMPI on the run, but even though Step 2 isn't printed, it seems as if the error results from the loop.
And of course, with all this I'm still testing on our frontend, I haven't it made it to submitting the job to the intended cluster yet.
I tried to specify the .libPaths call in my .Rprofile, but I'm not sure this would get read on the cluster and I can't even get it to get read on the frontend (and I couldn't figure out where R is looking for the file).
It's much easier to install R packages into a "personal library", since it is used automatically so you don't have to call .libPaths in your scripts. You can determine what directory this is by executing:
> Sys.getenv('R_LIBS_USER')
This will automatically be the first directory returned by .libPaths if it exists, so you don't have to worry about calling .libPaths at all.
Note that there's no point in calling .libPaths in the body of the foreach loop since doMPI must be loaded by the cluster workers before they can execute any tasks.
I'm not sure what's going wrong in your "mpirun" case, because mpirun is starting all of the workers, so the first four lines of your script are executed by all of them. That is why "Step 1" is displayed three times. But in your second case, the cluster workers are being spawned, so the doMPI package is loaded by the RMPIworker.R script, resulting in the error loading doMPI.
I suggest that you use the mpirun approach to solve the .libPaths problem, but call startMPIcluster with the verbose=TRUE option. That will create some files in your working directory named "MPI_*.log" which may contain some useful error messages that will provide a clue to the problem.

"Cannot open the connection" - HPC in R with snow

I'm attempting to run a parallel job in R using snow. I've been able to run extremely similar jobs with no trouble on older versions of R and snow. R package dependencies prevent me from reverting.
What happens: My jobs terminate at the parRapply step, i.e., the first time the nodes have to do anything short of reporting Sys.info(). The error message reads:
Error in checkForRemoteErrors(val) :
3 nodes produced errors; first error: cannot open the connection
Calls: parRapply ... clusterApply -> staticClusterApply -> checkForRemoteErrors
Specs: R 2.14.0, snow 0.3-8, RedHat Enterprise Linux Client release 5.6. The snow package has been built on the correct version of R.
Details:
The following code appears to execute fine:
cl <- makeCluster(3)
clusterEvalQ(cl,library(deSolve,lib="~/R/library"))
clusterCall(cl,function() Sys.info()[c("nodename","machine")])
I'm an end-user, not a system admin, but I'm desperate for suggestions and insights into what could be going wrong.
This cryptic error appeared because an input file that's requested during program execution wasn't actually present. Each node would attempt to load this file and then fail, but this would result only in a "cannot open the connection" message.
What this means is that almost anything can cause a "connection" error. Incredibly annoying!

Resources