debugging R code when using slurm

debugging R code when using slurm - r

I am running simulations in R on a cluster. Each R file contains 100 models. Each model analyses a different data set. Cluster commands are included in a slurm file, shown below.
A small percentage of models apparently do not converge well enough to estimate the Hessian and an error is generated for these models. The errors are placed in an error log file. However, I cannot determine from looking at the parameter estimates, the error log file and the output log file which of the 100 models are generating the errors.
Here is an example of an error message
Error in chol.default(fit$hessian) :
the leading minor of order 3 is not positive definite
Calls: chol2inv -> chol -> chol.default
Parameter estimates are returned despite these errors. Some SE's are huge, but I think the SE's can be large sometimes even when an error message is not returned.
Is it possible to include an additional line in my slurm file below that will generate a log file containing both the errors and the rest of the output with the errors shown in their original location (for example, the location in which they are shown on my Windows laptop). That way I would be able to determine quickly which models were generating the errors by looking at the log file. I have been trying to think of a work-around, but have not been able to come up with anything so far.
Here is a slurm file:
#!/bin/bash
#SBATCH -J JS_N200_301_400_Oct31_17c.R
#SBATCH -n 1
#SBATCH -c 1
#SBATCH -N 1
#SBATCH -t 2000
#SBATCH -p community.q
#SBATCH -o JS_N200_301_400_Oct31_17c.out
#SBATCH -e JS_N200_301_400_Oct31_17c.err
#SBATCH --mail-user markwm#myuniversity.edu
#SBATCH --mail-type ALL
Rscript JS_N200_301_400_Oct31_17c.R

Not sure if this is what you want, but R option error allows to control what should happen with errors (that you don't catch otherwise). For instance, setting
options(error = function() {
traceback(2L)
dump.frames(dumpto = "last.dump", to.file = TRUE)
})
at the beginning of your *.R script, or in a .Rprofile startup script, will (a) output the traceback if there's an error, but more importantly, it'll also (b) dump the call stack to file last.dump.rda, which you can load in a fresh R session as:
dump <- get(load("last.dump.rda"))
Note, that get(load( is not a mistake. Here dump is an object of class dump.frames which allows you to inspect the call stack and its content.
You can of course customize error to do other things.

I learned from an IT person in charge of the cluster that I can have the error messages added to the output log by simply removing the reference to the error log in the slurm file. See below. It seems to be good enough.
I plan to also output the model number into the log at the beginning and the end of each model's output for added clarity (which I should have been doing from the start).
#!/bin/bash
#SBATCH -J JS_N200_301_400_Oct31_17c.R
#SBATCH -n 1
#SBATCH -c 1
#SBATCH -N 1
#SBATCH -t 2000
#SBATCH -p community.q
#SBATCH -o JS_N200_301_400_Oct31_17c.out
#SBATCH --mail-user markwm#myuniversity.edu
#SBATCH --mail-type ALL
Rscript JS_N200_301_400_Oct31_17c.R

Related

Slurm: error in reading shell script when using mpirun. How can I fix this?

I am submitting a job on the debug queue on Niagara (Slurm scheduler) and am getting the following error:
SALLOC: error: _fork_command: Unable to exec command "/mypath/test.sh": Permission denied
I have checked the permissions of the file test.sh and it is readable, in fact I have been using the same file for serial jobs with no problems. I am trying to use mpirun for a parallel job and that's when I get the error.
My shell script is as follows:
#!/bin/bash
#SBATCH --account= xxxx
#SBATCH --nodes=2
#SBATCH --ntasks=160
#SBATCH --time=3:30:00
#SBATCH --job-name "sNucRNASeq"
pushd /mypath/
mpirun --np 4 R --no-save < Rscript test.R
Rscript test.R
I have tried chmod -rwx test.sh, it did not make a difference.
Am I missing something with regards to letting all the processors access the file? How can I by-pass the error?
The test.R script referred to above is pretty simple:
library(pbdMPI)
init()
rank<-comm.rank()
size<-comm.size()
myfiles<-load("ListofFiles.RData")
y <- scatter(lapply(myfiles, readRDS))
comm.print(str(y))
finalize()

Sourcing an R script within another script executed through SLURM

I am executing an R script on a cluster, using the following sbatch file:
#!/bin/bash
#SBATCH --job-name=ReconstructScans
#SBATCH --mem=4g
#SBATCH --account=XXXXXX
#SBATCH --array=1-1013%5
module load StdEnv/2020
module load gcc/9.3.0
module load r/4.0.2
srun Rscript ReconstructScans.R
ReconstructScans.R calls another R script, using
source("path_to_script.R", local=FALSE)
However, for some reason, the variables declared within path_to_script.R are not accessible to ReconstructScans.R. I get the object 'DataPath' not found error.
Why might this be happening? When I open an interactive R shell and source ReconstructScans.R, everything works fine. It's when ReconstructScans.R is executed through sbatch that the problem arises.
Is it a problem with how I'm executing it using:
srun Rscript ReconstructScans.R?
Thank you,
Mrinmayi

Inconsistent mounting behaviour with SLURM and Singularity

I am completely new to using SLURM to submit jobs to a HPC and I am facing a peculiar problem that I am not able to resolve.
I have a job.slurm file that contains the following bash script
#!/bin/bash
#SBATCH --job-name singularity-mpi
#SBATCH -N 1 # total number of nodes
#SBATCH --time=00:05:00 # Max execution time
#SBATCH --partition=partition-name
#SBATCH --output=/home/users/r/usrname/slurm-reports/slurm-%j.out
module load GCC/9.3.0 Singularity/3.7.3-Go-1.14 CUDA/11.0.2 OpenMPI/4.0.3
binaryPrecision=600 #Temporary number
while getopts i:o: flag
do
case "${flag}" in
i) input=${OPTARG}
;;
o) output=${OPTARG}
;;
*) echo "Invalid option: -$flag" ;;
esac
done
mpirun --allow-run-as-root singularity exec --bind /home/users/r/usrname/scratch/points_and_lines/:/usr/local/share/sdpb/ sdpb_2.5.1.sif pvm2sdp $binaryPrecision /usr/local/share/sdpb/$input /usr/local/share/sdpb/$output
The command pvm2sdp is just some specific kind of C++ executable that converts a XML file to a JSON file.
If I submit the .slurm file as
sbatch ./job.slurm -i /home/users/r/usrname/scratch/points_and_lines/xmlfile.xml -o /home/users/r/usrname/scratch/points_and_lines/jsonfile.json
it works perfectly. However, if I instead submit it using srun as
srun ./job.slurm -i /home/users/r/usrname/scratch/points_and_lines/xmlfile.xml -o /home/users/r/usrname/scratch/points_and_lines/jsonfile.json
I get the following error -
--------------------------------------------------------------------------
A call to mkdir was unable to create the desired directory:
Directory: /scratch
Error: Read-only file system
Please check to ensure you have adequate permissions to perform
the desired operation.
--------------------------------------------------------------------------
I have no clue why this is happening and how I can go about resolving the issue. I tried to mount /scratch as well but that does not resolve the issue.
Any help would be greatly appreciated since I need to use the srun inside another .slurm file that contains multiple other MPI calls.

I generally use srun after salloc. Let's say I have to run a python file on a GPU. I will use salloc to allocate a compute node.
salloc --nodes=1 --account=sc1901 --partition=accel_ai_mig --gres=gpu:2
Then I use this command to directly access the shell of the compute node.
srun --pty bash
Now, you can type any command as would do on your pc. You can try nvidia-smi. You can run Python files python code.py.
In your case, you can simply load modules manually and then run your mpirun command after srun --pty bash. You don't need the job script.
One more thing, sbatch and srun are customised for each HPC, so we can't say what exactly is stopping you from running those commands.
At Swansea University, we are expected to use job scripts with the sbatch only. Have a look at my university's HPC tutorial.
Read this article to know the primary differences between both.

Running MPI job on multiple nodes with slurm scheduler

I'm trying to run an MPI application with a specific task/node configuration. I need to run a total of 8 MPI tasks 4 of which on one node and 4 on another node.
This is the script file I'm using:
#!/bin/bash
#SBATCH --time=00:30:00
#SBATCH --nodes=2
#SBATCH --ntasks=8
#SBATCH --ntasks-per-node=4
#SBATCH --ntasks-per-socket=1
#SBATCH --cpus-per-task=4
module load autoload scalapack/2.0.2--intelmpi--2018--binary intel/pe-xe-2018--binary
srun <path_to_bin> <options>
I then run this with sbatch:
sbatch mpi_test.sh
but I continue to get this error:
sbatch: error: Batch job submission failed: Requested node
configuration is not available
How can I modify this piece of code to make it run? I'm surely missing something, but I cannot figure what.
I'm using IntelMPI and slurm 20.02

This can be due to the wrong parameters.
Potential issue could be in the following lines:
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=4
If there is not enough cpus that satisfy the requirement. ie. if there is less than 16cores in a single node the above error will be shown.
#SBATCH --ntasks-per-socket=1
As in the comment pointed out by damienfrancois, it can be an issue with the number of sockets. If there are no four sockets, then the same error will also be shown.
As a simple step, you can comment out "#SBATCH --ntasks-per-socket=1" the line and run the batch script. If it fails, then the issue can be due to the invalid mapping of tasks to cpu.
More information about the environment is needed for further analysis.

error while loading shared libraries: libicuuc.so.50

I try to submit a R script to SLURM in CentOS 7, like this:
#!/bin/bash
#SBATCH -J test
#SBATCH -o test.out
#SBATCH -p compute
#SBATCH --qos=normal
#SBATCH -N 1
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=1
#SBATCH --job-name=rtest
Rscript --vanilla Rhelp.R
Then system will return a jobid, but the R script does not work. I can assure this script can run in command line. Then I have found in test.out, like this:
error while loading shared libraries: libicuuc.so.50:
cannot open shared object file: No such file or directory
I am a freshman in SLURM and Linux, thx!

It looks like the libicu RPM package is not installed on the compute nodes.

Just because it may be installed on the head node doesn't mean it's installed on the compute node(s). You could fire off the same ldconfig command in a Slurm job and view the results to confirm that's the case.

With the module avail command from the head node you list all available modules and loaded modules are marked somehow depending on your OS; for me they are marked with (L). All you need to do is load those loaded modules from your file script each of which is invoked with the line
module load path_to_module. Whereas, path_to_module is as is indicated with the previous command module avail.
Or without resorting to module avail you could use module list for currently loaded modules.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex