Inconsistent mounting behaviour with SLURM and Singularity - mpi

I am completely new to using SLURM to submit jobs to a HPC and I am facing a peculiar problem that I am not able to resolve.
I have a job.slurm file that contains the following bash script
#!/bin/bash
#SBATCH --job-name singularity-mpi
#SBATCH -N 1 # total number of nodes
#SBATCH --time=00:05:00 # Max execution time
#SBATCH --partition=partition-name
#SBATCH --output=/home/users/r/usrname/slurm-reports/slurm-%j.out
module load GCC/9.3.0 Singularity/3.7.3-Go-1.14 CUDA/11.0.2 OpenMPI/4.0.3
binaryPrecision=600 #Temporary number
while getopts i:o: flag
do
case "${flag}" in
i) input=${OPTARG}
;;
o) output=${OPTARG}
;;
*) echo "Invalid option: -$flag" ;;
esac
done
mpirun --allow-run-as-root singularity exec --bind /home/users/r/usrname/scratch/points_and_lines/:/usr/local/share/sdpb/ sdpb_2.5.1.sif pvm2sdp $binaryPrecision /usr/local/share/sdpb/$input /usr/local/share/sdpb/$output
The command pvm2sdp is just some specific kind of C++ executable that converts a XML file to a JSON file.
If I submit the .slurm file as
sbatch ./job.slurm -i /home/users/r/usrname/scratch/points_and_lines/xmlfile.xml -o /home/users/r/usrname/scratch/points_and_lines/jsonfile.json
it works perfectly. However, if I instead submit it using srun as
srun ./job.slurm -i /home/users/r/usrname/scratch/points_and_lines/xmlfile.xml -o /home/users/r/usrname/scratch/points_and_lines/jsonfile.json
I get the following error -
--------------------------------------------------------------------------
A call to mkdir was unable to create the desired directory:
Directory: /scratch
Error: Read-only file system
Please check to ensure you have adequate permissions to perform
the desired operation.
--------------------------------------------------------------------------
I have no clue why this is happening and how I can go about resolving the issue. I tried to mount /scratch as well but that does not resolve the issue.
Any help would be greatly appreciated since I need to use the srun inside another .slurm file that contains multiple other MPI calls.

I generally use srun after salloc. Let's say I have to run a python file on a GPU. I will use salloc to allocate a compute node.
salloc --nodes=1 --account=sc1901 --partition=accel_ai_mig --gres=gpu:2
Then I use this command to directly access the shell of the compute node.
srun --pty bash
Now, you can type any command as would do on your pc. You can try nvidia-smi. You can run Python files python code.py.
In your case, you can simply load modules manually and then run your mpirun command after srun --pty bash. You don't need the job script.
One more thing, sbatch and srun are customised for each HPC, so we can't say what exactly is stopping you from running those commands.
At Swansea University, we are expected to use job scripts with the sbatch only. Have a look at my university's HPC tutorial.
Read this article to know the primary differences between both.

Related

Slurm: error in reading shell script when using mpirun. How can I fix this?

I am submitting a job on the debug queue on Niagara (Slurm scheduler) and am getting the following error:
SALLOC: error: _fork_command: Unable to exec command "/mypath/test.sh": Permission denied
I have checked the permissions of the file test.sh and it is readable, in fact I have been using the same file for serial jobs with no problems. I am trying to use mpirun for a parallel job and that's when I get the error.
My shell script is as follows:
#!/bin/bash
#SBATCH --account= xxxx
#SBATCH --nodes=2
#SBATCH --ntasks=160
#SBATCH --time=3:30:00
#SBATCH --job-name "sNucRNASeq"
pushd /mypath/
mpirun --np 4 R --no-save < Rscript test.R
Rscript test.R
I have tried chmod -rwx test.sh, it did not make a difference.
Am I missing something with regards to letting all the processors access the file? How can I by-pass the error?
The test.R script referred to above is pretty simple:
library(pbdMPI)
init()
rank<-comm.rank()
size<-comm.size()
myfiles<-load("ListofFiles.RData")
y <- scatter(lapply(myfiles, readRDS))
comm.print(str(y))
finalize()

Sourcing an R script within another script executed through SLURM

I am executing an R script on a cluster, using the following sbatch file:
#!/bin/bash
#SBATCH --job-name=ReconstructScans
#SBATCH --mem=4g
#SBATCH --account=XXXXXX
#SBATCH --array=1-1013%5
module load StdEnv/2020
module load gcc/9.3.0
module load r/4.0.2
srun Rscript ReconstructScans.R
ReconstructScans.R calls another R script, using
source("path_to_script.R", local=FALSE)
However, for some reason, the variables declared within path_to_script.R are not accessible to ReconstructScans.R. I get the object 'DataPath' not found error.
Why might this be happening? When I open an interactive R shell and source ReconstructScans.R, everything works fine. It's when ReconstructScans.R is executed through sbatch that the problem arises.
Is it a problem with how I'm executing it using:
srun Rscript ReconstructScans.R?
Thank you,
Mrinmayi

error while loading shared libraries: libicuuc.so.50

I try to submit a R script to SLURM in CentOS 7, like this:
#!/bin/bash
#SBATCH -J test
#SBATCH -o test.out
#SBATCH -p compute
#SBATCH --qos=normal
#SBATCH -N 1
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=1
#SBATCH --job-name=rtest
Rscript --vanilla Rhelp.R
Then system will return a jobid, but the R script does not work. I can assure this script can run in command line. Then I have found in test.out, like this:
error while loading shared libraries: libicuuc.so.50:
cannot open shared object file: No such file or directory
I am a freshman in SLURM and Linux, thx!
It looks like the libicu RPM package is not installed on the compute nodes.
Just because it may be installed on the head node doesn't mean it's installed on the compute node(s). You could fire off the same ldconfig command in a Slurm job and view the results to confirm that's the case.
With the module avail command from the head node you list all available modules and loaded modules are marked somehow depending on your OS; for me they are marked with (L). All you need to do is load those loaded modules from your file script each of which is invoked with the line
module load path_to_module. Whereas, path_to_module is as is indicated with the previous command module avail.
Or without resorting to module avail you could use module list for currently loaded modules.

GNUPlot cannot be executed after mpirun command in PBS script

I have PBS command something like this
#PBS -N marcell_single_cell
#PBS -l nodes=1:ppn=1
#PBS -l walltime=20000:00:00
#PBS -e stderr.log
#PBS -o stdout.log
# Specific the shell types
#PBS -S /bin/bash
# Specific the queue type
#PBS -q dque
#uncomment this if you want to debug the process
#set -vx
cd $PBS_O_WORKDIR
ulimit -s unlimited
NPROCS=`wc -l < $PBS_NODEFILE`
#export PATH=$PBS_O_PATH
echo This job has allocated $NPROCS nodes
echo Cleaning old files...
rm -rf *.png *.plt *.log
echo Cleaning success
/opt/Lib/openmpi-2.1.3/bin/mpirun -np $NPROCS /scratch4/marcell/CellMLSimulator/bin/CellMLSimulator -ionmodel grandi2010 -solverType CVode -irepeat 4 -dt 0.01
gnuplot -p plotting.gnu
It got error something like this, thrown by the PBS error log.
/var/spool/torque/mom_priv/jobs/6265.node01.SC: line 28: gnuplot: command not found
I've already make sure that the path of GNUPlot is already been added to the PATH environment variable.
However, the strange part is, if I interchange the sequence of command, like gnuplot first and then mpirun, there isn't any error. I suspect that some commands after mpirun need some special configs, but I dunno how to do that
Already following this solution, but no avail.
sleep command not found in torque pbs but works in shell
EDITED:
it seems that the before and after mpirun still got error. and this is the which result:
which: no gnuplot in (/opt/intel/composer_xe_2011_sp1.9.293/bin/intel64:/opt/intel/composer_xe_2011_sp1.9.293/bin/intel64:/opt/pgi/linux86-64/9.0-4/bin:/opt/openmpi/bin:/usr/kerberos/bin:/prog/tools/grace/grace/bin:/home/prog/ansys_inc/v121/fluent/bin:/bin:/usr/bin:/opt/intel/composer_xe_2011_sp1.9.293/mpirt/bin/intel64:/opt/intel/composer_xe_2011_sp1.9.293/mpirt/bin/intel64:/scratch7/feber/jdk1.8.0_101:/scratch7/feber/code/apache-maven/bin:/usr/local/bin:/scratch7/cml/bin)
It's strange, since when I try to find the gnuplot, there is one in the /usr/local/bin
ls -l /usr/local/bin/gnuplot
-rwxr-xr-x 1 root root 3262113 Sep 18 2017 /usr/local/bin/gnuplot
moreover, if I run those commands without PBS, it seems executed as I expected:
/scratch4/marcell/CellMLSimulator/bin/CellMLSimulator -ionmodel grandi2010 -solverType CVode -irepeat 4 -dt 0.01
gnuplot -p plotting.gnu
It's very likely that your system has different "login/head nodes" and "compute nodes". This is a commonly used practice in many supercomputing clusters. While you build and launch your application from the head node, it gets executed on one or more compute nodes.
The compute nodes can have different hardware and software compared to the head nodes. In your case, gnuplot is installed only on the head node, as you can see from the different outputs of which gnuplot. To solve this, you have three approaches:
Request the system administrators to install gnuplot on the compute nodes.
Build and install your own version of gnuplot in a file-system accessible from the compute nodes. It could be your home directory or somewhere else depending on your cluster. In general, the filesystem where your application is will be available. In your case, anywhere under /scratch4/marcell/ would probably work.
Run gnuplot on the head node after the MPI jobs finish as a post-processing step. PBS/Torque does not provide a direct way to do this. You'll need to write a separate bash (not PBS) script to do this.

error on running mpi job

I'm trying to run a MPI job on a cluster with torque and openmpi 1.3.2 installed and I'm always getting the following error:
"mpirun was unable to launch the specified application as it could not find an executable:
Executable: -p
Node: compute-101-10.local
while attempting to start process rank 0."
I'm using the following script to do the qsub:
#PBS -N mphello
#PBS -l walltime=0:00:30
#PBS -l nodes=compute-101-10+compute-101-15
cd $PBS_O_WORKDIR
mpirun -npersocket 1 -H compute-101-10,compute-101-15 /home/username/mpi_teste/mphello
Any idea why this happens?
What I want is to run 1 process in each node (compute-101-10 and compute-101-15). What am I getting wrong here?
I've already tried several combinations of the mpirun command, but either the program runs on only one node or it gives me the above error...
Thanks in advance!
The -npersocket option did not exist in OpenMPI 1.2.
The diagnostics that OpenMPI reported
mpirun was unable to launch the specified application as it could not
find an executable: Executable: -p
is exactly what mpirun in OpenMPI 1.2 would say if called with this option.
Running mpirun --version will determine which version of OpenMPI is default on the compute nodes.
The problem is that the -npersocket flag is only supported by Open MPI 1.3.2 and the cluster where I'm running my code only has Open MPI 1.2 which doesn't support that flag.
A possible way around is to use the flag -loadbalance and specify the nodes where i want the code to run with the flag -H node1,node2,node3,... like this:
mpirun -loadbalance -H node1,node2,...,nodep -np number_of_processes program_name
that way each node will run number_of_processes/p processes, where p the number of nodes where the processes will be run.

Resources