didn't get any gpu using sbatch when submitting a job script through slurm - torch

Here is my slurm job script. I requested 4 gpu and 1 computing node. My script is as follows:
#!/bin/bash
#SBATCH --partition=gpu
#SBATCH --gres=gpu:4
#SBATCH --ntasks-per-gpu=12
#SBATCH --mem-per-gpu=40G
#SBATCH --time=0:15:00
module use /ifs/opt_cuda/modulefiles
module load python/gcc/3.10
module load cuda11.1/toolkit cuda11.1/blas cuda11.1/fft cudnn8.0-cuda11.1 tensorrt-cuda11.1/7.2.3.4
# activate TF venv
source /ifs/groups/rweberGrp/venvs/py310-tf210/bin/activate
python -c "import torch;print(torch.cuda.device_count())"
so the torch.cuda.device_count() should give me 4 but actually the output is 0
0
I have no idea why this is happening. Anyone has any idea? Thanks

Related

Executing all the scripts within a directory using a bash script

I'd like to run all the r scripts located in a directory called scripts using a bash script. How would you do it? My script so far (not working) looks as follow:
#!/usr/bin/bash
#SBATCH --job-name=name
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
module load R/4.0.2-gnu9.1
module load sqlite/3.26.0
Rscript $R_SCRIPT

Slurm: error in reading shell script when using mpirun. How can I fix this?

I am submitting a job on the debug queue on Niagara (Slurm scheduler) and am getting the following error:
SALLOC: error: _fork_command: Unable to exec command "/mypath/test.sh": Permission denied
I have checked the permissions of the file test.sh and it is readable, in fact I have been using the same file for serial jobs with no problems. I am trying to use mpirun for a parallel job and that's when I get the error.
My shell script is as follows:
#!/bin/bash
#SBATCH --account= xxxx
#SBATCH --nodes=2
#SBATCH --ntasks=160
#SBATCH --time=3:30:00
#SBATCH --job-name "sNucRNASeq"
pushd /mypath/
mpirun --np 4 R --no-save < Rscript test.R
Rscript test.R
I have tried chmod -rwx test.sh, it did not make a difference.
Am I missing something with regards to letting all the processors access the file? How can I by-pass the error?
The test.R script referred to above is pretty simple:
library(pbdMPI)
init()
rank<-comm.rank()
size<-comm.size()
myfiles<-load("ListofFiles.RData")
y <- scatter(lapply(myfiles, readRDS))
comm.print(str(y))
finalize()

Slurm wont run mpi Job with specific number of nodes

I am currenntly trying to run calculalations that require large amounts of memory per core on a HPC cluster.
I am using a single node/machine with 512 GB ram. I have 28 cores per machine, but every process needs more than 512/28 GB ram.
I have no issue using 12 or 2 processes (which means I don't saturate the node intentionally) but whenever I try to use 6 or 7 I get:
srun: error: node058: tasks 3-5: Exited with exit code 255
The relevant part of my slurm script is:
#!/bin/bash -l
#SBATCH --nodes=1
#SBATCH --tasks-per-node=6
#SBATCH --hint=nomultithread
#SBATCH --partition=mem512
#SBATCH --time=1008:00:00
#SBATCH --mail-type=NONE
#SBATCH --job-name=$NAME
#SBATCH --exclusive
#SBATCH --export=NONE
export SLURM_EXPORT_ENV=ALL
export I_MPI_DEBUG=5
export I_MPI_PMI_LIBRARY=/usr/lib64/libpmi.so.0
#set period as decimal point
export LC_NUMERIC=C
ulimit -s unlimited
ulimit -l hard
export TMPDIR=/scratch/$SLURM_JOB_USER/$SLURM_JOBID
srun --cpu-bind=cores some_program < input 1> $SLURM_SUBMIT_DIR/output 2>error
Thank you for reading,
Cheers!

Running MPI job on multiple nodes with slurm scheduler

I'm trying to run an MPI application with a specific task/node configuration. I need to run a total of 8 MPI tasks 4 of which on one node and 4 on another node.
This is the script file I'm using:
#!/bin/bash
#SBATCH --time=00:30:00
#SBATCH --nodes=2
#SBATCH --ntasks=8
#SBATCH --ntasks-per-node=4
#SBATCH --ntasks-per-socket=1
#SBATCH --cpus-per-task=4
module load autoload scalapack/2.0.2--intelmpi--2018--binary intel/pe-xe-2018--binary
srun <path_to_bin> <options>
I then run this with sbatch:
sbatch mpi_test.sh
but I continue to get this error:
sbatch: error: Batch job submission failed: Requested node
configuration is not available
How can I modify this piece of code to make it run? I'm surely missing something, but I cannot figure what.
I'm using IntelMPI and slurm 20.02
This can be due to the wrong parameters.
Potential issue could be in the following lines:
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=4
If there is not enough cpus that satisfy the requirement. ie. if there is less than 16cores in a single node the above error will be shown.
#SBATCH --ntasks-per-socket=1
As in the comment pointed out by damienfrancois, it can be an issue with the number of sockets. If there are no four sockets, then the same error will also be shown.
As a simple step, you can comment out "#SBATCH --ntasks-per-socket=1" the line and run the batch script. If it fails, then the issue can be due to the invalid mapping of tasks to cpu.
More information about the environment is needed for further analysis.

error while loading shared libraries: libicuuc.so.50

I try to submit a R script to SLURM in CentOS 7, like this:
#!/bin/bash
#SBATCH -J test
#SBATCH -o test.out
#SBATCH -p compute
#SBATCH --qos=normal
#SBATCH -N 1
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=1
#SBATCH --job-name=rtest
Rscript --vanilla Rhelp.R
Then system will return a jobid, but the R script does not work. I can assure this script can run in command line. Then I have found in test.out, like this:
error while loading shared libraries: libicuuc.so.50:
cannot open shared object file: No such file or directory
I am a freshman in SLURM and Linux, thx!
It looks like the libicu RPM package is not installed on the compute nodes.
Just because it may be installed on the head node doesn't mean it's installed on the compute node(s). You could fire off the same ldconfig command in a Slurm job and view the results to confirm that's the case.
With the module avail command from the head node you list all available modules and loaded modules are marked somehow depending on your OS; for me they are marked with (L). All you need to do is load those loaded modules from your file script each of which is invoked with the line
module load path_to_module. Whereas, path_to_module is as is indicated with the previous command module avail.
Or without resorting to module avail you could use module list for currently loaded modules.

Resources