I am very confused about submitting a job on a multi-user cluster environment. I use a script with the following head
#BSUB -L /bin/bash
#BSUB -n 10
#BSUB -J jobname
#BSUB -oo log/output.%J
#BSUB -eo log/error.%J
#BSUB -q queue_name
#BSUB -P project_name
#BSUB -R "span[ptile=12]"
#BSUB -W 2:0
mpirun ./someexecutable
In my intention, this jobs should run on 10 processors (cores) and span 1 entire node (because each node on the machine has 12 cores), so the node is fully ised by me and no other user interfere on my node. I have explicitly checked and it looks like my code is using 10 cores at runtime.
Now I am talking with somebody and they are telling me that in this way I am actually asking for 120 cores. I think this is not right but maybe I have misunderstood the instructions
https://www.ibm.com/support/knowledgecenter/en/SSWRJV_10.1.0/lsf_admin/span_string.html
Shall I use instead?
#BSUB -R "span[hosts=1]"
In my intention, this jobs should run on 10 processors (cores) and span 1 entire node
Yes, you want to use
#BSUB -n 10
#BSUB -R "span[hosts=1]"
Which means put the job on only 1 host.
and no other user interfere on my node
You can get exclusive access to the host with
#BSUB -x
FYI. You can think of
#BSUB -R "span[ptile=x]"
as, put at most x slots on a single host.
Related
my script (13-4.sh) is :
#!/bin/sh
#PBS -N sample
#PBS -l nodes=4:ppn=64
#PBS -q batch
#PBS -o $HOME/qpms9-2/out/14-4.out
#PBS -e $HOME/qpms9-2/error/14-4.out
#PBS -l walltime=100:00:00
mpirun $HOME/qpms9-2/run_mpi $HOME/qpms9-2/14-4 -l 14 -d 4
when i write this command : qsub 13-4.sh
The answer is as follows:
qsub: Job exceeds queue resource limits MSG=cannot satisfy queue max nodes requirement
my cluster has 10 nodes (64 core per node)
This might be an issue with your scheduler. I don't know which one you have but you should search the relevant documentation and settings to see if there is a setting that is capping the maximum number of nodes per user. If you type
/sbin/service pbs status
you can find which scheduler you're using by checking which services are running. Popular schedulers are pbs_sched, maui, and moab.
I would also make sure that all 10 nodes are online. You might be able to test this using
pbsnodes
Depending on how your cluster is configured. Additionally you should check that the batch queue exists and there are no standing reservations.
My program uses MPI+pthreads, where n-1 MPI processes are pure MPI code whereas the only one MPI process uses pthreads. The last process contains only 2 threads( main thread and pthread ). Suppose that the HPC cluster I want to run this program on consists of compute nodes, each of which has 12 cores. How should I write my batch script to maximise utilization of the hardware?
Following is my batch script I wrote. I use export OMP_NUM_THREADS=2 because the last MPI process has 2 threads and have to assume that the others have 2 threads each as well.
Then I allocate 6 MPI processes per node, so each node can run 6xOMP_NUM_THREADS = 12(=the number of cores on each node) threads despite the fact that all MPI processes but one have 1 thread.
#BSUB -J LOOP.N200.L1000_SIMPLE_THREAD
#BSUB -o LOOP.%J
#BSUB -W 00:10
#BSUB -M 1024
#BSUB -N
#BSUB -a openmpi
#BSUB -n 20
#BSUB -m xxx
#BSUB -R "span[ptile=6]"
#BSUB -x
export OMP_NUM_THREADS=2
How can I write a better script for this ?
The following should work if you'd like the last rank to be the hybrid one:
#BSUB -n 20
#BSUB -R "span[ptile=12]"
#BSUB -x
$MPIEXEC $FLAGS_MPI_BATCH -n 19 -x OMP_NUM_THREADS=1 ./program : \
$FLAGS_MPI_BATCH -n 1 -x OMP_NUM_THREADS=2 ./program
If you'd like rank 0 to be the hybrid one, simply switch the two lines:
$MPIEXEC $FLAGS_MPI_BATCH -n 1 -x OMP_NUM_THREADS=2 ./program : \
$FLAGS_MPI_BATCH -n 19 -x OMP_NUM_THREADS=1 ./program
This utilises the ability of Open MPI to launch MIMD programs.
You mention that your hybrid rank uses POSIX threads and yet you are setting an OpenMP-related environment variable. If you are not really using OpenMP, you don't have to set OMP_NUM_THREADS at all and this simple mpiexec command should suffice:
$MPIEXEC $FLAGS_MPI_BATCH ./program
(in case my guess about the educational institution where you study or work turns out to be wrong, remove $FLAGS_MPI_BATCH and replace $MPIEXEC with mpiexec)
It's been awhile since I've used LSF, so this might not be totally correct, so you should experiment with it.
I read your request
#BSUB -n 20
#BSUB -R "span[ptile=6]"
as, a total of 20 tasks, with 6 tasks per node. Meaning you will get 4 nodes. Which seems a waste, as you stated the each node has 12 cores.
How about using all the cores on the nodes, as you have requested exclusive hosts (-x)
#BSUB -x
#BSUB -n 20
#BSUB -R "span[ptile=12]"
export OMP_NUM_THREADS=2
This way you know rank
0..11 is on the first host
12..19 is on the second host
where by the second host has spare slots, to make use of the OpenMP'ness of rank 19.
Of course if you are getting into even funnier placements, LSF allows you to shape the job placement. Using LSB_PJL_TASK_GEOMETRY.
Lets say you had 25 MPI tasks with rank number 5 using 12 cores
#BSUB -x
#BSUB -n 25
#BSUB -R "span[ptile=12]"
export LSB_PJL_TASK_GEOMETRY="{(0,1,2,3,4,6,7,8,9,10,11,12)\
(13,14,15,16,17,18,19,20,21,22,23,24)\
(5)}"
This way, task 5 gets it's own node.
Evening,
I am running a lot of wget commands using xargs
cat urls.txt | xargs -n 1 -P 10 wget -q -t 2 --timeout 10 --dns-timeout 10 --connect-timeout 10 --read-timeout 20
However, once the file has been parsed, some of the wget instances 'hang.' I can still see them in system monitor, and it can take about 2 minutes for them all to complete.
Is there anyway I can specify that the instance should be killed after 10 seconds? I can re-download all the URLs that failed later.
In system monitor, the wget instances are shown as sk_wait_data when they hang. xargs is there as 'do_wait,' but wget seems to be the issue, as once I kill them, my script continues.
I believe this should do it:
wget -v -t 2 --timeout 10
According to the docs:
--timeout: Set the network timeout to seconds seconds. This is equivalent to specifying
--dns-timeout, --connect-timeout, and --read-timeout, all at the same time.
Check the verbose output too and see more of what it's doing.
Also, you can try:
timeout 10 wget -v -t 2
Or you can do what timeout does internally:
( cmdpid=$BASHPID; (sleep 10; kill $cmdpid) & exec wget -v -t 2 )
(As seen in: BASH FAQ entry #68: "How do I run a command, and have it abort (timeout) after N seconds?")
GNU Parallel can download in parallel, and retry the process after a timeout:
cat urls.txt | parallel -j10 --timeout 10 --retries 3 wget -q -t 2
If the time for an url to be fetched changes (e.g. due to faster internet connection), you can let GNU Parallel figure out the timeout:
cat urls.txt | parallel -j10 --timeout 1000% --retries 3 wget -q -t 2
This will make GNU Parallel record the median time for a successful job and set the timeout dynamically to 10 times that.
For a PBS script called with qsub, I want to know how many total CPU's have actually been allocated in case the number defined in the PBS file is overwritten by inputs from the command line. For example with the following pbs script file:
jobscript.pbs:
#!/bin/bash
#PBS -N test_run
#PBS -l nodes=32
#PBS -l walltime=06:00:00
#PBS -j oe
#PBS -q normal
#PBS -o output.txt
cd $PBS_O_WORKDIR
module load gcc-openmpi-1.2.7
time mpiexec visct
This script could be run with just 16 CPU's (instead of 32) using the following command line:
$ qsub -l nodes=2:ppn=8 jobscript.pbs
So I would like a robust method for determining how many CPU's are actually available from within the script.
I was able to answer my own question with the following solution using the $PBS_NODEFILE environment variable which contains the path to a file listing information about the available nodes:
jobscript.pbs:
#!/bin/bash
#PBS -N test_run
#PBS -l nodes=32
#PBS -l walltime=06:00:00
#PBS -j oe
#PBS -q normal
#PBS -o output.txt
# This finds out the number of nodes we have
NP=$(wc -l $PBS_NODEFILE | awk '{print $1}')
echo "Total CPU count = $NP"
Thanks to "Source" after much online searching.
MasterHD I know you have found your answer but I thought I would share another way
This code is longer but it helps for my specific needs. I actually use pbsnodes commands. Below is a snippet of my code.
#nodes_whole =`pbsnodes -av -s $server | grep "pcpus" `;
$nodes_count = ` pbsnodes -av -s $server | grep "pcpus" | wc -l `;
while($i < $nodes_count){
#cpu_present = split(/\s+/, $nodes_whole[$i]);
$cpu_whole_count += $cpu_present[3];
$i++;
}
I do this because in my script I check things like the amount of cpus , which varies depending on the node the cpus maybe be 4, 8, 16. Also I have multiple clusters which are always changing size and I don't want the script have specific cluster or node info hard coded. Mainly, I do this because when a user submits a job I check to see how many resources they can use . If say they want use a queue and request 200 cpus but on cluster A their job will be queued my script can tell them they will be queued but would not be on cluster b or d. So then they have the option to change before they submit.
I also use it to check for nodes down:
#nodes_down=`pbsnodes -l -s $server `;
I see what resources are in use:
#nodes_used=`pbsnodes -av -s $server | grep "resources_assigned.ncpus" `;
Also in one case I have two clusters running off one head node while I wait for hardware. In that case I check to see what cluster the node is assigned to and then do a count based on the node assigned to that cluster. That way all the users see is another cluster and use the script they way they would for any of the other clusters.
I just mention because I have found a lot of useful ways to use the pbsnodes and it worked well for my particular needs.
I am doing MPI programming on a cluster with 8 nodes and each having a Intel Xeon hexcore processor. I need three processors for my mpi code.
I submit the job using qsub. When I check on which processors the job is running using "qstat -n" it says something like cn004/0*3 .
So does this mean it is running it on only one processor ??
Because it is not speeding up than when I use a single processor(This is when the domain size is the same for both cases)
The script i use for submitting is as follows
#! /bin/bash
#PBS -o logfile.log
#PBS -e errorfile.err
#PBS -l cput=40:00:00
#PBS -lselect=1:ncpus=3:ngpus=3
#PBS -lplace=excl
cat $PBS_NODEFILE
cd $PBS_O_WORKDIR
mpicc -g -W -c -I /usr/local/cuda/include mpi1.c
mpicc -g -W mpi1.o -L /usr/local/cuda/lib64 -lOpenCL
mpirun -np 3 ./a.out
"qstat -n" it says something like cn004/0*3.
Q: So does this mean it is running it on only one processor ??
The short answer is "no". This does not mean that it runs on one processor.
"cn004/0*3" should be interpreted as "The job is allocated three cpu cores. And if we were to number the cores from 0 to 5 then the cores allocated would have numbers 0,1,and 2".
If another job were to run on the node it would receive the next three consecutive numbers "3,4, and 5". In the qstat -n output this would look like "cn004/3*3".
You use the directive place=excl to ensure that other jobs would not get the node, so essentially all the six cores are available.
Now for your second question:
Q: it is not speeding up than when I use a single processor
In order to answer this question we need to know if the algorithm is parallelized correctly.