Can the load in an SGE node be more than the number of CPUs? - cpu-usage

I am running a job on a Sun Grid Engine (now known as Oracle Grid Engine) cluster. To see whether my job is slowing down because the node is overloaded, I tried to check the status of the node:
$ qstat -l hostname=hnode03 -f
queuename qtype resv/used/tot. load_avg arch states
--------------------------------------------------------------------------------- BP 0/0/0 103.41 lx24-amd64
highmem.q#hnode03.rnd.mycorp BP 0/37/40 103.41 lx24-amd64
977530 0.76963 runJob1 userme r 09/13/2013 17:53:26 2
threaded.q#hnode03.rnd.mycor BP 0/24/32 103.41 lx24-amd64
workflow.q#hnode03.rnd.mycor B 0/0/0 103.41 lx24-amd64
$ qhost -h hnode03
global - - - - - - -
hnode03 lx24-amd64 64 103.4 504.8G 122.9G 16.0G 58.0M
Now, the load_avg is 103.41, while the NCPU is only 64. Is this ever supposed to happen? Are some jobs using CPU than the slots they are assigned?
Update: In response to queries, the configurations are uploaded to

Yes, it can.
Slots are not synonymous of cores (NCPU). Slots must be seen as "how many jobs can be scheduled in parallel on a node."
If you only want one job to be ran at once, set the slots count for your machines to one.
For the load factor, even if your job only uses one slot, if you have too many threads or subprocesses, then all the cores will be used and the load factor will definitely go above 1.


Stress-ng stress memory with specific percentage

I am trying to stress a ubuntu container's memory. Typing free in my command terminal provides the following result:
free -m
total used free shared buff/cache available
Mem: 7958 585 6246 401 1126 6743
Swap: 2048 0 2048
I want to stress exactly 10% of the total available memory. Per stress-ng manual:
-m N, --vm N
start N workers continuously calling mmap(2)/munmap(2) and writing to the allocated
memory. Note that this can cause systems to trip the kernel OOM killer on Linux
systems if not enough physical memory and swap is not available.
--vm-bytes N
mmap N bytes per vm worker, the default is 256MB. One can specify the size as % of
total available memory or in units of Bytes, KBytes, MBytes and GBytes using the
suffix b, k, m or g.
Now, on my target container I run two memory stressors to occupy 10% of my memory:
stress-ng -vm 2 --vm-bytes 10% -t 10
However, the memory usage on the container never reaches 10% no matter how many times I run it. I tried different timeout values, no result. The closet it gets is 8.9% never approaches 10%. I inspect memory usage on my container this way:
docker stats --no-stream kind_sinoussi
c3fc7a103929 kind_sinoussi 199.01% 638.4MiB / 7.772GiB 8.02% 1.45kB / 0B 0B / 0B 7
In an attempt to understand this behaviour, I tried running the same command with an exact unit of bytes. In my case, I'll opt for 800 mega since 7958m * 0.1 = 795,8 ~ 800m.
stress-ng -vm 2 --vm-bytes 800m -t 15
And, I get 10%!
docker stats --no-stream kind_sinoussi
c3fc7a103929 kind_sinoussi 198.51% 815.2MiB / 7.772GiB 10.24% 1.45kB / 0B 0B / 0B 7
Can someone explain why this is happening?
Another question, is it possible for stress-ng to stress memory usage to 100%?
stress-ng --vm-bytes 10% will use sysconf(_SC_AVPHYS_PAGES) to determine the available memory. This sysconf() system call will return the number of pages that the application can use without hindering any other process. So this is approximately what the free command is returning for the free memory statistic.
Note that stress-ng will allocate the memory with mmap, so it may be that during run time mmap'd pages may not necessarily be physically backed at the time you check how much real memory is being used.
It may be worth trying to also use the --vm-populate option; this will try and ensure the pages are physically populated on the mmap'd memory that stress-ng is exercising. Also try --vm-madvise willneed to use the madvise() system call to hint that the pages will be required fairly soon.

perf_event_open and PERF_COUNT_HW_INSTRUCTIONS

I'm trying to profile an existing application with a quite complicated structure. For now I am using perf_event_open and the needed ioctl calls for enabling the events which are of my interest.
The manpage stays that PERF_COUNT_HW_INSTRUCTIONS should be used carefully - so which one should be preferred in case of a Skylake processor? Maybe a specific Intel PMU?
The perf_event_open manpage
PERF_COUNT_HW_INSTRUCTIONS Retired instructions. Be careful, these can be affected by various issues, most notably hardware interrupt counts.
I think this means that COUNT_HW_INSTRUCTIONS can be used (and it is supported almost everywhere). But exact values of COUNT_HW_INSTRUCTIONS for some code fragment may be slightly different in several runs due to noise from interrupts or another logic.
So it is safe to use events PERF_COUNT_HW_INSTRUCTIONS and PERF_COUNT_HW_CPU_CYCLES on most CPU. perf_events subsystem in Linux kernel will map COUNT_HW_CPU_CYCLES to some raw events more suitable to currently used CPU and its PMU.
Depending on your goals you should try to get some statistics on PERF_COUNT_HW_INSTRUCTIONS values for your code fragment. You can also check stability of this counter with several runs of perf stat with some simple program:
perf stat -e cycles:u,instructions:u /bin/echo 123
perf stat -e cycles:u,instructions:u /bin/echo 123
perf stat -e cycles:u,instructions:u /bin/echo 123
Or use integrated repeat function of perf stat:
perf stat --repeat 10 -e cycles:u,instructions:u /bin/echo 123
I have +-10 instructions events variation (less than 0.1%) for 200 thousands total instructions executed, so it is very stable. For cycles I have 5% variation, so it should be cycles event marked with careful warning.

Using MPI on Slurm, is there a way to send messages between separate jobs?

I'm new to using MPI (mpi4py) and Slurm. I need to run about 50000 tasks, so to obey the administrator-set limit of about 1000, I've been running them like this:
for i in {1..50}
sbatch $i
sleep 0.1
#SBATCH --job-name=mpi
#SBATCH --output=mpi_50000.out
#SBATCH --time=0:10:00
#SBATCH --ntasks=1000
srun --mpi=pmi2 --output=mpi_50k${1}.out python data_50000.pkl ${1} > ${1}py.out 2> ${1}.err (irrelevant stuff omitted):
offset = (int(sys.argv[2])-1)*1000
k = comm.Get_rank()
d = data[k+offset]
# ... do something with d ...
allresults = comm.gather(result, root=0)
if k == 0:
Is this a sensible way to get around the limit of 1000 tasks?
Is there a better way to consolidate results? I now have 50 files I have to concatenate manually. Is there some comm_world that can exist between different jobs?
I think you need to make your application divide the work among 1000 tasks (MPI ranks) and consolidate the results after that with MPI collective calls i.e. MPI_Reduce or MPI_AllReduce calls.
trying to work around the limit won't help you as the jobs you started will be queued one after another.
Jobs arrays will give similar behavior like what you did in the batch file you provided. So still your application must be able to processes all data items given only N tasks(MPI ranks).
No need to pool to make sure all other jobs are finished take a look at slurm job dependency parameter
You can use job dependeny to make a new job that will run after all other jobs finish and this job will collect the results and merge them into one big file. I still believe you are over thinking the obvious solution make rank 0 (master collect all results and save them to the disk)
This looks like a perfect candidate for job arrays. Each job in an array is identical with the exception of a $SLURM_ARRAY_TASK_ID environment variable. You can use this in the same way that you're using the command line variable.
(You'll need to check that MaxArraySize is set high enough by your sysadmin. Check the output of scontrol show config | grep MaxArraySize )
what do you mean by 50000 tasks ?
do you mean one MPI job with 50000 MPI tasks ?
or do you mean 50000 independant serial programs ?
or do you mean any combination where (number of MPI jobs) * (number of tasks per job) = 5000
if 1), well, consult your system administrator. of course you can allocate 50000 slots in multiple SLURM jobs, manually wait they are all running at the same time and then mpirun your app outside of SLURM. this is both ugly and unefficient, and you might get in trouble if this is seen as an attempt to circumvent the system limits.
if 2) or 3), then job array is a good candidate. and if i understand correctly your app, you will need an extra post processing step in order to concatenate/merge all your outputs in a single file.
and if you go with 3), you will need to find the sweet spot
(generally speaking, 50000 serial program are more efficient than fifty 1000 tasks MPI jobs or one 50000 tasks MPI program but merging 50000 file is less efficient than merging 50 files (or not merging anything at all)
and if you cannot do the post-processing on a frontend node, then you can use job dependency to start it after all the computation have completed

vmem and maxvmem

I have a question about vmem and maxvmem.
I searched on the web but there are really many confusing explanations of the
two words.
What I did was to type:
qstat -j 1154926 | grep vmem
The output was:
cpu=00:05:25, mem=23.21121 GBs, io=2.70481, vmem=239.277M, maxvmem=351.359M
Can anyone help me to understand the meaning of the variables?
The vmem variable states the current available virtual memory on that node/queue combination. The maxvmem states what the node/queue combination can present as the maximum available virtual memory.
I found this link to be helpful.
qstat can provide more detailed information about a running job by passing it a job id specified by the '-j' argument:
[jamesa#codon sge_test]$ qstat -j 3804867
job_number: 3804867
exec_file: job_scripts/3804867
submission_time: Wed Jun 29 11:04:02 2011
owner: jamesa
uid: 1001
group: bss-staff
gid: 50001
sge_o_home: /home/jamesa
sge_o_log_name: jamesa
sge_o_path: /usr/lib64/openmpi/bin:/usr/lib64/openmpi/1.4-gcc/bin:/opt/sge/6_2u5_1/bin/lx26-amd64:/usr/lib64/qt-3.3/bin:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/usr/lib/jvm/java/bin:/usr/lib/jvm/jre/bin:/usr/biosoft/bin:/usr/biosoft/perl_modules/bioperl/current/bin:/usr/biosoft/packages/emboss/current/bin:/home/jamesa/bin
sge_o_shell: /bin/bash
sge_o_workdir: /home/jamesa/sge_test
sge_o_host: codon
account: sge
cwd: /home/jamesa/
sge_testmerge: y
hard resource_list: h_rt=3600,h_vmem=2G
notify: FALSE
job_name: blast
jobshare: 0
usage 1: cpu=00:01:37, mem=78.06343 GBs, io=0.00002, vmem=925.562M, maxvmem=1.012G
scheduling info: There are no messages available
You can see the resources you requested at submission time on the hard_resource_list line of the qstat output. The actual resources used by the job at the time qstat were executed are shown in the usage line.
Email Reporting
A summary of the jobs resource usage can be obtained by requesting an email report following the jobs completion, using the -m e qsub directive (see advanced submission options for details). An example email report is show below.
Job 3804869 (blast) Complete
User = jamesa
Queue =
Host =
Start Time = 06/29/2011 11:26:08
End Time = 06/29/2011 11:30:43
User Time = 00:03:51
System Time = 00:00:12
Wallclock Time = 00:04:35
CPU = 00:04:04
Max vmem = 1.012G
Exit Status = 0
The total memory used is indicated by the Max vmem field, while the total runtime is reported as Wallclock Time. Care should be taken when submitting a large number of jobs with email reporting enabled, since you may find yourself recieving several thousand reports in individual emails.

Who know the history of unix fork?

Fork is a great tool in unix.We can use it to generate our copy and change its behaviour.But I don't know the history of fork.
Does someone can tell me the story?
Actually, unlike many of the basic UNIX features, fork was a relative latecomer (a).
The earliest existence of multiple processes within UNIX consisted of a few (fixed number of) processes, one per terminal that was attached to the PDP-7 machine (b).
The basic idea was that the shell process for a given terminal would accept a command from the user, locate the program file, load a small bootstrap program into high memory and jump to it, passing enough details for the bootstrap code to load the program file.
The bootstrap code, after loading the program into low memory (overwriting the shell), would then jump to it.
When the program was finished, it would call exit but it wasn't like the exit we know and love today. This exit would simply reload the shell and run it using pretty much the same method used to load the program in the first place.
So it was really more like a rudimentary exec command, the one that replaces your current program with another, in the same process space.
The shell would exec your program then, when your program was done, it would again exec the shell by calling exit.
This method was similar to that found in many other interactive systems at the time, including the Multics from whence UNIX got its name.
From the two-way exec, it was actually not that big a leap to adding fork as a process duplicator to work in conjunction. While many systems run another program directly, it's this "just add what's needed" method which is responsible for the separation of duties between fork and exec in UNIX. It also resulted in a very simple fork function.
If you're interested in the early history of various features(c) of Unix, you cannot go past the article The Evolution of the Unix Time-Sharing System by Dennis Ritchie, presented at a 1979 conference in Australia, and subsequently published by AT&T.
(a) Though I mean latecomer in the sense that the separation of the four fundamental forces in the universe was "late", happening some 0.00000000001 seconds after the big bang.</humour>.
(b) Since a question was raised in a comment as to how the shells were originally started off, there's a great resource holding very early source code for Unix over at The Unix Heritage Society, specifically the source code archives and, in particular, the first edition.
The init.s file from the first edition shows how the fixed number of shell processes were created (slightly reformatted):
mov $itab, r1 / address of table to r1
mov (r1)+, r0 / 'x, x=0, 1... to r0
beq 1f / branch if table end
movb r0, ttyx+8 / put symbol in ttyx
jsr pc, dfork / go to make new init for this ttyx
mov r0, (r1)+ / save child id in word offer '0, '1, etc
br 1b / set up next child
'0; ..
'1; ..
'2; ..
'3; ..
'4; ..
'5; ..
'6; ..
'7; ..
Here you can see the snippet which creates the processes for each connected terminal. These are the days of hard-coded values, no auto detection of terminal quantity involved. The zero-terminated table at itab is used to create a number of processes and hopefully the comments from the code explain how (the only possibly tricky bit is the labels - though there are multiple 1 labels, you branch to the nearest one in a given direction, hence 1b means the closest 1 label in the backwards direction).
The code shown simply processes the table, calling dfork to create a process for each terminal and start getty, the login prompt. The getty program, in turn, eventually started the shell. From that point, it's as I described in the main part of this answer.
(c) No paths (and use of temporary links to get around this limitation), limited processes, why there's a GECOS field in the password file, and all sorts of other trivia, generally interesting only to uber-geeks, of course.
