numactl --physcpubind processor migration - mpi

I'm trying to launch my mpi-application (Open MPI 1.4.5) with numactl. Since apparently the load balancing using --cpu-nodebind doesn't distribute my processes in a round-robbin manner among the available nodes I wanted to specifically restrict my processes to a closed set of cpus. In this way I plan to ensure a balanced load between the nodes in terms of the number of threads running on each node. --physcpubind seems to do the job according to the numactl manual.
The problem is - from what I could extract from this post - that, using --phycpubind, processes are allowed to migrate inside this cpu-set. Another problem is, that some cpus from this set remain unused while others are being assigned two or more processes and thus running with only 50% or less CPU usage. Why is this happening and is there any workaround for this phenomenon?
Kind regards

I think you can try this (It worked for me):
numactl --cpunodebind={cpu-core} chrt -r 98 {your-app}
The chrt command lets you establish a scheduling policy, you can choose among the following:
Policy options:
-b, --batch set policy to SCHED_BATCH
-d, --deadline set policy to SCHED_DEADLINE
-f, --fifo set policy to SCHED_FIFO
-i, --idle set policy to SCHED_IDLE
-o, --other set policy to SCHED_OTHER
-r, --rr set policy to SCHED_RR (default)
EDIT: The number 98 is the priority, in my case I am running a time critical process.
Also, you may need to isolate the cpus you are using to prevent the scheduler from assigning/moving processes to/from them.

Related

Does a call to BPXBATCH from JCL use the priority of the batch job or is priority in OMVS independent?

I am calling a shell script that does some processing from JCL using BPXBATCH like this:
//STEP2 EXEC PGM=BPXBATCH,
// PARM='SH PATHTOSCRIPT.SH MYARGUMENT'
The JCL has the service class with the highest priority. However, the shell script enters in a queue waiting for resources. Sometimes it runs quickly, and other times waits a lot of time for resources. The priority of the JCL seems to be independent of the shell script. I read maybe using the "nice" command in Unix would increase the priority of the shell script.
I want to be sure first, that the priority of a JCL from z/OS doesn't affect the priority of Unix process that was called from that JCL through BPXBATCH. I cannot find any documentation about it.
Short Answer
To answer your question first: BPXBATCH runs in one address space, and the shell runs in a second address space. Commands issues by the shell may run in the same address space as the shell, or may run in more additional address spaces.
The BPXBATCH address space has got a service class, and the shell address space(s) has got a service class, probably a different one. Each service class has its own performance goal, and this tells the system how to manage that work.
Detailed Answer
The z/OS workload manager (WLM) is responsible to assign work to a service classes when it is presented the new work. Service classes specify performance goals, and importance levels, not priorities. WLM manages all work in the system according to is performance goal based on the importance of the goal.
There are a couple (workload management) subsystems, that may start new work. Examples of such subsystems are
JES, which manage batch work, i.e. batch jobs.
TSO, which manages interactive TSO user work (TSO login).
OMVS, which manages forked, and non-locally spawned z/OS UNIX work.
STC, which manages started job workload.
This list is not complete; I listed only the subsystems that I need to answer the question.
When JES2/3 receives a job that shall run on the system, it presents some job attributes to WLM, and WLM assigns the job to a service class. It does so using WLM classification rules for subsystem type JES, and the attributes given.
Everything that runs in this job, i.e. in the job's address space will be managed towards the performance goal of the sercive class assigned. This includes z/OS UNIX work that is run in this very address space, i.e. work that is not started via UNIX fork(), or non-local spawn().
When a z/OS UNIX process starts an new process via fork(), or via non-local spawn(), this new work is handled by the WLM subsystem OMVS. The OMVS subsystem presents some attributes of the new process to WLM, and WLM assigns the process to a service class. It does so using WLM classification rules for subsystem type OMVS, and the attributes given. This kind of work is always runs in a separate, new address space.
BPXBATCH starts the (first) UNIX command it is told via PARM=, or //STDPARM, as a new process using either fork(), or spawn(). The spawn() may be a local spawn(), or a non-local spawn(). Which one is done depends on many factors, too complex to explain here.
The important point here is, when running BPXBATCH with PARM='SH ...', the shell proces will always run in a separate, new address space and will be classified via WLM subsystem OMVS.
The result is BPXBATCH is running in one address space with its service class, and the shell is run in a second address space with its service class. The service classes may be the same, but usually they are different WLM defintions with different performance goals.
As a starter, have a look at z/OS MVS Planning: Workload Management
nice() on z/OS UNIX
nice() has no effect on z/OS UNIX, unless the system has been setup to support it. There is parameter PRIORITYGOAL(...) in BPXPRMxx parmlib member to setup a list of up to 40 WLM service classes that will be used in conjunction with nice(). I have never heard of anyone having set this parameter.
See z/OS MVS Initialization & Tuning Reference for details about BPXPRMxx member

How can I configure yarn cluster for parallel execution of Applications?

When I run spark job on yarn cluster, Applications are running in queue. So how can I run in parallel number of Applications?.
I suppose your YARN scheduler option is set to FIFO. Please change it to FAIR or capacity scheduler.Fair Scheduler attempts to allocate resources so that all running applications get the same share of resources.
The Capacity Scheduler allows sharing of a Hadoop cluster along
organizational lines, whereby each organization is allocated a certain
capacity of the overall cluster. Each organization is set up with a
dedicated queue that is configured to use a given fraction of the
cluster capacity. Queues may be further divided in hierarchical
fashion, allowing each organization to share its cluster allowance
between different groups of users within the organization. Within a
queue, applications are scheduled using FIFO scheduling.
If you are using capacity scheduler then
In spark submit mention your queue --queue queueName
Please try to change this capacity scheduler property
yarn.scheduler.capacity.maximum-applications = any number
it will decide how many application will run parallely
By default, Spark will acquire all available resources when it launches a job.
You can limit the amount of resources consumed for each job via the spark-submit command.
Add the option "--conf spark.cores.max=1" to spark-submit. You can change the number of cores to suite your environment. For example if you have 100 total cores, you might limit a single job to 25 cores or 5 cores, etc.
You can also limit the amount of memory consumed: --conf spark.executor.memory=4g
You can change settings via spark-submit or in the file conf/spark-defaults.conf. Here is a link with documentation:
Spark Configuration

Is using the -L flag and a addprocs script the more powerful version of -p and --machinefile?

So I have a moderately complex set of requirements for my worker processes.
I want to use a the master slave topology, and a nondefault working directory.
I also want to mix both local and remote workers.
As far as I can tell from readying the --machine-file section of the documentation.
It will not let me do that.
So I am looking at the -L <file parameter
>julia -h
...
-L, --load Load immediately on all processors
...
So if I do not use the -p or --machine-file` flags, then there is initially only one processer so the all processors just mean on the only processor.
So I tried this out
start_workers.jl
addprocs([
("cluster_c4_1",:auto),
("cluster_c4_2",:auto)
],
dir="/mnt/",
topology=:master_slave
)
addprocs(
dir="/mnt/",
topology=:master_slave
)
test.jl
println("*************")
println(workers())
println("-------------")
Running it:
>julia -L start_workers.jl pl.jl
*************
[2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21]
-------------
So it looks all good, got my 20 workers.
Have I done anything unreasonable? Is this the best way?
That's exactly how I'm deploying it on a HPC cluster under Torque scheduler. In fact I'm in the process of re-writing the the cluster manager to support more options when adding processes through the Torque scheduling systems in particular, so I've spent quite a bit of time looking into this.
You might also want to be aware there are various ClusterManagers, Pkg.add("ClusterManagers") that extend the ability of addprocs under a variety of environments, such as when you need to request the resources from a scheduler. It looks like passwordless ssh is possible for you, so the default cluster manager is sufficient in your case.
I don't believe there is any way of defining the extra topology and directory parameters on the command line, so your approach is correct.

How to see the process table in unix?

What's the UNIX command to see the processes table, remember that table contains:
process status
pointers
process size
user ids
process ids
event descriptors
priority
etc
The "process table" as such lives in the kernel's memory. Some systems (such as AIX, Solaris and Linux--which is not "unix") have a /proc filesystem which makes those tables visible to ordinary programs. Without that, programs such as ps (on very old systems such as SunOS 4) required elevated privileges to read the /dev/kmem (kernel memory) special device, as well as having detailed knowledge about the kernel memory layout.
Your question is open ended, and an answer to a specific question you may have had can be looked up in any man page as #Alfasin suggests in his answer. A lot depends on what you are trying to do.
As #ThomasDickey points out in his response, in UNIX and most of its' derivatives, the command for viewing processes being run in the background or foreground is in fact the ps command.
ps stands for 'process status', answering your first bullet item. But the command uses over 30 options and depending on what information you seek, and permissions granted to you by the systems administrator, you can get various types of information from the command.
For example, for the second bullet item on your list above, depending on what you are looking for, you can get information on 3 different types of pointers - the session pointer (with option 'sess'), the terminal session pointer (tsess), and the process pointer (uprocp).
The rest of your items that you have listed are mostly available as standard output of the command.
Some UNIX variants implement a view of the system process table inside of the file system to support the running of programs such as ps. This is normally mounted on /proc (see #ThomasDickey response above)
Typical reasons for understanding the working of the command include system-administration responsibilities such as tracking the origin of the initiated processes, killing runaway or orphaned processes, examining the file size of the process and setting limits where necessary, etc. UNIX developers can also use it in conjunction with ipc features, etc. An understanding of the process table and status will help with associated UNIX features such as the kvm interface to examine crash dump, etc. or to get or set the kernal state.
Hope this helps

Spreading a job over different nodes of a cluster in sun grid engine (SGE)

I'm tryin get sun gridending (sge) to run the separate processes of an MPI job over all of the nodes of my cluster.
What is happening is that each node has 12 processors, so SGE is assigning 12 of my 60 processes to 5 separate nodes.
I'd like it to assign 2 processes to each of the 30 nodes available, because with 12 processes (dna sequence alignments) running on each node, the nodes are running out of memory.
So I'm wondering if it's possible to explicitly get SGE to assign the processes to a given node?
Thanks,
Paul.
Check out "allocation_rule" in the configuration for the parallel environment; either with that or then by specifying $pe_slots for allocation_rule and then using the -pe option to qsub you should be able to do what you ask for above.
You can do it by creating a queue in which you can define the queue uses only only 2 processors out of 12 processors in each node.
You can see configuration of current queue by using the command
qconf -sq queuename
you will see following in the queue configuration. This queue named in such a way that it uses only 5 execution hosts and 4 slots (processors) each.
....
slots 1,[master=4],[slave1=4],[slave2=4],[slave3=4],[slave4=4]
....
use following command to change the queue configuration
qconf -mq queuename
then change those 4 into 2.
From an admin host, run "qconf -msconf" to edit the scheduler configuration. It will bring up a list of configuration options in an editor. Look for one called "load_factor". Set the value to "-slots" (without the quotes).
This tells the scheduler that the machine is least loaded when it has the fewest slots in use. If your exec hosts have a similar number of slots each, you will get an even distribution. If you have some exec hosts that have more slots than the others, they will be preferred, but your distribution will still be more even than the default value for load_factor (which I don't remember, having changed this in my cluster quite some time ago).
You may need to set the slots on each host. I have done this myself because I need to limit the number of jobs on a particular set of boxes to less than their maximum because they don't have as much memory as some of the other ones. I don't know if it is required for this load_factor configuration, but if it is, you can add a slots consumable to each host. Do this with "qconf -me hostname", add a value to the "complex_values" that looks like "slots=16" where 16 is the number of slots you want that host to use.
This is what I learned from our sysadmin. Put this SGE resource request in your job script:
#$ -l nodes=30,ppn=2
Requests 2 MPI processes per node (ppn) and 30 nodes. I think there is no guarantee that this 30x2 layout will work on a 30-node cluster if other users also run lots of jobs but perhaps you can give it a try.

Resources