Questions about torque checkpoint MPI jobs with BLCR - mpi

We're trying to use torque to checkpoint MPI jobs, but it seems that torque can only handle jobs running on a single node. I checked the code and found that when using qhold to checkpoint a job, qhold sends a PBS_BATCH_HoldJob request to pbs server, then pbs server relays this request to master host, and then master host checkpoints the job processes running on itself with BLCR, but not send the request to its sister nodes, so it seems that MPI jobs can not be checkpointable in torque.
Another problem, after checkpoint succeeds(as reported by qhold), torque sends a signal 15 to the process in master host to kill the process, then torque would copy the checkpoint file to pbs_server and remove all the files locally. When using qrls to restart this job,the scheduler would allocate new nodes for this job, and copy the checkpoint file to the new nodes and then restart the job through the checkpoint file, then the problem comes:
Assume torque can checkpoint the processes of MPI jobs in every nodes, and usually our job uses a huge chunk of memory, and therefore the checkpoint file is very large, but the pbs server doesn't have a disk large enough to contain the checkpoint files.
In our environment, before the MPI job starts, we pull some large meta data from another cluster directly to the nodes allocated for the MPI job for computing, and after checkpoint/restart, the job processes may resume in some different nodes, and the meta data might be missing.
If there's someone who can tell me how you do checkpoint for MPI jobs, and if my question can be answered and it's need to modify torque code, I also like to do that.
Thanks.

Related

Control specific node where an MPI process executes using PBS script?

The setup: A single-processor executable and two parallel mpi-based codes that can run on 100s of processors each. All on an HPC cluster that uses a PBS-based job scheduler.
The problem: Using a shared memory communication between single-processor executable and the parallel codes requires that rank 0 of the parallel codes all be located physically on the same node in an HPC cluster that uses a PBS job scheduler.
The question: Can a PBS script be created that can specify that rank 0 of the two parallel codes must start on a specific node( the same node that the other single-processor executable is running on)?
Example:
ExecA --Single processor
ExecB -- 100 processors
ExecC -- 100 processors
I want a situation where ExecA, ExecB(Rank0), and ExecC(Rank0) all start up on the same node. I need this so that the Rank 0 processors can communicate with the single-processor using a shared memory paradigm and then broadcast that information out to the rest of their respective MPI processes.
From this post, it does appear that specification of the number of cores to use on a code can be controlled using the PBS script. From my reading of the MPI manual, it also appears that if given a hostfile, MPI will sequentially go down the hostfile until it has allocated all the processors that were requested. So theoretically if I had a hostfile/machinefile that contained the host name of a particular node, and had a specification of 1 processor on that node being used, then I believe rank 0 would likely reside on that node.
I know that most cluster-based job schedulers do provide node names for users that they can use to specify a particular node to execute on, but I can't quite determine if ability to generally tell a scheduler "Hey, for this parallel job put the first process on this node, and put the rest elsewhere" is possible.

Where is /tmp folder located in Airflow?

I know that we can share information between tasks by persisting data to /tmp location. Since every task could run on a different machine. How is /tmp from one task is available to the other task in the Airflow?
This sharing you mentioned is only possible if you use LocalExecutor - because all tasks run on the same machine in this case.
If you use Celery/Kubernetes executors, the "/tmp" data share will not work. It might work accidentaly on Celery Executor if the tasks are executed on the same machine, but this is not at all guaranteed. You could potentially put all the tasks that need such sharing on a single machine, in Celery Executor using "queues" - when you create a single celery worker for one queue, all tasks assigned to that queue will be executed on that machine (thus /tmp sharing will work), but this hevily impacts scalability and resilience scenarios.
And also you could have Airlfow Deployment where /tmp is put on a shared filesystem, but this would be terribly inefficient as "/tmp" folder is often used for local caching of files and using a network filesystem there would severly impact performance.

Processes of a single threaded program

If a program is written in a single threaded language, does that mean that when it is executed only a single process exists for it at a time (no concurrent processes)?
A process is just a separate memory space. A thread is just a unit of execution on a process. A process can have multiple threads. A thread cannot coexist between multiple processes.
When you run a single-threaded program (assuming the language runtime does not introduce any other threads) there exists only one thread in the process. That doesn't mean that there exists only one process for that program because multiple instances of the same program might be running.

How can I configure yarn cluster for parallel execution of Applications?

When I run spark job on yarn cluster, Applications are running in queue. So how can I run in parallel number of Applications?.
I suppose your YARN scheduler option is set to FIFO. Please change it to FAIR or capacity scheduler.Fair Scheduler attempts to allocate resources so that all running applications get the same share of resources.
The Capacity Scheduler allows sharing of a Hadoop cluster along
organizational lines, whereby each organization is allocated a certain
capacity of the overall cluster. Each organization is set up with a
dedicated queue that is configured to use a given fraction of the
cluster capacity. Queues may be further divided in hierarchical
fashion, allowing each organization to share its cluster allowance
between different groups of users within the organization. Within a
queue, applications are scheduled using FIFO scheduling.
If you are using capacity scheduler then
In spark submit mention your queue --queue queueName
Please try to change this capacity scheduler property
yarn.scheduler.capacity.maximum-applications = any number
it will decide how many application will run parallely
By default, Spark will acquire all available resources when it launches a job.
You can limit the amount of resources consumed for each job via the spark-submit command.
Add the option "--conf spark.cores.max=1" to spark-submit. You can change the number of cores to suite your environment. For example if you have 100 total cores, you might limit a single job to 25 cores or 5 cores, etc.
You can also limit the amount of memory consumed: --conf spark.executor.memory=4g
You can change settings via spark-submit or in the file conf/spark-defaults.conf. Here is a link with documentation:
Spark Configuration

Is it possible and how to get a list of cores on which my mpi job is running from slurm?

The question: Is it possible and if yes then how, to get the list of cores on which my mpi job is running at a given moment?
It is easy to list nodes to which the job has been assigned, but after few hours spent surveying the internet I start to suspect that slurm expose the cores list in any way (why wouldn't it tough?).
The thing is, i want to double check if the cluster i am working on is really spreading the processes of my job across nodes, cores (and if possible, sockets) as I ask it to do (call me paranoid if You will).
Please note that hwloc is not an answer to my question, i ask if it is possible to get this information from slurm, not from inside of my program (call me curious if You will).
Closely related to (but definitely not the same thing) other similar question
well, that depends on your MPI library (MPICH-based, Open MPI-based, other), on how you run your MPI app (via mpirun or direct launch via srun) and your SLURM config.
if you direct launch, SLURM is the one that may do the binding.
srun --cpu_bind=verbose ...
should report how each task is bound.
if you mpirun, SLURM only spawns one proxy on each node.
in the case of Open MPI, the spawn command is
srun --cpu_bind=none orted ...
so unless SLURM is configured to restrict the available cores (for example if you configured cpuset and nodes are not in exclusive mode), all the cores can be used by the MPI tasks.
and then it is up to the MPI library to bind the MPI tasks within the available cores.
if you want to know what the available cores are, you can
srun -N $SLURM_NNODES -n $SLURM_NNODES --cpu_bind=none grep Cpus_allowed_list /proc/self/status
if you want to know how the tasks are bound, you can
mpirun grep Cpus_allowed_list /proc/self/status
or you can ask MPI to report that
iirc, with Open MPI you can
mpirun --report-bindings ...

Resources