How to wait for one job in a set of slurm jobs to finish - wait

I have started n slurm jobs and I would like to let a separate process wait until at least one of them has finished. The waiting process should use as little cpu time as possible so polling would not be ideal (unless there is no other way).
I know of the scontrol wait_job, but as far as I can see this can only wait for exactly one job.

If you have sufficient privileges, you can use strigger.
Otherwise, you could use a workflow manager (for instance Fireworks). They typically do polling but at a reasonable pace.
Note that if the action to take is to submit another job, you can also submit it right away and use the --dependency parameter to delay its execution until ready.

If you want to run something in the current session and wait for jobs you have already submitted, you can wait by submitting a job with a dependency that finishes immediately(will still need to wait for the scheduler, but doesn't require installing anything new and doesn't require a bunch of polling):
sbatch --time=<some short time> <usual account args> --wait --dependency afterok:<your jobs> /dev/stdin <<< '#!/bin/bash \n sleep 1'
For example:
sbatch --time=0:15:00 --account my-account-abc --wait --dependency afterok:12345678:12345679 /dev/stdin <<< '#!/bin/bash \n sleep 1'
I know this is kind of late, but hope this helps someone :)

Related

How to stop a workflow immediately when a session fails?

I have a workflow that has many sessions that run in parallel to each other. When one of the session fails, the workflow waits for the other session to complete and then the entire workflow gets failed. We have selected the option "fail parent if this task fails". But we want the workflow to fail and stop immediately if any of the session fails without waiting for other sessions to finish.
ps: We have a unix shell script that calls all the workflows one by one. So if we can solve it using unix shell scripting that would be fine aswell.
Does anyone have any solution for it?
Best thing you can do in Informatica is use a Control Task to Abort the worklfow, and have it connected from all sessions with an OR condition. Something like:
start--S1--S2--S3
\ \ \
\---\---\-(OR)-CTL

How to launch a Dataflow job with Apache Airflow and not block other tasks?

Problem
Airflow tasks of the type DataflowTemplateOperator take a long time to complete. This means other tasks can be blocked by it (correct?).
When we run more of these tasks, that means we would need a bigger Cloud Composer cluster (in our case) to execute tasks that are essentially blocking while they shouldn't be (they should be async operations).
Options
Option 1: just launch the job and airflow job is successful
Option 2: write a wrapper as explained here and use a reschedule mode as explained here
Option 1 does not seem feasible as the DataflowTemplateOperator only has an option to specify the wait time between completion checks called poll_sleep (source).
For the DataflowCreateJavaJobOperator there is an option check_if_running to wait for completion of a previous job with the same name (see this code)
It seems that after launching a job, the wait_for_finish is executed (see this line), which boils down to an "incomplete" job (see this line).
For Option 2, I need Option 1.
Questions
Am I correct to assume that Dataflow tasks will block others in Cloud Composer/Airflow?
Is there a way to schedule a job without a "wait to finish" using the built-in operators? (I might have overlooked something)
Is there an easy way to write this myself? I'm thinking of just executing a bash launch script, followed by a task that looks if the job finished correctly, but in a reschedule mode.
Is there another way to avoid blocking other tasks while running dataflow jobs? Basically this is an async operation and should not take resources.
Answers
Am I correct to assume that Dataflow tasks will block others in Cloud Composer/Airflow?
A: Partly yes. Airflow has parallelism option in the configuration which define the number of tasks that should execute at a time across the system. Having a task block this slot might slow down the execution in the system but this issue is bound to happen as you increase the number of tasks and DAGs. You can increase this in the configuration depending on your needs
Is there a way to schedule a job without a "wait to finish" using the built-in operators? (I might have overlooked something)
A: Yes. You can use PythonOperator and in the python_callable you can use the dataflow hook to launch the job in async mode (launch and don't wait).
Is there an easy way to write this myself? I'm thinking of just executing a bash launch script, followed by a task that looks if the job finished correctly, but in a reschedule mode.
A: When you say reschedule, I'm assuming that you are going to retry the task that looks for job that checks if the job finished correctly. If I'm right, you can set the task on retry mode and the delay at which you want the retry to happen.
Is there another way to avoid blocking other tasks while running dataflow jobs? Basically this is an async operation and should not take resources.
A: I think I answered this in the second question.

Prevent automatic terminal closure/idle process in ZSH

Is there any way to make sure zsh doesn't exit when cluster administrators auto-kill all idle processes after some fairly short amount of time?
I have already figured out how to do it with tmux, but my zsh sessions keep dying and the admins won't budge on the policy. I can always leave a dumb while loop going, but that is really tedious to do at the start of every single zsh session, but it is very frustrating to come back and find that my process has exited, but I have no idea why because the shell that hosted it was killed and the output history is gone.
I use oh-my-zsh, so if there is a module in there that can do this, that would be great too.
I just added a simple script to to my .zshrc. It is an inelegant solution but it does work.
if [[ "$HOSTNAME" =~ "^myhost*" ]]; then
while true; do echo 'hi' > /dev/null; sleep 120; done &
I am sure there is a better way because this makes shell startup slower than I would like, but that is what I came up with for now.

Gearman Worker start with ulabox symfony plugin

again, im stucking in Gearman. I was implementing the ulabox gearman Bundle which works nicely. But there are two things which I dont unterstand yet.
How do I start a Worker??
Im the documentation, I should first execute a worker and the start the code.
https://github.com/ulabox/GearmanBundle/blob/master/README.md
Open the first console and run:
$ php app/console gearman:worker:execute --worker=AcmeDemoBundle:AcmeWorker
Now open another console and run:
$ php app/console gearman:client:execute --client=UlaboxGearmanBundle:GearmanClient:hello_world --worker=AcmeDemoBundle:AcmeWorker --params="{\"foo\": \"bar\" }"
So, if I dont start the worker manually, the job would be done by itsself. If I start the worker, everysthin is fine. But at least, it is a bit strange to start in manually, even if there is set an iteration of x so that the worker will kill itsself after that amount of job.
So please, can anyone help me out of this :((((
Heeeeeelp :) lol
thanks in advance an kind regards
Phil
Yes to run some task in background not only Gearman need to be run but also workers.
So you have run "gearman" that wait for some command (e.x. email send).
Additionally you have waiting workers.
When gearman view new command he look for first free worker and pass this command to it.
Next worker process execution for command and after finish return to Gearman server that it finished and ready to process new command.
More worker you have faster commands in queue processed.
You can use "supervisor" for automatic maintenance workers running.
Bellow you can find few links with more information:
http://www.daredevel.com/php-jobs-with-gearman-and-supervisor/
http://www.masnun.com/2011/11/02/gearman-php-and-supervisor-processing-background-jobs-with-sanity.html
Running Gearman Workers in the Background

Pagodabox or PHPfog + Gearman

All,
I'm looking for a good way to do some job backgrounding through either of these two services.
I see PHPFog supports IronWorks, but i need something more realtime. Through these cloud based PaaS services, I'm not able to use popen(background.php --token=1234). So I'm thinking the best solution, might be to try to kick off a gearman worker to handle the job. (Actually my preferred method would be to use websockets to keep a connection open and receive feedback from the job, rather than long polling a db table through AJAX, but none of these guys support websockets)
Question 1 is, is there a better solution than using gearman to offload the job?
Question 2 is, http://help.pagodabox.com/customer/portal/articles/430779 I see pagodabox supports 'worker listeners' ... has anybody set this up with gearman? Would it work?
Thanks
I am using PagodaBox with a background worker in an application I am building right now. Basically, PagodaBox daemonizes a PHP process for you (meaning it will continually run in the background), so all you really have to do is create a script that checks a database table for tasks to run, runs them, and then sleeps a bit so it's not running too many queries against your database.
This is a simplified version of what I have running:
// Remove time limit
set_time_limit(0);
// Show ALL errors
error_reporting(-1);
// Run daemon
echo "--- Starting Daemon ---\n";
while(true) {
// Query 'work_queue' table for new tasks
// Loop over items and do whatever tasks are associated with them
// Update row to mark task as completed
// Wait a bit
sleep(30);
}
A benefit to this approach is that it's easy to test via CLI:
php tasks.php
You will see all the echo statements come through in console as it's running, and of course it's much easier to do than a more complicated setup with other dependencies like Gearman.
So whenever you add a new task to the table, the maximum amount of time you'll wait for that task to be started in a batch is 30 seconds (or whatever your sleep time is). This is better and preferable to cron jobs, because if you setup a cron job to run every minute (the lowest possible interval) and the work you have to do takes longer than a minute, another cron process will start working on the same queue and you could end up with quite a lot of duplicated task work that could cause a lot of issues that are hard to debug and troubleshoot. So if you instead have either only one background worker that runs all tasks, or multiple background workers that work on different task types, you will never run into this issue.

Resources