I have a job_a and job_b. Job_b has dependency of job a. If job_a succeed then only job_b should start. I have a requirement to skip execution to these both jobs. So i have gone for to use Job ON ICE for both jobs. When I kept job_a on ice the job_b has been executed. Since it has matched on ice createria.
So how resolve this so that job_a and job_b should be skip for next execution.?
Related
Actually I got a request to perform onice the jobs running on a machine . I took the jobs from machine and performed onice on jobs.After few minutes I have received another request to office the jobs , I performed off ice but by mistake I have off iced all the jobs instead of the specifid machine.
In this case I did a mistake of off ice of other jobs too which are supposed to be in onice.
Now how can I get those previously on ice jobs before the request ?
Can any one help me in this pls?
I hope you got my question.
I have an AriFlow DAG where each step is an AWS EMR task. Once AirFlow reaches one of the steps, it sends the SIGTERM signal as the following
{emr_step.py:73} INFO - Poking step XXXXXXXX on cluster XXXXXXXX
{emr_base.py:66} INFO - Job flow currently RUNNING
{local_task_job.py:199} WARNING - State of this instance has been externally set to failed. Terminating instance.
{process_utils.py:100} INFO - Sending Signals.SIGTERM to GPID 7632
This is in spite of the fact that the EMR job is still running healthy. One major difference between the EMR job that AirFlow fails on and the rest of my EMR jobs is that it triggers anther system and wait to hear back from that system. In other words, it stays idle until it hears back from aother system. My impression is that AirFlow thinks the EMR job has failed. However, it is just waiting to hear from another system.
Is there any way to ask AirFlow to wait more for this EMR job?
I have multiple salt states and commands which are executed while other jobs could currently running.
Then I get an error for new jobs, something like:
The function "state.apply" is running as PID 3869 and was started at 2017, Mar 23 10:19:32.691177 with jid 20170323101932691177
Is there a way to wait for other jobs to complete first or to run the job in parallel?
You can queue the execution of salt states:
salt minion_A state.apply some.state queue=True
This will queue the state if any other states are currently running, keep in mind that this option starts a new thread for each queued state run, so use this option sparingly (https://docs.saltstack.com/en/latest/ref/modules/all/salt.modules.state.html).
You could use the saltutil.running function to check if there is a salt job running on a minion, f.e.
salt 'MINION' saltutil.running
See https://docs.saltstack.com/en/latest/ref/modules/all/salt.modules.saltutil.html#salt.modules.saltutil.running
As of salt version 2017.7.0, you can add parallel=true to your state command, which will attempt to execute tasks in parallel.
I'm aware, one can suspend running jobs by qmod -sj [jobid] command and in principal that works. Which means the jobs go to suspend (s) state -- fine so far, but:
I expected that if I put all running jobs to suspend state and qsub new ones to GE or have waiting jobs, that these get to be run, which is not the case.
Some search on this topic lead me to http://gridengine.org/pipermail/users/2011-February/000050.html, which in fact points to the direction, that suspended jobs make the GE free for running other ones.
See here.:
In a workload manager with "built-in" preemption, like Platform LSF,
it works by temporarily relaxing the slot count limit on a node and
then resolving the oversubscription by bumping the lowest job on the
totem pole to get the number of jobs back under the slot count limit.
In Sun Grid Engine, the same thing happens, except that instead of the
scheduler temporarily relaxing the slot count limits, you as the
administrator configure the host with more slots than you need and a
set of rules that create an artificial lower limit on the job count
that is enforced by bumping the lowest priority jobs.
Slightly different topic, but it seems the principal can hold the same: to run other jobs while maintaining your suspended ones, temporarily increase the slot counts on the relevant nodes.
We have multiple jobs configured in oozie, and some jobs are signal free and some are dependent on signals. We have given signal free jobs start time 2AM and the jobs were starting firing during that time. From last one month we have noticed those signal free jobs started delaying by 1 hour. We are not sure why is it happening.
does any one have idea on this, why oozie jobs started executing with delay.