Airflow triggerdagrun operator is not running the target dag - airflow

I have 2 dags - dag a and dag b.
I have used triggerdagrun operator in dag a and passed the dag id task id and parameters in the triggerdagrun operator.
The task that triggers the second dag executed successfully and the status of dag b is running. But the task in dag b didn't get triggered. The schedule interval for dag b is none.
Can someone help me in resolving this issue?

Did you
start the scheduler (via airflow scheduler)?
enable the "dag to be triggered" (per default DAGs are paused)?
... both are necessary conditions such that tasks get run...

Related

How to force a Airflow Task to restart at the new scheduling date?

I have this simple Airflow DAG:
from airflow import DAG
from datetime import datetime, timedelta
from airflow.operators.bash import BashOperator
with DAG("Second Dag",
start_date=datetime(2022,1,1),
schedule_interval="0 5 * * *",
catchup=False,
max_active_runs=1
) as dag:
task_a = BashOperator(
task_id="ToRepeat",
bash_command="cd /home/xdf/local/ && (env/bin/python workflow/test1.py)",
retries =1,
)
The task takes a variable amount of time between one run and the other, and I don't have any guarantee that it will be finished within the 5 A.M of the next day.
If the task is still running when a new task is scheduled to start, I need to kill the old one before it starts running.
How can I design Airflow DAG to automatically kill the old task if it's still running when a new task is scheduled to start?
More details:
I am looking for something dynamic. The old DAG should be killed only when the new DAG is starting. If, for any reason, the new DAG does not start for one week, then old DAG should be able to run for an entire week. That's why using a timeout is sub-optimal
You should set dagrun_timeout for your DAG.
dagrun_timeout: specify how long a DagRun should be up before
timing out / failing, so that new DagRuns can be created. The timeout
is only enforced for scheduled DagRuns.
Since your DAG runs daily you can set 24 hours for timeout.
with DAG("Second Dag",
start_date=datetime(2022,1,1),
schedule_interval="0 5 * * *",
catchup=False,
max_active_runs=1
dagrun_timeout=timedelta(hours=24)
) as dag:
If you want to set timeout on a specific task in your DAG you should use execution_timeout on your operator.
execution_timeout: max time allowed for the execution of this task instance, if it goes beyond it will raise and fail
Example:
MyOperator(task_id='task', execution_timeout=timedelta(hours=24))
If you really are looking for a dynamic solution; you can take help of Airflow DAGRun APIs and Xcoms; you can push your current dag run_id to Xcom and for subsequent runs you can pull this Xcom to consume with airflow API to check and kill the dag run with that run_id.
check_previous_dag_run_id >> kill_previous_dag_run >> push_current_run_id >> your_main_task
and your API call task should be something like
...
kill_previous_dag_run = BashOperator(
task_id="kill_previous_dag_run",
bash_command="curl -X 'DELETE' \
'http://<<your_webserver_dns>>/api/v1/dags/<<your_dag_name>>/dagRuns/<<url_encoded_run_id>>' \
-H 'accept: */*' --user <<api_username>>:<<api_user_password>>",
dag=dag
)
...

Airflow - Skip future task instance without making changes to dag file

I have a DAG 'abc' scheduled to run every day at 7 AM CST and there is task 'xyz' in that DAG.
For some reason, I do not want to run one of the tasks 'xyz' for tomorrow's instance.
How can I skip that particular task instance?
I do not want to make any changes to code as I do not have access to Prod code and the task is in Prod environment now.
Is there any way to do that using command line ?
Appreciate any help on this.
You can mark the unwanted tasks as succeeded using the run command. The tasks marked as succeeded will not be run anymore.
Assume, there is a DAG with ID a_dag and three tasks with IDs dummy1, dummy2, dummy3. We want to skipp the dummy3 task from the next DAG run.
First, we get the next execution date:
$ airflow next_execution a_dag
2020-06-12T21:00:00+00:00
Then we mark dummy3 as succeeded for this execution date:
$ airflow run -fAIim a_dag dummy3 '2020-06-12T21:00:00+00:00'
To be sure, we can check the task state. For the skipped task it will be success:
$ airflow task_state a_dag dummy3 '2020-06-12T21:00:00+00:00'
...
success
For the rest of the tasks the state will be None:
$ airflow task_state a_dag dummy1 '2020-06-12T21:00:00+00:00'
...
None

Apache Airflow does not enforce dagrun_timeout

I am using Apache Airflow version 1.10.3 with the sequential executor, and I would like the DAG to fail after a certain amount of time if it has not finished. I tried setting dagrun_timeout in the example code
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'me',
'depends_on_past': False,
'start_date': datetime(2019, 6, 1),
'retries': 0,
}
dag = DAG('min_timeout', default_args=default_args, schedule_interval=timedelta(minutes=5), dagrun_timeout = timedelta(seconds=30), max_active_runs=1)
t1 = BashOperator(
task_id='fast_task',
bash_command='date',
dag=dag)
t2 = BashOperator(
task_id='slow_task',
bash_command='sleep 45',
dag=dag)
t2.set_upstream(t1)
slow_task alone takes more than the time limit set by dagrun_timeout, so my understanding is that airflow should stop DAG execution. However, this does not happen, and slow_task is allowed to run for its entire duration. After this occurs, the run is marked as failed, but this does not kill the task or DAG as desired. Using execution_timeout for slow_task does cause the task to be killed at the specified time limit, but I would prefer to use an overall time limit for the DAG rather than specifying execution_timeout for each task.
Is there anything else I should try to achieve this behavior, or any mistakes I can fix?
The Airflow scheduler runs a loop at least every SCHEDULER_HEARTBEAT_SEC (the default is 5 seconds).
Bear in mind at least here, because the scheduler performs some actions that may delay the next cycle of its loop.
These actions include:
parsing the dags
filling up the DagBag
checking the DagRun and updating their state
scheduling next DagRun
In your example, the delayed task isn't terminated at the dagrun_timeout because the scheduler performs its next cycle after the task completes.
According to Airflow documentation:
dagrun_timeout (datetime.timedelta) – specify how long a DagRun should be up before timing out / failing, so that new DagRuns can be created. The timeout is only enforced for scheduled DagRuns, and only once the # of active DagRuns == max_active_runs.
So dagrun_timeout wouldn't work for non-scheduled DagRuns (e.g. manually triggered) and if the number of active DagRuns < max_active_runs parameter.

How to add manual tasks in an Apache Airflow Dag

I'm using Apache Airflow to manage the data processing pipeline. In the middle of the pipeline, some data need to be reviewed before the next-step processing. e.g.
... -> task1 -> human review -> task2 -> ...
where task1 and task2 are data processing task. When task1 finished, the generated data by task1 needs to be reviewed by human. After the reviewer approved the data, task2 could be launched.
Human review tasks may take a very long time(e.g. several weeks).
I'm thinking to use an external database to store the human review result. And use a Sensor to poke the review result by a time interval. But it will occupy an Airflow worker until the review is done.
any idea?
Piggy-packing off of Freedom's answer and Robert Elliot's answer, here is a full working example that gives the user two weeks to review the results of the first task before failing permanently:
from datetime import timedelta
from airflow.models import DAG
from airflow import AirflowException
from airflow.operators.python_operator import PythonOperator
from my_tasks import first_task_callable, second_task_callable
TIMEOUT = timedelta(days=14)
def task_to_fail():
raise AirflowException("Please change this step to success to continue")
dag = DAG(dag_id="my_dag")
first_task = PythonOperator(
dag=dag,
task_id="first_task",
python_callable=first_task_callable
)
manual_sign_off = PythonOperator(
dag=dag,
task_id="manual_sign_off",
python_callable=task_to_fail,
retries=1,
max_retry_delay=TIMEOUT
)
second_task = PythonOperator(
dag=dag,
task_id="second_task",
python_callable=second_task_callable
)
first_task >> manual_sign_off >> second_task
A colleague suggested having a task that always fails, so the manual step is simply to mark it as a success. I implemented it as so:
def always_fail():
raise AirflowException('Please change this step to success to continue')
manual_sign_off = PythonOperator(
task_id='manual_sign_off',
dag=dag,
python_callable=always_fail
)
start >> manual_sign_off >> end
Your idea seems good to me. You can create a dedicated DAG to check the progress of your approval process with a sensor. If you use a low timeout on your sensor and an appropriate schedule on this DAG, say every 6 hours. Adapt it to how often these tasks are approved and how soon you need to perform the downstream tasks.
Before 1.10, I used the retry feature of the operator to implement the ManualSignOffTask. The operator has set retries and retry_delay. So the task will be rescheduled after it fails. When the task is scheduled, it will check the database to see if the sign-off is done:
If the sign-off has not been done yet, the task fails and release the worker and wait for next schedule.
If the sign-off has been done, the task success, and the dag run proceeds.
After 1.10, a new TI state UP_FOR_RESCHEDULE is introduced and the Sensor natively supports long running tasks.

Example DAG gets stuck in "running" state indefinitely

In my first foray into airflow, I am trying to run one of the example DAGS that comes with the installation. This is v.1.8.0. Here are my steps:
$ airflow trigger_dag example_bash_operator
[2017-04-19 15:32:38,391] {__init__.py:57} INFO - Using executor SequentialExecutor
[2017-04-19 15:32:38,676] {models.py:167} INFO - Filling up the DagBag from /Users/gbenison/software/kludge/airflow/dags
[2017-04-19 15:32:38,947] {cli.py:185} INFO - Created <DagRun example_bash_operator # 2017-04-19 15:32:38: manual__2017-04-19T15:32:38, externally triggered: True>
$ airflow dag_state example_bash_operator '2017-04-19 15:32:38'
[2017-04-19 15:33:12,918] {__init__.py:57} INFO - Using executor SequentialExecutor
[2017-04-19 15:33:13,229] {models.py:167} INFO - Filling up the DagBag from /Users/gbenison/software/kludge/airflow/dags
running
The dag state remains "running" for a long time (at least 20 minutes by now), although from a quick inspection of this task it should take a matter of seconds. How can I troubleshoot this? How can I see which step it is stuck on?
To run any DAGs, you need to make sure two processes are running:
airflow webserver
airflow scheduler
If you only have airflow webserver running, the UI will show DAGs as running, but if you click on the DAG, none of it's tasks are actually running or scheduled, but rather in a Null state.
What this means is that they are waiting to be picked up by airflow scheduler. If airflow scheduler is not running, you'll be stuck in this state forever, as the tasks are never picked up for execution.
Additionally, make sure that the toggle button in the DAGs view is switched to 'ON' for the particular DAG. Otherwise it will not get picked up by the scheduler if you trigger it manually.
I too recently started using Airflow and my dags kept endlessly running. Your dag may be set on 'pause' without you realizing it, and thus the scheduler will not schedule new task instances and when you trigger the dag it just looks like it is endlessly running.
There are a few solutions:
1) In the Airflow UI toggle the button left of the dag from 'Off' to 'On'. Off means that the dag is paused, so On will allow the scheduler to pick it up and complete the dag. (this fixed my initial issue)
2) In your airflow.cfg file dags_are_paused_at_creation = True, is the default. So all new dags you create are paused from the start. Change this to False, and future dags you create will be good to go right away (i had to reboot webserver and scheduler for changes to the airflow.cfg to be recognized)
3) use the command line $ airflow unpause [dag_id]
documentation: https://airflow.apache.org/cli.html#unpause
The below worked for me.
Make sure AIRFLOW_HOME is set
in AIRFLOW_HOME have folders dags, plugins. The folders to have permissions r,w,x to airflow user.
Make sure u have atleast one dag in the dags/ folder.
pip install celery[redis]==4.1.1
I have checked the above soln on airflow 1.9.0 Airflow version
I tried the same trick with airflow 1.10 version and it worked.

Resources