Run Task on Success or Fail but not on Skipped - airflow

Is there a way to run a task if the upstream task succeeded or failed but not if the upstream was skipped?
I am familiar with trigger_rule with the all_done parameter, as mentioned in this other question, but that triggers the task when the upstream has been skipped. I only want the task to fire on the success or failure of the upstream task.

I don't believe there is a trigger rule for success and failed. What you could do is set up duplicate tasks, one with the trigger rule all_success and one with the trigger rule all_failed. That way, the duplicate task is only triggered if the parents ahead of it fails / succeeds.
I have included code below for you to test for expected results easily.
So, say you have three tasks.
task1 is your success / fail
task2 is your success only task
task3 is your failure only
#dags/latest_only_with_trigger.py
import datetime as dt
from airflow.models import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.utils.trigger_rule import TriggerRule
dag = DAG(
dag_id='stackoverflowtest',
schedule_interval=dt.timedelta(minutes=5),
start_date=dt.datetime(2019, 2, 20)
)
task1 = DummyOperator(task_id='task1', dag=dag)
task2 = DummyOperator(task_id='task2', dag=dag,
trigger_rule=TriggerRule.all_success)
task3 = DummyOperator(task_id='task3', dag=dag
trigger_rule=TriggerRule.all_failed)
###### ORCHESTRATION ###
task2.set_upstream(task1)
task3.set_upstream(task1)
Hope this helps!

Related

Airflow Only works with the Celery, CeleryKubernetes or Kubernetes executors

I got this dag, nevrtheless when trying to run it, it stacks on Queued run. When i then trying to run manually i get error:
Error:
Only works with the Celery, CeleryKubernetes or Kubernetes executors
Code:
from airflow import DAG
from airflow.providers.postgres.hooks.postgres import PostgresHook
from airflow.operators.python import PythonOperator
from datetime import datetime
def helloWorld():
print('Hello World')
def take_clients():
hook = PostgresHook(postgres_conn_id="postgres_robert")
df = hook.get_pandas_df(sql="SELECT * FROM clients;")
print(df)
# do what you need with the df....
with DAG(dag_id="test",
start_date=datetime(2021,1,1),
schedule_interval="#once",
catchup=False) as dag:
task1 = PythonOperator(
task_id="hello_world",
python_callable=helloWorld)
task2 = PythonOperator(
task_id="get_clients",
python_callable=take_clients)
task1 >> task2
I guess you are trying to use RUN button from the UI.
This button is enabled only for executors that supports it.
In your Airflow setup you are using Executor that doesn't support this command.
In newer Airflow versions the button is simply disable if you you are using Executor that doesn't support it:
I assume that what you are after is to create a new run, in that case you should use Trigger Run button. If you are looking to re-run specific task then use Clear button.
you run it as LocalExecutor , you have to change your Executor to Celery, CeleryKubernetes or Kubernetes or DaskExecutor
if you using docker-compose add:
AIRFLOW__CORE__EXECUTOR: CeleryExecutor
otherwise go to airflow Executor

How to force a Airflow Task to restart at the new scheduling date?

I have this simple Airflow DAG:
from airflow import DAG
from datetime import datetime, timedelta
from airflow.operators.bash import BashOperator
with DAG("Second Dag",
start_date=datetime(2022,1,1),
schedule_interval="0 5 * * *",
catchup=False,
max_active_runs=1
) as dag:
task_a = BashOperator(
task_id="ToRepeat",
bash_command="cd /home/xdf/local/ && (env/bin/python workflow/test1.py)",
retries =1,
)
The task takes a variable amount of time between one run and the other, and I don't have any guarantee that it will be finished within the 5 A.M of the next day.
If the task is still running when a new task is scheduled to start, I need to kill the old one before it starts running.
How can I design Airflow DAG to automatically kill the old task if it's still running when a new task is scheduled to start?
More details:
I am looking for something dynamic. The old DAG should be killed only when the new DAG is starting. If, for any reason, the new DAG does not start for one week, then old DAG should be able to run for an entire week. That's why using a timeout is sub-optimal
You should set dagrun_timeout for your DAG.
dagrun_timeout: specify how long a DagRun should be up before
timing out / failing, so that new DagRuns can be created. The timeout
is only enforced for scheduled DagRuns.
Since your DAG runs daily you can set 24 hours for timeout.
with DAG("Second Dag",
start_date=datetime(2022,1,1),
schedule_interval="0 5 * * *",
catchup=False,
max_active_runs=1
dagrun_timeout=timedelta(hours=24)
) as dag:
If you want to set timeout on a specific task in your DAG you should use execution_timeout on your operator.
execution_timeout: max time allowed for the execution of this task instance, if it goes beyond it will raise and fail
Example:
MyOperator(task_id='task', execution_timeout=timedelta(hours=24))
If you really are looking for a dynamic solution; you can take help of Airflow DAGRun APIs and Xcoms; you can push your current dag run_id to Xcom and for subsequent runs you can pull this Xcom to consume with airflow API to check and kill the dag run with that run_id.
check_previous_dag_run_id >> kill_previous_dag_run >> push_current_run_id >> your_main_task
and your API call task should be something like
...
kill_previous_dag_run = BashOperator(
task_id="kill_previous_dag_run",
bash_command="curl -X 'DELETE' \
'http://<<your_webserver_dns>>/api/v1/dags/<<your_dag_name>>/dagRuns/<<url_encoded_run_id>>' \
-H 'accept: */*' --user <<api_username>>:<<api_user_password>>",
dag=dag
)
...

Apache Airflow does not enforce dagrun_timeout

I am using Apache Airflow version 1.10.3 with the sequential executor, and I would like the DAG to fail after a certain amount of time if it has not finished. I tried setting dagrun_timeout in the example code
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'me',
'depends_on_past': False,
'start_date': datetime(2019, 6, 1),
'retries': 0,
}
dag = DAG('min_timeout', default_args=default_args, schedule_interval=timedelta(minutes=5), dagrun_timeout = timedelta(seconds=30), max_active_runs=1)
t1 = BashOperator(
task_id='fast_task',
bash_command='date',
dag=dag)
t2 = BashOperator(
task_id='slow_task',
bash_command='sleep 45',
dag=dag)
t2.set_upstream(t1)
slow_task alone takes more than the time limit set by dagrun_timeout, so my understanding is that airflow should stop DAG execution. However, this does not happen, and slow_task is allowed to run for its entire duration. After this occurs, the run is marked as failed, but this does not kill the task or DAG as desired. Using execution_timeout for slow_task does cause the task to be killed at the specified time limit, but I would prefer to use an overall time limit for the DAG rather than specifying execution_timeout for each task.
Is there anything else I should try to achieve this behavior, or any mistakes I can fix?
The Airflow scheduler runs a loop at least every SCHEDULER_HEARTBEAT_SEC (the default is 5 seconds).
Bear in mind at least here, because the scheduler performs some actions that may delay the next cycle of its loop.
These actions include:
parsing the dags
filling up the DagBag
checking the DagRun and updating their state
scheduling next DagRun
In your example, the delayed task isn't terminated at the dagrun_timeout because the scheduler performs its next cycle after the task completes.
According to Airflow documentation:
dagrun_timeout (datetime.timedelta) – specify how long a DagRun should be up before timing out / failing, so that new DagRuns can be created. The timeout is only enforced for scheduled DagRuns, and only once the # of active DagRuns == max_active_runs.
So dagrun_timeout wouldn't work for non-scheduled DagRuns (e.g. manually triggered) and if the number of active DagRuns < max_active_runs parameter.

Airflow backfill only scheduling for START_DATE

I just started using airflow and I basically want to run my dag to load historical data. So I'm running this command
airflow backfill my_dag -s 2018-07-30 -e 2018-08-01
And airflow is running my dag only for 2018-07-30. My expectation was airflow to run for 2018-07-30, 2018-07-31 and 2018-08-01.
Here's part of my dag's code:
import airflow
import configparser
import os
from airflow import DAG
from airflow.contrib.operators.databricks_operator import DatabricksSubmitRunOperator
from airflow.models import Variable
from datetime import datetime
def getConfFileFullPath(fileName):
return os.path.join(os.path.abspath(os.path.dirname(__file__)), fileName)
config = configparser.ConfigParser(interpolation=configparser.ExtendedInterpolation())
config.read([getConfFileFullPath('pipeline.properties')])
args = {
'owner': 'airflow',
'depends_on_past': True,
'start_date': datetime(2018,7,25),
'end_date':airflow.utils.dates.days_ago(1)
}
dag_id='my_dag'
dag = DAG(
dag_id=dag_id, default_args=args,
schedule_interval=None, catchup=False)
...
So am I doing anything wrong with my dag configuration?
Problem: schedule_interval=None
In order to initiate multiple runs within your defined date range you need to set the schedule interval for the dag. For example try:
schedule_interval=#daily
Start date, end date and schedule interval defines how many runs will be initiated by the scheduler when backfill is executed.
Airflow scheduling and presets

How to add manual tasks in an Apache Airflow Dag

I'm using Apache Airflow to manage the data processing pipeline. In the middle of the pipeline, some data need to be reviewed before the next-step processing. e.g.
... -> task1 -> human review -> task2 -> ...
where task1 and task2 are data processing task. When task1 finished, the generated data by task1 needs to be reviewed by human. After the reviewer approved the data, task2 could be launched.
Human review tasks may take a very long time(e.g. several weeks).
I'm thinking to use an external database to store the human review result. And use a Sensor to poke the review result by a time interval. But it will occupy an Airflow worker until the review is done.
any idea?
Piggy-packing off of Freedom's answer and Robert Elliot's answer, here is a full working example that gives the user two weeks to review the results of the first task before failing permanently:
from datetime import timedelta
from airflow.models import DAG
from airflow import AirflowException
from airflow.operators.python_operator import PythonOperator
from my_tasks import first_task_callable, second_task_callable
TIMEOUT = timedelta(days=14)
def task_to_fail():
raise AirflowException("Please change this step to success to continue")
dag = DAG(dag_id="my_dag")
first_task = PythonOperator(
dag=dag,
task_id="first_task",
python_callable=first_task_callable
)
manual_sign_off = PythonOperator(
dag=dag,
task_id="manual_sign_off",
python_callable=task_to_fail,
retries=1,
max_retry_delay=TIMEOUT
)
second_task = PythonOperator(
dag=dag,
task_id="second_task",
python_callable=second_task_callable
)
first_task >> manual_sign_off >> second_task
A colleague suggested having a task that always fails, so the manual step is simply to mark it as a success. I implemented it as so:
def always_fail():
raise AirflowException('Please change this step to success to continue')
manual_sign_off = PythonOperator(
task_id='manual_sign_off',
dag=dag,
python_callable=always_fail
)
start >> manual_sign_off >> end
Your idea seems good to me. You can create a dedicated DAG to check the progress of your approval process with a sensor. If you use a low timeout on your sensor and an appropriate schedule on this DAG, say every 6 hours. Adapt it to how often these tasks are approved and how soon you need to perform the downstream tasks.
Before 1.10, I used the retry feature of the operator to implement the ManualSignOffTask. The operator has set retries and retry_delay. So the task will be rescheduled after it fails. When the task is scheduled, it will check the database to see if the sign-off is done:
If the sign-off has not been done yet, the task fails and release the worker and wait for next schedule.
If the sign-off has been done, the task success, and the dag run proceeds.
After 1.10, a new TI state UP_FOR_RESCHEDULE is introduced and the Sensor natively supports long running tasks.

Resources