Airflow - create task that runs once all other tasks ran successfully - airflow

Is there a way to add a task that runs once all other tasks have run successfully in the same DAG, see below for my current DAG.
For example my current tasks run in the below order, but I want to add new_task once all of the below runs. If I leave it as the below it won't run new_task:
for endpoint in ENDPOINTS:
latest_only = (operator...)
s3 = (operator...)
etc ....
latest_only >> s3 >> short_circuit
short_circuit >> snowflake >> success
short_circuit >> postgres >> success
if endpoint.name == "io_lineitems":
success >> il_io_lineitems_tables
copy_monthly_billing >> load_io_monthly_billing_to_snowflake
copy_monthly_billing >> load_io_monthly_billing_to_postgres
new_task

Use trigger rules to trigger based on upstream task success status
trigger_rules

Related

Airflow: Can operators run on an external service, and communicate with Airflow to update the progress of the DAG?

I'm currently experimenting with a new concept where the operator will communicate with an external service to run the operator instead of running the operator locally, and the external service can communicate with Airflow to update the progress of the DAG.
For example, let's say we have a bash operator:
bash_task = BashOperator(
task_id="bash_task",
bash_command="echo \"This Message Shouldn't Run Locally on Airflow\"",
)
That is part of a DAG:
from airflow import DAG
from airflow.operators.bash import BashOperator
with DAG() as dag:
t1 = BashOperator(
task_id="bash_task1",
bash_command="echo \"t1:This Message Shouldn't Run Locally on Airflow\""
)
t2 = BashOperator(
task_id="bash_task2",
bash_command="echo \"t2:This Message Shouldn't Run Locally on Airflow\""
)
t1 >> t2
Is there a method in the Airflow code that will allow an external service to tell the DAG that t1 has started/completed and that t2 has started/completed, without actually running the DAG on the Airflow instance?
Airflow has a concept of Executors which are responsible for scheduling tasks, occasionally via or on external services - such as Kubernetes, Dask, or a Celery cluster.
https://airflow.apache.org/docs/apache-airflow/stable/executor/index.html
The worker process communicates back to Airflow, often via the Metadata DB about the progress of the task.

How to force a Airflow Task to restart at the new scheduling date?

I have this simple Airflow DAG:
from airflow import DAG
from datetime import datetime, timedelta
from airflow.operators.bash import BashOperator
with DAG("Second Dag",
start_date=datetime(2022,1,1),
schedule_interval="0 5 * * *",
catchup=False,
max_active_runs=1
) as dag:
task_a = BashOperator(
task_id="ToRepeat",
bash_command="cd /home/xdf/local/ && (env/bin/python workflow/test1.py)",
retries =1,
)
The task takes a variable amount of time between one run and the other, and I don't have any guarantee that it will be finished within the 5 A.M of the next day.
If the task is still running when a new task is scheduled to start, I need to kill the old one before it starts running.
How can I design Airflow DAG to automatically kill the old task if it's still running when a new task is scheduled to start?
More details:
I am looking for something dynamic. The old DAG should be killed only when the new DAG is starting. If, for any reason, the new DAG does not start for one week, then old DAG should be able to run for an entire week. That's why using a timeout is sub-optimal
You should set dagrun_timeout for your DAG.
dagrun_timeout: specify how long a DagRun should be up before
timing out / failing, so that new DagRuns can be created. The timeout
is only enforced for scheduled DagRuns.
Since your DAG runs daily you can set 24 hours for timeout.
with DAG("Second Dag",
start_date=datetime(2022,1,1),
schedule_interval="0 5 * * *",
catchup=False,
max_active_runs=1
dagrun_timeout=timedelta(hours=24)
) as dag:
If you want to set timeout on a specific task in your DAG you should use execution_timeout on your operator.
execution_timeout: max time allowed for the execution of this task instance, if it goes beyond it will raise and fail
Example:
MyOperator(task_id='task', execution_timeout=timedelta(hours=24))
If you really are looking for a dynamic solution; you can take help of Airflow DAGRun APIs and Xcoms; you can push your current dag run_id to Xcom and for subsequent runs you can pull this Xcom to consume with airflow API to check and kill the dag run with that run_id.
check_previous_dag_run_id >> kill_previous_dag_run >> push_current_run_id >> your_main_task
and your API call task should be something like
...
kill_previous_dag_run = BashOperator(
task_id="kill_previous_dag_run",
bash_command="curl -X 'DELETE' \
'http://<<your_webserver_dns>>/api/v1/dags/<<your_dag_name>>/dagRuns/<<url_encoded_run_id>>' \
-H 'accept: */*' --user <<api_username>>:<<api_user_password>>",
dag=dag
)
...

Implementing branching in Airflow

I am trying to add alerts to my airflow dags. The dags have multiple tasks, some upto 15.
I want to execute a bash script ( a general script for all dags ), in case any task at any point fails.
An example, a dag has tasks T1 to T5, as T1 >> T2 >> T3 >> T4 >> T5.
I want to trigger task A ( representing alerts ) in case any of these fail.
Would be really helpful in anyone can help me with the task hierarchy.
You have two options IMO. Failure callback and Trigger Rules
Success / Failure Callback
Airflow Task Instances have a concept of what to do in case of failure or success. These are callbacks that will be run in the case of a Task reaching a specific state... here are your options:
...
on_failure_callback=None,
on_success_callback=None,
on_retry_callback=None
...
Trigger Rules
Airflow Task Instances have a concept of what state of their upstream to trigger on with the default being ALL_SUCCESS. That means your main branch can stay as it is. And you can branch where you want with A from T1 as:
from airflow.utils.trigger_rule import TriggerRule
T1 >> DummyOperator(
dag=dag,
task_id="task_a",
trigger_rule=TriggerRule.ALL_FAILED
)
Alternatively, you can build your branch and include A as:
from airflow.utils.trigger_rule import TriggerRule
[T1, T2, T3, ...] >> DummyOperator(
dag=dag,
task_id="task_a",
trigger_rule=TriggerRule.ONE_FAILED
)

Airflow - Skip future task instance without making changes to dag file

I have a DAG 'abc' scheduled to run every day at 7 AM CST and there is task 'xyz' in that DAG.
For some reason, I do not want to run one of the tasks 'xyz' for tomorrow's instance.
How can I skip that particular task instance?
I do not want to make any changes to code as I do not have access to Prod code and the task is in Prod environment now.
Is there any way to do that using command line ?
Appreciate any help on this.
You can mark the unwanted tasks as succeeded using the run command. The tasks marked as succeeded will not be run anymore.
Assume, there is a DAG with ID a_dag and three tasks with IDs dummy1, dummy2, dummy3. We want to skipp the dummy3 task from the next DAG run.
First, we get the next execution date:
$ airflow next_execution a_dag
2020-06-12T21:00:00+00:00
Then we mark dummy3 as succeeded for this execution date:
$ airflow run -fAIim a_dag dummy3 '2020-06-12T21:00:00+00:00'
To be sure, we can check the task state. For the skipped task it will be success:
$ airflow task_state a_dag dummy3 '2020-06-12T21:00:00+00:00'
...
success
For the rest of the tasks the state will be None:
$ airflow task_state a_dag dummy1 '2020-06-12T21:00:00+00:00'
...
None

Airflow 'one_success' task not triggered

I'm running Airflow on a 4 CPU machine with LocalExecutor
I've defined an upstream task to be one success
create_spark_cluster_task = BashOperator(
task_id='create_spark_cluster',
trigger_rule='one_success',
bash_command= ...,
dag=dag)
...
download_bag_data_task >> create_spark_cluster_task
download_google_places_data_task >> create_spark_cluster_task
download_facebook_places_details_data_task >> create_spark_cluster_task
download_facebook_places_details_data_task_2 >> create_spark_cluster_task
download_facebook_places_details_data_task_3 >> create_spark_cluster_task
download_factual_data_task >> create_spark_cluster_task
download_dataoutlet_data_task >> create_spark_cluster_task
But even though some are clearly marked as success the task does not trigger
The 'download tasks' do run in parallel, so that cannot be the issue
Inspecting the tasks shows:
Dependency: Unknown
Reason: All dependencies are met but the task
instance is not running. In most cases this just means that the task
will probably be scheduled soon unless:
- The scheduler is down or under heavy load
- This task instance already ran and had it's state changed manually (e.g. cleared in the UI)
I've looked at the load and it's indeed pretty high:
load average: 2.45, 3.55, 3.71
CPU is at 50-60%
But that other tasks have already finished, so there should be resources free to start another task, right?

Resources