Airflow set task instance status as skipped programmatically - airflow

I have list that I loop to create the tasks. The list are static as far as size.
for counter, account_id in enumerate(ACCOUNT_LIST):
task_id = f"bash_task_{counter}"
if account_id:
trigger_task = BashOperator(
task_id=task_id,
bash_command="echo hello there",
dag=dag)
else:
trigger_task = BashOperator(
task_id=task_id,
bash_command="echo hello there",
dag=dag)
trigger_task.status = SKIPPED # is there way to somehow set status of this to skipped instead of having a branch operator?
trigger_task
I tried this manually but cannot make the task skipped:
start = DummyOperator(task_id='start')
task1 = DummyOperator(task_id='task_1')
task2 = DummyOperator(task_id='task_2')
task3 = DummyOperator(task_id='task_3')
task4 = DummyOperator(task_id='task_4')
start >> task1
start >> task2
try:
start >> task3
raise AirflowSkipException
except AirflowSkipException as ase:
log.error('Task Skipped for task3')
try:
start >> task4
raise AirflowSkipException
except AirflowSkipException as ase:
log.error('Task Skipped for task4')

yes there you need to raise AirflowSkipException
from airflow.exceptions import AirflowSkipException
raise AirflowSkipException
For more information see the source code

Have a fixed number of tasks to execute per DAG. This is really fine and this is also planning how much max parallel task your system should handle without degrading downstream systems. Also, having fixed number of tasks makes it visible in the web UI and give you indication whether they are executed or skipped.
In the code below, I initialized the list with None items and then update the list with values based on returned data from the DB. In the python_callable function, check if the account_id is None then raise an AirflowSkipException, otherwise execute the function. In the UI, the tasks are visible and indicates whether executed or skipped(meaning there is no account_id)
def execute(account_id):
if account_id:
print(f'************Executing task for account_id:{account_id}')
else:
raise AirflowSkipException
def create_task(task_id, account_id):
return PythonOperator(task_id=task_id,
python_callable=execute,
op_args=[account_id])
list_from_dbhook = [1, 2, 3] # dummy list. Get records using DB Hook
# Need to have some fix size. Need to allocate fix resources or # of tasks.
# Having this fixed number of tasks will make this tasks to be visible in UI instead of being purely dynamic
record_size_limit = 5
ACCOUNT_LIST = [None] * record_size_limit
for index, account_id_val in enumerate(list_from_dbhook):
ACCOUNT_LIST[index] = account_id_val
for idx, acct_id in enumerate(ACCOUNT_LIST):
task = create_task(f"task_{idx}", acct_id)
task

Related

Airflow - Task-Group with Dynamic task - Can't trigger Downstream if one upstream is failed/skipped

I've an Airflow DAG where I've a task_group with a loop inside that generates two dynamic tasks. After the task_group I need to perform other actions. My problem is:
Inside the task_group I've a branching operators that validates if the last task should run or not. In case of one of the two flows are completed with success, I want to continue my process. For that I'm using the trigger_rule one_success. My code:
with DAG(
dag_id='hello_world',
schedule_interval=None,
start_date=datetime(2022, 8, 25),
default_args=default_args,
max_active_runs=1,
catchup = False,
concurrency = 1,
) as dag:
task_a = DummyOperator(task_id="task_a")
with TaskGroup(group_id='task_group') as my_group:
my_list = ['a','b']
for i in my_list:
task_b = PythonOperator(
task_id="task_a_".format(i),
python_callable=p_task_1)
var_to_continue = check_status(i)
is_running = ShortCircuitOperator(
task_id="is_{}_running".format(i),
python_callable=lambda x: x in [True],
op_args=[var_to_continue])
task_c = PythonOperator(
task_id="task_a_".format(i),
python_callable=p_task_2)
task_b >> is_running >> task_c
task_d = DummyOperator(task_id="task_c",trigger_rule=TriggerRule.ONE_SUCCESS)
task_a >> my_group >> task_d
My problem is: if one of the iterations return skipped the task_d is always skipped, even one of the flow return success.
Do you know how to resolve this?
Thanks!
After a deep search, I found the problem.
In fact, by default ShortCircuitOperator ignore all the downstream tasks trigger rules, if its value is False, it will cut the circuit, which means it will skip all the downstream tasks (its downstream tasks and their downstream tasks and their downstream tasks, ...).
In Airflow 2.3.0, in this PR, they added a new argument ignore_downstream_trigger_rules with default value True to ignore the downstream trigger rules, but you can stop that by providing a False value.
If you are using a version older than 2.3.0, you should replace the operator ShortCircuitOperator by another solution, for ex:
def check_condition():
if not condition: # add your logic
raise AirflowSkipException()
is_running = PythonOperator(..., python_callable=check_condition)
is_running >> task_c

Airflow: Get the status of the prior run for a task

I'm working with Airflow 2.1.4 and looking to find the status of the prior task run (Task Run, not Task Instance and not Dag Run).
I.e., DAG MorningWorkflow runs a 9:00am, and task ConditionalTask is in that dag. There is some precondition logic that will throw an AirflowSkipException in a number of situations (including timeframe of day and other context-specific information to reduce the likelihood of collisions with independent processes)
If ConditionalTask fails, we can fix the issue, clear the failed run, and re-run it without running the entire DAG. However, the skip logic reruns and will often now skip it, even though the original conditions were non-skipping.
So, I want to update the precondition logic to never skip if this taskinstance ran previously and failed. I can determine if the taskinstance ran previously using TaskInstance.try_number orTaskInstance.prev_attempted_tries, but this doesn't tell me whether it actually tried to run originally or if it skipped (i.e., if we clear the entire DagRun to rerun the whole workflow, we would want it to still skip).
An alternative would be to determine whether the first attempted run was skipped or not.
#Kevin Crouse
In order to answer your question, we can take advantage of from airflow.models import DagRun
To provide you with a complete, answer I have created two functions to assist you in resolving similar quandaries in the future.
How to return the overall state/success of a specific dag_id passed as a function arg?
def get_last_dag_run_status(dag_id):
""" Returns the status of the last dag run for the given dag_id
1. Utilise the find method of DagRun class
2. Step 1 returns a list, so we sort it by the last execution date
3. I have returned 2 examples for you to see a) the state, b) the last execution date, you can explore this further by just returning last_dag_run[0]
Args:
dag_id (str): The dag_id to check
Returns:
List - The status of the last dag run for the given dag_id
List - The last execution date of the dag run for the given dag_id
"""
last_dag_run = DagRun.find(dag_id=dag_id)
last_dag_run.sort(key=lambda x: x.execution_date, reverse=True)
return [last_dag_run[0].state, last_dag_run[0].execution_date]
How to return the status of a specific task_id, within a specific dag_id?
def get_task_status(dag_id, task_id):
""" Returns the status of the last dag run for the given dag_id
1. The code is very similar to the above function, I use it as the foundation for many similar problems/solutions
2. The key difference is that in the return statement, we can directly access the .get_task_instance passing our desired task_id and its state
Args:
dag_id (str): The dag_id to check
task_id (str): The task_id to check
Returns:
List - The status of the last dag run for the given dag_id
"""
last_dag_run = DagRun.find(dag_id=dag_id)
last_dag_run.sort(key=lambda x: x.execution_date, reverse=True)
return last_dag_run[0].get_task_instance(task_id).state
I hope this helps you in your journey to resolve your issues.
For posterity, here is a complete dummy Dag to demonstrate the 2 functions working.
from airflow import DAG
from airflow.operators.dummy import DummyOperator
from airflow.operators.python import PythonOperator
from airflow.models import DagRun
from datetime import datetime
def get_last_dag_run_status(dag_id):
""" Returns the status of the last dag run for the given dag_id
Args:
dag_id (str): The dag_id to check
Returns:
List - The status of the last dag run for the given dag_id
List - The last execution date of the dag run for the given dag_id
"""
last_dag_run = DagRun.find(dag_id=dag_id)
last_dag_run.sort(key=lambda x: x.execution_date, reverse=True)
return [last_dag_run[0].state, last_dag_run[0].execution_date]
def get_task_status(dag_id, task_id):
""" Returns the status of the last dag run for the given dag_id
Args:
dag_id (str): The dag_id to check
task_id (str): The task_id to check
Returns:
List - The status of the last dag run for the given dag_id
"""
last_dag_run = DagRun.find(dag_id=dag_id)
last_dag_run.sort(key=lambda x: x.execution_date, reverse=True)
return last_dag_run[0].get_task_instance(task_id).state
with DAG(
'stack_overflow_ans_1',
tags = ['SO'],
start_date = datetime(2022, 1, 1),
schedule_interval = None,
catchup = False,
is_paused_upon_creation = False
) as dag:
t1 = DummyOperator(
task_id = 'start'
)
t2 = PythonOperator(
task_id = 'get_last_dag_run_status',
python_callable = get_last_dag_run_status,
op_args = ['YOUR_DAG_NAME'],
do_xcom_push = False
)
t3 = PythonOperator(
task_id = 'get_task_status',
python_callable = get_task_status,
op_args = ['YOUR_DAG_NAME', 'YOUR_DAG_TASK_WITHIN_THE_DAG'],
do_xcom_push = False
)
t4 = DummyOperator(
task_id = 'end'
)
t1 >> t2 >> t3 >> t4

Airflow - Get last success instance of the task

I have a need to terminate and start the EMR cluster every 24 hours from Airflow.
I have implemented logic to check if the previous task execution date - current execution date =1, then terminate the cluster and create a new one. Otherwise, skip the execution of the EMR creation task.
But, I'm seeing a weird scenario where the DAG execution is getting marked as a success but no task is being executed!!!
So, to handle this scenario, I'm trying to check for the last success date of the EMR (XCOM will have cluster id) to terminate the cluster before I start a new one.
I'm not successful so far.. any help is appreciated.
DAG Image:
The Pink ones indicate skip. If you closely observe, the emr_termination task (last row in the image) after the blank boxes should be green just like for emr_creation task. But, it got skipped and the previous cluster was not terminated
Code:
def emr_termination_trigger(execution_date, prev_execution_date_success, prev_execution_date, **kwargs):
days_diff = (execution_date.date() - prev_execution_date.date()).days
provisioned_product_id, cluster_id = None, None
creation_tsk_status = 'N/A'
try:
ti = TaskInstance(emr_creation_tsk, execution_date)
if ti.previous_ti is None:
print("previous execution of emr creation is not available. feching last 2 execution")
ti = TaskInstance(emr_creation_tsk, prev_execution_date)
if ti.previous_ti is None:
print("last 2 executions are not available. Nothing to terminate")
creation_tsk_status = None
else:
creation_tsk_status = ti.previous_ti.state
provisioned_product_id, cluster_id = ti.previous_ti.xcom_pull(
task_ids=emr_creation_tsk.task_id)
else:
creation_tsk_status = ti.previous_ti.state
provisioned_product_id, cluster_id = ti.previous_ti.xcom_pull(task_ids=emr_creation_tsk.task_id)
except:
pass
print(f"days_diff:{days_diff} - provisioned_prd:{provisioned_product_id} - create_status:{creation_tsk_status}")
if days_diff == 1:
print("Inside Terminating cluster")
terminate_cluster = TerminateEMROperator(
task_id='terminate_cluster',
provisioned_product_id=provisioned_product_id,
airflow_conn_id=airflow_conn_id,
provide_context=True,
dag=dag)
try:
terminate_cluster.execute(context=kwargs)
except Exception as e:
print(f"Got exception while terminating the EMR cluster:{str(e)}")

Add downstream task to every task without downstream in a DAG in Airflow 1.9

Problem: I've been trying to find a way to get tasks from a DAG that have no downstream tasks following them.
Why I need it: I'm building an "on success" notification for DAGs. Airflow DAGs have an on_success_callback argument, but problem with that is that it gets triggered after every task success instead of just DAG. I've seen other people approach this problem by creating notification task and appending it to the end. Problem I have with this approach is that many DAGs we're using have multiple ends, and some are auto-generated.
Making sure that all ends are caught manually is tedious.
I've spent hours digging for a way to access data I need to build this.
Sample DAG setup:
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from datetime import datetime
default_args = {
'owner': 'airflow',
'start_date': datetime(2018, 7, 29)}
dag = DAG(
'append_to_end',
description='append a tast to all tasks without downstream',
default_args=default_args,
schedule_interval='* * * * *',
catchup=False)
task_1 = DummyOperator(dag=dag, task_id='task_1')
task_2 = DummyOperator(dag=dag, task_id='task_2')
task_3 = DummyOperator(dag=dag, task_id='task_3')
task_1 >> task_2
task_1 >> task_3
This produces following DAG:
What I want to achieve is an automated way to include a new task to a DAG that connects to all ends, like in an image below.
I know it's an old post, but I've had a similar need as the above posted.
You can add to your return function a statement that doesn't return your "final_task" id, and so it won't be added to the get_leaf_task return, something like:
def get_leaf_tasks(dag):
return [task for task_id, task in dag.task_dict.items() if len(task.downstream_list) == 0 and task_ids != 'final_task']
Additionally, you can change this part:
for task in leaf_tasks:
task >> final_task
to:
get_leaf_tasks(dag) >> final_task
Since it already gives you a list of task instances and the bitwise operator ">>" will do the loop for you.
What I've got to so far is code below:
def get_leaf_tasks(dag):
return [task for task_id, task in dag.task_dict.items() if len(task.downstream_list) == 0]
leaf_tasks = get_leaf_tasks(dag)
final_task = DummyOperator(dag=dag, task_id='final_task')
for task in leaf_tasks:
task >> final_task
It produces the result I want, but what I don't like about this solution is that get_leaf_tasks must be executed before final_task is created, or it will be included in leaf_tasks list and I'll have to find ways to exclude it.
I could wrap assignment in another function:
def append_to_end(dag, task):
leaf_tasks = get_leaf_tasks(dag)
dag.add_task(task)
for task in leaf_tasks:
task >> final_task
final_task = DummyOperator(task_id='final_task')
append_to_end(dag, final_task)
This is not ideal either, as caller must ensure they've created a final_task without DAG assigned to it.

How to setup Nagios Alerts for Apache Airflow Dags

Is it possible to setup Nagios alerts for airflow dags?
In case the dag is failed, I need to alert the respective groups.
You can add an "on_failure_callback" to any task which will call an arbitrary failure handling function. In that function you can then send an error call to Nagios.
For example:
dag = DAG(dag_id="failure_handling",
schedule_interval='#daily')
def handle_failure(context):
# first get useful fields to send to nagios/elsewhere
dag_id = context['dag'].dag_id
ds = context['ds']
task_id = context['ti'].task_id
# instead of printing these out - you can send these to somewhere else
logging.info("dag_id={}, ds={}, task_id={}".format(dag_id, ds, task_id))
def task_that_fails(**kwargs):
raise Exception("failing test")
task_to_fail = PythonOperator(
task_id='python_task_to_fail',
python_callable=task_that_fails,
provide_context=True,
on_failure_callback=handle_failure,
dag=dag)
If you run a test on this:
airflow test failure_handling task_to_fail 2018-08-10
You get the following in your log output:
INFO - dag_id=failure_handling, ds=2018-08-10, task_id=task_to_fail

Resources