Spawn multiple parallel Tasks from one DAG - airflow

Psuedo python code.
def update_tags(product):
update-tag-based-on-stuff
def products(p,h,**kwargs)
// Get Products
for product in products:
update_tags(product)
with models.DAG(
"fetchProducts",
default_args=default_args,
schedule_interval=None
) as dag:
get_products = PythonOperator(
task_id = 'get_products',
python_callable = products,
op_kwargs = {'p':params, 'h':headers},
provide_context=True
);
get_products >> ??
What I am trying to achieve.
When the products loop is triggered, I want it to spawn a task (update_tags) in AirFlow and then proceed to the next loop, i.e. Don't wait for the task to finish.
My airflow.cfg is using localExecutor.

Related

DAG failure during the initial run

I have two DAGs:
DAG_A , DAG_B.
DAG_A triggers DAG_B thru TriggerDagRunOperator.
My tasks in DAG_B:
with DAG(
dag_id='DAG_B',
default_args=default_args,
schedule_interval='#once',
description='ETL pipeline for processing users'
) as dag:
start = DummyOperator(
task_id='start')
delete_xcom_task = PostgresOperator(
task_id='clean_up_xcom',
postgres_conn_id='postgres_default',
sql="delete from xcom where dag_id='DAG_A' and task_id='TASK_A' ")
end = DummyOperator(
task_id='end')
#trigger_rule='none_failed')
#num_table is set by DAG_A. Will have an empty list initially.
iterable_string = Variable.get("num_table",default_var="[]")
iterable_list = ast.literal_eval(iterable_string)
for index,table in enumerate(iterable_list):
table = table.strip()
read_src1 = PythonOperator(
task_id=f'Read_Source_data_{table}',
python_callable=read_src,
op_kwargs={'index': index}
)
upload_file_to_directory_bulk1 = PythonOperator(
task_id=f'ADLS_Loading_{table}',
python_callable=upload_file_to_directory_bulk,
op_kwargs={'index': index}
)
write_Snowflake1 = PythonOperator(
task_id=f'Snowflake_Staging_{table}',
python_callable=write_Snowflake,
op_kwargs={'index': index}
)
task_sf_storedproc1 = DummyOperator(
task_id=f'Snowflake_Processing_{table}'
)
start >> read_src1 >> upload_file_to_directory_bulk1 >> write_Snowflake1 >>task_sf_storedproc1 >> delete_xcom_task >> end
After executing airflow db init and making the webserver and scheduler up, DAG_B fails with failure in task delete_xcom_task.
2021-06-22 08:04:43,647] {taskinstance.py:871} INFO - Dependencies not met for <TaskInstance: Target_DIF.clean_up_xcom 2021-06-22T08:04:27.861718+00:00 [queued]>, dependency 'Trigger Rule' FAILED: Task's trigger rule 'all_success' requires all upstream tasks to have succeeded, but found 2 non-success(es). upstream_tasks_state={'total': 2, 'successes': 0, 'skipped': 0, 'failed': 0, 'upstream_failed': 0, 'done': 0}, upstream_task_ids={'Snowflake_Processing_products', 'Snowflake_Processing_inventories'}
[2021-06-22 08:04:43,651] {local_task_job.py:93} INFO - Task is not able to be run
But both DAGs become successful from the second runs.
Can anyone explain me what is happening internally?
How can I avoid the failure during the first run?
Thanks.
I suspect that the problem is in schedule_interval='#once' for DAG_B: When you add the DAG for the first time, the schedule_interval tells the scheduler to run the DAG once. So, DAG_B is triggered once by the scheduler and not by DAG_A. Any preparations that needs to be done by DAG_A for DAG_B to run successfully have not been done yet, therefore DAG_B fails.
Later on, DAG_A runs as scheduled and triggers DAG_B as expected. Both succeed.
To avoid DAG_B being triggered by the scheduler set schedule_interval=None.

How to let some of the tasks in an airflow dag do not run, and the rest of the tasks run according to my input parameters?

I have defined an airflow dag with certain numbers of different tasks, and all of these tasks are related to one parameter which is passed to the dag-define function. So when I pass different values to this parameter, I will get a different dag that has same task structure.
However, I come across a problem: How can I let some of these tasks not to run when I pass a certain parameter to the dag, and when I pass other parameters, all of these tasks should run normally.
For example, the dag code is like the following:
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from libs.config.dag_config import Config
from etl.funcs import function1, function2, function3
def build_dag(id, schedule_interval):
config = Config()
config.add_params({'id': id,})
with DAG(f'{id}_etl', default_args=config.default_args, schedule_interval=schedule_interval) as dag:
start = DummyOperator(task_id='start')
end = DummyOperator(task_id='end')
task1 = PythonOperator(task_id='task1',
python_callable=function1,
provide_context=True)
task2 = PythonOperator(task_id='task2',
python_callable=function2,
provide_context=True)
task3 = PythonOperator(task_id='task3',
python_callable=function3,
provide_context=True)
tasks = [task1, task2, task3]
start >> tasks >> end
return dag
dag1 = build_dag(10001, '0 * * * *')
dag2 = build_dag(10002, '0 * * * *')
dag3 = build_dag(10003, '0 * * * *')
For dag1 and dag2, I want all the three tasks run normally, but for dag3, I only want to run the task1 and task2, and do not want to run task3. How to achieve this?

Is there a way to configure different 'retries' for tasks in the same DAG

I have a DAG with many sub-tasks in it. In the middle of the DAG, there is a validation task and based on the result/return code from the task, i want to take two different paths. If success, one route(a sequence of tasks) will be followed and in case of failure, we would like to execute a different set of tasks. There are two problems with the current approach, one is that, validation tasks execute many times(as per the retries configured) if the exit code is 1. Second there is no way possible to take different branches of execution
To solve problem number 1, we can use the retry number is available from the task instance, which is available via the macro {{ task_instance }} . Appreciate if someone could point us to a cleaner approach, and the problem number 2 of taking different paths remains unsolved.
You can have retries at the task level.
run_this = BashOperator(
task_id='run_after_loop',
bash_command='echo 1',
retries=3,
dag=dag,
)
run_this_last = DummyOperator(
task_id='run_this_last',
retries=1,
dag=dag,
)
Regarding your 2nd problem, there is a concept of Branching.
The BranchPythonOperator is much like the PythonOperator except that it expects a python_callable that returns a task_id (or list of task_ids). The task_id returned is followed, and all of the other paths are skipped. The task_id returned by the Python function has to be referencing a task directly downstream from the BranchPythonOperator task.
Example DAG:
import random
import airflow
from airflow.models import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import BranchPythonOperator
args = {
'owner': 'airflow',
'start_date': airflow.utils.dates.days_ago(2),
}
dag = DAG(
dag_id='example_branch_operator',
default_args=args,
schedule_interval="#daily",
)
run_this_first = DummyOperator(
task_id='run_this_first',
dag=dag,
)
options = ['branch_a', 'branch_b', 'branch_c', 'branch_d']
branching = BranchPythonOperator(
task_id='branching',
python_callable=lambda: random.choice(options),
dag=dag,
)
run_this_first >> branching
join = DummyOperator(
task_id='join',
trigger_rule='one_success',
dag=dag,
)
for option in options:
t = DummyOperator(
task_id=option,
dag=dag,
)
dummy_follow = DummyOperator(
task_id='follow_' + option,
dag=dag,
)
branching >> t >> dummy_follow >> join
Regarding your first problem, you set task/Operator specific retry options quite easily. Reference: baseoperator.py#L77.
Problem two, you can branch within a DAG easily with BranchPythonOperator (Example Usage: example_branch_operator.py). You will want to nest your validation task/logic within the BranchPythonOperator (You can define and execute operators within operators).

DAG marked as "success" if one task fails, because of trigger rule ALL_DONE

I have the following DAG with 3 tasks:
start --> special_task --> end
The task in the middle can succeed or fail, but end must always be executed (imagine this is a task for cleanly closing resources). For that, I used the trigger rule ALL_DONE:
end.trigger_rule = trigger_rule.TriggerRule.ALL_DONE
Using that, end is properly executed if special_task fails. However, since end is the last task and succeeds, the DAG is always marked as SUCCESS.
How can I configure my DAG so that if one of the tasks failed, the whole DAG is marked as FAILED?
Example to reproduce
import datetime
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.utils import trigger_rule
dag = DAG(
dag_id='my_dag',
start_date=datetime.datetime.today(),
schedule_interval=None
)
start = BashOperator(
task_id='start',
bash_command='echo start',
dag=dag
)
special_task = BashOperator(
task_id='special_task',
bash_command='exit 1', # force failure
dag=dag
)
end = BashOperator(
task_id='end',
bash_command='echo end',
dag=dag
)
end.trigger_rule = trigger_rule.TriggerRule.ALL_DONE
start.set_downstream(special_task)
special_task.set_downstream(end)
This post seems to be related, but the answer does not suit my needs, since the downstream task end must be executed (hence the mandatory trigger_rule).
I thought it was an interesting question and spent some time figuring out how to achieve it without an extra dummy task. It became a bit of a superfluous task, but here's the end result:
This is the full DAG:
import airflow
from airflow import AirflowException
from airflow.models import DAG, TaskInstance, BaseOperator
from airflow.operators.bash_operator import BashOperator
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
from airflow.utils.db import provide_session
from airflow.utils.state import State
from airflow.utils.trigger_rule import TriggerRule
default_args = {"owner": "airflow", "start_date": airflow.utils.dates.days_ago(3)}
dag = DAG(
dag_id="finally_task_set_end_state",
default_args=default_args,
schedule_interval="0 0 * * *",
description="Answer for question https://stackoverflow.com/questions/51728441",
)
start = BashOperator(task_id="start", bash_command="echo start", dag=dag)
failing_task = BashOperator(task_id="failing_task", bash_command="exit 1", dag=dag)
#provide_session
def _finally(task, execution_date, dag, session=None, **_):
upstream_task_instances = (
session.query(TaskInstance)
.filter(
TaskInstance.dag_id == dag.dag_id,
TaskInstance.execution_date == execution_date,
TaskInstance.task_id.in_(task.upstream_task_ids),
)
.all()
)
upstream_states = [ti.state for ti in upstream_task_instances]
fail_this_task = State.FAILED in upstream_states
print("Do logic here...")
if fail_this_task:
raise AirflowException("Failing task because one or more upstream tasks failed.")
finally_ = PythonOperator(
task_id="finally",
python_callable=_finally,
trigger_rule=TriggerRule.ALL_DONE,
provide_context=True,
dag=dag,
)
succesful_task = DummyOperator(task_id="succesful_task", dag=dag)
start >> [failing_task, succesful_task] >> finally_
Look at the _finally function, which is called by the PythonOperator. There are a few key points here:
Annotate with #provide_session and add argument session=None, so you can query the Airflow DB with session.
Query all upstream task instances for the current task:
upstream_task_instances = (
session.query(TaskInstance)
.filter(
TaskInstance.dag_id == dag.dag_id,
TaskInstance.execution_date == execution_date,
TaskInstance.task_id.in_(task.upstream_task_ids),
)
.all()
)
From the returned task instances, get the states and check if State.FAILED is in there:
upstream_states = [ti.state for ti in upstream_task_instances]
fail_this_task = State.FAILED in upstream_states
Perform your own logic:
print("Do logic here...")
And finally, fail the task if fail_this_task=True:
if fail_this_task:
raise AirflowException("Failing task because one or more upstream tasks failed.")
The end result:
As #JustinasMarozas explained in a comment, a solution is to create a dummy task like :
dummy = DummyOperator(
task_id='test',
dag=dag
)
and bind it downstream to special_task :
failing_task.set_downstream(dummy)
Thus, the DAG is marked as failed, and the dummy task is marked as upstream_failed.
Hope there is an out-of-the-box solution, but waiting for that, this solution does the job.
To expand on Bas Harenslak answer, a simpler _finally function which will check the state of all tasks (not only the upstream ones) can be:
def _finally(**kwargs):
for task_instance in kwargs['dag_run'].get_task_instances():
if task_instance.current_state() != State.SUCCESS and \
task_instance.task_id != kwargs['task_instance'].task_id:
raise Exception("Task {} failed. Failing this DAG run".format(task_instance.task_id))

clear an upstream task in airflow within the dag

i have a task in an airflow DAG. it has three child tasks. unfortunately, there are cases where this parent task will succeed, but two of the three children will fail (and a retry on the children won't fix them).
it requires the parent to retry (even though it didn't fail).
so i dutifully go into the graph view of the dag run and 'clear' this parent task and all downstream tasks (+recursive).
is there a way i can do this within the dag itself?
If your tasks are part of a subdag, calling dag.clear() in the on_retry_callback of a SubDagOperator should do the trick:
SubDagOperator(
subdag=subdag,
task_id="...",
on_retry_callback=lambda context: subdag.clear(
start_date=context['execution_date'],
end_date=context['execution_date']),
dag=dag
)
We had a similar problem which we resolved by putting the task with dependencies that we want to repeat into a sub dag. Then when the sub dag retries we clear the sub dag tasks state using the on_retry_callback so that they all run again.
sub_dag = SubDagOperator(
retry_delay=timedelta(seconds=30),
subdag=create_sub_dag(),
on_retry_callback=callback_subdag_clear,
task_id=sub_dag_name,
dag=dag,
)
def callback_subdag_clear(context):
"""Clears a sub-dag's tasks on retry."""
dag_id = "{}.{}".format(
context['dag'].dag_id,
context['ti'].task_id,
)
execution_date = context['execution_date']
sub_dag = DagBag().get_dag(dag_id)
sub_dag.clear(
start_date=execution_date,
end_date=execution_date,
only_failed=False,
only_running=False,
confirm_prompt=False,
include_subdags=False
)
(originally taken from here https://gist.github.com/nathairtras/6ce0b0294be8c27d672e2ad52e8f2117)
We opted for using the clear_task_instances method of the taskinstance:
#provide_session
def clear_tasks_fn(tis,session=None,activate_dag_runs=False,dag=None) -> None:
"""
Wrapper for `clear_task_instances` to be used in callback function
(that accepts only `context`)
"""
taskinstance.clear_task_instances(tis=tis,
session=session,
activate_dag_runs=activate_dag_runs,
dag=dag)
def clear_tasks_callback(context) -> None:
"""
Clears tasks based on list passed as `task_ids_to_clear` parameter
To be used as `on_retry_callback`
"""
all_tasks = context["dag_run"].get_task_instances()
dag = context["dag"]
task_ids_to_clear = context["params"].get("task_ids_to_clear", [])
tasks_to_clear = [ ti for ti in all_tasks if ti.task_id in task_ids_to_clear ]
clear_tasks_fn(tasks_to_clear,
dag=dag)
You would need to provide the list of tasks you want cleared on the callback, e.g on any child task:
DummyOperator('some_child',
on_retry_callback=clear_tasks_callback,
params=dict(
task_ids_to_clear=['some_child', 'parent']
),
retries=1
)
It does not directly answer your question but I can suggest a better workaround:
default_args = {
'start_date': datetime(2017, 12, 16),
'depends_on_past': True,
}
dag = DAG(
dag_id='main_dag',
schedule_interval='#daily',
default_args=default_args,
max_active_runs=1,
retries=100,
retry_delay= timedelta(seconds=120)
)
Set the depends_on_past to True in the DAG.
Then in the tasks of this dag, limit the retries using retries
DummyOperator(
task_id='bar',
retries=0
dag=child)
This way the DAG is marked as failed when any task fails. Then the DAG will be retried.

Resources