I have a DAG that needs to recompile this customer lists in various brands. The script is called with two arguments brand and listtype.
I need the brands to run concurrently, but the list types to be dependent on the preceding list type, but I can't figure out how to do that in a loop. Can ya'll help me out?
BrandsToRun = ['A', 'B', 'C']
ListTypes = ['1', '2', '3']
# Defining the DAG
################################################################################
with DAG(
'MusterMaster',
default_args = default_args,
description = 'x',
# schedule_interval = None
schedule_interval = '30 4 * * *',
catchup = False
) as MusterMaster:
for Brand in BrandsToRun:
for ListType in ListTypes:
ListLoad = BashOperator(
task_id='Load_'+str(Brand)+'_'+str(ListType),
bash_command = """python3 '/usr/local/bin/MusterMaster.py' {0} {1}""".format(Brand[0], ListType[0]),
pool='logs'
)
ListLoad
I want the tasks to have dependency structure like this, but I can't figure it out. Brand should run concurrently, but ListTypes should be dependent on the preceding ListType.
Muster A 1 >> Muster A 2 >> Muster A 3
Muster B 1 >> Muster B 2 >> Muster B 3
Muster C 1 >> Muster C 2 >> Muster C 3
How can I best accomplish this?
You can do:
for Brand in BrandsToRun:
list = []
for ListType in ListTypes:
list.append(BashOperator(
task_id='Load_'+str(Brand)+'_'+str(ListType),
bash_command = """python3 '/usr/local/bin/MusterMaster.py' {0} {1}""".format(Brand[0], ListType[0]),
pool='logs'))
if len(list) > 1:
list[-2] >> list[-1]
Which will give you:
Related
I have a scenario like below:
dag A :
task 1,
task 2
Dag B :
Task 3,
Task 4
Now i want to trigger/ run the task 3 (dag B) only after the success of task 1(dag A). Both dags scheduled on the same day but different time.
For example:Dag A runs on 14 July 8 AM,.
Dag B runs on 14 July 2 PM
Is that doable? How?
Please help
Thanks
In DagB you should create a BranchPythonOperator that returns "task3" if the appropriate conditions occurred.
in this code example, I return "task3" only if "DagA" finished state=success on the same day.
def check_success_dag_a(**context):
ti: TaskInstance = context['ti']
dag_run: DagRun = context['dag_run']
date: datetime = ti.execution_date
ts = timezone.make_aware(datetime(date.year, date.month, date.day, 0, 0, 0))
dag_a = dag_run.find(
dag_id='DagA',
state="success",
execution_start_date=ts,
execution_end_date=ti.execution_date)
if dag_a:
return "task3"
check_success = BranchPythonOperator(
task_id="check_success_dag_a",
python_callable=check_success_dag_a,
)
def run(**context):
ti = context['ti']
print(ti.task_id)
task3 = PythonOperator(
task_id="task3",
python_callable=run,
trigger_rule=TriggerRule.ONE_SUCCESS
)
task4 = PythonOperator(
task_id="task4",
python_callable=run,
trigger_rule=TriggerRule.ONE_SUCCESS
)
(check_success >> [task3] >> task4)
Can we set different schedule_intervals for different tasks in the same DAG?
i.e. I have one DAG with three tasks, A >> B >> C. I want the upstream tasks A &B to run weekly, but for downstream task C, I want it to run daily. Is it possible? If so, what are the schedule_interval should be for the DAG and tasks?
There are two options you can use ShortCircuitOperator or BranchDayOfWeekOperator.
1 Using BranchDayOfWeekOperator for that use case. This operator branches based on specific day of the week:
with DAG('my_dag',
schedule_interval='#daily'
) as dag:
task1 = DummyOperator(task_id='TASK1')
task2 = DummyOperator(task_id='TASK2')
task3 = DummyOperator(task_id='TASK3')
end_task = DummyOperator(task_id='end_task')
branch = BranchDayOfWeekOperator(
task_id="make_choice",
follow_task_ids_if_true="TASK3",
follow_task_ids_if_false="end_task",
week_day="Monday",
)
task1 >> task2 >> branch >> [task3, end_task]
In this example task3 will be executed only on Monday while task1 & task2 will run daily.
Note this operator available only for Airflow >=2.1.0 however you can copy the operator source code and create local version.
2 Using ShortCircuitOperator:
from datetime import date
def func():
if date.today().weekday() == 0:
return True
return False
with DAG('my_dag',
schedule_interval='#daily'
) as dag:
task1 = DummyOperator(task_id='TASK1')
task2 = DummyOperator(task_id='TASK2')
task3 = DummyOperator(task_id='TASK3')
verify = ShortCircuitOperator(task_id='check_day', python_callable=func)
task1 >> task2 >> verify >> task3
I want to return 2 or more tasks from a function that should be run in sequence in the spot they're inserted in the dependencies, see below.
t1 = PythonOperator()
def generate_tasks():
t2 = PythonOperator()
t3 = PythonOperator()
return magic(t2, t3) # magic needed here (preferably)
t1 >> generate_tasks() # otherwise here
# desired result: t1 >> t2 >> t3
Is this doable? As I understand it Airflow 2.0 seems to achieve this with a TaskGroup, but we're on Google's Composer, and 2.0 won't be available for a while.
Best workaround I've found:
t1 = PythonOperator()
def generate_tasks():
t2 = PythonOperator()
t3 = PythonOperator()
return [t2, t3]
tasks = generate_tasks()
t1 >> tasks[0] >> tasks[1]
But I'd really like that to be abstracted away, as it more or less defeats the purpose of having multiple operators returned from a single function. We want it to be a single unit as far as the end user knows, even though it can be composed of 2 or more tasks.
How to do it with TaskGroup in Airflow 2.0:
class Encryptor:
def encrypt_and_archive(self):
with TaskGroup("archive_and_encrypt") as section_1:
encrypt = DummyOperator(task_id="encrypt")
archive = BashOperator(task_id="archive", bash_command='echo 1')
encrypt >> archive
return section_1
with DAG(dag_id="example_return_task_group", start_date=days_ago(2), tags=["example"]) as dag:
start = DummyOperator(task_id="start")
encrypt_and_archive = Encryptor().encrypt_and_archive()
end = DummyOperator(task_id='end')
# 👇 single variable, containing two tasks
start >> encrypt_and_archive >> end
Which creates the following graph:
Is something similar remotely doable before 2.0?
You didn't explain what magic(t2, t3) is.
TaskGroup is strictly UI feature it doesn't effect on the DAG logic. According to your description it seems that you are looking for a specific logic (otherwise what is magic?).
I believe this is what you are after:
default_args = {
'owner': 'airflow',
'start_date': datetime(2021, 1, 24),
}
def generate_tasks():
operator_list =[]
for i in range(5): # Replace to generate the logic you wish to dynamically create tasks
op = DummyOperator(task_id=f"t{str(i)}_task", dag=dag)
if i>0:
operator_list[i - 1] >> op
operator_list.append(op)
return operator_list
with DAG(
dag_id='loop',
default_args=default_args,
schedule_interval=None,
) as dag:
start_op = DummyOperator(task_id='start_task')
end_op = DummyOperator(task_id='end_task')
tasks = generate_tasks()
start_op >> tasks[0]
tasks[-1] >> end_op
You can replace the DummyOperator with any operator you'd like.
i am reading from table which contains the tasks to be executed and i am also storing the dependencies in the same table. i am reading the table into a pandas data frame.
my task 3 is dependent on task 1 & task 2 while task 4 is dependent on task 3 for completion.
for index, row in odf.iterrows():
dag_id = row["DAG_ID"]
task_id = row["TASK_ID"]
task_name = row["TASK_NAME"]
script_name = row["SCRIPT_NAME"]
if row["DEPENDENT_ID"] is not None:
dependents = row["DEPENDENT_ID"].split('|')
print(dependents)
t1 = OracleOperator(task_id=task_name,
oracle_conn_id='oracle_con',
sql='Begin %s; end;' % script_name, dag=dag)
for d in dependents:
for index, row in odf[odf["TASK_ID"] == int(d)].iterrows():
t2 = OracleOperator(task_id=row["TASK_NAME"],
oracle_conn_id='oracle_con',
sql= 'Begin %s; end;' %script_name,dag = dag)
t1.set_upstream(t2)
but my output is not able coming as expected and below is what i see.
i know that i could do something like this.
t1 = OracleOperator(task_id='run_proc_ihn_reference_raw',
oracle_conn_id='oracle_con',
sql= 'Begin proc.task1; end;',dag = dag)
t2 = OracleOperator(task_id='run_proc_aim_codelist_raw',
oracle_conn_id='oracle_con',
sql= 'Begin proc.task2; end;',dag = dag)
t3 = OracleOperator(task_id='run_proc_decline_reason_dim_build',
oracle_conn_id='oracle_con',
sql= 'Begin proc.task3; end;',dag = dag)
t4 = OracleOperator(task_id='run_proc_decline_reason_dim_load',
oracle_conn_id='oracle_con',
sql= 'Begin proc.task4; end;',dag = dag)
(t1,t2) >> t3 >> t4
but i might have more than 100 procedures and so looking for dag to be created with dependencies using above method.
need help for the same. Thank you
When dealing with large numbers of tasks involving complicated dependencies I find that I usually end up repeating quite a bit of "task boilerplate", as you've shown in your example.
In these situations I like to let Python do the "heavy lifting" in creating the tasks and wiring them up:
default_args = {
"oracle_conn_id": "oracle_con"
}
task_dict = {
"ihn_reference_raw": {"proc": "task1"},
"aim_codelist_raw": {"proc": "task2"},
"decline_reason_dim_build": {"proc": "task3",
"upstream": ["ihn_reference_raw",
"aim_codelist_raw"]},
"decline_reason_dim_load": {"proc": "task4",
"upstream": ["decline_reason_dim_build"]}
}
...
with DAG(
...,
default_args=default_args
) as dag:
# Iterate the details to create the tasks
for task_id, details in task_dict.items():
OracleOperator(task_id=f"run_proc_{task_id}",
sql=f"BEGIN {details['proc']}; END;")
# Iterate a second time to "wire up" the upstream tasks.
for task_id, details in task_dict.items():
if task_up := details.get("upstream"):
dag.get_task(f"run_proc_{task_id}").set_upstream(task_up)
(I've left out quite a bit for brevity, but the idea is there)
The key is to find the portions of your process that are repetitive, store the things that are unique to each task (in our task_dict in this example) and then loop to build.
I have an Apache Airflow DAG like so:
DAG_NAME='my_dag'
sections = ["0", "1", "2", "3"]
with DAG(DAG_NAME, default_args=default_args, schedule_interval=None) as dag:
for s in sections:
a = DummyOperator(task_id=f"section_{s}_start")
b = SubDagOperator(task_id=f"init_{s}_subdag",subdag=init_section(DAG_NAME,f"init_{s}_subdag", default_args))
c = SubDagOperator(task_id=f"process_{s}_subdag", subdag=process_section(DAG_NAME,f"process_{s}_subdag", default_args))
d = SubDagOperator(task_id=f"update_{s}_subdag", subdag=update_section(DAG_NAME,f"update_{s}_subdag", default_args))
e = DummyOperator(task_id=f"section_{s}_end")
a>>b>>c>>d>>e
This code renders my tasks like so
How can I make the sequence of tasks be:
section_0_start>>init_0_subdag>>process_0_subdag>>update_0_subdag>>section_0_end
section_0_end>>section_1_start
section_1_start>>init_1_subdag>>process_1_subdag>>update_1_subdag>>section_1_end
.....
and so on in sequence from section 0 ending with section 3 tasks
Thanks
Modify the for-loop like this:
previous_e = None
for s in sections:
a = ...
...
e = ...
if previous_e:
previous_e >> a
a>>b>>c>>d>>e
previous_e = e