dynamic dag creation based on dependencies from table - airflow

i am reading from table which contains the tasks to be executed and i am also storing the dependencies in the same table. i am reading the table into a pandas data frame.
my task 3 is dependent on task 1 & task 2 while task 4 is dependent on task 3 for completion.
for index, row in odf.iterrows():
dag_id = row["DAG_ID"]
task_id = row["TASK_ID"]
task_name = row["TASK_NAME"]
script_name = row["SCRIPT_NAME"]
if row["DEPENDENT_ID"] is not None:
dependents = row["DEPENDENT_ID"].split('|')
print(dependents)
t1 = OracleOperator(task_id=task_name,
oracle_conn_id='oracle_con',
sql='Begin %s; end;' % script_name, dag=dag)
for d in dependents:
for index, row in odf[odf["TASK_ID"] == int(d)].iterrows():
t2 = OracleOperator(task_id=row["TASK_NAME"],
oracle_conn_id='oracle_con',
sql= 'Begin %s; end;' %script_name,dag = dag)
t1.set_upstream(t2)
but my output is not able coming as expected and below is what i see.
i know that i could do something like this.
t1 = OracleOperator(task_id='run_proc_ihn_reference_raw',
oracle_conn_id='oracle_con',
sql= 'Begin proc.task1; end;',dag = dag)
t2 = OracleOperator(task_id='run_proc_aim_codelist_raw',
oracle_conn_id='oracle_con',
sql= 'Begin proc.task2; end;',dag = dag)
t3 = OracleOperator(task_id='run_proc_decline_reason_dim_build',
oracle_conn_id='oracle_con',
sql= 'Begin proc.task3; end;',dag = dag)
t4 = OracleOperator(task_id='run_proc_decline_reason_dim_load',
oracle_conn_id='oracle_con',
sql= 'Begin proc.task4; end;',dag = dag)
(t1,t2) >> t3 >> t4
but i might have more than 100 procedures and so looking for dag to be created with dependencies using above method.
need help for the same. Thank you

When dealing with large numbers of tasks involving complicated dependencies I find that I usually end up repeating quite a bit of "task boilerplate", as you've shown in your example.
In these situations I like to let Python do the "heavy lifting" in creating the tasks and wiring them up:
default_args = {
"oracle_conn_id": "oracle_con"
}
task_dict = {
"ihn_reference_raw": {"proc": "task1"},
"aim_codelist_raw": {"proc": "task2"},
"decline_reason_dim_build": {"proc": "task3",
"upstream": ["ihn_reference_raw",
"aim_codelist_raw"]},
"decline_reason_dim_load": {"proc": "task4",
"upstream": ["decline_reason_dim_build"]}
}
...
with DAG(
...,
default_args=default_args
) as dag:
# Iterate the details to create the tasks
for task_id, details in task_dict.items():
OracleOperator(task_id=f"run_proc_{task_id}",
sql=f"BEGIN {details['proc']}; END;")
# Iterate a second time to "wire up" the upstream tasks.
for task_id, details in task_dict.items():
if task_up := details.get("upstream"):
dag.get_task(f"run_proc_{task_id}").set_upstream(task_up)
(I've left out quite a bit for brevity, but the idea is there)
The key is to find the portions of your process that are repetitive, store the things that are unique to each task (in our task_dict in this example) and then loop to build.

Related

Airflow:Run tasks in parallel dynamically based on number of dates

Requirement: Run tasks in parallel dynamically based on the number of offset values which is basically dates
As below it starts from the current date 0 to 4 days back(end_offset_days), so that each task can run in parallel with each date in parallel
start_offset_dayts/ end_offset_days can be dynamic, tomorrow it can be changed to 6 to run past days
I tried as the below date_list gives me a list of dates to be run in parallel, How do I pass it to the next tasks for for looping
with DAG(
dag_id=dag_id,
default_args=default_args,
schedule_interval="0 * * * *",
catchup=False,
dagrun_timeout=timedelta(minutes=180),
max_active_runs=1,
params={},
) as dag:
#task(task_id='datelist')
def datelist(**kwargs):
ti = kwargs['ti']
import datetime
date_list = [(datetime.date.today() - datetime.timedelta(days=x)).strftime('%Y-%m-%d') for x in range(0, 4)]
return date_list
for tss in date_list:
jb = PythonOperator(
task_id=jb,
provide_context=True,
python_callable=main_run,
op_kwargs={
"start_offset_days": 0,
"end_offset_days": 4
}
)
jb
return dag
Belwo is xcom values from date_list
Create a job_list and inside the for loop do job_list.append(jb)
Then the line before return dag should simply be: job_list.
Then Airflow will run all those jobs in parallel.
So the last part of your code should look like this:
job_list = []
for tss in date_list:
jb = PythonOperator(
task_id=jb,
provide_context=True,
python_callable=main_run,
op_kwargs={
"start_offset_days": 0,
"end_offset_days": 4
}
)
job_list.append(jb)
job_list
return dag
Instead of running each jb in the loop, appending it to the collection and running the entire collection, will make them all run in parallel.
I would also replace the first part of the DAG. I don't think it has to run as a task. So instead of:
#task(task_id='datelist')
def datelist(**kwargs):
ti = kwargs['ti']
import datetime
date_list = [(datetime.date.today() - datetime.timedelta(days=x)).strftime('%Y-%m-%d') for x in range(0, 4)]
return date_list
I would simply do it like this:
import datetime
date_list = [(datetime.date.today() - datetime.timedelta(days=x)).strftime('%Y-%m-%d') for x in range(0, 4)]

Can we configure different schedule_interval for different tasks within a DAG?

Can we set different schedule_intervals for different tasks in the same DAG?
i.e. I have one DAG with three tasks, A >> B >> C. I want the upstream tasks A &B to run weekly, but for downstream task C, I want it to run daily. Is it possible? If so, what are the schedule_interval should be for the DAG and tasks?
There are two options you can use ShortCircuitOperator or BranchDayOfWeekOperator.
1 Using BranchDayOfWeekOperator for that use case. This operator branches based on specific day of the week:
with DAG('my_dag',
schedule_interval='#daily'
) as dag:
task1 = DummyOperator(task_id='TASK1')
task2 = DummyOperator(task_id='TASK2')
task3 = DummyOperator(task_id='TASK3')
end_task = DummyOperator(task_id='end_task')
branch = BranchDayOfWeekOperator(
task_id="make_choice",
follow_task_ids_if_true="TASK3",
follow_task_ids_if_false="end_task",
week_day="Monday",
)
task1 >> task2 >> branch >> [task3, end_task]
In this example task3 will be executed only on Monday while task1 & task2 will run daily.
Note this operator available only for Airflow >=2.1.0 however you can copy the operator source code and create local version.
2 Using ShortCircuitOperator:
from datetime import date
def func():
if date.today().weekday() == 0:
return True
return False
with DAG('my_dag',
schedule_interval='#daily'
) as dag:
task1 = DummyOperator(task_id='TASK1')
task2 = DummyOperator(task_id='TASK2')
task3 = DummyOperator(task_id='TASK3')
verify = ShortCircuitOperator(task_id='check_day', python_callable=func)
task1 >> task2 >> verify >> task3

Return list of tasks from function that should be run in sequence in Airflow

I want to return 2 or more tasks from a function that should be run in sequence in the spot they're inserted in the dependencies, see below.
t1 = PythonOperator()
def generate_tasks():
t2 = PythonOperator()
t3 = PythonOperator()
return magic(t2, t3) # magic needed here (preferably)
t1 >> generate_tasks() # otherwise here
# desired result: t1 >> t2 >> t3
Is this doable? As I understand it Airflow 2.0 seems to achieve this with a TaskGroup, but we're on Google's Composer, and 2.0 won't be available for a while.
Best workaround I've found:
t1 = PythonOperator()
def generate_tasks():
t2 = PythonOperator()
t3 = PythonOperator()
return [t2, t3]
tasks = generate_tasks()
t1 >> tasks[0] >> tasks[1]
But I'd really like that to be abstracted away, as it more or less defeats the purpose of having multiple operators returned from a single function. We want it to be a single unit as far as the end user knows, even though it can be composed of 2 or more tasks.
How to do it with TaskGroup in Airflow 2.0:
class Encryptor:
def encrypt_and_archive(self):
with TaskGroup("archive_and_encrypt") as section_1:
encrypt = DummyOperator(task_id="encrypt")
archive = BashOperator(task_id="archive", bash_command='echo 1')
encrypt >> archive
return section_1
with DAG(dag_id="example_return_task_group", start_date=days_ago(2), tags=["example"]) as dag:
start = DummyOperator(task_id="start")
encrypt_and_archive = Encryptor().encrypt_and_archive()
end = DummyOperator(task_id='end')
# 👇 single variable, containing two tasks
start >> encrypt_and_archive >> end
Which creates the following graph:
Is something similar remotely doable before 2.0?
You didn't explain what magic(t2, t3) is.
TaskGroup is strictly UI feature it doesn't effect on the DAG logic. According to your description it seems that you are looking for a specific logic (otherwise what is magic?).
I believe this is what you are after:
default_args = {
'owner': 'airflow',
'start_date': datetime(2021, 1, 24),
}
def generate_tasks():
operator_list =[]
for i in range(5): # Replace to generate the logic you wish to dynamically create tasks
op = DummyOperator(task_id=f"t{str(i)}_task", dag=dag)
if i>0:
operator_list[i - 1] >> op
operator_list.append(op)
return operator_list
with DAG(
dag_id='loop',
default_args=default_args,
schedule_interval=None,
) as dag:
start_op = DummyOperator(task_id='start_task')
end_op = DummyOperator(task_id='end_task')
tasks = generate_tasks()
start_op >> tasks[0]
tasks[-1] >> end_op
You can replace the DummyOperator with any operator you'd like.

Loop many times on many airflow tasks on one dag

I am creating one dag that will have following structure of tasks. This DAG will be schedule to run
on everyday at 1:00 AM UTC time.
Get rows from database ---- loop on rows to run many task that require each row data.
For example I have method in my DAG that call MySQL database and return many rows .Each row data I have to pass in 4 task as a parameter. I have followed some google search docs but that is not running correctly.
return_db_result is method to get result from Cloud SQL in GCP.
def return_result():
db_engine_connection = create_cloud_sql_connection()
session = get_db_session(db_engine_connection)
result = session.query(Scheduled).filter(Scheduled.job_status == "Scheduled").all()
session.commit()
return result
I tried using for loop like something following
for row in return_result():
op1 = operator({ param=row.id})
op2 = operator({ param=row.id})
op3 = operator({ param=row.id})
op4 = operator({ param=row.id})
op1 >> op2 >> op3 >> op4
But these task does not show on airflow UI.
Based on your comments assuming your operator is:
class MyOperator(BaseOperator):
#apply_defaults
def __init__(self,
input_id,
input_date,
input_status,
*args, **kwargs):
super(MyOperator, self).__init__(*args, **kwargs)
self.input_id=input_id
self.input_date=input_date
self.input_status=input_status
def execute(self, context):
pass
You can use it as follows:
start_op = DummyOperator(task_id='start_op')
for row in return_db_result:
op1 = MyOperator(task_id=f"op1_{row.id}", input_id=row.id, input_date=row.date, input_status=row.status)
op2 = MyOperator(task_id=f"op2_{row.id}", input_id=row.id, input_date=row.date, input_status=row.status)
op3 = MyOperator(task_id=f"op3_{row.id}", input_id=row.id, input_date=row.date, input_status=row.status)
op4 = MyOperator(task_id=f"op4_{row.id}", input_id=row.id, input_date=row.date, input_status=row.status)
start_op >> op1 >> op2 >> op3 >> op4

Airflow DAG Task Dependency in a Loop

I have a DAG that needs to recompile this customer lists in various brands. The script is called with two arguments brand and listtype.
I need the brands to run concurrently, but the list types to be dependent on the preceding list type, but I can't figure out how to do that in a loop. Can ya'll help me out?
BrandsToRun = ['A', 'B', 'C']
ListTypes = ['1', '2', '3']
# Defining the DAG
################################################################################
with DAG(
'MusterMaster',
default_args = default_args,
description = 'x',
# schedule_interval = None
schedule_interval = '30 4 * * *',
catchup = False
) as MusterMaster:
for Brand in BrandsToRun:
for ListType in ListTypes:
ListLoad = BashOperator(
task_id='Load_'+str(Brand)+'_'+str(ListType),
bash_command = """python3 '/usr/local/bin/MusterMaster.py' {0} {1}""".format(Brand[0], ListType[0]),
pool='logs'
)
ListLoad
I want the tasks to have dependency structure like this, but I can't figure it out. Brand should run concurrently, but ListTypes should be dependent on the preceding ListType.
Muster A 1 >> Muster A 2 >> Muster A 3
Muster B 1 >> Muster B 2 >> Muster B 3
Muster C 1 >> Muster C 2 >> Muster C 3
How can I best accomplish this?
You can do:
for Brand in BrandsToRun:
list = []
for ListType in ListTypes:
list.append(BashOperator(
task_id='Load_'+str(Brand)+'_'+str(ListType),
bash_command = """python3 '/usr/local/bin/MusterMaster.py' {0} {1}""".format(Brand[0], ListType[0]),
pool='logs'))
if len(list) > 1:
list[-2] >> list[-1]
Which will give you:

Resources