I am creating one dag that will have following structure of tasks. This DAG will be schedule to run
on everyday at 1:00 AM UTC time.
Get rows from database ---- loop on rows to run many task that require each row data.
For example I have method in my DAG that call MySQL database and return many rows .Each row data I have to pass in 4 task as a parameter. I have followed some google search docs but that is not running correctly.
return_db_result is method to get result from Cloud SQL in GCP.
def return_result():
db_engine_connection = create_cloud_sql_connection()
session = get_db_session(db_engine_connection)
result = session.query(Scheduled).filter(Scheduled.job_status == "Scheduled").all()
session.commit()
return result
I tried using for loop like something following
for row in return_result():
op1 = operator({ param=row.id})
op2 = operator({ param=row.id})
op3 = operator({ param=row.id})
op4 = operator({ param=row.id})
op1 >> op2 >> op3 >> op4
But these task does not show on airflow UI.
Based on your comments assuming your operator is:
class MyOperator(BaseOperator):
#apply_defaults
def __init__(self,
input_id,
input_date,
input_status,
*args, **kwargs):
super(MyOperator, self).__init__(*args, **kwargs)
self.input_id=input_id
self.input_date=input_date
self.input_status=input_status
def execute(self, context):
pass
You can use it as follows:
start_op = DummyOperator(task_id='start_op')
for row in return_db_result:
op1 = MyOperator(task_id=f"op1_{row.id}", input_id=row.id, input_date=row.date, input_status=row.status)
op2 = MyOperator(task_id=f"op2_{row.id}", input_id=row.id, input_date=row.date, input_status=row.status)
op3 = MyOperator(task_id=f"op3_{row.id}", input_id=row.id, input_date=row.date, input_status=row.status)
op4 = MyOperator(task_id=f"op4_{row.id}", input_id=row.id, input_date=row.date, input_status=row.status)
start_op >> op1 >> op2 >> op3 >> op4
Related
I have several cron expressions that I need to apply to a single DAG. There is no way to express them with one single cron expression.
Airflow 2.2 introduced Timetable. Is there an implementation that takes a list of cron expressions?
I was looking for the same thing, but didn't find anything. It would be nice if a standard one came with Airflow.
Here's a 0.1 version that I wrote for Airflow 2.2.5.
# This file is <airflow plugins directory>/timetable.py
from typing import Any, Dict, List, Optional
import pendulum
from croniter import croniter
from pendulum import DateTime, Duration, timezone, instance as pendulum_instance
from airflow.plugins_manager import AirflowPlugin
from airflow.timetables.base import DagRunInfo, DataInterval, TimeRestriction, Timetable
from airflow.exceptions import AirflowTimetableInvalid
class MultiCronTimetable(Timetable):
valid_units = ['minutes', 'hours', 'days']
def __init__(self,
cron_defs: List[str],
timezone: str = 'Europe/Berlin',
period_length: int = 0,
period_unit: str = 'hours'):
self.cron_defs = cron_defs
self.timezone = timezone
self.period_length = period_length
self.period_unit = period_unit
def infer_manual_data_interval(self, run_after: DateTime) -> DataInterval:
"""
Determines date interval for manually triggered runs.
This is simply (now - period) to now.
"""
end = run_after
if self.period_length == 0:
start = end
else:
start = self.data_period_start(end)
return DataInterval(start=start, end=end)
def next_dagrun_info(
self,
*,
last_automated_data_interval: Optional[DataInterval],
restriction: TimeRestriction) -> Optional[DagRunInfo]:
"""
Determines when the DAG should be scheduled.
"""
if restriction.earliest is None:
# No start_date. Don't schedule.
return None
is_first_run = last_automated_data_interval is None
if is_first_run:
if restriction.catchup:
scheduled_time = self.next_scheduled_run_time(restriction.earliest)
else:
scheduled_time = self.previous_scheduled_run_time()
if scheduled_time is None:
# No previous cron time matched. Find one in the future.
scheduled_time = self.next_scheduled_run_time()
else:
last_scheduled_time = last_automated_data_interval.end
if restriction.catchup:
scheduled_time = self.next_scheduled_run_time(last_scheduled_time)
else:
scheduled_time = self.previous_scheduled_run_time()
if scheduled_time is None or scheduled_time == last_scheduled_time:
# No previous cron time matched,
# or the matched cron time was the last execution time,
scheduled_time = self.next_scheduled_run_time()
elif scheduled_time > last_scheduled_time:
# Matched cron time was after last execution time, but before now.
# Use this cron time
pass
else:
# The last execution time is after the most recent matching cron time.
# Next scheduled run will be in the future
scheduled_time = self.next_scheduled_run_time()
if scheduled_time is None:
return None
if restriction.latest is not None and scheduled_time > restriction.latest:
# Over the DAG's scheduled end; don't schedule.
return None
start = self.data_period_start(scheduled_time)
return DagRunInfo(run_after=scheduled_time, data_interval=DataInterval(start=start, end=scheduled_time))
def data_period_start(self, period_end: DateTime):
return period_end - Duration(**{self.period_unit: self.period_length})
def croniter_values(self, base_datetime=None):
if not base_datetime:
tz = timezone(self.timezone)
base_datetime = pendulum.now(tz)
return [croniter(expr, base_datetime) for expr in self.cron_defs]
def next_scheduled_run_time(self, base_datetime: DateTime = None):
min_date = None
tz = timezone(self.timezone)
if base_datetime:
base_datetime_localized = base_datetime.in_timezone(tz)
else:
base_datetime_localized = pendulum.now(tz)
for cron in self.croniter_values(base_datetime_localized):
next_date = cron.get_next(DateTime)
if not min_date:
min_date = next_date
else:
min_date = min(min_date, next_date)
if min_date is None:
return None
return pendulum_instance(min_date)
def previous_scheduled_run_time(self, base_datetime: DateTime = None):
"""
Get the most recent time in the past that matches one of the cron schedules
"""
max_date = None
tz = timezone(self.timezone)
if base_datetime:
base_datetime_localized = base_datetime.in_timezone(tz)
else:
base_datetime_localized = pendulum.now(tz)
for cron in self.croniter_values(base_datetime_localized):
prev_date = cron.get_prev(DateTime)
if not max_date:
max_date = prev_date
else:
max_date = max(max_date, prev_date)
if max_date is None:
return None
return pendulum_instance(max_date)
def validate(self) -> None:
if not self.cron_defs:
raise AirflowTimetableInvalid("At least one cron definition must be present")
if self.period_unit not in self.valid_units:
raise AirflowTimetableInvalid(f'period_unit must be one of {self.valid_units}')
if self.period_length < 0:
raise AirflowTimetableInvalid(f'period_length must not be less than zero')
try:
self.croniter_values()
except Exception as e:
raise AirflowTimetableInvalid(str(e))
#property
def summary(self) -> str:
"""A short summary for the timetable.
This is used to display the timetable in the web UI. A cron expression
timetable, for example, can use this to display the expression.
"""
return ' || '.join(self.cron_defs) + f' [TZ: {self.timezone}]'
def serialize(self) -> Dict[str, Any]:
"""Serialize the timetable for JSON encoding.
This is called during DAG serialization to store timetable information
in the database. This should return a JSON-serializable dict that will
be fed into ``deserialize`` when the DAG is deserialized.
"""
return dict(cron_defs=self.cron_defs,
timezone=self.timezone,
period_length=self.period_length,
period_unit=self.period_unit)
#classmethod
def deserialize(cls, data: Dict[str, Any]) -> "MultiCronTimetable":
"""Deserialize a timetable from data.
This is called when a serialized DAG is deserialized. ``data`` will be
whatever was returned by ``serialize`` during DAG serialization.
"""
return cls(**data)
class CustomTimetablePlugin(AirflowPlugin):
name = "custom_timetable_plugin"
timetables = [MultiCronTimetable]
To use it, you provide a list of cron expressions, optionally a timezone string, optionally a period length and period unit.
For my use case I don't actually need the period length + unit, which are used to determine the DAG's data_interval. You can just leave them at the default value of 0 minutes, if your DAG doesn't care about the data_interval.
I tried to imitate standard schedule_interval behaviour. For example if catchup = False and the DAG could have potentially been triggered several times since the last run (for whatever reason, for example the DAG ran longer than expected, or the scheduler wasn't running, or it's the DAG's very first time being scheduled), then the DAG will be scheduled to run for the latest previous matching time.
I haven't really tested it with catchup = True, but in theory it would run for every matching cron time since the DAG's start_date (but only once per distinct time, for example with */30 * * * * and 0 * * * * the DAG would run twice per hour, not three times).
Example DAG file:
from time import sleep
import airflow
from airflow.operators.python import PythonOperator
import pendulum
from timetable import MultiCronTimetable
def sleepy_op():
sleep(660)
with airflow.DAG(
dag_id='timetable_test',
start_date=pendulum.datetime(2022, 6, 2, tz=pendulum.timezone('America/New_York')),
timetable=MultiCronTimetable(['*/5 * * * *', '*/3 * * * fri,sat', '1 12 3 * *'], timezone='America/New_York', period_length=10, period_unit='minutes'),
catchup=False,
max_active_runs=1) as dag:
sleepy = PythonOperator(
task_id='sleepy',
python_callable=sleepy_op
)
Requirement: Run tasks in parallel dynamically based on the number of offset values which is basically dates
As below it starts from the current date 0 to 4 days back(end_offset_days), so that each task can run in parallel with each date in parallel
start_offset_dayts/ end_offset_days can be dynamic, tomorrow it can be changed to 6 to run past days
I tried as the below date_list gives me a list of dates to be run in parallel, How do I pass it to the next tasks for for looping
with DAG(
dag_id=dag_id,
default_args=default_args,
schedule_interval="0 * * * *",
catchup=False,
dagrun_timeout=timedelta(minutes=180),
max_active_runs=1,
params={},
) as dag:
#task(task_id='datelist')
def datelist(**kwargs):
ti = kwargs['ti']
import datetime
date_list = [(datetime.date.today() - datetime.timedelta(days=x)).strftime('%Y-%m-%d') for x in range(0, 4)]
return date_list
for tss in date_list:
jb = PythonOperator(
task_id=jb,
provide_context=True,
python_callable=main_run,
op_kwargs={
"start_offset_days": 0,
"end_offset_days": 4
}
)
jb
return dag
Belwo is xcom values from date_list
Create a job_list and inside the for loop do job_list.append(jb)
Then the line before return dag should simply be: job_list.
Then Airflow will run all those jobs in parallel.
So the last part of your code should look like this:
job_list = []
for tss in date_list:
jb = PythonOperator(
task_id=jb,
provide_context=True,
python_callable=main_run,
op_kwargs={
"start_offset_days": 0,
"end_offset_days": 4
}
)
job_list.append(jb)
job_list
return dag
Instead of running each jb in the loop, appending it to the collection and running the entire collection, will make them all run in parallel.
I would also replace the first part of the DAG. I don't think it has to run as a task. So instead of:
#task(task_id='datelist')
def datelist(**kwargs):
ti = kwargs['ti']
import datetime
date_list = [(datetime.date.today() - datetime.timedelta(days=x)).strftime('%Y-%m-%d') for x in range(0, 4)]
return date_list
I would simply do it like this:
import datetime
date_list = [(datetime.date.today() - datetime.timedelta(days=x)).strftime('%Y-%m-%d') for x in range(0, 4)]
I want to return 2 or more tasks from a function that should be run in sequence in the spot they're inserted in the dependencies, see below.
t1 = PythonOperator()
def generate_tasks():
t2 = PythonOperator()
t3 = PythonOperator()
return magic(t2, t3) # magic needed here (preferably)
t1 >> generate_tasks() # otherwise here
# desired result: t1 >> t2 >> t3
Is this doable? As I understand it Airflow 2.0 seems to achieve this with a TaskGroup, but we're on Google's Composer, and 2.0 won't be available for a while.
Best workaround I've found:
t1 = PythonOperator()
def generate_tasks():
t2 = PythonOperator()
t3 = PythonOperator()
return [t2, t3]
tasks = generate_tasks()
t1 >> tasks[0] >> tasks[1]
But I'd really like that to be abstracted away, as it more or less defeats the purpose of having multiple operators returned from a single function. We want it to be a single unit as far as the end user knows, even though it can be composed of 2 or more tasks.
How to do it with TaskGroup in Airflow 2.0:
class Encryptor:
def encrypt_and_archive(self):
with TaskGroup("archive_and_encrypt") as section_1:
encrypt = DummyOperator(task_id="encrypt")
archive = BashOperator(task_id="archive", bash_command='echo 1')
encrypt >> archive
return section_1
with DAG(dag_id="example_return_task_group", start_date=days_ago(2), tags=["example"]) as dag:
start = DummyOperator(task_id="start")
encrypt_and_archive = Encryptor().encrypt_and_archive()
end = DummyOperator(task_id='end')
# 👇 single variable, containing two tasks
start >> encrypt_and_archive >> end
Which creates the following graph:
Is something similar remotely doable before 2.0?
You didn't explain what magic(t2, t3) is.
TaskGroup is strictly UI feature it doesn't effect on the DAG logic. According to your description it seems that you are looking for a specific logic (otherwise what is magic?).
I believe this is what you are after:
default_args = {
'owner': 'airflow',
'start_date': datetime(2021, 1, 24),
}
def generate_tasks():
operator_list =[]
for i in range(5): # Replace to generate the logic you wish to dynamically create tasks
op = DummyOperator(task_id=f"t{str(i)}_task", dag=dag)
if i>0:
operator_list[i - 1] >> op
operator_list.append(op)
return operator_list
with DAG(
dag_id='loop',
default_args=default_args,
schedule_interval=None,
) as dag:
start_op = DummyOperator(task_id='start_task')
end_op = DummyOperator(task_id='end_task')
tasks = generate_tasks()
start_op >> tasks[0]
tasks[-1] >> end_op
You can replace the DummyOperator with any operator you'd like.
i am reading from table which contains the tasks to be executed and i am also storing the dependencies in the same table. i am reading the table into a pandas data frame.
my task 3 is dependent on task 1 & task 2 while task 4 is dependent on task 3 for completion.
for index, row in odf.iterrows():
dag_id = row["DAG_ID"]
task_id = row["TASK_ID"]
task_name = row["TASK_NAME"]
script_name = row["SCRIPT_NAME"]
if row["DEPENDENT_ID"] is not None:
dependents = row["DEPENDENT_ID"].split('|')
print(dependents)
t1 = OracleOperator(task_id=task_name,
oracle_conn_id='oracle_con',
sql='Begin %s; end;' % script_name, dag=dag)
for d in dependents:
for index, row in odf[odf["TASK_ID"] == int(d)].iterrows():
t2 = OracleOperator(task_id=row["TASK_NAME"],
oracle_conn_id='oracle_con',
sql= 'Begin %s; end;' %script_name,dag = dag)
t1.set_upstream(t2)
but my output is not able coming as expected and below is what i see.
i know that i could do something like this.
t1 = OracleOperator(task_id='run_proc_ihn_reference_raw',
oracle_conn_id='oracle_con',
sql= 'Begin proc.task1; end;',dag = dag)
t2 = OracleOperator(task_id='run_proc_aim_codelist_raw',
oracle_conn_id='oracle_con',
sql= 'Begin proc.task2; end;',dag = dag)
t3 = OracleOperator(task_id='run_proc_decline_reason_dim_build',
oracle_conn_id='oracle_con',
sql= 'Begin proc.task3; end;',dag = dag)
t4 = OracleOperator(task_id='run_proc_decline_reason_dim_load',
oracle_conn_id='oracle_con',
sql= 'Begin proc.task4; end;',dag = dag)
(t1,t2) >> t3 >> t4
but i might have more than 100 procedures and so looking for dag to be created with dependencies using above method.
need help for the same. Thank you
When dealing with large numbers of tasks involving complicated dependencies I find that I usually end up repeating quite a bit of "task boilerplate", as you've shown in your example.
In these situations I like to let Python do the "heavy lifting" in creating the tasks and wiring them up:
default_args = {
"oracle_conn_id": "oracle_con"
}
task_dict = {
"ihn_reference_raw": {"proc": "task1"},
"aim_codelist_raw": {"proc": "task2"},
"decline_reason_dim_build": {"proc": "task3",
"upstream": ["ihn_reference_raw",
"aim_codelist_raw"]},
"decline_reason_dim_load": {"proc": "task4",
"upstream": ["decline_reason_dim_build"]}
}
...
with DAG(
...,
default_args=default_args
) as dag:
# Iterate the details to create the tasks
for task_id, details in task_dict.items():
OracleOperator(task_id=f"run_proc_{task_id}",
sql=f"BEGIN {details['proc']}; END;")
# Iterate a second time to "wire up" the upstream tasks.
for task_id, details in task_dict.items():
if task_up := details.get("upstream"):
dag.get_task(f"run_proc_{task_id}").set_upstream(task_up)
(I've left out quite a bit for brevity, but the idea is there)
The key is to find the portions of your process that are repetitive, store the things that are unique to each task (in our task_dict in this example) and then loop to build.
I need to execute the airflow same task on 12th day of the month and 2 days before the last day of the month.
I was trying with macros and execution_date as well. Not sure how to proceed further. Could you please help on this?
def check_trigger(execution_date, day_offset, **kwargs):
target_date = execution_date - timedelta(days = day_offset)
return target_date
I would approach it like below. And twelfth_or_two_before is a Python function that simply checks the date & returns the task_id of the appropriate downstream task. (That way if the business needs ever change & you need to run the actual tasks on a different/additional day(s), you just modify that function.)
with DAG( ... ) as dag:
right_days = BranchPythonOperator(
task_id="start",
python_callable="twelfth_or_two_before,
)
do_nothing = DummyOperator(task_id="do_nothing")
actual_task = ____Operator( ... ) # This is the Operator that does actual work
start >> [do_nothing, actual_task]