I want to customize the task to be weekday dependent in the dag file. It seems the airflow macros like {{ next_execution_date }} are not directly available in the python dag file. This is my dag definition:
RG_TASKS = {
'asia': {
'start_date': pendulum.datetime.(2021,1,1,16,0,tzinfo='Asia/Tokyo'),
'tz': 'Asia/Tokyo',
'files': [
'/path/%Y%m%d/asia_file1.%Y%m%d.csv',
'/path/%Y%m%d/asia_file2.%Y%m%d.csv',
...], },
'euro': {
'start_date': pendulum.datetime.(2021,1,1,16,0,tzinfo='Europe/London'),
'tz': 'Europe/London',
'files': [
'/path/%Y%m%d/euro_file1.%Y%m%d.csv',
'/path/%Y%m%d/euro_file2.%Y%m%d.csv',
...], },
}
dag = DAG(..., start_date=pendulum.datetime.(2021,1,1,16,0,tzinfo='Asia/Tokyo'),
schedule='00 16 * * 0-6')
for rg, t in RG_TASKS.items():
tz = t['tz']
h = t['start_date'].hour
m = t['start_date'].minute
target_time = f'{{{{ next_execution_date.replace(tzinfo="{tz}", hour={h}, minute={m}) }}}}'
time_sensor = DateTimeSensor(dag=dag, task_id=f'wait_for_{rg}', tartget_time=target_time)
bash_task = BashOperator(dag=dag, task_id='load_{rg}', trigger_rule='all_success', depends_on_past=True, bash_command=...)
for fname in t['files']:
fpath = f'{{{{ next_execution_date.strftime("{fname}") }}}}'
task_id = os.path.basename(fname).split('.')[0]
file_sensor = FileSensor(dag=dag, task_id=task_id, filepath=fpath, ...)
file_sensor.set_upstream(time_sensor)
file_sensor.set_downstream(bash_task)
The above works, and the bash_task will be triggered if all files are available, and it is set depend_on_past=True. However, the files have slightly different schedule. {rg}_file1 will be available 6 days/week, except Saturday, while the rest are available 7 days a week.
One option is to create 2 dags, one scheduled to run Sun-Fri, while the other is scheduled to run Sat only. But with this option, the depends_on_past=True is broken on Saturday.
Is there any better way to keep depends_on_past=True 7 days/week? Ideally in the files loop, I could do sth like:
for fname in t['files']:
dt = ...
if dt.weekday()==5 and task_id==f'{rg}_file1':
continue
Generally I think it's better to accomplish things in a single task when it is easy enough to do, and in this case it seem to me you can.
I'm not entirely sure why you are using a datetime sensor, but it does not seem necessary. As far as I can tell, you just want your process to run every day (ideally after the file is there) and skip once per week.
I think we can do away with file sensor too.
Option 1: everything in bash
Check for existence in your bash script and fail (with retries) if missing. Just return non-zero exit code when file missing.
Then in your bash script you could silently do nothing on the skip day.
On skip days, your bash task will be green even though it did nothing.
Option 2: subclass bash operator
Subclass BashOperator and add a skip_day parameter. Then your execute is like this:
def execute(self, context):
next_execution_date = context['next_execution_date']
if next_execution_date.day_of_week == self.skip_day:
raise AirflowSkipException(f'we skip on day {self.skip_day}')
super().execute(context)
With this option your bash script still needs to fail if file missing, but doesn't need to deal with the skip logic. And you'll be able to see that the task skipped in the UI.
Either way, no sensors.
Other note
You can simplify your filename templating.
files=[
'/path/{{ next_ds_nodash }}/euro_file2.{{ next_ds_nodash }}.csv',
...
]
Then you don't need to mess with strftime.
Related
I've an Airflow DAG where I've a task_group with a loop inside that generates two dynamic tasks. After the task_group I need to perform other actions. My problem is:
Inside the task_group I've a branching operators that validates if the last task should run or not. In case of one of the two flows are completed with success, I want to continue my process. For that I'm using the trigger_rule one_success. My code:
with DAG(
dag_id='hello_world',
schedule_interval=None,
start_date=datetime(2022, 8, 25),
default_args=default_args,
max_active_runs=1,
catchup = False,
concurrency = 1,
) as dag:
task_a = DummyOperator(task_id="task_a")
with TaskGroup(group_id='task_group') as my_group:
my_list = ['a','b']
for i in my_list:
task_b = PythonOperator(
task_id="task_a_".format(i),
python_callable=p_task_1)
var_to_continue = check_status(i)
is_running = ShortCircuitOperator(
task_id="is_{}_running".format(i),
python_callable=lambda x: x in [True],
op_args=[var_to_continue])
task_c = PythonOperator(
task_id="task_a_".format(i),
python_callable=p_task_2)
task_b >> is_running >> task_c
task_d = DummyOperator(task_id="task_c",trigger_rule=TriggerRule.ONE_SUCCESS)
task_a >> my_group >> task_d
My problem is: if one of the iterations return skipped the task_d is always skipped, even one of the flow return success.
Do you know how to resolve this?
Thanks!
After a deep search, I found the problem.
In fact, by default ShortCircuitOperator ignore all the downstream tasks trigger rules, if its value is False, it will cut the circuit, which means it will skip all the downstream tasks (its downstream tasks and their downstream tasks and their downstream tasks, ...).
In Airflow 2.3.0, in this PR, they added a new argument ignore_downstream_trigger_rules with default value True to ignore the downstream trigger rules, but you can stop that by providing a False value.
If you are using a version older than 2.3.0, you should replace the operator ShortCircuitOperator by another solution, for ex:
def check_condition():
if not condition: # add your logic
raise AirflowSkipException()
is_running = PythonOperator(..., python_callable=check_condition)
is_running >> task_c
I am facing this situation:
I have generated two dynamic branches. Each branch has multiple chained tasks.
This is what I need Airflow create for me:
taskA1->taskB1 taskC1->taskD1
taskA2->taskB2... taskZ.. taskC2->taskD2
taskA3->taskB3 taskC3->taskD3
and here is my sudocode:
def create_branch1(task_ids):
source = []
for task_id in task_ids:
source += [
Operator1(task_id=’task_A{0}'.format(task_id))) >>
Operator2(task_id=’task_B{0}’.format(task_id)) ]
return source
def create_branch2(task_ids):
source = []
for task_id in task_ids:
source += [
Operator1(task_id=’task_C{0}'.format(task_id))) >>
Operator2(task_id=’task_D{0}’.format(task_id)) ]
return source
create_branch1 >> dummyOperator(Z) >> create_branch2 >> end
However, what the Airflow generates, looks like this:
taskA1->taskB1 taskD1<-taskC1
taskA2->taskB2...taskZ...taskD2<-taskC2
taskA3->taskB3 taskD3<-taskC3
I mean in the second branch, dummyOperator(Z) will be connected to the last task of the chain (D), instead of connecting to the first task of the chain in the second branch (C).
It seems, no matters what, DummpyOperator(task-Z) will connect to the last task of the chained branches.
Do you have any idea, how to tackle this issue?
I have list that I loop to create the tasks. The list are static as far as size.
for counter, account_id in enumerate(ACCOUNT_LIST):
task_id = f"bash_task_{counter}"
if account_id:
trigger_task = BashOperator(
task_id=task_id,
bash_command="echo hello there",
dag=dag)
else:
trigger_task = BashOperator(
task_id=task_id,
bash_command="echo hello there",
dag=dag)
trigger_task.status = SKIPPED # is there way to somehow set status of this to skipped instead of having a branch operator?
trigger_task
I tried this manually but cannot make the task skipped:
start = DummyOperator(task_id='start')
task1 = DummyOperator(task_id='task_1')
task2 = DummyOperator(task_id='task_2')
task3 = DummyOperator(task_id='task_3')
task4 = DummyOperator(task_id='task_4')
start >> task1
start >> task2
try:
start >> task3
raise AirflowSkipException
except AirflowSkipException as ase:
log.error('Task Skipped for task3')
try:
start >> task4
raise AirflowSkipException
except AirflowSkipException as ase:
log.error('Task Skipped for task4')
yes there you need to raise AirflowSkipException
from airflow.exceptions import AirflowSkipException
raise AirflowSkipException
For more information see the source code
Have a fixed number of tasks to execute per DAG. This is really fine and this is also planning how much max parallel task your system should handle without degrading downstream systems. Also, having fixed number of tasks makes it visible in the web UI and give you indication whether they are executed or skipped.
In the code below, I initialized the list with None items and then update the list with values based on returned data from the DB. In the python_callable function, check if the account_id is None then raise an AirflowSkipException, otherwise execute the function. In the UI, the tasks are visible and indicates whether executed or skipped(meaning there is no account_id)
def execute(account_id):
if account_id:
print(f'************Executing task for account_id:{account_id}')
else:
raise AirflowSkipException
def create_task(task_id, account_id):
return PythonOperator(task_id=task_id,
python_callable=execute,
op_args=[account_id])
list_from_dbhook = [1, 2, 3] # dummy list. Get records using DB Hook
# Need to have some fix size. Need to allocate fix resources or # of tasks.
# Having this fixed number of tasks will make this tasks to be visible in UI instead of being purely dynamic
record_size_limit = 5
ACCOUNT_LIST = [None] * record_size_limit
for index, account_id_val in enumerate(list_from_dbhook):
ACCOUNT_LIST[index] = account_id_val
for idx, acct_id in enumerate(ACCOUNT_LIST):
task = create_task(f"task_{idx}", acct_id)
task
I thought the macro prev_execution_date listed here would get me the execution date of the last DAG run, but looking at the source code it seems to only get the last date based on the DAG schedule.
prev_execution_date = task.dag.previous_schedule(self.execution_date)
Is there any way via macros to get the execution date of the DAG when it doesn't run on a schedule?
Yes, you can define your own custom macro for this as follows:
# custom macro function
def get_last_dag_run(dag):
last_dag_run = dag.get_last_dagrun()
if last_dag_run is None:
return "no prev run"
else:
return last_dag_run.execution_date.strftime("%Y-%m-%d")
# add macro in user_defined_macros in dag definition
dag = DAG(dag_id="my_test_dag",
schedule_interval='#daily',
user_defined_macros={
'last_dag_run_execution_date': get_last_dag_run
}
)
# example of using it in practice
print_vals = BashOperator(
task_id='print_vals',
bash_command='echo {{ last_dag_run_execution_date(dag) }}',
dag=dag
)
Note that the dag.get_last_run() is just one of the many functions available on the Dag object. Here's where I found it: https://github.com/apache/incubator-airflow/blob/v1-10-stable/airflow/models.py#L3396
You can also tweak the formatting of the string for the date format, and what you want output if there is no previous run.
You can make your own user custom macro function, use airflow model to search meta-database.
def get_last_dag_run(dag_id):
//TODO search DB
return xxx
dag = DAG(
'example',
schedule_interval='0 1 * * *',
user_defined_macros={
'last_dag_run_execution_date': get_last_dag_run,
}
)
Then use the KEY in your template.
Problem: I've been trying to find a way to get tasks from a DAG that have no downstream tasks following them.
Why I need it: I'm building an "on success" notification for DAGs. Airflow DAGs have an on_success_callback argument, but problem with that is that it gets triggered after every task success instead of just DAG. I've seen other people approach this problem by creating notification task and appending it to the end. Problem I have with this approach is that many DAGs we're using have multiple ends, and some are auto-generated.
Making sure that all ends are caught manually is tedious.
I've spent hours digging for a way to access data I need to build this.
Sample DAG setup:
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from datetime import datetime
default_args = {
'owner': 'airflow',
'start_date': datetime(2018, 7, 29)}
dag = DAG(
'append_to_end',
description='append a tast to all tasks without downstream',
default_args=default_args,
schedule_interval='* * * * *',
catchup=False)
task_1 = DummyOperator(dag=dag, task_id='task_1')
task_2 = DummyOperator(dag=dag, task_id='task_2')
task_3 = DummyOperator(dag=dag, task_id='task_3')
task_1 >> task_2
task_1 >> task_3
This produces following DAG:
What I want to achieve is an automated way to include a new task to a DAG that connects to all ends, like in an image below.
I know it's an old post, but I've had a similar need as the above posted.
You can add to your return function a statement that doesn't return your "final_task" id, and so it won't be added to the get_leaf_task return, something like:
def get_leaf_tasks(dag):
return [task for task_id, task in dag.task_dict.items() if len(task.downstream_list) == 0 and task_ids != 'final_task']
Additionally, you can change this part:
for task in leaf_tasks:
task >> final_task
to:
get_leaf_tasks(dag) >> final_task
Since it already gives you a list of task instances and the bitwise operator ">>" will do the loop for you.
What I've got to so far is code below:
def get_leaf_tasks(dag):
return [task for task_id, task in dag.task_dict.items() if len(task.downstream_list) == 0]
leaf_tasks = get_leaf_tasks(dag)
final_task = DummyOperator(dag=dag, task_id='final_task')
for task in leaf_tasks:
task >> final_task
It produces the result I want, but what I don't like about this solution is that get_leaf_tasks must be executed before final_task is created, or it will be included in leaf_tasks list and I'll have to find ways to exclude it.
I could wrap assignment in another function:
def append_to_end(dag, task):
leaf_tasks = get_leaf_tasks(dag)
dag.add_task(task)
for task in leaf_tasks:
task >> final_task
final_task = DummyOperator(task_id='final_task')
append_to_end(dag, final_task)
This is not ideal either, as caller must ensure they've created a final_task without DAG assigned to it.