airflow not loading operator tasks from file other then DAG file - airflow

Normally we define the Operators within the same python file where our DAG is defined (see this basic example). So was I doing the same. But my tasks are itself BIG, using custom operators, so I wanted to have a polymorphism structured dag project, where all such tasks using same operator are in a separate file. For simplicity, let me give a very basic example. I have an operator x having several tasks. This is my project structure;
main_directory
├──tasks
| ├──operator_x
| | └──op_x.py
| ├──operator_y
| : └──op_y.py
|
└──dag.py
op_x.py has following method;
def prepare_task():
from main_directory.dag import dag
t2 = BashOperator(
task_id='print_inner_date',
bash_command='date',
dag=dag)
return t2
and the dag.py contains following code;
from main_directory.tasks.operator_x import prepare_task
default_args = {
'retries': 5,
'retry_delay': dt.timedelta(minutes=5),
'on_failure_callback': gen_email(EMAIL_DISTRO, retry=False),
'on_retry_callback': gen_email(EMAIL_DISTRO, retry=True),
'start_date': dt.datetime(2019, 5, 10)
}
dag = DAG('test_dag', default_args=default_args, schedule_interval=dt.timedelta(days=1))
t1 = BashOperator(
task_id='print_date',
bash_command='date',
dag=dag)
t2 = prepare_task()
Now when I execute this in my airflow environment and run airflow list_dags I get the desired dag named test_dag listed, but when I do airflow list_tasks -t test_dag I only get one task with id print_date and NOT the one defined inside the subdirectory with ID print_inner_date. can anyone help me understand what am I missing ?

Your code would create cyclic imports. Instead, try the following:
op_x.py should have:
def prepare_task(dag):
t2 = BashOperator(
task_id='print_inner_date',
bash_command='date',
dag=dag)
return t2
dag.py:
from main_directory.tasks.operator_x import prepare_task
default_args = {
'retries': 5,
'retry_delay': dt.timedelta(minutes=5),
'on_failure_callback': gen_email(EMAIL_DISTRO, retry=False),
'on_retry_callback': gen_email(EMAIL_DISTRO, retry=True),
'start_date': dt.datetime(2019, 5, 10)
}
dag = DAG('test_dag', default_args=default_args, schedule_interval=dt.timedelta(days=1))
t1 = BashOperator(
task_id='print_date',
bash_command='date',
dag=dag)
t2 = prepare_task(dag=dag)
Also make sure that main_directory is in your PYTHONPATH.

Related

Airflow: Dynamically generate tasks with TaskFlow API

Previously I used the following snippet to dynamically generate tasks:
dummy_start_task = PythonOperator(
task_id="dummy_start",
default_args=default_args,
python_callable=dummy_start,
dag=dag
)
make_images_tasks = list()
for n in range(WORKERS):
globals()[f"make_images_{n}_task"] = PythonOperator(
task_id=f'make_images_{n}',
default_args=default_args,
python_callable=make_images,
op_kwargs={"n": n},
dag=dag
)
make_images_tasks.append(globals()[f"make_images_{n}_task"])
dummy_collector_task = PythonOperator(
task_id="dummy_collector",
default_args=default_args,
python_callable=dummy_collector,
dag=dag
)
dummy_start_task >> make_images_tasks >> dummy_collector_task
# in collector_task I would use:
# items = task_instance.xcom_pull(task_ids=[f"make_images_{n}" for n in range(int(WORKERS))])
# to get the XCOMs from the these dynamically generated tasks
How can I achieve that using the TaskFlow API? (Spawn multiple tasks and then get their XComs in the following collector-task)
Here's an example:
from datetime import datetime
from airflow import DAG
from airflow.decorators import task
with DAG(dag_id="example_taskflow", start_date=datetime(2022, 1, 1), schedule_interval=None) as dag:
#task
def dummy_start_task():
pass
tasks = []
for n in range(3):
#task(task_id=f"make_images_{n}")
def images_task(i):
return i
tasks.append(images_task(n))
#task
def dummy_collector_task(tasks):
print(tasks)
dummy_start_task_ = dummy_start_task()
dummy_start_task_ >> tasks
dummy_collector_task(tasks)
Which gives the following DAG:
The make_images_* tasks take 0, 1, and 2 as input (and also use it in the tasks' id) and return the value. The dummy_collector_task takes all outputs from the make_images_* tasks and prints [0, 1, 2].

How to trigger a task in airflow if immediate parent task fails?

What i am mainly aiming for is that the restore_denormalized_es_Data should only get triggered when the load_denormalized_es_data task fails. If the load_denormalized_es_data task is successful then the command should be directed to end . Here as you can see , my restore is working when archive fails and load is skipped or retrying as a result i am getting wrong answers.
Have stated the code i am using
import sys
import os
from datetime import datetime
#import files what u want to import
# Airflow level imports
from airflow.models import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator,BranchPythonOperator
from airflow.operators.bash_operator import BashOperator
from airflow.utils.trigger_rule import TriggerRule
#Imported all the functions and the code is able to call the functions with ease
# Name of the Dag
DAG_NAME = "DAG"
#Default arguments
default_args = {
"owner": "Mehul",
"start_date": datetime.today().strftime("%Y-%m-%d"),
"provide_context": True
}
# Define the dag object
dag = DAG(
DAG_NAME,
default_args=default_args,
schedule_interval=None
)
archive_denormalized_es_data = PythonOperator(
task_id = "archive_denormalized_es_data",
python_callable = archive_current_ES_data,
trigger_rule=TriggerRule.ALL_SUCCESS,
provide_context = False,
dag=dag
)
load_denormalized_es_data = PythonOperator(
task_id = "load_denormalized_es_data",
python_callable = es_load,
provide_context = False,
trigger_rule = TriggerRule.ALL_SUCCESS,
dag=dag
)
restore_denormalized_es_data = PythonOperator(
task_id = "restore_denormalized_es_data",
python_callable = restore_current_ES_data,
trigger_rule=TriggerRule.ALL_FAILED,
provide_context=False,
dag=dag
)
END = DummyOperator(
task_id="END",
trigger_rule=TriggerRule.ALL_SUCCESS,
dag=dag)
denormalized_data_creation>>archive_denormalized_es_data>>load_denormalized_es_data
load_denormalized_es_data<<archive_denormalized_es_data<<denormalized_data_creation
load_denormalized_es_data>>restore_denormalized_es_data
restore_denormalized_es_data<<load_denormalized_es_data
load_denormalized_es_data>>END
END<<load_denormalized_es_data
restore_denormalized_es_data>>END
END<<restore_denormalized_es_data
Here is the picture of the pipelines referred above
If I understand correctly, you want to skip the rest of the pipeline if A fails.
ShortCircuitOperator will allow Airflow to short circuit (skip) the rest of the pipeline.
Here is an example that does what you outlined.
from datetime import datetime
from airflow.models import DAG
from airflow.operators.dummy import DummyOperator
from airflow.operators.python import PythonOperator, ShortCircuitOperator
from airflow.utils.trigger_rule import TriggerRule
from airflow.utils.state import State
def proceed(**context):
ti = context['dag_run'].get_task_instance(a.task_id)
if ti.state == State.FAILED:
return False
else:
return True
dag = DAG(
dag_id="dag",
start_date=datetime(2021, 4, 5),
schedule_interval='#once',
)
with dag:
a = PythonOperator(
task_id='archive_denormalized_es_data',
python_callable=lambda x: 1
)
gate = ShortCircuitOperator(
task_id='gate',
python_callable=proceed,
trigger_rule=TriggerRule.ALL_DONE
)
b = PythonOperator(
task_id='load_denormalized_es_data',
python_callable=lambda: 1
)
c = DummyOperator(
task_id='restore_denormalized_es_data',
trigger_rule=TriggerRule.ALL_FAILED
)
d = DummyOperator(
task_id='END',
trigger_rule=TriggerRule.ONE_SUCCESS
)
a >> gate >> b >> c
[b, c] >> d
If archive_denormalized_es_data fails, the rest of the pipeline is skipped, meaning Airflow does not run restore_denormalized_es_data
If load_denormalized_es_data fails, restore_denormalized_es_data runs and continues to END.
If load_denormalized_es_data succeeds, restore_denormalized_es_data is skipped and continues to END.
You code is essentially missing the logic to skip when archive_denormalized_es_data fails, which the ShortCircuitOperator takes care of for you.

How to retry an upstream task?

task a > task b > task c
If C fails I want to retry A. Is this possible? There are a few other tickets which involve subdags, but I would like to just be able to clear A.
I'm hoping to use on_retry_callback in task C but I don't know how to call task A.
There is another question which does this in a subdag, but I am not using subdags.
I'm trying to do this, but it doesn't seem to work:
def callback_for_failures(context):
print("*** retrying ***")
if context['task'].upstream_list:
context['task'].upstream_list[0].clear()
As other comments mentioned, I would use caution to make sure you aren't getting into an endless loop of clearing/retries. But you can call a bash command as part of your on_failure_callback and then specify which tasks you want to clear, and if you want downstream/upstream tasks cleared etc.
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
def clear_upstream_task(context):
execution_date = context.get("execution_date")
clear_tasks = BashOperator(
task_id='clear_tasks',
bash_command=f'airflow tasks clear -s {execution_date} -t t1 -d -y clear_upstream_task'
)
return clear_tasks.execute(context=context)
# Default settings applied to all tasks
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(seconds=5)
}
with DAG('clear_upstream_task',
start_date=datetime(2021, 1, 1),
max_active_runs=3,
schedule_interval=timedelta(minutes=5),
default_args=default_args,
catchup=False
) as dag:
t0 = DummyOperator(
task_id='t0'
)
t1 = DummyOperator(
task_id='t1'
)
t2 = DummyOperator(
task_id='t2'
)
t3 = BashOperator(
task_id='t3',
bash_command='exit 123',
on_failure_callback=clear_upstream_task
)
t0 >> t1 >> t2 >> t3

Airflow Scheduling: how to run initial setup task only once?

If my DAG is this
[setup] -> [processing-task] -> [end].
How can I schedule this DAG to run periodically, while running [setup] task only once (on first scheduled run) and skipping it for all later runs?
Check out this post in medium which describes how to implement a "run once" operator. I have successfully used this several times.
Here is a way to do it without need to create a new class. I found this simpler than the accepted answer and it worked well for my use case.
Might be useful for others!
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.dummy import DummyOperator
from airflow.operators.python import BranchPythonOperator
with DAG(
dag_id='your_dag_id',
default_args={
'depends_on_past': False,
'email': ['you#email.com'],
'email_on_failure': True,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
},
description='Dag with initial setup task that only runs on start_date',
start_date=datetime(2000, 1, 1),
# Runs daily at 1 am
schedule_interval='0 1 * * *',
# catchup must be true if start_date is before datetime.now()
catchup=True,
max_active_runs=1,
) as dag:
def branch_fn(**kwargs):
# Have to make sure start_date will equal data_interval_start on first run
# This dag is daily but since the schedule_interval is set to 1 am data_interval_start would be
# 2000-01-01 01:00:00 when it needs to be
# 2000-01-01 00:00:00
date = kwargs['data_interval_start'].replace(hour=0, minute=0, second=0, microsecond=0)
if date == dag.start_date:
return 'initial_task'
else:
return 'skip_initial_task'
branch_task = BranchPythonOperator(
task_id='branch_task',
python_callable=branch_fn,
provide_context=True
)
initial_task = DummyOperator(
task_id="initial_task"
)
skip_initial_task = DummyOperator(
task_id="skip_initial_task"
)
next_task = DummyOperator(
task_id="next_task",
# This is important otherwise next_task would be skipped
trigger_rule="one_success"
)
branch_task >> [initial_task, skip_initial_task] >> next_task

Airflow DAG does not skip tasks after BranchPythonOperator or ShortCircuitOperator

I am writing a DAG with a BranchPythonOperator to check whether or not data is available for download. If the data is there, the DAG should download and incorporate it into my PostgreSQL database. If it isn't there, all the processing tasks should be skipped and the branch should go to a DummyOperator. Unfortunately the DAG is not skipping all the tasks. It will skip up to 6 tasks, but then stops (the downstream tasks have an unknown status) and the DAG fails. I am not finding any error messages in the logs (because tasks are not failing).
Airflow version 1.8.1. I attached some screenshots below. In the following DAG example, I replaced sensitive file info with 'XXXXX'. I have also tried the ShortCircuitOperator, but only got it to skip the task directly downstream from the SCO.
Thank you!
from airflow import DAG
from airflow.contrib.operators.ssh_execute_operator import SSHExecuteOperator
from airflow.contrib.hooks.ssh_hook import SSHHook
from airflow.operators.postgres_operator import PostgresOperator
from airflow.operators.python_operator import BranchPythonOperator
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.email_operator import EmailOperator
from datetime import datetime, timedelta
default_args = {'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2018, 1, 2),
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=1)
}
dag = DAG('Question_DAG',
template_searchpath="/home/airflow/code",
schedule_interval='0 10 * * *',
catchup=True,
default_args=default_args
)
data_check = SSHExecuteOperator(task_id='check_for_new_data_and_download',
ssh_hook=SSHHook(conn_id='server'),
bash_command='download_file.sh',
xcom_push=True,
dag=dag)
def identify_new_data(**kwargs):
''' If DATA has not been uploaded, the DAG does not continue.'''
new_data_code = kwargs['task_instance'].xcom_pull(task_ids='check_for_new_data_and_download', key=None)
filecount_type_conversion_success = True
try:
new_data_code = int(new_data_code)
except ValueError:
filecount_type_conversion_success = False
# print new_data_code, type(new_data_code)
# 1 means there is new data, therefore it should update the data tables.
# 2 means source data was not uploaded
if new_data_code == 1:
return 'data_uploaded'
elif new_data_code == 2 or new_data_code == 3:
return 'no_data_uploaded'
identify_new_data = BranchPythonOperator(task_id='identify_new_data',
python_callable=identify_new_data,
trigger_rule="all_done",
provide_context=True,
dag=dag)
no_data_uploaded = DummyOperator(task_id="no_data_uploaded",
trigger_rule='all_done',
dag=dag)
data_uploaded = EmailOperator(task_id='data_uploaded',
to='myemail#google',
subject='File Downloaded',
html_content='Hello, This is an auto-generated email to inform you that the montly data has as been downloaded. Thank you.',
dag=dag)
################# create_raw_table ################################
create_raw_table = PostgresOperator(task_id='create_raw_table',
postgres_conn_id='warehouse',
sql='create_raw_table.sql',
dag=dag)
################# Convert fixed width file to csv ################################
convert_fixed_width_csv = SSHExecuteOperator(task_id='convert_fixed_width_csv',
ssh_hook=SSHHook(conn_id='server'),
bash_command='convert_fixed_width_csv.sh',
dag=dag)
################# Dedupe ##############
dedupe_on_id = PostgresOperator(task_id='dedupe_on_id',
postgres_conn_id='warehouse',
sql='dedupe.sql',
dag=dag)
################# Date Insert ################################
date_insert = PostgresOperator(task_id='add_dates_raw',
postgres_conn_id='warehouse',
sql='add_dates.sql',
dag=dag)
################ Client Insert ###########################
client_insert = PostgresOperator(task_id='client_insert',
postgres_conn_id='warehouse',
sql='client_insert.sql',
dag=dag)
################# Months Insert ###########################
months_insert = PostgresOperator(task_id='months_insert',
postgres_conn_id='warehouse',
sql='months_insert.sql',
dag=dag)
################# Eligibility Insert ######################
eligibility_insert = PostgresOperator(task_id='eligibility_insert',
postgres_conn_id='warehouse',
sql='eligibility_insert.sql',
dag=dag)
################# Plan Insert ####################
plan_insert = PostgresOperator(task_id='plan_insert',
postgres_conn_id='warehouse',
sql='plan_insert.sql',
dag=dag)
################# Codes ###################################
codes = PostgresOperator(task_id='codes',
postgres_conn_id='warehouse',
sql='codes.sql',
dag=dag)
################# Update Dates ################################
update_dates = PostgresOperator(task_id='update_dates',
postgres_conn_id='warehouse',
sql='update_dates.sql',
dag=dag)
################# Clients ################################
create_clients = PostgresOperator(task_id='create_clients',
postgres_conn_id='warehouse',
sql='clients.sql',
dag=dag)
################# fix_addresses ############
fix_addresses = SSHExecuteOperator(task_id='fix_addresses',
ssh_hook=SSHHook(conn_id='server'),
bash_command='fix_addresses.sh',
dag=dag)
################# Load data ############
load_data_command = """
cd data/
TASKDATE='date +%Y%m'
cp XXXX.TXT /home/admin/data/XXX_loaded/XXX.TXT
"""
load_data = SSHExecuteOperator(task_id='load_data',
ssh_hook=SSHHook(conn_id='server'),
bash_command=load_data_command,
dag=dag)
################# Update system status ################################
system_status = PostgresOperator(task_id='update_system_status',
postgres_conn_id='warehouse',
sql="SELECT update_system_status('new_info')",
dag=dag)
data_check.set_downstream(identify_new_data)
identify_new_data.set_downstream(data_uploaded)
data_uploaded.set_downstream(create_raw_table)
create_raw_table.set_downstream(convert_fixed_width_csv)
convert_fixed_width_csv.set_downstream(dedupe_on_id)
dedupe_on_id.set_downstream(date_insert)
date_insert.set_downstream(client_insert)
client_insert.set_downstream(months_insert)
months_insert.set_downstream(eligibility_insert)
eligibility_insert.set_downstream(plan_insert)
plan_insert.set_downstream(codes)
codes.set_downstream(update_dates)
update_dates.set_downstream(create_clients)
create_clients.set_downstream(fix_addresses)
fix_addresses.set_downstream(load_data)
load_data.set_downstream(system_status)
The attached screenshot shows the Tree View on the Airflow UI and I was trying to troubleshoot that certain tasks were not making it fail.
DAG tasks not skipping
DAG tasks
I believe you're running into the same issue described in AIRFLOW-1296. A fix was made for it in Airflow 1.8.2 so I would upgrade and see if you can reproduce it still. It worked for me, but as seen in the comments, there were some mix results.

Resources