Airflow - prevent DAG from running immediately during import - airflow

I have a DAG that has below Steps :-
Retrieve a list of items from an API call
For each item in the list, spin up another task that prints the value.
Basically, step 2 is indeterministic until the API call is made. I want the API call to be made only after I trigger a DAG run.
However, the Step1 of the DAG is being executed while importing the DAG itself, and if the API call is not working, then it reports DAG as broken. The entire thing is supposed to be dynamic.
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta
import requests
# Default args for the DAG
default_args = {
'owner': 'me',
'start_date': datetime(2025, 1, 1),
'depends_on_past': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
# Create a DAG instance
dag = DAG(
'my_dag_id',
default_args=default_args,
schedule=None,
)
def get_items():
"""
Makes a HTTP request to an API,
retrieves a list of items from the response,
and returns the list
"""
response = requests.get('https://api.example.com/items')
items = response.json()['items']
return items
def process_item(item):
"""
Processes a single item
"""
print(f'Processing item {item}')
# Create a PythonOperator to get the items
get_items_task = PythonOperator(
task_id='get_items',
python_callable=get_items,
dag=dag,
)
# Create a PythonOperator to process each item
for item in get_items():
task = PythonOperator(
task_id=f'process_item_{item}',
python_callable=process_item,
op_args=[item],
dag=dag,
)
task.set_upstream(get_items_task)
Notice that I have set start date to future and schedule=None.
As soon as I save this py file in the /dags folder, it immediately executes the get_items_task and reports that DAG is broken because the get_items api call returned error.
How can I stop the task from getting executed while importing DAG?
I want it to be dynamic i.e., fetch list of items only once the DAG is triggered, and then only create tasks for each of those items dynamically.

You're calling get_items() in the global scope of the DAG file (statement for item in get_items():). This gets evaluated every time Airflow parses the DAG file.
To avoid get_items() getting executed in the global scope, you can place this functionality in a function, to only generate tasks at runtime. For this use case, dynamic task mapping was introduced in Airflow. This allows you to generate a varying number of tasks given a collection of items.
I've refactored your DAG to generate tasks in the process_item task given the output of get_items:
from datetime import datetime, timedelta
import requests
from airflow import DAG
from airflow.decorators import task
# Default args for the DAG
default_args = {
"owner": "me",
"depends_on_past": False,
"retries": 1,
"retry_delay": timedelta(minutes=5),
}
# Create a DAG instance
with DAG(
"my_dag_id",
default_args=default_args,
start_date=datetime(2025, 1, 1),
schedule=None,
):
#task
def get_items():
"""
Makes a HTTP request to an API,
retrieves a list of items from the response,
and returns the list
"""
response = requests.get("https://api.example.com/items")
items = response.json()["items"]
return items
#task
def process_item(item):
"""
Processes a single item
"""
print(f"Processing item {item}")
process_item.expand(item=get_items())
expand() generates a task for each element in the output of get_items(). The TaskFlow API (#task decorator) is convenient when dealing with dynamically generated tasks, read more about it in the docs.

Related

Airflow state toggles between success and removed

I have my airflow dag, the tasks are constantly toggling between success and removed and vice versa.
I am not sure why the task state is going from success to removed state.
My dag code is:
from airflow import DAG
import datetime
from datetime import timedelta
from tasks.user_space_tables_refresh import UserSpaceTablesRefresh
default_args = {
'owner': 'Data Engineering',
'depends_on_past': False,
'start_date': datetime.datetime(2022, 2, 1),
}
user_space_dag = DAG(
'user_space_snowflake_tables_refresh', default_args=default_args,
schedule_interval="30 18 * * *", catchup=False)
with user_space_dag:
users_task = UserSpaceTablesRefresh(
task_id='ingest_USERS_data',
source_table='USERS')
saved_software_task = UserSpaceTablesRefresh(
task_id='ingest_SAVED_SOFTWARE_data',
source_table='SAVED_SOFTWARE')
tasks.user_space_tables_refresh file:
class UserSpaceTablesRefresh(BaseOperator):
#apply_defaults
def __init__(self, source_table, *args, **kwargs):
super().__init__(*args, **kwargs)
self.table = source_table
def execute(self, context):
try:
sf_table = self.table
...
except Exception as ex:
print("Exception")
The task is marked as removed when it disappears from the DAG since the run started.
This can happen when you create tasks dynamically:
you have a condition that has changed since the start of the run
you are using a remote configuration file that is inaccessible during the next dag files processing
If the run history is not important for, try to remove the dag completely from the Metastore, and let airflow dag file processor re-create it, maybe it's a problem with different version of serialized dags or a problem which appear after an Airflow upgrade.
airflow dags delete <dag_id>
This command will delete the dag from the Metastore and all the dag runs and tasks information.

How to define a subdag taks in Airflow from another dag.py file?

I want to make a parent DAG with a few child DAGs that get called via the SubDagOperator.
I can only find examples how to dynamically create Subdags in the SubDagOperator task.
However, in this case I want standalone child DAGs that are already defined in a DAG.py file and stitch those together in a parent dag
If I set the SubDAGOperator task with just the Dag Name of the child dag:
task_1 = SubDagOperator(
task_id="task_1",
subdag=child_dag_name,
dag=parent_dag
)
I get the following Error:
NameError: name 'child_dag_name' is not defined
This answer equally relies on knowledge of Python as much it does on having know-how of Airflow
Recall that
python: importing a module means that all top-level (indentation zero) stuff is immediately executed (during import process)
airflow: only those DAG objects are picked by scheduler / webserver that are occur on top-level (indentation zero) of dag-definition file
Keeping above 2 things in mind, here's what you can do
create a helper / utility function in your child_dag.py file to insantiate and return a DAG object for child-dag
use that helper function for instantiating the top-level child-DAG as well as for creating SubDagOperator task
dag_object_builder.py
from typing import Dict, Any
from airflow.models import DAG
def create_dag_object(dag_id: str, dag_params: Dict[str, Any]) -> DAG:
dag: DAG = DAG(dag_id=dag_id, **dag_params)
return dag
child_dag.py
from datetime import datetime
from typing import Dict, Any
from airflow.models import DAG
from src.main.subdag_example import dag_object_builder
default_args: Dict[str, Any] = {
"owner": "my_owner",
"email": ["my_username#my_domain.com"],
"weight_rule": "downstream",
"retries": 1
}
...
def create_child_dag_object(dag_id: str) -> DAG:
my_dag: DAG = dag_object_builder.create_dag_object(
dag_id=dag_id,
dag_params={
"start_date": datetime(year=2019, month=7, day=10, hour=21, minute=30),
"schedule_interval": None,
"max_active_runs": 1,
"default_view": "graph",
"catchup": False,
"default_args": default_args
}
)
return my_dag
my_child_dag: DAG = create_child_dag_object(dag_id="my_child_dag")
parent_dag.py
from datetime import datetime
from typing import Dict, Any
from airflow.models import DAG
from airflow.operators.subdag_operator import SubDagOperator
from src.main.subdag_example import child_dag
from src.main.subdag_example import dag_object_builder
default_args: Dict[str, Any] = {
"owner": "my_owner",
"email": ["my_username#my_domain.com"],
"weight_rule": "downstream",
"retries": 1
}
my_parent_dag: DAG = dag_object_builder.create_dag_object(
dag_id="my_parent_dag",
dag_params={
"start_date": datetime(year=2019, month=7, day=10, hour=21, minute=30),
"schedule_interval": None,
"max_active_runs": 1,
"default_view": "graph",
"catchup": False,
"default_args": default_args
}
)
...
my_subdag_task: SubDagOperator = SubDagOperator(
task_id="my_subdag_task",
dag=my_parent_dag,
subdag=child_dag.create_child_dag_object(dag_id="my_subdag")
)
If your intention is to link-up DAGs together and you don't have any particular requirement that necessitates using a SubDagOperator, then I would suggest using the TriggerDagRunOperator instead since SubDags have their share of nuisances.
Read more about it here: Wiring top-level DAGs together

Dynamic dags not getting added by scheduler

I am trying to create Dynamic DAGs and then get them to the scheduler. I tried the reference from https://www.astronomer.io/guides/dynamically-generating-dags/ which works well. I changed it a bit as in the below code. Need help in debugging the issue.
I tried
1. Test run the file. The Dag gets executed and the globals() is printing all the DAGs objects. But somehow not listing in the list_dags or in the UI
from datetime import datetime, timedelta
import requests
import json
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.operators.http_operator import SimpleHttpOperator
def create_dag(dag_id,
dag_number,
default_args):
def hello_world_py(*args):
print('Hello World')
print('This is DAG: {}'.format(str(dag_number)))
dag = DAG(dag_id,
schedule_interval="#hourly",
default_args=default_args)
with dag:
t1 = PythonOperator(
task_id='hello_world',
python_callable=hello_world_py,
dag_number=dag_number)
return dag
def fetch_new_dags(**kwargs):
for n in range(1, 10):
print("=====================START=========\n")
dag_id = "abcd_" + str(n)
print (dag_id)
print("\n")
globals()[dag_id] = create_dag(dag_id, n, default_args)
print(globals())
default_args = {
'owner': 'diablo_admin',
'depends_on_past': False,
'start_date': datetime(2019, 8, 8),
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=1),
'trigger_rule': 'none_skipped'
#'schedule_interval': '0 * * * *'
# 'queue': 'bash_queue',
# 'pool': 'backfill',
# 'priority_weight': 10,
# 'end_date': datetime(2016, 1, 1),
}
dag = DAG('testDynDags', default_args=default_args, schedule_interval='*/1 * * * *')
#schedule_interval='*/1 * * * *'
check_for_dags = PythonOperator(dag=dag,
task_id='tst_dyn_dag',
provide_context=True,
python_callable=fetch_new_dags
)
check_for_dags
Expected to create 10 DAGs dynamically and added to the scheduler.
I guess doing the following would fix it
completely remove the global testDynDags dag and tst_dyn_dags task (instantiation and invocation)
invoke your fetch_new_dags(..) method with requisite arguments in global scope
Explanation
Dynamic dags / tasks merely means that you have a well-defined logic at the time of writing dag-definition file that can help create tasks / dags having a known structure in a pre-defined fashion.
You can NOT determine the structure of your DAG at runtime (task execution). So, for instance, you cannot add n identical tasks to your DAG if the upstream task returned an integer value n. But you can iterate over a YAML file containing n segments and generate n tasks / dags.
So clearly, wrapping dag-generation code inside an Airflow task itself makes no sense.
UPDATE-1
From what is indicated in comments, I infer that the requirement dictates that you revise your external source that feeds inputs (how many dags or tasks to create) to your DAG / task-generation script. While this is indeed a complex use-case, but a simple way to achieve this is to create 2 separate DAGs.
One dag runs once in a while and generates the inputs that are stored in an an external resource like Airflow Variable (or any other external store like file / S3 / database etc.)
The second DAG is constructed programmatically by reading that same datasource which was written by the first DAG
You can take inspiration from the Adding DAGs based on Variable value section

airflow trigger_rule using ONE_FAILED cause dag failure

what i wanted to achieve is to create a task where will send notification if any-one of the task under the dag is failed. I am applying trigger rule to the task where:
batch11 = BashOperator(
task_id='Error_Buzz',
trigger_rule=TriggerRule.ONE_FAILED,
bash_command='python /home/admin/pythonwork/home/codes/notifications/dagLevel_Notification.py') ,
dag=dag,
catchup = False
)
batch>>batch11
batch1>>batch11
The problem for now is when there no other task failed, the batch11 task will not execute due to trigger_rule, which is what i wanted, but it will result the dag failure since the default trigger_rule for dag is ALL_SUCCESS. Is there a way to end the loop hole to make the dag runs successfully ?
screenshot of outcome :
We do something similar in our Airflow Deployment. The idea is to notify slack when a task in a dag fails. You can set a dag level configuration on_failure_callback as documented https://airflow.apache.org/code.html#airflow.models.BaseOperator
on_failure_callback (callable) – a function to be called when a task
instance of this task fails. a context dictionary is passed as a
single parameter to this function. Context contains references to
related objects to the task instance and is documented under the
macros section of the API.
Here is an example of how I use it. if any of the task fails or succeeds airflow calls notify function and I can get notification wherever I want.
import sys
import os
from datetime import datetime, timedelta
from airflow.operators.python_operator import PythonOperator
from airflow.models import DAG
from airflow.utils.dates import days_ago
from util.airflow_utils import AirflowUtils
schedule = timedelta(minutes=5)
args = {
'owner': 'user',
'start_date': days_ago(1),
'depends_on_past': False,
'on_failure_callback': AirflowUtils.notify_job_failure,
'on_success_callback': AirflowUtils.notify_job_success
}
dag = DAG(
dag_id='demo_dag',
schedule_interval=schedule, default_args=args)
def task1():
return 'Whatever you return gets printed in the logs!'
def task2():
return 'cont'
task1 = PythonOperator(task_id='task1',
python_callable=task1,
dag=dag)
task2 = PythonOperator(task_id='task2',
python_callable=task1,
dag=dag)
task1 >> task2

Is it possible to have a pipeline in Airflow that does not tie to any schedule?

I need to have pipeline that will be executed either manually or programmatically, is possible with Airflow? Looks like right now each workflow MUST be tied to a schedule.
Just set the schedule_interval to None when you create the DAG:
dag = DAG('workflow_name',
template_searchpath='path',
schedule_interval=None,
default_args=default_args)
From the Airflow Manual:
Each DAG may or may not have a schedule, which informs how DAG Runs
are created. schedule_interval is defined as a DAG arguments, and
receives preferably a cron expression as a str, or a
datetime.timedelta object.
The manual then goes on to list some cron 'presets' one of which is None.
Yes, this can be achieved by passing None to schedule_interval in default_args.
Check this documation on DAG run.
For example:
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2015, 12, 1),
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
'schedule_interval': None, # Check this line
}
In Airflow, every DAG is required to have a start date and schedule interval*, for example hourly:
import datetime
dag = DAG(
dag_id='my_dag',
schedule_interval=datetime.timedelta(hours=1),
start_date=datetime(2018, 5, 23),
)
(Without a schedule how would it know when to run?)
Alternatively to a cron schedule, you can set the schedule to #once to only run once.
*One exception: You can omit the schedule for externally triggered DAGs because Airflow will not schedule them itself.
However, that said, if you omit the schedule, then you need to trigger the DAG externally somehow. If you want to be able to call a DAG programmatically, for instance, as a result of a separate condition occurring in another DAG, you can do that with the TriggerDagRunOperator. You might also hear this idea called externally triggered DAGs.
Here's a usage example from the Airflow Example DAGs:
File 1 - example_trigger_controller_dag.py:
"""This example illustrates the use of the TriggerDagRunOperator. There are 2
entities at work in this scenario:
1. The Controller DAG - the DAG that conditionally executes the trigger
2. The Target DAG - DAG being triggered (in example_trigger_target_dag.py)
This example illustrates the following features :
1. A TriggerDagRunOperator that takes:
a. A python callable that decides whether or not to trigger the Target DAG
b. An optional params dict passed to the python callable to help in
evaluating whether or not to trigger the Target DAG
c. The id (name) of the Target DAG
d. The python callable can add contextual info to the DagRun created by
way of adding a Pickleable payload (e.g. dictionary of primitives). This
state is then made available to the TargetDag
2. A Target DAG : c.f. example_trigger_target_dag.py
"""
from airflow import DAG
from airflow.operators.dagrun_operator import TriggerDagRunOperator
from datetime import datetime
import pprint
pp = pprint.PrettyPrinter(indent=4)
def conditionally_trigger(context, dag_run_obj):
"""This function decides whether or not to Trigger the remote DAG"""
c_p = context['params']['condition_param']
print("Controller DAG : conditionally_trigger = {}".format(c_p))
if context['params']['condition_param']:
dag_run_obj.payload = {'message': context['params']['message']}
pp.pprint(dag_run_obj.payload)
return dag_run_obj
# Define the DAG
dag = DAG(dag_id='example_trigger_controller_dag',
default_args={"owner": "airflow",
"start_date": datetime.utcnow()},
schedule_interval='#once')
# Define the single task in this controller example DAG
trigger = TriggerDagRunOperator(task_id='test_trigger_dagrun',
trigger_dag_id="example_trigger_target_dag",
python_callable=conditionally_trigger,
params={'condition_param': True,
'message': 'Hello World'},
dag=dag)
File 2 - example_trigger_target_dag.py:
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
from airflow.models import DAG
from datetime import datetime
import pprint
pp = pprint.PrettyPrinter(indent=4)
# This example illustrates the use of the TriggerDagRunOperator. There are 2
# entities at work in this scenario:
# 1. The Controller DAG - the DAG that conditionally executes the trigger
# (in example_trigger_controller.py)
# 2. The Target DAG - DAG being triggered
#
# This example illustrates the following features :
# 1. A TriggerDagRunOperator that takes:
# a. A python callable that decides whether or not to trigger the Target DAG
# b. An optional params dict passed to the python callable to help in
# evaluating whether or not to trigger the Target DAG
# c. The id (name) of the Target DAG
# d. The python callable can add contextual info to the DagRun created by
# way of adding a Pickleable payload (e.g. dictionary of primitives). This
# state is then made available to the TargetDag
# 2. A Target DAG : c.f. example_trigger_target_dag.py
args = {
'start_date': datetime.utcnow(),
'owner': 'airflow',
}
dag = DAG(
dag_id='example_trigger_target_dag',
default_args=args,
schedule_interval=None)
def run_this_func(ds, **kwargs):
print("Remotely received value of {} for key=message".
format(kwargs['dag_run'].conf['message']))
run_this = PythonOperator(
task_id='run_this',
provide_context=True,
python_callable=run_this_func,
dag=dag)
# You can also access the DagRun object in templates
bash_task = BashOperator(
task_id="bash_task",
bash_command='echo "Here is the message: '
'{{ dag_run.conf["message"] if dag_run else "" }}" ',
dag=dag)

Resources