I want to be able to publish and trigger a DAG object from my code which is not in control of scheduler (viz. $AIRFLOW_HOME/dags folder)
My last resort would be to programmatically create a py file containing the DAG definition that I want to publish and save this file to the $AIRFLOW_HOME/dags folder.
I'm sure it should be easier than that.
Below is what I've tried.
import airflow
from airflow import DAG
from datetime import timedelta
from airflow.models import DagPickle
from airflow.operators.dummy_operator import DummyOperator
from airflow.utils.db import provide_session
#provide_session
def submit_dag(session=None):
args = {
'owner': 'airflow',
'start_date': airflow.utils.dates.days_ago(2)
}
dag = DAG(
dag_id='sample', default_args=args,
schedule_interval=None, start_date=airflow.utils.dates.days_ago(2),
dagrun_timeout=timedelta(minutes=60))
task = DummyOperator(task_id='one', dag=dag)
dag_pickle = DagPickle(task)
session.add(dag_pickle)
session.commit()
submit_dag()
The above code does create entries in dag_pickle table but how do I publish and later trigger this dag?
I can do pickle.dump(dag,open(DAGS_FOLDER/pickled_dags,'wb')) and have a file in DAGS FOLDER that would pickle.load(DAGS_FOLDER/pickled_dags)
Related
Given the following DAG:
import logging
from datetime import datetime
from airflow import DAG
from airflow.operators.dummy import DummyOperator
from airflow.operators.python import PythonOperator
dag = DAG(
dag_id="dag_foo",
start_date=datetime(2022, 2, 28),
default_args={"owner": "Airflow", "params": {"param_a": "foo"}},
schedule_interval="#once",
catchup=False
)
def log_dag_param(param):
logging.info(param)
with dag:
DummyOperator(task_id="dummy") >> PythonOperator(
python_callable=log_dag_param, op_args=[dag.params["param_a"]]
)
I'm wondering if there is any way to overwrite an existing DAG parameter using the CLI. I'm aware of the airflow.models.dagrun.DagRun.conf, --conf parameter and this approach but I'm looking how I could overwrite a DAG parameter instead of a conf value.
I am trying to trigger multiple external dag dataflow job via master dag.
I plan to use TriggerDagRunOperator and ExternalTaskSensor . I have around 10 dataflow jobs - some are to be executed in sequence and some in parallel .
For example: I want to execute Dag dataflow jobs A,B,C etc from master dag and before execution goes next task I want to ensure the previous dag run has completed. But I am having issues with importing ExternalTaskSensor module.
Is their any alternative path to achieve this ?
Note: Each Dag eg A/B/C has 6- 7 task .Can ExternalTaskSensor check if the last task of dag A has completed before DAG B or C can start.
I Used the below sample code to run dag’s which uses ExternalTaskSensor, I was able to successfully import the ExternalTaskSensor module.
import time
from datetime import datetime, timedelta
from pprint import pprint
from airflow import DAG
from airflow.operators.dagrun_operator import TriggerDagRunOperator
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
from airflow.sensors.external_task_sensor import ExternalTaskSensor
from airflow.utils.state import State
sensors_dag = DAG(
"test_launch_sensors",
schedule_interval=None,
start_date=datetime(2020, 2, 14, 0, 0, 0),
dagrun_timeout=timedelta(minutes=150),
tags=["DEMO"],
)
dummy_dag = DAG(
"test_dummy_dag",
schedule_interval=None,
start_date=datetime(2020, 2, 14, 0, 0, 0),
dagrun_timeout=timedelta(minutes=150),
tags=["DEMO"],
)
def print_context(ds, **context):
pprint(context['conf'])
with dummy_dag:
starts = DummyOperator(task_id="starts", dag=dummy_dag)
empty = PythonOperator(
task_id="empty",
provide_context=True,
python_callable=print_context,
dag=dummy_dag,
)
ends = DummyOperator(task_id="ends", dag=dummy_dag)
starts >> empty >> ends
with sensors_dag:
trigger = TriggerDagRunOperator(
task_id=f"trigger_{dummy_dag.dag_id}",
trigger_dag_id=dummy_dag.dag_id,
conf={"key": "value"},
execution_date="{{ execution_date }}",
)
sensor = ExternalTaskSensor(
task_id="wait_for_dag",
external_dag_id=dummy_dag.dag_id,
external_task_id="ends",
poke_interval=5,
timeout=120,
)
trigger >> sensor
In the above sample code, sensors_dag triggers tasks in dummy_dag using the TriggerDagRunOperator(). The sensors_dag will wait till the completion of the specified external_task in dummy_dag.
I am trying to generate airflow dags using a template in a python code, and using globals() as defined here
To define dag object and saving it. Below is my code :
import datetime as dt
import sys
import airflow
from airflow.models import DAG
from airflow.operators.bash_operator import BashOperator
argumentList = sys.argv
owner = argumentList[1]
dag_name = argumentList[2]
taskID = argumentList[3]
bashCommand = argumentList[4]
default_args = {
'owner': owner,
'start_date': dt.datetime(2019, 6, 1),
'retries': 1,
'retry_delay': dt.timedelta(minutes=5),
}
def dagCreate():
with DAG(dag_name,
default_args=default_args,
schedule_interval=None,
) as dag:
print_hello = BashOperator(task_id=taskID, bash_command=bashCommand)
return dag
globals()[dag_name] = dagCreate()
I have kept this python code outside dag_folder, and executing it as follows :
python bash-dag-generator.py Airflow test_bash_generate auto_bash_task ls
But I don't see any DAG generated in the airflow webserver UI. I am not sure where I am going wrong.
As per the official documentation:
DAGs are defined in standard Python files that are placed in Airflow’s DAG_FOLDER. Airflow will execute the code in each file to dynamically build the DAG objects. You can have as many DAGs as you want, each describing an arbitrary number of tasks. In general, each one should correspond to a single logical workflow.
So unless your code is actually inside the DAG_FOLDER, it will not be registered as a DAG.
The way I have been able to implement Dynamic DAGs is by using Airflow Variable.
In the below example I have a csv file that has list of Bash command like ls, echo etc. As part of the read_file task I am updating the file location to the Airflow Variable. The part where we read the csv file and loop through the commands is where the dynamic DAGs get created.
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.operators.bash_operator import BashOperator
from airflow.models import Variable
from datetime import datetime, timedelta
import csv
'''
Orchestrate the Dynamic Tasks
'''
def read_file_task():
print('I am reading a File and setting variables ')
Variable.set('dynamic-dag-sample','/home/bashoperator.csv')
with DAG('dynamic-dag-sample',
start_date=datetime(2018, 11, 1)) as dag:
read_file_task = PythonOperator(task_id='read_file_task',
python_callable=read_file_task, provide_context=True,
dag=dag)
dynamic_dag_sample_file_path = Variable.get("dynamic-dag-sample")
if dynamic_dag_sample_file_path != None:
with open(dynamic_dag_sample_file_path) as csv_file:
reader = csv.DictReader(csv_file)
line_count = 0
for row in reader:
bash_task = BashOperator(task_id=row['Taskname'], bash_command=row['Command'])
read_file_task.set_downstream(bash_task)
I want to make a parent DAG with a few child DAGs that get called via the SubDagOperator.
I can only find examples how to dynamically create Subdags in the SubDagOperator task.
However, in this case I want standalone child DAGs that are already defined in a DAG.py file and stitch those together in a parent dag
If I set the SubDAGOperator task with just the Dag Name of the child dag:
task_1 = SubDagOperator(
task_id="task_1",
subdag=child_dag_name,
dag=parent_dag
)
I get the following Error:
NameError: name 'child_dag_name' is not defined
This answer equally relies on knowledge of Python as much it does on having know-how of Airflow
Recall that
python: importing a module means that all top-level (indentation zero) stuff is immediately executed (during import process)
airflow: only those DAG objects are picked by scheduler / webserver that are occur on top-level (indentation zero) of dag-definition file
Keeping above 2 things in mind, here's what you can do
create a helper / utility function in your child_dag.py file to insantiate and return a DAG object for child-dag
use that helper function for instantiating the top-level child-DAG as well as for creating SubDagOperator task
dag_object_builder.py
from typing import Dict, Any
from airflow.models import DAG
def create_dag_object(dag_id: str, dag_params: Dict[str, Any]) -> DAG:
dag: DAG = DAG(dag_id=dag_id, **dag_params)
return dag
child_dag.py
from datetime import datetime
from typing import Dict, Any
from airflow.models import DAG
from src.main.subdag_example import dag_object_builder
default_args: Dict[str, Any] = {
"owner": "my_owner",
"email": ["my_username#my_domain.com"],
"weight_rule": "downstream",
"retries": 1
}
...
def create_child_dag_object(dag_id: str) -> DAG:
my_dag: DAG = dag_object_builder.create_dag_object(
dag_id=dag_id,
dag_params={
"start_date": datetime(year=2019, month=7, day=10, hour=21, minute=30),
"schedule_interval": None,
"max_active_runs": 1,
"default_view": "graph",
"catchup": False,
"default_args": default_args
}
)
return my_dag
my_child_dag: DAG = create_child_dag_object(dag_id="my_child_dag")
parent_dag.py
from datetime import datetime
from typing import Dict, Any
from airflow.models import DAG
from airflow.operators.subdag_operator import SubDagOperator
from src.main.subdag_example import child_dag
from src.main.subdag_example import dag_object_builder
default_args: Dict[str, Any] = {
"owner": "my_owner",
"email": ["my_username#my_domain.com"],
"weight_rule": "downstream",
"retries": 1
}
my_parent_dag: DAG = dag_object_builder.create_dag_object(
dag_id="my_parent_dag",
dag_params={
"start_date": datetime(year=2019, month=7, day=10, hour=21, minute=30),
"schedule_interval": None,
"max_active_runs": 1,
"default_view": "graph",
"catchup": False,
"default_args": default_args
}
)
...
my_subdag_task: SubDagOperator = SubDagOperator(
task_id="my_subdag_task",
dag=my_parent_dag,
subdag=child_dag.create_child_dag_object(dag_id="my_subdag")
)
If your intention is to link-up DAGs together and you don't have any particular requirement that necessitates using a SubDagOperator, then I would suggest using the TriggerDagRunOperator instead since SubDags have their share of nuisances.
Read more about it here: Wiring top-level DAGs together
what i wanted to achieve is to create a task where will send notification if any-one of the task under the dag is failed. I am applying trigger rule to the task where:
batch11 = BashOperator(
task_id='Error_Buzz',
trigger_rule=TriggerRule.ONE_FAILED,
bash_command='python /home/admin/pythonwork/home/codes/notifications/dagLevel_Notification.py') ,
dag=dag,
catchup = False
)
batch>>batch11
batch1>>batch11
The problem for now is when there no other task failed, the batch11 task will not execute due to trigger_rule, which is what i wanted, but it will result the dag failure since the default trigger_rule for dag is ALL_SUCCESS. Is there a way to end the loop hole to make the dag runs successfully ?
screenshot of outcome :
We do something similar in our Airflow Deployment. The idea is to notify slack when a task in a dag fails. You can set a dag level configuration on_failure_callback as documented https://airflow.apache.org/code.html#airflow.models.BaseOperator
on_failure_callback (callable) – a function to be called when a task
instance of this task fails. a context dictionary is passed as a
single parameter to this function. Context contains references to
related objects to the task instance and is documented under the
macros section of the API.
Here is an example of how I use it. if any of the task fails or succeeds airflow calls notify function and I can get notification wherever I want.
import sys
import os
from datetime import datetime, timedelta
from airflow.operators.python_operator import PythonOperator
from airflow.models import DAG
from airflow.utils.dates import days_ago
from util.airflow_utils import AirflowUtils
schedule = timedelta(minutes=5)
args = {
'owner': 'user',
'start_date': days_ago(1),
'depends_on_past': False,
'on_failure_callback': AirflowUtils.notify_job_failure,
'on_success_callback': AirflowUtils.notify_job_success
}
dag = DAG(
dag_id='demo_dag',
schedule_interval=schedule, default_args=args)
def task1():
return 'Whatever you return gets printed in the logs!'
def task2():
return 'cont'
task1 = PythonOperator(task_id='task1',
python_callable=task1,
dag=dag)
task2 = PythonOperator(task_id='task2',
python_callable=task1,
dag=dag)
task1 >> task2