Run code before all Airflow DAGs - airflow

I'm using the relatively new airflow project. I have a bunch of DAGs written and running. Now I want to integrate a bug reporting service, so that if any code in any of the DAGs raises an exception, the information will be sent to a certain API. I can put the API call in the on_failure_callback of each DAG, but I need to execute an initializing line like bug_reporter.init(bug_reporter_token) that just needs to run once.
Is there a place in Airflow for initializing code? Right now I'm initializing the bug tracker at the beginning of every DAG definition file. This seems to be redundant, but I can't find a place to write a file that runs before the DAGs are defined. I've tried reading about plugins, but it doesn't seem to be there.

In your DAG definition file, instead of a DAG use your own subclass:
from airflow.utils.decorators import apply_defaults
import bug_reporter
class DAGWithBugReporter(DAG):
#apply_defaults
def __init__(
self,
bug_reporter_token,
*args, **kwargs):
super(DAGWithBugReporter, self).__init__(*args, **kwargs)
bug_reporter.init(bug_reporter_token)
Then in your dag definition:
dag = DAGWithBugReporter(
dag_id='my_dag',
schedule_interval=None,
start_date=datetime(2017, 2, 26),
bug_reporter_token=my_token_from_somewhere
)
t1 = PythonOperator(
task_id='t1',
provide_context=True,
python_callable=my_callable,
xcom_push=True,
dag=dag)

Related

airflow defferable operator no triggering

Trying to implement a simple deferrable operator based on this example, nothing seems to appear after the manual triggering of my DAG (same case with the exact code of example).
class TestDefer(BaseOperator):
def execute(self, context):
print("--- execute --")
self.defer(
trigger=TimeDeltaTrigger(delta=timedelta(seconds=1)),
method_name="func",
)
def func(self, context, event=None):
print("--- func ----")
pass
with DAG(
"def_dag", schedule_interval=None, start_date=datetime.now(),
) as dag:
t = TestDefer(task_id="defer_task")
and then :
airflow dags test def_dag now
airflow triggerer
Result : func is never called.
Thanks in advance for your help.
Your deferrable operator code is correct. I tested it with the DAG below in Airflow 2.5.1 (only changed the print statements to logs and the start_date because datetime.now() can lead to issues when scheduling, but it should work manually as you had it).
Is the issue the same when you run the DAG manually from the UI? Using airflow dags test... I get an output without "--- func ----" but manually running the DAG from the UI the line prints and the DAG works as expected. (might be loosely related to this issue).
If manually running from the UI does not work: what is the output of docker ps?
from airflow import DAG
from datetime import timedelta, datetime
from airflow.triggers.temporal import TimeDeltaTrigger
from airflow.models.baseoperator import BaseOperator
import logging
# get Airflow logger
log = logging.getLogger('airflow.task')
class TestDefer(BaseOperator):
def execute(self, context):
log.info("--- execute --")
self.defer(
trigger=TimeDeltaTrigger(delta=timedelta(seconds=1)),
method_name="func",
)
def func(self, context, event=None):
log.info("--- func ----")
pass
with DAG(
"def_dag",
schedule_interval=None,
start_date=datetime(2023, 1, 1),
catchup=False
) as dag:
t = TestDefer(task_id="defer_task")
After few tests, with Airflow 2.5.1, and your advices, my deferrable operator works following these steps :
launching of airflow scheduler
launching of airflow triggerer
airflow dags test... or from the UI
Thanks for the help.

Dags triggered by DagTriggenRunOperator are stuck in queue

example_trigger_controller_dag.py
import pendulum
from airflow import DAG
from airflow.operators.trigger_dagrun import TriggerDagRunOperator
with DAG(
dag_id="example_trigger_controller_dag",
start_date=pendulum.datetime(2021, 1, 1, tz="UTC"),
catchup=False,
schedule="#once",
tags=["example"],
) as dag:
trigger = TriggerDagRunOperator(
task_id="test_trigger_dagrun",
trigger_dag_id="example_trigger_target_dag", # Ensure this equals the dag_id of the DAG to trigger
conf={"message": "Hello World"},
)
example_trigger_target_dag.py
import pendulum
from airflow import DAG
from airflow.decorators import task
from airflow.operators.bash import BashOperator
#task(task_id="run_this")
def run_this_func(dag_run=None):
"""
Print the payload "message" passed to the DagRun conf attribute.
:param dag_run: The DagRun object
"""
print(f"Remotely received value of {dag_run.conf.get('message')} for key=message")
with DAG(
dag_id="example_trigger_target_dag",
start_date=pendulum.datetime(2021, 1, 1, tz="UTC"),
catchup=False,
schedule=None,
tags=["example"],
) as dag:
run_this = run_this_func()
bash_task = BashOperator(
task_id="bash_task",
bash_command='echo "Here is the message: $message"',
env={"message": '{{ dag_run.conf.get("message") }}'},
)
the task in controller dag successfully ended but the task in target dag stuck in queue. Any ideas about how to solve this problem?
I ran your DAGs (with both of them unpaused) and they work fine in a completely new environment (Airflow 2.5.0, Astro CLI Runtime 7.1.0). So the issue is most likely not with your DAG code.
Tasks stuck in queue is often an issue with the scheduler, mostly with older Airflow versions. I suggest you:
make sure both DAGs are unpaused when the first DAG runs.
make sure all start_dates are in the past (though in this case usually the tasks don't even get queued)
restart your scheduler/Airflow environment
try running the DAGs while no other DAGs are running to check if the issue could be that the parallelism limit is reached. (if you are using K8s executor you should also check worker_pods_creation_batch_size and with the Celery Executor worker_concurrency and stalled_task_timeout)
take a look at your scheduler logs (at $AIRFLOW_HOME/logs/scheduler)
upgrade Airflow if you are running an older version.

How can I use a xcom value to configure max_active_tis_per_dag for the TriggerDagRunOperator in Airflow 2.3.x?

Dear Apache Airflow experts,
I am currently trying to make the parallel execution of Apache Airflow 2.3.x DAGs configurable via the DAG run config.
When executing below code the DAG creates two tasks - for the sake of my question it does not matter what the other DAG does.
Because max_active_tis_per_dag is set to 1, the two tasks will be run one after another.
What I want to achieve: I want to provide the result of get_num_max_parallel_runs (which checks the DAG config, if no value is present it falls back to 1 as default) to max_active_tis_per_dag.
I would appreciate any input on this!
Thank you in advance!
from airflow import DAG
from airflow.decorators import task
from airflow.operators.trigger_dagrun import TriggerDagRunOperator
from datetime import datetime
with DAG(
'aaa_test_controller',
schedule_interval=None,
start_date=datetime(2021, 1, 1),
catchup=False
) as dag:
#task
def get_num_max_parallel_runs(dag_run=None):
return dag_run.conf.get("num_max_parallel_runs", 1)
trigger_dag = TriggerDagRunOperator.partial(
task_id="trigger_dependent_dag",
trigger_dag_id="aaa_some_other_dag",
wait_for_completion=True,
max_active_tis_per_dag=1,
poke_interval=5
).expand(conf=['{"some_key": "some_value_1"}', '{"some_key": "some_value_2"}'])

read cli input without calling python operator

we want to read cli input pass to dag from UI during Dagtrigger in Dag.
i tried below code but its not working. here i am passing input as {""kpi":"ID123"}
and i want to print this ip value in my function get_data_from_bq
from airflow import DAG
from airflow.utils.dates import days_ago
from airflow.operators.python_operator import PythonOperator
from airflow import models
from airflow.models import Variable
from google.cloud import bigquery
from airflow.configuration import conf
LOCATION = Variable.get("HDM_PROJECT_LOCATION")
PROJECT_ID = Variable.get("HDM_PROJECT_ID")
client = bigquery.Client()
kpi='{{ kpi}}'
# default arguments
default_dag_args = {
'start_date':days_ago(0),
'retries': 0,
'project_id': PROJECT_ID
}
# Setting airflow environment varriable,getting hdm_batch_details data and updating it
def get_data_from_bq(**kwargs):
print("op is:")
print(kpi)
#Dag Defination
with models.DAG(
'00_test_sql1',
schedule_interval=None,
default_args=default_dag_args) as dag:
v_run_sql_01 = PythonOperator(
task_id='Run_SQL',
provide_context=True,
python_callable=get_data_from_bq,
location=LOCATION,
use_legacy_sql=False)
v_run_sql_01
Note: I don't want to use any operator to read data passed from cli
Note: I don't want to use any operator to read data passed from cli
This is impossible. Dag Run is only created when there are tasks to run.
You should understand that :
DAG + its top level code - builds DAG structure consisting of Tasks
DAG Run -> is single instance of DAG run which contains Task Instances to be executed. Dag Run simply consists of task instancess that belong to the DAG run with the given "dag run".
The configuration that you pass is "dag_run.conf" not "dag.conf" - which meanss that it is only specified for the DagRun, which is valid only for all Task Instances that belong to it.
Only Task Instances have access to dag_run.conf

Airflow unpause dag programmatically?

I have a dag that we'll deploy to multiple different airflow instances and in our airflow.cfg we have dags_are_paused_at_creation = True but for this specific dag we want it to be turned on without having to do so manually by clicking on the UI. Is there a way to do it programmatically?
I created the following function to do so if anyone else runs into this issue:
import airflow.settings
from airflow.models import DagModel
def unpause_dag(dag):
"""
A way to programatically unpause a DAG.
:param dag: DAG object
:return: dag.is_paused is now False
"""
session = airflow.settings.Session()
try:
qry = session.query(DagModel).filter(DagModel.dag_id == dag.dag_id)
d = qry.first()
d.is_paused = False
session.commit()
except:
session.rollback()
finally:
session.close()
airflow-rest-api-plugin plugin can also be used to programmatically pause tasks.
Pauses a DAG
Available in Airflow Version: 1.7.0 or greater
GET - http://{HOST}:{PORT}/admin/rest_api/api?api=pause
Query Arguments:
dag_id - string - The id of the dag
subdir (optional) - string - File location or directory from which to
look for the dag
Examples:
http://{HOST}:{PORT}/admin/rest_api/api?api=pause&dag_id=test_id
See for more details:
https://github.com/teamclairvoyant/airflow-rest-api-plugin
supply your dag_id and run this command on your command line.
airflow pause dag_id.
For more information on the airflow command line interface: https://airflow.incubator.apache.org/cli.html
I think you are looking for unpause ( not pause)
airflow unpause DAG_ID
The following cli command should work per the recent docs.
airflow dags unpause dag_id
https://airflow.apache.org/docs/apache-airflow/stable/cli-and-env-variables-ref.html#unpause
Airflow's REST API provides a way using the DAG patch API: we need to update the dag with query parameter ?update_mask=is_paused and send boolean as request body.
Ref: https://airflow.apache.org/docs/apache-airflow/stable/stable-rest-api-ref.html#operation/patch_dag
airflow pause dag_id.
has been discontinued.
You will have to use:
airflow dags pause dag_id
You can do this using in the python operator of any dag to pause and unpause the dags programatically . This is the best approch i found instead of using cli just pass the list of dags and rest is take care
from airflow.models import DagModel
dag_id = "dag_name"
dag = DagModel.get_dagmodel(dag_id)
dag.set_is_paused(is_paused=False)
And just if you want to check if it is paused or not it will return boolean
dag.is_paused()

Resources