Can this warning be avoided in apache airflow 2.0? - airflow

I am using airflow v2.0 on windows 10 WSL (Ubuntu 20.04).
The warning message is :
/home/jainri/.local/lib/python3.8/site-packages/airflow/models/dag.py:1342: PendingDeprecationWarning: The requested task could not be added to the DAG because a task with task_id create_tag_template_field_result is already in the DAG. Starting in Airflow 2.0, trying to overwrite a task will raise an exception.
warnings.warn(
Done.
Due to this warning, the dags showing in web UI are also some example dags included with apache airflow. I have setup **AIRFLOW_HOME** and it also picks up dags from there. But the list of example dags also displayed. I have posted the image of WEB UI also.
WebUI
This is the dag below that I am trying to run:
import datetime
import logging
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
#
# TODO: Define a function for the python operator to call
#
def greet():
logging.info("Hello Rishabh!!")
dag = DAG(
'lesson1.demo1',
start_date = datetime.datetime.now()
end_date
)
#
# TODO: Define the task below using PythonOperator
#
greet_task = PythonOperator(
task_id='greet_task',
python_callable=greet,
dag=dag
)
Also, the main issue is like the list of dags showing in webUI is some example dags. That shows up a huge list along with my own dags. Which makes it cumbersome to look for my own dags.

I found the issue, the error you are seeing is because of airflow/example_dags/example_complex.py (one of the example_dags) that is shipped with Airflow.
Disable loading of example_dags by setting AIRFLOW__CORE__LOAD_EXAMPLES=False as an environment variable or set [core] load_examples = False in airflow.cfg (docs).

Related

Not able to find my DAG in airflow WEB UI even though the dag is in correct folder

I have been trying past 2 days to resolve this. There is a DAG python script which I created and saved it in the dags folder in airflow which is being referred to in the "airflow.cfg" file. The other dags are getting updated except for one dag. I tried to restart scheduler and also tried to reset the airflow db using airflow db reset and then tried airflow db init once again but still the same issue exists.
Some ideas on what you could check:
Do all of your DAGs have a unique dag_id? (I lost a few hours to this once, if two dags have the same name, the scheduler will randomly pick one to display with every dag_dir_list_interval)
If you are using a the #dag decorator: are you calling the DAG below its definition? Like so:
from airflow.decorators import dag, task
from pendulum import datetime
#dag(
dag_id="unique_name",
start_date=datetime(2022,12,10),
schedule=None,
catchup=False
)
def my_dag():
#task
def say_hi():
return "hi"
say_hi()
# without this line the DAG will not show up in the UI
my_dag()
What is the output of airflow run dags list and airflow run dags list-import-errors ?
If you have a lot of DAGs in your environment you might want to increase the dagbag_import_timeout.
Does your DAG work if thrown into a new Airflow instance (the easiest way to check is by spinning up a project with the Astro CLI and putting the dag into the dags folder created by astro dev init)
Disclaimer: I work at Astronomer, who develops the Astro CLI as an OS project.

How can I use a xcom value to configure max_active_tis_per_dag for the TriggerDagRunOperator in Airflow 2.3.x?

Dear Apache Airflow experts,
I am currently trying to make the parallel execution of Apache Airflow 2.3.x DAGs configurable via the DAG run config.
When executing below code the DAG creates two tasks - for the sake of my question it does not matter what the other DAG does.
Because max_active_tis_per_dag is set to 1, the two tasks will be run one after another.
What I want to achieve: I want to provide the result of get_num_max_parallel_runs (which checks the DAG config, if no value is present it falls back to 1 as default) to max_active_tis_per_dag.
I would appreciate any input on this!
Thank you in advance!
from airflow import DAG
from airflow.decorators import task
from airflow.operators.trigger_dagrun import TriggerDagRunOperator
from datetime import datetime
with DAG(
'aaa_test_controller',
schedule_interval=None,
start_date=datetime(2021, 1, 1),
catchup=False
) as dag:
#task
def get_num_max_parallel_runs(dag_run=None):
return dag_run.conf.get("num_max_parallel_runs", 1)
trigger_dag = TriggerDagRunOperator.partial(
task_id="trigger_dependent_dag",
trigger_dag_id="aaa_some_other_dag",
wait_for_completion=True,
max_active_tis_per_dag=1,
poke_interval=5
).expand(conf=['{"some_key": "some_value_1"}', '{"some_key": "some_value_2"}'])

read cli input without calling python operator

we want to read cli input pass to dag from UI during Dagtrigger in Dag.
i tried below code but its not working. here i am passing input as {""kpi":"ID123"}
and i want to print this ip value in my function get_data_from_bq
from airflow import DAG
from airflow.utils.dates import days_ago
from airflow.operators.python_operator import PythonOperator
from airflow import models
from airflow.models import Variable
from google.cloud import bigquery
from airflow.configuration import conf
LOCATION = Variable.get("HDM_PROJECT_LOCATION")
PROJECT_ID = Variable.get("HDM_PROJECT_ID")
client = bigquery.Client()
kpi='{{ kpi}}'
# default arguments
default_dag_args = {
'start_date':days_ago(0),
'retries': 0,
'project_id': PROJECT_ID
}
# Setting airflow environment varriable,getting hdm_batch_details data and updating it
def get_data_from_bq(**kwargs):
print("op is:")
print(kpi)
#Dag Defination
with models.DAG(
'00_test_sql1',
schedule_interval=None,
default_args=default_dag_args) as dag:
v_run_sql_01 = PythonOperator(
task_id='Run_SQL',
provide_context=True,
python_callable=get_data_from_bq,
location=LOCATION,
use_legacy_sql=False)
v_run_sql_01
Note: I don't want to use any operator to read data passed from cli
Note: I don't want to use any operator to read data passed from cli
This is impossible. Dag Run is only created when there are tasks to run.
You should understand that :
DAG + its top level code - builds DAG structure consisting of Tasks
DAG Run -> is single instance of DAG run which contains Task Instances to be executed. Dag Run simply consists of task instancess that belong to the DAG run with the given "dag run".
The configuration that you pass is "dag_run.conf" not "dag.conf" - which meanss that it is only specified for the DagRun, which is valid only for all Task Instances that belong to it.
Only Task Instances have access to dag_run.conf

DagBag does not populate dags as expected

I want to test my dags to make sure they have certain default arguments and also to make sure that all dags are not having importation errors.
I am using DagBag to populate dags and then iterate through each dag and check for the values of each dag to make sure they are what I want them to be.
Because DagBag can fetch also the example dags that are shipped with airflow, I am passing the argument include_example = False however when I do this I realize that none of my dags is pulled into dagbags.
Am I using DagBag wrongly? or is there another better way to pull and inspect dags when testing?
My code
def test_no_import_errors():
dag_bag = DagBag(include_examples=False)
assert len(dag_bag.import_errors) == 0, "No Import Failures"
I was able to reproduce the problem, when creating the DagBag object, if you don't provide a value to dag_folder parameter, no DAG is added to the colleciton.
So as Jarek stated, this works:
def test_no_import_errors():
dag_bag = DagBag(dag_folder="dags/", include_examples=False)
assert len(dag_bag.import_errors) == 0, "No Import Failures"
This is the example I made to test it:
# python -m unittest test_dag_validation.py
import unittest
import logging
from airflow.models import DagBag
class TestDAGValidation(unittest.TestCase):
#classmethod
def setUpClass(cls):
log = logging.getLogger()
handler = logging.FileHandler("dag_validation.log", mode="w")
handler.setLevel(logging.INFO)
log.addHandler(handler)
cls.log = log
def test_no_import_errors(self):
dag_bag = DagBag(dag_folder="dags/", include_examples=False)
self.log.info(f"How Many DAGs?: {dag_bag.size()}")
self.log.info(f"Import errors: {len(dag_bag.import_errors)}")
assert len(dag_bag.import_errors) == 0, "No Import Failures"
When you construct DagBag objects you can pass folder list where DagBag should look for the dag files. I guess this is the problem
By default airflow DagBag looks for dags inside AIRFLOW_HOME/dags folder.
This is usually stored inside airflow.cfg file.
By default it points to ~/airflow folder, but you can point to current working directory by running -
export $AIRFLOW_HOME=abs_path_of_your_folder
If you are using python for Airflow installation, make sure to export the $AIRFLOW_HOME variable first, then activate virtual environment and finally install airflow. This will make sure your path is properly attached to the airflow.cfg file.
Also you can check if your folder loaded properly or not, while running the unittest. In terminal, the file path is printed like
[2022-02-03 20:45:57,657] {dagbag.py:500} INFO - Filling up the DagBag from /Users/kehsihba19/Desktop/airflow-test/dags
An example file for checking import errors in DAGs which include checking typos and cyclic tasks check -
from airflow.models import DagBag
import unittest
class TestDags(unittest.TestCase):
def test_DagBag(self):
self.dag_bag = DagBag(include_examples=False)
self.assertFalse(bool(self.dag_bag.import_errors))
if __name__ == "__main__":
unittest.main()

Airflow unpause dag programmatically?

I have a dag that we'll deploy to multiple different airflow instances and in our airflow.cfg we have dags_are_paused_at_creation = True but for this specific dag we want it to be turned on without having to do so manually by clicking on the UI. Is there a way to do it programmatically?
I created the following function to do so if anyone else runs into this issue:
import airflow.settings
from airflow.models import DagModel
def unpause_dag(dag):
"""
A way to programatically unpause a DAG.
:param dag: DAG object
:return: dag.is_paused is now False
"""
session = airflow.settings.Session()
try:
qry = session.query(DagModel).filter(DagModel.dag_id == dag.dag_id)
d = qry.first()
d.is_paused = False
session.commit()
except:
session.rollback()
finally:
session.close()
airflow-rest-api-plugin plugin can also be used to programmatically pause tasks.
Pauses a DAG
Available in Airflow Version: 1.7.0 or greater
GET - http://{HOST}:{PORT}/admin/rest_api/api?api=pause
Query Arguments:
dag_id - string - The id of the dag
subdir (optional) - string - File location or directory from which to
look for the dag
Examples:
http://{HOST}:{PORT}/admin/rest_api/api?api=pause&dag_id=test_id
See for more details:
https://github.com/teamclairvoyant/airflow-rest-api-plugin
supply your dag_id and run this command on your command line.
airflow pause dag_id.
For more information on the airflow command line interface: https://airflow.incubator.apache.org/cli.html
I think you are looking for unpause ( not pause)
airflow unpause DAG_ID
The following cli command should work per the recent docs.
airflow dags unpause dag_id
https://airflow.apache.org/docs/apache-airflow/stable/cli-and-env-variables-ref.html#unpause
Airflow's REST API provides a way using the DAG patch API: we need to update the dag with query parameter ?update_mask=is_paused and send boolean as request body.
Ref: https://airflow.apache.org/docs/apache-airflow/stable/stable-rest-api-ref.html#operation/patch_dag
airflow pause dag_id.
has been discontinued.
You will have to use:
airflow dags pause dag_id
You can do this using in the python operator of any dag to pause and unpause the dags programatically . This is the best approch i found instead of using cli just pass the list of dags and rest is take care
from airflow.models import DagModel
dag_id = "dag_name"
dag = DagModel.get_dagmodel(dag_id)
dag.set_is_paused(is_paused=False)
And just if you want to check if it is paused or not it will return boolean
dag.is_paused()

Resources