DagBag does not populate dags as expected - airflow

I want to test my dags to make sure they have certain default arguments and also to make sure that all dags are not having importation errors.
I am using DagBag to populate dags and then iterate through each dag and check for the values of each dag to make sure they are what I want them to be.
Because DagBag can fetch also the example dags that are shipped with airflow, I am passing the argument include_example = False however when I do this I realize that none of my dags is pulled into dagbags.
Am I using DagBag wrongly? or is there another better way to pull and inspect dags when testing?
My code
def test_no_import_errors():
dag_bag = DagBag(include_examples=False)
assert len(dag_bag.import_errors) == 0, "No Import Failures"

I was able to reproduce the problem, when creating the DagBag object, if you don't provide a value to dag_folder parameter, no DAG is added to the colleciton.
So as Jarek stated, this works:
def test_no_import_errors():
dag_bag = DagBag(dag_folder="dags/", include_examples=False)
assert len(dag_bag.import_errors) == 0, "No Import Failures"
This is the example I made to test it:
# python -m unittest test_dag_validation.py
import unittest
import logging
from airflow.models import DagBag
class TestDAGValidation(unittest.TestCase):
#classmethod
def setUpClass(cls):
log = logging.getLogger()
handler = logging.FileHandler("dag_validation.log", mode="w")
handler.setLevel(logging.INFO)
log.addHandler(handler)
cls.log = log
def test_no_import_errors(self):
dag_bag = DagBag(dag_folder="dags/", include_examples=False)
self.log.info(f"How Many DAGs?: {dag_bag.size()}")
self.log.info(f"Import errors: {len(dag_bag.import_errors)}")
assert len(dag_bag.import_errors) == 0, "No Import Failures"

When you construct DagBag objects you can pass folder list where DagBag should look for the dag files. I guess this is the problem

By default airflow DagBag looks for dags inside AIRFLOW_HOME/dags folder.
This is usually stored inside airflow.cfg file.
By default it points to ~/airflow folder, but you can point to current working directory by running -
export $AIRFLOW_HOME=abs_path_of_your_folder
If you are using python for Airflow installation, make sure to export the $AIRFLOW_HOME variable first, then activate virtual environment and finally install airflow. This will make sure your path is properly attached to the airflow.cfg file.
Also you can check if your folder loaded properly or not, while running the unittest. In terminal, the file path is printed like
[2022-02-03 20:45:57,657] {dagbag.py:500} INFO - Filling up the DagBag from /Users/kehsihba19/Desktop/airflow-test/dags
An example file for checking import errors in DAGs which include checking typos and cyclic tasks check -
from airflow.models import DagBag
import unittest
class TestDags(unittest.TestCase):
def test_DagBag(self):
self.dag_bag = DagBag(include_examples=False)
self.assertFalse(bool(self.dag_bag.import_errors))
if __name__ == "__main__":
unittest.main()

Related

read cli input without calling python operator

we want to read cli input pass to dag from UI during Dagtrigger in Dag.
i tried below code but its not working. here i am passing input as {""kpi":"ID123"}
and i want to print this ip value in my function get_data_from_bq
from airflow import DAG
from airflow.utils.dates import days_ago
from airflow.operators.python_operator import PythonOperator
from airflow import models
from airflow.models import Variable
from google.cloud import bigquery
from airflow.configuration import conf
LOCATION = Variable.get("HDM_PROJECT_LOCATION")
PROJECT_ID = Variable.get("HDM_PROJECT_ID")
client = bigquery.Client()
kpi='{{ kpi}}'
# default arguments
default_dag_args = {
'start_date':days_ago(0),
'retries': 0,
'project_id': PROJECT_ID
}
# Setting airflow environment varriable,getting hdm_batch_details data and updating it
def get_data_from_bq(**kwargs):
print("op is:")
print(kpi)
#Dag Defination
with models.DAG(
'00_test_sql1',
schedule_interval=None,
default_args=default_dag_args) as dag:
v_run_sql_01 = PythonOperator(
task_id='Run_SQL',
provide_context=True,
python_callable=get_data_from_bq,
location=LOCATION,
use_legacy_sql=False)
v_run_sql_01
Note: I don't want to use any operator to read data passed from cli
Note: I don't want to use any operator to read data passed from cli
This is impossible. Dag Run is only created when there are tasks to run.
You should understand that :
DAG + its top level code - builds DAG structure consisting of Tasks
DAG Run -> is single instance of DAG run which contains Task Instances to be executed. Dag Run simply consists of task instancess that belong to the DAG run with the given "dag run".
The configuration that you pass is "dag_run.conf" not "dag.conf" - which meanss that it is only specified for the DagRun, which is valid only for all Task Instances that belong to it.
Only Task Instances have access to dag_run.conf

Airflow - DAG Integrity Testing - sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) no such table: variable

I am trying to write some DAG integrity tests in airflow. The issue I am coming across is the DAG that I am testing, I have references to variables in some of the tasks within that DAG.
eg: Variable.get("AIRFLOW_VAR_BLOB_CONTAINER")
I seem to be getting the error:
sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) no such table: variable
from this because when testing via pytest, those variables (and the variables table) don't exist. Does anyone know any workarounds or suggested methods to handle Variables/Connection references when running DAG Integrity tests?
Thanks,
You can create a local metastore for testing. Running airflow db init without any other settings will create a SQLite metastore in your home directory which you can use during testing. My default additional settings for a local metastore for testing are:
AIRFLOW__CORE__LOAD_DEFAULT_CONNECTIONS=False (to ensure there are no defaults to make things magically work)
AIRFLOW__CORE__LOAD_EXAMPLES=False (to ensure there are no defaults to make things magically work)
AIRFLOW__CORE__UNIT_TEST_MODE=True (Set default test settings, skip certain actions, etc.)
AIRFLOW_HOME=[project root dir] (To avoid Airflow files in your home dir)
Running airflow db init with these settings results in three files in your project root dir:
unittests.db
unittests.cfg
webserver_config.py
It's probably a good idea to add those to your .gitignore. With this set up you can safely test against the local metastore unittests.db during your tests (ensure that when running pytest, the same env vars are set).
Alternatively, if you don't want a local metastore for reasons, you will have to resort to mocking to substitute the call Airflow makes to the metastore. This requires knowledge of the internals of Airflow. An example:
import datetime
from unittest import mock
from airflow.models import DAG
from airflow.operators.bash import BashOperator
def test_bash_operator(tmp_path):
with DAG(dag_id="test_dag", start_date=datetime.datetime(2021, 1, 1), schedule_interval="#daily") as dag:
with mock.patch("airflow.models.variable.Variable.get") as variable_get_mock:
employees = ["Alice", "Bob", "Charlie"]
variable_get_mock.return_value = employees
output_file = tmp_path / "output.txt"
test = BashOperator(task_id="test", bash_command="echo {{ var.json.employees }} > " + str(output_file))
dag.clear()
test.run(
start_date=dag.start_date,
end_date=dag.start_date,
ignore_first_depends_on_past=True,
ignore_ti_state=True,
)
variable_get_mock.assert_called_once()
assert output_file.read_text() == f"[{', '.join(employees)}]\n"
These lines:
with mock.patch("airflow.models.variable.Variable.get") as variable_get_mock:
employees = ["Alice", "Bob", "Charlie"]
variable_get_mock.return_value = employees
Determine that the function airflow.models.variable.Variable.get isn't actually called but instead this list is returned: ["Alice", "Bob", "Charlie"]. Since task.run() doesn't return anything, I made the bash_command write to a tmp_path, and read the file to assert if the content is what I expected.
This avoids the need for a metastore entirely, but mocking can be a lot of work and complex once your tests grow beyond basic examples like these.

Can this warning be avoided in apache airflow 2.0?

I am using airflow v2.0 on windows 10 WSL (Ubuntu 20.04).
The warning message is :
/home/jainri/.local/lib/python3.8/site-packages/airflow/models/dag.py:1342: PendingDeprecationWarning: The requested task could not be added to the DAG because a task with task_id create_tag_template_field_result is already in the DAG. Starting in Airflow 2.0, trying to overwrite a task will raise an exception.
warnings.warn(
Done.
Due to this warning, the dags showing in web UI are also some example dags included with apache airflow. I have setup **AIRFLOW_HOME** and it also picks up dags from there. But the list of example dags also displayed. I have posted the image of WEB UI also.
WebUI
This is the dag below that I am trying to run:
import datetime
import logging
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
#
# TODO: Define a function for the python operator to call
#
def greet():
logging.info("Hello Rishabh!!")
dag = DAG(
'lesson1.demo1',
start_date = datetime.datetime.now()
end_date
)
#
# TODO: Define the task below using PythonOperator
#
greet_task = PythonOperator(
task_id='greet_task',
python_callable=greet,
dag=dag
)
Also, the main issue is like the list of dags showing in webUI is some example dags. That shows up a huge list along with my own dags. Which makes it cumbersome to look for my own dags.
I found the issue, the error you are seeing is because of airflow/example_dags/example_complex.py (one of the example_dags) that is shipped with Airflow.
Disable loading of example_dags by setting AIRFLOW__CORE__LOAD_EXAMPLES=False as an environment variable or set [core] load_examples = False in airflow.cfg (docs).

airflow plugins not getting picked up correctly

We are using Apache 1.9.0. I have written a snowflake hook plugin. I have placed the hook in the $AIRFLOW_HOME/plugins directory.
$AIRFLOW_HOME
+--plugins
+--snowflake_hook2.py
snowflake_hook2.py
# This is the base class for a plugin
from airflow.plugins_manager import AirflowPlugin
# This is necessary to expose the plugin in the Web interface
from flask import Blueprint
from flask_admin import BaseView, expose
from flask_admin.base import MenuLink
# This is the base hook for connecting to a database
from airflow.hooks.dbapi_hook import DbApiHook
# This is the snowflake provided Connector
import snowflake.connector
# This is the default python logging package
import logging
class SnowflakeHook2(DbApiHook):
"""
Airflow Hook to communicate with Snowflake
This is implemented as a Plugin
"""
def __init__(self, connname_in='snowflake_default', db_in='default', wh_in='default', schema_in='default'):
logging.info('# Connecting to {0}'.format(connname_in))
self.conn_name_attr = 'snowflake_conn_id'
self.connname = connname_in
self.superconn = super().get_connection(self.connname) #gets the values from Airflow
{SNIP - Connection stuff that works}
self.cur = self.conn.cursor()
def query(self,q,params=None):
"""From jmoney's db_wrapper allows return of a full list of rows(tuples)"""
if params == None: #no Params, so no insertion
self.cur.execute(q)
else: #make the parameter substitution
self.cur.execute(q,params)
self.results = self.cur.fetchall()
self.rowcount = self.cur.rowcount
self.columnnames = [colspec[0] for colspec in self.cur.description]
return self.results
{SNIP - Other class functions}
class SnowflakePluginClass(AirflowPlugin):
name = "SnowflakePluginModule"
hooks = [SnowflakeHook2]
operators = []
So I went ahead and put some print statements in Airflows plugin_manager to try and get a better handle on what is happening. After restarting the webserver and running airflow list_dags, these lines were showing the "new module name" (and no errors
SnowflakePluginModule [<class '__home__ubuntu__airflow__plugins_snowflake_hook2.SnowflakeHook2'>]
hook_module - airflow.hooks.snowflakepluginmodule
INTEGRATING airflow.hooks.snowflakepluginmodule
snowflakepluginmodule <module 'airflow.hooks.snowflakepluginmodule'>
As this is consistent with what the documentation says, I should be fine using this in my DAG:
from airflow import DAG
from airflow.hooks.snowflakepluginmodule import SnowflakeHook2
from airflow.operators.python_operator import PythonOperator
But the web throws this error
Broken DAG: [/home/ubuntu/airflow/dags/test_sf2.py] No module named 'airflow.hooks.snowflakepluginmodule'
So the question is, What am I doing wrong? Or have I uncovered a bug?
You need to import as below:
from airflow import DAG
from airflow.hooks import SnowflakeHook2
from airflow.operators.python_operator import PythonOperator
OR
from airflow import DAG
from airflow.hooks.SnowflakePluginModule import SnowflakeHook2
from airflow.operators.python_operator import PythonOperator
I don't think that airflow automatically goes through the folders in your plugins directory and runs everything underneath it. The way that I've set it up successfully is to have an __init__.py under the plugins directory which contains each plugin class. Have a look at the Astronomer plugins in Github, it provides some really good examples for how to set up your plugins.
In particular have a look at how they've set up the mysql plugin
https://github.com/airflow-plugins/mysql_plugin
Also someone has incorporated a snowflake hook in one of the later versions of airflow too which you might want to leverage:
https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/hooks/snowflake_hook.py

Airflow unpause dag programmatically?

I have a dag that we'll deploy to multiple different airflow instances and in our airflow.cfg we have dags_are_paused_at_creation = True but for this specific dag we want it to be turned on without having to do so manually by clicking on the UI. Is there a way to do it programmatically?
I created the following function to do so if anyone else runs into this issue:
import airflow.settings
from airflow.models import DagModel
def unpause_dag(dag):
"""
A way to programatically unpause a DAG.
:param dag: DAG object
:return: dag.is_paused is now False
"""
session = airflow.settings.Session()
try:
qry = session.query(DagModel).filter(DagModel.dag_id == dag.dag_id)
d = qry.first()
d.is_paused = False
session.commit()
except:
session.rollback()
finally:
session.close()
airflow-rest-api-plugin plugin can also be used to programmatically pause tasks.
Pauses a DAG
Available in Airflow Version: 1.7.0 or greater
GET - http://{HOST}:{PORT}/admin/rest_api/api?api=pause
Query Arguments:
dag_id - string - The id of the dag
subdir (optional) - string - File location or directory from which to
look for the dag
Examples:
http://{HOST}:{PORT}/admin/rest_api/api?api=pause&dag_id=test_id
See for more details:
https://github.com/teamclairvoyant/airflow-rest-api-plugin
supply your dag_id and run this command on your command line.
airflow pause dag_id.
For more information on the airflow command line interface: https://airflow.incubator.apache.org/cli.html
I think you are looking for unpause ( not pause)
airflow unpause DAG_ID
The following cli command should work per the recent docs.
airflow dags unpause dag_id
https://airflow.apache.org/docs/apache-airflow/stable/cli-and-env-variables-ref.html#unpause
Airflow's REST API provides a way using the DAG patch API: we need to update the dag with query parameter ?update_mask=is_paused and send boolean as request body.
Ref: https://airflow.apache.org/docs/apache-airflow/stable/stable-rest-api-ref.html#operation/patch_dag
airflow pause dag_id.
has been discontinued.
You will have to use:
airflow dags pause dag_id
You can do this using in the python operator of any dag to pause and unpause the dags programatically . This is the best approch i found instead of using cli just pass the list of dags and rest is take care
from airflow.models import DagModel
dag_id = "dag_name"
dag = DagModel.get_dagmodel(dag_id)
dag.set_is_paused(is_paused=False)
And just if you want to check if it is paused or not it will return boolean
dag.is_paused()

Resources