Airflow packaged DAGs seem like a great building block for a sane production airflow deployment.
I have a DAG with dynamic subDAGs, driven by a config file, something like:
config.yaml:
imports:
- project_foo
- project_bar`
which yields subdag tasks like imports.project_{foo|bar}.step{1|2|3}.
I've normally read in the config file using python's open function, a la config = open(os.path.join(os.path.split(__file__)[0], 'config.yaml')
Unfortunately, when using packaged DAGs, this results in an error:
Broken DAG: [/home/airflow/dags/workflows.zip] [Errno 20] Not a directory: '/home/airflow/dags/workflows.zip/config.yaml'
Any thoughts / best practices to recommend here?
It's a bit of a kludge, but I eventually just fell back on reading zip file contents via ZipFile.
import yaml
from zipfile import ZipFile
import logging
import re
def get_config(yaml_filename):
"""Parses and returns the given YAML config file.
For packaged DAGs, gracefully handles unzipping.
"""
zip, post_zip = re.search(r'(.*\.zip)?(.*)', yaml_filename).groups()
if zip:
contents = ZipFile(zip).read(post_zip.lstrip('/'))
else:
contents = open(post_zip).read()
result = yaml.safe_load(contents)
logging.info('Parsed config: %s', result)
return result
which works as you'd expect from the main dag.py:
get_config(os.path.join(path.split(__file__)[0], 'config.yaml'))
Related
In my airflow task, I am creating a file using open() method in airflow dag and writing records into it. Then sending it with a mail within same task. Will it get deleted automatically or will exists into the dag?
filename = to_report_name(context)+'_'+currentNextRunTime.strftime('%m.%d.%Y_%H-%M')+'_'+currentNextRunTime.tzname()+'.'+extension.lower()
with open(filename, "w+b") as file:
file.write(download_response.content)
print(file.name)
send_report(context,file)
The file will not be deleted automaticly. The code you execute is pure Python if you want the file to be deleted once the operation is done then use tempfile module which gurentee the file will be deleted once it's closed. Example:
import tempfile, os
with tempfile.NamedTemporaryFile() as file:
os.rename(file.name, '/tmp/my_custom_name.txt') # use this if you want to rename the file
file.write(...)
I'm trying to implement custom XCOM backend.
Those are the steps I did:
Created "include" directory at the main Airflow dir (AIRFLOW_HOME).
Created these "custom_xcom_backend.py" file inside:
from typing import Any
from airflow.models.xcom import BaseXCom
import pandas as pd
class CustomXComBackend(BaseXCom):
#staticmethod
def serialize_value(value: Any):
if isinstance(value, pd.DataFrame):
value = value.to_json(orient='records')
return BaseXCom.serialize_value(value)
#staticmethod
def deserialize_value(result) -> Any:
result = BaseXCom.deserialize_value(result)
result = df = pd.read_json(result)
return result
Set at config file:
xcom_backend = include.custom_xcom_backend.CustomXComBackend
When I restarted webserver I got:
airflow.exceptions.AirflowConfigException: The object could not be loaded. Please check "xcom_backend" key in "core" section. Current value: "include.cust...
My guess is that it not recognizing the "include" folder
But how can I fix it?
*Note: There is no docker. It is installed on a Ubuntu machine.
Thanks!
So I solved it:
Put custom_xcom_backend.py into the plugins directory
set at config file:
xcom_backend = custom_xcom_backend.CustomXComBackend
Restart all airflow related services
*Note: Do not store DataFrames that way (bad practice).
Sources I used:
https://www.youtube.com/watch?v=iI0ymwOij88
I am trying to write some DAG integrity tests in airflow. The issue I am coming across is the DAG that I am testing, I have references to variables in some of the tasks within that DAG.
eg: Variable.get("AIRFLOW_VAR_BLOB_CONTAINER")
I seem to be getting the error:
sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) no such table: variable
from this because when testing via pytest, those variables (and the variables table) don't exist. Does anyone know any workarounds or suggested methods to handle Variables/Connection references when running DAG Integrity tests?
Thanks,
You can create a local metastore for testing. Running airflow db init without any other settings will create a SQLite metastore in your home directory which you can use during testing. My default additional settings for a local metastore for testing are:
AIRFLOW__CORE__LOAD_DEFAULT_CONNECTIONS=False (to ensure there are no defaults to make things magically work)
AIRFLOW__CORE__LOAD_EXAMPLES=False (to ensure there are no defaults to make things magically work)
AIRFLOW__CORE__UNIT_TEST_MODE=True (Set default test settings, skip certain actions, etc.)
AIRFLOW_HOME=[project root dir] (To avoid Airflow files in your home dir)
Running airflow db init with these settings results in three files in your project root dir:
unittests.db
unittests.cfg
webserver_config.py
It's probably a good idea to add those to your .gitignore. With this set up you can safely test against the local metastore unittests.db during your tests (ensure that when running pytest, the same env vars are set).
Alternatively, if you don't want a local metastore for reasons, you will have to resort to mocking to substitute the call Airflow makes to the metastore. This requires knowledge of the internals of Airflow. An example:
import datetime
from unittest import mock
from airflow.models import DAG
from airflow.operators.bash import BashOperator
def test_bash_operator(tmp_path):
with DAG(dag_id="test_dag", start_date=datetime.datetime(2021, 1, 1), schedule_interval="#daily") as dag:
with mock.patch("airflow.models.variable.Variable.get") as variable_get_mock:
employees = ["Alice", "Bob", "Charlie"]
variable_get_mock.return_value = employees
output_file = tmp_path / "output.txt"
test = BashOperator(task_id="test", bash_command="echo {{ var.json.employees }} > " + str(output_file))
dag.clear()
test.run(
start_date=dag.start_date,
end_date=dag.start_date,
ignore_first_depends_on_past=True,
ignore_ti_state=True,
)
variable_get_mock.assert_called_once()
assert output_file.read_text() == f"[{', '.join(employees)}]\n"
These lines:
with mock.patch("airflow.models.variable.Variable.get") as variable_get_mock:
employees = ["Alice", "Bob", "Charlie"]
variable_get_mock.return_value = employees
Determine that the function airflow.models.variable.Variable.get isn't actually called but instead this list is returned: ["Alice", "Bob", "Charlie"]. Since task.run() doesn't return anything, I made the bash_command write to a tmp_path, and read the file to assert if the content is what I expected.
This avoids the need for a metastore entirely, but mocking can be a lot of work and complex once your tests grow beyond basic examples like these.
I want to test my dags to make sure they have certain default arguments and also to make sure that all dags are not having importation errors.
I am using DagBag to populate dags and then iterate through each dag and check for the values of each dag to make sure they are what I want them to be.
Because DagBag can fetch also the example dags that are shipped with airflow, I am passing the argument include_example = False however when I do this I realize that none of my dags is pulled into dagbags.
Am I using DagBag wrongly? or is there another better way to pull and inspect dags when testing?
My code
def test_no_import_errors():
dag_bag = DagBag(include_examples=False)
assert len(dag_bag.import_errors) == 0, "No Import Failures"
I was able to reproduce the problem, when creating the DagBag object, if you don't provide a value to dag_folder parameter, no DAG is added to the colleciton.
So as Jarek stated, this works:
def test_no_import_errors():
dag_bag = DagBag(dag_folder="dags/", include_examples=False)
assert len(dag_bag.import_errors) == 0, "No Import Failures"
This is the example I made to test it:
# python -m unittest test_dag_validation.py
import unittest
import logging
from airflow.models import DagBag
class TestDAGValidation(unittest.TestCase):
#classmethod
def setUpClass(cls):
log = logging.getLogger()
handler = logging.FileHandler("dag_validation.log", mode="w")
handler.setLevel(logging.INFO)
log.addHandler(handler)
cls.log = log
def test_no_import_errors(self):
dag_bag = DagBag(dag_folder="dags/", include_examples=False)
self.log.info(f"How Many DAGs?: {dag_bag.size()}")
self.log.info(f"Import errors: {len(dag_bag.import_errors)}")
assert len(dag_bag.import_errors) == 0, "No Import Failures"
When you construct DagBag objects you can pass folder list where DagBag should look for the dag files. I guess this is the problem
By default airflow DagBag looks for dags inside AIRFLOW_HOME/dags folder.
This is usually stored inside airflow.cfg file.
By default it points to ~/airflow folder, but you can point to current working directory by running -
export $AIRFLOW_HOME=abs_path_of_your_folder
If you are using python for Airflow installation, make sure to export the $AIRFLOW_HOME variable first, then activate virtual environment and finally install airflow. This will make sure your path is properly attached to the airflow.cfg file.
Also you can check if your folder loaded properly or not, while running the unittest. In terminal, the file path is printed like
[2022-02-03 20:45:57,657] {dagbag.py:500} INFO - Filling up the DagBag from /Users/kehsihba19/Desktop/airflow-test/dags
An example file for checking import errors in DAGs which include checking typos and cyclic tasks check -
from airflow.models import DagBag
import unittest
class TestDags(unittest.TestCase):
def test_DagBag(self):
self.dag_bag = DagBag(include_examples=False)
self.assertFalse(bool(self.dag_bag.import_errors))
if __name__ == "__main__":
unittest.main()
I try to run a Apache Beam pipeline (Python) within Google Cloud Dataflow, triggered by a DAG in Google Cloud Coomposer.
The structure of my dags folder in the respective GCS bucket is as follows:
/dags/
dataflow.py <- DAG
dataflow/
pipeline.py <- pipeline
setup.py
my_modules/
__init__.py
commons.py <- the module I want to import in the pipeline
The setup.py is very basic, but according to the Apache Beam docs and answers on SO:
import setuptools
setuptools.setup(setuptools.find_packages())
In the DAG file (dataflow.py) I set the setup_file option and pass it to Dataflow:
default_dag_args = {
... ,
'dataflow_default_options': {
... ,
'runner': 'DataflowRunner',
'setup_file': os.path.join(configuration.get('core', 'dags_folder'), 'dataflow', 'setup.py')
}
}
Within the pipeline file (pipeline.py) I try to use
from my_modules import commons
but this fails. The log in Google Cloud Composer (Apache Airflow) says:
gcp_dataflow_hook.py:132} WARNING - b' File "/home/airflow/gcs/dags/dataflow/dataflow.py", line 11\n from my_modules import commons\n ^\nSyntaxError: invalid syntax'
The basic idea behind the setup.py file is documented here
Also, there are similar questions on SO which helped me:
Google Dataflow - Failed to import custom python modules
Dataflow/apache beam: manage custom module dependencies
I'm actually wondering why my pipelines fails with a Syntax Error and not a module not found kind of error...
I tried to reproduce your issue and then try to solve it, so I created the same folder structure you already have:
/dags/
dataflow.py
dataflow/
pipeline.py -> pipeline
setup.py
my_modules/
__init__.py
common.py
Therefore, to make it work, the change I made is to copy these folders to a place where the instance is running the code is able to find it, for example in the /tmp/ folder of the instance.
So, my DAG would be something like this:
1 - Fist of all I declare my arguments:
default_args = {
'start_date': datetime(xxxx, x, x),
'retries': 1,
'retry_delay': timedelta(minutes=5),
'dataflow_default_options': {
'project': '<project>',
'region': '<region>',
'stagingLocation': 'gs://<bucket>/stage',
'tempLocation': 'gs://<bucket>/temp',
'setup_file': <setup.py>,
'runner': 'DataflowRunner'
}
}
2- After this, I created the DAG and before running the Dataflow task, I copied the whole folder directory, above created, into the /tmp/ folder of the instance Task t1, and after this, I run the pipeline from the /tmp/ directory Task t2:
with DAG(
'composer_df',
default_args=default_args,
description='datflow dag',
schedule_interval="xxxx") as dag:
def copy_dependencies():
process = subprocess.Popen(['gsutil','cp', '-r' ,'gs://<bucket>/dags/*',
'/tmp/'])
process.communicate()
t1 = python_operator.PythonOperator(
task_id='copy_dependencies',
python_callable=copy_dependencies,
provide_context=False
)
t2 = DataFlowPythonOperator(task_id="composer_dataflow",
py_file='/tmp/dataflow/pipeline.py', job_name='job_composer')
t1 >> t2
That's how I created the DAG file dataflow.py, and then, in the pipeline.py the package to import would be like:
from my_modules import commons
It should work fine, since the folder directory is understandable for the VM.