What is the relationship between airflow.cfg and airflow_local_settings.py? - airflow

Both of these files seem to define configuration. What is the difference between the two?
airflow_local_settings.py
airflow.cfg

at the head of airflow_local_settings.py you can see
from airflow.configuration import conf
this line cause airflow to read airflow.cfg file
conf = initialize_config() # configuration.py

Related

Airflow DeprecationWarning

I'm running a distributed Airflow 2.4.0 setup using the official Docker image. All the containers use the same .env file and same version of Airflow image. When I log into one of the Airflow containers I get this warning:
/home/airflow/.local/lib/python3.7/site-packages/airflow/configuration.py:545: DeprecationWarning: The sql_alchemy_conn option in [core] has been moved to the sql_alchemy_conn option in [database] - the old setting has been used, but please update your config.
option = self._get_environment_variables(deprecated_key, deprecated_section, key, section)
/home/airflow/.local/lib/python3.7/site-packages/airflow/configuration.py:545: DeprecationWarning: The auth_backend option in [api] has been renamed to auth_backends - the old setting has been used, but please update your config.
option = self._get_environment_variables(deprecated_key, deprecated_section, key, section)
/home/airflow/.local/lib/python3.7/site-packages/airflow/configuration.py:367: FutureWarning: The auth_backends setting in [api] has had airflow.api.auth.backend.session added in the running config, which is needed by the UI. Please update your config before Apache Airflow 3.0.
FutureWarning,
I checked the airflow.cfg inside the container and it has the up to date variables. Why do I still get the warning messages?
You are seeing these variables because of the section. airflow.cfg is configuration file with section. settings are expected to be in the proper section.
In your case your airflow.cfg has sql_alchemy_conn where you override the default value. Prior to 2.3.0 this setting in core section and in 2.3.0 it was moved to database section. (see PR)
What you need to do is simply open airflow.cfg and move the setting to the proper section. For example:
[core]
sql_alchemy_conn = sqlite:///{AIRFLOW_HOME}/airflow.db
to:
[database]
sql_alchemy_conn = sqlite:///{AIRFLOW_HOME}/airflow.db
The reason why it's like that is also explained in the docs. Airflow reference settings by environment variables of the format AIRFLOW__{SECTION}__{KEY} so in this case it will be:
AIRFLOW__DATABASE__SQL_ALCHEMY_CONN so the section is important to find the variable.

How to output Airflow's scheduler log to stdout or S3 / GCS

We're running Airflow cluster using puckel/airflow docker image with docker-compose. Airflow's scheduler container outputs its logs to /usr/local/airflow/logs/scheduler.
The problem is that the log files are not rotated and disk usage increases until the disk gets full. Dag for cleaning up the log directory is available but the DAG run on worker node and log directory on scheduler container is not cleaned up.
I'm looking for the way to output scheduler log to stdout or S3/GCS bucket but unable to find out. Is there any to output the scheduler log to stdout or S3/GCS bucket?
Finally I managed to output scheduler's log to stdout.
Here you can find how to use custom logger of Airflow. The default logging config is available at github.
What you have to do is.
(1) Create custom logger class to ${AIRFLOW_HOME}/config/log_config.py.
# Setting processor (scheduler, etc..) logs output to stdout
# Referring https://www.astronomer.io/guides/logging
# This file is created following https://airflow.apache.org/docs/apache-airflow/2.0.0/logging-monitoring/logging-tasks.html#advanced-configuration
from copy import deepcopy
from airflow.config_templates.airflow_local_settings import DEFAULT_LOGGING_CONFIG
import sys
LOGGING_CONFIG = deepcopy(DEFAULT_LOGGING_CONFIG)
LOGGING_CONFIG["handlers"]["processor"] = {
"class": "logging.StreamHandler",
"formatter": "airflow",
"stream": sys.stdout,
}
(2) Set logging_config_class property to config.log_config.LOGGING_CONFIG in airflow.cfg
logging_config_class = config.log_config.LOGGING_CONFIG
(3) [Optional] Add $AIRFLOW_HOME to PYTHONPATH environment.
export "${PYTHONPATH}:~"
Actually, you can set the path of logging_config_class to anything as long as the python is able to load the package.
Setting handler.processor to airflow.utils.log.logging_mixin.RedirectStdHandler didn't work for me. It used too much memory.
remote_logging=True in airflow.cfg is the key.
Please check the thread here for detailed steps.
You can extend the image with the following or do so in airflow.cfg
ENV AIRFLOW__LOGGING__REMOTE_LOGGING=True
ENV AIRFLOW__LOGGING__REMOTE_LOG_CONN_ID=gcp_conn_id
ENV AIRFLOW__LOGGING__REMOTE_BASE_LOG_FOLDER=gs://bucket_name/AIRFLOW_LOGS
the gcp_conn_id should have the correct permission to create/delete objects in GCS

Airflow on conda - folder structure

I have bare bones airflow installation on conda - I managed to create custom operators by putting them in path:
airflow/dags/operators/custom_operator.py
then calling from dag as:
from operators.custom_operator import CustomOperator
how can I instead achieve folder structure:
airflow/operators/custom_operator.py
Which would be called from dag as:
from airflow.operators.custom_operator import CustomOperator
In case if you think that's a bad approach - please point it out in your answer/comment happy to tweak my approach, if there are better design patterns...
Interestingly - the solution here is in airflow.cfg (your airflow config file) to move parameter dags_folder one directory up - to $AIRFLOW_HOME, so instead of having:
....
[core]
dags_folder = /home/user/airflow/dags
....
Just make it:
....
[core]
dags_folder = /home/user/airflow
....
Airflow apparently will look recursivelly for dags, and capture only classes defined as dags... Whereas you can then keep clean folder structure, with custom operators, utility functions, custom sensors etc. outside dags/ folder.

How to point to the airflow unittest.cfg?

Airflow creates a unittest.cfg file in the AIRFLOW_HOME environment variable path.
My question is: how can I point to unittest.cfg in the same why that I point to the airflow.cfg via the environment variable AIRFLOW_CONFIG?
The reason why I want to do this is because I don't want to have any config files in the AIRFLOW_HOME directory.
Also, if anyone knows better, could you please explain what is the unittest.cfg is for as there is no documentation I could find on it.
unittest.cfg test configuration file is the default configuration file used when Airflow is running in test mode.
Test mode can be activated by setting the unit_test_mode configuration option in airflow.cfg or AIRFLOW__CORE__UNIT_TEST_MODE environment variable to True .
The configuration values in test configuration file overwrite those in airflow.cfg in runtime when test mode is activated.
# Source: https://github.com/apache/airflow/blob/1.10.5/airflow/configuration.py#L558,L561
def get_airflow_test_config(airflow_home):
if 'AIRFLOW_TEST_CONFIG' not in os.environ:
return os.path.join(airflow_home, 'unittests.cfg')
return expand_env_var(os.environ['AIRFLOW_TEST_CONFIG'])
The AIRFLOW_TEST_CONFIG environment variable can be set to the path of your test configuration file.

How can I read a config file from airflow packaged DAG?

Airflow packaged DAGs seem like a great building block for a sane production airflow deployment.
I have a DAG with dynamic subDAGs, driven by a config file, something like:
config.yaml:
imports:
- project_foo
- project_bar`
which yields subdag tasks like imports.project_{foo|bar}.step{1|2|3}.
I've normally read in the config file using python's open function, a la config = open(os.path.join(os.path.split(__file__)[0], 'config.yaml')
Unfortunately, when using packaged DAGs, this results in an error:
Broken DAG: [/home/airflow/dags/workflows.zip] [Errno 20] Not a directory: '/home/airflow/dags/workflows.zip/config.yaml'
Any thoughts / best practices to recommend here?
It's a bit of a kludge, but I eventually just fell back on reading zip file contents via ZipFile.
import yaml
from zipfile import ZipFile
import logging
import re
def get_config(yaml_filename):
"""Parses and returns the given YAML config file.
For packaged DAGs, gracefully handles unzipping.
"""
zip, post_zip = re.search(r'(.*\.zip)?(.*)', yaml_filename).groups()
if zip:
contents = ZipFile(zip).read(post_zip.lstrip('/'))
else:
contents = open(post_zip).read()
result = yaml.safe_load(contents)
logging.info('Parsed config: %s', result)
return result
which works as you'd expect from the main dag.py:
get_config(os.path.join(path.split(__file__)[0], 'config.yaml'))

Resources