I am new to Apache Airflow and I am trying to figure out how to unit/integration test my dags/tasks
Here is my directory structure
/airflow
/dags
/tests/dags
I created a simple DAG which has a task to reads data from a Postgres table
def read_files(ti):
sql = "select id from files where status='NEW'"
pg_hook = PostgresHook(postgres_conn_id="metadata")
connection = pg_hook.get_conn()
cursor = connection.cursor()
cursor.execute(sql)
files = cursor.fetchall()
ti.xcom_push(key="files_to_process", value=files)
with DAG(dag_id="check_for_new_files", schedule_interval=timedelta(minutes=30),
start_date=datetime(2022, 9, 1), catchup=False) as dag:
check_files = PythonOperator(task_id="read_files",
python_callable=read_files)
Is it possible to test this by mocking Airflow/Postgres connection etc
yes it is possible to do test in dags, here is an example of basic things you can do:
import unittest
from airflow.models import DagBag
class TestCheckForNewFilesDAG(unittest.TestCase):
"""Check Dag"""
def setUp(self):
self.dagbag = DagBag()
def test_task_count(self):
"""Check task count for a dag"""
dag_id='check_for_new_files'
dag = self.dagbag.get_dag(dag_id)
self.assertEqual(len(dag.tasks), 1)
def test_contain_tasks(self):
"""Check task contains in hello_world dag"""
dag_id='check_for_new_files'
dag = self.dagbag.get_dag(dag_id)
tasks = dag.tasks
task_ids = list(map(lambda task: task.task_id, tasks))
self.assertListEqual(task_ids, ['read_files'])
def test_dependencies_of_read_files_task(self):
"""Check the task dependencies of a taskin hello_world dag"""
dag_id='check_for_new_files'
dag = self.dagbag.get_dag(dag_id)
read_files_task = dag.get_task('read_files')
# to be use in case you have upstream task
upstream_task_ids = list(map(lambda task: task.task_id,
read_files_task.upstream_list))
self.assertListEqual(upstream_task_ids, [])
downstream_task_ids = list(map(lambda task: task.task_id,
read_files_task.downstream_list))
self.assertListEqual(downstream_task_ids, [])
suite = unittest.TestLoader().loadTestsFromTestCase(TestHelloWorldDAG)
unittest.TextTestRunner(verbosity=2).run(suite)
In case of verifying that manipulated data of files are moved correctly the documentations suggest:
https://airflow.apache.org/docs/apache-airflow/2.0.1/best-practices.html#self-checks
Self-Checks
You can also implement checks in a DAG to make sure the tasks are producing the results as expected. As an example, if you have a task that pushes data to S3, you can implement a check in the next task. For example, the check could make sure that the partition is created in S3 and perform some simple checks to determine if the data is correct.
I think this is an excellent and straightforward way to verify a specific task.
Here there are other useful links you can use:
https://www.youtube.com/watch?v=ANJnYbLwLjE
In the next ones, they talk about mock
https://www.astronomer.io/guides/testing-airflow/
https://medium.com/#montadhar/apache-airflow-testing-guide-7956a3f4bbf5
https://godatadriven.com/blog/testing-and-debugging-apache-airflow/
I follow this example
create the example timetable py file, and put it in the $Home/airflow/plugins
create the example dag file, and put it in $Home/airflow/dags
After restart scheduler and webserver, I get DAG import error. In the web UI, the last line of detailed error message:
airflow.exceptions.SerializationError: Failed to serialize DAG 'example_timetable_dag2': Timetable class 'AfterWorkdayTimetable' is not registered
But if I run airflow plugins, I can see the timetable is in the name and source list.
How to fix this error?
Detail of plugins/AfterWorkdayTimetable.py:
from datetime import timedelta
from typing import Optional
from pendulum import Date, DateTime, Time, timezone
from airflow.plugins_manager import AirflowPlugin
from airflow.timetables.base import DagRunInfo, DataInterval, TimeRestriction, Timetable
UTC = timezone("UTC")
class AfterWorkdayTimetable(Timetable):
def infer_data_interval(self, run_after: DateTime) -> DataInterval:
weekday = run_after.weekday()
if weekday in (0, 6): # Monday and Sunday -- interval is last Friday.
days_since_friday = (run_after.weekday() - 4) % 7
delta = timedelta(days=days_since_friday)
else: # Otherwise the interval is yesterday.
delta = timedelta(days=1)
start = DateTime.combine((run_after - delta).date(), Time.min).replace(tzinfo=UTC)
return DataInterval(start=start, end=(start + timedelta(days=1)))
def next_dagrun_info(
self,
*,
last_automated_data_interval: Optional[DataInterval],
restriction: TimeRestriction,
) -> Optional[DagRunInfo]:
if last_automated_data_interval is not None: # There was a previous run on the regular schedule.
last_start = last_automated_data_interval.start
last_start_weekday = last_start.weekday()
if 0 <= last_start_weekday < 4: # Last run on Monday through Thursday -- next is tomorrow.
delta = timedelta(days=1)
else: # Last run on Friday -- skip to next Monday.
delta = timedelta(days=(7 - last_start_weekday))
next_start = DateTime.combine((last_start + delta).date(), Time.min).replace(tzinfo=UTC)
else: # This is the first ever run on the regular schedule.
next_start = restriction.earliest
if next_start is None: # No start_date. Don't schedule.
return None
if not restriction.catchup:
# If the DAG has catchup=False, today is the earliest to consider.
next_start = max(next_start, DateTime.combine(Date.today(), Time.min).replace(tzinfo=UTC))
elif next_start.time() != Time.min:
# If earliest does not fall on midnight, skip to the next day.
next_day = next_start.date() + timedelta(days=1)
next_start = DateTime.combine(next_day, Time.min).replace(tzinfo=UTC)
next_start_weekday = next_start.weekday()
if next_start_weekday in (5, 6): # If next start is in the weekend, go to next Monday.
delta = timedelta(days=(7 - next_start_weekday))
next_start = next_start + delta
if restriction.latest is not None and next_start > restriction.latest:
return None # Over the DAG's scheduled end; don't schedule.
return DagRunInfo.interval(start=next_start, end=(next_start + timedelta(days=1)))
class WorkdayTimetablePlugin(AirflowPlugin):
name = "workday_timetable_plugin"
timetables = [AfterWorkdayTimetable]
Details of dags/test_afterwork_timetable.py:
import datetime
from airflow import DAG
from AfterWorkdayTimetable import AfterWorkdayTimetable
from airflow.operators.dummy import DummyOperator
with DAG(
dag_id="example_workday_timetable",
start_date=datetime.datetime(2021, 1, 1),
timetable=AfterWorkdayTimetable(),
tags=["example", "timetable"],
) as dag:
DummyOperator(task_id="run_this")
If I run airflow plugins:
name | source
==================================+==========================================
workday_timetable_plugin | $PLUGINS_FOLDER/AfterWorkdayTimetable.py
I had similar issue.
Either you need to add __init__.py file or you should try this to debug your issue:
Get all plugin manager objects:
from airflow import plugins_manager
plugins_manager.initialize_timetables_plugins()
plugins_manager.timetable_classes
I got this result: {'quarterly.QuarterlyTimetable': <class 'quarterly.QuarterlyTimetable'>}
Compare your result with exception message. If timetable_classes dictionary has a different plugin name you should either change plugin file path.
You could also try this inside DAG python file:
from AfterWorkdayTimetable import AfterWorkdayTimetable
from airflow import plugins_manager
print(plugins_manager.as_importable_string(AfterWorkdayTimetable))
This would help you find the name that airflow tries to use when searching through timetable_classes dictionary.
You need to register the timetable in "timetables" array via plugin interface. See:
https://airflow.apache.org/docs/apache-airflow/stable/plugins.html
Encountered the same issue.
These are the steps I followed.
Add the Timetable file(custom_tt.py) into plugins folder.
Make sure the plugin folder has _ _ init_ _.py file present in plugins folder.
Change the lazy_load_plugins in airflow.cfg to False.
lazy_load_plugins = False
Add import statement in dagfile as:
from custom_tt import CustomTimeTable
In DAG as
DAG(timetable=CustomTimeTable())
Restart the webserver and scheduler.
Problem fixed.
They have found the resolution to this but doesn't seem like they have updated the documentation to represent the fix just yet.
Your function
def infer_data_interval(self, run_after: DateTime) -> DataInterval:
should be
def infer_manual_data_interval(self, *, run_after: DateTime) -> DataInterval:
See reference:
Apache airflow.timetables.base Documentation
After updating the function with the correct name and extra parameter, everything else should work for you as it did for me.
I was running into this as well. airflow plugins reports that the plugin is registered, and running the DAG script on the command line works fine, but the web UI reports that the plugin is not registered. #Bakuchi's answer pointed me in the right direction.
In my case, the problem was how I was importing the Timetable - airflow apparently expects you to import it relative to the $PLUGINS_FOLDER, not from any other directory, even if that other directory is also on the PYTHONPATH.
For a concrete example:
export PYTHONPATH=/path/to/my/code:$PYTHONPATH
# airflow.cfg
plugins_folder = /path/to/my/code/airflow_plugins
# dag.py
import sys
from airflow_plugins.custom_timetable import CustomTimetable as Bad
from custom_timetable import CustomTimetable as Good
from airflow import plugins_manager
plugins_manager.initialize_timetables_plugins()
print(sys.path) # /path/to/my/code:...:/path/to/my/code/airflow_plugins
print(plugins_manager.as_importable_string(Bad)) # airflow_plugins.custom_timetable.CustomTimetable
print(plugins_manager.as_importable_string(Good)) # custom_timetable.CustomTimetable
print(plugins_manager.timetable_classes) # {'custom_timetable.CustomTimetable': <class 'custom_timetable.CustomTimetable'>}
A bad lookup in plugins_manager.timetable_classes is ultimately what ends up raising the _TimetableNotRegistered error, so the fix is to make the keys match by changing how the timetable is imported.
I submitted a bug report: https://github.com/apache/airflow/issues/21259
I'm using Hydra for training machine learning models. It's great for doing complex commands like python train.py data=MNIST batch_size=64 loss=l2. However, if I want to then run the trained model with the same parameters, I have to do something like python reconstruct.py --config_file path_to_previous_job/.hydra/config.yaml. I then use argparse to load in the previous yaml and use the compose API to initialize the Hydra environment. The path to the trained model is inferred from the path to Hydra's .yaml file. If I want to modify one of the parameters, I have to add additional argparse parameters and run something like python reconstruct.py --config_file path_to_previous_job/.hydra/config.yaml --batch_size 128. The code then manually overrides any Hydra parameters with those that were specified on the command line.
What's the right way of doing this?
My current code looks something like the following:
train.py:
import hydra
#hydra.main(config_name="config", config_path="conf")
def main(cfg):
# [training code using cfg.data, cfg.batch_size, cfg.loss etc.]
# [code outputs model checkpoint to job folder generated by Hydra]
main()
reconstruct.py:
import argparse
import os
from hydra.experimental import initialize, compose
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('hydra_config')
parser.add_argument('--batch_size', type=int)
# [other flags and parameters I may need to override]
args = parser.parse_args()
# Create the Hydra environment.
initialize()
cfg = compose(config_name=args.hydra_config)
# Since checkpoints are stored next to the .hydra, we manually generate the path.
checkpoint_dir = os.path.dirname(os.path.dirname(args.hydra_config))
# Manually override any parameters which can be changed on the command line.
batch_size = args.batch_size if args.batch_size else cfg.data.batch_size
# [code which uses checkpoint_dir to load the model]
# [code which uses both batch_size and params in cfg to set up the data etc.]
This is my first time posting, so let me know if I should clarify anything.
If you want to load the previous config as is and not change it, use OmegaConf.load(file_path).
If you want to re-compose the config (and it sounds like you do, because you added that you want override things), I recommend that you use the Compose API and pass in parameters from the overrides file in the job output directory (next to the stored config.yaml), but concatenate the current run parameters.
This script seems to be doing the job:
import os
from dataclasses import dataclass
from os.path import join
from typing import Optional
from omegaconf import OmegaConf
import hydra
from hydra import compose
from hydra.core.config_store import ConfigStore
from hydra.core.hydra_config import HydraConfig
from hydra.utils import to_absolute_path
# You can also use a yaml config file instead of this Structured Config
#dataclass
class Config:
load_checkpoint: Optional[str] = None
batch_size: int = 16
loss: str = "l2"
cs = ConfigStore.instance()
cs.store(name="config", node=Config)
#hydra.main(config_path=".", config_name="config")
def my_app(cfg: Config) -> None:
if cfg.load_checkpoint is not None:
output_dir = to_absolute_path(cfg.load_checkpoint)
original_overrides = OmegaConf.load(join(output_dir, ".hydra/overrides.yaml"))
current_overrides = HydraConfig.get().overrides.task
hydra_config = OmegaConf.load(join(output_dir, ".hydra/hydra.yaml"))
# getting the config name from the previous job.
config_name = hydra_config.hydra.job.config_name
# concatenating the original overrides with the current overrides
overrides = original_overrides + current_overrides
# compose a new config from scratch
cfg = compose(config_name, overrides=overrides)
# train
print("Running in ", os.getcwd())
print(OmegaConf.to_yaml(cfg))
if __name__ == "__main__":
my_app()
~/tmp$ python train.py
Running in /home/omry/tmp/outputs/2021-04-19/21-23-13
load_checkpoint: null
batch_size: 16
loss: l2
~/tmp$ python train.py load_checkpoint=/home/omry/tmp/outputs/2021-04-19/21-23-13
Running in /home/omry/tmp/outputs/2021-04-19/21-23-22
load_checkpoint: /home/omry/tmp/outputs/2021-04-19/21-23-13
batch_size: 16
loss: l2
~/tmp$ python train.py load_checkpoint=/home/omry/tmp/outputs/2021-04-19/21-23-13 batch_size=32
Running in /home/omry/tmp/outputs/2021-04-19/21-23-28
load_checkpoint: /home/omry/tmp/outputs/2021-04-19/21-23-13
batch_size: 32
loss: l2
I have a stored XCom value that I wanted to pass to another python function which is not called using PythonOperator.
def sql_file_template():
<some code which uses xcom variable>
def call_stored_proc(**kwargs):
#project = kwargs['row_id']
print("INSIDE CALL STORE PROC ------------")
query = """CALL `{0}.dataset_name.store_proc`(
'{1}' # source table
, ['{2}'] # row_ids
, '{3}' # pivot_col_name
, '{4}' # pivot_col_value
, 100 # max_columns
, 'MAX' # aggregation
);"""
query = query.format(kwargs['project'],kwargs['source_tbl'] ,kwargs['row_id'],kwargs['pivot_col'],kwargs['pivot_val'])
job = client.query(query, location="US")
for result in job.result():
task_instance = kwargs['task_instance']
task_instance.xcom_push(key='query_string', value=result)
print result
return result
bq_cmd = PythonOperator (
task_id= 'task1'
provide_context= True,
python_callable= call_stored_proc,
op_kwargs= {'project' : project,
'source_tbl' : source_tbl,
'row_id' : row_id,
'pivot_col' : pivot_col,
'pivot_val' : pivot_val
},
dag= dag
)
dummy_operator >> bq_cmd
sql_file_template()
The output of stored proc is a string which is captured using xcom.
Now I would like to pass this value to some python function sql_file_template without using PythonOperator.
As per Airflow documentation xcom can be accessed only between tasks.
Can anyone help on this?
If you have access to the Airflow installation you'd like to query (configuration, database access, and code) you can use Airflow's airflow.models.XCom:get_one class method:
from datetime import datetime
from airflow.models import XCom
execution_date = datetime(2020, 8, 28)
xcom_value = XCom.get_one(execution_date=execution_date,
task_id="the_task_id",
dag_id="the_dag_id")
So you want to access XCOM outside Airflow (probably a different project / module, without creating any Airflow DAGs / tasks)?
Airflow uses SQLAlchemy for mapping all it's models (including XCOM) to corresponding SQLAlchemy backend (meta-db) tables
Therefore this can be done in two ways
Leverage Airflow's SQLAlchemy model
(without having to create a task or DAG). Here's an untested code snippet for reference
from typing import List
from airflow.models import XCom
from airflow.settings import Session
from airflow.utils.db import provide_session
from pendulum import Pendulum
#provide_session
def read_xcom_values(dag_id: str,
task_id: str,
execution_date: Pendulum,
session: Optional[Session]) -> List[str]:
"""
Function that reads and returns 'values' of XCOMs with given filters
:param dag_id:
:param task_id:
:param execution_date: datetime object
:param session: Airflow's SQLAlchemy Session (this param must not be passed, it will be automatically supplied by
'#provide_session' decorator)
:return:
"""
# read XCOMs
xcoms: List[XCom] = session.query(XCom).filter(
XCom.dag_id == dag_id, XCom.task_id == task_id,
XCom.execution_date == execution_date).all()
# retrive 'value' fields from XCOMs
xcom_values: List[str] = list(map(lambda xcom: xcom.value, xcoms))
return xcom_values
Do note that since it is importing airflow packages, it still requires working airflow installation on python classpath (as well as connection to backend-db), but here we are not creating any tasks or dags (this snippet can be run in a standalone python file)
For this snippet, I have referred to views.py which is my favorite place to peek into Airflow's SQLAlchemy magic
Directly query Airflow's SQLAlchemy backend meta-db
Connect to meta db and run this query
SELECT value FROM xcom WHERE dag_id='' AND task_id='' AND ..
I'm running 5 DAG's which have generated a total of about 6GB of log data in the base_log_folder over a months period. I just added a remote_base_log_folder but it seems it does not exclude logging to the base_log_folder.
Is there anyway to automatically remove old log files, rotate them or force airflow to not log on disk (base_log_folder) only in remote storage?
Please refer https://github.com/teamclairvoyant/airflow-maintenance-dags
This plugin has DAGs that can kill halted tasks and log-cleanups.
You can grab the concepts and can come up with a new DAG that can cleanup as per your requirement.
We remove the Task logs by implementing our own FileTaskHandler, and then pointing to it in the airflow.cfg. So, we overwrite the default LogHandler to keep only N task logs, without scheduling additional DAGs.
We are using Airflow==1.10.1.
[core]
logging_config_class = log_config.LOGGING_CONFIG
log_config.LOGGING_CONFIG
BASE_LOG_FOLDER = conf.get('core', 'BASE_LOG_FOLDER')
FOLDER_TASK_TEMPLATE = '{{ ti.dag_id }}/{{ ti.task_id }}'
FILENAME_TEMPLATE = '{{ ti.dag_id }}/{{ ti.task_id }}/{{ ts }}/{{ try_number }}.log'
LOGGING_CONFIG = {
'formatters': {},
'handlers': {
'...': {},
'task': {
'class': 'file_task_handler.FileTaskRotationHandler',
'formatter': 'airflow.job',
'base_log_folder': os.path.expanduser(BASE_LOG_FOLDER),
'filename_template': FILENAME_TEMPLATE,
'folder_task_template': FOLDER_TASK_TEMPLATE,
'retention': 20
},
'...': {}
},
'loggers': {
'airflow.task': {
'handlers': ['task'],
'level': JOB_LOG_LEVEL,
'propagate': False,
},
'airflow.task_runner': {
'handlers': ['task'],
'level': LOG_LEVEL,
'propagate': True,
},
'...': {}
}
}
file_task_handler.FileTaskRotationHandler
import os
import shutil
from airflow.utils.helpers import parse_template_string
from airflow.utils.log.file_task_handler import FileTaskHandler
class FileTaskRotationHandler(FileTaskHandler):
def __init__(self, base_log_folder, filename_template, folder_task_template, retention):
"""
:param base_log_folder: Base log folder to place logs.
:param filename_template: template filename string.
:param folder_task_template: template folder task path.
:param retention: Number of folder logs to keep
"""
super(FileTaskRotationHandler, self).__init__(base_log_folder, filename_template)
self.retention = retention
self.folder_task_template, self.folder_task_template_jinja_template = \
parse_template_string(folder_task_template)
#staticmethod
def _get_directories(path='.'):
return next(os.walk(path))[1]
def _render_folder_task_path(self, ti):
if self.folder_task_template_jinja_template:
jinja_context = ti.get_template_context()
return self.folder_task_template_jinja_template.render(**jinja_context)
return self.folder_task_template.format(dag_id=ti.dag_id, task_id=ti.task_id)
def _init_file(self, ti):
relative_path = self._render_folder_task_path(ti)
folder_task_path = os.path.join(self.local_base, relative_path)
subfolders = self._get_directories(folder_task_path)
to_remove = set(subfolders) - set(subfolders[-self.retention:])
for dir_to_remove in to_remove:
full_dir_to_remove = os.path.join(folder_task_path, dir_to_remove)
print('Removing', full_dir_to_remove)
shutil.rmtree(full_dir_to_remove)
return FileTaskHandler._init_file(self, ti)
Airflow maintainers don't think truncating logs is a part of airflow core logic, to see this, and then in this issue, maintainers suggest to change LOG_LEVEL avoid too many log data.
And in this PR, we can learn how to change log level in airflow.cfg.
good luck.
I know it sounds savage, but have you tried pointing base_log_folder to /dev/null? I use Airflow as a part of a container, so I don't care about the files either, as long as the logger pipe to STDOUT as well.
Not sure how well this plays with S3 though.
For your concrete problems, I have some suggestions.
For those, you would always need a specialized logging config as described in this answer: https://stackoverflow.com/a/54195537/2668430
automatically remove old log files and rotate them
I don't have any practical experience with the TimedRotatingFileHandler from the Python standard library yet, but you might give it a try:
https://docs.python.org/3/library/logging.handlers.html#timedrotatingfilehandler
It not only offers to rotate your files based on a time interval, but if you specify the backupCount parameter, it even deletes your old log files:
If backupCount is nonzero, at most backupCount files will be kept, and if more would be created when rollover occurs, the oldest one is deleted. The deletion logic uses the interval to determine which files to delete, so changing the interval may leave old files lying around.
Which sounds pretty much like the best solution for your first problem.
force airflow to not log on disk (base_log_folder), but only in remote storage?
In this case you should specify the logging config in such a way that you do not have any logging handlers that write to a file, i.e. remove all FileHandlers.
Rather, try to find logging handlers that send the output directly to a remote address.
E.g. CMRESHandler which logs directly to ElasticSearch but needs some extra fields in the log calls.
Alternatively, write your own handler class and let it inherit from the Python standard library's HTTPHandler.
A final suggestion would be to combine both the TimedRotatingFileHandler and setup ElasticSearch together with FileBeat, so you would be able to store your logs inside ElasticSearch (i.e. remote), but you wouldn't store a huge amount of logs on your Airflow disk since they will be removed by the backupCount retention policy of your TimedRotatingFileHandler.
Usually apache airflow grab the disk space due to 3 reasons
1. airflow scheduler logs files
2. mysql binaly logs [Major]
3. xcom table records.
To make it clean up on regular basis I have set up a dag which run on daily basis and cleans the binary logs and truncate the xcom table to make the disk space free
You also might need to install [pip install mysql-connector-python].
To clean up scheduler log files I do delete them manually two times in a week to avoid the risk of logs deleted which needs to be required for some reasons.
I clean the logs files by [sudo rm -rd airflow/logs/] command.
Below is my python code for reference
'
"""Example DAG demonstrating the usage of the PythonOperator."""
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime, timedelta
from airflow.utils.dates import days_ago
from airflow.operators.bash import BashOperator
from airflow.providers.postgres.operators.postgres import PostgresOperator
args = {
'owner': 'airflow',
'email_on_failure':True,
'retries': 1,
'email':['Your Email Id'],
'retry_delay': timedelta(minutes=5)
}
dag = DAG(
dag_id='airflow_logs_cleanup',
default_args=args,
schedule_interval='#daily',
start_date=days_ago(0),
catchup=False,
max_active_runs=1,
tags=['airflow_maintenance'],
)
def truncate_table():
import mysql.connector
connection = mysql.connector.connect(host='localhost',
database='db_name',
user='username',
password='your password',
auth_plugin='mysql_native_password')
cursor = connection.cursor()
sql_select_query = """TRUNCATE TABLE xcom"""
cursor.execute(sql_select_query)
connection.commit()
connection.close()
print("XCOM Table truncated successfully")
def delete_binary_logs():
import mysql.connector
from datetime import datetime
date = datetime.today().strftime('%Y-%m-%d')
connection = mysql.connector.connect(host='localhost',
database='db_name',
user='username',
password='your_password',
auth_plugin='mysql_native_password')
cursor = connection.cursor()
query = 'PURGE BINARY LOGS BEFORE ' + "'" + str(date) + "'"
sql_select_query = query
cursor.execute(sql_select_query)
connection.commit()
connection.close()
print("Binary logs deleted successfully")
t1 = PythonOperator(
task_id='truncate_table',
python_callable=truncate_table, dag=dag
)
t2 = PythonOperator(
task_id='delete_binary_logs',
python_callable=delete_binary_logs, dag=dag
)
t2 << t1
'
I am surprized but it worked for me. Update your config as below:
base_log_folder=""
It is test in minio and in s3.
Our solution looks a lot like Franzi's:
Running on Airflow 2.0.1 (py3.8)
Override default logging configuration
Since we use a helm chart for airflow deployment it was easiest to push an env there, but it can also be done in the airflow.cfg or using ENV in dockerfile.
# Set custom logging configuration to enable log rotation for task logging
AIRFLOW__LOGGING__LOGGING_CONFIG_CLASS: "airflow_plugins.settings.airflow_local_settings.DEFAULT_LOGGING_CONFIG"
Then we added the logging configuration together with the custom log handler to a python module we build and install in the docker image. As described here: https://airflow.apache.org/docs/apache-airflow/stable/modules_management.html
Logging configuration snippet
This is only a copy on the default from the airflow codebase, but then the task logger gets a different handler.
DEFAULT_LOGGING_CONFIG: Dict[str, Any] = {
'version': 1,
'disable_existing_loggers': False,
'formatters': {
'airflow': {'format': LOG_FORMAT},
'airflow_coloured': {
'format': COLORED_LOG_FORMAT if COLORED_LOG else LOG_FORMAT,
'class': COLORED_FORMATTER_CLASS if COLORED_LOG else 'logging.Formatter',
},
},
'handlers': {
'console': {
'class': 'airflow.utils.log.logging_mixin.RedirectStdHandler',
'formatter': 'airflow_coloured',
'stream': 'sys.stdout',
},
'task': {
'class': 'airflow_plugins.log.rotating_file_task_handler.RotatingFileTaskHandler',
'formatter': 'airflow',
'base_log_folder': os.path.expanduser(BASE_LOG_FOLDER),
'filename_template': FILENAME_TEMPLATE,
'maxBytes': 10485760, # 10MB
'backupCount': 6,
},
...
RotatingFileTaskHandler
And finally the custom handler which is just a merge of the logging.handlers.RotatingFileHandler and the FileTaskHandler.
#
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
"""File logging handler for tasks."""
import logging
import os
from pathlib import Path
from typing import TYPE_CHECKING, Optional
import requests
from airflow.configuration import AirflowConfigException, conf
from airflow.utils.helpers import parse_template_string
if TYPE_CHECKING:
from airflow.models import TaskInstance
class RotatingFileTaskHandler(logging.Handler):
"""
FileTaskHandler is a python log handler that handles and reads
task instance logs. It creates and delegates log handling
to `logging.FileHandler` after receiving task instance context.
It reads logs from task instance's host machine.
:param base_log_folder: Base log folder to place logs.
:param filename_template: template filename string
"""
def __init__(self, base_log_folder: str, filename_template: str, maxBytes=0, backupCount=0):
self.max_bytes = maxBytes
self.backup_count = backupCount
super().__init__()
self.handler = None # type: Optional[logging.FileHandler]
self.local_base = base_log_folder
self.filename_template, self.filename_jinja_template = parse_template_string(filename_template)
def set_context(self, ti: "TaskInstance"):
"""
Provide task_instance context to airflow task handler.
:param ti: task instance object
"""
local_loc = self._init_file(ti)
self.handler = logging.handlers.RotatingFileHandler(
filename=local_loc,
mode='a',
maxBytes=self.max_bytes,
backupCount=self.backup_count,
encoding='utf-8',
delay=False,
)
if self.formatter:
self.handler.setFormatter(self.formatter)
self.handler.setLevel(self.level)
def emit(self, record):
if self.handler:
self.handler.emit(record)
def flush(self):
if self.handler:
self.handler.flush()
def close(self):
if self.handler:
self.handler.close()
def _render_filename(self, ti, try_number):
if self.filename_jinja_template:
if hasattr(ti, 'task'):
jinja_context = ti.get_template_context()
jinja_context['try_number'] = try_number
else:
jinja_context = {
'ti': ti,
'ts': ti.execution_date.isoformat(),
'try_number': try_number,
}
return self.filename_jinja_template.render(**jinja_context)
return self.filename_template.format(
dag_id=ti.dag_id,
task_id=ti.task_id,
execution_date=ti.execution_date.isoformat(),
try_number=try_number,
)
def _read_grouped_logs(self):
return False
def _read(self, ti, try_number, metadata=None): # pylint: disable=unused-argument
"""
Template method that contains custom logic of reading
logs given the try_number.
:param ti: task instance record
:param try_number: current try_number to read log from
:param metadata: log metadata,
can be used for steaming log reading and auto-tailing.
:return: log message as a string and metadata.
"""
# Task instance here might be different from task instance when
# initializing the handler. Thus explicitly getting log location
# is needed to get correct log path.
log_relative_path = self._render_filename(ti, try_number)
location = os.path.join(self.local_base, log_relative_path)
log = ""
if os.path.exists(location):
try:
with open(location) as file:
log += f"*** Reading local file: {location}\n"
log += "".join(file.readlines())
except Exception as e: # pylint: disable=broad-except
log = f"*** Failed to load local log file: {location}\n"
log += "*** {}\n".format(str(e))
elif conf.get('core', 'executor') == 'KubernetesExecutor': # pylint: disable=too-many-nested-blocks
try:
from airflow.kubernetes.kube_client import get_kube_client
kube_client = get_kube_client()
if len(ti.hostname) >= 63:
# Kubernetes takes the pod name and truncates it for the hostname. This truncated hostname
# is returned for the fqdn to comply with the 63 character limit imposed by DNS standards
# on any label of a FQDN.
pod_list = kube_client.list_namespaced_pod(conf.get('kubernetes', 'namespace'))
matches = [
pod.metadata.name
for pod in pod_list.items
if pod.metadata.name.startswith(ti.hostname)
]
if len(matches) == 1:
if len(matches[0]) > len(ti.hostname):
ti.hostname = matches[0]
log += '*** Trying to get logs (last 100 lines) from worker pod {} ***\n\n'.format(
ti.hostname
)
res = kube_client.read_namespaced_pod_log(
name=ti.hostname,
namespace=conf.get('kubernetes', 'namespace'),
container='base',
follow=False,
tail_lines=100,
_preload_content=False,
)
for line in res:
log += line.decode()
except Exception as f: # pylint: disable=broad-except
log += '*** Unable to fetch logs from worker pod {} ***\n{}\n\n'.format(ti.hostname, str(f))
else:
url = os.path.join("http://{ti.hostname}:{worker_log_server_port}/log", log_relative_path).format(
ti=ti, worker_log_server_port=conf.get('celery', 'WORKER_LOG_SERVER_PORT')
)
log += f"*** Log file does not exist: {location}\n"
log += f"*** Fetching from: {url}\n"
try:
timeout = None # No timeout
try:
timeout = conf.getint('webserver', 'log_fetch_timeout_sec')
except (AirflowConfigException, ValueError):
pass
response = requests.get(url, timeout=timeout)
response.encoding = "utf-8"
# Check if the resource was properly fetched
response.raise_for_status()
log += '\n' + response.text
except Exception as e: # pylint: disable=broad-except
log += "*** Failed to fetch log file from worker. {}\n".format(str(e))
return log, {'end_of_log': True}
def read(self, task_instance, try_number=None, metadata=None):
"""
Read logs of given task instance from local machine.
:param task_instance: task instance object
:param try_number: task instance try_number to read logs from. If None
it returns all logs separated by try_number
:param metadata: log metadata,
can be used for steaming log reading and auto-tailing.
:return: a list of listed tuples which order log string by host
"""
# Task instance increments its try number when it starts to run.
# So the log for a particular task try will only show up when
# try number gets incremented in DB, i.e logs produced the time
# after cli run and before try_number + 1 in DB will not be displayed.
if try_number is None:
next_try = task_instance.next_try_number
try_numbers = list(range(1, next_try))
elif try_number < 1:
logs = [
[('default_host', f'Error fetching the logs. Try number {try_number} is invalid.')],
]
return logs, [{'end_of_log': True}]
else:
try_numbers = [try_number]
logs = [''] * len(try_numbers)
metadata_array = [{}] * len(try_numbers)
for i, try_number_element in enumerate(try_numbers):
log, metadata = self._read(task_instance, try_number_element, metadata)
# es_task_handler return logs grouped by host. wrap other handler returning log string
# with default/ empty host so that UI can render the response in the same way
logs[i] = log if self._read_grouped_logs() else [(task_instance.hostname, log)]
metadata_array[i] = metadata
return logs, metadata_array
def _init_file(self, ti):
"""
Create log directory and give it correct permissions.
:param ti: task instance object
:return: relative log path of the given task instance
"""
# To handle log writing when tasks are impersonated, the log files need to
# be writable by the user that runs the Airflow command and the user
# that is impersonated. This is mainly to handle corner cases with the
# SubDagOperator. When the SubDagOperator is run, all of the operators
# run under the impersonated user and create appropriate log files
# as the impersonated user. However, if the user manually runs tasks
# of the SubDagOperator through the UI, then the log files are created
# by the user that runs the Airflow command. For example, the Airflow
# run command may be run by the `airflow_sudoable` user, but the Airflow
# tasks may be run by the `airflow` user. If the log files are not
# writable by both users, then it's possible that re-running a task
# via the UI (or vice versa) results in a permission error as the task
# tries to write to a log file created by the other user.
relative_path = self._render_filename(ti, ti.try_number)
full_path = os.path.join(self.local_base, relative_path)
directory = os.path.dirname(full_path)
# Create the log file and give it group writable permissions
# TODO(aoen): Make log dirs and logs globally readable for now since the SubDag
# operator is not compatible with impersonation (e.g. if a Celery executor is used
# for a SubDag operator and the SubDag operator has a different owner than the
# parent DAG)
Path(directory).mkdir(mode=0o777, parents=True, exist_ok=True)
if not os.path.exists(full_path):
open(full_path, "a").close()
# TODO: Investigate using 444 instead of 666.
os.chmod(full_path, 0o666)
return full_path
Maybe a final note; the links in the airflow UI to the logging will now only open the latest logfile, not the older rotated files which are only accessible by means of SSH or any other interface to access the airflow logging path.
I don't think that there is a rotation mechanism but you can store them in S3 or google cloud storage as describe here : https://airflow.incubator.apache.org/configuration.html#logs