How to achieve the dynamic generation for tasks in airflow

How to achieve the dynamic generation for tasks in airflow - airflow

Basically, I want to achieve things like that.
I have no idea how to make sure the task_id can be called consistently. I used the PythonOperator and tried to call the kwargs.
One of my function is like:
def Dynamic_Structure(log_type, callableFunction, context, args):
# dyn_value = "{{ task_instance.xcom_pull(task_ids='Extract_{}_test'.format(log_type)) }}"
structure_logs = StrucFile(
task_id = 'Struc_{}_test'.format(log_type),
provide_context = True,
format_location = config.DATASET_FORMAT,
struc_location = config.DATASET_STR,
# log_key_hash_dict_location = dl_config.EXEC_WINDOWS_MIDDLE_KEY_HASH_DICT,
log_key_hash_dict_location = dl_config.DL_MODEL_DATASET/log_type/dl_config.EXEC_MIDDLE_KEY_HASH_DICT,
data_type = 'test',
xcom_push=True,
dag = dag,
op_kwargs=args,
log_sort = log_type,
python_callable = eval(callableFunction),
# the following parameters depend on the format of raw log
regex = config.STRUCTURE_REGEX[log_type],
log_format = config.STRUCTURE_LOG_FORMAT[log_type],
columns = config.STRUCTURE_COLUMNS[log_type]
)
return structure_logs
Then I call it like:
for log_type in log_types:
...
structure_logs_task = Dynamic_Structure(log_type, 'values_function', {'previous_task_id': 'Extract_{}_test'.format(log_type)})
But I can not get the performances like the chart.
I am confused about how to call pervious_task_id. Xcom_pull and Xcom_push would help, but I have no idea how to use it in PythonOperator.

The problem is not the code.
I tried to read environment from .env for the log_types. But I did not find the right way to read environment variable in docker images.
It should be:
In docker-compose.yml
version: "x"
services:
xxx:
build:
# there is the space between context and .
context: .
dockerfile: ./Dockerfile
args:
- PORT=${PORT}
volumes:
...
In Dockerfile
FROM xx
ARG PORT
ENV PORT "$PORT"
EXPOSE ${PORT}
...
In the root folder, you can define the ARG in .env.

Related

Airflow set DAG options conditionally

I'm implementing a python script to create a bunch of Airflow dag based on json config files. One json config file contains all the fields to be used in DAG(), and the last three fields are optional(will use global default if not set).
{
"owner": "Mike",
"start_date": "2022-04-10",
"schedule_interval": "0 0 * * *",
"on_failure_callback": "slack",
"is_paused_upon_creation": False,
"catchup": True
}
Now, my question is how to create the DAG conditionally? Since the last three option on_failure_callback, is_paused_upon_creation and catchup is optional, wonder what's the best way to use them in DAG()?
One solution_1 I tried is to use default_arg=optional_fields, and add optional fields into it with an if statement. However, it doesn't work. The DAG is not taking these three optional fields' values.
def create_dag(name, config):
# config is a dict that generate from the json config file
optional_fields = {
'owner': config['owner']
}
if 'on_failure_callback' in config:
optional_fields['on_failure_callback'] = partial(xxx(config['on_failure_callback']))
if 'is_paused_upon_creation' in config:
optional_fields['is_paused_upon_creation'] = config['is_paused_upon_creation']
dag = DAG(
dag_id=name,
start_date=datetime.strptime(config['start_date'], '%Y-%m-%d'),
schedule_interval=config['schedule_interval'],
default_args=optional_fields
)
Then, I tried solution_2 with **optional_fields, but got an error TypeError: __init__() got an unexpected keyword argument 'owner'
dag = DAG(
dag_id=name,
start_date=datetime.strptime(config['start_date'], '%Y-%m-%d'),
schedule_interval=config['schedule_interval'],
**optional_fields
)
Then solution_3 works as the following.
def create_dag(name, config):
# config is a dict that generate from the json config file
default_args = {
'owner': config['owner']
}
optional_fields = {}
if 'on_failure_callback' in config:
optional_fields['on_failure_callback'] = partial(xxx(config['on_failure_callback']))
if 'is_paused_upon_creation' in config:
optional_fields['is_paused_upon_creation'] = config['is_paused_upon_creation']
dag = DAG(
dag_id=name,
start_date=datetime.strptime(config['start_date'], '%Y-%m-%d'),
schedule_interval=config['schedule_interval'],
default_args=default_args
**optional_fields
)
However, I'm confused 1) which fields should be set in optional_fields vs default_args? 2) is there any other way to achieve it?

How to see the templated values in airflow?

When I do something like:
some_value = "{{ dag_run.get_task_instance('start').start_date }}"
print(f"some interpolated value: {some_value}")
I see this in the airflow logs:
some interpolated value: {{ dag_run.get_task_instance('start').start_date }}
but not the actual value itself. How can I easily see what the value is?

Everything in the DAG task run comes through as kwargs (before 1.10.12 you needed to add provide_context, but all context is provided after version 2).
To get something out of kwargs, do something like this in your python callable:
run_id = kwargs['run_id']
print(f'run_id = {run_id}')
Additional info:
To get the kwargs out, add them to your callable, so:
def my_func(**kwargs):
run_id = kwargs['run_id']
print(f'run_id = {run_id}')
And just call this from your DAG task like:
my_task = PythonOperator(
task_id='my_task'
, dag=dag
, python_callable=my_func)
I'm not sure what your current code structure is because you haven't provided more info I'm afraid.

Airflow how to set default values for dag_run.conf

I'm trying to setup an Airflow DAG that provides default values available from dag_run.conf. This works great when running the DAG from the webUI, using the "Run w/ Config" option. However when running on the schedule, the dag_run.conf dict is not present, and the task will fail, e.g.
jinja2.exceptions.UndefinedError: 'dict object' has no attribute 'key1'
Below is an example job.
Is it possible to make it so that dag_run.conf always contains the dict defined by params here?
from airflow import DAG
from airflow.utils.dates import hours_ago
from airflow.operators.bash import BashOperator
from datetime import timedelta
def do_something(val1: str, val2: str) -> str:
return f'echo vars are: "{val1}, {val2}"'
params = {
'key1': 'def1',
'key2': 'def2',
}
default_args = {
'retries': 0,
}
with DAG(
'template_test',
default_args=default_args,
schedule_interval=timedelta(minutes=1),
start_date=hours_ago(1),
params = params,
) as dag:
hello_t = BashOperator(
task_id='example-command',
bash_command=do_something('{{dag_run.conf["key1"]}}', '{{dag_run.conf["key2"]}}'),
config=params,
)
The closest I've seen is in For Apache Airflow, How can I pass the parameters when manually trigger DAG via CLI?, however there they leverage Jinja and if/else - which would require defining these default parameters twice. I'd like to define them only once.

You could use DAG params to achieve what you are looking for:
params (dict) – a dictionary of DAG level parameters that are made accessible in templates, namespaced under params. These params can be overridden at the task level.
You can define params at DAG or Task levels and also add or modify them from the UI in the Trigger DAG w/ config section.
Example DAG:
default_args = {
"owner": "airflow",
}
dag = DAG(
dag_id="example_dag_params",
default_args=default_args,
schedule_interval="*/5 * * * *",
start_date=days_ago(1),
params={"param1": "first_param"},
catchup=False,
)
with dag:
bash_task = BashOperator(
task_id="bash_task", bash_command="echo bash_task: {{ params.param1 }}"
)
Output log:
[2021-10-02 20:23:25,808] {logging_mixin.py:104} INFO - Running <TaskInstance: example_dag_params.bash_task 2021-10-02T23:15:00+00:00 [running]> on host worker_01
[2021-10-02 20:23:25,867] {taskinstance.py:1302} INFO - Exporting the following env vars:
AIRFLOW_CTX_DAG_OWNER=***
AIRFLOW_CTX_DAG_ID=example_dag_params
AIRFLOW_CTX_TASK_ID=bash_task
AIRFLOW_CTX_EXECUTION_DATE=2021-10-02T23:15:00+00:00
AIRFLOW_CTX_DAG_RUN_ID=scheduled__2021-10-02T23:15:00+00:00
[2021-10-02 20:23:25,870] {subprocess.py:52} INFO - Tmp dir root location:
/tmp
[2021-10-02 20:23:25,871] {subprocess.py:63} INFO - Running command: ['bash', '-c', 'echo bash_task: first_param']
[2021-10-02 20:23:25,884] {subprocess.py:74} INFO - Output:
[2021-10-02 20:23:25,886] {subprocess.py:78} INFO - bash_task: first_param
[2021-10-02 20:23:25,887] {subprocess.py:82} INFO - Command exited with return code 0
From the logs, notice that the dag_run is scheduled and the params are still there.
You can find a more extensive example on using parameters in this answer.
Hope that works for you!

How to configure environment variables in Azure DevOps pipeline?

I have an Azure Function (.NET Core) that is configured to read application settings from both a JSON file and environment variables:
var configurationBuilder = new ConfigurationBuilder()
.SetBasePath(_baseConfigurationPath)
.AddJsonFile("appsettings.json", optional: true)
.AddEnvironmentVariables()
.Build();
BuildAgentMonitorConfiguration configuration = configurationBuilder.Get<BuildAgentMonitorConfiguration>();
appsettings.json has the following structure:
{
"ProjectBaseUrl": "https://my-project.visualstudio.com/",
"ProjectName": "my-project",
"AzureDevOpsPac": ".....",
"SubscriptionId": "...",
"AgentPool": {
"PoolId": 38,
"PoolName": "MyPool",
"MinimumAgentCount": 2,
"MaximumAgentCount": 10
},
"ContainerRegistry": {
"Username": "mycontainer",
"LoginServer": "mycontainer.azurecr.io",
"Password": "..."
},
"ActiveDirectory": {
"ClientId": "...",
"TenantId": "...",
"ClientSecret": "..."
}
}
Some of these settings are configured as environment variables in the Azure Function. Everything works as expected:
The problem now is to configure some of these variables in a build pipeline, which are used in unit and integration tests. I've tried adding a variable group as follows and linking it to the pipeline:
But the environment variables are not being set and the tests are failing. What am I missing here?

I also have the same use case in which I want some environment variable to be set up using the azure build pipeline so that the test cases can access that environment variable to get the test passed.
Directly setting the env variable using the EXPORT,ENV command does not work for the subsequent task so to have the environment variable set up for subsequent task follow the syntax as mentioned on https://learn.microsoft.com/en-us/azure/devops/pipelines/process/variables?view=azure-devops&tabs=yaml%2Cbatch
ie the
task.set variable with the script tag
Correct way of setting ENV variable using build pipeline
- script: |
echo '##vso[task.setvariable variable=LD_LIBRARY_PATH]$(Build.SourcesDirectory)/src/Projectname/bin/Release/netcoreapp2.0/x64'
displayName: set environment variable for subsequent steps
Please be careful of the spaces as its is yaml. The above script tags set up the variable LD_LIBRARY_PATH (used in Linux to define path for .so files) to the directory defined.
This style of setting the environment variable works for subsequent task also , but if we set the env variable like mentioned below the enviroment variable will be set for the specefic shell instance and will not be applicable for subsequent tasks
Wrong way of setting env variable :
- script: export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$(Build.SourcesDirectory)/src/CorrectionLoop.HttpApi/bin/Release/netcoreapp2.0/x64
displayName: Set environment variable
You can use the similar syntax for the setting up your environment variable.

I ran into this as well when generating EF SQL scripts from a build task. According to the docs, the variables you define in the "Variables" tab are also provided to the process as environment variables.
Notice that variables are also made available to scripts through environment variables. The syntax for using these environment variables depends on the scripting language. Name is upper-cased, . replaced with _, and automatically inserted into the process environment
For my instance, I just had to load a connection string, but deal with the case difference of the key between the json file and the environment:
var config = new ConfigurationBuilder()
.AddJsonFile("appsettings.json", true, true)
.AddEnvironmentVariables()
.Build();
var connectionString = config["connectionString"] ?? config["CONNECTIONSTRING"];

If you are using bash then their example does not work, as they are referring incorrectly to the variables in the documentation. Instead it should be:
Store secret
#!/bin/bash
echo "##vso[task.setvariable variable=sauce]crushed tomatoes"
echo "##vso[task.setvariable variable=secret.Sauce;issecret=true]crushed tomatoes with garlic"
Retrieve secret
Wrong: Their example
#!/bin/bash
echo "No problem reading $1 or $SAUCE"
echo "But I cannot read $SECRET_SAUCE"
echo "But I can read $2 (but the log is redacted so I do not spoil the secret)"
Right:
#!/bin/bash
echo "No problem reading $(sauce)"
echo "But I cannot read $(secret.Sauce)"

Configure KeyVault Secrets in Variable Groups for FileTransform#1
The below will read KeyVault Secrets used in a Variable group and add them to the Environments Variables for FileTransform#1 to use
setup KeyVault
Create Variable Group and import the values you want to use for the Pipeline.
In this example we used:
- ConnectionStrings--Context
- Cloud--AuthSecret
- Compumail--ApiPassword
Setup names to match keyVault names: (you can pass these into the yml steps template)
#These parameters are here to support Library > Variable Groups > with "secrets" from a KeyVault
#KeyVault keys cannot contain "_" or "." as FileTransform1# wants
#This script takes "--" keys and replaces them with "." and adds them into "env:" variables so Transform can do it's thing.
parameters:
- name: apiSecretKeys
displayName: apiSecretKeys
type: object
default:
- ConnectionStrings--Context
- Cloud--AuthSecret
- Compumail--ApiPassword
stages:
- template: ./locationOfTemplate.yml
parameters:
apiSecretKeys: ${{ parameters.apiSecretKeys }}
... build api - publish to .zip file
Setup Variable groups on "job level"
variables:
#next line here for pipeline validation purposes..
- ${{if parameters.variableGroup}}:
- group: ${{parameters.variableGroup}}
#OR
#- VariableGroupNameContinaingSecrets
Template file: (the magic)
parameters:
- name: apiSecretKeys
displayName: apiSecretKeys
type: object
default: []
steps:
- ${{if parameters.apiSecretKeys}}:
- powershell: |
$envs = Get-childItem env:
$envs | Format-List
displayName: List Env Vars Before
- ${{ each key in parameters.apiSecretKeys }}:
- powershell: |
$writeKey = "${{key}}".Replace('--','.')
Write-Host "oldKey :" ${{key}}
Write-Host "value :" "$(${{key}})"
Write-Host "writeKey:" $writeKey
Write-Host "##vso[task.setvariable variable=$writeKey]$(${{key}})"
displayName: Writing Dashes To LowDash for ${{key}}
- ${{ each key in parameters.apiSecretKeys }}:
- powershell: |
$readKey = "${{key}}".Replace('--','.')
Write-Host "oldKey :" ${{key}}
Write-Host "oldValue:" "$(${{key}})"
Write-Host "readKey :" $readKey
Write-Host "newValue:" ('$env:'+"$($readKey)" | Invoke-Expression)
displayName: Read From New Env Var for ${{key}}
- powershell: |
$envs = Get-childItem env:
$envs | Format-List
name: List Env Vars After adding secrets to env
Run Task - task: FileTransform#1
Enjoy ;)

Removing Airflow task logs

I'm running 5 DAG's which have generated a total of about 6GB of log data in the base_log_folder over a months period. I just added a remote_base_log_folder but it seems it does not exclude logging to the base_log_folder.
Is there anyway to automatically remove old log files, rotate them or force airflow to not log on disk (base_log_folder) only in remote storage?

Please refer https://github.com/teamclairvoyant/airflow-maintenance-dags
This plugin has DAGs that can kill halted tasks and log-cleanups.
You can grab the concepts and can come up with a new DAG that can cleanup as per your requirement.

We remove the Task logs by implementing our own FileTaskHandler, and then pointing to it in the airflow.cfg. So, we overwrite the default LogHandler to keep only N task logs, without scheduling additional DAGs.
We are using Airflow==1.10.1.
[core]
logging_config_class = log_config.LOGGING_CONFIG
log_config.LOGGING_CONFIG
BASE_LOG_FOLDER = conf.get('core', 'BASE_LOG_FOLDER')
FOLDER_TASK_TEMPLATE = '{{ ti.dag_id }}/{{ ti.task_id }}'
FILENAME_TEMPLATE = '{{ ti.dag_id }}/{{ ti.task_id }}/{{ ts }}/{{ try_number }}.log'
LOGGING_CONFIG = {
'formatters': {},
'handlers': {
'...': {},
'task': {
'class': 'file_task_handler.FileTaskRotationHandler',
'formatter': 'airflow.job',
'base_log_folder': os.path.expanduser(BASE_LOG_FOLDER),
'filename_template': FILENAME_TEMPLATE,
'folder_task_template': FOLDER_TASK_TEMPLATE,
'retention': 20
},
'...': {}
},
'loggers': {
'airflow.task': {
'handlers': ['task'],
'level': JOB_LOG_LEVEL,
'propagate': False,
},
'airflow.task_runner': {
'handlers': ['task'],
'level': LOG_LEVEL,
'propagate': True,
},
'...': {}
}
}
file_task_handler.FileTaskRotationHandler
import os
import shutil
from airflow.utils.helpers import parse_template_string
from airflow.utils.log.file_task_handler import FileTaskHandler
class FileTaskRotationHandler(FileTaskHandler):
def __init__(self, base_log_folder, filename_template, folder_task_template, retention):
"""
:param base_log_folder: Base log folder to place logs.
:param filename_template: template filename string.
:param folder_task_template: template folder task path.
:param retention: Number of folder logs to keep
"""
super(FileTaskRotationHandler, self).__init__(base_log_folder, filename_template)
self.retention = retention
self.folder_task_template, self.folder_task_template_jinja_template = \
parse_template_string(folder_task_template)
#staticmethod
def _get_directories(path='.'):
return next(os.walk(path))[1]
def _render_folder_task_path(self, ti):
if self.folder_task_template_jinja_template:
jinja_context = ti.get_template_context()
return self.folder_task_template_jinja_template.render(**jinja_context)
return self.folder_task_template.format(dag_id=ti.dag_id, task_id=ti.task_id)
def _init_file(self, ti):
relative_path = self._render_folder_task_path(ti)
folder_task_path = os.path.join(self.local_base, relative_path)
subfolders = self._get_directories(folder_task_path)
to_remove = set(subfolders) - set(subfolders[-self.retention:])
for dir_to_remove in to_remove:
full_dir_to_remove = os.path.join(folder_task_path, dir_to_remove)
print('Removing', full_dir_to_remove)
shutil.rmtree(full_dir_to_remove)
return FileTaskHandler._init_file(self, ti)

Airflow maintainers don't think truncating logs is a part of airflow core logic, to see this, and then in this issue, maintainers suggest to change LOG_LEVEL avoid too many log data.
And in this PR, we can learn how to change log level in airflow.cfg.
good luck.

I know it sounds savage, but have you tried pointing base_log_folder to /dev/null? I use Airflow as a part of a container, so I don't care about the files either, as long as the logger pipe to STDOUT as well.
Not sure how well this plays with S3 though.

For your concrete problems, I have some suggestions.
For those, you would always need a specialized logging config as described in this answer: https://stackoverflow.com/a/54195537/2668430
automatically remove old log files and rotate them
I don't have any practical experience with the TimedRotatingFileHandler from the Python standard library yet, but you might give it a try:
https://docs.python.org/3/library/logging.handlers.html#timedrotatingfilehandler
It not only offers to rotate your files based on a time interval, but if you specify the backupCount parameter, it even deletes your old log files:
If backupCount is nonzero, at most backupCount files will be kept, and if more would be created when rollover occurs, the oldest one is deleted. The deletion logic uses the interval to determine which files to delete, so changing the interval may leave old files lying around.
Which sounds pretty much like the best solution for your first problem.
force airflow to not log on disk (base_log_folder), but only in remote storage?
In this case you should specify the logging config in such a way that you do not have any logging handlers that write to a file, i.e. remove all FileHandlers.
Rather, try to find logging handlers that send the output directly to a remote address.
E.g. CMRESHandler which logs directly to ElasticSearch but needs some extra fields in the log calls.
Alternatively, write your own handler class and let it inherit from the Python standard library's HTTPHandler.
A final suggestion would be to combine both the TimedRotatingFileHandler and setup ElasticSearch together with FileBeat, so you would be able to store your logs inside ElasticSearch (i.e. remote), but you wouldn't store a huge amount of logs on your Airflow disk since they will be removed by the backupCount retention policy of your TimedRotatingFileHandler.

Usually apache airflow grab the disk space due to 3 reasons
1. airflow scheduler logs files
2. mysql binaly logs [Major]
3. xcom table records.
To make it clean up on regular basis I have set up a dag which run on daily basis and cleans the binary logs and truncate the xcom table to make the disk space free
You also might need to install [pip install mysql-connector-python].
To clean up scheduler log files I do delete them manually two times in a week to avoid the risk of logs deleted which needs to be required for some reasons.
I clean the logs files by [sudo rm -rd airflow/logs/] command.
Below is my python code for reference
'
"""Example DAG demonstrating the usage of the PythonOperator."""
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime, timedelta
from airflow.utils.dates import days_ago
from airflow.operators.bash import BashOperator
from airflow.providers.postgres.operators.postgres import PostgresOperator
args = {
'owner': 'airflow',
'email_on_failure':True,
'retries': 1,
'email':['Your Email Id'],
'retry_delay': timedelta(minutes=5)
}
dag = DAG(
dag_id='airflow_logs_cleanup',
default_args=args,
schedule_interval='#daily',
start_date=days_ago(0),
catchup=False,
max_active_runs=1,
tags=['airflow_maintenance'],
)
def truncate_table():
import mysql.connector
connection = mysql.connector.connect(host='localhost',
database='db_name',
user='username',
password='your password',
auth_plugin='mysql_native_password')
cursor = connection.cursor()
sql_select_query = """TRUNCATE TABLE xcom"""
cursor.execute(sql_select_query)
connection.commit()
connection.close()
print("XCOM Table truncated successfully")
def delete_binary_logs():
import mysql.connector
from datetime import datetime
date = datetime.today().strftime('%Y-%m-%d')
connection = mysql.connector.connect(host='localhost',
database='db_name',
user='username',
password='your_password',
auth_plugin='mysql_native_password')
cursor = connection.cursor()
query = 'PURGE BINARY LOGS BEFORE ' + "'" + str(date) + "'"
sql_select_query = query
cursor.execute(sql_select_query)
connection.commit()
connection.close()
print("Binary logs deleted successfully")
t1 = PythonOperator(
task_id='truncate_table',
python_callable=truncate_table, dag=dag
)
t2 = PythonOperator(
task_id='delete_binary_logs',
python_callable=delete_binary_logs, dag=dag
)
t2 << t1
'

I am surprized but it worked for me. Update your config as below:
base_log_folder=""
It is test in minio and in s3.

Our solution looks a lot like Franzi's:
Running on Airflow 2.0.1 (py3.8)
Override default logging configuration
Since we use a helm chart for airflow deployment it was easiest to push an env there, but it can also be done in the airflow.cfg or using ENV in dockerfile.
# Set custom logging configuration to enable log rotation for task logging
AIRFLOW__LOGGING__LOGGING_CONFIG_CLASS: "airflow_plugins.settings.airflow_local_settings.DEFAULT_LOGGING_CONFIG"
Then we added the logging configuration together with the custom log handler to a python module we build and install in the docker image. As described here: https://airflow.apache.org/docs/apache-airflow/stable/modules_management.html
Logging configuration snippet
This is only a copy on the default from the airflow codebase, but then the task logger gets a different handler.
DEFAULT_LOGGING_CONFIG: Dict[str, Any] = {
'version': 1,
'disable_existing_loggers': False,
'formatters': {
'airflow': {'format': LOG_FORMAT},
'airflow_coloured': {
'format': COLORED_LOG_FORMAT if COLORED_LOG else LOG_FORMAT,
'class': COLORED_FORMATTER_CLASS if COLORED_LOG else 'logging.Formatter',
},
},
'handlers': {
'console': {
'class': 'airflow.utils.log.logging_mixin.RedirectStdHandler',
'formatter': 'airflow_coloured',
'stream': 'sys.stdout',
},
'task': {
'class': 'airflow_plugins.log.rotating_file_task_handler.RotatingFileTaskHandler',
'formatter': 'airflow',
'base_log_folder': os.path.expanduser(BASE_LOG_FOLDER),
'filename_template': FILENAME_TEMPLATE,
'maxBytes': 10485760, # 10MB
'backupCount': 6,
},
...
RotatingFileTaskHandler
And finally the custom handler which is just a merge of the logging.handlers.RotatingFileHandler and the FileTaskHandler.
#
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
"""File logging handler for tasks."""
import logging
import os
from pathlib import Path
from typing import TYPE_CHECKING, Optional
import requests
from airflow.configuration import AirflowConfigException, conf
from airflow.utils.helpers import parse_template_string
if TYPE_CHECKING:
from airflow.models import TaskInstance
class RotatingFileTaskHandler(logging.Handler):
"""
FileTaskHandler is a python log handler that handles and reads
task instance logs. It creates and delegates log handling
to `logging.FileHandler` after receiving task instance context.
It reads logs from task instance's host machine.
:param base_log_folder: Base log folder to place logs.
:param filename_template: template filename string
"""
def __init__(self, base_log_folder: str, filename_template: str, maxBytes=0, backupCount=0):
self.max_bytes = maxBytes
self.backup_count = backupCount
super().__init__()
self.handler = None # type: Optional[logging.FileHandler]
self.local_base = base_log_folder
self.filename_template, self.filename_jinja_template = parse_template_string(filename_template)
def set_context(self, ti: "TaskInstance"):
"""
Provide task_instance context to airflow task handler.
:param ti: task instance object
"""
local_loc = self._init_file(ti)
self.handler = logging.handlers.RotatingFileHandler(
filename=local_loc,
mode='a',
maxBytes=self.max_bytes,
backupCount=self.backup_count,
encoding='utf-8',
delay=False,
)
if self.formatter:
self.handler.setFormatter(self.formatter)
self.handler.setLevel(self.level)
def emit(self, record):
if self.handler:
self.handler.emit(record)
def flush(self):
if self.handler:
self.handler.flush()
def close(self):
if self.handler:
self.handler.close()
def _render_filename(self, ti, try_number):
if self.filename_jinja_template:
if hasattr(ti, 'task'):
jinja_context = ti.get_template_context()
jinja_context['try_number'] = try_number
else:
jinja_context = {
'ti': ti,
'ts': ti.execution_date.isoformat(),
'try_number': try_number,
}
return self.filename_jinja_template.render(**jinja_context)
return self.filename_template.format(
dag_id=ti.dag_id,
task_id=ti.task_id,
execution_date=ti.execution_date.isoformat(),
try_number=try_number,
)
def _read_grouped_logs(self):
return False
def _read(self, ti, try_number, metadata=None): # pylint: disable=unused-argument
"""
Template method that contains custom logic of reading
logs given the try_number.
:param ti: task instance record
:param try_number: current try_number to read log from
:param metadata: log metadata,
can be used for steaming log reading and auto-tailing.
:return: log message as a string and metadata.
"""
# Task instance here might be different from task instance when
# initializing the handler. Thus explicitly getting log location
# is needed to get correct log path.
log_relative_path = self._render_filename(ti, try_number)
location = os.path.join(self.local_base, log_relative_path)
log = ""
if os.path.exists(location):
try:
with open(location) as file:
log += f"*** Reading local file: {location}\n"
log += "".join(file.readlines())
except Exception as e: # pylint: disable=broad-except
log = f"*** Failed to load local log file: {location}\n"
log += "*** {}\n".format(str(e))
elif conf.get('core', 'executor') == 'KubernetesExecutor': # pylint: disable=too-many-nested-blocks
try:
from airflow.kubernetes.kube_client import get_kube_client
kube_client = get_kube_client()
if len(ti.hostname) >= 63:
# Kubernetes takes the pod name and truncates it for the hostname. This truncated hostname
# is returned for the fqdn to comply with the 63 character limit imposed by DNS standards
# on any label of a FQDN.
pod_list = kube_client.list_namespaced_pod(conf.get('kubernetes', 'namespace'))
matches = [
pod.metadata.name
for pod in pod_list.items
if pod.metadata.name.startswith(ti.hostname)
]
if len(matches) == 1:
if len(matches[0]) > len(ti.hostname):
ti.hostname = matches[0]
log += '*** Trying to get logs (last 100 lines) from worker pod {} ***\n\n'.format(
ti.hostname
)
res = kube_client.read_namespaced_pod_log(
name=ti.hostname,
namespace=conf.get('kubernetes', 'namespace'),
container='base',
follow=False,
tail_lines=100,
_preload_content=False,
)
for line in res:
log += line.decode()
except Exception as f: # pylint: disable=broad-except
log += '*** Unable to fetch logs from worker pod {} ***\n{}\n\n'.format(ti.hostname, str(f))
else:
url = os.path.join("http://{ti.hostname}:{worker_log_server_port}/log", log_relative_path).format(
ti=ti, worker_log_server_port=conf.get('celery', 'WORKER_LOG_SERVER_PORT')
)
log += f"*** Log file does not exist: {location}\n"
log += f"*** Fetching from: {url}\n"
try:
timeout = None # No timeout
try:
timeout = conf.getint('webserver', 'log_fetch_timeout_sec')
except (AirflowConfigException, ValueError):
pass
response = requests.get(url, timeout=timeout)
response.encoding = "utf-8"
# Check if the resource was properly fetched
response.raise_for_status()
log += '\n' + response.text
except Exception as e: # pylint: disable=broad-except
log += "*** Failed to fetch log file from worker. {}\n".format(str(e))
return log, {'end_of_log': True}
def read(self, task_instance, try_number=None, metadata=None):
"""
Read logs of given task instance from local machine.
:param task_instance: task instance object
:param try_number: task instance try_number to read logs from. If None
it returns all logs separated by try_number
:param metadata: log metadata,
can be used for steaming log reading and auto-tailing.
:return: a list of listed tuples which order log string by host
"""
# Task instance increments its try number when it starts to run.
# So the log for a particular task try will only show up when
# try number gets incremented in DB, i.e logs produced the time
# after cli run and before try_number + 1 in DB will not be displayed.
if try_number is None:
next_try = task_instance.next_try_number
try_numbers = list(range(1, next_try))
elif try_number < 1:
logs = [
[('default_host', f'Error fetching the logs. Try number {try_number} is invalid.')],
]
return logs, [{'end_of_log': True}]
else:
try_numbers = [try_number]
logs = [''] * len(try_numbers)
metadata_array = [{}] * len(try_numbers)
for i, try_number_element in enumerate(try_numbers):
log, metadata = self._read(task_instance, try_number_element, metadata)
# es_task_handler return logs grouped by host. wrap other handler returning log string
# with default/ empty host so that UI can render the response in the same way
logs[i] = log if self._read_grouped_logs() else [(task_instance.hostname, log)]
metadata_array[i] = metadata
return logs, metadata_array
def _init_file(self, ti):
"""
Create log directory and give it correct permissions.
:param ti: task instance object
:return: relative log path of the given task instance
"""
# To handle log writing when tasks are impersonated, the log files need to
# be writable by the user that runs the Airflow command and the user
# that is impersonated. This is mainly to handle corner cases with the
# SubDagOperator. When the SubDagOperator is run, all of the operators
# run under the impersonated user and create appropriate log files
# as the impersonated user. However, if the user manually runs tasks
# of the SubDagOperator through the UI, then the log files are created
# by the user that runs the Airflow command. For example, the Airflow
# run command may be run by the `airflow_sudoable` user, but the Airflow
# tasks may be run by the `airflow` user. If the log files are not
# writable by both users, then it's possible that re-running a task
# via the UI (or vice versa) results in a permission error as the task
# tries to write to a log file created by the other user.
relative_path = self._render_filename(ti, ti.try_number)
full_path = os.path.join(self.local_base, relative_path)
directory = os.path.dirname(full_path)
# Create the log file and give it group writable permissions
# TODO(aoen): Make log dirs and logs globally readable for now since the SubDag
# operator is not compatible with impersonation (e.g. if a Celery executor is used
# for a SubDag operator and the SubDag operator has a different owner than the
# parent DAG)
Path(directory).mkdir(mode=0o777, parents=True, exist_ok=True)
if not os.path.exists(full_path):
open(full_path, "a").close()
# TODO: Investigate using 444 instead of 666.
os.chmod(full_path, 0o666)
return full_path
Maybe a final note; the links in the airflow UI to the logging will now only open the latest logfile, not the older rotated files which are only accessible by means of SSH or any other interface to access the airflow logging path.

I don't think that there is a rotation mechanism but you can store them in S3 or google cloud storage as describe here : https://airflow.incubator.apache.org/configuration.html#logs

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to achieve the dynamic generation for tasks in airflow - airflow

Related

Airflow set DAG options conditionally

How to see the templated values in airflow?

Airflow how to set default values for dag_run.conf

How to configure environment variables in Azure DevOps pipeline?

Removing Airflow task logs

Categories

Resources