airflow scheduler how to config checking new files interval? - airflow

when I run the airflow scheduler on the server I found the scheduler are checking new dags file every 300 seconds.
Checking for new files in /home/ubuntu/airflow/dags every 300 seconds
how can I change this behavior cause I can't find any config in the aiflow.cfg to change this?

It's under the [scheduler] section as dag_dir_list_interval.
# How often (in seconds) to scan the DAGs directory for new files. Default to 5 minutes.
dag_dir_list_interval = 300
Source: https://github.com/apache/incubator-airflow/blob/1.10.1/airflow/config_templates/default_airflow.cfg#L438-L439

Related

DataFlow Job Startup Takes Too Long When triggered from Composer

I have a static pipeline with the following architecture:
main.py
setup.py
requirements.txt
module 1
__init__.py
functions.py
module 2
__init__.py
functions.py
dist
setup_tarball
The setup.py and requirements.txt contain the non-native PyPI and local functions which would be used by the Dataflow worker node. The dataflow options are written as follows:
import apache_beam as beam
from apache_beam.io import ReadFromText, WriteToText
from apache_beam.options.pipeline_options import PipelineOptions
from module2.functions import function_to_use
dataflow_options = ['--extra_package=./dist/setup_tarball','temp_location=<gcs_temp_location>', '--runner=DataflowRunner', '--region=us-central1', '--requirements_file=./requirements.txt]
So then the pipeline will run something like this:
options = PipelineOptions(dataflow_options)
p = beam.Pipeline(options=options)
transform = (p | ReadFromText(gcs_url) | beam.Map(function_to_use) | WriteToText(gcs_output_url))
Running this locally takes Dataflow around 6 minutes to complete, where most of the time goes to worker startup. I tried getting this code automated with Composer and re-arranged the architecture as follows: my main (dag) function in dags folder, the modules in plugins, and setup_tarball and requirements.txt in data folder... So the only parameters that really changed are:
'--extra_package=/home/airflow/gcs/data/setup_tarball'
'--requirements_file=/home/airflow/gcs/data/requirements.txt'
When I try running this modified code in Composer, it will work... but it'll take much, much longer... Once the worker starts up, it will take anywhere from 20-30 minutes before actually running the pipeline (which is only a few seconds).. This is much longer than triggering Dataflow from my local code, which was taking only 6 minutes to complete. I realize this question is very general, but since the code works, I don't think it's related to the Airflow task itself. Where would be a reasonable place to start looking at for troubleshooting this problem? At the Airflow level, what can be modified? How does Composer (Airflow) interact with Dataflow, and what can potentially cause this bottleneck?
It turns out that the problem was associated with Composer itself. The fix was to increase the capacity of Composer, i.e., increase vCPUs. Not sure why this would be the case, so if anyone has an idea for the foundation behind this issue, your input would be much appreciated!

How to output Airflow's scheduler log to stdout or S3 / GCS

We're running Airflow cluster using puckel/airflow docker image with docker-compose. Airflow's scheduler container outputs its logs to /usr/local/airflow/logs/scheduler.
The problem is that the log files are not rotated and disk usage increases until the disk gets full. Dag for cleaning up the log directory is available but the DAG run on worker node and log directory on scheduler container is not cleaned up.
I'm looking for the way to output scheduler log to stdout or S3/GCS bucket but unable to find out. Is there any to output the scheduler log to stdout or S3/GCS bucket?
Finally I managed to output scheduler's log to stdout.
Here you can find how to use custom logger of Airflow. The default logging config is available at github.
What you have to do is.
(1) Create custom logger class to ${AIRFLOW_HOME}/config/log_config.py.
# Setting processor (scheduler, etc..) logs output to stdout
# Referring https://www.astronomer.io/guides/logging
# This file is created following https://airflow.apache.org/docs/apache-airflow/2.0.0/logging-monitoring/logging-tasks.html#advanced-configuration
from copy import deepcopy
from airflow.config_templates.airflow_local_settings import DEFAULT_LOGGING_CONFIG
import sys
LOGGING_CONFIG = deepcopy(DEFAULT_LOGGING_CONFIG)
LOGGING_CONFIG["handlers"]["processor"] = {
"class": "logging.StreamHandler",
"formatter": "airflow",
"stream": sys.stdout,
}
(2) Set logging_config_class property to config.log_config.LOGGING_CONFIG in airflow.cfg
logging_config_class = config.log_config.LOGGING_CONFIG
(3) [Optional] Add $AIRFLOW_HOME to PYTHONPATH environment.
export "${PYTHONPATH}:~"
Actually, you can set the path of logging_config_class to anything as long as the python is able to load the package.
Setting handler.processor to airflow.utils.log.logging_mixin.RedirectStdHandler didn't work for me. It used too much memory.
remote_logging=True in airflow.cfg is the key.
Please check the thread here for detailed steps.
You can extend the image with the following or do so in airflow.cfg
ENV AIRFLOW__LOGGING__REMOTE_LOGGING=True
ENV AIRFLOW__LOGGING__REMOTE_LOG_CONN_ID=gcp_conn_id
ENV AIRFLOW__LOGGING__REMOTE_BASE_LOG_FOLDER=gs://bucket_name/AIRFLOW_LOGS
the gcp_conn_id should have the correct permission to create/delete objects in GCS

Airflow logs not loading

Upon failure of nodes in my dag, I want to see logs but the page just keeps rendering(symbol) and no logs are actually loaded.
Any other way I can check logs or correct this issue?
There is a directory named 'logs' at the same location with the 'dags' directory,
default: /etc/airflow/logs
In it there are directories named same with your dags, then tasks, and dates. You can look there.
example: /etc/airflow/logs/controller/trigger/2018-12-10T17:34:19.234871+00:00

Airflow - How do I ignore succeeded tasks in a backfill?

I have added new tasks to my DAG and it needs to backfill them. At the moment, when I run airflow backfill it runs all the tasks(new ones and the old ones) and I would like to ignore the old tasks which have already succeeded.
Is there any way to skip the tasks with success state in a backfill?
As of Airflow version 1.8.1, successful tasks should not be scheduled by a backfill, see AIRFLOW-1124.
Note that you can also specify which tasks you want to run in a backfill:
-t TASK_REGEX, --task_regex TASK_REGEX
The regex to filter specific task_ids to backfill
(optional)
The ignore dependencies flag may also help you in case your new tasks depend on any old ones that may not have succeeded.
-i, --ignore_dependencies
Skip upstream tasks, run only the tasks matching the
regexp. Only works in conjunction with task_regex

How to retrieve currently applied node configuration from Riak v2.0+

Showing currently applied configuration values
In v2.0+ of Riak there is a new command option: riak config effective
Which I read as it would tell you the current running values of riak.
At any time, you can get a snapshot of currently applied
configurations through the command line. For a listing of all of the
configs currently applied in the node
Config changes applied only on start of each node?
In multiple locations in Riak documentation there is reference like:
Remember that you must stop and then re-start each node when you
change storage backends or modify any other configuration
Problem:
However when I made a change to a setting (I've tested this in both riak.conf and advanced.conf), I see the newest value when running: riak config effective
ie:
Start node: riak start
View current setting for log level: riak config effective | grep log.console.level
log.console.level = info
Change the level to debug (something that will output a lot to console.log)
Re-run: riak config effective | grep log.console.level, we get:
log.console.level = debug
Checking the console log file for debug: cat /var/log/riak/console.log | grep debug give no results (indicating the config change has not been applied)
So the question is, how can I retrieve and verify what config setting each Riak node is running under?
When Riak starts, it creates two files: 'app..config' and 'vm..config'. The default location is in a 'generated.configs' directory under the platform data directory (usually /var/lib/riak).
These files will contain the settings that were in place when Riak was started. The command riak config effective processes the current riak.conf and advanced.config files.

Resources