How to output Airflow's scheduler log to stdout or S3 / GCS - airflow

We're running Airflow cluster using puckel/airflow docker image with docker-compose. Airflow's scheduler container outputs its logs to /usr/local/airflow/logs/scheduler.
The problem is that the log files are not rotated and disk usage increases until the disk gets full. Dag for cleaning up the log directory is available but the DAG run on worker node and log directory on scheduler container is not cleaned up.
I'm looking for the way to output scheduler log to stdout or S3/GCS bucket but unable to find out. Is there any to output the scheduler log to stdout or S3/GCS bucket?

Finally I managed to output scheduler's log to stdout.
Here you can find how to use custom logger of Airflow. The default logging config is available at github.
What you have to do is.
(1) Create custom logger class to ${AIRFLOW_HOME}/config/log_config.py.
# Setting processor (scheduler, etc..) logs output to stdout
# Referring https://www.astronomer.io/guides/logging
# This file is created following https://airflow.apache.org/docs/apache-airflow/2.0.0/logging-monitoring/logging-tasks.html#advanced-configuration
from copy import deepcopy
from airflow.config_templates.airflow_local_settings import DEFAULT_LOGGING_CONFIG
import sys
LOGGING_CONFIG = deepcopy(DEFAULT_LOGGING_CONFIG)
LOGGING_CONFIG["handlers"]["processor"] = {
"class": "logging.StreamHandler",
"formatter": "airflow",
"stream": sys.stdout,
}
(2) Set logging_config_class property to config.log_config.LOGGING_CONFIG in airflow.cfg
logging_config_class = config.log_config.LOGGING_CONFIG
(3) [Optional] Add $AIRFLOW_HOME to PYTHONPATH environment.
export "${PYTHONPATH}:~"
Actually, you can set the path of logging_config_class to anything as long as the python is able to load the package.
Setting handler.processor to airflow.utils.log.logging_mixin.RedirectStdHandler didn't work for me. It used too much memory.

remote_logging=True in airflow.cfg is the key.
Please check the thread here for detailed steps.

You can extend the image with the following or do so in airflow.cfg
ENV AIRFLOW__LOGGING__REMOTE_LOGGING=True
ENV AIRFLOW__LOGGING__REMOTE_LOG_CONN_ID=gcp_conn_id
ENV AIRFLOW__LOGGING__REMOTE_BASE_LOG_FOLDER=gs://bucket_name/AIRFLOW_LOGS
the gcp_conn_id should have the correct permission to create/delete objects in GCS

Related

Files created in airflow task deleted automatically or not

In my airflow task, I am creating a file using open() method in airflow dag and writing records into it. Then sending it with a mail within same task. Will it get deleted automatically or will exists into the dag?
filename = to_report_name(context)+'_'+currentNextRunTime.strftime('%m.%d.%Y_%H-%M')+'_'+currentNextRunTime.tzname()+'.'+extension.lower()
with open(filename, "w+b") as file:
file.write(download_response.content)
print(file.name)
send_report(context,file)
The file will not be deleted automaticly. The code you execute is pure Python if you want the file to be deleted once the operation is done then use tempfile module which gurentee the file will be deleted once it's closed. Example:
import tempfile, os
with tempfile.NamedTemporaryFile() as file:
os.rename(file.name, '/tmp/my_custom_name.txt') # use this if you want to rename the file
file.write(...)

How do I tell Dagit (the Dagster GUI) to run on an existing Dask cluster?

I'm using dagster 0.11.3 (the latest as of this writing)
I've created a Dagster pipeline (saved as pipeline.py) that looks like this:
#solid
def return_a(context):
return 12.34
#pipeline(
mode_defs=[
ModeDefinition(
executor_defs=[dask_executor] # Note: dask only!
)
]
)
def the_pipeline():
return_a()
I have the DAGSTER_HOME environment variable set to a directory that contains a file named dagster.yaml, which is an empty file. This should be ok because the defaults are reasonable based on these docs: https://docs.dagster.io/deployment/dagster-instance.
I have an existing Dask cluster running at "scheduler:8786". Based on these docs: https://docs.dagster.io/deployment/custom-infra/dask, I created a run config named config.yaml that looks like this:
execution:
dask:
config:
cluster:
existing:
address: "scheduler:8786"
I have SUCCESSFULLY used this run config with Dagster like so:
$ dagster pipeline execute -f pipeline.py -c config.yaml
(I checked the Dask logs and made sure that it did indeed run on my Dask cluster)
My question is: How can I get Dagit to use this Dask cluster?
The only thing I have found that seems related is this:
https://docs.dagster.io/_apidocs/execution#executors
...but it doesn't even mention Dask as an option (it has dagster.in_process_executor and dagster.multiprocess_executor, which don't seem at all related to dask).
Probably I need to configure dagster-dask, which is documented here: https://docs.dagster.io/_apidocs/libraries/dagster-dask#dask-dagster-dask
...but where do I put that run config when using Dagit? There's no way to feed config.yaml to Dagit, for example.
Some options:
you can manually plug in the values that are in config.yaml in to the dagit playground
you can bind the config directly to the executor if you do not need to change it ever https://docs.dagster.io/concepts/configuration/configured#configured-api
you can create a preset from that config yaml https://docs.dagster.io/tutorial/advanced-tutorial/pipelines#pipeline-config-presets
Given the context, I would recommend the configured API

Simple python app deployed but not deployed

I just installed dokku for the first time and I'm struggling with an apparently very simple problem... I made a sample python app that just logs an env variable:
import os
import time
API_TOKEN = os.getenv('API_TOKEN')
while True:
print(f'API_TOKEN is {API_TOKEN}')
time.sleep(1)
pass
With a Procfile as this:
worker: python temp.py
The deploy looks normal and successful, however if I try to look at the logs, dokku says
App <app name> has not been deployed .
Am I missing something very trivial?
Thanks in advance!
By default dokku only scales up the web process if it is present. Any workers or other processes are scaled to 0 otherwise known as "App < app name > has not been deployed"
To deploy your app you need to log onto the box and scale up the worker by running:
dokku ps:scale <app name> worker=1
change 1 to a larger number if you want more workers
If you often deploy the app to different dokku instances and have to search for this solution over and over again you can instead create a file in the root of your app called DOKKU_SCALE. In it you can set the default scale of all the proceses, like so:
worker=1
That reminds me, I need to go do that now. It is driving me nuts.

How to point to the airflow unittest.cfg?

Airflow creates a unittest.cfg file in the AIRFLOW_HOME environment variable path.
My question is: how can I point to unittest.cfg in the same why that I point to the airflow.cfg via the environment variable AIRFLOW_CONFIG?
The reason why I want to do this is because I don't want to have any config files in the AIRFLOW_HOME directory.
Also, if anyone knows better, could you please explain what is the unittest.cfg is for as there is no documentation I could find on it.
unittest.cfg test configuration file is the default configuration file used when Airflow is running in test mode.
Test mode can be activated by setting the unit_test_mode configuration option in airflow.cfg or AIRFLOW__CORE__UNIT_TEST_MODE environment variable to True .
The configuration values in test configuration file overwrite those in airflow.cfg in runtime when test mode is activated.
# Source: https://github.com/apache/airflow/blob/1.10.5/airflow/configuration.py#L558,L561
def get_airflow_test_config(airflow_home):
if 'AIRFLOW_TEST_CONFIG' not in os.environ:
return os.path.join(airflow_home, 'unittests.cfg')
return expand_env_var(os.environ['AIRFLOW_TEST_CONFIG'])
The AIRFLOW_TEST_CONFIG environment variable can be set to the path of your test configuration file.

Google Cloud Composer DataflowJavaOperator: 403 Forbidden When Creating Job in Another Project

I am trying to use DataflowJavaOperator on our testing composer environment, but I am running into a 403 forbidden error. My intention is to kick off a Dataflow Java job on a different project using the test composer environment.
t2 = DataFlowJavaOperator(
task_id = "run-java-dataflow-job",
jar="gs://path/to/dataflow-jar.jar",
dataflow_default_options=config_params["dataflow_default_options"],
gcp_conn_id=config_params["gcloud_config"]["conn_id"],
dag=dag
)
My default options look like
'dataflow_default_options': {
'project': 'other-project',
'input': 'other-project:dataset.table',
'output': 'other-project:dataset.table'
...
}
I have tried creating a temporary composer test environment in the same project as the Dataflow, and this allows me to use DataflowJavaOperator as expected. Only when the composer environment resides in a different project as the Dataflow, does DataflowJavaOperator not work as expected.
My current workaround is to use BashOperator, use "env" to set the GOOGLE_APPLICATION_CREDENTIALS as gcp_conn_id path, store the jar file in our test composer bucket, and just run this bash command:
java -jar /path/to/dataflow-jar.jar \
[... all Dataflow job options]
Is it possible to use DataflowJavaOperator to kick off Dataflow jobs on another project?
You need a different GCP connection created for Composer to interact with your 2nd GCP project and you need to pass that connection id to gcp_conn_id in DataFlowJavaOperator

Resources