Airflow on conda - folder structure - unix

I have bare bones airflow installation on conda - I managed to create custom operators by putting them in path:
airflow/dags/operators/custom_operator.py
then calling from dag as:
from operators.custom_operator import CustomOperator
how can I instead achieve folder structure:
airflow/operators/custom_operator.py
Which would be called from dag as:
from airflow.operators.custom_operator import CustomOperator
In case if you think that's a bad approach - please point it out in your answer/comment happy to tweak my approach, if there are better design patterns...

Interestingly - the solution here is in airflow.cfg (your airflow config file) to move parameter dags_folder one directory up - to $AIRFLOW_HOME, so instead of having:
....
[core]
dags_folder = /home/user/airflow/dags
....
Just make it:
....
[core]
dags_folder = /home/user/airflow
....
Airflow apparently will look recursivelly for dags, and capture only classes defined as dags... Whereas you can then keep clean folder structure, with custom operators, utility functions, custom sensors etc. outside dags/ folder.

Related

DataFlow Job Startup Takes Too Long When triggered from Composer

I have a static pipeline with the following architecture:
main.py
setup.py
requirements.txt
module 1
__init__.py
functions.py
module 2
__init__.py
functions.py
dist
setup_tarball
The setup.py and requirements.txt contain the non-native PyPI and local functions which would be used by the Dataflow worker node. The dataflow options are written as follows:
import apache_beam as beam
from apache_beam.io import ReadFromText, WriteToText
from apache_beam.options.pipeline_options import PipelineOptions
from module2.functions import function_to_use
dataflow_options = ['--extra_package=./dist/setup_tarball','temp_location=<gcs_temp_location>', '--runner=DataflowRunner', '--region=us-central1', '--requirements_file=./requirements.txt]
So then the pipeline will run something like this:
options = PipelineOptions(dataflow_options)
p = beam.Pipeline(options=options)
transform = (p | ReadFromText(gcs_url) | beam.Map(function_to_use) | WriteToText(gcs_output_url))
Running this locally takes Dataflow around 6 minutes to complete, where most of the time goes to worker startup. I tried getting this code automated with Composer and re-arranged the architecture as follows: my main (dag) function in dags folder, the modules in plugins, and setup_tarball and requirements.txt in data folder... So the only parameters that really changed are:
'--extra_package=/home/airflow/gcs/data/setup_tarball'
'--requirements_file=/home/airflow/gcs/data/requirements.txt'
When I try running this modified code in Composer, it will work... but it'll take much, much longer... Once the worker starts up, it will take anywhere from 20-30 minutes before actually running the pipeline (which is only a few seconds).. This is much longer than triggering Dataflow from my local code, which was taking only 6 minutes to complete. I realize this question is very general, but since the code works, I don't think it's related to the Airflow task itself. Where would be a reasonable place to start looking at for troubleshooting this problem? At the Airflow level, what can be modified? How does Composer (Airflow) interact with Dataflow, and what can potentially cause this bottleneck?
It turns out that the problem was associated with Composer itself. The fix was to increase the capacity of Composer, i.e., increase vCPUs. Not sure why this would be the case, so if anyone has an idea for the foundation behind this issue, your input would be much appreciated!

Airflow - DAG Integrity Testing - sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) no such table: variable

I am trying to write some DAG integrity tests in airflow. The issue I am coming across is the DAG that I am testing, I have references to variables in some of the tasks within that DAG.
eg: Variable.get("AIRFLOW_VAR_BLOB_CONTAINER")
I seem to be getting the error:
sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) no such table: variable
from this because when testing via pytest, those variables (and the variables table) don't exist. Does anyone know any workarounds or suggested methods to handle Variables/Connection references when running DAG Integrity tests?
Thanks,
You can create a local metastore for testing. Running airflow db init without any other settings will create a SQLite metastore in your home directory which you can use during testing. My default additional settings for a local metastore for testing are:
AIRFLOW__CORE__LOAD_DEFAULT_CONNECTIONS=False (to ensure there are no defaults to make things magically work)
AIRFLOW__CORE__LOAD_EXAMPLES=False (to ensure there are no defaults to make things magically work)
AIRFLOW__CORE__UNIT_TEST_MODE=True (Set default test settings, skip certain actions, etc.)
AIRFLOW_HOME=[project root dir] (To avoid Airflow files in your home dir)
Running airflow db init with these settings results in three files in your project root dir:
unittests.db
unittests.cfg
webserver_config.py
It's probably a good idea to add those to your .gitignore. With this set up you can safely test against the local metastore unittests.db during your tests (ensure that when running pytest, the same env vars are set).
Alternatively, if you don't want a local metastore for reasons, you will have to resort to mocking to substitute the call Airflow makes to the metastore. This requires knowledge of the internals of Airflow. An example:
import datetime
from unittest import mock
from airflow.models import DAG
from airflow.operators.bash import BashOperator
def test_bash_operator(tmp_path):
with DAG(dag_id="test_dag", start_date=datetime.datetime(2021, 1, 1), schedule_interval="#daily") as dag:
with mock.patch("airflow.models.variable.Variable.get") as variable_get_mock:
employees = ["Alice", "Bob", "Charlie"]
variable_get_mock.return_value = employees
output_file = tmp_path / "output.txt"
test = BashOperator(task_id="test", bash_command="echo {{ var.json.employees }} > " + str(output_file))
dag.clear()
test.run(
start_date=dag.start_date,
end_date=dag.start_date,
ignore_first_depends_on_past=True,
ignore_ti_state=True,
)
variable_get_mock.assert_called_once()
assert output_file.read_text() == f"[{', '.join(employees)}]\n"
These lines:
with mock.patch("airflow.models.variable.Variable.get") as variable_get_mock:
employees = ["Alice", "Bob", "Charlie"]
variable_get_mock.return_value = employees
Determine that the function airflow.models.variable.Variable.get isn't actually called but instead this list is returned: ["Alice", "Bob", "Charlie"]. Since task.run() doesn't return anything, I made the bash_command write to a tmp_path, and read the file to assert if the content is what I expected.
This avoids the need for a metastore entirely, but mocking can be a lot of work and complex once your tests grow beyond basic examples like these.

How do I tell Dagit (the Dagster GUI) to run on an existing Dask cluster?

I'm using dagster 0.11.3 (the latest as of this writing)
I've created a Dagster pipeline (saved as pipeline.py) that looks like this:
#solid
def return_a(context):
return 12.34
#pipeline(
mode_defs=[
ModeDefinition(
executor_defs=[dask_executor] # Note: dask only!
)
]
)
def the_pipeline():
return_a()
I have the DAGSTER_HOME environment variable set to a directory that contains a file named dagster.yaml, which is an empty file. This should be ok because the defaults are reasonable based on these docs: https://docs.dagster.io/deployment/dagster-instance.
I have an existing Dask cluster running at "scheduler:8786". Based on these docs: https://docs.dagster.io/deployment/custom-infra/dask, I created a run config named config.yaml that looks like this:
execution:
dask:
config:
cluster:
existing:
address: "scheduler:8786"
I have SUCCESSFULLY used this run config with Dagster like so:
$ dagster pipeline execute -f pipeline.py -c config.yaml
(I checked the Dask logs and made sure that it did indeed run on my Dask cluster)
My question is: How can I get Dagit to use this Dask cluster?
The only thing I have found that seems related is this:
https://docs.dagster.io/_apidocs/execution#executors
...but it doesn't even mention Dask as an option (it has dagster.in_process_executor and dagster.multiprocess_executor, which don't seem at all related to dask).
Probably I need to configure dagster-dask, which is documented here: https://docs.dagster.io/_apidocs/libraries/dagster-dask#dask-dagster-dask
...but where do I put that run config when using Dagit? There's no way to feed config.yaml to Dagit, for example.
Some options:
you can manually plug in the values that are in config.yaml in to the dagit playground
you can bind the config directly to the executor if you do not need to change it ever https://docs.dagster.io/concepts/configuration/configured#configured-api
you can create a preset from that config yaml https://docs.dagster.io/tutorial/advanced-tutorial/pipelines#pipeline-config-presets
Given the context, I would recommend the configured API

How to output Airflow's scheduler log to stdout or S3 / GCS

We're running Airflow cluster using puckel/airflow docker image with docker-compose. Airflow's scheduler container outputs its logs to /usr/local/airflow/logs/scheduler.
The problem is that the log files are not rotated and disk usage increases until the disk gets full. Dag for cleaning up the log directory is available but the DAG run on worker node and log directory on scheduler container is not cleaned up.
I'm looking for the way to output scheduler log to stdout or S3/GCS bucket but unable to find out. Is there any to output the scheduler log to stdout or S3/GCS bucket?
Finally I managed to output scheduler's log to stdout.
Here you can find how to use custom logger of Airflow. The default logging config is available at github.
What you have to do is.
(1) Create custom logger class to ${AIRFLOW_HOME}/config/log_config.py.
# Setting processor (scheduler, etc..) logs output to stdout
# Referring https://www.astronomer.io/guides/logging
# This file is created following https://airflow.apache.org/docs/apache-airflow/2.0.0/logging-monitoring/logging-tasks.html#advanced-configuration
from copy import deepcopy
from airflow.config_templates.airflow_local_settings import DEFAULT_LOGGING_CONFIG
import sys
LOGGING_CONFIG = deepcopy(DEFAULT_LOGGING_CONFIG)
LOGGING_CONFIG["handlers"]["processor"] = {
"class": "logging.StreamHandler",
"formatter": "airflow",
"stream": sys.stdout,
}
(2) Set logging_config_class property to config.log_config.LOGGING_CONFIG in airflow.cfg
logging_config_class = config.log_config.LOGGING_CONFIG
(3) [Optional] Add $AIRFLOW_HOME to PYTHONPATH environment.
export "${PYTHONPATH}:~"
Actually, you can set the path of logging_config_class to anything as long as the python is able to load the package.
Setting handler.processor to airflow.utils.log.logging_mixin.RedirectStdHandler didn't work for me. It used too much memory.
remote_logging=True in airflow.cfg is the key.
Please check the thread here for detailed steps.
You can extend the image with the following or do so in airflow.cfg
ENV AIRFLOW__LOGGING__REMOTE_LOGGING=True
ENV AIRFLOW__LOGGING__REMOTE_LOG_CONN_ID=gcp_conn_id
ENV AIRFLOW__LOGGING__REMOTE_BASE_LOG_FOLDER=gs://bucket_name/AIRFLOW_LOGS
the gcp_conn_id should have the correct permission to create/delete objects in GCS

Disable file output of hydra

I'm using hydra to log hyperparameters of experiments.
#hydra.main(config_name="config", config_path="../conf")
def evaluate_experiment(cfg: DictConfig) -> None:
print(OmegaConf.to_yaml(cfg))
...
Sometimes I want to do a dry run to check something. For this I don't need any saved parameters, so I'm wondering how I can disable the savings to the filesystem completely in this case?
The answer from Omry Yadan works well if you want to solve this using the CLI. However, you can also add these flags to your config file such that you don't have to type them every time you run your script. If you want to go this route, make sure you add the following items in your root config file:
defaults:
- _self_
- override hydra/hydra_logging: disabled
- override hydra/job_logging: disabled
hydra:
output_subdir: null
run:
dir: .
There is an enhancement request aimed at Hydra 1.1 to support disabling working directory management.
Working directory management is doing many things:
Creating a working directory for the run
Changing the working directory to the created dir.
There are other related features:
Saving log files
Saving files like config.yaml and hydra.yaml into .hydra in the working directory.
Different features has different ways to disable them:
To prevent the creation of a working directory, you can override hydra.run.dir to ..
To prevent saving the files into .hydra, override hydra.output_subdir to null.
To prevent the creation of logging files, you can disable logging output of hydra/hydra_logging and hydra/job_logging, see this.
A complete example might look like:
$ python foo.py hydra.run.dir=. hydra.output_subdir=null hydra/job_logging=disabled hydra/hydra_logging=disabled
Note that as always you can also override those config values through your config file.

Resources