How to manage python packages between airflow dags? - airflow

If I have multiple airflow dags with some overlapping python package dependencies, how can I keep each of these project deps. decoupled? Eg. if I had project A and B on same server I would run each of them with something like...
source /path/to/virtualenv_a/activate
python script_a.py
deactivate
source /path/to/virtualenv_b/activate
python script_b.py
deactivate
Basically, would like to run dags with the same situation (eg. each dag uses python scripts that have may have overlapping package deps. that I would like to develop separately (ie. not have to update all code using a package when want to update the package just for one project)). Note, I've been using the BashOperator to run python tasks like...
do_stuff = BashOperator(
task_id='my_task',
bash_command='python /path/to/script.py'),
execution_timeout=timedelta(minutes=30),
dag=dag)
Is there a way to get this working? IS there some other best-practice way that airflow intendeds for people to address (or avoid) these kinds of problems?

Based on discussion from the apache-airflow mailing list, the simplest answer that addresses the modular way in which I am using various python scripts for tasks is to directly call virtualenv python interpreter binaries for each script or module, eg.
source /path/to/virtualenv_a/activate
python script_a.py
deactivate
source /path/to/virtualenv_b/activate
python script_b.py
deactivate
would translate to something like
do_stuff_a = BashOperator(
task_id='my_task_a',
bash_command='/path/to/virtualenv_a/bin/python /path/to/script_a.py'),
execution_timeout=timedelta(minutes=30),
dag=dag)
do_stuff_b = BashOperator(
task_id='my_task_b',
bash_command='/path/to/virtualenv_b/bin/python /path/to/script_b.py'),
execution_timeout=timedelta(minutes=30),
dag=dag)
in an airflow dag.
To the question of passing args to the Tasks, it depends on the nature of the args you want to pass in. In my case, there are certain args that depend on what a data table looks like on the day the dag is run (eg. highest timestamp record in the table, etc.). To add these args to the Tasks, I have a "congif dag" that runs before this one. In the config dag, there is a Task that generates the args for the "real" dag as a python dict and converts to a pickle file. Then the "config" dag has a Task that is a TriggerDagRunOperator that activates the "real" dag which has initial logic to read from the pickle file generated by the "config" dag (in my case, as a Dict) and I read it into that bash_command string like bash_command=f"python script.py {configs['arg1']}".

you can use packaged dags, where each dag is packaged with its dependency
http://airflow.apache.org/concepts.html#packaged-dags

There are operators for running Python. There is a relatively new one, the PythonVirtualenvOperator which will create an ephemeral virtualenv, install your dependencies, run your python, then tear down the environment. This does create some per-task overhead but is a functional (if not ideal) approach to your dependency overlap issue.
https://airflow.apache.org/docs/apache-airflow/stable/howto/operator/python.html#pythonvirtualenvoperator

Related

How to avoid DAG generation during task run

We have an Airflow python script which read configuration files and then generate > 100 DAGs dynamically. When running the script in Airflow 2.4.1, from the task run log, we notice that Airflow is trying to parse our python script for every task run.
https://github.com/apache/airflow/blob/2.4.1/airflow/task/task_runner/standard_task_runner.py#L91-L97
Is there any way to make Airflow deserialize DAGs from DBs instead?
just found out that it is an expected behavior
https://medium.com/apache-airflow/airflows-magic-loop-ec424b05b629
https://medium.com/apache-airflow/magic-loop-in-airflow-reloaded-3e1bd8fb6671
but the Python script may use parsing context to load the respective DAG only
https://github.com/apache/airflow/pull/25161

Airflow DAG using a custom operator not working when python code for operator is placed in seperate file

I created a custom airflow operator, basically modified some code related to run_id for TriggerDagRunOperator and named it as CustomTriggerDagRunOperator.
This new operator is working fine. When I place the operator class in my DAGs code then my dag runs fine and the modifications are also performed as expected.
But when I created a seperate python file for this operator say, my_custom_operator.py and placed this file in the same folder as DAG. Thereafter, added import statement in DAG as from my_custom_operator import CustomTriggerDagRunOperator. The airflow UI doesn't give any DAG error. But when I try to run the DAG it doesn't work nor does it display any logs, even the tasks not related to this operator also fail to execute. It is confusing as I just shifted the code related to operator to a different file so that the custom operator can be used accross all my DAGs. Need some suggestions.
Airflow Version: 2.1.3
Using Astronomer, hosted on Kubernetes
In order to import classes/methods from your module, you need to add the module package to python path, in this case the DagFileProcessor will be able to import the classes/methods when it processes the dag script.
DAGS_FOLDER/
dag.py
my_operators/
operator1.py
In your scheduler and all the workers, you need to set change the python path to PYTHONPATH=$PYTHONPATH:/path/to/DAGS_FOLDER, and in your dag script, you need to import from my_operators package and not from .:
from my_operators.operator1 import CustomTriggerDagRunOperator
For your development, you can select the DAGS_FOLDER as source folder for your project, which is similar to adding it to python path.

Create folder in webserver and scheduler via Airflow

When running the following operator a folder called my_folder is created in the Airflow worker. This makes sense since it is the worker executing the task.
run_this = BashOperator(
task_id='run_this',
bash_command=f'mkdir my_folder'
)
However, I'd like to create this folder both in the webserver and scheduler. Is there an easy way to do this?
My idea is to update other dags from a dag by copying them from s3, but being able to do this is a first step.
One thing that comes to mind is to mount a shared volume that all components can access: worker/scheduler/webserver and update your dag to create that folder in the shared volume like so:
run_this = BashOperator(
task_id='run_this',
bash_command=f'mkdir /shared_volume/my_folder'
)
How you do this is totally dependent on how you're deploying airflow.

Set Airflow Env Vars at Runtime

If I set env vars corresponding to airflow config settings after executing the airflow binary and at the same time DAG definitions are being loaded into memory, will this have the same effect as having set these same env vars at the OS level prior to having executed the binary?
I wasn't able to find any documentation on whether this would work as intended and figured that if I had to read through the source to figure this out then it's probably not a good idea to be doing it in the first place.
Instead of setting environment variables at runtime I've created two airflow.cfg files: airflow.prod.cfg and airflow.dev.cfg. I then created a shell script start.sh that cps the appropriate .cfg file to airflow.cfg prior to executing the airflow binary.
I don't love having to use the shell script to boot things up but I'd prefer that to chancing any kind of spooky action as a result of setting env vars at runtime.

How to run one airflow task and all its dependencies?

I suspected that
airflow run dag_id task_id execution_date
would run all upstream tasks, but it does not. It will simply fail when it sees that not all dependent tasks are run. How can I run a specific task and all its dependencies? I am guessing this is not possible because of an airflow design decision, but is there a way to get around this?
You can run a task independently by using -i/-I/-A flags along with the run command.
But yes the design of airflow does not permit running a specific task and all its dependencies.
You can backfill the dag by removing non-related tasks from the DAG for testing purpose
A bit of a workaround but in case you have given your tasks task_id-s consistently you can try the backfilling from Airflow CLI (Command Line Interface):
airflow backfill -t TASK_REGEX ... dag_id
where TASK_REGEX corresponds to the naming pattern of the task you want to rerun and its dependencies.
(remember to add the rest of the command line options, like --start_date).

Resources