We have an Airflow python script which read configuration files and then generate > 100 DAGs dynamically. When running the script in Airflow 2.4.1, from the task run log, we notice that Airflow is trying to parse our python script for every task run.
https://github.com/apache/airflow/blob/2.4.1/airflow/task/task_runner/standard_task_runner.py#L91-L97
Is there any way to make Airflow deserialize DAGs from DBs instead?
just found out that it is an expected behavior
https://medium.com/apache-airflow/airflows-magic-loop-ec424b05b629
https://medium.com/apache-airflow/magic-loop-in-airflow-reloaded-3e1bd8fb6671
but the Python script may use parsing context to load the respective DAG only
https://github.com/apache/airflow/pull/25161
Related
I created a custom airflow operator, basically modified some code related to run_id for TriggerDagRunOperator and named it as CustomTriggerDagRunOperator.
This new operator is working fine. When I place the operator class in my DAGs code then my dag runs fine and the modifications are also performed as expected.
But when I created a seperate python file for this operator say, my_custom_operator.py and placed this file in the same folder as DAG. Thereafter, added import statement in DAG as from my_custom_operator import CustomTriggerDagRunOperator. The airflow UI doesn't give any DAG error. But when I try to run the DAG it doesn't work nor does it display any logs, even the tasks not related to this operator also fail to execute. It is confusing as I just shifted the code related to operator to a different file so that the custom operator can be used accross all my DAGs. Need some suggestions.
Airflow Version: 2.1.3
Using Astronomer, hosted on Kubernetes
In order to import classes/methods from your module, you need to add the module package to python path, in this case the DagFileProcessor will be able to import the classes/methods when it processes the dag script.
DAGS_FOLDER/
dag.py
my_operators/
operator1.py
In your scheduler and all the workers, you need to set change the python path to PYTHONPATH=$PYTHONPATH:/path/to/DAGS_FOLDER, and in your dag script, you need to import from my_operators package and not from .:
from my_operators.operator1 import CustomTriggerDagRunOperator
For your development, you can select the DAGS_FOLDER as source folder for your project, which is similar to adding it to python path.
I'm running a custom backfill script that backfills a DAG serially. (If I run the backfill concurrently, I either run into a deadlock problem or a Serializable isolation problem.) As part of the process, I sometimes have failed DAGs mixed in with non-existing dates. To backfill that failed date, I need to clear it first. The problem comes in that a cleared DAG auto restarts and conflicts with the first date running in the backfill.
airflow clear dag_id -f -s 01-01-20 -e 01-12-20 -f
Because of the way it is built, I'll have to rewrite it from scratch and clear each backfill individually. If I can prevent a cleared DAG from rerunning, I would save me some time. Is there a way to do this in the cli?
You could try setting the max_active_runs argument to 1 when creating the DAG object. This will ensure that no more than one execution is active at a time, that way you can clear as many as you'd like and let Airflow handle the rest.
If I have multiple airflow dags with some overlapping python package dependencies, how can I keep each of these project deps. decoupled? Eg. if I had project A and B on same server I would run each of them with something like...
source /path/to/virtualenv_a/activate
python script_a.py
deactivate
source /path/to/virtualenv_b/activate
python script_b.py
deactivate
Basically, would like to run dags with the same situation (eg. each dag uses python scripts that have may have overlapping package deps. that I would like to develop separately (ie. not have to update all code using a package when want to update the package just for one project)). Note, I've been using the BashOperator to run python tasks like...
do_stuff = BashOperator(
task_id='my_task',
bash_command='python /path/to/script.py'),
execution_timeout=timedelta(minutes=30),
dag=dag)
Is there a way to get this working? IS there some other best-practice way that airflow intendeds for people to address (or avoid) these kinds of problems?
Based on discussion from the apache-airflow mailing list, the simplest answer that addresses the modular way in which I am using various python scripts for tasks is to directly call virtualenv python interpreter binaries for each script or module, eg.
source /path/to/virtualenv_a/activate
python script_a.py
deactivate
source /path/to/virtualenv_b/activate
python script_b.py
deactivate
would translate to something like
do_stuff_a = BashOperator(
task_id='my_task_a',
bash_command='/path/to/virtualenv_a/bin/python /path/to/script_a.py'),
execution_timeout=timedelta(minutes=30),
dag=dag)
do_stuff_b = BashOperator(
task_id='my_task_b',
bash_command='/path/to/virtualenv_b/bin/python /path/to/script_b.py'),
execution_timeout=timedelta(minutes=30),
dag=dag)
in an airflow dag.
To the question of passing args to the Tasks, it depends on the nature of the args you want to pass in. In my case, there are certain args that depend on what a data table looks like on the day the dag is run (eg. highest timestamp record in the table, etc.). To add these args to the Tasks, I have a "congif dag" that runs before this one. In the config dag, there is a Task that generates the args for the "real" dag as a python dict and converts to a pickle file. Then the "config" dag has a Task that is a TriggerDagRunOperator that activates the "real" dag which has initial logic to read from the pickle file generated by the "config" dag (in my case, as a Dict) and I read it into that bash_command string like bash_command=f"python script.py {configs['arg1']}".
you can use packaged dags, where each dag is packaged with its dependency
http://airflow.apache.org/concepts.html#packaged-dags
There are operators for running Python. There is a relatively new one, the PythonVirtualenvOperator which will create an ephemeral virtualenv, install your dependencies, run your python, then tear down the environment. This does create some per-task overhead but is a functional (if not ideal) approach to your dependency overlap issue.
https://airflow.apache.org/docs/apache-airflow/stable/howto/operator/python.html#pythonvirtualenvoperator
I write something wrong in my sql_test.py,and run python sql_test.py,the error is 'no module named xxx',and in web-ui it shows a red error - Broken DAG.
And then I run airflow list_dags the same error occurs again .This is strange and I don't know what's happening.
I tried to run airflow delete_dags sql_test but there is no such id.
How can I :
remove the waning in web-ui
get sql_test out of list_dags
There's some syntactical mistake in your dag-definition file, resulting in failure in parsing the DAG. When Airflow fails to parse a DAG, several functionalities get broken (like list_dags in your case)
Of course deleting the problematic dag-definition file would fix it, but that's not a solution. So here's how you can understand what's wrong and fix it
From linux shell, go to Airflow's logs folder
cd $AIRFLOW_HOME/logs/scheduler/latest/
Run tree command to see directory structure
tree -I "__init__.py|__pycache__|*.pyc"
View the last few lines of the log file of your corresponding broken dag
tail -n 25 /path/to/my/broken-dag.py.log
This will give you the stack-trace that Airflow threw while trying to parse your broken dag file. That would hopefully help you diagnose the problem and patch it.
Once your dag-definition file is fixed
the broken dag message would disappear from UI
DAG would appear in the UI (refresh it a few times)
list_dags command would also start working
If you don't want to repair your DAG and ignore it, you can remove the unwanted DAG by specifying the DAG's underlying file in an .airflowignore file.
I suspected that
airflow run dag_id task_id execution_date
would run all upstream tasks, but it does not. It will simply fail when it sees that not all dependent tasks are run. How can I run a specific task and all its dependencies? I am guessing this is not possible because of an airflow design decision, but is there a way to get around this?
You can run a task independently by using -i/-I/-A flags along with the run command.
But yes the design of airflow does not permit running a specific task and all its dependencies.
You can backfill the dag by removing non-related tasks from the DAG for testing purpose
A bit of a workaround but in case you have given your tasks task_id-s consistently you can try the backfilling from Airflow CLI (Command Line Interface):
airflow backfill -t TASK_REGEX ... dag_id
where TASK_REGEX corresponds to the naming pattern of the task you want to rerun and its dependencies.
(remember to add the rest of the command line options, like --start_date).