Are there any best practices that are followed for deploying new dags to airflow?
I saw a couple of comments on the google forum stating that the dags are saved inside a GIT repository and the same is synced periodically to the local location in the airflow cluster. Regarding this approach, I had a couple of questions
Do we maintain separate dag files for separate environments? (testing. production)
How to handle rollback of an ETL to an older version in case the new version has a bug?
Any help here is highly appreciated. Let me know in case you need any further details?
Here is how we manage it for our team.
First in terms of naming convention, each of our DAG file name matches the DAG Id from the content of the DAG itself (including the DAG version). This is useful because ultimately it's the DAG Id that you see in the Airflow UI so you will know exactly which file has been used behind each DAG.
Example for a DAG like this:
from airflow import DAG
from datetime import datetime, timedelta
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2017,12,05,23,59),
'email': ['me#mail.com'],
'email_on_failure': True
}
dag = DAG(
'my_nice_dag-v1.0.9', #update version whenever you change something
default_args=default_args,
schedule_interval="0,15,30,45 * * * *",
dagrun_timeout=timedelta(hours=24),
max_active_runs=1)
[...]
The name of the DAG file would be: my_nice_dag-v1.0.9.py
All our DAG files are stored in a Git repository (among other things)
Everytime a merge request is done in our master branch, our Continuous Integration pipeline starts a new build and packages our DAG files into a zip (we use Atlassian Bamboo but there's other solutions like Jenkins, Circle CI, Travis...)
In Bamboo we configured a deployment script (shell) which unzips the package and places the DAG files on the Airflow server in the /dags folder.
We usually deploy the DAGs in DEV for testing, then to UAT and finally PROD. The deployment is done with the click of a button in Bamboo UI thanks to the shell script mentioned above.
Benefits
Because you have included the DAG version in your file name, the previous version of your DAG file is not overwritten in the DAG folder so you can easily come back to it
When your new DAG file is loaded in Airflow you can recognize it in the UI thanks to the version number.
Because your DAG file name = DAG Id you could even improve the deployment script by adding some Airflow command line to automatically switch ON your new DAGs once they are deployed.
Because every version of the DAGs is historicized in Git, we can always comeback to previous versions if needed.
As of yet, Airflow doesn't has its own functionality of versioning workflows (see this).
However you can manage that on your own by managing DAGs on their own git repository and fetching its state into airflow reposiroty as submodules. In this way you always have single airflow version that contains sets of DAGs with specific versions. Watch more here
One best practice is written in the documentation:
Deleting a task
Never delete a task from a DAG. In case of deletion, the historical
information of the task disappears from the Airflow UI. It is advised
to create a new DAG in case the tasks need to be deleted
I believe this is why the versioning topic is not so easy to solve yet, and we have to plan some workarounds.
https://airflow.apache.org/docs/apache-airflow/2.0.0/best-practices.html#deleting-a-task
Related
I have airflow 1.10.12 installed on a server and I’ve launched a dag with a trigger dag configuration which is json. I know how to access the config in my dag’s code, but I want to get it in Airflow UI. Where should I look?
It's not very convenient, but you can view the DAG run conf via Browse -> DAG Runs -> view "Conf" column (and filter for your specific DAG run).
There is code on the main branch which adds a separate DAG run page, but that's not released yet: https://github.com/apache/airflow/pull/19705.
I ran airflow web server under one of my virtual environments (myenv). When I tried to add some new dummy DAGs, my result didn't go as I expected.
Here is the story:
First, I created a new DAG which is literally a copy of "example_bash_operator" with another dag_id. Then I put this dag under the same directory as other example days were, which is "~/myenv/lib/python3.8/site-packages/airflow/example_dags". But when I opened the web server UI, this newly created dag wasn't shown.
I'm really confused. Should I change AIRFLOW_HOME? I did export AIRFLOW_HOME=~/airflow as Airflow documentation indicates. What's more, why the example dags are collected under the virtual environment directory for site packages instead of the airflow home that I declared?
with DAG(
dag_id='my_test',
default_args=args,
schedule_interval='0 0 * * *',
start_date=days_ago(2),
dagrun_timeout=timedelta(minutes=60),
tags=['some_tag'],
params={"example_key": "example_value"},
) as dag
🔼🔼 This is the only place that I changed from example_bash_operator.
Example DAGS are just example DAGs - they are "hard-coded" in airflow installation and shown only when you enable them in config. And they are mostly there to be able to quickly see some examples.
Your own dags should be placed in ${AIRFLOW_HOME}/dags. not in example_dags folder. Airflow only scans regularly the DAGs folder for changes because it does not expect example dags to change. Ever. It's a strange idea to change data inside installed python package.
Just place your dag in ${AIRFLOW_HOME}/dags and if they will have no problems, they should be shown quickly. You can also disable examples in airflow.cfg and then you will have a cleaner list of only your dags from "dags" folder.
I am running airflow 2.0, setting up airflow dag for the first time, and following quick start tutorials.
After creating and running the py file I don't see the dag created it does not list for me.
setting:
airflow.cfg:dags_folder = /Users/vik/src/airflow/dags
my python file is in this folder. There are no errors here.
I am able to see the example dags in example-dags.
I did airflow db init
airflow webserver
airflow scheduler
then try to list the dags
I think I am missing something
I don't know exactly how you installed everything, but I highly recommend Astronomer CLI for simplicity and quick setup. With that you'll be able to setup a first DAG pretty quickly. Here is also the video tutorial that helps you understand how to install / setup everything.
A few things to try:
Make sure the scheduleris running (run airflow scheduler) or try to restart it .
Using the Airflow CLI, run airflow config list and make sure that the loaded config is in fact what you are expecting, check the value of dags_folder.
Try running airflow dags list from the CLI, and check the the filepath if your dag is shown in the results.
If there was an error parsing your DAG, and therefore could not be loaded by the scheduler, you can find the logs in ${AIRFLOW_HOME}/logs/scheduler/${date}/your_dag_id.log
I am using Cloud Composer to schedule multiple DAGs. These DAGs are built dynamically using this method and they use custom plugins.
I would like to know how to proceed when adding / modifying a plugin which concerns all DAGs (let's say it adds a new task to each DAGs)?
Do we need to pause all the running DAGs when doing so?
What I did so far when adding /modifying a plugin, is :
Upload the plugins into the plugins bucket of the Composer cluster (using gcloud composer command)
Do a dummy update in the Airflow config -> add a dummy value to the airflow.cfg (using gcloud composer commands)
I did that to force the DAGs to pause, and once the update is finished then the DAGs are resumed but with the new plugins and hence the new tasks (or if its not in this dagrun then it's the next one). Is it useless?
Thanks if you can help.
As explained in the architecture diagram, the Airflow webserver where you view your DAG and plugin code runs in a Google-managed tenant project, whereas the Airflow workers which actually run your DAG and plugin code are directly in your project.
When a DAG/Plugin is placed in the Composer bucket, the Airflow webserver(which falls under tenant project) validates the code and updates any new scheduling changes in the Airflow database.
At the same time, the Airflow scheduler (in your project) asks the Airflow database for the next DAG to run and notifies the Airflow workers to perform the scheduled work. The Airflow workers (in your project) then grab the DAG/Plugin code from the Composer bucket and compile them in order to run that specific task.
Thus, any updates made to your DAGs/Plugin code is read separately by the Airflow webserver and Airflow workers, at different times.
If you do not see your new code in the Airflow webserver, it should still be picked up by the Workers when they grab the code fresh on the new task run.
Therefore you shouldn't have to restart Composer for the workers to pick up the changes.
You cannot force a worker to grab and re-compile the new code mid task execution.
There are two ways to refresh the Airflow Webserver to see the Plugin code changes if it is not updating:
Set the reload_on_plugin_change property to True in the [webserver] section via the ‘AIRFLOW CONFIGURATIONS OVERRIDE’ tab in the Console.
OR, you can specifically add/remove/update a PYPI Package via the ‘PYPI PACKAGES’ Console tab. Non PYPI Package changes will not trigger the web server restart. Note this will also initiate an entire Composer environment restart which may take ~20 minutes.
I am using some constants in my DAG which are being imported from another (configuration) Python file in the project directory.
Scenario
Airflow is running, I add a new DAG. I am importing the schedule_interval from that configuration file and some other constant as well which I am passing to a function being called in the PythonOperator in my DAG.
I update the code base, so new dag gets added in the airflow_dag folder and the configuration file gets updated with the new constants.
Problem
The schedule_interval does not work and the dag does not get scheduled. It also throws an exception (import error) for any other constant which is being imported in the dag.
In the web ui I can see the new dag, but I can also see a red label error that displays could not found constant XYZ in configuration_file.py while it's actually there.
It does not come here no matter how long I wait.
Bad Solution
I go and restart the airflow scheduler (and webserver as well just in case), and everything starts working again.
Question
Is there a solution to this where I will not have to restart airflow and update those things?
Note: The proposed solution to refresh dag in question Can airflow load dags file without restart scheduler did not work for me.