Add new DAG to airflow scheduler - airflow

I ran airflow web server under one of my virtual environments (myenv). When I tried to add some new dummy DAGs, my result didn't go as I expected.
Here is the story:
First, I created a new DAG which is literally a copy of "example_bash_operator" with another dag_id. Then I put this dag under the same directory as other example days were, which is "~/myenv/lib/python3.8/site-packages/airflow/example_dags". But when I opened the web server UI, this newly created dag wasn't shown.
I'm really confused. Should I change AIRFLOW_HOME? I did export AIRFLOW_HOME=~/airflow as Airflow documentation indicates. What's more, why the example dags are collected under the virtual environment directory for site packages instead of the airflow home that I declared?
with DAG(
dag_id='my_test',
default_args=args,
schedule_interval='0 0 * * *',
start_date=days_ago(2),
dagrun_timeout=timedelta(minutes=60),
tags=['some_tag'],
params={"example_key": "example_value"},
) as dag
🔼🔼 This is the only place that I changed from example_bash_operator.

Example DAGS are just example DAGs - they are "hard-coded" in airflow installation and shown only when you enable them in config. And they are mostly there to be able to quickly see some examples.
Your own dags should be placed in ${AIRFLOW_HOME}/dags. not in example_dags folder. Airflow only scans regularly the DAGs folder for changes because it does not expect example dags to change. Ever. It's a strange idea to change data inside installed python package.
Just place your dag in ${AIRFLOW_HOME}/dags and if they will have no problems, they should be shown quickly. You can also disable examples in airflow.cfg and then you will have a cleaner list of only your dags from "dags" folder.

Related

Access airflow dag run conf in UI

I have airflow 1.10.12 installed on a server and I’ve launched a dag with a trigger dag configuration which is json. I know how to access the config in my dag’s code, but I want to get it in Airflow UI. Where should I look?
It's not very convenient, but you can view the DAG run conf via Browse -> DAG Runs -> view "Conf" column (and filter for your specific DAG run).
There is code on the main branch which adds a separate DAG run page, but that's not released yet: https://github.com/apache/airflow/pull/19705.

Unable to create dag with apache airflow

I am running airflow 2.0, setting up airflow dag for the first time, and following quick start tutorials.
After creating and running the py file I don't see the dag created it does not list for me.
setting:
airflow.cfg:dags_folder = /Users/vik/src/airflow/dags
my python file is in this folder. There are no errors here.
I am able to see the example dags in example-dags.
I did airflow db init
airflow webserver
airflow scheduler
then try to list the dags
I think I am missing something
I don't know exactly how you installed everything, but I highly recommend Astronomer CLI for simplicity and quick setup. With that you'll be able to setup a first DAG pretty quickly. Here is also the video tutorial that helps you understand how to install / setup everything.
A few things to try:
Make sure the scheduleris running (run airflow scheduler) or try to restart it .
Using the Airflow CLI, run airflow config list and make sure that the loaded config is in fact what you are expecting, check the value of dags_folder.
Try running airflow dags list from the CLI, and check the the filepath if your dag is shown in the results.
If there was an error parsing your DAG, and therefore could not be loaded by the scheduler, you can find the logs in ${AIRFLOW_HOME}/logs/scheduler/${date}/your_dag_id.log

Airflow dag file is in place but it's not showing up when I do airflow dags list

I placed a dag file in the dags folder based on a tutorial with slight modifications, but it doesn't show up in the GUI or when run airflow dags list.
Answering my own question: Check the python file for Exceptions by running it directly. It turns out one exception in the dag's python script due to a missing import made the dag not show up in the list. I note this just in case another new user comes across this. To me the moral of the story is that dag files should often be checked by running with python directly when they are modified because there won't be an obvious error showing up otherwise; they may just disappear from the list.

Pick DAG related constants from configuration file WITHOUT restarting airflow

I am using some constants in my DAG which are being imported from another (configuration) Python file in the project directory.
Scenario
Airflow is running, I add a new DAG. I am importing the schedule_interval from that configuration file and some other constant as well which I am passing to a function being called in the PythonOperator in my DAG.
I update the code base, so new dag gets added in the airflow_dag folder and the configuration file gets updated with the new constants.
Problem
The schedule_interval does not work and the dag does not get scheduled. It also throws an exception (import error) for any other constant which is being imported in the dag.
In the web ui I can see the new dag, but I can also see a red label error that displays could not found constant XYZ in configuration_file.py while it's actually there.
It does not come here no matter how long I wait.
Bad Solution
I go and restart the airflow scheduler (and webserver as well just in case), and everything starts working again.
Question
Is there a solution to this where I will not have to restart airflow and update those things?
Note: The proposed solution to refresh dag in question Can airflow load dags file without restart scheduler did not work for me.

Efficient way to deploy dag files on airflow

Are there any best practices that are followed for deploying new dags to airflow?
I saw a couple of comments on the google forum stating that the dags are saved inside a GIT repository and the same is synced periodically to the local location in the airflow cluster. Regarding this approach, I had a couple of questions
Do we maintain separate dag files for separate environments? (testing. production)
How to handle rollback of an ETL to an older version in case the new version has a bug?
Any help here is highly appreciated. Let me know in case you need any further details?
Here is how we manage it for our team.
First in terms of naming convention, each of our DAG file name matches the DAG Id from the content of the DAG itself (including the DAG version). This is useful because ultimately it's the DAG Id that you see in the Airflow UI so you will know exactly which file has been used behind each DAG.
Example for a DAG like this:
from airflow import DAG
from datetime import datetime, timedelta
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2017,12,05,23,59),
'email': ['me#mail.com'],
'email_on_failure': True
}
dag = DAG(
'my_nice_dag-v1.0.9', #update version whenever you change something
default_args=default_args,
schedule_interval="0,15,30,45 * * * *",
dagrun_timeout=timedelta(hours=24),
max_active_runs=1)
[...]
The name of the DAG file would be: my_nice_dag-v1.0.9.py
All our DAG files are stored in a Git repository (among other things)
Everytime a merge request is done in our master branch, our Continuous Integration pipeline starts a new build and packages our DAG files into a zip (we use Atlassian Bamboo but there's other solutions like Jenkins, Circle CI, Travis...)
In Bamboo we configured a deployment script (shell) which unzips the package and places the DAG files on the Airflow server in the /dags folder.
We usually deploy the DAGs in DEV for testing, then to UAT and finally PROD. The deployment is done with the click of a button in Bamboo UI thanks to the shell script mentioned above.
Benefits
Because you have included the DAG version in your file name, the previous version of your DAG file is not overwritten in the DAG folder so you can easily come back to it
When your new DAG file is loaded in Airflow you can recognize it in the UI thanks to the version number.
Because your DAG file name = DAG Id you could even improve the deployment script by adding some Airflow command line to automatically switch ON your new DAGs once they are deployed.
Because every version of the DAGs is historicized in Git, we can always comeback to previous versions if needed.
As of yet, Airflow doesn't has its own functionality of versioning workflows (see this).
However you can manage that on your own by managing DAGs on their own git repository and fetching its state into airflow reposiroty as submodules. In this way you always have single airflow version that contains sets of DAGs with specific versions. Watch more here
One best practice is written in the documentation:
Deleting a task
Never delete a task from a DAG. In case of deletion, the historical
information of the task disappears from the Airflow UI. It is advised
to create a new DAG in case the tasks need to be deleted
I believe this is why the versioning topic is not so easy to solve yet, and we have to plan some workarounds.
https://airflow.apache.org/docs/apache-airflow/2.0.0/best-practices.html#deleting-a-task

Resources