I'm learning airflow and I'm trying to figure out how to run the simplest possible example where a dag is defined and run end-to-end from Python.
I was originally following the doc tutorial
https://airflow.apache.org/docs/apache-airflow/stable/tutorial.html
But I'm noticing that all of these tutorial are defining DAGs in a global scope and then using an external bash command to run it. This seems far more complicated than it needs to be (in the simplest case where everything is local).
What I want is something that is self-contained. I want to import airflow, define the dag, and execute it immediately. I'm OK if I have to start some worker daemon / scheduler on my local machine before I run my code, but what seems not OK is needing to write a custom to define that DAG and then have to execute it somewhere outside of that file.
I'm not sure if this is possible, perhaps it runs contrary to the design philosophy of airflow, but it feels like it should be possible and that I'm just making a simple mistake, or perhaps I didn't configure airflow correctly before running this.
The example I would like to get working is as follows:
from airflow import DAG
from datetime import timezone
from datetime import datetime as datetime_cls
from airflow.operators.bash import BashOperator
now = datetime_cls.utcnow().replace(tzinfo=timezone.utc)
dag = DAG(
'mycustomdag',
start_date=now,
catchup=False,
tags=['example'],
)
t1 = BashOperator(task_id='task1', bash_command='date', dag=dag)
dag.run(verbose=True, local=True)
I'm running this directly in IPython, but the code might exist in some library function and get executed on the fly. The important feature is that no file with the sole purpose of defining this dag will exist on disk a-priori. I want to dynamically define the dag and run it.
This seems like it should be fine. But this results in an exception:
BackfillUnfinished: Some task instances failed:
DAG ID Task ID Run ID Try number
----------- --------- ------------------------------------------ ------------
mycustomdag task1 backfill__2022-06-20T19:51:38.686880+00:00
1
I'm ok if I need to pre-configure the runner. But I don't want to interact with this at all. I don't want to look at any UI. I just want to define and run the dag in Python. Is that possible or is that just now how airflow works?
EDIT
I'm thinking this might not be possible (which is absolutely crazy to me from a design perspective). In these docs (https://airflow.apache.org/docs/apache-airflow/2.0.0/concepts.html#scope) it seems to state that dags must be defined in a global scope.
Is there any way around this? Can you locally define a dag and then export it to a file?
Related
I ran airflow web server under one of my virtual environments (myenv). When I tried to add some new dummy DAGs, my result didn't go as I expected.
Here is the story:
First, I created a new DAG which is literally a copy of "example_bash_operator" with another dag_id. Then I put this dag under the same directory as other example days were, which is "~/myenv/lib/python3.8/site-packages/airflow/example_dags". But when I opened the web server UI, this newly created dag wasn't shown.
I'm really confused. Should I change AIRFLOW_HOME? I did export AIRFLOW_HOME=~/airflow as Airflow documentation indicates. What's more, why the example dags are collected under the virtual environment directory for site packages instead of the airflow home that I declared?
with DAG(
dag_id='my_test',
default_args=args,
schedule_interval='0 0 * * *',
start_date=days_ago(2),
dagrun_timeout=timedelta(minutes=60),
tags=['some_tag'],
params={"example_key": "example_value"},
) as dag
🔼🔼 This is the only place that I changed from example_bash_operator.
Example DAGS are just example DAGs - they are "hard-coded" in airflow installation and shown only when you enable them in config. And they are mostly there to be able to quickly see some examples.
Your own dags should be placed in ${AIRFLOW_HOME}/dags. not in example_dags folder. Airflow only scans regularly the DAGs folder for changes because it does not expect example dags to change. Ever. It's a strange idea to change data inside installed python package.
Just place your dag in ${AIRFLOW_HOME}/dags and if they will have no problems, they should be shown quickly. You can also disable examples in airflow.cfg and then you will have a cleaner list of only your dags from "dags" folder.
I placed a dag file in the dags folder based on a tutorial with slight modifications, but it doesn't show up in the GUI or when run airflow dags list.
Answering my own question: Check the python file for Exceptions by running it directly. It turns out one exception in the dag's python script due to a missing import made the dag not show up in the list. I note this just in case another new user comes across this. To me the moral of the story is that dag files should often be checked by running with python directly when they are modified because there won't be an obvious error showing up otherwise; they may just disappear from the list.
I am using some constants in my DAG which are being imported from another (configuration) Python file in the project directory.
Scenario
Airflow is running, I add a new DAG. I am importing the schedule_interval from that configuration file and some other constant as well which I am passing to a function being called in the PythonOperator in my DAG.
I update the code base, so new dag gets added in the airflow_dag folder and the configuration file gets updated with the new constants.
Problem
The schedule_interval does not work and the dag does not get scheduled. It also throws an exception (import error) for any other constant which is being imported in the dag.
In the web ui I can see the new dag, but I can also see a red label error that displays could not found constant XYZ in configuration_file.py while it's actually there.
It does not come here no matter how long I wait.
Bad Solution
I go and restart the airflow scheduler (and webserver as well just in case), and everything starts working again.
Question
Is there a solution to this where I will not have to restart airflow and update those things?
Note: The proposed solution to refresh dag in question Can airflow load dags file without restart scheduler did not work for me.
Are there any best practices that are followed for deploying new dags to airflow?
I saw a couple of comments on the google forum stating that the dags are saved inside a GIT repository and the same is synced periodically to the local location in the airflow cluster. Regarding this approach, I had a couple of questions
Do we maintain separate dag files for separate environments? (testing. production)
How to handle rollback of an ETL to an older version in case the new version has a bug?
Any help here is highly appreciated. Let me know in case you need any further details?
Here is how we manage it for our team.
First in terms of naming convention, each of our DAG file name matches the DAG Id from the content of the DAG itself (including the DAG version). This is useful because ultimately it's the DAG Id that you see in the Airflow UI so you will know exactly which file has been used behind each DAG.
Example for a DAG like this:
from airflow import DAG
from datetime import datetime, timedelta
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2017,12,05,23,59),
'email': ['me#mail.com'],
'email_on_failure': True
}
dag = DAG(
'my_nice_dag-v1.0.9', #update version whenever you change something
default_args=default_args,
schedule_interval="0,15,30,45 * * * *",
dagrun_timeout=timedelta(hours=24),
max_active_runs=1)
[...]
The name of the DAG file would be: my_nice_dag-v1.0.9.py
All our DAG files are stored in a Git repository (among other things)
Everytime a merge request is done in our master branch, our Continuous Integration pipeline starts a new build and packages our DAG files into a zip (we use Atlassian Bamboo but there's other solutions like Jenkins, Circle CI, Travis...)
In Bamboo we configured a deployment script (shell) which unzips the package and places the DAG files on the Airflow server in the /dags folder.
We usually deploy the DAGs in DEV for testing, then to UAT and finally PROD. The deployment is done with the click of a button in Bamboo UI thanks to the shell script mentioned above.
Benefits
Because you have included the DAG version in your file name, the previous version of your DAG file is not overwritten in the DAG folder so you can easily come back to it
When your new DAG file is loaded in Airflow you can recognize it in the UI thanks to the version number.
Because your DAG file name = DAG Id you could even improve the deployment script by adding some Airflow command line to automatically switch ON your new DAGs once they are deployed.
Because every version of the DAGs is historicized in Git, we can always comeback to previous versions if needed.
As of yet, Airflow doesn't has its own functionality of versioning workflows (see this).
However you can manage that on your own by managing DAGs on their own git repository and fetching its state into airflow reposiroty as submodules. In this way you always have single airflow version that contains sets of DAGs with specific versions. Watch more here
One best practice is written in the documentation:
Deleting a task
Never delete a task from a DAG. In case of deletion, the historical
information of the task disappears from the Airflow UI. It is advised
to create a new DAG in case the tasks need to be deleted
I believe this is why the versioning topic is not so easy to solve yet, and we have to plan some workarounds.
https://airflow.apache.org/docs/apache-airflow/2.0.0/best-practices.html#deleting-a-task
these days I'm working on a new ETL project and I wanted to give a try to Airflow as job manager.
Me and my colleague are both working on Airflow for the first time and we are following two different approaches: I decided to write python functions (operators like the ones included in the apache-airflow project) while my colleague uses airflow to call external python scripts through BashOperator.
I'd like to know if there is something like a "good practice", if the two approaches are equally good or I should consider one over the other.
To me, the main differences are:
- with BashOperator you can call a python script using a specific python environment with specific packages
- with BashOperator the tasks are more independent and can be launched manually if airflow goes mad
- with BashOperator task to task communication is a bit harder to manage
- with BashOperator task errors and failures are harder to manage (how can a bash task know if the task before it failed or succeded?).
What do you think?
My personal preference in these cases would be to use a PythonOperator over BashOperator. Here's what I do and why:
Single repo that contains all my DAGs. This repo also has a setup.py that includes airflow as a dependency, along with anything else my DAGs require. Airflow services are run from a virtualenv that installs these dependencies. This handles the python environment you mentioned regarding BashOperator.
I try to put all Python logic unrelated to Airflow in its own externally packaged python library. That code should have its own unit tests and also has its own main so it can be called on the command line independent of Airflow. This addresses your point about when Airflow goes mad!
If the logic is small enough that it doesn't make sense to separate into its own library, I drop it in a utils folder in my DAG repo, with unit tests still of course.
Then I call this logic in Airflow with the PythonOperator. The python callable can be easily be unit tested, unlike a BashOperator template script. This also means you can access things like starting an Airflow DB session, push multiple values to XCom, etc.
Like you mentioned, error handling is a bit easier with Python. You can catch exceptions and check return values easily. You can choose to mark the task as skipped with raise AirflowSkipException.
FYI for BashOperator, if the script exits with an error code, Airflow will mark the task as failed.
TaskA checks data availability at source. TaskB process it.
Task A>>Task B
Both tasks use BashOperator to call python scripts. I used to return sys.exit(1) (when no data at source) from script1 triggered by TaskA as a way to communicate Task A failed as there is no data and no need to run task B.