Airflow DAG Versioning - airflow

Is DAG versioning a thing ? I can't find much on the subject with a few Google searches. I would like to look at the DAGs screen in Airflow and be sure of what DAG code is in the wild.
The simplest solution would be to include a version number as part of the dag_id, but I would appreciate knowing if anyone has better, alternative solution. Tags would work too and migjht look good in the UI - they are designed for for filtering though, I'm not sure if there would be undesirable side-effects.

As the author of the DAG Versioning AIP, I can say that this work has been deferred post 2.0 mainly to support end-to-end DAG Versioning.
Originally, we (Airflow Core Committers) were planning to have a Webserver-only DAG Versioning i.e. to improve the visibility behaviour but not execution:
The scope of this AIP to make sure that the visibility behavior of
Airflow is correct, without changing the execution behaviour which
will continue to be based on the most recent version of the DAG.
This means it overcomes the issues where you can go back to an old version of the DAG, to view the shape of the DAG few months back and you can see the correct representation instead of "always-latest".
Currently, Airflow suffers from the issue where if you add/remove a task, it gets added/removed in all the previous DagRuns in the Webserver.
However, what we have decided is that we will accomplish Remote DAG Fetcher + DAG Versioning and enable versioning of DAG on the worker side, so a user will be able to run a DAG with the previous version too.
Currently, we don't have a date but mostly planning to do it around the end of 2021.

The Airflow project has a draft feature open to support DAG versions. The answer currently is that Airflow does not support versions.
The first use case in the link describes a key limitation, log files from previous runs can only surface nodes from the current DAG.

As mentioned above, as of yet, Airflow doesn't has its own functionality of versioning workflows. However you can manage that on your own by managing DAGs on their own git repository and fetching its state into airflow reposiroty as submodules. More on that;
https://www.youtube.com/watch?v=a-4yRne3ba4&lc=UgwiIO-ECVFSZPz1hOt4AaABAg

Related

Job Sensors in Databricks Workflows

At the moment we schedule our Databricks notebooks using Airflow. Due to dependencies between projects, there are dependencies between DAGs. Some DAGs wait until a task in a previous DAG is finished before starting (by using sensors).
We are now looking to use Databricks DBX. It is still new for us, but it seems that DBX' main added value is when you use Databricks workflows. It would be possible to run a Python wheel in a job that was created by DBX. My question is now, is it possible to add dependencies between Databricks jobs? Can we create 2 different jobs using DBX, and make the second job wait until the first one is completed.
I am aware that I can have dependencies between tasks in one job, but in our case it is not possible to have only one job with all the tasks.
I was thinking about adding a notebook/python script before the wheel with ETL logic. This notebook would check then if the previous job is finished. Once this is the case, the task with the wheel will be executed. Does this make sense, or are there better ways? Is something like the ExternalTaskSensor in Airflow available within Databricks workflows?
Or is there a good way to use DBX without DB workflows?
author of dbx here.
TL;DR - dbx is not opinionated in terms of the orchestrator choice.
It is still new for us, but it seems that DBX' main added value is when you use Databricks workflows. It would be possible to run a Python wheel in a job that was created by DBX.
The short answer is yes, but it's done on the tasks level (read more here on the difference between workflow and task).
Another approach would be the following - if you still need (or want) to use Airflow, you can do it in the following way:
Deploy and update your jobs from your CI/CD pipeline with dbx deploy commands.
In Airflow, use the Databricks Operator to launch the job (either by name or by id).

Can you specify the number of threads for certain tasks in a DAG?

I'm very new to Airflow and while I have read the docs and some answers about Airflow's configuration regarding parallelism, it seems I have not yet found the answer to specifying threads used in a task.
My current case is I have 5 tasks (in the form of a Python script) that only do API calls (but to different API service) and transform the data. For each task I can make up to 1000+ calls, so I try to utilize multithreading in the script. Unfortunately, when I try to run the multithreaded script in Airflow, it doesn't use the multithreading mechanism in the script. I feel like this is because of Airflow configuration that overrides the child script or am I wrong? Any help or answer is appreciated, thank you.
Run your script with a KubernetesPodOperator.
You can use a python base image and run your script as is. This should closely mimic how you are executing the script locally except now it's done in a kubernetes pod.

Airflow DAG "seems to be existing only locally. The master scheduler doesn't seem to be aware of its existence"

I used airflow for workflow of Spark jobs. After installation, I copy the DAG files into DAGs folder set in airflow.cfg. I can backfill the DAG to run the BashOperators successfully. But there is always a warning like the one mentioned. I didn't verify if the scheduling is fine, but I doubt scheduling can work as the warning said the master scheduler doesn't know of my DAG's existence. How can I eliminate this warning and get scheduling work? Anybody run into the same issue who can help me out?
This is usually connected to the Scheduler not running or the refresh interval being too wide. There are no log entries present so we cannot analyze from there. Also, unfortunately the very cause might have been ignored, because this is usually the root of the problem:
I didn't verify if the scheduling is fine.
So first you should check if both of the following services are running:
airflow webserver
and
airflow scheduler
If that won't help, see this post for more reference: Airflow 1.9.0 is queuing but not launching tasks

Changing the airflow schedule through the GUI and incremental updates when the source is Oracle

We are evaluating Airflow for scheduling and data pipeline design. However we are not able to find out how to achieve the following two task:
(1) How to change the DAG schedule through the GUI?
(2) How to achieve the incremental update when the data source is Oracle or MySQl.
This is what we have tried:
(1) We tried changing the schedule of the DAG in the GUI, but looks like that only changes the schedule of that particular instance.
(2) We tried to handle the incremental update programatically by storing the last column value. Is there any other better way of doing incremental update?
1) You can't change the DAG schedule in the GUI, you have to do this in python code when you write the DAG
2) How you do incremental updates is entirely up to you, however I would use a combination of Airflow macros https://airflow.apache.org/code.html#macros and SQL files with JINJA templates https://airflow.apache.org/concepts.html#jinja-templating
Might be worth having a look through the Airflow documentation as it sounds like you're not entirely familiar with its concepts.

Apache Airflow Best Practice: (Python)Operators or BashOperators

these days I'm working on a new ETL project and I wanted to give a try to Airflow as job manager.
Me and my colleague are both working on Airflow for the first time and we are following two different approaches: I decided to write python functions (operators like the ones included in the apache-airflow project) while my colleague uses airflow to call external python scripts through BashOperator.
I'd like to know if there is something like a "good practice", if the two approaches are equally good or I should consider one over the other.
To me, the main differences are:
- with BashOperator you can call a python script using a specific python environment with specific packages
- with BashOperator the tasks are more independent and can be launched manually if airflow goes mad
- with BashOperator task to task communication is a bit harder to manage
- with BashOperator task errors and failures are harder to manage (how can a bash task know if the task before it failed or succeded?).
What do you think?
My personal preference in these cases would be to use a PythonOperator over BashOperator. Here's what I do and why:
Single repo that contains all my DAGs. This repo also has a setup.py that includes airflow as a dependency, along with anything else my DAGs require. Airflow services are run from a virtualenv that installs these dependencies. This handles the python environment you mentioned regarding BashOperator.
I try to put all Python logic unrelated to Airflow in its own externally packaged python library. That code should have its own unit tests and also has its own main so it can be called on the command line independent of Airflow. This addresses your point about when Airflow goes mad!
If the logic is small enough that it doesn't make sense to separate into its own library, I drop it in a utils folder in my DAG repo, with unit tests still of course.
Then I call this logic in Airflow with the PythonOperator. The python callable can be easily be unit tested, unlike a BashOperator template script. This also means you can access things like starting an Airflow DB session, push multiple values to XCom, etc.
Like you mentioned, error handling is a bit easier with Python. You can catch exceptions and check return values easily. You can choose to mark the task as skipped with raise AirflowSkipException.
FYI for BashOperator, if the script exits with an error code, Airflow will mark the task as failed.
TaskA checks data availability at source. TaskB process it.
Task A>>Task B
Both tasks use BashOperator to call python scripts. I used to return sys.exit(1) (when no data at source) from script1 triggered by TaskA as a way to communicate Task A failed as there is no data and no need to run task B.

Resources