Apache Airflow environment setup - airflow

Can a single installation of Apache Airflow be used to handle multiple environments? eg. Dev, QA1, QA2, and Production (if so please guide) or do I need to have a separate install for each? What would be the best design considering maintenance of all environments.

You can do whatever you want. If you want to keep a single Airflow installation to handle different environments, you could switch connections or Airflow variables according to the environment.
At the end of the day DAGs are written in Python so you're really flexible.

Related

Apache Airflow - Multiple deployment environments

When handling multiple environments (such as Dev/Staging/Prod etc) having separate (preferably identical) Airflow instances for each of these environments would be the best case scenario.
I'm using the GCP managed Airflow ( GCP Cloud Composer), which is not very cheap to run, and having multiple instances would increase our monthly bill significantly.
So, I'd like to know if anyone has recommendations on using a single Airflow instance to handle multiple environments?
One approach I was considering of was to have separate top-level folders within my dags folder corresponding to each of the environment (i.e. dags/dev, dags/prod etc)
and copy my DAG scripts to the relevant folder through the CI/CD pipeline.
So, within my source code repository if my dag looks like:
airflow_dags/my_dag_A.py
During the CI stage, I could have a build step that creates 2 separate versions of this file:
airflow_dags/DEV/my_dag_A.py
airflow_dags/PROD/my_dag_A.py
I would follow a strict naming convention for naming my DAGs, Airflow Variables etc to reflect the environment name, so that the above build step can automatically rename those accordingly.
I wanted check if this is an approach others may have used? Or are there any better/alternative suggestions?
Please let me know if any additional clarifications are needed.
Thank you in advance for your support. Highly appreciated.
I think it can be a good approach to have a shared environement because it's cost effective.
However if you have a Composer cluster per environment, it's simpler to manage, and it's allows having a better separation.
If you stay on a shared environment, I think you are on the good direction with a separation on the Composer bucket DAG and a folder per environment.
If you use Airflow variables, you also have to deal with environment in addition to the DAGs part.
You can then manage the access to each folder in the bucket.
In my team, we chose another approach.
Cloud Composer uses GKE with autopilot mode and it's more cost effective than the previous version.
It's also easier to manage the environement size of the cluster and play with differents parameters (workers, cpu, webserver...).
In our case, we created a cluster per environment but we have a different configuration per environment (managed by Terraform):
For dev and uat envs, we have a little sizing and an environment size as small
For prod env, we have a higher sizing and an environment size as Medium
It's not perfect but this allows us to have a compromise between cost and separation.

Apache airflow: install two instances in the same local machine

I have an Airflow instance in a local Ubuntu machine. This instance doesn't work very well, so I would like to install it again. The problem is that I can't delete the current instance, because it is used by other people, so I would like to create a new Airflow instance in the same machine to put various dags there.
How could I do it? I created a different virtual environment, but I don't know how to install a second airflow server in that environment, which works in parallel with the current airflow.
Thank you!
use different port for webserver
use different AIRFLOW_HOME variable
use different sql_alchemy_conn (to point to a different database)
copy the deployment you have to start/stop your airflow components.
Depending on your deployment you might somehew record process id of your running airflow (so called pid-files) or have some other way to determine which processes are running. But that is nothing airflow-specific, this is something that is specific for your deployment.

Best practices for developing own custom operators in Airflow 2.0

We are currently in the process of developing custom operators and sensors for our Airflow (>2.0.1) on Cloud Composer. We use the official Docker image for testing/developing
As of Airflow 2.0, the recommended way is not to put them in the plugins directory of Airflow but to build them as separate Python package. This approach however seems quite complicated when developing DAGs and testing them on the Docker Airflow.
To use Airflows recommended approach we would use two separate repos for our DAGs and the operators/sensors, we would then mount the custom operators/sensors package to Docker to quickly test it there and edit it on the local machine. For further use on Composer we would need to publish our package to our private pypi repo and install it on Cloud Composer.
The old approach however, to put everything in the local plugins folder, is quite straight forward and doesnt deal with these problems.
Based on your experience what is your recommended way of developing and testing custom operators/sensors ?
You can put the "common" code (custom operators and such) in the dags folder and exclude it from being processed by scheduler via .airflowignore file. This allows for rather quick iterations when developing stuff.
You can still keep the DAG and "common code" in separate repositories to make things easier. you can easily use a "submodule" pattern for that (add "common" repo as submodule of the DAG repo - this way you will be able to check them out together, you can even keep different DAG directories (for different teams) with different version of the common packages this way (just submodule-link it to different versions of the packages).
I think the "package" pattern if more of a production deployment thing rather than development. Once you developed the common code locally, it would be great if you package it together in common package and version accordingly (same as any other python package). Then you can release it after testing, version it etc. etc..
In the "development" mode you can checkout the code with "recursive" submodule update and add the "common" subdirectory to PYTHONPATH. In production - even if you use git-sync, you could deploy your custom operators via your ops team using custom image (by installing appropriate, released version of your package) where your DAGS would be git-synced separately WITHOUT the submodule checkout. The submodule would only be used for development.
Also it would be worth in this case to run a CI/CD with the Dags you push to your DAG repo to see if they continue working with the "released" custom code in the "stable" branch, while running the same CI/CD with the common code synced via submodule in "development" branch (this way you can check the latest development DAG code with the linked common code).
This is what I'd do. It would allow for quick iteration while development while also turning the common code into "freezable" artifacts that could provide stable environment in production, while still allowing your DAGs to be developed and evolve quickly, while also CI/CD could help in keeping the "stable" things really stable.

Is there any operator to move files across different hosts via SFTP using Airflow?

A noob to Airflow ecosystem. One of my first goals using Airflow is implementing workflows to move files across machines. In particular, I'm looking for ways to consolidate data from different Mac/Linux machines into a NAS (using SFTP).
I've been exploring the different Airflow operators and most of the transfer ones copy data from local machine to cloud services. I haven't seen anyone to copy files from host to host, neither one to move (or copy, then check, then delete). I assume I could use BashOperator to move or use rsync with files. Is there any best practice in this regard on how to move files across different hosts via SFTP using Airflow? Any pattern to copy/check/delete safely?
There is not.
However, you can create your own operators as plugins:
https://airflow.apache.org/docs/stable/plugins.html
You may wanna take advantage of SftpHook existing in Airflow to code your operator:
https://airflow.readthedocs.io/en/stable/_modules/airflow/contrib/hooks/sftp_hook.html

how to organize your projects and dags with airflow

I am considering using Apache-Airflow. I had a look at the documentation and now I am trying to implement an already existing pipeline (home made framework) using Airflow.
All given examples are simple one module DAGs. But in real life you can have a versionned application that provides (complex) pipeline blocks. And DAGs use those blocks as tasks. Basically the application package is installed in a dedicated virtual environment with its dependencies.
Ok so no now how do you plug that with Airflow ? Should airflow be installed in the application virtualenv ? Then there is a dedicated Airflow instance for this application pipelines. But in this case if you have 100 applications you have to have 100 Airflow instances... On the other side if you have one unique instance it means you have installed all your applications packages on the same environement and potentially you can have dependency conflicts...
Is there something I am missing ? Are there best practices ? Do you know internet resources that may help ? Or GitHub repos using one pattern or the other ?
Thanks
One instance with 100 pipelines. Each pipelines can easily be versioned and python dependancies can be packaged.
We have 200+ very different pipelines and use one central airflow instance. Folders are organized as follow:
DAGs/
DAGs/pipeline_1/v1/pipeline_1_dag_1.0.py
DAGs/pipeline_1/v1/dependancies/
DAGs/pipeline_1/v2/pipeline_1_dag_2.0.py
DAGs/pipeline_1/v2/dependancies/
DAGs/pipeline_2/v5/pipeline_2_dag_5.0.py
DAGs/pipeline_2/v5/dependancies/
DAGs/pipeline_2/v6/pipeline_2_dag_6.0.py
DAGs/pipeline_2/v6/dependancies/
etc.

Resources