how to organize your projects and dags with airflow - airflow

I am considering using Apache-Airflow. I had a look at the documentation and now I am trying to implement an already existing pipeline (home made framework) using Airflow.
All given examples are simple one module DAGs. But in real life you can have a versionned application that provides (complex) pipeline blocks. And DAGs use those blocks as tasks. Basically the application package is installed in a dedicated virtual environment with its dependencies.
Ok so no now how do you plug that with Airflow ? Should airflow be installed in the application virtualenv ? Then there is a dedicated Airflow instance for this application pipelines. But in this case if you have 100 applications you have to have 100 Airflow instances... On the other side if you have one unique instance it means you have installed all your applications packages on the same environement and potentially you can have dependency conflicts...
Is there something I am missing ? Are there best practices ? Do you know internet resources that may help ? Or GitHub repos using one pattern or the other ?
Thanks

One instance with 100 pipelines. Each pipelines can easily be versioned and python dependancies can be packaged.
We have 200+ very different pipelines and use one central airflow instance. Folders are organized as follow:
DAGs/
DAGs/pipeline_1/v1/pipeline_1_dag_1.0.py
DAGs/pipeline_1/v1/dependancies/
DAGs/pipeline_1/v2/pipeline_1_dag_2.0.py
DAGs/pipeline_1/v2/dependancies/
DAGs/pipeline_2/v5/pipeline_2_dag_5.0.py
DAGs/pipeline_2/v5/dependancies/
DAGs/pipeline_2/v6/pipeline_2_dag_6.0.py
DAGs/pipeline_2/v6/dependancies/
etc.

Related

Need to create custom s3KeySensor

I'm using airflow_docker and it seems it does not come with an s3KeySensor. It does come with a Sensor Class. How can I create my own custom s3KeySensor? Do I just have to inherit from Sensor and overwrite the poke method? Can I literally just copy the source code from the s3KeySensor? Here's the source code for the s3KeySensor
The reason I am using airflow docker is that it can run in a container and I can pass in aws role creds to the task container so that it has the proper and exact permissions to do an action other than using the worker container's role permissions.
I would recommend upgrading to the latest version Airflow or at least start at Airflow 2.0.
In Airflow 2.0, a majority of the operators, sensors, and hooks that are not considered core functionality are categorized and separated into providers. Amazon has a provider, apache-airflow-providers-amazon, that includes the S3KeySensor.
You can also install backport providers so the one you would need is apache-airflow-backport-providers-amazon if you want to stay on Airflow 1.10.5 (based on the links you sent).

Best practices for developing own custom operators in Airflow 2.0

We are currently in the process of developing custom operators and sensors for our Airflow (>2.0.1) on Cloud Composer. We use the official Docker image for testing/developing
As of Airflow 2.0, the recommended way is not to put them in the plugins directory of Airflow but to build them as separate Python package. This approach however seems quite complicated when developing DAGs and testing them on the Docker Airflow.
To use Airflows recommended approach we would use two separate repos for our DAGs and the operators/sensors, we would then mount the custom operators/sensors package to Docker to quickly test it there and edit it on the local machine. For further use on Composer we would need to publish our package to our private pypi repo and install it on Cloud Composer.
The old approach however, to put everything in the local plugins folder, is quite straight forward and doesnt deal with these problems.
Based on your experience what is your recommended way of developing and testing custom operators/sensors ?
You can put the "common" code (custom operators and such) in the dags folder and exclude it from being processed by scheduler via .airflowignore file. This allows for rather quick iterations when developing stuff.
You can still keep the DAG and "common code" in separate repositories to make things easier. you can easily use a "submodule" pattern for that (add "common" repo as submodule of the DAG repo - this way you will be able to check them out together, you can even keep different DAG directories (for different teams) with different version of the common packages this way (just submodule-link it to different versions of the packages).
I think the "package" pattern if more of a production deployment thing rather than development. Once you developed the common code locally, it would be great if you package it together in common package and version accordingly (same as any other python package). Then you can release it after testing, version it etc. etc..
In the "development" mode you can checkout the code with "recursive" submodule update and add the "common" subdirectory to PYTHONPATH. In production - even if you use git-sync, you could deploy your custom operators via your ops team using custom image (by installing appropriate, released version of your package) where your DAGS would be git-synced separately WITHOUT the submodule checkout. The submodule would only be used for development.
Also it would be worth in this case to run a CI/CD with the Dags you push to your DAG repo to see if they continue working with the "released" custom code in the "stable" branch, while running the same CI/CD with the common code synced via submodule in "development" branch (this way you can check the latest development DAG code with the linked common code).
This is what I'd do. It would allow for quick iteration while development while also turning the common code into "freezable" artifacts that could provide stable environment in production, while still allowing your DAGs to be developed and evolve quickly, while also CI/CD could help in keeping the "stable" things really stable.

Flyway - Multiple DBs with Central Migration

Our company has about 30+ applications written in different languages (java, c#, visual basic, nodejs etc)
Our aim is to have development teams keep the database change sqls in their repositories, and do the migration from Jenkins with them starting pipelines with version number. Development teams don't have access to Jenkins configuration, they can only run jobs that we created and configured.
How should we go about this? Do we have too keep different flyway instances for each application? And how about pre-production and production stages?
Basically, how should we, as devops team, maintain flyway to do migration of different applications with different stages, without the development teams doing the migration part.
This should be possible with the Flyway CLI. You can tell Flywyay where to look for migrations and how to connect to the database. See the docs about configuring the CLI.
You can configure Flyway in various ways - environment variables, command line arguments, and config files.
What you could do is allow each development team to specify a migrations directory and connection details for the Jenkins task. The task can then call the Flyway CLI, overriding the relevant config items via command line arguments. For example, the command line call to migrate a database:
flyway url=jdbc:oracle:thin:#//<host>:<port>/<service> -locations=some-location migrate
Or you could allow your devs to specify environment variables, or provide a custom config file.
You can reuse a single Flyway instance since the commands are essentially stateless. The only bit of environmental state they need comes from the config file, which you have complete control over.
Hope that helps

What will be the recommended way to to Install Alfresco process service?

I wanted to install APS(Alfresco process service) for production environment.
for this installation, there are two ways.
By using Installer file
By using WAR files
What will be the best way for this, and what are the issues with other approach?
Thanks in Advance.
For a Production installation you should deploy the war file into you own java container.
The reason for this is that it gives you a lot more flexibility in configuring the APS properties outside the war file.
While the installer is easy, it makes many assumptions about your environment.
Another option in case you werent aware is that APS is now available in the AWS environmen as a quick start application. This makes setting up a hosted environmenr quick and easy.

Apache Airflow environment setup

Can a single installation of Apache Airflow be used to handle multiple environments? eg. Dev, QA1, QA2, and Production (if so please guide) or do I need to have a separate install for each? What would be the best design considering maintenance of all environments.
You can do whatever you want. If you want to keep a single Airflow installation to handle different environments, you could switch connections or Airflow variables according to the environment.
At the end of the day DAGs are written in Python so you're really flexible.

Resources