I am using Airflow 1.10.12. Currently, I need to connect to Azure Blob Storage and copy some files to Google Cloud Storage. After some research i found we can install backport providers for azure and its dependencies(google) for airflow 1.10.12. So, my question is will any of the existing DAGs that uses from airflow.contrib become invalid if i install this package?
Thanks
providers are how Airflow decoples code that isn't core related into separated packages. This allow faster and independent release cycle of providers and Airflow core.
In order to ease migration to Airflow 2 - backport providers were released. These are classes that are compatible with Airflow 1.10.x classes in contrib aren't affected by this.
providers are just an additional package that contains new classes/ updated version of classes from contrib.
To be noted that there are no more releases of backport providers packages. Thus it's highly recommended that you will upgrade to Airflow 2 as soon as possible so that you will be able to use providers packages.
Related
I have installed Airflow 2.1.0 version recently. But unable to find an option to enable airflow login authentication with supplied backend like in older versions. Can you anyone help me in providing steps to achieve this?
Since Airflow 2.0, Airflow's basic auth for the UI is the default behavior. You can consult the following documentation to read that:
https://airflow.apache.org/docs/apache-airflow/stable/security/webserver.html
If you want to use different security backends, you will need to provide an additional configuration file at your $AIRFLOW_HOME dir, the webserver_config.py. It will load its configuration to Flask-AppBuilder, and have the ability to define various security backends.
Some links:
Airflow documentation on Other Security Methods
Flask-AppBuilder documentation
example webserver_config.py file from Airflow's repository
I'm using airflow_docker and it seems it does not come with an s3KeySensor. It does come with a Sensor Class. How can I create my own custom s3KeySensor? Do I just have to inherit from Sensor and overwrite the poke method? Can I literally just copy the source code from the s3KeySensor? Here's the source code for the s3KeySensor
The reason I am using airflow docker is that it can run in a container and I can pass in aws role creds to the task container so that it has the proper and exact permissions to do an action other than using the worker container's role permissions.
I would recommend upgrading to the latest version Airflow or at least start at Airflow 2.0.
In Airflow 2.0, a majority of the operators, sensors, and hooks that are not considered core functionality are categorized and separated into providers. Amazon has a provider, apache-airflow-providers-amazon, that includes the S3KeySensor.
You can also install backport providers so the one you would need is apache-airflow-backport-providers-amazon if you want to stay on Airflow 1.10.5 (based on the links you sent).
We are currently in the process of developing custom operators and sensors for our Airflow (>2.0.1) on Cloud Composer. We use the official Docker image for testing/developing
As of Airflow 2.0, the recommended way is not to put them in the plugins directory of Airflow but to build them as separate Python package. This approach however seems quite complicated when developing DAGs and testing them on the Docker Airflow.
To use Airflows recommended approach we would use two separate repos for our DAGs and the operators/sensors, we would then mount the custom operators/sensors package to Docker to quickly test it there and edit it on the local machine. For further use on Composer we would need to publish our package to our private pypi repo and install it on Cloud Composer.
The old approach however, to put everything in the local plugins folder, is quite straight forward and doesnt deal with these problems.
Based on your experience what is your recommended way of developing and testing custom operators/sensors ?
You can put the "common" code (custom operators and such) in the dags folder and exclude it from being processed by scheduler via .airflowignore file. This allows for rather quick iterations when developing stuff.
You can still keep the DAG and "common code" in separate repositories to make things easier. you can easily use a "submodule" pattern for that (add "common" repo as submodule of the DAG repo - this way you will be able to check them out together, you can even keep different DAG directories (for different teams) with different version of the common packages this way (just submodule-link it to different versions of the packages).
I think the "package" pattern if more of a production deployment thing rather than development. Once you developed the common code locally, it would be great if you package it together in common package and version accordingly (same as any other python package). Then you can release it after testing, version it etc. etc..
In the "development" mode you can checkout the code with "recursive" submodule update and add the "common" subdirectory to PYTHONPATH. In production - even if you use git-sync, you could deploy your custom operators via your ops team using custom image (by installing appropriate, released version of your package) where your DAGS would be git-synced separately WITHOUT the submodule checkout. The submodule would only be used for development.
Also it would be worth in this case to run a CI/CD with the Dags you push to your DAG repo to see if they continue working with the "released" custom code in the "stable" branch, while running the same CI/CD with the common code synced via submodule in "development" branch (this way you can check the latest development DAG code with the linked common code).
This is what I'd do. It would allow for quick iteration while development while also turning the common code into "freezable" artifacts that could provide stable environment in production, while still allowing your DAGs to be developed and evolve quickly, while also CI/CD could help in keeping the "stable" things really stable.
I am considering using Apache-Airflow. I had a look at the documentation and now I am trying to implement an already existing pipeline (home made framework) using Airflow.
All given examples are simple one module DAGs. But in real life you can have a versionned application that provides (complex) pipeline blocks. And DAGs use those blocks as tasks. Basically the application package is installed in a dedicated virtual environment with its dependencies.
Ok so no now how do you plug that with Airflow ? Should airflow be installed in the application virtualenv ? Then there is a dedicated Airflow instance for this application pipelines. But in this case if you have 100 applications you have to have 100 Airflow instances... On the other side if you have one unique instance it means you have installed all your applications packages on the same environement and potentially you can have dependency conflicts...
Is there something I am missing ? Are there best practices ? Do you know internet resources that may help ? Or GitHub repos using one pattern or the other ?
Thanks
One instance with 100 pipelines. Each pipelines can easily be versioned and python dependancies can be packaged.
We have 200+ very different pipelines and use one central airflow instance. Folders are organized as follow:
DAGs/
DAGs/pipeline_1/v1/pipeline_1_dag_1.0.py
DAGs/pipeline_1/v1/dependancies/
DAGs/pipeline_1/v2/pipeline_1_dag_2.0.py
DAGs/pipeline_1/v2/dependancies/
DAGs/pipeline_2/v5/pipeline_2_dag_5.0.py
DAGs/pipeline_2/v5/dependancies/
DAGs/pipeline_2/v6/pipeline_2_dag_6.0.py
DAGs/pipeline_2/v6/dependancies/
etc.
I have an artifactory server, with a bunch of remote repositories.
We are planning to upgrade from 5.11.0 to 5.11.6 to take advantage of a security patch in that version.
Questions are:
do all repositories need to be on exactly the same version?
is there anything else i need to think about when upgrading multiple connected repositories (there is nothing specific about this in the manual)
do i need to do a system-level export just on the primary server? or should i be doing it on all of the remote repository servers
Lastly, our repositories are huge... a full System Export to backup will take too long...
is it enough to just take the config files/dirs
do i get just the config files/dirs by hitting "Exclude Content"
If you have an Artifactory instance that points to other Artifactory instances via smart remote repositories, then you will not have to upgrade all of the instances as they will be able to communicate with each other even if they are not on the same version. With that said, it is always recommended to use the latest version of Artifactory (for all of your instances) in order to enjoy all the latest features and bug fixes and best compatibility between instances. You may find further information about the upgrade process in this wiki page.
In addition, it is also always recommended to keep backups of your Artifactory instance, especially when attempting an upgrade. You may use the built-in backup mechanism or you may manually backup your filestore (by default located in $ARTIFACTORY_HOME/data/filestore) and take DataBase snapshots.
What do you mean by
do all repositories need to be on exactly the same version?
Are you asking about Artifactory instances? Artifactory HA nodes?
Regarding the full system export:
https://www.jfrog.com/confluence/display/RTF/Managing+Backups
https://jfrog.com/knowledge-base/how-should-we-backup-our-data-when-we-have-1tb-of-files/
For more info, you might want to contact JFrog's support.