Is there any operator to move files across different hosts via SFTP using Airflow? - rsync

A noob to Airflow ecosystem. One of my first goals using Airflow is implementing workflows to move files across machines. In particular, I'm looking for ways to consolidate data from different Mac/Linux machines into a NAS (using SFTP).
I've been exploring the different Airflow operators and most of the transfer ones copy data from local machine to cloud services. I haven't seen anyone to copy files from host to host, neither one to move (or copy, then check, then delete). I assume I could use BashOperator to move or use rsync with files. Is there any best practice in this regard on how to move files across different hosts via SFTP using Airflow? Any pattern to copy/check/delete safely?

There is not.
However, you can create your own operators as plugins:
https://airflow.apache.org/docs/stable/plugins.html
You may wanna take advantage of SftpHook existing in Airflow to code your operator:
https://airflow.readthedocs.io/en/stable/_modules/airflow/contrib/hooks/sftp_hook.html

Related

Apache Airflow - Multiple deployment environments

When handling multiple environments (such as Dev/Staging/Prod etc) having separate (preferably identical) Airflow instances for each of these environments would be the best case scenario.
I'm using the GCP managed Airflow ( GCP Cloud Composer), which is not very cheap to run, and having multiple instances would increase our monthly bill significantly.
So, I'd like to know if anyone has recommendations on using a single Airflow instance to handle multiple environments?
One approach I was considering of was to have separate top-level folders within my dags folder corresponding to each of the environment (i.e. dags/dev, dags/prod etc)
and copy my DAG scripts to the relevant folder through the CI/CD pipeline.
So, within my source code repository if my dag looks like:
airflow_dags/my_dag_A.py
During the CI stage, I could have a build step that creates 2 separate versions of this file:
airflow_dags/DEV/my_dag_A.py
airflow_dags/PROD/my_dag_A.py
I would follow a strict naming convention for naming my DAGs, Airflow Variables etc to reflect the environment name, so that the above build step can automatically rename those accordingly.
I wanted check if this is an approach others may have used? Or are there any better/alternative suggestions?
Please let me know if any additional clarifications are needed.
Thank you in advance for your support. Highly appreciated.
I think it can be a good approach to have a shared environement because it's cost effective.
However if you have a Composer cluster per environment, it's simpler to manage, and it's allows having a better separation.
If you stay on a shared environment, I think you are on the good direction with a separation on the Composer bucket DAG and a folder per environment.
If you use Airflow variables, you also have to deal with environment in addition to the DAGs part.
You can then manage the access to each folder in the bucket.
In my team, we chose another approach.
Cloud Composer uses GKE with autopilot mode and it's more cost effective than the previous version.
It's also easier to manage the environement size of the cluster and play with differents parameters (workers, cpu, webserver...).
In our case, we created a cluster per environment but we have a different configuration per environment (managed by Terraform):
For dev and uat envs, we have a little sizing and an environment size as small
For prod env, we have a higher sizing and an environment size as Medium
It's not perfect but this allows us to have a compromise between cost and separation.

how to organize your projects and dags with airflow

I am considering using Apache-Airflow. I had a look at the documentation and now I am trying to implement an already existing pipeline (home made framework) using Airflow.
All given examples are simple one module DAGs. But in real life you can have a versionned application that provides (complex) pipeline blocks. And DAGs use those blocks as tasks. Basically the application package is installed in a dedicated virtual environment with its dependencies.
Ok so no now how do you plug that with Airflow ? Should airflow be installed in the application virtualenv ? Then there is a dedicated Airflow instance for this application pipelines. But in this case if you have 100 applications you have to have 100 Airflow instances... On the other side if you have one unique instance it means you have installed all your applications packages on the same environement and potentially you can have dependency conflicts...
Is there something I am missing ? Are there best practices ? Do you know internet resources that may help ? Or GitHub repos using one pattern or the other ?
Thanks
One instance with 100 pipelines. Each pipelines can easily be versioned and python dependancies can be packaged.
We have 200+ very different pipelines and use one central airflow instance. Folders are organized as follow:
DAGs/
DAGs/pipeline_1/v1/pipeline_1_dag_1.0.py
DAGs/pipeline_1/v1/dependancies/
DAGs/pipeline_1/v2/pipeline_1_dag_2.0.py
DAGs/pipeline_1/v2/dependancies/
DAGs/pipeline_2/v5/pipeline_2_dag_5.0.py
DAGs/pipeline_2/v5/dependancies/
DAGs/pipeline_2/v6/pipeline_2_dag_6.0.py
DAGs/pipeline_2/v6/dependancies/
etc.

Apache Airflow environment setup

Can a single installation of Apache Airflow be used to handle multiple environments? eg. Dev, QA1, QA2, and Production (if so please guide) or do I need to have a separate install for each? What would be the best design considering maintenance of all environments.
You can do whatever you want. If you want to keep a single Airflow installation to handle different environments, you could switch connections or Airflow variables according to the environment.
At the end of the day DAGs are written in Python so you're really flexible.

use julia language without internet connection (mirror?)

Problem:
I would like to make julia available for our developers on our corporate network, which has no internet access at all (no proxy), due to sensitive data.
As far as I understand julia is designed to use github.
For instance julia> Pkg.init() tries to access:
git://github.com/JuliaLang/METADATA.jl
Example:
I solved this problem for R by creating a local CRAN repository (rsync) and setting up a local webserver.
I also solved this problem for python the same way by creating a local PyPi repository (bandersnatch) + webserver.
Question:
Is there a way to create a local repository for metadata and packages for julia?
Thank you in advance.
Roman
Yes, one of the benefits from using the Julia package manager is that you should be able to fork METADATA and host it anywhere you'd like (and keep a branch where you can actually check new packages before allowing your clients to update). You might be one of the first people to actually set up such a system, so expect that you will need to submit some issues (or better yet; pull requests) in order to get everything working smoothly.
See the extra arguments to Pkg.init() where you specify the METADATA repo URL.
If you want a simpler solution to manage I would also think about having a two tier setup where you install packages on one system (connected to the internet), and then copy the resulting ~/.julia directory to the restricted system. If the packages you use have binary dependencies, you might run into problems if you don't have similar systems on both sides, or if some of the dependencies is installed globally, but Pkg.build("Pkgname") might be helpful.
This is how I solved it (for now), using second suggestion by
ivarne.I use a two tier setup, two networks one connected to internet (office network), one air gapped network (development network).
System information: openSuSE-13.1 (both networks), julia-0.3.5 (both networks)
Tier one (office network)
installed julia on an NFS share, /sharename/local/julia.
soft linked /sharename/local/bin/julia to /sharename/local/julia/bin/julia
appended /sharename/local/bin/ to $PATH using a script in /etc/profile.d/scriptname.sh
created /etc/gitconfig on all office network machines: [url "https://"] insteadOf = git:// (to solve proxy server problems with github)
now every user on the office network can simply run # julia
Pkg.add("PackageName") is then used to install various packages.
The two networks are connected periodically (with certain security measures ssh, firewall, routing) for automated data exchange for a short period of time.
Tier two (development network)
installed julia on NFS share equal to tier one.
When the networks are connected I use a shell script with rsync -avz --delete to synchronize the .julia directory of tier one to tier two for every user.
Conclusion (so far):
It seems to work reasonably well.
As ivarne suggested there are problems if a package is installed AND something more than just file copying is done (compiled?) on tier one, the package wont run on tier two. But this can be resolved with Pkg.build("Pkgname").
PackageCompiler.jl seems like the best tool for using modern Julia (v1.8) on secure systems. The following approach requires a build server with the same architecture as the deployment server, something your institution probably already uses for developing containers, etc.
Build a sysimage with PackageCompiler's create_sysimage()
Upload the build (sysimage and depot) along with the Julia binaries to the secure system
Alias a script to julia, similar to the following example:
#!/bin/bash
set -Eeu -o pipefail
unset JULIA_LOAD_PATH
export JULIA_PROJECT=/Path/To/Project
export JULIA_DEPOT_PATH=/Path/To/Depot
export JULIA_PKG_OFFLINE=true
/Path/To/julia -J/Path/To/sysimage.so "$#"
I've been able to run a research pipeline on my institution's secure system, for which there is a public version of the approach.

How to keep different versions of my WSDLs in different git branches of my workflow

I have a project (web), that interacts heavily with a service layer. The client has different staging servers for the deployment of the project and the service layer, like this:
Servers from A0..A9 for developement,
Servers from B0..B9 for data migration tests,
Servers from C0..C9 for integration test,
Servers from D0..D9 for QA,
Servers from E0..E9 for production
The WSDLs I'm consuming on the website to interact with the service layer, change from one group of server to the other.
How can I keep different versions of the WSDLs in the different branches using a git workflow with three branches (master, dev, qa)?
As you explained in your comment, the branches will be merged, and the WSDL files will conflict. When that happens, you have to resolve the conflict by keeping the right version of the WSDL file.
For example, if you are on qa, merging from dev, and there is a conflict on a WSDL file, you can resolve it with:
git checkout HEAD file.wsdl
This will restore the WSDL file as it was before the merge, and you can commit the merge.
However, if there are changes in the WSDL file but there are no conflicts, then git merge will automatically merge them. If that's not what you want, and you really want to preserve the file without merging, then you could merge like this:
git merge dev --no-commit --no-ff
git checkout HEAD file.wsdl
git commit
UPDATE
To make this easier, see the first answer to this other question:
Git: ignore some files during a merge (keep some files restricted to one branch)
It offers 3 different solutions, make sure to consider all of them.
I think branching is the wrong tool to use. Branches are highly useful for (more or less) independent development that needs that take place in parallel and in isolation. Your use case of multiple deployment environments doesn't appear to fit that model. It also doesn't scale very well if you need multiple branches for e.g. releases and always have to create and maintain the deployment-specific child branches.
You would probably be better served by a single branch that defines multiple configurations (either for deployment or building, depending on what consumes the WSDL file).
This assumes that the WSDL file is the only different between the branches, which is the impression I get from the question.

Resources