Apache Airflow - Multiple deployment environments - airflow

When handling multiple environments (such as Dev/Staging/Prod etc) having separate (preferably identical) Airflow instances for each of these environments would be the best case scenario.
I'm using the GCP managed Airflow ( GCP Cloud Composer), which is not very cheap to run, and having multiple instances would increase our monthly bill significantly.
So, I'd like to know if anyone has recommendations on using a single Airflow instance to handle multiple environments?
One approach I was considering of was to have separate top-level folders within my dags folder corresponding to each of the environment (i.e. dags/dev, dags/prod etc)
and copy my DAG scripts to the relevant folder through the CI/CD pipeline.
So, within my source code repository if my dag looks like:
airflow_dags/my_dag_A.py
During the CI stage, I could have a build step that creates 2 separate versions of this file:
airflow_dags/DEV/my_dag_A.py
airflow_dags/PROD/my_dag_A.py
I would follow a strict naming convention for naming my DAGs, Airflow Variables etc to reflect the environment name, so that the above build step can automatically rename those accordingly.
I wanted check if this is an approach others may have used? Or are there any better/alternative suggestions?
Please let me know if any additional clarifications are needed.
Thank you in advance for your support. Highly appreciated.

I think it can be a good approach to have a shared environement because it's cost effective.
However if you have a Composer cluster per environment, it's simpler to manage, and it's allows having a better separation.
If you stay on a shared environment, I think you are on the good direction with a separation on the Composer bucket DAG and a folder per environment.
If you use Airflow variables, you also have to deal with environment in addition to the DAGs part.
You can then manage the access to each folder in the bucket.
In my team, we chose another approach.
Cloud Composer uses GKE with autopilot mode and it's more cost effective than the previous version.
It's also easier to manage the environement size of the cluster and play with differents parameters (workers, cpu, webserver...).
In our case, we created a cluster per environment but we have a different configuration per environment (managed by Terraform):
For dev and uat envs, we have a little sizing and an environment size as small
For prod env, we have a higher sizing and an environment size as Medium
It's not perfect but this allows us to have a compromise between cost and separation.

Related

How to do "artefact promotion" for NextJS?

I am building a CI/CD pipeline for a product and am confused about a few things.
I have so far worked in a system where I used to do "code promotion" for environment progression ie each branch pointed towards a certain env and PRs between the branches but very recently I read about "artifact promotion" and I feel like it is a sensible thing to do and want to give it a try.
Now for my microservices, I am able to manage it by keeping a Docker image for each env, free from any env specific variables, and supplying env variables to my pod directly. It all works. But for my frontend, I am hosting it using S3 & CloudFront and I am using Next JS f/w and the way env variables work in Next JS is that we need to supply them at build time and they get embedded in the export/dist.
How do I do "artifact promotion" in such cases, specially when the env variables are different for each environment.
PS: I know this question is very specific to my use case. Apologies if I am asking it at a wrong place!

Firebase Cloud Functions -- package all in a single VS Code project, or create multiple VS Code projects?

I am new to cloud functions and a little unclear about the way they are "containerized" after they are written and deployed to my project.
I have two quite different sets of functions. One set deals with image storage and firebase, another deals with some time consuming computations. They two sets (lets call them A and B) of functions use different node modules and have no dependecies on each other, except they both use Firestore.
My question is wehther it matters if I put all the functions in a single VS Code project, or if I should split them up in separate projects? One question is on the deployment side? (It seems like you deploy all the functions in the project when you run firebase deploy changes, even if some of the functions haven't changed, but probably more important is whether or not functions which don't need sharp or other other image manipulation packages are "containerized" together with other functions which maybe need stats packages and math related packages, and does it make any difference how they are organized into projects?
I realize this question is high level and not about specific code, but its not so clear to me from the various resources what is the appropriate way to bundle these two sets of unrelated cloud functions to not waste a lot of unecessary loading once theya re deployed out to Firestore.
Visual studio code project is simply a way to package your code. You can create 2 folder in your project, one for each set of function with their own firebase configuration.
Only the source repository can be a constraint here, especially if 2 different teams work on the code base and each one doesn't need to see the code of the other set of functions
In addition, if you open a VS code project with the 2 set of functions, it will take more time to load them and to lint them.
On Google Cloud side, each functions are deployed in their own container. Of course, because the packaging engine (Buildpack) doesn't know, the whole code is added inside the container. When the app start, the whole code is loaded. More you have code, longer will be the init.
If you have segregate your set of functions code in different folder in your project, only the code for the set A will be embedded in the container of functions A, and same thing for B.
Now, of course, if you put all the functions at the same level and the functions doesn't use the same data, the same code and so on, it's:
The mess to understand which function do what
The mess in the container to load too much things
So, it's not a great code base design, but it's beyond the "Google Cloud" topic, and an engineering choice.
Initially I was really confused on GCP project vs VS Code IDE project...
On a question about how cloud functions are "grouped" into containers during deployment - I strongly believe that each cloud function "image" is "deployed" into its own dedicated and separate container in the GCP. I think Guillaume described it absolutely correctly. At the same time, the "source" code packed into an "image" - might have a lot of redundancies, and there may be plenty of resources, which are not used by the given cloud function. it may be a good idea to minimize that.
I also would like to suggest, that neither development nor deployment process should depend on the client side IDE, and ideally the deployment should not happen from the client machine at all, to eliminate any local configuration/version variability between different developers. If we work together - I may use vi, and you VS Code, and Guillaume - GoLand, for example. There should not be any difference in deployment, as the deployment process should take all code from (origin/remote) git repository, rather than from the local machine.
In terms of "packaging" - for every cloud function it may be useful to "logically" consolidate all required code (and other files), so that all required files are archived together on deployment, and pushed into a dedicated GCS bucket. And exclude from such "archives" any not used (not required) files. In that case we might have many "archives" - one per cloud function. The deployment process should redeploy only modified "archives", and don't touch unmodified cloud functions.

Is there any operator to move files across different hosts via SFTP using Airflow?

A noob to Airflow ecosystem. One of my first goals using Airflow is implementing workflows to move files across machines. In particular, I'm looking for ways to consolidate data from different Mac/Linux machines into a NAS (using SFTP).
I've been exploring the different Airflow operators and most of the transfer ones copy data from local machine to cloud services. I haven't seen anyone to copy files from host to host, neither one to move (or copy, then check, then delete). I assume I could use BashOperator to move or use rsync with files. Is there any best practice in this regard on how to move files across different hosts via SFTP using Airflow? Any pattern to copy/check/delete safely?
There is not.
However, you can create your own operators as plugins:
https://airflow.apache.org/docs/stable/plugins.html
You may wanna take advantage of SftpHook existing in Airflow to code your operator:
https://airflow.readthedocs.io/en/stable/_modules/airflow/contrib/hooks/sftp_hook.html

Apache Airflow environment setup

Can a single installation of Apache Airflow be used to handle multiple environments? eg. Dev, QA1, QA2, and Production (if so please guide) or do I need to have a separate install for each? What would be the best design considering maintenance of all environments.
You can do whatever you want. If you want to keep a single Airflow installation to handle different environments, you could switch connections or Airflow variables according to the environment.
At the end of the day DAGs are written in Python so you're really flexible.

How to keep different versions of my WSDLs in different git branches of my workflow

I have a project (web), that interacts heavily with a service layer. The client has different staging servers for the deployment of the project and the service layer, like this:
Servers from A0..A9 for developement,
Servers from B0..B9 for data migration tests,
Servers from C0..C9 for integration test,
Servers from D0..D9 for QA,
Servers from E0..E9 for production
The WSDLs I'm consuming on the website to interact with the service layer, change from one group of server to the other.
How can I keep different versions of the WSDLs in the different branches using a git workflow with three branches (master, dev, qa)?
As you explained in your comment, the branches will be merged, and the WSDL files will conflict. When that happens, you have to resolve the conflict by keeping the right version of the WSDL file.
For example, if you are on qa, merging from dev, and there is a conflict on a WSDL file, you can resolve it with:
git checkout HEAD file.wsdl
This will restore the WSDL file as it was before the merge, and you can commit the merge.
However, if there are changes in the WSDL file but there are no conflicts, then git merge will automatically merge them. If that's not what you want, and you really want to preserve the file without merging, then you could merge like this:
git merge dev --no-commit --no-ff
git checkout HEAD file.wsdl
git commit
UPDATE
To make this easier, see the first answer to this other question:
Git: ignore some files during a merge (keep some files restricted to one branch)
It offers 3 different solutions, make sure to consider all of them.
I think branching is the wrong tool to use. Branches are highly useful for (more or less) independent development that needs that take place in parallel and in isolation. Your use case of multiple deployment environments doesn't appear to fit that model. It also doesn't scale very well if you need multiple branches for e.g. releases and always have to create and maintain the deployment-specific child branches.
You would probably be better served by a single branch that defines multiple configurations (either for deployment or building, depending on what consumes the WSDL file).
This assumes that the WSDL file is the only different between the branches, which is the impression I get from the question.

Resources