Cloud composer rosbag BashOperator - airflow

I have an airflow dag running in a VM, but in order to facilitate the event driven triggering I'm trying to set up cloud composer in GCP. However, I only see an option in cloud composer to install pypi packages.
I need rosbag package in order to run my bash script, is there any way to do that in cloud composer? Or it's better to either run Airflow in a VM or a container with Kubernetes?

You can add your own requirements in Cloud Composer
https://cloud.google.com/composer/docs/how-to/using/installing-python-dependencies
However knowing rosbag pretty well (I've been robotics engineer using ROS for quite some time) - this might not be super easy to work out the right set of dependencies. Airflow has > 500 dependencies overall and it is highly likely some of them might clash with the particular version of ROS.
Also ROS has its own, specific way of initialization and setting up all the environment variables, sourcing certain scripts - which you will have to emulate yourself, modify PYTHONPATH and possibly do some initialization.
I'd say your best bet will be to use DockerOperator and use ROS from a Docker image. This can be done even with GPU support if needed (been there, done that) and it will provide the right level of isolation - both Airflow and ROS are utilising Python and dependencies a lot, and this might be the simplest way.

Related

Connect dbt to Airflow using EKS

We currently deployed helm Airflow in AWS EKS and want to trigger dbt models from it.
A few questions:
1. What would be the ideal way to deploy dbt? I am thinking about deploying another container for dbt only or installing dbt in the same container running Airflow via pip or brew.
2. If the ideal way to run dbt is in its own container, how do I connect airflow to dbt?
Please feel free to add any relevant information!
I think you should consider switching to the Official chart that he Apache Airflow community released recently: https://airflow.apache.org/docs/helm-chart/stable/index.html - it is prepared and maintained by the same community that builds Airflow.
I think one of the best descriptions about how to integrate dbt you can find in this Astronomer's blog: https://www.astronomer.io/blog/airflow-dbt-1
Summarising it - if you do not want to use dbt cloud, you can install dbt as pip package, and either run it via Bash script, or using dedicated DBT operators. If you already use Airflow through image, connecting dbt image to it when it should be invoked in another image, while technically possible, is a bit challenging and likely not worth the hassle.
You should simply extend the Airflow Image and add dbt as pip package. You can learn how to extend or customize the Airflow Image here: https://airflow.apache.org/docs/docker-stack/build.html
Small follow-up. Not sure if you have seen the talk from last week's Airflow Summit but I highly recommend it: https://airflowsummit.org/sessions/2021/building-a-robust-data-pipeline-with-the-dag-stack/ - it might give you a bit more answers :)

Best practices for developing own custom operators in Airflow 2.0

We are currently in the process of developing custom operators and sensors for our Airflow (>2.0.1) on Cloud Composer. We use the official Docker image for testing/developing
As of Airflow 2.0, the recommended way is not to put them in the plugins directory of Airflow but to build them as separate Python package. This approach however seems quite complicated when developing DAGs and testing them on the Docker Airflow.
To use Airflows recommended approach we would use two separate repos for our DAGs and the operators/sensors, we would then mount the custom operators/sensors package to Docker to quickly test it there and edit it on the local machine. For further use on Composer we would need to publish our package to our private pypi repo and install it on Cloud Composer.
The old approach however, to put everything in the local plugins folder, is quite straight forward and doesnt deal with these problems.
Based on your experience what is your recommended way of developing and testing custom operators/sensors ?
You can put the "common" code (custom operators and such) in the dags folder and exclude it from being processed by scheduler via .airflowignore file. This allows for rather quick iterations when developing stuff.
You can still keep the DAG and "common code" in separate repositories to make things easier. you can easily use a "submodule" pattern for that (add "common" repo as submodule of the DAG repo - this way you will be able to check them out together, you can even keep different DAG directories (for different teams) with different version of the common packages this way (just submodule-link it to different versions of the packages).
I think the "package" pattern if more of a production deployment thing rather than development. Once you developed the common code locally, it would be great if you package it together in common package and version accordingly (same as any other python package). Then you can release it after testing, version it etc. etc..
In the "development" mode you can checkout the code with "recursive" submodule update and add the "common" subdirectory to PYTHONPATH. In production - even if you use git-sync, you could deploy your custom operators via your ops team using custom image (by installing appropriate, released version of your package) where your DAGS would be git-synced separately WITHOUT the submodule checkout. The submodule would only be used for development.
Also it would be worth in this case to run a CI/CD with the Dags you push to your DAG repo to see if they continue working with the "released" custom code in the "stable" branch, while running the same CI/CD with the common code synced via submodule in "development" branch (this way you can check the latest development DAG code with the linked common code).
This is what I'd do. It would allow for quick iteration while development while also turning the common code into "freezable" artifacts that could provide stable environment in production, while still allowing your DAGs to be developed and evolve quickly, while also CI/CD could help in keeping the "stable" things really stable.

Reduce latency when rebuilding/updating Google Cloud Composer?

I am working with Google Cloud Composer, and whenever I change an environment variable or or any setting in airflow, it will trigger a rebuild of the airflow environment.
I was thinking there may be a way to resolve package dependencies (ie. upload python packages) to cut back on the rebuild latency.
I ask because the rebuild can take anywhere from 2-15 mins.
Has anyone had any luck with reducing this build time (with or without increasing costs)?
Environment updates can generally take between 5-30 minutes in Cloud Composer and it works as intended at this moment. Please check this public issue tracker for more insight. You can click on +1 to make it more visible to the Cloud Composer engineering team.
Please note, that the Composer needs to take care of a lot of resources, i.e. deploys Airflow within the Google Kubernetes Engine and App Engine, which forces rebuilding images of the container, updating the Airflow Webserver and so on.
I suggest you to take a look in the following Cloud Composer architecture documentation, where you can find all the components that needs to be updated with each change.
You can check if any of your PODs are in Evicted state, which means that your node is on low resource, so you should consider using higher resource machine instead of the standard one n1-standard-2.
I hope you find the above pieces of information useful.

Apache Airflow environment setup

Can a single installation of Apache Airflow be used to handle multiple environments? eg. Dev, QA1, QA2, and Production (if so please guide) or do I need to have a separate install for each? What would be the best design considering maintenance of all environments.
You can do whatever you want. If you want to keep a single Airflow installation to handle different environments, you could switch connections or Airflow variables according to the environment.
At the end of the day DAGs are written in Python so you're really flexible.

Apache Karaf deployment in a Jenkins build pipeline

Currently I am trying to improve automation in my test environment which uses Apache Karaf 2.4 and Jenkins.
My goal is, if Jenkins successfully builds the project (and updates a Maven repository), to deploy the new version in Karaf. Karaf and Jenkins run on different machines.
My current solution is to use a feature-descriptor, but I am facing a lot of trouble with it: I found no easy way to update feature-bundles in Karaf 2.4. As far as I know, there is no command that is able to update all bundles of an existing feature in one command.
My current approach is to grep the output of the list command after a special pattern and find out all BIDs and then run update for all IDs. This approach tends to create bugs (it may include bundles that are not part of the feature when they match the same naming-pattern). And I was wondering if there is a better way to automatically update all my feature bundles in one?
Also, there is another problem: When a new feature gets added or removed from the feature-file, I found no elegant way to update it. My current approach was to first uninstall all associated bundles (with grep again...), then remove the repository from Karaf and then replace the new version of the feature-file in the deployment folder. As this is very circumstantial, I was wondering, if there is a better way to do this in Karaf 2.4?
I think deployment got better in Karaf 3, but I am unable to use it because of this bug (link). As I am in a testing environment and my software tends to be extremely unstable, I often have to kill the Karaf process and restart it again. And this is a big problem in Karaf 3. Hopefully it will be fixed in Karaf 4 but I do not want to wait until it is released.
Do you have any suggestions, how my deployment-process could be improved? Maybe there are better ways than my solution (I really hope so!), but I haven't seen them yet.
This is a nice request, cause I've been working on something similar just recently. Basically I've got the same setup you got.
So you just need to install the latest Jolokia osgi servlet on top of karaf to make all JMX commands accessible via REST.
For a showcase I created a maven plugin (I'm going to publish those sources in a couple of weeks - might even get into the karaf maven plugin) that installs the current bundle via a rest request on top of Karaf, it's currently also able to check for existing bundles.
Basically you need to issue the following rest POST:
{
"type":"EXEC",
"mbean":"org.apache.karaf:type=bundle,name=root",
"operation":"install(java.lang.String,boolean)",
"arguments":["mvn:${project.groupId}/${project.artifactId}/${project.version}", true]
}
You'll need to use the latest Jolokia snapshot version to get around the role based authentication which is used in the latest Karaf versions.
For this you'll also need to create a configuration file in etc, called org.jolokia.osgi.cfg, wich contains the following entries:
org.jolokia.user=karaf
org.jolokia.realm=karaf
org.jolokia.authMode=jaas
For more details also check the issue for it.

Resources