We currently deployed helm Airflow in AWS EKS and want to trigger dbt models from it.
A few questions:
1. What would be the ideal way to deploy dbt? I am thinking about deploying another container for dbt only or installing dbt in the same container running Airflow via pip or brew.
2. If the ideal way to run dbt is in its own container, how do I connect airflow to dbt?
Please feel free to add any relevant information!
I think you should consider switching to the Official chart that he Apache Airflow community released recently: https://airflow.apache.org/docs/helm-chart/stable/index.html - it is prepared and maintained by the same community that builds Airflow.
I think one of the best descriptions about how to integrate dbt you can find in this Astronomer's blog: https://www.astronomer.io/blog/airflow-dbt-1
Summarising it - if you do not want to use dbt cloud, you can install dbt as pip package, and either run it via Bash script, or using dedicated DBT operators. If you already use Airflow through image, connecting dbt image to it when it should be invoked in another image, while technically possible, is a bit challenging and likely not worth the hassle.
You should simply extend the Airflow Image and add dbt as pip package. You can learn how to extend or customize the Airflow Image here: https://airflow.apache.org/docs/docker-stack/build.html
Small follow-up. Not sure if you have seen the talk from last week's Airflow Summit but I highly recommend it: https://airflowsummit.org/sessions/2021/building-a-robust-data-pipeline-with-the-dag-stack/ - it might give you a bit more answers :)
Related
I have an airflow dag running in a VM, but in order to facilitate the event driven triggering I'm trying to set up cloud composer in GCP. However, I only see an option in cloud composer to install pypi packages.
I need rosbag package in order to run my bash script, is there any way to do that in cloud composer? Or it's better to either run Airflow in a VM or a container with Kubernetes?
You can add your own requirements in Cloud Composer
https://cloud.google.com/composer/docs/how-to/using/installing-python-dependencies
However knowing rosbag pretty well (I've been robotics engineer using ROS for quite some time) - this might not be super easy to work out the right set of dependencies. Airflow has > 500 dependencies overall and it is highly likely some of them might clash with the particular version of ROS.
Also ROS has its own, specific way of initialization and setting up all the environment variables, sourcing certain scripts - which you will have to emulate yourself, modify PYTHONPATH and possibly do some initialization.
I'd say your best bet will be to use DockerOperator and use ROS from a Docker image. This can be done even with GPU support if needed (been there, done that) and it will provide the right level of isolation - both Airflow and ROS are utilising Python and dependencies a lot, and this might be the simplest way.
We are currently in the process of developing custom operators and sensors for our Airflow (>2.0.1) on Cloud Composer. We use the official Docker image for testing/developing
As of Airflow 2.0, the recommended way is not to put them in the plugins directory of Airflow but to build them as separate Python package. This approach however seems quite complicated when developing DAGs and testing them on the Docker Airflow.
To use Airflows recommended approach we would use two separate repos for our DAGs and the operators/sensors, we would then mount the custom operators/sensors package to Docker to quickly test it there and edit it on the local machine. For further use on Composer we would need to publish our package to our private pypi repo and install it on Cloud Composer.
The old approach however, to put everything in the local plugins folder, is quite straight forward and doesnt deal with these problems.
Based on your experience what is your recommended way of developing and testing custom operators/sensors ?
You can put the "common" code (custom operators and such) in the dags folder and exclude it from being processed by scheduler via .airflowignore file. This allows for rather quick iterations when developing stuff.
You can still keep the DAG and "common code" in separate repositories to make things easier. you can easily use a "submodule" pattern for that (add "common" repo as submodule of the DAG repo - this way you will be able to check them out together, you can even keep different DAG directories (for different teams) with different version of the common packages this way (just submodule-link it to different versions of the packages).
I think the "package" pattern if more of a production deployment thing rather than development. Once you developed the common code locally, it would be great if you package it together in common package and version accordingly (same as any other python package). Then you can release it after testing, version it etc. etc..
In the "development" mode you can checkout the code with "recursive" submodule update and add the "common" subdirectory to PYTHONPATH. In production - even if you use git-sync, you could deploy your custom operators via your ops team using custom image (by installing appropriate, released version of your package) where your DAGS would be git-synced separately WITHOUT the submodule checkout. The submodule would only be used for development.
Also it would be worth in this case to run a CI/CD with the Dags you push to your DAG repo to see if they continue working with the "released" custom code in the "stable" branch, while running the same CI/CD with the common code synced via submodule in "development" branch (this way you can check the latest development DAG code with the linked common code).
This is what I'd do. It would allow for quick iteration while development while also turning the common code into "freezable" artifacts that could provide stable environment in production, while still allowing your DAGs to be developed and evolve quickly, while also CI/CD could help in keeping the "stable" things really stable.
I have previously used Airflow running Ubuntu VM. I was able to login into VM via WinSCP and putty to run commands and edit Airflow related files as required.
But i first time came across Airflow running AWS ECS cluster. I am new to ECS So, i am trying to see what is the best possible way to :
run commands like "airflow dbinit", stop/start web-server and scheduler etc...
Install new python packages like "pip install "
View and edit Airflow files in ECS cluster
I was reading about AWS CLI and ECS cli could they be helpful ? or is there is any other best possible way that lets me do above mentioned actions.
Thanks in Advance.
Kind Regards
P.
There are many articles that describe how to run Airflow on ECS:
https://tech.gc.com/apache-airflow-on-aws-ecs/
https://towardsdatascience.com/building-an-etl-pipeline-with-airflow-and-ecs-4f68b7aa3b5b
https://aws.amazon.com/blogs/containers/running-airflow-on-aws-fargate/
many more
[ Note: Fargate allows you to run containers via ECS without a need to have EC2 instances. More here if you want additional background re what Fargate is]
Also, AWS CLI is a generic CLI that maps all AWS APIs (mostly 1:1). You may consider it for your deployment (but it should not be your starting point).
The ECS CLI is a more abstracted CLI that exposes higher level constructs and workflows that are specific to ECS. Note that the ECS CLI has been superseded by AWS Copilot which is an opinionated CLI to deploy workloads patterns on ECS. It's perhaps too opinionated to be able to deploy Airflow.
My suggestion is to go through the blogs above and get inspired.
We use DBT with GCP and BigQuery for transformations in BigQuery, and the simplest approach to scheduling our daily run dbt seems to be a BashOperator in Airflow. Currently we have two separate directories / github projects, one for DBT and another for Airflow. To schedule DBT to run with Airflow, it seems like our entire DBT project would need to be nested inside of our Airflow project, that way we can point to it for our dbt run bash command?
Is it possible to trigger our dbt run and dbt test without moving our DBT directory inside of our Airflow directory? With the airflow-dbt package, for the dir in the default_args, maybe it is possible to point to the gibhub link for the DBT project here?
My advice would be to leave your dbt and airflow codebases separated.
There is indeed a better way:
dockerise your dbt project in a simple python-based image where you COPY the codebase
push that to DockerHub or ECR or any other docker repository that you are using
use the DockerOperator in your airflow DAG to run that docker image with your dbt code
I'm assuming that you use the airflow LocalExecutor here and that you want to execute your dbt run workload on the server where airflow is running. If that's not the case and that you have access to a Kubernetes cluster, I would suggest instead to use the KubernetesPodOperator.
Accepted the other answer based on the consensus via upvotes and the supporting comment, however I'd like to post a 2nd solution that I'm currently using:
dbt and airflow repos / directories are next to each other.
in our airflow's docker-compose.yml, we've added our DBT directory as a volume so that airflow has access to it.
in our airflow's Dockerfile, install DBT and copy our dbt code.
use BashOperator to run dbt and test dbt.
Since you’re on GCP another option that is completely serverless is to run dbt with cloud build instead of airflow. You can also add workflows to that if you want more orchestration. If you want a detailed description there’s a post describing it. https://robertsahlin.com/serverless-dbt-on-google-cloud-platform/
What do people find is the best way to distribute code (dags) to airflow webserver / scheduler + workers? I am trying to run celery on a large cluster of workers such that any manual updates are impractical.
I am deploying airflow on docker and using s3fs right now and it is crashing on me constantly and creating weird core.### files. I am exploring other solutions (ie StorageMadeEasy, DropBox, EFS, a cron job to update from git...) but would love a little feedback as I explore solutions.
Also how do people typically make updates to dags and distribute that code? If one uses a share volume like s3fs, every time you update a dag do you restart the scheduler? Is editing the code in place on something like DropBox asking for trouble? Any best practices on how update dags and distribute the code would be much appreciated.
I can't really tell you what the "best" way of doing it is but I can tell you what I've done when I needed to distribute the workload onto another machine.
I simply set up an NFS share on the airflow master for the both the DAGS and the PLUGINS folders and mounted this share onto the worker. I had an issue once or twice where the NFS mount point would break for some reason but after re-mounting the jobs continued.
To distribute the DAG and PLUGIN code to the Airflow cluster I just deploy it to the master (I do this by bash script on my local machine which just SCPs the folders up from my local git branch) and NFS handles the replication to the worker. I always restart everything after a deploy, I also don't deploy while a job is running.
A better way to deploy would be to have GIT on the airflow master server which checks out a branch from a GIT repository (test or master depending on the airflow server?) and then replace the dags and plugins with the ones in the git repository. I'm experimenting with doing deployments like this at the moment with Ansible.