I use DBT with Airflow for transformations in our DBs deployed on our K8S cluster.
We then push the project of DBT to Azure Devops/Gitlab.
We use DAG process which pulls the DBT files onto the DBT VM then uses it from a directory on the VM.
To schedule and run DBT commands it seems that we always need to copy our DBT files.
Is it possible to run dbt without moving our DBT directory inside of our Airflow directory?
Only solution I found is dockerise the project into an image which is not a solution for us since we do not use this way. Maybe adding the projects together to a Azure/Gitlab project?
Related
Can someone provide a YAML file of the same mentioned above? I need it for a project.
I am trying to execute my tasks parallelly on each core of the workers, as it provides a performance boost for the task.
To achieve this I want to execute my Airflow tasks directly on the Dask cluster. As my project requires Airflow to be run on docker, I couldn't find any docker-compose.yaml file for Airflow with DaskExecutor.
Dask generally has a scheduler and some workers in its cluster.
Apart from this, I've tried to achieve this task parallelism with the airflow-provider-ray library by Astronomer registry. I've used this documentation to achieve so in docker. But I am facing OSError: Connection timeout. Here I am running airflow in docker and ray cluster on my local python environment.
Secondly, I've tried the same with the dask cluster. In this, there is Airflow running on docker with celery executor, and in another docker, there is Dask scheduler, two workers, and a notebook. Then I am able to connect these but I keep getting error - ModuleNotFoundError: No module named 'unusual_prefix_2774d32fcb40c2ba2f509980b973518ede2ad0c3_dask_client'.
The solution to any of these problems will be appreciated.
We currently deployed helm Airflow in AWS EKS and want to trigger dbt models from it.
A few questions:
1. What would be the ideal way to deploy dbt? I am thinking about deploying another container for dbt only or installing dbt in the same container running Airflow via pip or brew.
2. If the ideal way to run dbt is in its own container, how do I connect airflow to dbt?
Please feel free to add any relevant information!
I think you should consider switching to the Official chart that he Apache Airflow community released recently: https://airflow.apache.org/docs/helm-chart/stable/index.html - it is prepared and maintained by the same community that builds Airflow.
I think one of the best descriptions about how to integrate dbt you can find in this Astronomer's blog: https://www.astronomer.io/blog/airflow-dbt-1
Summarising it - if you do not want to use dbt cloud, you can install dbt as pip package, and either run it via Bash script, or using dedicated DBT operators. If you already use Airflow through image, connecting dbt image to it when it should be invoked in another image, while technically possible, is a bit challenging and likely not worth the hassle.
You should simply extend the Airflow Image and add dbt as pip package. You can learn how to extend or customize the Airflow Image here: https://airflow.apache.org/docs/docker-stack/build.html
Small follow-up. Not sure if you have seen the talk from last week's Airflow Summit but I highly recommend it: https://airflowsummit.org/sessions/2021/building-a-robust-data-pipeline-with-the-dag-stack/ - it might give you a bit more answers :)
at this time I use the LocalExecutor with airflow. My DAG is using docker images with the DockerOperator coming with airflow. For this the docker images must be present on the PC. If I want to use a distributed executor like CeleryExecutor or KubernetesExecutor the docker images must be present on all the machines which are part of the Celery or Kubernetes cluster?
Regards
Oli
That is correct. Since airflow workers run tasks locally, you will require to have docker images or nay other resources available locally on the worker. You could try this link to setup local docker registry which can serve docker images and save you efforts of maintaining them manually.
We use DBT with GCP and BigQuery for transformations in BigQuery, and the simplest approach to scheduling our daily run dbt seems to be a BashOperator in Airflow. Currently we have two separate directories / github projects, one for DBT and another for Airflow. To schedule DBT to run with Airflow, it seems like our entire DBT project would need to be nested inside of our Airflow project, that way we can point to it for our dbt run bash command?
Is it possible to trigger our dbt run and dbt test without moving our DBT directory inside of our Airflow directory? With the airflow-dbt package, for the dir in the default_args, maybe it is possible to point to the gibhub link for the DBT project here?
My advice would be to leave your dbt and airflow codebases separated.
There is indeed a better way:
dockerise your dbt project in a simple python-based image where you COPY the codebase
push that to DockerHub or ECR or any other docker repository that you are using
use the DockerOperator in your airflow DAG to run that docker image with your dbt code
I'm assuming that you use the airflow LocalExecutor here and that you want to execute your dbt run workload on the server where airflow is running. If that's not the case and that you have access to a Kubernetes cluster, I would suggest instead to use the KubernetesPodOperator.
Accepted the other answer based on the consensus via upvotes and the supporting comment, however I'd like to post a 2nd solution that I'm currently using:
dbt and airflow repos / directories are next to each other.
in our airflow's docker-compose.yml, we've added our DBT directory as a volume so that airflow has access to it.
in our airflow's Dockerfile, install DBT and copy our dbt code.
use BashOperator to run dbt and test dbt.
Since you’re on GCP another option that is completely serverless is to run dbt with cloud build instead of airflow. You can also add workflows to that if you want more orchestration. If you want a detailed description there’s a post describing it. https://robertsahlin.com/serverless-dbt-on-google-cloud-platform/
What do people find is the best way to distribute code (dags) to airflow webserver / scheduler + workers? I am trying to run celery on a large cluster of workers such that any manual updates are impractical.
I am deploying airflow on docker and using s3fs right now and it is crashing on me constantly and creating weird core.### files. I am exploring other solutions (ie StorageMadeEasy, DropBox, EFS, a cron job to update from git...) but would love a little feedback as I explore solutions.
Also how do people typically make updates to dags and distribute that code? If one uses a share volume like s3fs, every time you update a dag do you restart the scheduler? Is editing the code in place on something like DropBox asking for trouble? Any best practices on how update dags and distribute the code would be much appreciated.
I can't really tell you what the "best" way of doing it is but I can tell you what I've done when I needed to distribute the workload onto another machine.
I simply set up an NFS share on the airflow master for the both the DAGS and the PLUGINS folders and mounted this share onto the worker. I had an issue once or twice where the NFS mount point would break for some reason but after re-mounting the jobs continued.
To distribute the DAG and PLUGIN code to the Airflow cluster I just deploy it to the master (I do this by bash script on my local machine which just SCPs the folders up from my local git branch) and NFS handles the replication to the worker. I always restart everything after a deploy, I also don't deploy while a job is running.
A better way to deploy would be to have GIT on the airflow master server which checks out a branch from a GIT repository (test or master depending on the airflow server?) and then replace the dags and plugins with the ones in the git repository. I'm experimenting with doing deployments like this at the moment with Ansible.