I am running dbt (v0.19.0) on Apache Airflow (v2.0.2) using KuberneterPodOperator and sending alerts to Slack on failures. Nested within dbt there are multiple models from multiple owners with cross dependencies between them and they are all run together with the run command.
For example:
KubernetesPodOperator(
dbt_command="run",
task_id="run_models",
on_failure_callback=send_slack_alert,
)
I am trying to make sure that each model owner gets the alert that belongs to them in their relevant channel.
To explain my problem better, Let's say there are two models within dbt; Model-A & Model-B. Model-A is owned by the A-Team and Model-B is owned by the B-Team. With this approach (because there is one dbt run command) if there is a failure of Model-A, the failure will appear in the shared logs for Model-A and Model-B. Let's also assume that both A-Team & B-Team have their own alert channels. However, because dbt is run with a single command all the alerts are sent to a common channel.
Now imagine having plenty of models (Model-A, Model-B.....Model-Z). How can I improve the existing process to make sure that failures in Model-A get sent to the A-team alert channel, failures in Model-B get sent to the B-team alert channel...and so on.
How do I dispatch errors from dbt (running within Airflow) to the relevant owner to make alerts actionable?
I'd suggest you're likely to end up with n models owned by m teams.
Your simplest change would be to tag each dbt model with an owning team. And then to call that model calling back to that team,e.g.
KubernetesPodOperator(
dbt_command="run -m tags:team1",
task_id="run_models",
on_failure_callback=send_slack_alert_team1,
)
You could consider passing args to your alerts rather than custom callbacks (Pass other arguments to on_failure_callback).
This will work as long as you want to run your models in groups of owners, but could get to issues if there are intra owner dependencies.
You can break down your Airflow model to compose a dynamic dag from your models, running one model at a time, e.g. here https://www.astronomer.io/blog/airflow-dbt-1.
You could then assign the slack operator in that same dynamic loop.
Related
I'm working for a startup and am setting up our analytics tech stack from scratch. As a result of limited resource we're focussing on using 3rd party tools rather than building custom pipelines.
Our stack is as follows:
ELT tool: either Fivetran or Hevo
Data warehouse: BigQuery
Transformations: dbt cloud
Reverse ETL: Hightouch (if we go with Fivetran - hevo has built in reverse ETL)
BI Tool: Tableau
The problem i'm having is:
With either Fivetran or Hevo there's a break in the below workflow whereby we have to switch tools and there's no integration within the tools themselves to trigger jobs sequentially based on the completion of the previous job.
Use case (workflow): load data into the warehouse -> transform using dbt -> reverse etl data back out of the warehouse into a tool like mailchimp for marketing purposes (e.g a list of user id who haven't performed certain actions and therefore we want to send a prompt email to, a list which is produced via a dbt job which runs daily)
Here's how these workflows would look in the respective tools (E = Extract, L = Load, T = Transform)
Hevo: E+L (hevo) -> break in workflow -> T: dbt job (unable to be triggered within the hevo UI) -> break in workflow -> reverse E+L: can be done within the hevo UI but can\t be triggered by a dbt job
Fivetran: E+L (fivetran) -> T: dbt job (can be triggered within fivetran UI) -> break in workflow -> reverse E+L fivetran partner with a company called hightouch but there's no way of triggering the hightouch job based on the completion of the fivetran/dbt job.
We can of course just sync these up in a time based fashion but this means if a previous job fails subsequent jobs still run, meaning incurring unnecessary cost and it would also be good to be able to re-trigger the whole workflow from the last break point once you've de-bugged it.
From reading online I think something like apache airflow could be used for this type of use case but that's all i've got thus far.
Thanks in advance.
You're looking for a data orchestrator. Airflow is the most popular choice, but Dagster and Prefect are newer challengers with some very nice features that are built specifically for managing data pipelines (vs. Airflow, which was built for task pipelines that don't necessarily pass data).
All 3 of these tools are open source, but orchestrators can get complex very quickly, and unless you're comfortable deploying kubernetes and managing complex infrastructure you may want to consider a hosted (paid) solution. (Hosted Airflow is under the brand name Astronomer).
Because of this complexity, you should ask yourself if you really need an orchestrator today, or if you can wait to implement one. There are hacky/brittle ways to coordinate these tools (e.g., cron, GitHub Actions, having downstream tools poll for fresh data, etc.), and at a startup's scale (one-person data team) you may actually be able to move much faster with a hacky solution for some time. Does it really impact your users if there is a 1-hour delay between loading data and transforming it? How much value is added to the business by closing that gap vs. spending your time modeling more data or building more reporting? Realistically for a single person new to the space, you're probably looking at weeks of effort until an orchestrator is adding value; only you will know if there is an ROI on that investment.
I use Dagster for orchestrating multiple dbt projects or dbt models with other data pipeline processes (e.g. database initialization, pyspark, etc.)
Here is a more detailed description and demo code:
Three dbt models (bronze, silver, gold) that are executed in sequence
Dagster orchestration code
You could try the following workflow, where you'd need to use a couple more additional tools, but it shouldn't need you any custom engineering effort on orchestration.
E+L (fivetran) -> T: Use Shipyard to trigger a dbt cloud job -> Reverse E+L: Trigger a Hightouch or Census sync on completion of a dbt cloud job
This should run your entire pipeline in a single flow.
I am trying to see if Airflow is the right tool for some functionality I need for my project. We are trying to use it as a scheduler for running a sequence of jobs
that start at a particular time (or possibly on demand).
The first "task" is to query the database for the list of job id's to sequence through.
For each job in the sequence send a REST request to start the job
Wait until job completes or fails (via REST call or DB query)
Go to next job in sequence.
I am looking for recommendations on how to break down the functionality discussed above into an airflow DAG. So far my approach would :
create a Hook for the database and another for the REST server.
create a custom operator that handles the start and monitoring of the "job" (steps 2 and 3)
use a sensor to poll handle waiting for job to complete
Thanks
When using Google Cloud Tasks, how can i prematurely run a tasks that is in the queue. I have a need to run the task before it's scheduled to run. For example the user chooses to navigate away from the page and they are prompted. If they accept the prompt to move away from that page, i need to clear the queued task item programmatically.
I will be running this with a firebase-function on the backend.
Looking at the API for Cloud Tasks found here it seems we have primitives to:
list - get a list of tasks that are queued to run
delete - delete a task this is queued to run
run - forces a task to run now
Based on these primitives, we seem to have all the "bits" necessary to achieve your ask.
For example:
To run a task now that is scheduled to run in the future.
List all the tasks
Find the task that you want to run now
Delete the task
Run a task (now) using the details of the retrieved task
We appear to have a REST API as well as language bound libraries for the popular languages.
I am trying to figure out if Airflow can be used to express a workflow where multiple instances of the same task need to be started based on the output of a parent task. Airflow supports multiple workers, so I naively expect that Airflow can be used to orchestrate workflows involving batch processing. So far I failed to find any recipe/direction that would fit this model. What is the right way to leverage Airflow for a bath processing workflow like the one below? Assume there is a pool of Airflow workers.
Example of a workflow:
1. Start Task A to produce multiple files
2. For each file start an instance of Task B (might be another workflow)
3. Wait for all instances of Task B, then start Task C
As a hack to parallelize processing of input data in Airflow, I use a custom operator that splits the input into a predetermined number of partitions. The downstream operator gets replicated for each partition and if needed the result can be merged again. For local files, the operator runs the split command. In Kubernetes, this works nicely with the cluster autoscaling.
We are currently using Apache Mesos with Marathon and Chronos to schedule long running and batch processes.
It would be great if we could create more complex workflows like with Oozie. Say for example kicking of a job when a file appears in a location or when a certain application completes or calls an API.
While it seems we could do this with Marathon/Chronos or Singularity, there seems no readily available interface for this.
You can use Chronos' /scheduler/dependency endpoint to specify "all jobs which must run at least once before this job will run." Do this on each of your Chronos jobs, and you can build arbitrarily complex workflow DAGs.
https://airbnb.github.io/chronos/#Adding%20a%20Dependent%20Job
Chronos currently only schedules jobs based on time or dependency triggers. Other events like file update, git push, or email/tweet could be modeled as a wait-for-X job that your target job would then depend on.