creating sequential workflows using and ETL tool (Fivetran/Hevo), dbt and a reverse ETL tool (Hightouch) - airflow

I'm working for a startup and am setting up our analytics tech stack from scratch. As a result of limited resource we're focussing on using 3rd party tools rather than building custom pipelines.
Our stack is as follows:
ELT tool: either Fivetran or Hevo
Data warehouse: BigQuery
Transformations: dbt cloud
Reverse ETL: Hightouch (if we go with Fivetran - hevo has built in reverse ETL)
BI Tool: Tableau
The problem i'm having is:
With either Fivetran or Hevo there's a break in the below workflow whereby we have to switch tools and there's no integration within the tools themselves to trigger jobs sequentially based on the completion of the previous job.
Use case (workflow): load data into the warehouse -> transform using dbt -> reverse etl data back out of the warehouse into a tool like mailchimp for marketing purposes (e.g a list of user id who haven't performed certain actions and therefore we want to send a prompt email to, a list which is produced via a dbt job which runs daily)
Here's how these workflows would look in the respective tools (E = Extract, L = Load, T = Transform)
Hevo: E+L (hevo) -> break in workflow -> T: dbt job (unable to be triggered within the hevo UI) -> break in workflow -> reverse E+L: can be done within the hevo UI but can\t be triggered by a dbt job
Fivetran: E+L (fivetran) -> T: dbt job (can be triggered within fivetran UI) -> break in workflow -> reverse E+L fivetran partner with a company called hightouch but there's no way of triggering the hightouch job based on the completion of the fivetran/dbt job.
We can of course just sync these up in a time based fashion but this means if a previous job fails subsequent jobs still run, meaning incurring unnecessary cost and it would also be good to be able to re-trigger the whole workflow from the last break point once you've de-bugged it.
From reading online I think something like apache airflow could be used for this type of use case but that's all i've got thus far.
Thanks in advance.

You're looking for a data orchestrator. Airflow is the most popular choice, but Dagster and Prefect are newer challengers with some very nice features that are built specifically for managing data pipelines (vs. Airflow, which was built for task pipelines that don't necessarily pass data).
All 3 of these tools are open source, but orchestrators can get complex very quickly, and unless you're comfortable deploying kubernetes and managing complex infrastructure you may want to consider a hosted (paid) solution. (Hosted Airflow is under the brand name Astronomer).
Because of this complexity, you should ask yourself if you really need an orchestrator today, or if you can wait to implement one. There are hacky/brittle ways to coordinate these tools (e.g., cron, GitHub Actions, having downstream tools poll for fresh data, etc.), and at a startup's scale (one-person data team) you may actually be able to move much faster with a hacky solution for some time. Does it really impact your users if there is a 1-hour delay between loading data and transforming it? How much value is added to the business by closing that gap vs. spending your time modeling more data or building more reporting? Realistically for a single person new to the space, you're probably looking at weeks of effort until an orchestrator is adding value; only you will know if there is an ROI on that investment.

I use Dagster for orchestrating multiple dbt projects or dbt models with other data pipeline processes (e.g. database initialization, pyspark, etc.)
Here is a more detailed description and demo code:
Three dbt models (bronze, silver, gold) that are executed in sequence
Dagster orchestration code

You could try the following workflow, where you'd need to use a couple more additional tools, but it shouldn't need you any custom engineering effort on orchestration.
E+L (fivetran) -> T: Use Shipyard to trigger a dbt cloud job -> Reverse E+L: Trigger a Hightouch or Census sync on completion of a dbt cloud job
This should run your entire pipeline in a single flow.

Related

Correct Approach For Airflow DAG Project

I am trying to see if Airflow is the right tool for some functionality I need for my project. We are trying to use it as a scheduler for running a sequence of jobs
that start at a particular time (or possibly on demand).
The first "task" is to query the database for the list of job id's to sequence through.
For each job in the sequence send a REST request to start the job
Wait until job completes or fails (via REST call or DB query)
Go to next job in sequence.
I am looking for recommendations on how to break down the functionality discussed above into an airflow DAG. So far my approach would :
create a Hook for the database and another for the REST server.
create a custom operator that handles the start and monitoring of the "job" (steps 2 and 3)
use a sensor to poll handle waiting for job to complete
Thanks

Can Airflow persist access to metadata of short-lived dynamically generated tasks?

I have a DAG that, whenever there are files detected by FileSensor, generates tasks for each file to (1) move the file to a staging area, (2) trigger a separate DAG to process the file.
FileSensor -> Move(File1) -> TriggerDAG(File1) -> Done
|-> Move(File2) -> TriggerDAG(File2) -^
In the DAG definition file, the middle tasks are generated by iterating over the directory that FileSensor is watching, a bit like this:
# def generate_move_task(f: Path) -> BashOperator
# def generate_dag_trigger(f: Path) -> TriggerDagRunOperator
with dag:
for filepath in Path(WATCH_DIR).glob(*):
sensor_task >> generate_move_task(filepath) >> generate_dag_trigger(filepath)
The Move task moves the files that lead to the task generation, so the next DAG run won't have FileSensor re-trigger either Move or TriggerDAG tasks for this file. In fact, the scheduler won't generate the tasks for this file at all, since after all files go through Move, the input directory has no contents to iterate over anymore..
This gives rise to two problems:
After execution, the task logs and renderings are no longer available. The Graph View only shows the DAG as it is now (empty), not as it was at runtime. (The Tree View shows that the tasks' run and state, but clicking on the "square" and picking any details leads to an Airflow error.)
The downstream tasks can be memory-holed due to a race condition. The first task is to move the originating file to a staging area. If that takes longer than the scheduler polling period, the scheduler no longer collects the downstream TriggerDAG(File1) task, which means that task is not scheduled to be executed even though the upstream task ran successfully. It's as if the downstream task never existed.
The race condition issue is solved by changing the task sequence to Copy(File1) -> TriggerDAG(File1) -> Remove(File1), but the broader problem remains: is there a way to persist dynamically generated tasks, or at least a way to consistently access them through the Airflow interface?
While it isn't clear, i'm assuming that downstream DAG(s) that you trigger via your orchestrator DAG are NOT dynamically generated for each file (like your Move & TriggerDAG tasks); in other words, unlike your Move tasks that keep appearing and disappearing (based on files), the downstream DAGs are static and stay there always
You've already built a relatively complex workflow that does advanced stuff like generating tasks dynamically and triggering external DAGs. I think with slight modification to your DAGs structure, you can get rid of your troubles (which also are quite advanced IMO)
Relocate the Move task(s) from your upstream orchestrator DAG to the downstream (per-file) process DAG(s)
Make the upstream orchestrator DAG do two things
Sense / wait for files to appear
For each file, trigger the downstream processing DAG (which in effect you are already doing).
For the orchestrator DAG, you can do it either ways
have a single task that does file sensing + triggering downstream DAGs for each file
have two tasks (I'd prefer this)
first task senses files and when they appear, publishes their list in an XCOM
second task reads that XCOM and foreach file, triggers it's corresponding DAG
but whatever way you choose, you'll have to replicate the relevant bits of code from
FileSensor (to be able to sense file and then publish their names in XCOM) and
TriggerDagRunOperator (so as to be able to trigger multiple DAGs with single task)
here's a diagram depicting the two tasks approach
The short answer to the title question is, as of Airflow 1.10.11, no, this doesn't seem possible as stated. To render DAG/task details, the Airflow webserver always consults the DAGs and tasks as they are currently defined and collected to DagBag. If the definition changes or disappears, tough luck. The dashboard just shows the log entries in the table; it doesn't probe the logs for prior logic (nor does it seem to store much of it other than the headline).
y2k-shubham provides an excellent solution to the unspoken question of "how can I write DAGs/tasks so that the transient metadata are accessible". The subtext of his solution: convert the transient metadata into something Airflow stores per task run, but keep the tasks themselves fixed. XCom is the solution he uses here, and it does shows up in the task instance details / logs.
Will Airflow implement persistent interface access to fleeting one-time tasks whose definition disappears from the DagBag? It's possible but unlikely, for two reasons:
It would require the webserver to probe the historical logs instead of just the current DagBag when rendering the dashboard, which would require extra infrastructure to keep the web interface snappy, and could make the display very confusing.
As y2k-shubham notes in a comment to another question of mine, fleeting and changing tasks/DAGs are an Airflow anti-pattern. I'd imagine that would make this a tough sell as the next feature.

For Hangfire, is there any sample code for non-simple tasks; and how should recurring tasks be handled when re-publishing?

I am considering using Hangfire https://www.hangfire.io to replace an older home-grown scheduling ASP.NET web site/app.
I have created a simple test project using Hangfire. I am able to start the project with Hangfire, submit (in code) a couple of very simple single and recurring tasks, view the dashboard, etc.
I'm looking for more suggestions for creating a little more complex code (and classes) for tasks to be scheduled, and I have a question about what happens with permanently scheduled tasks when re-publishing a Hangfire site to production.
I have read some of the documentation on the Hangfire site, reviewed the 2 tutorials, scanned the Hangfire forums, and searched StackOverflow and the web a bit. A lot of what I have seen shows you how to schedule something very simple (like Console.WriteLine), but nothing more complex. The "Highlighter" tutorial was useful, but that essentially shows how to schedule a single instance of a (slightly longer-running) task in response to an interactive user input. I understand how useful that can be, but I'm more interested in recurring tasks that are submitted and then run every day (or every hour, etc.) and don't need to be submitted again. These tasks could be for something like sending a batch of emails to users each night, batch processing some data, importing a nightly feed of external data, periodically calling a web service to perform some processing, etc.
Is there any sample code available that shows some examples like this, or any guidance on the most appropriate approach for structuring such code in an interface and class(es)?
Secondly, in my case, most of the tasks would be "permanent" (always existing as a recurring task). If I set up code to add these as recurring tasks shortly after starting the Hangfire application in production, how should I handle it when publishing updates to production (when this same initialization would run again)? Should I just call "AddOrUpdate" with the same ID and Hangfire will take care of it? Should I first call "RemoveIfExists" and then add the recurring task again? Is there some other approach that should be used?
One example would be a log janitor, which would run every weekday # 5:00PM to remove logs that are older than 5 days.
public void Schedule()
{
RecurringJob.AddOrUpdate<LogJanitor>(
"Janitor - Old Logs",
j => j.OnSchedule(null),
"0 17 * * 1,2,3,4,5",
TimeZoneInfo.FindSystemTimeZoneById("CST"));
}
Then we would handle it this way
public void OnSchedule(
PerformContext context)
{
DateTime timeStamp = DateTime.Today.AddDays(-5);
_logRepo.FindAndDelete(from: DateTime.MinValue, to: timeStamp);
}
These two methods are declared inside LogJanitor class. When our application starts, we get an instance of this class then call Schedule().

Jenkins - How to stall a job until a notification is received?

Is there anyway that a Jenkins job can be paused until a notification is received. Ideally with a payload as well?
I have a "test" job which does a whole bunch of remote tests and I'd like it to wait until the test are done where I send a HTTP notification via curl with a payload including a test success code.
Is this possible with any default Jenkins plugins?
If Jenkins 2.x is an option for you, I'd consider taking a look at writing a pipeline job.
See https://jenkins.io/doc/book/pipeline/
Perhaps you could create a pipeline with multiple stages, where:
The first batch of work (your test job) is launched by the first pipeline stage.
That stage is configured (via Groovy code) to wait until your tests are complete before continuing. This is of course easy if the command to run your tests blocks, but if your tests launch and then detach without providing an easy way to determine when they exit, you can probably add extra Groovy code to your stage to make it poll the machine where the tests are running, to discover whether the work is complete.
Subsequent stages can be run once the first stage exits.
As for passing a payload from one stage to another, that's possible too - for exit codes and strings, you can use Groovy variables, and for files, I believe you can have a stage archive a file as an artifact; subsequent stages can then access the artifact.
Or, as Hani mentioned in a comment, you could create two Jenkins jobs, and have your tests (launched by the first job) use the Jenkins API to launch the second job when they complete successfully.
As you suggested, curl can be used to trigger jobs via the API, or you can use a Jenkins API wrapper package for to your preferred language (I've had success using the Python jenkinsapi package for this sort of work: http://pythonhosted.org/jenkinsapi/)
If you need to pass parameters from your API client code to the second Jenkins job, that's possible by adding parameters to the second job using the the Parameterized Build features built into Jenkins: https://wiki.jenkins-ci.org/display/JENKINS/Parameterized+Build

Apache Mesos Workflows - Event Driven Scheduler

We are currently using Apache Mesos with Marathon and Chronos to schedule long running and batch processes.
It would be great if we could create more complex workflows like with Oozie. Say for example kicking of a job when a file appears in a location or when a certain application completes or calls an API.
While it seems we could do this with Marathon/Chronos or Singularity, there seems no readily available interface for this.
You can use Chronos' /scheduler/dependency endpoint to specify "all jobs which must run at least once before this job will run." Do this on each of your Chronos jobs, and you can build arbitrarily complex workflow DAGs.
https://airbnb.github.io/chronos/#Adding%20a%20Dependent%20Job
Chronos currently only schedules jobs based on time or dependency triggers. Other events like file update, git push, or email/tweet could be modeled as a wait-for-X job that your target job would then depend on.

Resources