We are currently using Apache Mesos with Marathon and Chronos to schedule long running and batch processes.
It would be great if we could create more complex workflows like with Oozie. Say for example kicking of a job when a file appears in a location or when a certain application completes or calls an API.
While it seems we could do this with Marathon/Chronos or Singularity, there seems no readily available interface for this.
You can use Chronos' /scheduler/dependency endpoint to specify "all jobs which must run at least once before this job will run." Do this on each of your Chronos jobs, and you can build arbitrarily complex workflow DAGs.
https://airbnb.github.io/chronos/#Adding%20a%20Dependent%20Job
Chronos currently only schedules jobs based on time or dependency triggers. Other events like file update, git push, or email/tweet could be modeled as a wait-for-X job that your target job would then depend on.
Related
I'm working for a startup and am setting up our analytics tech stack from scratch. As a result of limited resource we're focussing on using 3rd party tools rather than building custom pipelines.
Our stack is as follows:
ELT tool: either Fivetran or Hevo
Data warehouse: BigQuery
Transformations: dbt cloud
Reverse ETL: Hightouch (if we go with Fivetran - hevo has built in reverse ETL)
BI Tool: Tableau
The problem i'm having is:
With either Fivetran or Hevo there's a break in the below workflow whereby we have to switch tools and there's no integration within the tools themselves to trigger jobs sequentially based on the completion of the previous job.
Use case (workflow): load data into the warehouse -> transform using dbt -> reverse etl data back out of the warehouse into a tool like mailchimp for marketing purposes (e.g a list of user id who haven't performed certain actions and therefore we want to send a prompt email to, a list which is produced via a dbt job which runs daily)
Here's how these workflows would look in the respective tools (E = Extract, L = Load, T = Transform)
Hevo: E+L (hevo) -> break in workflow -> T: dbt job (unable to be triggered within the hevo UI) -> break in workflow -> reverse E+L: can be done within the hevo UI but can\t be triggered by a dbt job
Fivetran: E+L (fivetran) -> T: dbt job (can be triggered within fivetran UI) -> break in workflow -> reverse E+L fivetran partner with a company called hightouch but there's no way of triggering the hightouch job based on the completion of the fivetran/dbt job.
We can of course just sync these up in a time based fashion but this means if a previous job fails subsequent jobs still run, meaning incurring unnecessary cost and it would also be good to be able to re-trigger the whole workflow from the last break point once you've de-bugged it.
From reading online I think something like apache airflow could be used for this type of use case but that's all i've got thus far.
Thanks in advance.
You're looking for a data orchestrator. Airflow is the most popular choice, but Dagster and Prefect are newer challengers with some very nice features that are built specifically for managing data pipelines (vs. Airflow, which was built for task pipelines that don't necessarily pass data).
All 3 of these tools are open source, but orchestrators can get complex very quickly, and unless you're comfortable deploying kubernetes and managing complex infrastructure you may want to consider a hosted (paid) solution. (Hosted Airflow is under the brand name Astronomer).
Because of this complexity, you should ask yourself if you really need an orchestrator today, or if you can wait to implement one. There are hacky/brittle ways to coordinate these tools (e.g., cron, GitHub Actions, having downstream tools poll for fresh data, etc.), and at a startup's scale (one-person data team) you may actually be able to move much faster with a hacky solution for some time. Does it really impact your users if there is a 1-hour delay between loading data and transforming it? How much value is added to the business by closing that gap vs. spending your time modeling more data or building more reporting? Realistically for a single person new to the space, you're probably looking at weeks of effort until an orchestrator is adding value; only you will know if there is an ROI on that investment.
I use Dagster for orchestrating multiple dbt projects or dbt models with other data pipeline processes (e.g. database initialization, pyspark, etc.)
Here is a more detailed description and demo code:
Three dbt models (bronze, silver, gold) that are executed in sequence
Dagster orchestration code
You could try the following workflow, where you'd need to use a couple more additional tools, but it shouldn't need you any custom engineering effort on orchestration.
E+L (fivetran) -> T: Use Shipyard to trigger a dbt cloud job -> Reverse E+L: Trigger a Hightouch or Census sync on completion of a dbt cloud job
This should run your entire pipeline in a single flow.
Problem
Airflow tasks of the type DataflowTemplateOperator take a long time to complete. This means other tasks can be blocked by it (correct?).
When we run more of these tasks, that means we would need a bigger Cloud Composer cluster (in our case) to execute tasks that are essentially blocking while they shouldn't be (they should be async operations).
Options
Option 1: just launch the job and airflow job is successful
Option 2: write a wrapper as explained here and use a reschedule mode as explained here
Option 1 does not seem feasible as the DataflowTemplateOperator only has an option to specify the wait time between completion checks called poll_sleep (source).
For the DataflowCreateJavaJobOperator there is an option check_if_running to wait for completion of a previous job with the same name (see this code)
It seems that after launching a job, the wait_for_finish is executed (see this line), which boils down to an "incomplete" job (see this line).
For Option 2, I need Option 1.
Questions
Am I correct to assume that Dataflow tasks will block others in Cloud Composer/Airflow?
Is there a way to schedule a job without a "wait to finish" using the built-in operators? (I might have overlooked something)
Is there an easy way to write this myself? I'm thinking of just executing a bash launch script, followed by a task that looks if the job finished correctly, but in a reschedule mode.
Is there another way to avoid blocking other tasks while running dataflow jobs? Basically this is an async operation and should not take resources.
Answers
Am I correct to assume that Dataflow tasks will block others in Cloud Composer/Airflow?
A: Partly yes. Airflow has parallelism option in the configuration which define the number of tasks that should execute at a time across the system. Having a task block this slot might slow down the execution in the system but this issue is bound to happen as you increase the number of tasks and DAGs. You can increase this in the configuration depending on your needs
Is there a way to schedule a job without a "wait to finish" using the built-in operators? (I might have overlooked something)
A: Yes. You can use PythonOperator and in the python_callable you can use the dataflow hook to launch the job in async mode (launch and don't wait).
Is there an easy way to write this myself? I'm thinking of just executing a bash launch script, followed by a task that looks if the job finished correctly, but in a reschedule mode.
A: When you say reschedule, I'm assuming that you are going to retry the task that looks for job that checks if the job finished correctly. If I'm right, you can set the task on retry mode and the delay at which you want the retry to happen.
Is there another way to avoid blocking other tasks while running dataflow jobs? Basically this is an async operation and should not take resources.
A: I think I answered this in the second question.
When using Google Cloud Tasks, how can i prematurely run a tasks that is in the queue. I have a need to run the task before it's scheduled to run. For example the user chooses to navigate away from the page and they are prompted. If they accept the prompt to move away from that page, i need to clear the queued task item programmatically.
I will be running this with a firebase-function on the backend.
Looking at the API for Cloud Tasks found here it seems we have primitives to:
list - get a list of tasks that are queued to run
delete - delete a task this is queued to run
run - forces a task to run now
Based on these primitives, we seem to have all the "bits" necessary to achieve your ask.
For example:
To run a task now that is scheduled to run in the future.
List all the tasks
Find the task that you want to run now
Delete the task
Run a task (now) using the details of the retrieved task
We appear to have a REST API as well as language bound libraries for the popular languages.
One of the github repository is resource for my pipeline. I have 3 parallel jobs in my concourse pipeline which gets triggered when there is any checkin to the github repository. Other jobs in the pipeline is in sequence. I am having the below issues:
1) I want the pipeline to complete full execution then only start new run. I am using pool resource to make sure the execution completes then only new run is triggered. Is there a better way to resolve it.
2) If there are multiple checkins while the pipeline is in progress then is there a way to only execute pipeline on the last checkin. For example 1st instance of pipeline is running and while the pipeline execution completes there are 6 checkins in the repository. Can the pipeline pick only 6th version of the repos and purge the run for previous five checkins?
using the lock pool resource is almost the perfect option but as you have rightly caught, there will be a trigger for each git commit and jobs will start to queue.
It sounds like you want this pipeline to be serialised. Have you considered serial_groups http://concourse-ci.org/single-page.html#job-serial-groups
My level of experience with the product is basic at best, but I'm expected to be a developer; I have a basic understanding of many things.
Right now my job is to investigate canceling lines in Purchase Orders. We have a workflow set up to handle those, and I'm trying to duplicate the scenario in my dev instance. Whenever a user cancels a line, the workflow is supposed to engage, and I've found that a batch job is what triggers that workflow to work (maybe that's the case with all workflows, but I don't know that for sure).
I've set up my personal Dev AX Instance under System Configuration => System => Server Configuration to use my personal Dev AOS server that my client is also running under, but when I go to System Configuration => Batch Jobs => Batch Jobs, then find the Batch Job I've been looking for and set the status to Waiting, the Batch Job never runs.
On our Test instance, the jobs is configured exactly the same way, except they use the AOS Server allotted for it.
I did a SQL script to change the batch job to use my personal Dev AOS Server, then did a restart of the Dynamics AX Servers.
There must be something I'm doing wrong for my personal dev instance. I've been reading some things from here about what may be going on and following down the list, but I'm pretty sure the problem is even stupider => https://www.daxrunbase.com/2017/07/02/troubleshooting-batch-jobs-in-ax/
First of all, do you have all 3 workflow jobs set up?
Workflow message processing
Workflow due date processing
Workflow line-item notifications
They can be set up from System administration > Setup > Workflow > Workflow infrastructure configuration.
Secondly, it is OK for the periodic batch jobs to have status Waiting. They will be in status Executing for a short time and then they will be Waiting for the next run. If the Scheduled start date/time value in this batch job is in the past, that could be a problem. Otherwise everything is OK.
Lastly, if you have already ticked the Is batch server check-box in System administration > Setup > System > Server configuration, please also make sure to move the workflow batch group in the Batch server groups section in the same form from Remaining groups to Selected groups.
The batch jobs should start at Scheduled start date/time - or a bit later, you'd need to wait a minute and refresh the grid.