I have been using Netflix Conductor as workflow orchestration before and Apache Airflow is new to me. In Conductor, the execution of workflows works in these steps:
Workflow starts via REST API call
Each separate worker (service) polls its own tasks by constantly calling REST API methods of Conductor
After completing or failing these tasks, each worker calls REST API to change status of workflow
Each of these tasks workers are separate services. They are implemented on different programming languages.
I can't seem to find any examples of how to use these concepts on Apache Airflow. Constantly using BashOperator seems to me very bad solution.
Are there any examples that shows how to use workers, some of them are written not on python, to listen and execute its tasks that are defined in DAGs?
Related
I have a web api that's running 24/7 on LAN(due to data security reasons, it's not connected to the internet). I have to call one method inside it each morning around 07:00. I've read that Timers are not reliable. I do not own the server so I don't have access to the Task Scheduler. How can I achieve this?
Do you think I should tell them that I have to install the software? Would they let me use the Task Scheduler? Is there any way to do this without the Task Scheduler?
There are a couple of options that you should look into.
Native .NET Core solution through IHostedService
Hangfire
Personally, I like that you can configure a dashboard and see what happened during the execution of your scheduled tasks.
As suggested by #Lei Yang in the comments, Quartz.NET
Hangfire is still adding in-memory support. Therefore, you're required to have a SQL Server to store all the information about the job (triggers, states, etc.).
This can have a huge impact in your decision. On the other hand, Quartz.NET does support in-memory store type.
My team is working orchestrating our data pipeline with Airflow. Since our pipeline steps are complex, we were thinking about having different DAGs / workflows, each defined on its own file. Each of the workflows can trigger more than one downstream workflow. Each workflow will output data to an S3 bucket at the end of execution. Broadly, it looks like the following options for orchestration between DAGs are available:
Using TriggerDagRunOperator at the end of each workflow to decide which downstream workflows to trigger
Using ExternalTaskSensor at the beginning of each workflow to run workflows once last parent workflow task has completed successfully
Using SubDagOperator, and orchestrating all workflows inside a primary DAG in a similar way in which tasks would be orchestrated. Child workflows should be created using a factory. It looks like this operator is being deprecated since it had performance and functional issues
Using TaskGroups: Looks like this is the successor of the SubDag operator. It groups tasks in the UI, however tasks still belong to the same DAG, we will not have multiple DAGs. Although since DAGs are coded in Python, we could still put each TaskGroup in an independent file, like mentioned here: https://stackoverflow.com/a/68394010
Using S3KeySensor to trigger workflow once file has been added to S3 bucket (since all our workflows will end up storing a dataset in S3)
Using S3 Events to trigger a lambda whenever a file is added to S3. That lambda will use the TriggerDagRunOperator to start downstream workflows
From my research, it looks like:
ExternalTaskSensors and S3 sensors add overhead to the workflow, since the task sensors have to reschedule constantly -- creating a massive task queue that creates lots of delays
Using S3 Events to trigger a Lambda that in turn starts child workflows would not have this issue, however, we would need to keep configuration in these lambdas about which child workflows to trigger, as well as adding additional components that complicate our architecture
SubDagOperator seemed the way to go, we can have several DAGs and manage all dependencies between them in a primary DAG in a Python file, however it is being deprecated since it has functional and performance issues
TaskGroups are the SubDagOperator successor, however since all tasks will still belong to the same DAG, I have concerns with how to perform operations such as BackFilling an individual task group, rerunning individual TaskGroups, maybe scheduling them with different intervals in the future etc
If anyone has experience with any of these approaches and could share some insights it would be greatly appreciated.
Thanks!
We are planning to deploy Netflix conductor war in PCF and then create a Conductor client in Java that will communicate with the server and load the json(tasks and workflow creation ) on start up
Can we create the JSONs and load them at client start-uo ? I have tried googling but unable to find sample Conductor Client that can create workflows etc
Any help in pointing to this direction would be helpful.
Thanks in advance
Clients are like listeners with capability of doing the work (workers) and reporting back the status of the task to the conductor. This listening ability comes after a task of specific type is already scheduled by the conductor. For all this to happen the task definitions and workflow definition (metadata) should be fed into the conductor followed by the workflow execution through the rest end-point.
Why do you want to load the workflow tasks and definitions in the startup? Looks like you are using in-memory mode for your POC. For the workflow definitions and tasks to load on startup, use an external dynomite and elasticsearch configured. So that you will not required to load these every time you restart your server.
Hope this helps.
In my project I required to write some background jobs for scheduled processing. I did it using quartz scheduler with spring, but quite often it required me to execute the tasks at random without schedule. So later I pulled out the tasks from the quartz and created web endpoints for them(exposed internally).
To perform the regular scheduled based operation of tasks, I created unix cron jobs that hit the web endpoints using curl command.
My question is, why could this approach not work always. Even in case you don't want to expose web endpoints, you can always execute standalone tasks using unix cron. Is there any particular advantage I gain by using quartz scheduler over unix cron jobs?
You may still opt for using Quartz if:
An event needs to be scheduled as part of the activity that happens within the java application itself. For example, user subscribes to a newsletter.
You have a listener object that needs to be notified when the job completes.
You are using JTA transactions in your scheduled job
You want to keep the history of job executions or load job and trigger definitions from a file or a database
You are running on an application server and require load balancing and failover
You are not running on an UNIX / Linux environment (i.e. you wanted platform independence)
I have long running workflow which runs for 3-5 minutes; I want to give flexibility to end user to Cancel/Stop/Abort workflow while workflow is in execution. How can I do this ?
Ocean
It depends on your workflow execution environment. If you are using the WorkflowApplication it has methods to control a workflow. If your are using the WorkflowServiceHost there is a Workflow Control Endpoint with a client that will let you do so. See this answer for the WCF option.