My team is working orchestrating our data pipeline with Airflow. Since our pipeline steps are complex, we were thinking about having different DAGs / workflows, each defined on its own file. Each of the workflows can trigger more than one downstream workflow. Each workflow will output data to an S3 bucket at the end of execution. Broadly, it looks like the following options for orchestration between DAGs are available:
Using TriggerDagRunOperator at the end of each workflow to decide which downstream workflows to trigger
Using ExternalTaskSensor at the beginning of each workflow to run workflows once last parent workflow task has completed successfully
Using SubDagOperator, and orchestrating all workflows inside a primary DAG in a similar way in which tasks would be orchestrated. Child workflows should be created using a factory. It looks like this operator is being deprecated since it had performance and functional issues
Using TaskGroups: Looks like this is the successor of the SubDag operator. It groups tasks in the UI, however tasks still belong to the same DAG, we will not have multiple DAGs. Although since DAGs are coded in Python, we could still put each TaskGroup in an independent file, like mentioned here: https://stackoverflow.com/a/68394010
Using S3KeySensor to trigger workflow once file has been added to S3 bucket (since all our workflows will end up storing a dataset in S3)
Using S3 Events to trigger a lambda whenever a file is added to S3. That lambda will use the TriggerDagRunOperator to start downstream workflows
From my research, it looks like:
ExternalTaskSensors and S3 sensors add overhead to the workflow, since the task sensors have to reschedule constantly -- creating a massive task queue that creates lots of delays
Using S3 Events to trigger a Lambda that in turn starts child workflows would not have this issue, however, we would need to keep configuration in these lambdas about which child workflows to trigger, as well as adding additional components that complicate our architecture
SubDagOperator seemed the way to go, we can have several DAGs and manage all dependencies between them in a primary DAG in a Python file, however it is being deprecated since it has functional and performance issues
TaskGroups are the SubDagOperator successor, however since all tasks will still belong to the same DAG, I have concerns with how to perform operations such as BackFilling an individual task group, rerunning individual TaskGroups, maybe scheduling them with different intervals in the future etc
If anyone has experience with any of these approaches and could share some insights it would be greatly appreciated.
Thanks!
Related
Rust seems to have 3 different types of tasks,
std::task
core::task
tokio::task
Why do these three tasks exist?
Those are modules, so the fact that they all coexist and have the same name doesn't really imply anything. Any arbitrary crate can create a task module (or type or trait or...). That's why most programming languages have namespaces to start with — so we can have name collisions
std::task is core::task, re-exported under a different name. This contains the building blocks for creating futures themselves and the executors that drive them. A very small handful of people will need to use these types.
tokio::task allows creating Tokio tasks: "asynchronous green-threads". These are the semantic equivalent of threads for an asynchronous world. See the Tokio website's section on spawning tasks for more detail.
async_std::task is the same thing but for a different executor. async-std tasks are distinct from Tokio tasks and are not interchangeable.
futures::task is kind of a mish-mash between the standard library's module and those of the executors. This is because of its history — the futures crate was the implementation of futures before they were moved into the standard library. Now it contains a re-export of the standard library's types plus some further tools for creating an executor as well as traits for spawning tasks on the executor provided by the futures library.
See also:
What is the difference between `alloc::rc::Rc` and `std::rc::Rc`?
std::ops::Add or core::ops::Add?
How do I execute an async/await function without using any external dependencies?
I would like to use Azure Application Insights in my console application to track some operations.
We currently track some traces, dependencies and exceptions, but I would like to link them so I could better see the context of these events and how do relate to each other (timeline maybe)..
Reading the topic specific for long running tasks, I wounder if it would be possible to have individual operations for each background task, these tasks run run in parallel?
In my case I have a single instance of TelemetryClient that is injected into these worker threads. The way the code looks like, once you start and operation, everything that goes after is tracked as being part of the same operation.
Any ideas? Would I need multiple instances of TelemetryClient?
You can and should use individual operations for each background task. i.e every background task code should be wrapped inside tc.startoperation.. tc.stopoperation. All telemetry generated within that task will be correlated together. Have you followed the example fully? If not, please share your snippet.
You dont need multiple instances of telemetryclient. If you used WorkerService package from the below link, then retrieve TelemetryClient instance from DI.
https://learn.microsoft.com/en-us/azure/azure-monitor/app/worker-service#net-corenet-framework-console-application
I have been using Netflix Conductor as workflow orchestration before and Apache Airflow is new to me. In Conductor, the execution of workflows works in these steps:
Workflow starts via REST API call
Each separate worker (service) polls its own tasks by constantly calling REST API methods of Conductor
After completing or failing these tasks, each worker calls REST API to change status of workflow
Each of these tasks workers are separate services. They are implemented on different programming languages.
I can't seem to find any examples of how to use these concepts on Apache Airflow. Constantly using BashOperator seems to me very bad solution.
Are there any examples that shows how to use workers, some of them are written not on python, to listen and execute its tasks that are defined in DAGs?
I'm building a simple task manager that will at this moment execute tasks in a serial manner. I have been reading about threads in flex and it seems it is not quite clear/prepared for real threads.
What I'm looking at this moment is a way to execute a method at the beginning or end of a flash builder update. This method will be the one that will take the responsibility to start tasks added in the previous update. The removing of finished tasks will be done through event notification (the task will notify it finished) then the scheduler will remove it and dispatch the message again to let the outside world know the task was over.
A rough workflow of the system woudl be:
1) Add Tasks to the scheduler. And listen to events of the task (finished, etc...)
2) At the beginning/ End of a flex update (don't know if this really happen) Start tasks waiting. And run tasks that have a runnable method per update.
3) When a task finishes it notifies the scheduler and it is removed from the scheduler queue and redispatches the event to let the outside world the task finsihed.
Could anybody suggest the correct place to have a method like this? Any suggestion to the scheduler?.
Thanks in advance,
Aaron.
Based on your description you don't seem to be doing anything new and that unique. I'd start first with researching existing task and concurrency solutions. If they won't do what you want, extending the code will probably still be easier than starting from scratch.
Get familiar first with Cairngorm 3 Tasks and/or Parsley Tasks.
Also take a look at the callLater() method.
Finally there is the GreenThreads project.
I want each sequence inside a foreach<T> activity running in a different thread. Is this possible by using WWF 4.0? If not, how can I achieve multithreading in WWF 4.0?
It depends on the kind of work you are doing. By default the workflow scheduler will only execute a single activity in a workflow at the time, no way around that. The parallel activities schedule multiple child activities at the same time but they don't execute in parallel.
The big exception to the rule is AsyncCodeActivity type activities. The scheduler will execute another activity as soon as they are doing some asynchronous stuff. Now this works best with IO bound work like database access or network IO but that is not a requirement.
So to achieve true parallelism in your workflows you need to be use a combination of one of the parallel activities with activities deriving from AsyncCodeActivity.
To achieve parallel execution of a foreach, use ParallelForEach.