Airflow Dynamic Task Mapping - limits and scheduler - airflow

I have a question about Dynamic Task Mapping, more specifically on how to "control" and not impact other DAGs.
Case:
There are several DAGs that query an API. Each DAG needs to make multiple calls to this API's endpoints. I can have DAG A with 50 calls to this API, and each task generated by dynamic task mapping would make this call. So I would have about 10 DAGs calling the same API. And each DAG of this can have from 50 to 2000 tasks mapped to execute. I have doubts about the impact that having many tasks being dynamically mapped can have on the scheduler.
I know that it has a max_map_length parameter to regulate the maximum number of tasks to be mapped, and I've seen that maybe I'd have to increase this parameter for one of the DAGs, but it's a way to control dynamic limits.
Another parameter I saw is max_active_tis_per_dag, which will limit the parallelism of tasks within a DAG. So I can put 16 here, for example, so that during the entire execution there are no more than 16 tasks running in a DAG.
I also evaluated the use of the pool, since there are several DAGs making requests to the same API, I believe it is one more way to control all of this.
However, I have doubts about the impact that having many tasks being dynamically mapped can have on the scheduler, even using these parameters that I mentioned. It's not very clear to me what else I would have to pay attention to in order not to impact the task scheduling of all the other DAGs in Airflow. I'm using Airflow 2.3.4 on Cloud Composer 1.20.3.

Related

Running jobs independent of each other's failure/success in a single dataflow pipeline

I am trying to load data in Avro format from GCS to Big Query, using a single pipeline. There are 10 tables for instance that I am trying to load, which means 10 parallel jobs in a single pipeline.
Now if the 3rd job fails, all the subsequent jobs fail. How can I make the other jobs run independent of the failure/success of one?
You cannot isolate different steps within a single Dataflow pipeline without implementing custom logic (for example, custom DoFn/ParDo implementations). Some I/O connectors such as BigQuery offer a way to send failed requests to a dead-letter queue in some write modes but this might not give what you want. If you want full isolation you should run separate jobs and combine them into a workflow using a orchestration framework such as Apache Airflow.

Can I restrict max_active_runs of a DAG only for runs with the same dag_run.conf?

I have a DAG that runs in a multi tenant scenario. The tenant ID get's set in the dag_run.conf when the DAG is triggered. I want to ensure that there is at most one active run per tenant at a time, but potentially many active runs simultaneously across all tenants.
So far I have found the max_active_runs setting, but that would require me to actually setup one DAG per tenant, which I am trying to avoid.
Is there a way to achieve this in airflow or am I approaching the problem in the wrong way?
You are using dag_run.conf which means that you are triggering your dags manually. Currently there is a bug (Airflow 2.0.1) max_active_runs isn't respected for manual runs (see GitHub issue).

Scheduling thousands of tasks with Airflow

We are considering to use Airflow for a project that needs to do thousands of calls a day to external APIs in order to download external data, where each call might take many minutes.
One option we are considering is to create a task for each distinct API call, however this will lead to thousands of tasks. Rendering all those tasks in UI is going to be challenging. We are also worried about the scheduler, which may struggle with so many tasks.
Other option is to have just a few parallel long-running tasks and then implement our own scheduler within those tasks. We can add a custom code into PythonOperator, which will query the database and will decide which API to call next.
Perhaps Airflow is not well suited for such a use case and it would be easier and better to implement such a system outside of Airflow? Does anyone have experience with running thousands of tasks in Airflow and can shed some light on pros and cons on the above use case?
One task per call would kill Airflow as it still needs to check on the status of each task at every heartbeat - even if the processing of the task (worker) is separate e.g. on K8s.
Not sure where you plan on running Airflow but if on GCP and a download is not longer than 9 min, you could use the following:
task (PythonOperator) -> pubsub -> cloud function (to retrieve) -> pubsub -> function (to save result to backend).
The latter function may not be required but we (re)use a generic and simple "bigquery streamer".
Finally, you query in a downstream AF task (PythonSensor) the number of results in the backend and compare with the number of requests published.
We do this quite efficiently for 100K API calls to a third-party system we host on GCP as we maximize parallelism. The nice thing of GCF is that you can tweak the architecture to use and concurrency, instead of provisioning a VM or container to run the tasks.

Can I use wildcards when deleting Google Cloud Tasks?

I'm very new to Google Cloud Tasks.
I'm wondering, is there a way to use wildcards when deleting a task? For example, if I potentially had 3 tasks in queue using the following ID naming structure...
id-123-task-1
id-123-task-2
id-123-task-3
Could I simply delete id-123-task-* to delete all 3, or would I have to delete all 3 specific ID's every time? I guess I'm trying to limit the number of required API invocations to delete everything related to 'id-123'.
Can I use wildcards when deleting Google Cloud Tasks?
As of today, wildcards are not supported within Google Cloud Tasks. I can not confirm that you could pass the Google Cloud Task's ID as you mentioned id-123-task-* will delete all the tasks.
Nonetheless, if you are creating tasks for an specific purpose in mind, you could create a separate queue for this kind of tasks.
Not only you will win in terms of organizing your tasks, but when you would like to delete all, you will only need to purge all tasks from the specified queue making only 1 API invocation.
Here you could see how to purge all tasks from the specified queue, and also how to delete tasks and queues.
Also, I attached the API documentation in case you need further information about purging queues in Cloud Tasks.
As stated here, take into account that if you purge all the tasks from a queue:
Do not create new tasks immediately after purging a queue. Wait at least a second. Tasks created in close temporal proximity to a purge call will also be purged.
Also, if you are using named tasks, as stated here:
You can assign your own name to a task by using the name parameter. However, this introduces significant performance overhead, resulting in increased latencies and potentially increased error rates associated with named tasks. These costs can be magnified significantly if tasks are named sequentially, such as with timestamps.
As a consequence, if you are using named tasks, the documentation recommends using a well-distributed prefix for task names, such as a hash of the contents.
I think this is the best solution if you would like to limit the amount of API calls.
I hope it helps.

How do I configure a flow to send to multiple PODs in K8S?

I have single camunda job that is configured as a multi-instance call to another process. At present, multi instance asynchronous before, multi instance asynchronous after, and multi instance exclusive are all checked. We have multiple PODs deployed to handle the calls(1k at a time) and right now when I try to run this, it seems like no matter what I am doing, it is running them serially, or close to it. What is needed to actually send all 1000 elements to multiple instances of the child process?
Tried configuring the multi instance asynch settings
Multi Instance
Loop Cardinality-blank
Collection-builtJobList
Element Variable-builtRequestObject
I then have all three multi instance values checked. The Asynch Continuations are not checked.
Camunda BPM will only run a single thread (execution) within a given process instance at a time by default. You can change that behavior for a given task/activity by checking the "Asynchronous Before" and/or "Asynchronous After" checkboxes - thus electing to use the Job Executor - and deselecting the "Exclusive" checkbox. (This also applies to the similar checkboxes for multi-instance activities.) If you do that, beware that the behavior may not be what you want; specifically:
You will likely receive OptimisticLockingExceptions if you have a decent number of threads running simultaneously on a single instance. These are thrown when the individual threads attempt to update the information in the relational database for the process instance and discover that the data has been modified while they were performing their processing.
If those OptimisticLockingExceptions occur, the Job Executor will automatically retry the Job without decrementing the available retries. This will re-execute the Job, re-executing any included integration logic as well. This may not be desirable.
Although Camunda BPM has been proven to be fantastic at executing large numbers of process instances in parallel, it isn't designed to execute multiple threads simultaneously within an individual process instance. If you want that behavior within a given process instance, I would suggest that you handle the threading yourself within an individual Service Task, fire-and-forget launching the threads you need and letting the Service Task complete within Camunda immediately after launching them... of course if that's feasible given your application's desired behavior.

Resources