Scheduling thousands of tasks with Airflow - airflow

We are considering to use Airflow for a project that needs to do thousands of calls a day to external APIs in order to download external data, where each call might take many minutes.
One option we are considering is to create a task for each distinct API call, however this will lead to thousands of tasks. Rendering all those tasks in UI is going to be challenging. We are also worried about the scheduler, which may struggle with so many tasks.
Other option is to have just a few parallel long-running tasks and then implement our own scheduler within those tasks. We can add a custom code into PythonOperator, which will query the database and will decide which API to call next.
Perhaps Airflow is not well suited for such a use case and it would be easier and better to implement such a system outside of Airflow? Does anyone have experience with running thousands of tasks in Airflow and can shed some light on pros and cons on the above use case?

One task per call would kill Airflow as it still needs to check on the status of each task at every heartbeat - even if the processing of the task (worker) is separate e.g. on K8s.
Not sure where you plan on running Airflow but if on GCP and a download is not longer than 9 min, you could use the following:
task (PythonOperator) -> pubsub -> cloud function (to retrieve) -> pubsub -> function (to save result to backend).
The latter function may not be required but we (re)use a generic and simple "bigquery streamer".
Finally, you query in a downstream AF task (PythonSensor) the number of results in the backend and compare with the number of requests published.
We do this quite efficiently for 100K API calls to a third-party system we host on GCP as we maximize parallelism. The nice thing of GCF is that you can tweak the architecture to use and concurrency, instead of provisioning a VM or container to run the tasks.

Related

Airflow Dynamic Task Mapping - limits and scheduler

I have a question about Dynamic Task Mapping, more specifically on how to "control" and not impact other DAGs.
Case:
There are several DAGs that query an API. Each DAG needs to make multiple calls to this API's endpoints. I can have DAG A with 50 calls to this API, and each task generated by dynamic task mapping would make this call. So I would have about 10 DAGs calling the same API. And each DAG of this can have from 50 to 2000 tasks mapped to execute. I have doubts about the impact that having many tasks being dynamically mapped can have on the scheduler.
I know that it has a max_map_length parameter to regulate the maximum number of tasks to be mapped, and I've seen that maybe I'd have to increase this parameter for one of the DAGs, but it's a way to control dynamic limits.
Another parameter I saw is max_active_tis_per_dag, which will limit the parallelism of tasks within a DAG. So I can put 16 here, for example, so that during the entire execution there are no more than 16 tasks running in a DAG.
I also evaluated the use of the pool, since there are several DAGs making requests to the same API, I believe it is one more way to control all of this.
However, I have doubts about the impact that having many tasks being dynamically mapped can have on the scheduler, even using these parameters that I mentioned. It's not very clear to me what else I would have to pay attention to in order not to impact the task scheduling of all the other DAGs in Airflow. I'm using Airflow 2.3.4 on Cloud Composer 1.20.3.

Running jobs independent of each other's failure/success in a single dataflow pipeline

I am trying to load data in Avro format from GCS to Big Query, using a single pipeline. There are 10 tables for instance that I am trying to load, which means 10 parallel jobs in a single pipeline.
Now if the 3rd job fails, all the subsequent jobs fail. How can I make the other jobs run independent of the failure/success of one?
You cannot isolate different steps within a single Dataflow pipeline without implementing custom logic (for example, custom DoFn/ParDo implementations). Some I/O connectors such as BigQuery offer a way to send failed requests to a dead-letter queue in some write modes but this might not give what you want. If you want full isolation you should run separate jobs and combine them into a workflow using a orchestration framework such as Apache Airflow.

Cron every 10 seconds (Firebase funtions)

I need to run a function in firebase every 10 seconds (I call an external api and process data).
With a normal cron I cant because it is limited to maximum once a minute. Using a setTimeOut is also not convenient since Functions charges per second of use.
My only idea so far is to use Cloud Tasks. But I don't know if it is the most convenient to use this tool for my purpose.
On Google Cloud Platform, Cloud Tasks is almost certainly the best/only way to get this job done without using other products in a way that they weren't intended. Cloud Functions is not good for the reasons you mentioned. With Cloud Tasks, you will need to have the prior task schedule the next task upon completion, as they will not repeat themselves automatically like cron.

Can I use wildcards when deleting Google Cloud Tasks?

I'm very new to Google Cloud Tasks.
I'm wondering, is there a way to use wildcards when deleting a task? For example, if I potentially had 3 tasks in queue using the following ID naming structure...
id-123-task-1
id-123-task-2
id-123-task-3
Could I simply delete id-123-task-* to delete all 3, or would I have to delete all 3 specific ID's every time? I guess I'm trying to limit the number of required API invocations to delete everything related to 'id-123'.
Can I use wildcards when deleting Google Cloud Tasks?
As of today, wildcards are not supported within Google Cloud Tasks. I can not confirm that you could pass the Google Cloud Task's ID as you mentioned id-123-task-* will delete all the tasks.
Nonetheless, if you are creating tasks for an specific purpose in mind, you could create a separate queue for this kind of tasks.
Not only you will win in terms of organizing your tasks, but when you would like to delete all, you will only need to purge all tasks from the specified queue making only 1 API invocation.
Here you could see how to purge all tasks from the specified queue, and also how to delete tasks and queues.
Also, I attached the API documentation in case you need further information about purging queues in Cloud Tasks.
As stated here, take into account that if you purge all the tasks from a queue:
Do not create new tasks immediately after purging a queue. Wait at least a second. Tasks created in close temporal proximity to a purge call will also be purged.
Also, if you are using named tasks, as stated here:
You can assign your own name to a task by using the name parameter. However, this introduces significant performance overhead, resulting in increased latencies and potentially increased error rates associated with named tasks. These costs can be magnified significantly if tasks are named sequentially, such as with timestamps.
As a consequence, if you are using named tasks, the documentation recommends using a well-distributed prefix for task names, such as a hash of the contents.
I think this is the best solution if you would like to limit the amount of API calls.
I hope it helps.

How to invoke other Cloud Firebase Functions from a Cloud Function

Let's say I have a Cloud Firebase Function - called by a cron job - that produces 30+ tasks every time it's invoked.
These tasks are quite slow (5 - 6 second each in average) and I can't process them directly in the original because it would time out.
So, the solution would be invoking another "worker" function, once per task, to complete the tasks independently and write the results in a database. So far I can think of three strategies:
Pubsub messages. That would be amazing, but it seems that you can only listen on pubsub messages from within a Cloud Function, not create one. Resorting to external solutions, like having a GAE instance, is not an option for me.
Call the worker http-triggered Firebase Cloud Function from the first one. That won't work, I think, because I would need to wait for a response from the all the invoked worker functions, after they finish and send, and my original Function would time out.
Append tasks to a real time database list, then have a worker function triggered by each database change. The worker has to delete the task from the queue afterwards. That would probably work, but it feels there are a lot of moving parts for a simple problem. For example, what if the worker throws? Another cron to "clean" the db would be needed etc.
Another solution that comes to mind is firebase-queue, but its README explicitly states:
"There may continue to be specific use-cases for firebase-queue,
however if you're looking for a general purpose, scalable queueing
system for Firebase then it is likely that building on top of Google
Cloud Functions for Firebase is the ideal route"
It's not officially supported and they're practically saying that we should use Functions instead (which is what I'm trying to do). I'm a bit nervous on using in prod a library that might be abandoned tomorrow (if it's not already) and would like to avoid going down that route.
Sending Pub/Sub messages from Cloud Functions
Cloud Functions are run in a fairly standard Node.js environment. Given the breadth of the Node/NPM ecosystem, the amount of things you can do in Cloud Functions is quite broad.
it seems that you can only listen on pubsub messages from within a Cloud Function, not create one
You can publish new messages to Pub/Sub topics from within Cloud Functions using the regular Node.js module for Pub/Sub. See the Cloud Pub/Sub documentation for an example.
Triggering new actions from Cloud Functions through Database writes
This is also a fairly common pattern. I usually have my subprocesses/workers clean up after themselves at the same moment they write their result back to the database. This works fine in my simple scenarios, but your mileage may of course vary.
If you're having a concrete cleanup problem, post the code that reproduces the problem and we can have a look at ways to make it more robust.

Resources