Can I use wildcards when deleting Google Cloud Tasks? - google-cloud-tasks

I'm very new to Google Cloud Tasks.
I'm wondering, is there a way to use wildcards when deleting a task? For example, if I potentially had 3 tasks in queue using the following ID naming structure...
id-123-task-1
id-123-task-2
id-123-task-3
Could I simply delete id-123-task-* to delete all 3, or would I have to delete all 3 specific ID's every time? I guess I'm trying to limit the number of required API invocations to delete everything related to 'id-123'.

Can I use wildcards when deleting Google Cloud Tasks?
As of today, wildcards are not supported within Google Cloud Tasks. I can not confirm that you could pass the Google Cloud Task's ID as you mentioned id-123-task-* will delete all the tasks.
Nonetheless, if you are creating tasks for an specific purpose in mind, you could create a separate queue for this kind of tasks.
Not only you will win in terms of organizing your tasks, but when you would like to delete all, you will only need to purge all tasks from the specified queue making only 1 API invocation.
Here you could see how to purge all tasks from the specified queue, and also how to delete tasks and queues.
Also, I attached the API documentation in case you need further information about purging queues in Cloud Tasks.
As stated here, take into account that if you purge all the tasks from a queue:
Do not create new tasks immediately after purging a queue. Wait at least a second. Tasks created in close temporal proximity to a purge call will also be purged.
Also, if you are using named tasks, as stated here:
You can assign your own name to a task by using the name parameter. However, this introduces significant performance overhead, resulting in increased latencies and potentially increased error rates associated with named tasks. These costs can be magnified significantly if tasks are named sequentially, such as with timestamps.
As a consequence, if you are using named tasks, the documentation recommends using a well-distributed prefix for task names, such as a hash of the contents.
I think this is the best solution if you would like to limit the amount of API calls.
I hope it helps.

Related

Cloud Tasks - waiting for a result

My application needs front-end searching. It searches an external API, for which I'm limited to a few calls per second.
So, I wanted to keep ALL queries, related to this external API, on the same Cloud Task queue, so I could guarantee the amount of calls per second.
That means the user would have to wait for second or two, most likely, when searching.
However, using Google's const { CloudTasksClient } = require('#google-cloud/tasks') library, I can create a task but when I go to check it's status using .getTask() it says:
The task no longer exists, though a task with this name existed recently.
Is there any way to poll a task until it's complete and retrieve response data? Or any other recommended methods for this? Thanks in advance.
No. GCP Cloud Tasks provides no way to gather information on the body of requests that successfully completed.
(Which is a shame, because it seems quite natural. I just wrote an email to my own GCP account rep asking about this possible feature. If I get an update I'll put it here.)

How to force test a Firestore parallel Transaction?

Transactions are used for atomic changes and when two clients may change the same data at the same time.
I want to test in the dev env if my transaction is having its expected behavior when there is a parallel transaction running from multiple clients requests. It runs only in my Cloud Functions. I can't let any undesired behavior of this nature to happen in the prod env, so I want to check in dev if everything is alright when it happens, even being unlikely.
Is it possible to force this test case?
Using JS/TS.
In the case of a concurrent edit, Cloud Firestore runs the entire transaction again. For example, if a transaction reads documents and another client modifies any of those documents, Cloud Firestore retries the transaction. This feature ensures that the transaction runs on up-to-date and consistent data. Refer to this documentation regarding updating data with transactions.
You can check this post that discusses concurrent read and write at the same time. Another link that also has an example on how to create and run a transaction using Node.js.
Lastly, you can consider creating two FirebaseApp instances, and running the same transaction in both and then synchronize between the two in the single process that they run in. Or, use testing tools that support parallel tests, like Node TAP.

Scheduling thousands of tasks with Airflow

We are considering to use Airflow for a project that needs to do thousands of calls a day to external APIs in order to download external data, where each call might take many minutes.
One option we are considering is to create a task for each distinct API call, however this will lead to thousands of tasks. Rendering all those tasks in UI is going to be challenging. We are also worried about the scheduler, which may struggle with so many tasks.
Other option is to have just a few parallel long-running tasks and then implement our own scheduler within those tasks. We can add a custom code into PythonOperator, which will query the database and will decide which API to call next.
Perhaps Airflow is not well suited for such a use case and it would be easier and better to implement such a system outside of Airflow? Does anyone have experience with running thousands of tasks in Airflow and can shed some light on pros and cons on the above use case?
One task per call would kill Airflow as it still needs to check on the status of each task at every heartbeat - even if the processing of the task (worker) is separate e.g. on K8s.
Not sure where you plan on running Airflow but if on GCP and a download is not longer than 9 min, you could use the following:
task (PythonOperator) -> pubsub -> cloud function (to retrieve) -> pubsub -> function (to save result to backend).
The latter function may not be required but we (re)use a generic and simple "bigquery streamer".
Finally, you query in a downstream AF task (PythonSensor) the number of results in the backend and compare with the number of requests published.
We do this quite efficiently for 100K API calls to a third-party system we host on GCP as we maximize parallelism. The nice thing of GCF is that you can tweak the architecture to use and concurrency, instead of provisioning a VM or container to run the tasks.

Can you programmatically detect re-execution in Google Cloud Dataflow?

In Google Cloud Dataflow (streaming pipeline), your data "bundles" can be re-executed because of failure or speculative execution. Is there any way of knowing that the current bundle/element is a re-execution?
This would be very useful to provide conditional behavior for side-effects (in our case: to help make a datastore update operation (read/write) idempotent).
I don't believe this is something that is offered through the Beam API but you can avoid the need to know this information through following mechanisms.
If writes to the external datastore are idempotent, simply introduce a fusion break by adding a Reshuffle transform before the write step. This will make sure that data to be written is not re-generated when there are failures.
If writes to external datastore are not idempotent (for example, files, BigQuery), usual mechanism is to combine (1) with writing to a temporary location first. When all (parallel) writes to temporary location are finished, results can be finalized in an idempotent and a failure safe way from a single worker.
Many Beam sinks utilize these mechanisms to write to external data stores in an idempotent manner. For streaming, usually, these operations are performed per window.

How to invoke other Cloud Firebase Functions from a Cloud Function

Let's say I have a Cloud Firebase Function - called by a cron job - that produces 30+ tasks every time it's invoked.
These tasks are quite slow (5 - 6 second each in average) and I can't process them directly in the original because it would time out.
So, the solution would be invoking another "worker" function, once per task, to complete the tasks independently and write the results in a database. So far I can think of three strategies:
Pubsub messages. That would be amazing, but it seems that you can only listen on pubsub messages from within a Cloud Function, not create one. Resorting to external solutions, like having a GAE instance, is not an option for me.
Call the worker http-triggered Firebase Cloud Function from the first one. That won't work, I think, because I would need to wait for a response from the all the invoked worker functions, after they finish and send, and my original Function would time out.
Append tasks to a real time database list, then have a worker function triggered by each database change. The worker has to delete the task from the queue afterwards. That would probably work, but it feels there are a lot of moving parts for a simple problem. For example, what if the worker throws? Another cron to "clean" the db would be needed etc.
Another solution that comes to mind is firebase-queue, but its README explicitly states:
"There may continue to be specific use-cases for firebase-queue,
however if you're looking for a general purpose, scalable queueing
system for Firebase then it is likely that building on top of Google
Cloud Functions for Firebase is the ideal route"
It's not officially supported and they're practically saying that we should use Functions instead (which is what I'm trying to do). I'm a bit nervous on using in prod a library that might be abandoned tomorrow (if it's not already) and would like to avoid going down that route.
Sending Pub/Sub messages from Cloud Functions
Cloud Functions are run in a fairly standard Node.js environment. Given the breadth of the Node/NPM ecosystem, the amount of things you can do in Cloud Functions is quite broad.
it seems that you can only listen on pubsub messages from within a Cloud Function, not create one
You can publish new messages to Pub/Sub topics from within Cloud Functions using the regular Node.js module for Pub/Sub. See the Cloud Pub/Sub documentation for an example.
Triggering new actions from Cloud Functions through Database writes
This is also a fairly common pattern. I usually have my subprocesses/workers clean up after themselves at the same moment they write their result back to the database. This works fine in my simple scenarios, but your mileage may of course vary.
If you're having a concrete cleanup problem, post the code that reproduces the problem and we can have a look at ways to make it more robust.

Resources