Can you programmatically detect re-execution in Google Cloud Dataflow? - google-cloud-datastore

In Google Cloud Dataflow (streaming pipeline), your data "bundles" can be re-executed because of failure or speculative execution. Is there any way of knowing that the current bundle/element is a re-execution?
This would be very useful to provide conditional behavior for side-effects (in our case: to help make a datastore update operation (read/write) idempotent).

I don't believe this is something that is offered through the Beam API but you can avoid the need to know this information through following mechanisms.
If writes to the external datastore are idempotent, simply introduce a fusion break by adding a Reshuffle transform before the write step. This will make sure that data to be written is not re-generated when there are failures.
If writes to external datastore are not idempotent (for example, files, BigQuery), usual mechanism is to combine (1) with writing to a temporary location first. When all (parallel) writes to temporary location are finished, results can be finalized in an idempotent and a failure safe way from a single worker.
Many Beam sinks utilize these mechanisms to write to external data stores in an idempotent manner. For streaming, usually, these operations are performed per window.

Related

Does it make sense to use Firestore Realtime as a way to interact with functions?

I have a good old fashioned piece of code that takes some inputs, runs some code and produces outputs for a variety of tasks. This code runs in a function, and it should be triggered by a UI action. It is relevant to store a history of these runs per user, to list them in a specific view of the UI, and, on returning users, put them back where they left.
Since the Firestore documentation and videos strongly encourage us to default to it for data exchange with the backend, I of course decided to entertain the idea before just moving on to simply implementing a callable.
First, it made sense to me to model the data as follows:
/users
id
/taskRuns
id
inputs
outputs
/history
id
date
...outputs
and make the data flow as follows:
The UI writes to the inputs field and it listens to the document.
The unction is triggered. It runs happily and it writes to outputs.
A second function listening to /assets puts copies over the inputs and outputs onto its corresponding /history subcollection.
The UI reacts to the change.
At this point, a couple of red flags are raised on my mind:
Since the UI has to both read and write /assets, a malicious user could technically trigger infinite loops here. I'm aware that the same thing is possible with a callable, but it somehow seems even easier in this scenario.
I lose the ability to do timeout and error handling the way one would with a simple API call.
However, I also understand that:
My function needs to write to the database anyway to store the outputs even if it was returning it à la API.
Logic for who's responsible for what has somewhat been broken down into pieces.
So - Am I looking at it wrong? Is there a world in which this approach would make sense over the callable?
The approach you're describing uses the database as a queue/cache/proxy for your backend functionality. I regularly use it myself with either Realtime Database or Firestore to expose backend functionality, because it:
Works more gracefully when there's intermittent loss of internet connection. Where a direct call to Cloud Functions would have to implement its own retries, the database SDKs already handle this scenario.
Uses the database as a proxy/cache for the result, so that repeated loads of the data don't cause additional calls to my Cloud Functions.
By setting a maximum number of instances on my Cloud Function, I can prevent overloading any legacy infrastructure my code calls, and the database becomes my queue of pending requests.
The pattern is also used in this video from Cloud Next 2019: Serverless in real life: a case study in the travel industry and this talk from Google I/O: Architecting for Data Contention in a Realtime World with Firebase.
That said, many of my team mates only ever use callable functions, so it's definitely a matter of preference and the requirements of your use-case.

Callback to cloud functions/cloud run if timeout happens?

I have some cloud run and cloud functions that serve to parse a large number of files that users upload. Sometimes users upload an exceedingly large number of files, and that causes these functions to timeout even when I set them to their maximum runtime limits (15 minutes for Cloud Run and 9 minutes for Cloud Functions respectively.) I have a loading icon corresponding to a database entry that shows the progress of processing each batch of files that's been uploaded, and so if the function times out currently, the loading icon gets stuck for that batch in perpetuity, as the database is not updated after the function is killed.
Is there a way for me to create say a callback function to the Cloud Run/Functions to update the database and indicate that the parsing process failed if the Cloud Run/Functions timed out? There is currently no way for me to know a priori if the batch of files is too large to process, and clearly I cannot use a simple try/catch here as the execution environment itself will be killed.
One popular method is to have a public-facing API location that you can invoke by passing on the remaining queued information. You should assume that this API location is compromised so some sort of OTP should be used. This does depend on some factors, such as how these files are uploaded or the cloud trigger was handled which may require you to store that information in a database location to be retrieved.
You can set a flag on the db before you start processing, then after processing, clear/delete the flag. Then have another function regularly check for the status.
No such callback functionality exists for either product.
Serverless products are generally not meant to be used for batch processing where the batches can easily be larger than the limits of the system. They are meant for small bits of discrete work, such as simple API calls.
If you want to process larger amounts of data, considering first uploading that to Cloud Storage (which will accept files of any size), then sending a pubsub message upon completion to a backend compute product that can handle the processing requirements (such as Compute Engine).
Direct answer. For example, you might be able to achieve that by filtering and creating a sink in the relevant StackDriver logs (where a cloud function timeout crash is to be recorded), so that the relevant log records are pushed into some PubSub topic. On the other side of that topic you may have some other cloud function, which can implement the desired functionality.
Indirect answer. Without context, scope and requirement details - it is difficult to provide a good suggestion... but, based on some guesses - I am not sure that the design is optimal. Serverless services are supposed to be used for handling independent and relatively small chunks of data. If are have something large - you might like to use the first, let's say cloud function, to divide it into reasonably small chunks, so they can be processed independently by, let's say the second cloud function. In your case - can you have a cloud function per file, for example? If a file is too large (a few Gb, or dozen Gb) - can it be saved to a cloud storage and read/processed in chunks, so that the cloud functions are triggered from he cloud storage? And so on. That approach should help, but has a drawback - complexity is increased, as you have to coordinate and control how the process is going...

Can I use wildcards when deleting Google Cloud Tasks?

I'm very new to Google Cloud Tasks.
I'm wondering, is there a way to use wildcards when deleting a task? For example, if I potentially had 3 tasks in queue using the following ID naming structure...
id-123-task-1
id-123-task-2
id-123-task-3
Could I simply delete id-123-task-* to delete all 3, or would I have to delete all 3 specific ID's every time? I guess I'm trying to limit the number of required API invocations to delete everything related to 'id-123'.
Can I use wildcards when deleting Google Cloud Tasks?
As of today, wildcards are not supported within Google Cloud Tasks. I can not confirm that you could pass the Google Cloud Task's ID as you mentioned id-123-task-* will delete all the tasks.
Nonetheless, if you are creating tasks for an specific purpose in mind, you could create a separate queue for this kind of tasks.
Not only you will win in terms of organizing your tasks, but when you would like to delete all, you will only need to purge all tasks from the specified queue making only 1 API invocation.
Here you could see how to purge all tasks from the specified queue, and also how to delete tasks and queues.
Also, I attached the API documentation in case you need further information about purging queues in Cloud Tasks.
As stated here, take into account that if you purge all the tasks from a queue:
Do not create new tasks immediately after purging a queue. Wait at least a second. Tasks created in close temporal proximity to a purge call will also be purged.
Also, if you are using named tasks, as stated here:
You can assign your own name to a task by using the name parameter. However, this introduces significant performance overhead, resulting in increased latencies and potentially increased error rates associated with named tasks. These costs can be magnified significantly if tasks are named sequentially, such as with timestamps.
As a consequence, if you are using named tasks, the documentation recommends using a well-distributed prefix for task names, such as a hash of the contents.
I think this is the best solution if you would like to limit the amount of API calls.
I hope it helps.

How to minimise Firebase Function Latency

As per the documentation, Firebase Functions are currently supported for 4 regions only - “us-central1”, “us-east1", “europe-west1”, “asia-northeast1"
That means locations further away would incur more latency, and often that translates to lower performance.
How can this limitation be worked around?
1) Choosing a location that is closest to you. You can set up test cloud functions in different regions, and test the round-trip latency. Only you can discover the specifics about your location.
2) Focus your software architecture on infrastructure that is locally available.
Use the client-side Firestore library directly as much as possible. It supports offline data, queueing data to send out later if you don't have internet, and caching read data locally - you can't get faster latency than that! So make sure you use Firestore for CRUD operations.
3) Architect to use CloudFunctions for batch and background processesing. If any business-logic processing is required, write the data to Firestore (using client libraries), and have a FF trigger to do some processing upon the write data-event. Have that trigger update that record with the additional processing, and state. I believe that if you're using the client-side libraries there is a way to have the updated data automatically pushed back to the client-side. (edited)
You also have the bonus benefit of being able to control authorisation with Firestore Auth, where Functions don't have an admin-level authorisation control.
4) Reduce chatter - minimising the amount of CloudFunction calls overall, and ensuring your CloudFunctions themselves do more in one go and return more complete data in one go.

Event Driven Architecture - Service Contract Design

I'm having difficulty conceptualising a requirement I have into something that will fit into our nascent SOA/EDA
We have a component I'll call the Data Downloader. This is a facade for an external data provider that has both high latency and a cost associated with every request. I want to take this component and create a re-usable service out of it with a clear contract definition. It is up to me to decide how that contract should work, however its responsibilities are two-fold:
Maintain the parameter list (called a Download Definition) for an upcoming scheduled download
Manage the technical details of the communication to the external service
Basically, it manages the 'how' of the communication. The 'what' and the 'when' are the responsibilities of two other components:
The 'what' is managed by 'Clients' who are responsible for
determining the parameters for the download.
The 'when' is managed by a dedicated scheduling component. Because of the cost associated with the downloads we'd like to batch the requests intraday.
Hopefully this sequence diagram explains the responsibilities of the services:
Because each of the responsibilities are split out in three different components, we get all sorts of potential race conditions with async messaging. For instance when the Scheduler tells the Downloader to do its work, because the 'Append to Download Definition' command is asynchronous, there is no guarantee that the pending requests from Client A have actually been serviced. But this all screams high-coupling to me; why should the Scheduler necessarily know about any 'prerequisite' client requests that need to have been actioned before it can invoke a download?
Some potential solutions we've toyed with:
Make the 'Append to Download Definition' command a blocking request/response operation. But this then breaks the perf. and scalability benefits of having an EDA
Build something in the Downloader to ensure that it only runs when there are no pending commands in its incoming request queue. But that then introduces a dependency on the underlying messaging infrastructure which I don't like either.
Makes me think I'm thinking about this problem in a completely backward way. Or is this just a classic case of someone trying to fit a synchronous RPC requirement into an async event-driven architecture?
The thing I like most about EDA and SOA, is that it almost completely eliminates the notion of race condition. As long as your events are associated with some association key (e.g. downloadId), the problem you describe can be addressed with several solutions of different complexities - depending on your needs. I'm not sure I totally understand the described use-case but I will try my best
Out of the top of my head:
DataDownloader maintains a list of received Download Definitions and a list of triggered downloads. When a definition is received it is checked against the triggers list to see if the associated download has already been triggered, and if it was, execute the download. When a TriggerDownloadCommand is recieved, the definitions list is checked against a definition with the associated downloadId.
For more complex situation, consider using the Saga pattern, which is implemented by some 3rd party messaging infrastructures. With some simple configuration, it will handle both messages, and initiate the actual download when the required condition is satisfied. This is more appropriate for distributed systems, where an in-memory collection is out of the question.
You can also configure your scheduler (or the trigger command handler) to retry when an error is signaled (e.g. by an exception), in order to avoid that race condition, and ultimately give up after a specified timeout.
Does this help?

Resources