I know that Firebase/Google Cloud Functions can have timeout increased up to 9 minutes, but I have a Pub/Sub function that one request within it needs around 20-30 seconds to complete (document conversion).
async function () {
// code...
// const convertedDoc = await convertDocument()
// ... do something with convertedDoc
}
With 9 minutes of maximum timeout it gives me up to 18 documents I can process.
My question is if after 15th document conversion I would call PubSub function again while finishing the previous invocation will timeout timer start over with new function? Of course, I would need to pass all the data from previous one, but is that a way to do it? Something like recursive PubSub of sorts?
If you don't mind, I would suggest a slightly different approach.
Let's divide the whole process into 2 steps.
The first step - a cloud function which "collects" all documents (I mean their id, or reference, or metadata to uniquely distinct one from others) into a list, and then sends a message per document into a PubSub topic. That message contains a unique identifier/handle/hash of the document, so it can be fetch/processed later.
The PubSub topic triggers (push) the second cloud function. This cloud function is deployed with a maximum instances argument of a few dozens (or hundreds) depending on the context and requirements. Thus, many cloud function instances are being executed parallel, but each cloud function instance is triggered by a message with a unique document id.
The cloud function performs the processing you described, and presumably it takes 20 or 30 seconds. As many cloud functions are being executed in parallel, the overall processing time can be less, than if everything is done sequentially.
In addition, you might like to keep the state of a process in a firestore database using a document id as a firestore record id. Thus each record reflects a process of handling of one particular document. By doing that, any possible duplication can be eliminated, and a self-healing process can be organised.
Related
In the cloud functions documentation for Firestore triggers, it mentions the following limitation:
"Events are delivered at least once, but a single event may result in multiple function invocations. Avoid depending on exactly-once mechanics, and write idempotent functions."
I am writing a system that checks the length of an array, and when it reaches a certain length (100), it is cleared, and some processing is done on the data. I think this is what it means by an "exactly-once" mechanic.
My question is this, is checking:
if (change.after.data().array.length === change.before.data().array.length) {
return;
}
A sufficient way to prevent multiple executions?
Is checking the following a sufficient way to prevent multiple executions?
if (change.after.data().array.length === change.before.data().array.length) { return; }
The answer is no. The Cloud Function could be ran multiple times in the case array length was 99 and is now 100 (= the single event mentioned in the doc you referred to).
There is a Blog article on Building idempotent functions which explains that a common way to make a Cloud Function idempotent is "to use the event ID which remains unchanged across function retries for the same event".
You could use this method, by saving the event ID when you do the "data processing" you mention in your question, and in case there is another execution, check if the ID was already saved.
Concretely, you can create a Firestore document with the CF event ID as Document ID and in your Cloud Function, you check if this document exists or not, with a Transaction. If it does not exist, you can proceed: the CF was never executed for this event ID.
An simpler solution for your case could be to ensure that if the Cloud Function is executed multiple times it results in the exact same situation in terms of data.
In other words, if the result of the "data processing" is exactly the same when running several times the Cloud Function for array.length == 100, then your function is idempotent ("the operation results remain unchanged when an operation is applied more than once").
Of course this highly depends on the business logic of your "data processing". If, for example, it involves the time of execution, obviously the end result could be different between two executions.
I am building an application that must react when the timestamp for a certain Firestore document becomes older than the current time. Is there a way to setup this type of query listener as a Cloud Function, or otherwise achieve the desired goal of reacting to a document when its timestamp crosses the current time?
From what I can tell reading the Firestore and Cloud Function documentation, query listeners may not be possible to setup as Cloud Functions. Furthermore, this is not just a regular query listener - the query criteria (time) is dynamic, so it isn't the typical query structure ("is A < 5") but a dynamic one ("is T < now" where "now" is changing every moment).
If it's true this is not possible as a query listener, I'd certainly appreciate any suggestions on how to achieve this goal through another means. One idea I had was to create a Cloud Function that triggers every 60 seconds and runs the queries based on the time at that moment, but this would not allow constant listening (and 60 seconds is unfortunately too long for our usage). Thank you so much in advance
Firestore queries can only filter on literal values that are explicitly stored in the documents they return. There's no way to perform a calculation in a query, so any time you need a now in the query - that timestamp will be calculated at the moment the query is created.
There are two common ways to implement the time-to-live type functionality that you describe:
Set up a process that periodically runs (e.g. a time-based Cloud Function), and every time the process runs perform a query to determine what documents have expired.
As a variant of this, you could start a permanent listener for updates each time the Cloud Function triggers and keep that active for slightly less than the interval until the next trigger.
Create a Cloud Task for each document that expires/triggers when the document needs to be processed. While this may seem more complex, it actually ends up being simpler due to the fact that your callbacks now trigger on individual documents.
Also see: Is there any TTL (Time To Live ) for Documents in Firebase Firestore, which includes a link to Doug's excellent article on How to schedule a Cloud Function to run in the future with Cloud Tasks (to build a Firestore document TTL).
I created a function that is triggered when a new data created in Realtime Database. The problem is that the activity in the nodes could be very frequent, so the function could be called every second. Is there a way to limit the number of triggers in a period? For example, when 100 new data consecutively created in a minute, the function will be triggered only once and only by the last data created and other 99 will never be processed. I know there is a way to schedule the function to call in every minute instead of triggering it for new data in the db, but this is not efficient if some nodes have very rare activity while some have very frequent.
There is no way to implement what you're describing. Cloud Functions will always trigger for every matching event from the source that it tracks. There is no way to suspend or suppress events. If this is going to be too frequent (by whatever measure you use for frequency), you will need to change the way you write the database, or some up with some other solution.
We have 20 functions that must run everyday. Each of these functions do something different based on inputs from the previous function.
We tried calling all the functions in one function, but it hits the timeout error as these 20 functions take more than 9 minutes to execute.
How can we trigger these multiple functions sequentially, or avoid timeout error for one function that executes each of these functions?
There is no configuration or easy way to get this done. You will have to set up a fair amount of code and infrastructure to get this done.
The most straightforward solution involves chaining together calls using pubsub type functions. You can send a message to a pubsub topic that will trigger the next function to run. The payload of the message to send can be the parameters that the function should use to determine how it should operate. If the payload is too big, or some more complex sources of data are required to make that decision, you can use a database to store intermediate data that the next function can query and use.
Since we don't have any more specific details about how your functions actually work, nothing more specific can be said. If you run into problems with a specific detail of this scheme, please post again describing that specifically you're trying to do and what's not working the way you expect.
There is a variant to the Doug solution. At the end of the function, instead of publishing a message into pubsub, simply write a specific log (for example " end").
Then, go to stackdriver logging, search for this specific log trace (turn on advanced filters) and configure a sink into a PubSub topic of this log entry. Thereby, every time that the log is detected, a PubSub message is published with the log content.
Finally, plug your next function on this PubSub topic.
If you need to pass values from function to another one, you can simply add these values in the log trace at the end of the function and parse it at the beginning of the next one.
Chaining functions is not an easy things to do. Things are coming, maybe Google Cloud Next will announce new products for helping you in this task.
If you simply want the functions to execute in order, and you don't need to pass the result of one directly to the next, you could wrap them in a scheduled function (docs) that spaces them out with enough time for each to run.
Sketch below with 3 minute spacing:
exports.myScheduler = functions.pubsub
.schedule('every 3 minutes from 22:00 to 23:00')
.onRun(context => {
let time = // check the time
if (time === '22:00') func1of20();
else if (time === '22:03') func2of20();
// etc. through func20of20()
}
If you do need to pass the results of each function to the next, func1 could store its result in a DB entry, then func2 starts by reading that result, and ends by overwriting with its own so func3 can read when fired 3 minutes later, etc. — though perhaps in this case, the other solutions are more tailored to your needs.
Is there a performance difference between the following two options for real-time database triggered functions?:
One cloud function that listens to all subnodes and decides what to execute based on the path
An entirely separate cloud function for each subnode.
This is assuming total number of function executions stays equal.
If there are multiple events happening at the same time, it may be a problem (from https://cloud.google.com/functions/docs/concepts/exec):
Cloud Functions may start multiple function instances to scale your
function up to meet the current load. These instances run in parallel,
which results in having more than one parallel function execution.
However, each function instance handles only one concurrent request at
a time. This means while your code is processing one request, there is
no possibility of a second request being routed to the same function
instance, and the original request can use the full amount of
resources (CPU and memory) that you requested.
Adding to this, the logic for separate cloud functions should be a lot simpler than having one monolithic function checking for each trigger.