I have some cloud run and cloud functions that serve to parse a large number of files that users upload. Sometimes users upload an exceedingly large number of files, and that causes these functions to timeout even when I set them to their maximum runtime limits (15 minutes for Cloud Run and 9 minutes for Cloud Functions respectively.) I have a loading icon corresponding to a database entry that shows the progress of processing each batch of files that's been uploaded, and so if the function times out currently, the loading icon gets stuck for that batch in perpetuity, as the database is not updated after the function is killed.
Is there a way for me to create say a callback function to the Cloud Run/Functions to update the database and indicate that the parsing process failed if the Cloud Run/Functions timed out? There is currently no way for me to know a priori if the batch of files is too large to process, and clearly I cannot use a simple try/catch here as the execution environment itself will be killed.
One popular method is to have a public-facing API location that you can invoke by passing on the remaining queued information. You should assume that this API location is compromised so some sort of OTP should be used. This does depend on some factors, such as how these files are uploaded or the cloud trigger was handled which may require you to store that information in a database location to be retrieved.
You can set a flag on the db before you start processing, then after processing, clear/delete the flag. Then have another function regularly check for the status.
No such callback functionality exists for either product.
Serverless products are generally not meant to be used for batch processing where the batches can easily be larger than the limits of the system. They are meant for small bits of discrete work, such as simple API calls.
If you want to process larger amounts of data, considering first uploading that to Cloud Storage (which will accept files of any size), then sending a pubsub message upon completion to a backend compute product that can handle the processing requirements (such as Compute Engine).
Direct answer. For example, you might be able to achieve that by filtering and creating a sink in the relevant StackDriver logs (where a cloud function timeout crash is to be recorded), so that the relevant log records are pushed into some PubSub topic. On the other side of that topic you may have some other cloud function, which can implement the desired functionality.
Indirect answer. Without context, scope and requirement details - it is difficult to provide a good suggestion... but, based on some guesses - I am not sure that the design is optimal. Serverless services are supposed to be used for handling independent and relatively small chunks of data. If are have something large - you might like to use the first, let's say cloud function, to divide it into reasonably small chunks, so they can be processed independently by, let's say the second cloud function. In your case - can you have a cloud function per file, for example? If a file is too large (a few Gb, or dozen Gb) - can it be saved to a cloud storage and read/processed in chunks, so that the cloud functions are triggered from he cloud storage? And so on. That approach should help, but has a drawback - complexity is increased, as you have to coordinate and control how the process is going...
Related
I recently started seemingly similar topic here, but I feel that maybe I implied too much in my question by asking how to implement something instead of asking how to solve specific problem. So here I go asking from different angle:
third-party API (most possibly webhook) sends .csv file and .docx files (data and template) and sends response as fast as those files are uploaded (no waiting until those documents are processed)
server merges that data and whenever the data is ready it sends to the user-specified endpoint response with download link
I want to use Firebase products to achieve that
it has to be compatible with typical automation tools like Zapier, Pabbly etc. (it just has to work like typical webhook)
In my previous question I got quite interesting answer to use PubSub (almost tried it, but got error while installing it), but I'm thinking - maybe there is some easier way to solve this?
Like I wrote in my last comment to your other question, if you plan to send heavy files to Cloud Functions, bear in mind that the size limit for data sent to an HTTP Cloud Functions is 10MB (See doc). There is the same limit for the size of messages you can push to Pub/Sub. (See doc).
One approach would be to upload the files (data and template) to Cloud Storage and pass their references to the HTTP Cloud Function and, from this one, pass them in the payload of the Pub/Sub message (as explained in the other answer). Then in the Pub/Sub Cloud Function, you read the files from Cloud Storage.
Another solution to overcome the file limit is to use streams in cloud functions. Depending on your application you could stream your data directly back to the client (assuming you are using a http cloud function) / or a bucket. If you do this, your cloud function will only use a couple of mb. We did that with a quite large zip file containing averagely of 2-3 gb.
So this should work in your case.
I'm trying to figure out an approach that will guarantee correct count of user uploaded file size from a web client to the firebase storage.
The core requirements are these:
it must be stable: it must guarantee to count every uploaded file
it must be scalable: I can't read "all stored files" to calculate size at once as the number of files can be huge
it must be secure: I can't rely on browser, the calculations must be performed on server.
So far the approach is this:
user of my web app can upload multiple files
those files are uploaded using firebase web client sdk
on the server I listen to functions.storage.object().onFinalize() to get file size for each file
then I update a dedicated document in firestore to add the new size to the total (let's call it totalStorageSizeDoc)
This approach addresses security and scalability but the problem that I see here is when user is uploading a bunch of small enough files, this can easily trigger lots of functions.storage.object().onFinalize() within 1 second, which is 'not-hard-but-still' a limit for write operations in firestore. At this point the writes to totalStorageSizeDoc will also be performed too fast with the risk of rejecting some of those writes.
Is there a way to easily queue the writes from the onFinalize() to ensure the totalStorageSizeDoc is not overwhelmed? Or should I take a completely different approach? Maybe there are any best-practices out there to count the used storage size that I've missed?
Any advice is much appreciated.
The easiest thing to do would be to enable retries for your function, so that if it fails for whatever reason (like exceeding some limit), then the system will simply retry it until it succeeds.
In Google Cloud Dataflow (streaming pipeline), your data "bundles" can be re-executed because of failure or speculative execution. Is there any way of knowing that the current bundle/element is a re-execution?
This would be very useful to provide conditional behavior for side-effects (in our case: to help make a datastore update operation (read/write) idempotent).
I don't believe this is something that is offered through the Beam API but you can avoid the need to know this information through following mechanisms.
If writes to the external datastore are idempotent, simply introduce a fusion break by adding a Reshuffle transform before the write step. This will make sure that data to be written is not re-generated when there are failures.
If writes to external datastore are not idempotent (for example, files, BigQuery), usual mechanism is to combine (1) with writing to a temporary location first. When all (parallel) writes to temporary location are finished, results can be finalized in an idempotent and a failure safe way from a single worker.
Many Beam sinks utilize these mechanisms to write to external data stores in an idempotent manner. For streaming, usually, these operations are performed per window.
As per the documentation, Firebase Functions are currently supported for 4 regions only - “us-central1”, “us-east1", “europe-west1”, “asia-northeast1"
That means locations further away would incur more latency, and often that translates to lower performance.
How can this limitation be worked around?
1) Choosing a location that is closest to you. You can set up test cloud functions in different regions, and test the round-trip latency. Only you can discover the specifics about your location.
2) Focus your software architecture on infrastructure that is locally available.
Use the client-side Firestore library directly as much as possible. It supports offline data, queueing data to send out later if you don't have internet, and caching read data locally - you can't get faster latency than that! So make sure you use Firestore for CRUD operations.
3) Architect to use CloudFunctions for batch and background processesing. If any business-logic processing is required, write the data to Firestore (using client libraries), and have a FF trigger to do some processing upon the write data-event. Have that trigger update that record with the additional processing, and state. I believe that if you're using the client-side libraries there is a way to have the updated data automatically pushed back to the client-side. (edited)
You also have the bonus benefit of being able to control authorisation with Firestore Auth, where Functions don't have an admin-level authorisation control.
4) Reduce chatter - minimising the amount of CloudFunction calls overall, and ensuring your CloudFunctions themselves do more in one go and return more complete data in one go.
My application needs to build a couple of large hashmaps before processing a user's request. Ideally I want to store these hashmaps in-memory on the machine, which means it never has to do any expensive processing and can process any incoming requests quickly.
But this doesn't work for firebase because there's a chance a user triggers a new instance which sets off the very time-consuming preprocessing step.
So, I tried designing my application to use the firebase database, and get only the data it needs from the database each time instead of holding all the data in-memory. But, since the cloud functions are downloading loads of data from the database, I have now triggered over 1.7 GB in download for this month, just by myself from testing. This goes over the quota.
There must be something I'm missing; all I want is a permanent memory storage of some hashmaps. All I want is for those hashmaps to be ready by the time the function is called with a request. It seems like such a simple requirement; how come there is no way to do this?
If you want to store data in the container that runs your Cloud Functions, you can use its local tempfs, which is actually kept in memory. But this will disappear when the container is recycled, which happens when your function hasn't been access for a while. So this local file system will have to be rebuilt whenever the container spins up.
If you want permanent storage of values you generate, consider using Google Cloud Storage. It is probably a more cost effective option, and definitively the most scalable one.