I recently started seemingly similar topic here, but I feel that maybe I implied too much in my question by asking how to implement something instead of asking how to solve specific problem. So here I go asking from different angle:
third-party API (most possibly webhook) sends .csv file and .docx files (data and template) and sends response as fast as those files are uploaded (no waiting until those documents are processed)
server merges that data and whenever the data is ready it sends to the user-specified endpoint response with download link
I want to use Firebase products to achieve that
it has to be compatible with typical automation tools like Zapier, Pabbly etc. (it just has to work like typical webhook)
In my previous question I got quite interesting answer to use PubSub (almost tried it, but got error while installing it), but I'm thinking - maybe there is some easier way to solve this?
Like I wrote in my last comment to your other question, if you plan to send heavy files to Cloud Functions, bear in mind that the size limit for data sent to an HTTP Cloud Functions is 10MB (See doc). There is the same limit for the size of messages you can push to Pub/Sub. (See doc).
One approach would be to upload the files (data and template) to Cloud Storage and pass their references to the HTTP Cloud Function and, from this one, pass them in the payload of the Pub/Sub message (as explained in the other answer). Then in the Pub/Sub Cloud Function, you read the files from Cloud Storage.
Another solution to overcome the file limit is to use streams in cloud functions. Depending on your application you could stream your data directly back to the client (assuming you are using a http cloud function) / or a bucket. If you do this, your cloud function will only use a couple of mb. We did that with a quite large zip file containing averagely of 2-3 gb.
So this should work in your case.
Related
My company has a larger customer using one of our HTTP Cloud Functions to fetch a large body of data in the response for his Microsoft PowerBI integration. The trouble is, I routinely run into the 10MB response size restriction for this guy and soon others.
Do I need to set up a dedicated API server since, with a low HTTP body length requirement documented at https://firebase.google.com/docs/functions/quotas, it seems that server-less API deployment doesn't cover my use case...?
We are heavily invested into Firebase and so far Cloud Functions have been working gloriously for every requirement we have needed to deliver on. I'm far more into programming apps and software than I am a full devops fellow, so any direction the community could steer me in would be very valuable because I'm not sure what new service to spin up on GCP and rebuild these endpoints into.
Thank you!
If you need to return a larger payload that Cloud Functions' maximum request size, consider writing that data to a file in Cloud Storage, and then returning the path to that file to the caller from the Cloud Function call.
Also see:
How to increase the max http request size limit for HTTP triggers in Cloud Functions
Cloud Function - Getting file contents more than 10 MB
I have some cloud run and cloud functions that serve to parse a large number of files that users upload. Sometimes users upload an exceedingly large number of files, and that causes these functions to timeout even when I set them to their maximum runtime limits (15 minutes for Cloud Run and 9 minutes for Cloud Functions respectively.) I have a loading icon corresponding to a database entry that shows the progress of processing each batch of files that's been uploaded, and so if the function times out currently, the loading icon gets stuck for that batch in perpetuity, as the database is not updated after the function is killed.
Is there a way for me to create say a callback function to the Cloud Run/Functions to update the database and indicate that the parsing process failed if the Cloud Run/Functions timed out? There is currently no way for me to know a priori if the batch of files is too large to process, and clearly I cannot use a simple try/catch here as the execution environment itself will be killed.
One popular method is to have a public-facing API location that you can invoke by passing on the remaining queued information. You should assume that this API location is compromised so some sort of OTP should be used. This does depend on some factors, such as how these files are uploaded or the cloud trigger was handled which may require you to store that information in a database location to be retrieved.
You can set a flag on the db before you start processing, then after processing, clear/delete the flag. Then have another function regularly check for the status.
No such callback functionality exists for either product.
Serverless products are generally not meant to be used for batch processing where the batches can easily be larger than the limits of the system. They are meant for small bits of discrete work, such as simple API calls.
If you want to process larger amounts of data, considering first uploading that to Cloud Storage (which will accept files of any size), then sending a pubsub message upon completion to a backend compute product that can handle the processing requirements (such as Compute Engine).
Direct answer. For example, you might be able to achieve that by filtering and creating a sink in the relevant StackDriver logs (where a cloud function timeout crash is to be recorded), so that the relevant log records are pushed into some PubSub topic. On the other side of that topic you may have some other cloud function, which can implement the desired functionality.
Indirect answer. Without context, scope and requirement details - it is difficult to provide a good suggestion... but, based on some guesses - I am not sure that the design is optimal. Serverless services are supposed to be used for handling independent and relatively small chunks of data. If are have something large - you might like to use the first, let's say cloud function, to divide it into reasonably small chunks, so they can be processed independently by, let's say the second cloud function. In your case - can you have a cloud function per file, for example? If a file is too large (a few Gb, or dozen Gb) - can it be saved to a cloud storage and read/processed in chunks, so that the cloud functions are triggered from he cloud storage? And so on. That approach should help, but has a drawback - complexity is increased, as you have to coordinate and control how the process is going...
As per the documentation, Firebase Functions are currently supported for 4 regions only - “us-central1”, “us-east1", “europe-west1”, “asia-northeast1"
That means locations further away would incur more latency, and often that translates to lower performance.
How can this limitation be worked around?
1) Choosing a location that is closest to you. You can set up test cloud functions in different regions, and test the round-trip latency. Only you can discover the specifics about your location.
2) Focus your software architecture on infrastructure that is locally available.
Use the client-side Firestore library directly as much as possible. It supports offline data, queueing data to send out later if you don't have internet, and caching read data locally - you can't get faster latency than that! So make sure you use Firestore for CRUD operations.
3) Architect to use CloudFunctions for batch and background processesing. If any business-logic processing is required, write the data to Firestore (using client libraries), and have a FF trigger to do some processing upon the write data-event. Have that trigger update that record with the additional processing, and state. I believe that if you're using the client-side libraries there is a way to have the updated data automatically pushed back to the client-side. (edited)
You also have the bonus benefit of being able to control authorisation with Firestore Auth, where Functions don't have an admin-level authorisation control.
4) Reduce chatter - minimising the amount of CloudFunction calls overall, and ensuring your CloudFunctions themselves do more in one go and return more complete data in one go.
Let's say I have a Cloud Firebase Function - called by a cron job - that produces 30+ tasks every time it's invoked.
These tasks are quite slow (5 - 6 second each in average) and I can't process them directly in the original because it would time out.
So, the solution would be invoking another "worker" function, once per task, to complete the tasks independently and write the results in a database. So far I can think of three strategies:
Pubsub messages. That would be amazing, but it seems that you can only listen on pubsub messages from within a Cloud Function, not create one. Resorting to external solutions, like having a GAE instance, is not an option for me.
Call the worker http-triggered Firebase Cloud Function from the first one. That won't work, I think, because I would need to wait for a response from the all the invoked worker functions, after they finish and send, and my original Function would time out.
Append tasks to a real time database list, then have a worker function triggered by each database change. The worker has to delete the task from the queue afterwards. That would probably work, but it feels there are a lot of moving parts for a simple problem. For example, what if the worker throws? Another cron to "clean" the db would be needed etc.
Another solution that comes to mind is firebase-queue, but its README explicitly states:
"There may continue to be specific use-cases for firebase-queue,
however if you're looking for a general purpose, scalable queueing
system for Firebase then it is likely that building on top of Google
Cloud Functions for Firebase is the ideal route"
It's not officially supported and they're practically saying that we should use Functions instead (which is what I'm trying to do). I'm a bit nervous on using in prod a library that might be abandoned tomorrow (if it's not already) and would like to avoid going down that route.
Sending Pub/Sub messages from Cloud Functions
Cloud Functions are run in a fairly standard Node.js environment. Given the breadth of the Node/NPM ecosystem, the amount of things you can do in Cloud Functions is quite broad.
it seems that you can only listen on pubsub messages from within a Cloud Function, not create one
You can publish new messages to Pub/Sub topics from within Cloud Functions using the regular Node.js module for Pub/Sub. See the Cloud Pub/Sub documentation for an example.
Triggering new actions from Cloud Functions through Database writes
This is also a fairly common pattern. I usually have my subprocesses/workers clean up after themselves at the same moment they write their result back to the database. This works fine in my simple scenarios, but your mileage may of course vary.
If you're having a concrete cleanup problem, post the code that reproduces the problem and we can have a look at ways to make it more robust.
We have an application that uses base64 encoded content to transmit attachments to backend. Backend then moves the content to Storage after some manipulation. This way we can enjoy world class offline support and sync and at the same time use the much cheaper Storage to store the files in the end.
Initially we used updateChildren to set the content in one go. This works fairly well, but then users started to upload bigger and more files at the same time, resulting in silent freezing of the database in the end user devices.
We then changed the code to write the files one by one using FirebaseDatabase.getInstance().getReference("/full/uri").setValue(base64stuff), and then using updateChildren to only set the metadata.
This allowed seemingly endless amount of files (provided that it is chopped to max 9 meg chunks), but now we're facing another problem.
Our backend uses Firebase listener to start working once new content is available. The trigger waits for the metadata and then starts to process the attachments. It seems that even though the client device writes the files before we set the metadata, the backend usually receives the metadata before the content from the files is available. This forced us to change backend code to stop processing and check later again if the attachment base64 data is available.
This works, but is not elegant and wastes cpu cycles and increases latencies.
I haven't found anything in the docs wether Firebase guarantees anything about the order in which the data is received by the backend. It seems that everything written in one go (using setValue or updateChildren) is available in the backend as one atomic unit.
Is this correct? Can I depend on that as a fact that will not change in the future?
The way I'm going to go about this (if the assumptions are correct above) is to write metadata first using updateChildren in the client like this
"/uri/of/metadata/uid/attachments/attachment_uid1" = "per attachment metadata"
"/uri/of/metadata/uid/attachments/attachment_uid2" = "per attachment metadata"
and then each base64 chunk using updateChildren with following payload:
"/uri/of/metadata/uid/uploaded_attachments/attachment_uid2" = true
"/uri/of/base64/content/attachment_uid" = "base64content"
I can't use setValue for any data to prevent accidental overwrite depending the order in which the writes will happen in the end.
This would allow me to listen to /uri/of/base64/content and try to start the handling of the metadata package every time a new attachment completes the load. The only thing needed to determine if all files have been already uploaded is to grab the metadata and see that all attachment uids found from /attachments/ are also present /uploaded_attachments/.
Writes from a single Firebase Database client are delivered to the server in the same order as they are executed on the client. They are also broadcast out to any listening clients in the same order.
There is no chance that another client will see the results of write B without seeing the results from write A (unless A was rejected by security rules)