I'm trying to think of the best (read automated, cheapest and easy to use) way to back up Firestore data for a production app.
I'm aware I could automate exports through a scheduled cloud function and send them over to a gcloud bucket. The problem I have with this approach is that it does not allow for "incremental updates of the new and updated documents" but only for backing up entire collections. This means that most of the data will be backed up each and every time, even though it hasn't even changed since the last backup, skyrocketing the cost up for no reason.
The approach that came to mind was having a cloud function in "my-app" project that would listen to each and every change in the Firestore, and perform the same change in the Firestore of the "my-app-backup" project.
This way, I only back up the changed data. Furthermore, backed up data would never become stale (as it's backed up in real-time), unlike the first approach where automated backups happen e.g. daily or weekly.
Is this even possible, having a single cloud function in the first Firebase project writing data into another Firebase project? If not, perhaps write the data elsewhere(not in another Firebase project)? Does the approach even make sense, or do you have a better suggestion?
If you want to export updated documents only then you can store a field updatedAt and query documents where("updatedAt", ">", "lastExportTime"). Then you can periodically run a Cloud function to export these documents. This should only cost N reads (N = number of updated documents) every time the function runs.
Furthermore, backed up data would never become stale (as it's backed up in real-time)
This works too but can also get expensive if the document updates are too frequent.
Related
I have some cloud run and cloud functions that serve to parse a large number of files that users upload. Sometimes users upload an exceedingly large number of files, and that causes these functions to timeout even when I set them to their maximum runtime limits (15 minutes for Cloud Run and 9 minutes for Cloud Functions respectively.) I have a loading icon corresponding to a database entry that shows the progress of processing each batch of files that's been uploaded, and so if the function times out currently, the loading icon gets stuck for that batch in perpetuity, as the database is not updated after the function is killed.
Is there a way for me to create say a callback function to the Cloud Run/Functions to update the database and indicate that the parsing process failed if the Cloud Run/Functions timed out? There is currently no way for me to know a priori if the batch of files is too large to process, and clearly I cannot use a simple try/catch here as the execution environment itself will be killed.
One popular method is to have a public-facing API location that you can invoke by passing on the remaining queued information. You should assume that this API location is compromised so some sort of OTP should be used. This does depend on some factors, such as how these files are uploaded or the cloud trigger was handled which may require you to store that information in a database location to be retrieved.
You can set a flag on the db before you start processing, then after processing, clear/delete the flag. Then have another function regularly check for the status.
No such callback functionality exists for either product.
Serverless products are generally not meant to be used for batch processing where the batches can easily be larger than the limits of the system. They are meant for small bits of discrete work, such as simple API calls.
If you want to process larger amounts of data, considering first uploading that to Cloud Storage (which will accept files of any size), then sending a pubsub message upon completion to a backend compute product that can handle the processing requirements (such as Compute Engine).
Direct answer. For example, you might be able to achieve that by filtering and creating a sink in the relevant StackDriver logs (where a cloud function timeout crash is to be recorded), so that the relevant log records are pushed into some PubSub topic. On the other side of that topic you may have some other cloud function, which can implement the desired functionality.
Indirect answer. Without context, scope and requirement details - it is difficult to provide a good suggestion... but, based on some guesses - I am not sure that the design is optimal. Serverless services are supposed to be used for handling independent and relatively small chunks of data. If are have something large - you might like to use the first, let's say cloud function, to divide it into reasonably small chunks, so they can be processed independently by, let's say the second cloud function. In your case - can you have a cloud function per file, for example? If a file is too large (a few Gb, or dozen Gb) - can it be saved to a cloud storage and read/processed in chunks, so that the cloud functions are triggered from he cloud storage? And so on. That approach should help, but has a drawback - complexity is increased, as you have to coordinate and control how the process is going...
I need a bit of expert advice. I'm using the firebase cloud function to automate few things ( using this brilliant nodejs package "https://github.com/jdgamble555/adv-firestore-functions".
What happens is, it runs on - onWrite trigger, as I understand when onWrite triggers, it executes the function on each new document or childnode within a document being updated or created or deleted. That package has taken care many of things, but my concern is, executing the functions multiple times does any harm? I'm already making sure than if not required do not hit firestore by using condition checks. So (all) the function executes as I can see it in the log not writing/updating firestore db if not required.
I'm worried if all functions execute all the time will I finish my limits quickly. ( right now I'm testing on Firebase emulator), specially when userbase with increase.
Anything can be done to reduce these calls or its normal?
As per the firebase documentation, the first 2 million invocations are free per month, after that 0.40$ for every million invocations. Also, there is a resource limit for each function call. https://cloud.google.com/functions/pricing#free_tier
As of my knowledge and experience, this is normal. Just make sure that your code does not make any infinite function calls & database reads/writes.
I'm also using cloud functions for my social media platform which also uses triggers to execute & write to the database based on conditions. It never went beyond the free quota.
I'm trying to understand how Firebase Realtime Database uses cache. The documentation doesn't clarify some cases about cache handling. Especially for Flutter, there is no documentation and online sources are not enough. There are two different scenarios that I'm confused.
First of all, I start with setting the cache for both scenarios:
await FirebaseDatabase.instance.setPersistenceEnabled(true);
await FirebaseDatabase.instance.setPersistenceCacheSizeBytes(10000000);
Scenario 1: I listen to the value of a specific user. I want to donwload user data for once. Then, always use cache and download only updates if there is any:
final stream = FirebaseDatabase().reference().child("users").child("some_id").onValue();
It's my understanding that Firebase will download the node first and use the cache later if there is no update. This won't change even if the app restarts.
Scenario 2: I want to query the posts that are created only after the date:
final date = DateTime(2020,6,20);
final data = await FirebaseDatabase().reference().child("posts").orderByChild("createdAt").startAt(date).once();
Here for Scenario 2, I'm not sure how cache will be done. If Firebase Realtime Database caches the query, will it download everything when a new post created after the date? Or it will download only the new post and get others from the cache?
If there is a change to a location/query that you have a listener on, Firebase performs a so-called delta-sync on that data. In this delta-sync, the client calculates hashes on subtrees of its internal version of the data, and sends those to the server. The server compares those hashes with those of its own subtrees and only sends back the subtrees where the hashes are different. This is usually quite a bit smaller than the full data, but not necessarily the minimal delta.
Note that Firebase will always perform a delta sync between the data it has in memory already for the query/location and the data on the server, regardless of whether you enable disk persistence. Having disk persistence enabled just means the in-memory copy will initially be populated from disk, but after that the delta-sync works the same for both cases.
I discovered a very useful function named get(Source source). If I pass CACHE, I can get data only from cache. But how about set(Source source)? I cannot find something similar. I need to save data locally and push it to the server only when needed. How to solve this? Or any other alternatives? Thanks
What you're trying to do is not supported by the Firestore client libraries. Then only kind of writes you can perform will always be synchronized to the server at the earliest opportunity. There is no operation that lets you write, then decide to synchronize later.
What you should do instead is write data to some other local storage (perhaps a database), then write those records to Firestore when you're ready.
My application needs to build a couple of large hashmaps before processing a user's request. Ideally I want to store these hashmaps in-memory on the machine, which means it never has to do any expensive processing and can process any incoming requests quickly.
But this doesn't work for firebase because there's a chance a user triggers a new instance which sets off the very time-consuming preprocessing step.
So, I tried designing my application to use the firebase database, and get only the data it needs from the database each time instead of holding all the data in-memory. But, since the cloud functions are downloading loads of data from the database, I have now triggered over 1.7 GB in download for this month, just by myself from testing. This goes over the quota.
There must be something I'm missing; all I want is a permanent memory storage of some hashmaps. All I want is for those hashmaps to be ready by the time the function is called with a request. It seems like such a simple requirement; how come there is no way to do this?
If you want to store data in the container that runs your Cloud Functions, you can use its local tempfs, which is actually kept in memory. But this will disappear when the container is recycled, which happens when your function hasn't been access for a while. So this local file system will have to be rebuilt whenever the container spins up.
If you want permanent storage of values you generate, consider using Google Cloud Storage. It is probably a more cost effective option, and definitively the most scalable one.