Deleting very large collections in Firestore - firebase

I need to delete very large collections in Firestore.
Initially I used client side batch deletes, but when the documentation changed and started to discouraged that with the comments
Deleting collections from an iOS client is not recommended.
Deleting collections from a Web client is not recommended.
Deleting collections from an Android client is not recommended.
https://firebase.google.com/docs/firestore/manage-data/delete-data?authuser=0
I switched to a cloud function as recommended in the docs. The cloud function gets triggered when a document is deleted and then deletes all documents in a subcollection as proposed in the above link in the section on "NODE.JS".
The problem that I am running into now is that the cloud function seems to be able to manage around 300 deletes per seconds. With the maximum runtime of a cloud function of 9 minutes I can manage up to 162000 deletes this way. But the collection I want to delete currently holds 237560 documents, which makes the cloud function timeout about half way.
I cannot trigger the cloud function again with an onDelete trigger on the parent document, as this one has already been deleted (which triggered the initial call of the function).
So my question is: What is the recommended way to delete large collections in Firestore? According to the docs it's not client side but server side, but the recommended solution does not scale for large collections.
Thanks!

When you have too muck work that can be performed in a single Cloud Function execution, you will need to either find a way to shard that work across multiple invocations, or continue the work in a subsequent invocations after the first. This is not trivial, and you have to put some thought and work into constructing the best solution for your particular situation.
For a sharding solution, you will have to figure out how to split up the document deletes ahead of time, and have your master function kick off subordinate functions (probably via pubsub), passing it the arguments to use to figure out which shard to delete. For example, you might kick off a function whose sole purpose is to delete documents that begin with 'a'. And another with 'b', etc by querying for them, then deleting them.
For a continuation solution, you might just start deleting documents from the beginning, go for as long as you can before timing out, remember where you left off, then kick off a subordinate function to pick up where the prior stopped.
You should be able to use one of these strategies to limit the amount of work done per functions, but the implementation details are entirely up to you to work out.
If, for some reason, neither of these strategies are viable, you will have to manage your own server (perhaps via App Engine), and message (via pubsub) it to perform a single unit of long-running work in response to a Cloud Function.

Related

Firestore : Maintaining the count of a collection. Trigger function vs transaction

Let's say I have a collection called persons and another collection called cities with a field population. When a Person is created in a City, I would like to increment the population field in the corresponding city.
I have two options.
Create a onCreate trigger function. Find the city document and increment using FieldValue.increment(1).
Create an HTTPS callable cloud function to create the person. The cloud function executes a transaction in which the person is created and the population is incremented.
The first one is simpler and I am using it right now. But, I am wondering if there could be cases where the onCreate is not called due to some glitch...
I am thinking of moving to the second option. I am wondering if there are any disadvantages. Does HTTPS callable function cost more?
The only problem I see with the HTTPS callables would be that if something fails you would need to handle that on your client side. That would be (at least for me) a little bit to much logic for the client side.
What I can recommend you after almost 4 years experience with exactly that problem is a solution with a virtual queue. I had a long dicussion on that theme here and even with the Firebase ppl on the last in person Google IO and Firebase Summit.
Our problem was that there where those glitches and even if they happend sometimes the changes and transaction failed due to too much requests. After trying every offical recommendation like the shard counters etc. we ended up creating a virtual queue where each onCreate adds an entry to just a Firestore or RTD list/collection and another function that runs eaither by crone or another trigger (that doesn't matter). That cloud function handles each entry in the queue one by one and starts again for each of them to awoid timouts and memeroy limits. We made sure one handler/calculation is enought for a single function to handle it.
This method was the only bullet proof one that could handle thousands of new entries in a second without having an issue. The only downside is that it takes more time than an usual trigger because each entries is calculated one by one. If your calculations are smaller you could do them in batches (that is how we started to).

Can I use Google CloudFunctions for reliable application purposes?

I remember to have read an article where it was explained that Cloud Functions are not guaranteed to be executed and especially in the right order. I can't find any sources on this anymore.
Is this still recent information?
I am aware that the start of a function can take a couple seconds, especially when cold starting the function.
Could I reliably increment a number each time a document is created in a specific Firestore collection without getting my numbers mixed up? I know this is done often but I've never seen information on whether or not it is safe to do.
Following up on question one, are there red flags when using Cloud Functions for payment backend services?
Can I be sure that Cloud Functions are executed in the order that they were triggered i.e. are they queued or executed in parallel?
Could I reliably increment a number each time a document is created in a specific Firestore collection without getting my numbers mixed up?
You can certainly write code to do that. You will need to keep track of a running count of documents in another document, and use a transaction to keep it up to date.
I don't recommend doing this. It's kind an anti-pattern in Firestore to impose sequentially increasing numbers for documents in a collection. If you want time-based ordering, you should consider using a timestamp instead.
Can I be sure that Cloud Functions are executed in the order that they were triggered i.e. are they queued or executed in parallel?
Cloud Functions provides absolutely no guarantee that functions invocations will happen in any order. They are asynchronous and can execute in parallel on multiple server instances, depending on the load applied to the function.
I strongly suggest reading through the documentation to understand the execution environment provided by Cloud Functions.

Firebase Firestore, Delete Collection with a Callable Cloud Function

if you see here https://firebase.google.com/docs/firestore/solutions/delete-collections
you can see the below
Consistency - the code above deletes documents one at a time. If you
query while there is an ongoing delete operation, your results may
reflect a partially complete state where only some targeted documents
are deleted. There is also no guarantee that the delete operations
will succeed or fail uniformly, so be prepared to handle cases of
partial deletion.
so how to handle this correctly?
this means "preventing users from accessing this collection while deletion is in progress?"
or "If the work is stopped by accessing the collection in the middle, is it to call the function again from the failed part to proceed with the complete deletion?"
so how to handle this correctly?
It's suggesting that you should check for failures, and retry until there are no documents remaining (or at least until you are satisfied with the result).

Downside of using transactions in google firestore

I'm developing a Flutter App and I'm using the Firebase services. I'd like to stick only to using transactions as I prefer consistency over simplicity.
await Firestore.instance.collection('user').document(id).updateData({'name': 'new name'});
await Firestore.instance.runTransaction((transaction) async {
transaction.update(Firestore.instance.collection('user').document(id), {'name': 'new name'});
});
Are there any (major) downsides to transactions? For example, are they more expensive (Firebase billing, not computationally)? After all there might be changes to the data on the Firestore database which will result in up to 5 retries.
For reference: https://firebase.google.com/docs/firestore/manage-data/transactions
"You can also make atomic changes to data using transactions. While
this is a bit heavy-handed for incrementing a vote total, it is the
right approach for more complex changes."
https://codelabs.developers.google.com/codelabs/flutter-firebase/#10
With the specific code samples you're showing, there is little advantage to using a transaction. If your document update makes a static change to a document, without regard to its existing data, a transaction doesn't make sense. The transaction you're proposing is actually just a slower version of the update, since it has to round-trip with the server twice in order to make the change. A plain update just uses a single round trip.
For example, if you want to append data to a string, two clients might overwrite each other's changes, depending on when they each read the document. Using a transaction, you can be sure that each append is going to take effect, no matter when the append was executed, since the transaction will be retried with updated data in the face of concurrency.
Typically, you should strive to get your work done without transactions if possible. For example, prefer to use FieldValue.increment() outside of a transaction instead of manually incrementing within a transaction.
Transactions are intended to be used when you have changes to make to a document (or, typically, multiple documents) that must take the current values of its fields into account before making the final write. This prevents two clients from clobbering each others' changes when they should actually work in tandem.
Please read more about transactions in the documentation to better understand how they work. It is not quite like SQL transactions.
Are there any (major) downsides to transactions?
I don't know any downsides.
For example, are they more expensive (Firebase billing, not computationally)?
No, a transaction costs like any other write operaton. For example, if you create a transaction to increase a counter, you'll be charged with only one write operation.
I'm not sure I understand your last question completely but if a transaction fails, Cloud Firestore retries the transaction for sure.

firebase database equivalent of MySQL transaction

I'm seeking something where I can thread through multiple updates to multiple firebase.database.References (before performing a commit) a single object and then commit that at the end and if it is unsuccessful no changes are made to any of my Firebase References.
Does this exist? the firebase.database.Transaction I thought would be similar since it is an atomic update and it does involve a callback which says if it has been committed or not, but the update function, I believe, is only for a single object, and the function doesn't seem to return a transactionId or something I could pass to other firebase.database.Transactionss or something.
UPDATE
This transaction's update seems to return a Transaction which would lend itself to perhaps chaining: https://firebase.google.com/docs/reference/js/firebase.firestore.Transaction
however this is different from the other Transaction:
Firebase Database transactions perform an update to a single location based on the current value of that same location. They explicitly do not work across multiple locations, since that would limit their scalability. Sometimes developers work around this by performing a transaction higher up in their JSON tree (at the first common point of the locations). I'd recommend against that, as that would limit the scalability even further.
The only way to efficiently update multiple locations with one API call, is with a multiple location update. This does however not have reading of the current value built-in.
So if you want to update multiple locations based on their current value, you'll have to perform the read operation in your application code, turn that into a multi-location update, and then use security rules to ensure all of those updates follow your application rules. This is a quite non-trivial approach, so I hardly see it being done in practice. See my answer here for an example: Is the way the Firebase database quickstart handles counts secure?

Resources