Deleting a very large collection in firestore from the firebase console - firebase

I have a very large collection of aprox 2 milions documents, all of them are outdated, and needed to be deleted.
I need to do this operation only one time, in the new data i have TTL (time to live) so i won't run into this problem again.
Sould i use the firestore console ui to delete those, or there is a better way to do this. is it possible to do this in one shot or sould i split it?

There's no single way that is pertinently better here.
The simplest option is probably to delete the documents from the console, but I often also use the Firebase CLI's firestore:delete command - and writing your own logic through the API is equally fine. Any of these can work fine, all will need to read the documents before deleting them, and none of them is going to be significantly faster than the other.

Related

Firestore unused Index limitation?

I have an application that uses Firebase Firestore as a database. Currently, I am out of indexes(composite) in the database. It reaches the limit to 200 and cant add more.
I'd manually deleted the index to some extent. Since my application is pretty large It is very difficult to manually find the unused index. Also, it takes time to recheck the same in multiple parts of the application.
I am looking for a solution. Either to identify the unused indexes in a better way than manually searching or an option to extend the limit of the index.
You may file a support ticket with Google Cloud Support, or Firebase Support to get the unused indexes, there is no automated way to do this.
Additionally, there is no way to increase the limit.
Changing the way your data is modeled so that you will need less composite indexes to support your query might also be an alternative as mentioned here.

How to delete many of data in "Realtime Database"

I want to delete all data in "Realtime Database", without increasing "Usage Load" in "Realtime Database".
any idea for deleting that data?
is 420,000+ data in realtime database
Here is image
If you can help me, its very usefull..
Image Usage Load
There is support for deleting large nodes built into the Firebase CLI these days as explained in this blog How to Perform Large Deletes in the Realtime Database:
If you want to delete a large node, the new recommended approach is to use the Firebase CLI (> v6.4.0). The CLI automatically detects a large node and performs a chunked delete efficiently.
$ firebase database:remove /path/to/delete
My initial write-up is below. 👇 I'm pretty sure the CLI mentioned above implements precisely this approach, so that's likely a faster way to accomplish this, but I'm still leaving this explanation as it may be useful as background.
Deleting data is a write operation, so it's by definition going to put load on the database. Deleting a lot of data causes a lot of load, either as a spike in a short period or (if you spread it out) as a lifted load for a longer period. Spreading the load out is the best way to minimize impact for your regular users.
The best way to delete a long, flat list of keys (as you seem to have) is to:
Get a list of that keys, either from a backup of your database (which happens out of band), or by using the shallow parameter on the REST API.
Delete the data in reasonable batches, where reasonable depends on the amount of data you store per key. If each key is just a few properties, you could start deleting 100 keys per batch, and check how that impacts the load to determine if you can ramp up to more keys per batch.

Changing data structure after app has been published - Firestore

I have just published an app that uses Firestore as a backend.
I want to change how the data is structured;
for example if some documents are stored in subcollections like 'PostsCollection/userId/SubcolletionPosts/postIdDocument' I want to move all this last postIdDocument inside the first collection 'PostsCollection'.
Obviously doing so would prevent users of the previous app version from writing and reading the right collection and all data would be lost.
Since I don't know how to approach this issue, I want to ask you what is the best approach that also big companies use when changing the data structure of their projects.
So the approach I have used is document versioning. There is an explanation here.
You basically version your documents so when you app reads them, it knows how to update those documents to get them to the desired version. So in your case, you would have no version, and need to get to version 1, which means read the sub-collections to the top collection and remove the sub collection before working with the document.
Yes it is more work, but allows an iterative approach to document changes. And sometimes, a script is written to update to the desired state and new code is deployed 😛. Which usually happens when someone wants it done yesterday. Which with many documents can have it's own issues.

How to find which kinds are not being used in Google Datastore

There's any way to list the kinds that are not being used in google's datastore by our app engine app without having to look into our code and/or logic? : )
I'm not talking about indexes, which I can list by issuing an
gcloud datastore indexes list
and then compare with the datastore-indexes.xml or index.yaml.
I tried to check datastore kinds statistics and other metadata but I could not find anything useful to help me on this matter.
Should I give up to find ways of datastore providing me useful stats and code something to keep collecting datastore statistics(like data size), during a huge period to have at least a clue of which kinds are not being used and then, only after this research, take a look into our app code to see if the kind Model was removed?
Example:
select bytes from __Stat_Kind__
Store it somewhere and keep updating for a period. If the Kind bytes size does not change than probably the kind is not being used anymore.
The idea is to do some cleaning in datastore.
I would like to find which kinds are not being used anymore, maybe for a long time or were created manually to be used once... You know, like a table in oracle that no one knows what is used for and then if we look into the statistics of that table we would see that this table was only used once 5 years ago. I'm trying to achieve the same in datastore, I want to know which kinds are not being used anymore or were used a while ago, then ask around and backup/delete it if no owner was found.
It's an interesting question.
I think you would be best-placed to audit your code and instill organizational practice that requires this documentation to be performed in future as a business|technical pre-prod requirement.
IIRC, Datastore doesn't automatically timestamp Entities and keys (rightly) aren't incremental. So there appears no intrinsic mechanism to track changes short of taking a snapshot (expensive) and comparing your in-flight and backup copies for changes (also expensive and inconclusive).
One challenge with identifying a Kind that appears to be non-changing is that it could be referenced (rarely) by another Kind and so, while it does not change, it is required.
Auditing your code and documenting it for posterity should not only provide you with a definitive answer (and identify owners) but it pays off a significant technical debt that has been incurred and avoids this and probably future problems (e.g. GDPR-like) requirements that will arise in the future.
Assuming you are referring to records being created/updated, then I can think of the following options
Via the Cloud Console (Datastore > Dashboard) - This lists all your 'Kinds' and the number of records in each Kind. Theoretically, you can take a screen shot and compare the counts so that you know which one has experienced an increase or not.
Use of Created/LastModified Date columns - I usually add these 2 columns to most of my datastore tables. If you have them, then you can have a stored function that queries them. For example, you run a query to sort all of your Kinds in descending order of creation (or last modified date) and you only pull the first record from each one. This tells you the last time a record was created or modified.
I would write a function as part of my App, put it behind a page which requires admin privilege (only app creator can run it) and then just clicking a link on my App would give me the information.

Deleting very large collections in Firestore

I need to delete very large collections in Firestore.
Initially I used client side batch deletes, but when the documentation changed and started to discouraged that with the comments
Deleting collections from an iOS client is not recommended.
Deleting collections from a Web client is not recommended.
Deleting collections from an Android client is not recommended.
https://firebase.google.com/docs/firestore/manage-data/delete-data?authuser=0
I switched to a cloud function as recommended in the docs. The cloud function gets triggered when a document is deleted and then deletes all documents in a subcollection as proposed in the above link in the section on "NODE.JS".
The problem that I am running into now is that the cloud function seems to be able to manage around 300 deletes per seconds. With the maximum runtime of a cloud function of 9 minutes I can manage up to 162000 deletes this way. But the collection I want to delete currently holds 237560 documents, which makes the cloud function timeout about half way.
I cannot trigger the cloud function again with an onDelete trigger on the parent document, as this one has already been deleted (which triggered the initial call of the function).
So my question is: What is the recommended way to delete large collections in Firestore? According to the docs it's not client side but server side, but the recommended solution does not scale for large collections.
Thanks!
When you have too muck work that can be performed in a single Cloud Function execution, you will need to either find a way to shard that work across multiple invocations, or continue the work in a subsequent invocations after the first. This is not trivial, and you have to put some thought and work into constructing the best solution for your particular situation.
For a sharding solution, you will have to figure out how to split up the document deletes ahead of time, and have your master function kick off subordinate functions (probably via pubsub), passing it the arguments to use to figure out which shard to delete. For example, you might kick off a function whose sole purpose is to delete documents that begin with 'a'. And another with 'b', etc by querying for them, then deleting them.
For a continuation solution, you might just start deleting documents from the beginning, go for as long as you can before timing out, remember where you left off, then kick off a subordinate function to pick up where the prior stopped.
You should be able to use one of these strategies to limit the amount of work done per functions, but the implementation details are entirely up to you to work out.
If, for some reason, neither of these strategies are viable, you will have to manage your own server (perhaps via App Engine), and message (via pubsub) it to perform a single unit of long-running work in response to a Cloud Function.

Resources