I have to do some pretty heavy work once a month.
I have a "user" collection with over 100,000 documents.
One document is approximately 1kb.
And this collection is in sync with BigQuery.
I have to update all documents in this "user" collection once a month.
Users earn "point" the more they serve.
And users are given a new grade every month with these points.
Grades are given by a fixed percentage.
(For example, the top 10% is gold, 30% silver ...)
This is a method I came up with based on my lack of skills.
CloudFunction calls BigQuery, sorts all documents in the "user" collection by "point", and finds the minimum "point" required for each grade.
Using these "point" values, CloudFunction updates all Firestore documents.
Assuming I use this method,
How should I write the code for CloudFunction?
I don't need to change 100,000 documents in just seconds.
I don't mind if it takes a few minutes
For example, I think there's this way.
Update 1000 documents.
When it is complete, update the following 1000.
Next 1000...
But isn't the function terminated in this case?
How should I proceed....?
Additionally, there is no restriction that I must use Firebase Functions.
Plese Help..!
Thank you sincerely :)
Related
How to efficiently count the documents in a large Firestore collection.
Obviously, I do not want to get the entire collection and count it on the front end as the money will go through the roof. Is there really not a simple API such as db.collection('someCollection').count() or similar, but we need to hack around it?
(2022-10-20) Edit:
Starting from now, counting the documents in a collection or the documents that are returned by a query is actually possible without the need for keeping a counter. So you can count the documents using the new count() method which:
Returns a query that counts the documents in the result set of this query.
This new feature was announced at this year's Firebase summit. Keep in mind that this feature doesn't read the actual documents. So according to the official documentation:
For aggregation queries such as count(), you are charged one document read for each batch of up to 1000 index entries matched by the query. For aggregation queries that match 0 index entries, there is a minimum charge of one document read.
For example, count() operations that match between 0 and 1000 index entries are billed for one document read. For A count() operation that matches 1500 index entries, you are billed 2 document reads.
Is there really not a simple api such as db.collection('someCollection').count() or similar
No, there is not.
but we need to hack around it
Yes, we can use a workaround for counting the number of documents within a collection, which would be to keep a separate counter that should be updated every time a document is added to, or removed from the collection.
This counter can be added as a field inside a document in Firestore. However, if the documents in the collection are added or deleted very frequently, then this solution might be a little costly, a case in which I highly recommend you to use the Realtime Database. In this case, there is nothing you need to pay when you update the counter, but only when you read (download) it. And since it's just a number, then you'll have to pay almost nothing. I have even written an article a couple of years ago regarding solutions for counting documents in Firestore:
How to count the number of documents in a Firestore collection?
Let's say i have an multi-restaurant food order app.
I'm storing orders in Firestore as documents.
Each order object/document contains:
total: double
deliveredByUid: str
restaurantId: str
I wanna see anytime during the day, the totals of every Driver to each Restaurant like so:
robert: mcdonalds: 10
kfc: 20
alex: mcdonalds: 35
kfc: 10
What is the best way of calculating the totals of all the orders?
I currently thinking of the following:
The safest and easiest method but expensive: Each time i need to know the totals i just query all the documents in that day and calculate them 1 by 1
Cloud Functions method: Each time an order has been added/removed modify a value in a Realtime database specific child: /totals/driverId/placeId
Manual totals: Each time a driver complete an order and write its id to the order object, make another write to the Realtime database specific child.
Edit: added the whole order object because i was asked to.
What I would most likely do is make sure orders are completely atomic (or as atomic as they can be). Most likely, I'd perform the order on the client within a transaction or batch write (both are atomic) that would not only create this document in question but also update the delivery driver's document by incrementing their running total. Depending on how extensible I wanted to get, I may even create subcollections within the user's document that represented chunks of time if I wanted to be able to record totals by month or year, or whatever. You really want to think this one through now.
The reason I'd advise against your suggested pattern is because it's not atomic. If the operation succeeds on the client, there is no guarantee it will succeed in the cloud. If you make both writes part of the same transaction then they could never be out of sync and you could guarantee that the total will always be accurate.
I'm creating a React firebase website that has a collection of documents that contains a rating from 1 to 10. All of these documents have an author attached. The average rating of all of the author's documents should be calculated and presented.
Here are my current two solutions:
Calculate the average from all the documents with the same author
Add the statistic to the author himself, such that every time the author adds a new document it will update his statistic
My thought process of the second one is such that the website doesn't have to calculate the average rating each time it's requested. Would this be a bad idea, or isn't there a problem in the first place, reading all the documents and calculating in the first place?
Your second approach is in fact a best practice when working with NoSQL databases. If you calculate the average on demand across a dynamic number of documents, the cost of that operation will grow as you add more documents to the database.
For this reason you'll want to calculate all aggregates on write, and store them in the database. With that approach looking up an aggregate value is a simple write.
Also see:
The Firebase documentation on aggregation queries
The Firebase documentation on distributed counters
How to get a count of number of documents in a collection with Cloud Firestore
Leaderboard ranking with Firebase
The article about Best practices for Cloud Firestore states that we should keep the rate of write operations for an individual collection under 1,000 operations/second.
But at the same time, the Firebase team says in Choose a data structure that root-level collections "offer the most flexibility and scalability".
What if I have a root-level collection (e.g. "messages") which expects to have more than 1,000 write operations/second?
If you think at that limitation of 1,000 operations/second it's pretty much but if you find your self in a situation in which you need more than that, then you should consider changing your database schema to allow writes on multiple collections. So you should multiply the number of collections. Having a single collection of messages, in which every user can add messages doesn't sound as a good way to go since you can reach that limitation very soon. In this case you should split that collection into multiple other collections. A possible schema might be the one I have explained in the following video:
https://www.youtube.com/watch?v=u3KwKQddPoo
See, at the end of that video, there is collection named messages which in term contains a roomId document. This document contains a subcollection named roomMessages which contains as documents all messages from a chat room. In this case, there are no chances you can reach that limitation.
But at the same time, the Firebase team says in Choose a data structure that root-level collections "offer the most flexibility and scalability".
But also rememeber, Firestore can as quickly look up a collection at level 1 as it can at level 100, so you don't need to worry about that.
The limit of 1,000 ops/sec per collection only apply to realtime update, so as long as you don't have a snapshot listener this should be okay.
I asked the question on the Cloud Firestore Google Groups
The limit is 10,000 writes per second if no other limits apply first:
https://firebase.google.com/docs/firestore/quotas#writes_and_transactions
Also just keep in mind the best practices for scaling cloud firestore
I'm wondering whats the better structure for my Firestore database.
I want to create some sort of appointment manager for employee where I can show every employee its appointment for some date. I have thought of these two options:
Option:
Every employee has a collection Appointments where I save all the upcoming appointments. The appointment documents would have a column date.
When I want to load all appointments for a date I would have to query all appointments by this date.
Option:
Every employee has a collection Workdays with documents for each day. These workday documents would have the column date. And then a collection with Appointments where I save the appointments for a workday.
When I want to load all appointments, I would have to query the Workdays collection for the correct date and then load all its Appointments.
I expect an average workday to contain 10-20 appointments. And let's say I save appointments for the next 30 days. For option 1, I would then have to query 300-600 documents down to 10-20.
For option 2 I would have 30 documents and query it for 1 documents. Then load around 10-20 documents.
So in option 2 I would have to query fewer documents, but I would have to wait until the query is finished and then load 10-20 further documents. While for option 1, I would have to query more documents but once this query is finished I wouldn't have to load any more documents.
I'm wondering what option is the faster for my use case - any thoughts?
Documents are not necessarily tabular (columnar). Keep it simple, follow their documentation, Choose a data structure, and do not overthink optimizing search. Leave query optimization to the Firebase platform/team as there are several search approaches which might be implemented, depending on the type of data you are querying for. Examples include source Wikipedia:
Dijkstra's algorithm, Kruskal's algorithm, the nearest neighbour
algorithm, and Prim's algorithm.
Again, provided you basically follow their data structure guideline, the optimal search approach should be baked in to the Firebase/Firestore platform and may be optimized by them when possible. In short, the speed of the compute platform will amaze you. Focus on higher level tasks relating to your particular app.
If the total number of documents read in each case is the same, the faster option will be the one that reduces the number of round trips between the client and server. So, fewer total queries would be better. The total number of documents in the collection is not going to affect the results very much.
With Firestore, performance of queries scales with the size of the result set (total number of documents read), not with the total size of the collection.
The first option is rather straightforward, and is definitely the way you'd do it with a relational database. The date column could become a natural foreign key to any potential workday table, assuming one is even needed.
The second option is more complicated because there are three data cases:
Workday does not exist
Workday does exist but has no appointments in the list
Workday exists and has appointments
In terms of performance, they are not likely to be very different, but if there is a significant gap, I'd gamble option 1 to be more efficient.