I have a Cloud Function in Python 3.7 to write/update small documents to Firestore. Each document has an user_id as Document_id, and two fields: a timestamp and a map (a dictionary) with three key-value objects, all of them are very small.
This is the code I'm using to write/update Firestore:
doc_ref = db.collection(u'my_collection').document(user['user_id'])
date_last_seen=datetime.combine(date_last_seen, datetime.min.time())
doc_ref.set({u'map_field': map_value, u'date_last_seen': date_last_seen})
My goal is to call this function one time every day, and write/update ~500K documents. I have tried the following tests, for each one I include the execution time:
Test A: Process the output to 1000 documents. Don't write/update Firestore -> ~ 2 seconds
Test B: Process the output to 1000 documents. Write/update Firestore -> ~ 1 min 3 seconds
Test C: Process the output to 5000 documents. Don't write/update Firestore -> ~ 3 seconds
Test D: Process the output to 5000 documents. Write/update Firestore -> ~ 3 min 12 seconds
My conclusion here: writing/updating Firestore is consuming more than 99% of my compute time.
Question: How to write/update ~500 K documents every day efficiently?
It's not possible to prescribe a single course of action without knowing details about the data you're actually trying to write. I strongly suggest you read the documentation about best practices for Firestore. It will give you a sense of what things you can do to avoid problems with heavy write loads.
Basically, you will want to avoid these situations, as described in that doc:
High read, write, and delete rates to a narrow document range
Avoid high read or write rates to lexicographically close documents,
or your application will experience contention errors. This issue is
known as hotspotting, and your application can experience hotspotting
if it does any of the following:
Creates new documents at a very high rate and allocates its own monotonically increasing IDs.
Cloud Firestore allocates document IDs using a scatter algorithm. You should not encounter hotspotting on writes if you create new
documents using automatic document IDs.
Creates new documents at a high rate in a collection with few documents.
Creates new documents with a monotonically increasing field, like a timestamp, at a very high rate.
Deletes documents in a collection at a high rate.
Writes to the database at a very high rate without gradually increasing traffic.
I won't repeat all the advice in that doc. What you do need to know is this: because of the way that Firestore is built to scale massively, limits are placed on how quickly you can write data into it. The fact that you have to scale up gradually is probably going to be your main problem that can't be solved.
I achieved my needs with batched queries. But according to Firestore documentation there is another faster way:
Note: For bulk data entry, use a server client library with
parallelized individual writes. Batched writes perform better than
serialized writes but not better than parallel writes. You should use
a server client library for bulk data operations and not a mobile/web
SDK.
I also recommend to take a look to this post in stackoverflow with examples in Node.js
Related
In my application there will be users (tens/hundreds). Each user will have many documents of the same type(typeA). I will need to read all these documents for a current user. I plan to use the following option:
root collection: typeACollection
|
nested collections for users: user1Collection, user2Collection, user3Collection ....
|
all documents for a specific user
An alternative to this solution is to create a separate root collection for each user and store documents of this type in it. But I do not like this solution - there will be a "not clear" structure.
user1typeACollection, user2typeACollection, user3typeACollection ....
your opinion which of the options is preferable (performance/price) - first or second?
There is no singular best structure here, it all depends on the use-cases of your app.
The performance of a read operation in Firestore purely depends on the amount of data retrieved, and not on the size of the database. So it makes no difference if you read 20 user documents from a collection of 100 documents in total, or if there are 100 million documents in there - the performance will be the same.
What does make a marginal difference is the number of API calls you need to make. So loading 20 user documents with 20 cals will be slower than loading them with 1 call. But if you use a single collection group query to load the documents from multiple collections of the same name, that's the same performance again - as you're loading 20 documents with a single API call.
The cost is also going to be the same, as you pay for the number of documents read and the bandwidth consumed by those documents, which is the same in these scenarios.
I highly recommend watching the Getting to know Cloud Firestore video series to learn more about data modeling considerations and pricing when using Firestore.
I need to keep track of the number of photos I have in a Photos collection. So I want to implement an Aggregate Query as detailed in the linked article.
My plan is to have a Cloud Function that runs whenever a Photo document is created or deleted, and then increment or decrement the aggregate counter as needed.
This will work, but I worry about running into the 1 write/document/second limit. Say that a user adds 10 images in a single import action. That is 10 executions of the Cloud Function in more-or-less the same time, and thus 10 writes to the Aggregate Query document more-or-less at the same time.
Looking around I have seen several mentions (like here) that the 1 write/doc/sec limit is for sustained periods of constant load, not short bursts. That sounds reassuring, but it isn't really reassuring enough to convince an employer that your choice of DB is a safe and secure option if all you have to go on is that 'some guy said it was OK on Google Groups'. Is there any official sources stating that short write bursts are OK, and if so, what definitions are there for a 'short burst'?
Or are there other ways to maintain an Aggregate Query result document without also subjecting all the aggregated documents to a very restrictive 1 write / second limitation across all the aggregated documents?
If you think that you'll see a sustained write rate of more than once per second, consider dividing the aggregation up in shards. In this scenario you have N aggregation docs, and each client/function picks one at random to write to. Then when a client needs the aggregate, it reads all these subdocuments and adds them up client-side. This approach is quite well explained in the Firebase documentation on distributed counters, and is also the approach used in the distributed counter Firebase Extension.
I saw this docs (https://cloud.google.com/firestore/docs/best-practices#hotspots) and it says:
Avoid high read or write rates to lexicographically close documents,
or your application will experience contention errors. This issue is
known as hotspotting, and your application can experience hotspotting
if it does any of the following:
Creates new documents at a very high rate and allocates its own
monotonically increasing IDs.
Cloud Firestore allocates document IDs using a scatter algorithm. You
should not encounter hotspotting on writes if you create new documents
using automatic document IDs.
Creates new documents at a high rate in a collection with few
documents.
Creates new documents with a monotonically increasing field, like a
timestamp, at a very high rate.
Deletes documents in a collection at a high rate.
Writes to the database at a very high rate without gradually
increasing traffic.
Does a high rate occur when a lot of users create documents at once?
Or is it talking about creating documents by running a for or while(roop) statement?
Does a high rate occur when a lot of users create documents at once? Or is it talking about creating documents by running a for or while(roop) statement?
Either of those can trigger hotspotting in certain cases with a high write rate. More important than where the writes come from is how fast the write come in, how you assign document IDs, and whether or not you're writing monotonically increasing or decreasing fields.
This article goes into more detail on the timestamp case and describes a workaround:
https://cloud.google.com/firestore/docs/solutions/shard-timestamp
Understand firestore charge based on read / write operation.
But I notice that the firestore read from server per app launch, it will cause a big read count if many user open the app quite frequent.
Q1 Can I just limit user read from server for first time login. After that it just read for those update document per app launch?
For example there's a chat app group.
100 users
100 message
100 app launch / user / day
It will become 1,000,000 read count per day?
Which is ridiculous high.
Q2 Read is count per document, doesn't matter is root collection / sub collection, right?
For example, I read from a root collection that contain 10 subcollection and each of them having 10 documents, which will result 100 read count, am i right?
Thanks.
That’s correct, Cloud Firestore cares less about the amount of downloaded data and more about the number of performed operations.
As Cloud Firestore’s pricing depends on the number of reads, writes, and deletes that you perform, it means that if you had 100 users communicating within one chat room, each of the users would get an update once someone sends a message in that chat, therefore, increasing the number of read operations.
Since the number of read operations would be very much affected by the number of people in the same chatroom, Cloud Firestore suits best (price-wise) for a person-to-person chat app.
However, you could structure your app to have more chat rooms in order to decrease the volume of reads. Here you can see how to store different chat rooms, while the following link will guide you to the best practices on how to optimize your Cloud Firestore realtime updates.
Please keep in mind that Cloud Firestore itself does not have any rate limiting by default. However, Google Cloud Platform, has configurable billing alerts that apply to your entire project.
You can also limit the billing to $25/month by using the Flame plan, and if there is anything unclear in your bill, you can always contact Firebase support for help.
Regarding your second question, a read occurs any time a client gets data from a document. Remember, only the documents that are retrieved are counted - Cloud Firestore does searching through indexes, not the documents themselves.
By using subcollections, you can still retrieve data from a single document, which will count only as 1 read, or you can use a collection group query that will retrieve all the documents within the subcollection, counting into multiple reads depending on the amount of documents (in the example you put, it would be 10x10 = 100).
Firebase's Cloud Firestore gives you limits on the number of document writes and reads (and deletes). For example, the spark plan (free) allows 50K reads and 20k writes a day. Estimating how many writes and reads is obviously important when developing an app, as you will want to know the potential costs incurred.
Part of this estimation is knowing exactly what counts as a document read/write. This part is somewhat unclear from searching online.
One document can contain many different fields, so if an app is designed such that user actions done through a session require the fields within a single document to be updated, would it be cost-efficient to update all the fields in one single document write at the end of the session, rather than writing the document every single the user wants to update one field?
Similarly, would it not make sense to read the document once at the start of a session, getting the values of all fields, rather than reading them when each is needed?
I appreciate that method will lead to the user seeing slightly out-of-date field values, and the database not being updated admittedly, but if such things aren't too much of a concern to you, couldn't such a method reduce you reads/writes by a large factor?
This all depends on what counts as a document write/read (does writing 20 fields within the same document in one go count as 20 writes?).
The cost of a write operation has no bearing on the number of fields you write. It's purely based on the number of times you call update() or set() on a document reference, weither independently, in a transaction, or in a batch.
If you choose to write each N fields using N separate updates, then you will be charged N writes. If you choose to write N fields using 1 update, then you will be charged 1 write.