I'm working on an app where users create certain events in a calendar.
I was thinking on structuring the calendar events data as follows:
allEventsEver/{yearId}/months/{monthId}/events/{eventId}
I understand that
Firestore is optimized for storing large collections of small documents
but the structure above would mean that this would be an ever-growing collection. Is this something I should worry about? Would it be better to create a new collection for each year, e.g.:
2022/months/{monthId}/events/{eventId}
2023/months/{monthId}/events/{eventId}
Also, should I avoid using year/month value as document id (e.g. 2022) as those would be considered sequential ids that could cause hotspots that impact latency? If yes, what other approach do you suggest?
The most important/unique performance guarantee Firestore gives is that its query performance is independent of the number of documents in the collection. Query performance only depends on how much data you return, not on how much data needs to be considered.
So an ever-growing collection is not a concern on Firestore. As long put a limit on how many results your query can return, you'll have an upper bound on how much time it will take.
Related
I am new to Firestore and building an event planning app but I am unsure what the best way to structure the data is taking into account the speed of queries and Firestore costs based on reads etc. In both options I can think of, I have a users collection and an events collection
Option 1:
In the users collection, each user has an array of eventIds for events they are hosting and also events they are attending. Then I query the events collection for those eventIds of that user so I can list the appropriate events to the user
Option 2:
For each event in the events collection, there is a hostId and an array of attendeeIds. So I would query the events collection for events where the hostID === user.id and where attendeeIds.includes(user.id)
I am trying to figure out which is best from a performance and a costs perspective taking into account there could be thousands of events to iterate through. Is it better to search events collections by an eventId as it will stop iterating when all events are found or is that slow since it will be searching for one eventId at a time? Maybe there is a better way to do this than I haven't mentioned above. Would really appreciate the feedback.
In addition to #Dharmaraj answer, please note that none of the solutions is better than the other in terms of performance. In Firestore, the query performance depends on the number of documents you request (read) and not on the number of documents you are searching. It doesn't really matter if you search 10 documents in a collection of 100 documents or in a collection that contains 100 million documents, the response time will always be the same.
From a billing perspective, yes, the first solution will imply an additional document to read, since you first need to actually read the user document. However, reading the array and getting all the corresponding events will also be very fast.
Please bear in mind, that in the NoSQL world, we are always structuring a database according to the queries that we intend to perform. So if a query returns the documents that you're interested in, and produces the fewest reads, then that's the solution you should go ahead with. Also remember, that you'll always have to pay a number of reads that is equal to the number of documents the query returns.
Regarding security, both solutions can be secured relatively easily. Now it's up to you to decide which one works better for your use case.
I would recommend going with option 2 because it might save you some reads:
You won't have to query the user's document in the first place and then run another query like where(documentId(), "in", [...userEvents]) or fetch each of them individually if you have many.
When trying to write security rules, you can directly check if an event belongs to the user trying to update the event by resource.data.hostId == request.auth.uid.
When using the first option, you'll have to query the user's document in security rules to check if this eventID is present in that events array (that may cost you another read). Checkout the documentation for more information on billing.
I have users collection & transactions collection.
I need to get the user's balance by calculating his/her transactions.
And I heard that you are allowed to make duplicates and denormalize your database to achieve less document read in one request. (reading many docs cost more)
My approaches:
set transaction collection as a "subcollection" in the user document, so that you only get a user's documentation and compute the values need on the client-side.
make those collections as TOP level collections separately and somehow make "JOIN" queries to get his/her transactions then compute the value on the client-side.
Just make a field named "balance" in the user's document and update it every time they make transactions. (But this seems not quite adaptable to changes that might be made in the future)
Which approach is efficient? Or Maybe are there totally different approaches?
Which approach is efficient?
The third one.
Or Maybe are there totally different approaches?
Of course, there are, but by far the third is the best and cheapest one. Every time a new transaction is performed simply increment the "balance" field using:
What is the recommended way of saving durations in Firestore?
Assume I am designing a new Firestore database. Assume I like the idea of a hierarchical design and, as a contrived example, each Year has a sequence of child Weeks of which each has Days.
What's the most performance efficient way to retrieve a single document for today? i.e. 2021-W51-Thursday
Answers are permitted to include changes to the model, e.g. "denormalizing" the day model such that it includes year, week and dayName fields (and querying them).
Otherwise a simple document reference may be the fastest way, like:
DocumentReference ref = db
.Collection("years").Document("2021")
.Collection("weeks").Document("51")
.Collection("days").Document("Thursday");
Thanks.
Any query that identifies a single document to fetch is equally performant to any other query that does the same at the scale that Firestore operates. The organization of collections or documents does not matter at al at scale. You might see some fluctuations in performance at small scale, depending on your data set, but that's not how Firestore is optimized to work.
All collections and all subcollections each have at least one index on the ID of the document that works the same way, independent of each other collection and index. If you can identify a unique document using its path:
/db/XXXX/weeks/YY/days/ZZZZ
Then it scales the same as a document stored using a more flat structure:
/db/XXXXYYZZZZ
It makes no difference at scale, since indexes on collections scale to an infinite number of documents with no theoretical upside limit. That's the magic of Firestore: if the system allows the query, then it will always perform well. You don't worry about scaling and performance at all. The indexes are automatically sharded across computing resources to optimize performance (vs. cost).
All of the above is true for fields of a document (instead of a document ID). You can think of a document ID as a field of a document that must be unique within a collection. Each field has its own index by default, and it scales massively.
With NoSQL databases like Firestore, you should structure your data in such a way that eases your queries, as long as those queries can be supported by indexes that operate at scale. This stands in contrast with SQL databases, which are optimized for query flexibility rather than massive scalability.
I am considering storing multiple tenants in a single Firebase Firestore database. There will only be one collection per tenant and a few shared collections. Some will have more data than others. Some tenants may have a few million records while others may end up with a few billion. I want to confirm that the size of data in one collection will not impact the performance or storage of another collection in the same database.
I couldn't find much in the documentation about how the data is physically stored. Is all the data in Firestore stored in a single blob/file? If so, this could be a problem when there are hundreds of tenants with billions of records each. In an ideal world, each collection would be a physically separate file, and the server orchestration would separate the collections onto multiple servers so that a single server is not sharing the load between a very heavy tenant, and a very light tenant. This scenario would mean that a heavy tenant would slow down a light tenant.
My basic question is: can a single Firestore database infinitely scale up in size assuming that no single collection is bigger than a few billion records?
I know that there are two types of databases: native and datastore. Which of these seems more appropriate, and is the answer to my question different depending on which of these I select?
If the answer is that Firestore cannot scale infinitely in this way, what is the alternative approach? Should I be using Bigtable instead? Cassandra? Or, is there another way to physically divide my Firestore database other than collections?
Some tenants may have a few million records while others may end up with a few billion. I want to confirm that the size of data in one collection will not impact the performance or storage of another collection in the same database.
The performance in Firestore isn't related to the number of documents that exist in a collection. In terms of speed, it doesn't matter if you perform a query on:
A top-level (root-level) collection.
A sub-collection, which basically represents a collection that is nested under a document.
A collection group, which actually means querying collections and sub-collections that exist across the entire database.
The speed will always be the same, as long as the query returns the same number of documents. This is happening because the query performance depends on the number of documents you request and not on the number of documents you search. So it doesn't really matter if you query a collection with 1 MILLION documents or even 1 BILLION documents, the time for getting the same results will be the same.
I couldn't find much in the documentation about how the data is physically stored. Is all the data in Firestore stored in a single blob/file? If so, this could be a problem when there are hundreds of tenants with billions of records each.
In Cloud Firestore, the unit of storage is the document. Documents live in collections, which are simply containers for documents. Please note that Firestore is optimized for storing large collections of small documents. And when I say large, I mean extremely large. So when you perform a query against a collection of 1 MILLION documents, the speed depends on the number of results you return and it does not depend on the number of the documents in which you search, or on the number of documents that exist in other collections in which you aren't performing a search.
Can a single Firestore database infinitely scale up in size assuming that no single collection is bigger than a few billion records?
While when using the Firebase Realtime Database you had to scale using multiple databases, in Firestore this practice is not necessary. However, the are some techniques that are really good explained in the official docs:
Building scalable applications with Firestore
If the answer is that Firestore cannot scale infinitely in this way, what is the alternative approach?
I can definitely massively scale.
See the Firestore best practices and security rules.
You may conceptualize Firestore as being one service being shared by all of Google's customers. Just as Google's attempts to ensure that one customer's (so-called "noisy neighbor") impact on the service does not affect others, you don't want to be a noisy neighbor to yourself.
You need to consider more than just performance.
Security. E.g.see security rules as a mechanism that you may be able to use to help enforce segregation of your tenants' data. You will want to understand fully how to keep different customers' data separated securely. Your customers will want to understand what measures you're employing to ensure their data is keep separate too.
Multitenancy. Google Cloud Platform has no intrinsic (platform-wide) multitenant capabilities and, often, a way to manifest tenancy has been to use different Google Projects for different customers. This is because Projects provide a well-defined security perimeter. You may want to investigate whether (some subset of your customers) would benefit from being one customer, one project.
Quota. Another important consideration is quota. Every Cloud Platform method is constrained by some quota. You will want to be careful in ensuring that quota is distributed fairly across customers so that some customers don't consume all the quota denying other customers access to the service.
I have read in the documentation that the amount of time for retreiving data will be the same for querying a collection of 6 documents and a collection of 60M.
So is it safe to save all of the data of a specific kind (like users) under the same collection? Will I never have to split them into separate collections for getting better performance?
It is definitely possible to have slow-performing queries on Firestore, but the performance will not be related to the number of documents in the collection that you're querying. A common cause of slow reads is for example having documents that contain way more data than the application needs, which means that it takes more time to download that data to the client than is necessary for the use-case.
In your example: it is indeed normal to store all user profiles in a single collection. Querying 6 users out of that collection will always take the same amount of time, even if you app grows to millions or hundreds of millions of users.