Maxing out document storage in Firestore - firebase

I'm working on some posting forum projects and trying to figure out the ideal Firestore database structure.
I read that documents have a max size of 1 mg but what are the pros and cons to maxing out the storage space of each document by having multiple posts stored in a document rather than using a single document for each post?
I think it would be cheaper. Assuming that the app would make use of all the data in a document, the bandwidth costs would be the same but rather than multiple reads, I would be charged for only one document. Does this make sense?
Would it also be faster?

You can likely store many posts in a single document, and depending on your application, there may be good reasons for doing so. Just keep a few things in mind:
Firestore always reads complete documents. So if you store 100 posts in a single 1MB document, to only display 10 of those posts, you may have reduced the read operations by 10x, but you've increased the bandwidth consumption by 10x. And your mobile users will likely also pay for that bandwidth.
Implementing your own sharding strategy is not always hard, but it's seldom related to application functionality.
My guidelines when modeling data in any NoSQL database is:
model application screens in your database
I tend to model the data in my database after the screens that I have in my application. So if you typically show a list of headlines of recent articles when a user starts the app, I might actually create a document that contains just the headlines of recent articles. That way the app only has to read a single document with just the headlines, instead of having to read each individual post. This reduces not only the number of documents the app needs to read, but also the bandwidth it consumes.
don't be afraid to duplicate data
This goes hand-in-hand with the previous guideline, and is very normal across all NoSQL databases, but goes against the core of what many of us have learned from relational databases. It is sometimes also referred to as denormalizing, as it counters the database normalization of relations database models.
Continuing the example from before: you'll probably have a separate document for each post, just to make sure that each post has its own single point of definition. But you'll store parts of that post in many other places, such as in the document-of-recent-headlines that we had before. This means that we'll have to duplicate the data for each new post into that document, and possibly multiple other places. This process is known as fan-out, and there are some common strategies for updating this denormalized data.
I find that this duplication leads to no concerns, as long as it is clear what the main point of definition for each entity is. So in our example: if there ever is a difference between the headline of a post in the post-document itself, and the document-of-recent-headlines, I know that I should update the document-of-recent-headlines, since the post-document itself is my point-of-definition for the post.
The result of all this is that I often see my database as part actual data storage, part prerendered fragments of application screens. As long as the points of definition are clear, that works quite well and allows me to define data models that scale efficiently both for users of the applications that consume the data and for the cost to operate them.
To learn more about NoSQL data modeling:
NoSQL data modeling
Getting to know Cloud Firestore, which contains many more examples of these prerendered application screens.

Related

Is this pattern valid for communicating among multiple firestore databases?

I'm currently brainstorming and wondering if it's possible to easily communicate among multiple firestore databases. If so, I could isolate collections and therefore also isolate writes/updates on those collections from competing with other services reducing the risk that I hit the 10,000 write limit p/second on a given database.
Conceptually, I figure I can capture the necessary information from one document in DB_A (including the doc_id) in a read and then set that document in DB_B with the matching doc_id.
In a working example, perhaps one page has a lot of content (documents) that I need to generate and I don't want those writes to compete with writes used in other services on my app. When a user visits this page, we show those documents from DB_A and if the user is interested in one of those documents, we can take that document that we've effectively already read, and now write it into DB_B where user-specific content lives. It seems practical enough. Are there any indexing problems / other problems that could come out of this solution that I'm not seeing?
In the example you give the databases themselves are not communicating, but your app is communicating with multiple database instances. That is indeed possible. Since you can only have one Firestore instance per project, you will need to add multiple projects to your app.
What you're describing is known as sharding, as each database becomes a shard of (a subset of) your entire data set.
Note that it is quite uncommon to have shards to Firestore. If you predict such a high volume of writes, also have a look at Firebase's Realtime Database - as that is typically better suited for use-cases with more, small writes. Firestore is more suited for use-cases that have fewer larger writes, and many more readers. While you may also still to shard (and possibly shard more to reach the same read capacity) with Realtime Database, it can have multiple database instances per project - making the process easier to manage.

Are CosmosDB attachments (still) a good way to store payloads larger than the document limit of 2MB?

I need to save CosmosDB documents that contain a large list - larger than the document limit of 2 MB.
I'd like to split that list into an 'attachment' associated to the "main document".
But this documentation page briefly mentions that
Attachment feature is being depreciated [sic]
What's the deprecation plan? Will newly created collections (in the future) stop supporting attachments?
The same page of documentation mentions a limit of 2GB for "Maximum attachment size per Account".
What does that mean? 2GB per attachment? 2 GB total for all attachments?
I recommend not taking a dependency on attachments. We are still planning on deprecating them but have not started in earnest on this.
Depending on your access patterns for this data you may want to break this up as individual documents or modeled in some other way. CRUD operations on large documents can be very costly and will experience high latency because of the large payload in each request.
If you have an unbounded array these should definitely be stored as individual documents or modeled such that increasing size does not cause eventual performance issues. If your data is updated frequently it should be modeled such that the frequently updated portions are separate from properties that are static.
This article here describes scenarios and considerations when modeling data in Cosmos and may help you come up with a more efficient model.
https://learn.microsoft.com/en-us/azure/cosmos-db/modeling-data
Hope this is helpful.

Firestore Realtime Updates 1M Limit

When using Firestore and subscribing to document updates, it states a limit of 1M concurrent mobile/web connections per database.
https://firebase.google.com/docs/firestore/quotas#realtime_updates
Is that a hard limit (enforced/throttled in code)? Or is it a theoretical limit (like you're safe up to 1M, then things get dicey)? Is it possible to get an uplift?
Trying to understand how to support a large user base without needing to shard the database (which is one of the advantages of Firestore). Even at 5M users, it seems you would start having problems because you'd probably hit times when >20% of those users were on your app simultaneously.
As you already noticed, the maximum size of a single document in Firestore is 1 Megabyte. Trying to store large number of objects (maps) that may exceed this limitation, is generally considered a bad design.
You should reconsider the logic of you app and think at the reson why you need to have more than 1Mib in single a document, rather than each object being their own document. So to be able to use Firestore, you should change the way you are holding the data from within a single documents to a collection. In case of collections, there are no limitations. You can add as many documents as you want. According to the official documentation regarding Cloud Firestore Data model:
Cloud Firestore is optimized for storing large collections of small documents.
IMHO, you should take advantage of this feature.
For details, I recommend you see my answer from this post where I have explained some practices regarding storing data in arrays (documents), maps or collections.
Edit:
Without sharding, I'm affraid it is not an option. So in this case, sharding will work for sure. So in my opinion, that's certainly a reasonable option.

Firestore database model for Notion-like modules [duplicate]

I have seen videos and read the documentation of Cloud firestore, from Google Firebase service, but I can't figure this out coming from realtime database.
I have this web app in mind in which I want to store my providers from different category of products. I want perform a search query through all my products to find what providers I have for such product, and eventually access that provider info.
I am planning to use this structure for this purpose:
Providers ( Collection )
Provider 1 ( Document )
Name
City
Categories
Provider 2
Name
City
Products ( Collection )
Product 1 ( Document )
Name
Description
Category
Provider ID
Product 2
Name
Description
Category
Provider ID
So my question is, is this approach the right way to access the provider info once I get the product I want?
I know this is possible in the realtime database, using the provider ID I could search for that provider in the providers section, but with Firestore I am not sure if its possible or if this is right approach.
What is the correct way to structure this kind of data in Firestore?
You need to know that there is no "perfect", "the best" or "the correct" solution for structuring a Cloud Firestore database. The best and correct solution is the solution that fits your needs and makes your job easier. Bear also in mind that there is also no single "correct data structure" in the world of NoSQL databases. All data is modeled to allow the use-cases that your app requires. This means that what works for one app, may be insufficient for another app. So there is not a correct solution for everyone. An effective structure for a NoSQL type database is entirely dependent on how you intend to query it.
The way you are structuring your data looks good to me. In general, there are two ways in which you can achieve the same thing. The first one would be to keep a reference of the provider in the product object (as you already do) or to copy the entire provider object within the product document. This last technique is called denormalization and is a quite common practice when it comes to Firebase. So we often duplicate data in NoSQL databases, to suit queries that may not be possible otherwise. For a better understanding, I recommend you see this video, Denormalization is normal with the Firebase Database. It's for Firebase Realtime Database but the same principles apply to Cloud Firestore.
Also, when you are duplicating data, there is one thing that needs to keep in mind. In the same way, you are adding data, you need to maintain it. In other words, if you want to update/delete a provider object, you need to do it in every place that it exists.
You might wonder now, which technique is best. In a very general sense, the best way in which you can store references or duplicate data in a NoSQL database is completely dependent on your project's requirements.
So you should ask yourself some questions about the data you want to duplicate or simply keep it as references:
Is the static or will it change over time?
If it does, do you need to update every duplicated instance of the data so they all stay in sync? This is what I have also mentioned earlier.
When it comes to Firestore, are you optimizing for performance or cost?
If your duplicated data needs to change and stay in sync in the same time, then you might have a hard time in the future keeping all those duplicates up to date. This will also might imply you spend a lot of money keeping all those documents fresh, as it will require a read and write for each document for each change. In this case, holding only references will be the winning variant.
In this kind of approach, you write very little duplicated data (pretty much just the Provider ID). So that means that your code for writing this data is going to be quite simple and quite fast. But when reading the data, you will need to load the data from both collections, which means an extra database call. This typically isn't a big performance issue for reasonable numbers of documents, but definitely does require more code and more API calls.
If you need your queries to be very fast, you may want to prefer to duplicate more data so that the client only has to read one document per item queried, rather than multiple documents. But you may also be able to depend on local client caches makes this cheaper, depending on the data the client has to read.
In this approach, you duplicate all data for a provider for each product document. This means that the code to write this data is more complex, and you're definitely storing more data, one more provider object for each product document. And you'll need to figure out if and how to keep up to date on each document. But on the other hand, reading a product document now gives you all information about the provider document in one read.
This is a common consideration in NoSQL databases: you'll often have to consider write performance and disk storage vs. reading performance and scalability.
For your choice of whether or not to duplicate some data, it is highly dependent on your data and its characteristics. You will have to think that through on a case-by-case basis.
So in the end, remember that both are valid approaches, and neither of them is pertinently better than the other. It all depends on what your use-cases are and how comfortable you are with this new technique of duplicating data. Data duplication is the key to faster reads, not just in Cloud Firestore or Firebase Realtime Database but in general. Any time you add the same data to a different location, you're duplicating data in favor of faster read performance. Unfortunately in return, you have a more complex update and higher storage/memory usage. But you need to note that extra calls in Firebase real-time database, are not expensive, in Firestore are. How much duplication data versus extra database calls is optimal for you, depends on your needs and your willingness to let go of the "Single Point of Definition mindset", which can be called very subjective.
After finishing a few Firebase projects, I find that my reading code gets drastically simpler if I duplicate data. But of course, the writing code gets more complex at the same time. It's a trade-off between these two and your needs that determines the optimal solution for your app. Furthermore, to be even more precise you can also measure what is happening in your app using the existing tools and decide accordingly. I know that is not a concrete recommendation but that's software development. Everything is about measuring things.
Remember also, that some database structures are easier to be protected with some security rules. So try to find a schema that can be easily secured using Cloud Firestore Security Rules.
Please also take a look at my answer from this post where I have explained more about collections, maps and arrays in Firestore.

Modeling document data and query performance

I have an aggerate data model (think a Customer entity with Widgets that belong to them as a list of embedded entities).
When I search for customers (e.g DocumentDBRepository.GetItemsAsync) That will be hydrating the customer data model along with the widgets for each. For efficiency reasons, I don’t really need the customer search to consider the widgets.
Are there any strategies for this in document dbs (such as a “LiteCustomer” entity)? I suspect not as that is just the nature of the “schema-less” data I’ve told it to store in the first place, but interested to hear thoughts.
Is this simply a ‘non issue’?
First, disclaimer: data modeling is hard. There are many nuances and a SO question can never cover entire business and everything left unsaid in both Q and A. There's no silver bullets. Regardless..
"LiteCustomer"
Perfectly fine to have such model in your client code. Your main Customer model may and will have many representations, most of them simple subsets of full model. Similarly to relational sql, select only what you need. Don't fetch data to client which you don't need.
The SQL API provides quite cool SQL tools to compose json for return documents for you.
physical storage model may differ from domain model
Consider your usage scenarios. If many scenarios happen to work with customer without widgets (or vice versa) then consider having widgets as separate document(s) in storage model.
In DocDB, the question is often not so much in querying logic but what your application expects on modification logic. Querying which is indexed is fast and every sql query can easily do transformations (though cross-doc joining is troublesome). For C(R)UD - you have less options - it's always by full document. Having too large documents will end up with higher RU costs and complex code.
Questions to consider:
How often customer changes without widget count/details changing?
How often widgets change without customer changing?
Do widgets on customer change independently or always as a set?
When do you need transactional updates on customer+widget changes?
How would queries look like? Can they be indexed?
Test.
True, changing model later is cumbersome in DocDB, but don't try to fix something before you know it's broken. If you are not sure you have an issue or not, then most likely fixing the maybe-issue is costlier than not fixing it.
If in doubt, generate loads of data and test it out.

Resources