Firestore subcollection vs array - firebase

First of, I know how Firestore works and have spent a lot of time, evaluating different approaches for a good structure. Still I am considering following scenario:
There is a database of known recipes. Users can add recipes, but they have to be confirmed to be real recipes and not just some variations. So every user can choose receipes from the user-generated list of recipes to state, that they know how to cook them (or add new ones).
Now I want users to share their list of receipes with others, but this is where I am not sure how this can be best accomplished using Firestore. The trick is, that I want to show all the recipes at once, and don't want to paginate them.
I am currently evaluating two possibilities:
Subcollections
Whenever a user shares his list, the user looking at said list will have to load the entire list of the recipes which can result in a high amount of document reads (I suppose realistically ~50, in very rare cases maybe 1000).
Pros:
More natural structure
Easier to maintain (e.g. deleting a recipe, checking if a specific one exists)
Easier to add fields (e.g. timeOfCreation, comment, personalRating, ...)
Cons:
Can result in a high amount of reads on the long run
Arrays
I could save every known recipe (the id and an imageURL) inside the user's document (or as a single subdocument "KnownRecipes") within an array. This array could be in form of
recipesKnown: [{rid: 293ndwa, imageURL: image1.com, timeAdded: 8371201332},
{rid: 9012831, imageURL: image1.com, timeAdded: 8371201871},
{rid: jd812da, imageURL: image1.com, timeAdded: 8371201118},
...
]
Pros:
I only need one document read whenever someone wants to see another user's list
Reading a user's list is probably faster
Cons:
It's hard to update a specific recipe (e.g. someone wants to change the imageURL: I need to change the list locally and send the entire document as an update to the server - since I cannot just change a single element in the array)
When a user decides to have around 1000 recipes (this will maybe never happen, but it could), the 1MiB limit of the Firestore limit could be reached. A possible workaround would be to create a seperate document and split those two arrays into these two documents.
For me, the idea with Subcollections seems to be the more "clean" solution to this problem, but maybe I am missing some arguments on why one of those solutions would be superior over the other.
My most common queries are as follows (ordered descending by importance):
Which recipes can a user cook
Add a recipe a user can cook to the user's list
Who can cook a specific recipe (there is a Recipe -> Cooks subcollection)
Update an existing recipe a user can cook

The answer to your question depends on the level of scalability you want to achieve.
If by design the amount of sub-data you want to store is limited and very low, you should use arrays, since you reduce the number of document reads, which means lower costs.
If your sub-data is supposed to increase "unlimitedly" over time, you should use sub-collections.
If you're building a database which is not supposed to scale in any direction (Proof of concept, very small business, etc.) just go with what you feel more comfortable with.

I'm researching the same question...
One of the questions is whether the data held in the document will be ever go pass 1MB that is the limit for a document. Researching a bit on how much it can be held in plain text in 1MB well it's a hell of a lot. Still if it were to be incredible bigger it would crash in the end. Thus if you think in a big-big way sub-collections.
If we had to use the Firebase element logic the answer would be sub-collections.
Still I guess the major point is the data pulled. If you call the user you will directly be pulling out that MB of data. Instead with a sub-collection it won't load, even if you loaded it you can still lazy-load.
I guess for the kind of setup you are doing sub-collections.

key is an additional collection's con/pro
key could help to avoid duplicates; but this requires thinking of what is duplicate's definition (which might change);
array's no-key behavior could be emulated via auto-id.
p.s. #Thomas's list of pros/cons in the question has been quite helpful.

Related

Should I create a duplicate collection/document for each use-case? (Firebase/Firestore)

I'm trying to build an ecommerce app with firebase on the backend. I have a collection of 1000+ products, each of which is stored as a separate document, which have product specific info such as price, title etc.
document:{
title: 'Some Title',
price: '$99.99',
genres: ['Horror', 'Action']
}
So in my app I need to display these products in many places, such as product carousels(similar to a bookshelf with arrow buttons at the ends), and also in a search results page.
At any given page, I assume that I will need to display at least 50 products, either as search results, or multiple carousels. I understand that I can use queries to get this data from firebase. But since each document I retrieve counts as (at least)one firestore read, I assume that a typical user session would run into 100+ reads, if not thousands.
It seems a little inefficient to me that I need to read multiple documents to get this data, when I could just all that data in a single array, as its own document. That would mean I get charged for one document read, not 50, per page.
Is this how it is expected to be done? Should I create a new document containing the data I need for each specific use case?
P.S. I'm pretty new to backend dev, let alone firebase.
TL;DR Yes, you should create a new document with the needed data for each specific use case, but it’s not recommended to make it as a document with nested objects like arrays with 1000+ elements.
From a technical point of view, Cloud Firestore is optimized for storing large collections of small documents.
Depending on the use case, you can select the most appropriate Cloud Firestore data structure.
For example, the 10 most buyed books of the month can be a document with nested complex objects like arrays or maps. This structure could be useful for use cases with a small or predefined number of elements, but as stated here, if your data expands over time with larger or growing lists, the document also grows, which can lead to slower document retrieval times.
In plus thousand registers, a better choice can be structure your data as subcollections. It is, you can create collections within documents when you have data that might expand over time, with the main advantage that, as your lists grow, the size of the parent document doesn't change.
Cloud Firestore also has several features to help you manage queries that return a large number of results:
Cursors, which allow you to resume a long-running query.
Page tokens, which help you paginate the query results.
Limits, which specify how many results to retrieve.
Offsets, which allow you
to skip a fixed number of documents.
There are no additional costs for using cursors, page tokens, and limits. In fact, these features can help you save money by reading only the documents that you actually need.
As a best practice, do not use offsets. Instead, use cursors. Using an offset only avoids returning the skipped documents to your application, but these documents are still retrieved internally. The skipped documents affect the latency of the query, and your application is billed for the read operations required to retrieve them.

Is there a best practice limitation of how many items I should keep in a single DynamoDB table?

I am setting up a Serverless application for a system and I am wondering the following:
Say that my table handle Companies. Each Company can have Invoices. Each company has roughly 6-8000 Invoices. Say that I have 14 Companies, that results in roughly 112 000 items in my table.
Is it "okay" to handle it this way? I will only pay for each Get request I do, and I can query a lot of items into the same get request.
I will not fetch every single item each time I write or get items.
So, is there a recommendation for how many items I should max have in a table? I could bake some items together, but I mainly want a general recommendation.
There is no practical limit to the number of items you can have in a table. How many items each invoice is depends on your application's access patterns. You need to ask, what data does your app need, when does it need that data, and how large is the data, how often is the item updated. For example, if all the data in one item comes in under the 1Kb WCU and 4Kb RCU and you do not write to it often, and when you read it, you need all of the data in the item, then shove it in one item perhaps. If the data is larger, or part of it gets written to more often, then perhaps split it up.
An example might be a package tracking app. You have the initial information about the package, size, weight, source address, destination address, etc. That could be a lot of data. When that package enters a sorting facility it is checked in. Do you want to update that entire item you already wrote? Or do you just write an item that has the same PK (item collection), but a different SK and then the info that it made it to the sorting facility? When it leaves the sorting facility, you want to write to the DB that it left, which truck it was on, etc. Same questions.
Now when you need to present the shipping information by tracking ID number, the PK, you can do a query to DynamoDB and get the entire item collection for that tracking ID number. Therefore you get all items with that ID as your app presents much of that information on the tracking web site for the customer.
So again, it really depends on the app and your access patterns, but you want to TRY to only read and write the items your app needs, when you need them, how you need them, and no more...within reason (there is such a thing as over slicing your data). That is how, in my opinion, you will make a NoSQL database like DynamoDB be the most performant and most cost effective.
Dynamo Db won't even notice 100K entries...
As mentioned by LifeOfPi, entries should be less than 400k.
The question indicates a distinct lack of understanding of what/why/how to use DDB. I suggest you do some more learning. The AWS Reinvent videos around DDB are quite useful.
In a standard RDBMS, you need to know the structure from the beginning. Accessing that data is then very flexible.
DDB is the opposite, you need to understand how you'll need to access you data; the structure is not important. You should end up with something like so:
For 100K items and for most applications, you may find Aurora serverless to be an easier fit for your needs; especially if you have complicated searching and/or sorting needs.

Firestore, Array vs documents list

Assuming a user has a thousand friends, but when calling a friend list on a specific screen, bringing in a thousand documents is expensive and time consuming. Even if pagination is performed, there will be a speed delay due to additional requests.
And according to the official documentation, you can put 1MB in documents, that is, about 1 million characters. However, what I worry about when using Arrays is that there will be situations where things get complicated in many ways.
Are there any exact standards?
You seem to be fully aware of the limitations of Firestore in this case. There is no new information that will help you here.
If you have a list of things to store in Firestore, and that list could exceed the max size of a document (1MB, as you correctly state), then you are going to have a problem. On the other hand, if you put all of those items in other documents, you're going to have to pay for all of those reads. That's the tradeoff -- you will have to decide which problem is worse. Typically, people choose to use separately document so they are limited by the size of a document. But that's your call to make.
You could try to shard that list across multiple documents somehow, but that will add much complexity to your code. Again, your call to make.

Determining number of Firebase reads for nested sub-collection

I have a mobile solution (iOS) that is using Firebase to aid in syncing of data between a users devices. What I have works and allows me to keep clients in sync as I wanted to. However from testing, my reads are a bit out of control for larger data sets and I need to do some optimization. To that end, I wanted to make sure that my understanding of how reads are counted was correct (I am still a newbie at Firebase).
My data is structured like this:
Its a bit nested I agree, but for all the uses cases it seems to be the best way to do things to minimize redundancy, e.g. there are relationship between Cats and Dogs and Birds, but I only store one copy of each, not multiple. In addition, each users data is segregated from the other users and I need the ability to version the data. Put that all together and with the requirement to alternate collections and documents, you get what you see.
Based on this structure, I can create queries like this:
Firestore.firestore().collection("userid1").document("data").collection("version0").document("Cats").collection("data").whereField("modifiedDate" isGreaterThanOrEqualTo: someDoubleValue).getDocuments(completionCallback)
This gets me the data I need and seems to only return the number of items I think it should. However, am I correct in saying that if there are 100 Cat type documents (Cat1...Cat100), but only 3 of them have a modifiedDate that is greater than my query parameter, when the data is returned to me, I will only be "charged" for 3 reads? Or have I don't something completely silly here and I am getting charged for all 100 even though I only get 3 documents back in the callback.
The billing doesn't work any different for subcollections than it does for top-level collections. You are only billed for the documents transferred, not the entire set of documents in the collection (unless you do request every document).
Cloud Firestore scales massively, and it's expected that you might have a massive number of documents in a collection. Billing a read for each and every document in a collection for each query against that collection would be insanely expensive.

Riak solution for querying data by books or unique pages

Consider a set of data called Library, which contains a set of Books and each book contains a set of Pages.
Let's say you are using Riak to store this data, and you need to be access the data in two possible ways:
- Query for a particular page (with a unique id)
- Query for all pages in a particular book (with a unique name)
Additionally, you need to be able to easily update and delete pages of a particular Book.
What would be the best way to accomplish this in Riak?
Obviously Riak Search will do the trick, but maybe is inefficient for what I am trying to do. I am wondering if it makes sense to set up buckets where each bucket can be a Book (which would make for potentially millions of "Book" buckets). Maybe that is a bad idea...
Can this be accomplished with secondary indexes?
I am trying to keep this simple...
I am new to Riak and I am trying to find the best way to accomplish something that is probably relatively simple. I would appreciate any help from the Stack Overflow community. Thanks!
A common way to model master-detail relationships in Riak is to have the master record contain a list of detail record IDs, possibly together with some information about the detail record that may be useful when deciding which detail records to retrieve.
In your example, you could have two buckets called 'books' and 'pages'. The master record in the 'books' bucket will contain metadata and information about the book as a whole together with a list of pages that are included in the book. Each page would contain the ID of the 'pages' record holding the page data as well as the corresponding page number. If you e.g. wanted to be able to query by chapter, you could also add information about which chapters a certain page belongs to.
The 'pages' bucket would contain the text of the page and possibly links to images and other media data that are included on that page. This data could be stored in yet another bucket.
In order to get a specific page or a range of pages, one would first retrieve the master record from the 'books' bucket and then based on the contents of the record the appropriate pages. Even though this requires several GET operations, they are all direct lookups based on keys, which is the most efficient and scalable way to retrieve data from Riak, so it is will perform and scale well.
This approach also makes it simple to change the order of pages and/or chapters as only the master record needs to be updated. Adding, deleting or modifying pages would however require both the master record as well as one or more detail records to be updated, added or deleted.
You can most certainly also solve this problem by adding secondary indexes to the objects and query based on this. Secondary index queries in Riak does however have to include processing on a covering set (generally ring size / n_val) of partitions in order to fulfil the request, and therefore puts a bit more load on the system and generally results in higher latencies than retrieving a single object containing keys through a direct key lookup (which only needs to involve the partitions where the object is actually stored).
Although maintaining a separate object containing indexes adds a bit of extra work when inserting or deleting pages/entries, this approach will generally result in more efficient reads, as only direct key lookups are required. If your application is heavy on reads, it probably makes sense to use this approach, while secondary indexes could be more efficient for a write heavy application as inserts and modifications are made cheaper at the expense of more expensive reads. You can however always add secondary indexes just in case in order to keep your options open.
In cases like this I would usually recommend performing some benchmarks to test the solutions and chech which solution that best matches you particular performance and scaling requirements.
The most efficient way will be to store hole book as an one object, and duplicate it's pages as another separate objects.
Pros:
you will be able to select any object by its key(the most cheapest op
in riak is kv query)
any query will be predicted by latency
this is natural way of storing for riak
Cons:
If you need to update any page you must update whole book, and then page. As riak doesn't have atomic ops, you must to think how to recover any failure situation (like this: book was updated, but page was not).
Riak is about availability predictable latency, so if you will use something like 2i to collect results, it will make unpredictable time query, which will grow with page numbers

Resources