Many tiny documents in CosmosDB - azure-cosmosdb

I have many (order of 100s) pieces of data that I want to associate with a document in CosmosDB. Each piece of data is small (order of 100s of bytes).
My first solution was to store the data as an array inside the document. This works okay, but in order to append a new item to the array I need to read the document from CosmosDB, add the element, then replace the document back into CosmosDB.
Instead of doing this I would like to store each piece of data as its own document in the same partition. What are the drawbacks of having many tiny documents vs the one aggregated document?

What are the drawbacks of having many tiny documents vs the one
aggregated document?
I would like to say that i suggest you storing each piece of data,instead of one aggregated document.
Reason1:As you mentioned in your question,if you want to add the element into the document,you need to read the document from CosmosDB, then replace the document because the partial update is not supported by cosmos db so far.(Please refer to this feedback and follow it if you need:https://feedback.azure.com/forums/263030-azure-cosmos-db/suggestions/6693091-be-able-to-do-partial-updates-on-document) That's a huge and tedious work.
Reason2:If you store pieces of data,you can query them flat. (select * from c)
For one single array document,you need to use join to access the nested properties.(select a.array from c join array in c.array)
Reason3:If you store pieces of data,you could manage them into different partitions.Even though you don't need it now,why not keep the feature for the future.
Reason4:As to cost,it all depends the RUs and storage and requests to cosmos db will consume RUs. If you store pieces of data,you just need to access the specific document as you want which is more economical i think.

Depends on your use case.
For frequent add operations, you are first reading and updating the document back (2 operations) which will incur you more cost than creating a new document (1 operation).
However, if the documents are having some sort of relationships (like foreign keys in traditional SQL), getting data would require multiple queries if you go with approach #1 above (have more cost) otherwise, you'll get the complete data in a single query (low cost).
I'd recommend to go through this and this posts which will give you better insights on which approach you can choose.

I'm facing this question right now and I want to let my contribution here. I'm having to store some statuses, this status is a metric that I get once per hour, then i have two options:
Create a register per status -> 24 registers per day
Create a register per day and add status inside it -> 1 register per day with 24 status inside an array
I chose the second one because:
Both options will have the same amount of operations on database
I'm using this data on Power BI and after doing some tests the data from second option had a small size after importation

Related

Large arrays in Firestore Database (Best practices)

I am populating a series of dates and temperatures that I was thinking of storing in a Firestore Database to later be consumed by the front-end with the following structure:
{
date: ['1920-01-01', '1920-01-02', '1920-01-03', '1920-01-04', '1920-01-05', ...],
values: [20, 18, 19.5, 20.5, ...]
}
The array may consider a lot of years, so it turns huge, with thousands of entries. Firestore database started complaining about returning the too many index entries for entity error, and even if I get the data uploaded, the user interface Firebase -> Firebase Database -> Panel View collapses. That happens even with less than 3000 entries array.
The fact is that the data is consumed in the front-end with an array structure very similar to the one described above (I want to plot it using Echarts library). This way, I found this structure to be the more natural way, as any other alternative will require reversing the structure to arrays in the front-end.
Nevertheless, I see that Firestore Database very clearly does not like this structure. What should I do? What is the best practice for dealing with this kind of data in Firestore?
The indexes required for the most basic queries in Firestore are automatically created for you. However, there are some limits involved. So you're getting the following error:
too many index entries for entity
Because you hit the maximum number of index entries for a document, which is 40,000. If you add too many elements into an array or you add too many fields to a document, then you can reach the maximum limit.
So most likely the number of elements that exist in the date array + the number of elements that exist in the values array is bigger than 40k, hence the error.
To solve this, you might consider creating two separate documents, one for each array. If you still hit the maximum limit, then you might consider creating a document for each hour, and not for an entire day. In this way, you'll drastically reduce the number of elements that exist in an array.
If you don't find these solutions useful, then you have to set some "Single-field index exemptions" to avoid the above error.
Firestore is not the best tool to deal with time series. The best solution I found in Firestore was creating an independent document for each day in my data. Nevertheless, that raises the number of documents I need to fetch from the front-end side and, therefore, the costs.
By using large arrays in Firestore, you easily reach the index limit, and you are forced to remove the index, which I feel is a big red flag, suggesting checking another tool.
The solution I found, in case is useful for anyone, was building my API in Flask using MongoDB as a database. Although it takes more effort than just using Firestore, it deals better with time series and brings more flexibility.

The best way to calculate total money from multiple orders

Let's say i have an multi-restaurant food order app.
I'm storing orders in Firestore as documents.
Each order object/document contains:
total: double
deliveredByUid: str
restaurantId: str
I wanna see anytime during the day, the totals of every Driver to each Restaurant like so:
robert: mcdonalds: 10
kfc: 20
alex: mcdonalds: 35
kfc: 10
What is the best way of calculating the totals of all the orders?
I currently thinking of the following:
The safest and easiest method but expensive: Each time i need to know the totals i just query all the documents in that day and calculate them 1 by 1
Cloud Functions method: Each time an order has been added/removed modify a value in a Realtime database specific child: /totals/driverId/placeId
Manual totals: Each time a driver complete an order and write its id to the order object, make another write to the Realtime database specific child.
Edit: added the whole order object because i was asked to.
What I would most likely do is make sure orders are completely atomic (or as atomic as they can be). Most likely, I'd perform the order on the client within a transaction or batch write (both are atomic) that would not only create this document in question but also update the delivery driver's document by incrementing their running total. Depending on how extensible I wanted to get, I may even create subcollections within the user's document that represented chunks of time if I wanted to be able to record totals by month or year, or whatever. You really want to think this one through now.
The reason I'd advise against your suggested pattern is because it's not atomic. If the operation succeeds on the client, there is no guarantee it will succeed in the cloud. If you make both writes part of the same transaction then they could never be out of sync and you could guarantee that the total will always be accurate.

Should I create a duplicate collection/document for each use-case? (Firebase/Firestore)

I'm trying to build an ecommerce app with firebase on the backend. I have a collection of 1000+ products, each of which is stored as a separate document, which have product specific info such as price, title etc.
document:{
title: 'Some Title',
price: '$99.99',
genres: ['Horror', 'Action']
}
So in my app I need to display these products in many places, such as product carousels(similar to a bookshelf with arrow buttons at the ends), and also in a search results page.
At any given page, I assume that I will need to display at least 50 products, either as search results, or multiple carousels. I understand that I can use queries to get this data from firebase. But since each document I retrieve counts as (at least)one firestore read, I assume that a typical user session would run into 100+ reads, if not thousands.
It seems a little inefficient to me that I need to read multiple documents to get this data, when I could just all that data in a single array, as its own document. That would mean I get charged for one document read, not 50, per page.
Is this how it is expected to be done? Should I create a new document containing the data I need for each specific use case?
P.S. I'm pretty new to backend dev, let alone firebase.
TL;DR Yes, you should create a new document with the needed data for each specific use case, but it’s not recommended to make it as a document with nested objects like arrays with 1000+ elements.
From a technical point of view, Cloud Firestore is optimized for storing large collections of small documents.
Depending on the use case, you can select the most appropriate Cloud Firestore data structure.
For example, the 10 most buyed books of the month can be a document with nested complex objects like arrays or maps. This structure could be useful for use cases with a small or predefined number of elements, but as stated here, if your data expands over time with larger or growing lists, the document also grows, which can lead to slower document retrieval times.
In plus thousand registers, a better choice can be structure your data as subcollections. It is, you can create collections within documents when you have data that might expand over time, with the main advantage that, as your lists grow, the size of the parent document doesn't change.
Cloud Firestore also has several features to help you manage queries that return a large number of results:
Cursors, which allow you to resume a long-running query.
Page tokens, which help you paginate the query results.
Limits, which specify how many results to retrieve.
Offsets, which allow you
to skip a fixed number of documents.
There are no additional costs for using cursors, page tokens, and limits. In fact, these features can help you save money by reading only the documents that you actually need.
As a best practice, do not use offsets. Instead, use cursors. Using an offset only avoids returning the skipped documents to your application, but these documents are still retrieved internally. The skipped documents affect the latency of the query, and your application is billed for the read operations required to retrieve them.

Can we avoid scan in dynamodb

I am new the noSQL data modelling so please excuse me if my question is trivial. One advise I found in dynamodb is always supply 'PartitionId' while querying otherwise, it will scan the whole table. But there could be cases where we need listing our items, for instance in case of ecom website, where we need to list our products on list page (with pagination).
How should we perform this listing by avoiding scan or using is efficiently?
Basically, there are three ways of reading data from DynamoDB:
GetItem – Retrieves a single item from a table. This is the most efficient way to read a single item, because it provides direct access to the physical location of the item.
Query – Retrieves all of the items that have a specific partition key. Within those items, you can apply a condition to the sort key and retrieve only a subset of the data. Query provides quick, efficient access to the partitions where the data is stored.
Scan – Retrieves all of the items in the specified table. (This operation should not be used with large tables, because it can consume large amounts of system resources.
And that's it. As you see, you should always prefer GetItem (BatchGetItem) to Query, and Query — to Scan.
You could use queries if you add a sort key to your data. I.e. you can use category as a hash key and product name as a sort key, so that the page showing items for a particular category could use querying by that category and product name. But that design is fragile, as you may need other keys for other pages, for example, you may need a vendor + price query if the user looks for a particular mobile phones. Indexes can help here, but they come with their own tradeofs and limitations.
Moreover, filtering by arbitrary expressions is applied after the query / scan operation completes but before you get the results, so you're charged for the whole query / scan. It's literally like filtering the data yourself in the application and not on the database side.
I would say that DynamoDB just is not intended for many kinds of workloads. Probably, it's not suited for your case too. Think of it as of a rich key-value (key to object) store, and not a "classic" RDBMS where indexes come at a lower cost and with less limitations and who provide developers rich querying capabilities.
There is a good article describing potential issues with DynamoDB, take a look. It contains an awesome decision tree that guides you through the DynamoDB argumentation. I'm pasting it here, but please note, that the original author is Forrest Brazeal.
Another article worth reading.
Finally, check out this short answer on SO about DynamoDB usecases and issues.
P.S. There is nothing criminal in doing scans (and I actually do them by schedule once per day in one of my projects), but that's an exceptional case and I regret about the decision to use DynamoDB in that case. It's not efficient in terms of speed, money, support and "dirtiness". I had to increase the capacity before the job and reduce it after, but that's another story…

Work around firestore document size limit?

I need to store a large number of fields, like for a star rating system, but firestore only allows 20,000 fields per document. Is there a known way around this? Right now I am going to 'shard' the fields in multiple documents, and keep the size of each document in a documentSizeTracker document that I use to determine which document to shard to (and add to the counter with a transaction). Is this the correct approach? Any problems with this?
Sharding certainly could work. It's hard to say without knowing exactly what kind of data you'll need from your document, and when, but that's certainly a reasonable option. You could also consider having a parent "summary" doc that contains fields you might want to search on and then split all of your data into several documents inside a subcollection of that parent.
One important nuance here: the limit isn't 20,000 fields, but 20,000 indexed fields. So if you're storing a bunch of data inside your document, but you know that you're not going to be searching on all of them, another alternative is to mark some of your fields as unindexed (which you can now do in the Firebase console in the "Exemptions" section).
If you're dealing with thousands of fields, though, you probably won't want to exempt them all one at a time, so a better alternative might be to place your data as a map inside a container field (named something like "allOfMyData"), then just mark that one field as unindexed. That will automatically remove all indexes from any fields contained inside that map.
Actually, I ran into similar problem with the read and write issues with Firebase. So, here is my conclusion:
# if something small needs to be written & read very often, then use Firebase Realtime Database
Firebase Realtime database allows fast writes, but limits concurrent users to 100,000
Firebase Firestore allows a maximum of 1 write per second per document
It's very expensive to read a document that only contains a rating for example in Firestore
# if something (larger) needs to be read very often with writes usually more than 1 second in between then use Firestore
Firestore allows up to 1,000,000 concurrent users at current Beta release (they might make it more)
It's cheaper to read a large document (less than 1 MiB limit) in Firestore than Firebase Realtime database
# If your model doesn't fit into these two choices, then you should modify your model and split them into 2 models:
1 very small model to store in Firebase Real Database (ratings for example)
1 larger model to store in Firestore
Note: You could use both Firebase Realtime database and Firebase Firestore in the same project. Don't forget to take into account the billing differences between both databases. and their different limits. I believe, it's best to combine them and use the good side of each instead of trying to force solutions into one of them.
Note 2: I really didn't like the shard-ing idea in Firestore suggested solution and work around

Resources