CosmosDB Partition and Update attributes - azure-cosmosdb

I have a container having JSON documents of size ~2KB and has a synthetic partition key constructed as TenantId-Division-UserId
The documents contain following updatable attributes:
UpdateDate
DeactivateInd
The TPS on the UPDATE is ~400. Volume of the documents ~30 million
Question:
Does CosmosDB DataModel need to be split into two collections one static collection and another a lookup type collection for quick update with less RUs? Any suggestions on the high performance data model with low RUs?
Ref: http://www.mnazureusergroup.com/2018/10/25/azure-cosmos-db-partitioning-design-patterns-part-1/

If you have access patterns where small portions of a document are updated frequently then yes you should consider shredding it and putting it into a different document. I would not put it into a different container because you then cannot query for the entire document in a single operation. Doing updates on smaller documents is much cheaper than doing a replace on a large one.
Before committing to this however, you should measure the impact on RU/s to ensure doing so is a less expensive option.

Related

What partition key to choose in cosmos db with little data and one customer per database?

We’re developing a personnel management system based on blazor and Cosmos DB serverless. There will be one customer per database and around 30 “docTypes”. The biggest categories by number and data volume are "users" and "employees". When we query we get all data of users and employee at once. So it can be several thousand. The other doctypes are much smaller an less frequently queried.
The volume of data per customer will not exceed 5 GByte. The most frequent queries are to 3 docTypes.
Would it make more sense to use customerId (so all data is in one partition) or docType as a partition key?
thanks
Based on the information you supplied it sound like docType is a good property to use as partition key, since it can be used to avoid cross partition queries. Especially since you state this will be often be used in your queries. With the max size you stated it will also be unlikely to cause you issues as a single partition can contain up to 20GB of data.
One thing to watch out for is Hot Partitioning. You state that your users partition might be a lot bigger than others. That can result into one partition doing all of the lifting while the others sit mostly idle which results and causing inefficiëncy of your total throughput.
On the other side it won't really matter for your use case. Since none of the databases will exceed that 5GB you'll always stay within a single partition, but it's always good though to think about it beforehand; As situations may change and you end up with a database that does split into partitions.
Lastly I would never use a single partition for all data. It has no benefits. If you have no properties that could serve as partition key then id is the better choice (so a logical partition per document). It won't hit storage limitations and evenly distributes throughput between partitions.
I would highly recommend you first take a look at this segment of the Data Modelling & Partitioning presentation by Thomas Weiss, Cosmos DB program manager. In my view it's one of the best resources to understand how to think about partitioning.
Do agree with David Makogon that you didn't provide enough data. For instance, we know there are 30 doc types per single database - given cosmosdb database uses containers, I actually expect each docType to have its own container - contrary to what you wrote:
Would it make more sense to use customerId (so all data is in one partition) or docType as a partition key?
Which suggests you want to use a single container for all your data. I wouldn't keep users and employees as documents in the same container. They are separate domains and deserve their own container.
See Azure docs page on Partition Strategy and subsequent paragraph about access patterns. The recommendation is to:
Choose a partition key that enables access patterns to be evenly spread across logical partitions.
In the access patterns section, the good practice mentioned is to separate data into hot, medium and cold data and place it into their own containers. One caveat is, that according to this page the max number of containers per database with shared throughput is 25.
If that is not possible, and all data has to end up in a single container, then docType seems to be the right partition key, because your queries will get data by docType if I understood correctly.
As 404 wrote, you want to avoid Hot Partitioning i.e. jamming most of documents in a container into a single or a few logical partitions. Therefore you want to choose a partition key based on most frequent operations.

Google Firestore - Efficiently fetch a single document, perform a point query within a subcollection

Assume I am designing a new Firestore database. Assume I like the idea of a hierarchical design and, as a contrived example, each Year has a sequence of child Weeks of which each has Days.
What's the most performance efficient way to retrieve a single document for today? i.e. 2021-W51-Thursday
Answers are permitted to include changes to the model, e.g. "denormalizing" the day model such that it includes year, week and dayName fields (and querying them).
Otherwise a simple document reference may be the fastest way, like:
DocumentReference ref = db
.Collection("years").Document("2021")
.Collection("weeks").Document("51")
.Collection("days").Document("Thursday");
Thanks.
Any query that identifies a single document to fetch is equally performant to any other query that does the same at the scale that Firestore operates. The organization of collections or documents does not matter at al at scale. You might see some fluctuations in performance at small scale, depending on your data set, but that's not how Firestore is optimized to work.
All collections and all subcollections each have at least one index on the ID of the document that works the same way, independent of each other collection and index. If you can identify a unique document using its path:
/db/XXXX/weeks/YY/days/ZZZZ
Then it scales the same as a document stored using a more flat structure:
/db/XXXXYYZZZZ
It makes no difference at scale, since indexes on collections scale to an infinite number of documents with no theoretical upside limit. That's the magic of Firestore: if the system allows the query, then it will always perform well. You don't worry about scaling and performance at all. The indexes are automatically sharded across computing resources to optimize performance (vs. cost).
All of the above is true for fields of a document (instead of a document ID). You can think of a document ID as a field of a document that must be unique within a collection. Each field has its own index by default, and it scales massively.
With NoSQL databases like Firestore, you should structure your data in such a way that eases your queries, as long as those queries can be supported by indexes that operate at scale. This stands in contrast with SQL databases, which are optimized for query flexibility rather than massive scalability.

Return a partial document in CosmosDB

I have a large document with many fields and I would just like to return 1-2 fields from the object to preserve throughput. Is this possible in cosmosDB or do I need to return the entire object everytime?
Doing a point read using ReadItemAsync() this is not possible. The only way to do this is with a query and include the properties you want in the SELECT statement.
That said, a query is unlikely to save a ton of RU/s because it still have to retrieve the item from the data store, then project the properties you want before returning in the response.
If you have a large document with lots of properties and asymmetric access patterns, meaning you only read or update a small number of properties with high concurrency, then the better solution is to shred the document into two with the high concurrency properties in one document and the more static properties in another.
This will provide the most efficiency.

I want to increase number of records read using queryPage in dynamoDB

I have a requirement where I need to get only a certain attribute from the matching records on querying a DynamoDB table. I have used withSelect(Select.SPECIFIC_ATTRIBUTES).withProjectionExpression(<attribute_name>) to get that attribute. But the number of records being read by the queryPage operation is the same in both the cases (1. using withSelect and 2. without using withSelect). The only advantage is by using withSelect, these operations are being processed very quickly. But this is in turn causing a lot of DynamoDB reads. Is there any way I can read more records in a single query thereby reducing my number of DB reads?
The reason you are seeing that the number of reads is the same is due to the fact that projection expressions are applied after each item is retrieved from the storage nodes, but before it is collected into the response object. The net benefit of projection expressions is to save network bandwidth, which in turn can save latency. But it will not result in consumed capacity savings.
If you want to save consumed capacity and be able to retrieve more items per request, your only options are:
create an index and project only the attributes you need to query; this can be a local secondary index, or a global secondary index, depending whether you need to change the partition key for the index
try to optimize the schema of your data stored in the table; perhaps you can compress your items, or just generally work out encodings that result in smaller documents
Some things to keep in mind if you do decide to go with an index: a local secondary index would probably work best in your example but you would need to create a new table for that (local secondary indexes can only be created when you create the table); a global secondary index would also work but only if your application can tolerate eventually consistent reads on the index (and of course, there is a higher cost associated with these).
Read more about using indexes with DynamoDB here: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-indexes.html

Is it ok to store a user id as the key of a field in a Firestore document?

Firestore charges for the amount of indexes used. If I have a structure where there is a massive list of ratings different users gave, and have the key as the user Id and the value as the rating, will that take up too many auto created indexes? Is there a good structure around this.
For example, in the collection 'ratings', I shard individual ratings that each user gives into different documents using a complex sharding mechanism I made that fills a document up to the max document size of around 20k, then starts filling up another document. say I have 5 documents, each filled with 20k fields. One of those docs would look like this:
uid1: 3.3
uid2: 5
uid3: 1.234
...
Is there another structure I should be using to store loads of individual 'fields' in Firestore? I don't want to use loads of documents for each rating either as that is too expensive. Arrays aren't big enough to store loads of ratings either.
Arrays aren't big enough to store loads of ratings either
The problem isn't about the arrays, the problem is that the documents have limits. So there are some limits when it comes to how much data you can put into a document. According to the official documentation regarding usage and limits:
Maximum size for a document: 1 MiB (1,048,576 bytes)
As you can see, you are limited to 1 MiB total of data in a single document. When we are talking about storing text, you can store pretty much but as your array getts bigger, be careful about this limitation.
According to the offical documentation regarding modelling data in Cloud Firestore:
Cloud Firestore is optimized for storing large collections of small documents.
So trying to shard a collection by filling up documents one by one, is not such a good idea.
If you are trying to add raitings from multipe users in a single document, with other words you trying to store large amount of data in a single document that can be updated by lots of users, there is another limitation that you need to take care of. So you are limited to 1 write per second on every document. So if you have a situation in which a lot of users al all trying to write data to the same documents all at once, you might start to see some of this writes to fail. So, be careful about this limitation too.
My recommendation is to store those raitings in an array, if you think that the size of the document will be within the 1MiB limitation, otherwise use a collection of tags for each object separately.

Resources