I created 2 empty documentDB collections: 1) with single partition and 2) with multi-partition. Next inserted a single row on both these collections and ran a scan (select * from c). I found that the single partition took up ~2RUs whereas multi-partition took about ~50RUs. It's not just the RU's, but the read latency was about 20x slower with multi-partition. So is it that multi-partition always has high read latency when queried across partitions?
You can get the same latency for multi-partition collections as single-partition collections. Let's take the example of scans:
If you have non-empty collections, then the performance
will be the same as data is read from one of the partitions. Data is read from the first partition, and paginated across partitions in order.
If you use the MaxDegreeOfParallelism option, you'll get the same low
latencies. Note that query execution is serial by default, in order
to optimize for queries with larger datasets. If you use the
parallelism option, the query will have the same low latency
If you scan with a filter on partition key = value, then you'll get the
same performance even without the parallelism.
It is true that there is a small RU overhead for each partition touched during query (~2 RU per partition for query parsing). Note that this doesn't increase with query size, i.e., even if your query returned e.g. 1000 documents, the query will be 1000 + P*2 RUs for partitioned collections in place of 1000 RUs. You can eliminate this overhead of course by including a filter on partition key.
Related
Context & goal
I need to periodically create snapshots of cosmosDB partitions. That is:
export all documents from a single CosmosDB partition. Ca 100-10k doc per partition, 1KB-200KB each doc, entire partition JSON usually <50M)
each document must be handled separately, with id known.
Host the process in Azure function app, using consumption plan (so mem/CPU/duration matters).
And run this for thousands of partitions..
Using Microsoft.Azure.Cosmos v3 C# API.
What I've tried
I can skip deserialization using the Container.*StreamAsync() tools in API, and avoid parsing the document contents. This should notably reduce the CPU/Mem need also avoids accidentally touching the documents to be exported with serialization roundtrip. The tricky part is how to combine it with having 10k documents per partition.
Query individually x 10k
I could query item ids per partition using SQL and just send send separate ReadItemStreamAsync(id) requests.
This skips deserialization, still have ids, I could control how many docs are in memory at given time, etc.
It would work, but it smells as too chatty, sending 1+10k requests to CosmosDB per partition, which is a lot = millions of requests.. Also, by my experience SQL-querying large documents would usually be RU-wise cheaper than loading those documents by point reads which would add up in this scale. It would be nice to be able to pull N docuents with a single (SQL query) request..
Query all as stream x 1.
There is Container.GetItemQueryStreamIterator() for which I could just pass in select * from c where c.partiton = #key. It would be simpler, use less RUs, I could control batch size with MaxItemsCount, it sends just a minimal number or requests to cosmos (query+continuations). All is good, except ..
.. it would return a single JSON array for all documents in batch and I would need to deserialize it all to split it into individual documents and mapping to their ids. Defeating the purpose of loading them as Stream.
Similarly, ReadManyItemsStreamAsync(..) would return the items as single response stream.
Question
Does the CosmosDB API provide a better way to download a lot of individual raw JSON documents without deserializing?
Preferably with having some control over how much data is being buffered in client.
While I agree that designing the solution around streaming documents with change feed is promising and might have better scalability and cost-effect on cosmosDB side, but to answer the original question ..
The chattiness of solution "Query individually x 10k" could be reduced with Bulk mode.
That is:
Prepare a bulk CosmosClient with AllowBulkExecution option
query document ids to export (select c.id from c where c.partition = #key)
(Optionally) split the ids to batches of desired size to limit the number of documents loaded in memory.
For each batch:
Load all documents in batch concurrently using ReadItemStreamAsync(id, partition), this avoids deserialization but retains link to id.
Write all documents to destination before starting next batch to release memory.
Since all reads are to the same partition, then bulk mode will internally merge the underlying requests to CosmosDB, reducing the network "chattiness" and trading this for some internal (hidden) complexity and slight increase in latency.
It's worth noting that:
It is still doing the 1+10k queries to cosmosDB + their RU cost. It's just compacted in network.
batching ids and waiting on batch completion is required as otherwise Bulk would send all internal batches concurrently (See: Bulk support improvements for Azure Cosmos DB .NET SDK). Or don't, if you prefer to max out throughput instead and don't care about memory footprint. In this case the partitions are smallish enough so it does not matter much.
Bulk has a separate internal batch size. Most likely its best to use the same value. This seems to be 100, which is a rather good chunk of data to process anyway.
Bulk may add latency to requests if waiting for internal batch to fill up
before dispatching (100ms). Imho this is largely neglible in this case and could be avoided by fully filling the internal Bulk batch bucket if possible.
This solution is not optimal, for example due to burst load put on CosmosDB, but the main benefit is simplicity of implementation, and the logic could be run in any host, on-demand, with no infra setup required..
There isn't anything out of the box that provides a means to doing on-demand, per-partition batch copying of data from Cosmos to blob storage.
However, before looking at other ways you can do this as a batch job, another approach you may consider, is to stream your data using Change Feed from Cosmos to blob storage. The reason is that, for a database like Cosmos, throughput (and cost) is measured on a per-second basis. The more you can amortize the cost of some set of operations over time, the less expensive it is. One other major thing I should point out too is, the fact that you want to do this on a per-partition basis means that the amount of throughput and cost required for the batch operation will be a combination of throughput * the number of physical partitions for your container. This is because when you increase throughput in a container, the throughput is allocated evenly across all physical partitions, so if I need 10k RU additional throughput to do some work on data in one container with 10 physical partitions, I will need to provision 100k RU/s to do the same work for the same amount of time.
Streaming data is often a less expensive solution when the amount of data involved is significant. Streaming effectively amortizes cost over time reducing the amount of throughput required to move that data elsewhere. In scenarios where the data is being moved to blob storage, often when you need the data is not important because blob storage is very cheap (0.01 USD/GB)
compared to Cosmos (0.25c USD/GB)
As an example, if I have 1M (1Kb) documents in a container with 10 physical partitions and I need to copy from one container to another, the amount of RU needed to do the entire thing will be 1M RU to read each document, then approximately 10 RU (with no indexes) to write it into another container.
Here is the breakdown for how much incremental throughput I would need and the cost for that throughput (in USD), if I ran this as a batch job over that period of time. Keep in mind that Cosmos DB charges you for the maximum throughput per hour.
Complete in 1 second = 11M RU/s $880 USD * 10 partitions = $8800 USD
Complete in 1 minute = 183K RU/s $14 USD * 10 partitions = $140 USD
Complete in 10 minutes = 18.3K $1/USD * 10 partitions = $10 USD
However, if I streamed this job over the course of a month, the incremental throughput required would be, only 4 RU/s which can be done without any additional RU at all. Another benefit is that it is usually less complex to stream data than to handle as a batch. Handling exceptions and dead-letter queuing are easier to manage. Although because you are streaming, you will need to first look up the document in blob storage and then replace it due to the data being streamed over time.
There are two simple ways you can stream data from Cosmos DB to blob storage. The easiest is Azure Data Factory. However, it doesn't really give you the ability to capture costs on a per logical partition basis as you're looking to do.
To do this you'd need to write your own utility using change feed processor. Then within the utility, as you read in and write each item, you can capture the amount of throughput to read the data (usually 1 RU/s) and can calculate the cost of writing it to blob storage based upon the per unit cost for whatever your monthly hosting cost is for the Azure Function that hosts the process.
As I prefaced, this is only a suggestion. But given the amount of data and the fact that it is on a per-partition basis, may be worth exploring.
We have recently started testing Cosmos DB as a persistent storage for our Orleans Cluster.
Our scenario for Cosmos DB is as a Key-Value Store with low-latency and a high amount of Replace, Point Reads and Creates. Due to this we have Document IDs and Partition Keys be the same, so essentially 1 document per partition.
However, in the Cosmos DB Insights we are seeing that we are reaching 100% Normalized RU Consumption:
When we dig deeper we see a heat map that puts 100% RU Consumption on PartitionKeyRangeID 0 with no other PartitionKeyRanges available.
It is my understanding that since we have as much partitions as we have documents, we should be reaching this problem. I am also not sure what PartitionKeyRangeID 0 signifies as we should have at least quite a few thousand partitions
PartitionKeyRangeID corresponds to physical partition.
Your logical partition key is hashed and the hash space is divided up into ranges that are allocated to each physical partition. For example in a collection with two partitions Partition 1 might take the range from 0x00000000000000 to 0x7FFFFFFFFFFFFF and Partition 2 the range from 0x80000000000000 to 0xFFFFFFFFFFFFFF.
In this case it looks like you have a single physical partition and this is maxed out.
Each physical partition supports up to 10k RU per sec so you would see additional physical partitions if you were to scale the collection above that forcing a split (or if a split is required for per partition storage capacity limits).
The RU budget from your provisioned throughput is divided between the physical partitions. The heat map is useful when you have multiple physical partitions so you can identify cases where some physical partitions are maxed out and others are idle (which would be likely due to hot logical partitions).
The DynamoDB documentation describes how table partitioning works in principle, but its very light on specifics (i.e. numbers). Exactly how, and when, does DynamoDB table partitioning take place?
I found this presentation produced by Rick Houlihan (Principal Solutions Architect DynamoDB) from AWS Loft San Franciso on 20th January 2016.
The presention is also on Youtube.
This slide provides the important detail on how/when table partitioning occurs:
And below I have generalised the equation you can plug your own values into.
Partitions by capacity = (RCUs/3000) + (WCUs/1000)
Partitions by size = TableSizeInGB/10
Total Partitions = Take the largest of your Partitions by capacity and Partitions by size. Round this up to an integer.
In summary a partition can contain a maximum of 3000 RCUs, 1000 WCUs and 10GB of data. Once partitions are created, RCUs, WCUs and data are spread evenly across them.
Note that, to the best of my knowledge, once you have created partitions, lowering RCUs, WCUs and removing data will not result in the removal of partitions. I don't currently have a reference for this.
Regarding the "removal of partitions" point Stu mentioned.
You don't directly control the number of partitions and once the partitions are created they cannot be deleted => this behaviour can cause performance issues which are many times not expected.
Consider you have a Table which has 500WCU assigned. For this example consider you have 15GB of data stored in this Table. This means we reached a data size cap (10GB per partition) thus we currently have 2 partitions between which the RCUs and WCUs are split (each partition can use 250WCU).
Soon there will be an enormous increase (let's say Black Friday) of users that needs to write the data to the Table. So what would you do is to increase the WCUs to 10000, to handle the load, right? Well, what happens behind the scenes is that DynamoDB has reached another cap - WCU capacity per partition (max 1000) - so it creates 10 partitions between which the data are spread by the hashing function in our Table.
Once the Black Friday is over - you decide to decrease the WCU back to 500 to save the cost. What will happen is that even though you decreased the WCU, the number of partitions will not decrease => now you have to SPLIT those 500 WCU between 10 partitions (so effectively every partition can only use 50WCU).
You see the problem? This is often forgotten and can bite you if you are not planning properly how the data will be used in your application.
TLDR: Always understand how your data will be used and plan your database design properly.
We are using Cosmos DB SQL API and here's a collection XYZ with:
Size: Unlimited
Throughput: 50000 RU/s
PartitionKey: Hashed
We are inserting 200,000 records each of size ~2.1 KB and having same value for a partition key column. Per our knowledge all the docs with same partition key value are stored in the same logical partition, and a logical partition should not exceed 10 GB limit whether we are on fixed or unlimited sized collection.
Clearly our total data is not even 0.5 GB. However, in the metrics blade of Azure Cosmos DB (in portal), it says:
Collection XYZ has 5 partition key ranges. Provisioned throughput is
evenly distributed across these partitions (10000 RU/s per partition).
This does not match with what we have studied so far from the MSFT docs. Are we missing something? Why are these 5 partitions created?
When using the Unlimited collection size, by default you will be provisioned 5 physical partition key ranges. This number can change, but as of May 2018, 5 is the default. You can think of each physical partition as a "server". So your data will be spread amongst 5 physical "servers". As your data size grows, your data will automatically be re-distributed against more physical partitions. That's why getting partition key correct upfront in your design is so important.
The problem in your scenario of having the same Partition Key (PK) for all 200k records is that you will have hot spots. You have 5 physical "servers" but only one will ever be used. The other 4 will go idle, and the result is that you'll have less performance for the same price point. You're paying for 50k RU/s but will ever only be able to use 10k RU/s. Change your PK to something that is more uniformly distributed. This will vary of course how you read the data. If you give more detail about the docs you're storing then we may be able to help give a recommendation. If you're simply doing point lookups (calling ReadDocumentAsync() by each Document ID) then you can safely partition on the ID field of the document. This will spread all 200k of your docs across all 5 physical partitions and your 50k RU/s throughput will be maximized. Once you effectively do this, you will probably see that you can reduce the RU usage to something much lower and save a ton of money. With only 200k records each at 2.1KB, you probably could go low as 2500 RU/s (1/20th of the cost you're paying now).
*Server is in quotes because each physical partition is actually a collection of many servers that are load-balanced for high availability and also throughput (depending on your consistency level).
From "How does partitioning work":
In brief, here's how partitioning works in Azure Cosmos DB:
You provision a set of Azure Cosmos DB containers with T RU/s
(requests per second) throughput.
Behind the scenes, Azure Cosmos DB
provisions physical partitions needed to serve T requests per second.
If T is higher than the maximum throughput per physical partition t,
then Azure Cosmos DB provisions N = T/t physical partitions. The value
of maximum throughput per partition(t) is configured by Azure Cosmos
DB, this value is assigned based on total provisioned throughput and
the hardware configuration used.
.. and more importantly:
When you provision throughput higher than t*N, Azure Cosmos DB splits one or more of your physical partitions to support the higher throughput.
So, it seems your requested RU throughput of 50k is higher than that t mentioned above. Considering the numbers, it seems t is ~10k RU/s.
Regarding the actual value of t, CosmosDB team member Aravind Krishna R. has said in another SO post:
[---] the reason this value is not explicitly mentioned is because it will be changed (increased) as the Azure Cosmos DB team changes hardware, or rolls out hardware upgrades. The intent is to show that there is always a limit per partition (machine), and that partition keys will be distributed across these partitions.
You can discover the current value by saturating the writes for a single partition key at maximum throughput.
I have a question...
If I have 1000 item having same partition key in a table... And if I made a query for this partition key with limit 10 then I want to know does it take read capacity unit for 1000 items or for just 10 items
Please clear my doubt
I couldn't find the exact point in the DynamoDB documentation. From my experience it uses only the returned limit for consumed capacity which is 10 (Not 1000).
You can quickly evaluate this also using the following approach.
However, you can specify the ReturnConsumedCapacity parameter in
a Query request to obtain this information.
The limit option will limit the number of results returned. The capacity consumed depends on the size of the items, and how many of them are accessed (I say accessed because if you have filters in place, more capacity may be consumed than the number of items actually returned would consume if there are items that get filtered out) to produce the results returned.
The reason I mention this is because, for queries, each 4KB of returned capacity is equivalent to 1 read capacity unit.
Why is this important? Because if your items are small, then for each capacity unit consumed you could return multiple items.
For example, if each item is 200 bytes in size, you could be returning up to 20 items for each capacity unit.
According to the aws documentation:
The maximum number of items to evaluate (not necessarily the number of matching items). If DynamoDB processes the number of items up to the limit while processing the results, it stops the operation and returns the matching values up to that point, and a key in LastEvaluatedKey to apply in a subsequent operation, so that you can pick up where you left off.
It seems to me that it means that it will not consume the capacity units for all the items with the same partition key. According to your example the consumed capacity units will be for your 10 items.
However since I did not test it I cannot be sure, but that is how I understand the documentation.