I've been using Azure SDK for .NET (with core 3.1) client to query collections by calling GetItemQueryIterator() on a container.
I’ve observed that the FeedResponses returned by the FeedIterator returned by GetItemQueryIterator correspond to physical partitions, but I haven’t found any confirmation of this in the docs.
Can someone confirm that:
If a partition key value isn’t specified in the query, the FeedIterator will return a FeedResponse for each physical partition in the collection?
And that if a partition key value is specified, the FeedIterator will return only one FeedResponse with results from the physical partition holding the specified logical partition?
If the above statements aren't true, are there any guarantees or not about the relationship between partitions (logical and/or physical) and FeedIterators and FeedResponses?
Thanks!
TL;DR - It is a safe bet that if you don't specify a partition key, your results will be a aggregate of the results from a given range of partitions.
It's important to remember that, for the most part, the logical to physical partition mapping is treated as an implementation detail, though in practice that's a grey zone (a grey zone most folks can safely ignore). Additionally, how query mechanically works changes a good bit overtime (though functionally it should not change) as improvements are made, but the code is open source from the SDK side of things, so I can at least describe the point in time as you could see from any of the SDK implementations.
If you provide a partition key, the answer is easy - your results will come from a single physical partition at a time. It might be different each time (though in practice, it will be the same range), because partitions can split between queries. Your query might also be served by a different replica each time. The above is true for all flavors of single partition targeting queries.
Cross partition queries get more interesting. By and large, cross partition queries go through a pipeline which merges results depending on the query plan. So order by queries will go from physical partition to physical partition grabbing pages until they have enough to be assured ordering is maintained. Aggregates do similar things/etc. This of course comes with the overhead of having to talk to multiple partitions before you can serve results and more overhead on the resources the client consumes to serve the request, so we don't recommend heavy cross partition queries being in hot path code (save it as a materialized view, etc). There are a few cases where it skips this pipeline and each response corresponds to a page served over the network, but usually just plain where clauses.
Related
I have read that a single partition of DynamoDB has a size limit of 10GB. This means if all my data are smaller as 10GB then I have only one partition?
There is also a limit of 3000 RCUs or 1000 WCUs on a single partition. This means this is also the limit for a small database which has only one partition?
I use the billing mode PAY_PER_REQUEST. On the database there are short usage peaks of approximate 50MB data. And then there is nothing for hours. How can I design the database to get the best peak performance? Or is DynamoDB a bad option for this use case?
How to design a database to get best performance and picking the right database... these are deep questions.
DynamoDB works well for a wide variety of use cases. On the back end it uses partitions. You rarely have to think about partitions until you're at the high-end of scale. Are you?
Partition keys are used as a way to map data to partitions but it's not 1 to 1. If you don't follow best practice guidance and use one PK value, the database may still split the items across back-end partitions to spread the load. Just don't use a Local Secondary Index (LSI) or it prohibits this ability. The details of the mapping depend on your usage pattern.
One physical partition will be 10 GB or less, and has the 3,000 Read units and 1,000 Write units limit, which is why the database will spread load across partitions. If you use a lot of PK values you make it more straightforward for the database to do this.
If you're at a high enough scale to hit the performance limits, you'll have an AWS account manager you can ask to hook you up with a DynamoDB specialist.
A given partition key can't receive more than 3k RCUs/1k WCUs worth of requests at any given time and store more than 10GB in total if you're using an LSI (if not using an LSI, you can store more than 10GB assuming you're using a Sort Key). If your data definitely fits within those limits, there's no reason you can't use DDB with a single partition key value (and thus a single partition). It'd still be better to plan on a design that could scale.
The right design for you will depend on what your data model and access patterns look like. Given what you've described of some kind of periodic job, a timestamp could be used (although it has issues with hotspots you should be careful of). If you've got some kind of other unique id, like user_id or device_id, etc. that would be a better choice. There is some great documentation on that here.
We’re developing a personnel management system based on blazor and Cosmos DB serverless. There will be one customer per database and around 30 “docTypes”. The biggest categories by number and data volume are "users" and "employees". When we query we get all data of users and employee at once. So it can be several thousand. The other doctypes are much smaller an less frequently queried.
The volume of data per customer will not exceed 5 GByte. The most frequent queries are to 3 docTypes.
Would it make more sense to use customerId (so all data is in one partition) or docType as a partition key?
thanks
Based on the information you supplied it sound like docType is a good property to use as partition key, since it can be used to avoid cross partition queries. Especially since you state this will be often be used in your queries. With the max size you stated it will also be unlikely to cause you issues as a single partition can contain up to 20GB of data.
One thing to watch out for is Hot Partitioning. You state that your users partition might be a lot bigger than others. That can result into one partition doing all of the lifting while the others sit mostly idle which results and causing inefficiëncy of your total throughput.
On the other side it won't really matter for your use case. Since none of the databases will exceed that 5GB you'll always stay within a single partition, but it's always good though to think about it beforehand; As situations may change and you end up with a database that does split into partitions.
Lastly I would never use a single partition for all data. It has no benefits. If you have no properties that could serve as partition key then id is the better choice (so a logical partition per document). It won't hit storage limitations and evenly distributes throughput between partitions.
I would highly recommend you first take a look at this segment of the Data Modelling & Partitioning presentation by Thomas Weiss, Cosmos DB program manager. In my view it's one of the best resources to understand how to think about partitioning.
Do agree with David Makogon that you didn't provide enough data. For instance, we know there are 30 doc types per single database - given cosmosdb database uses containers, I actually expect each docType to have its own container - contrary to what you wrote:
Would it make more sense to use customerId (so all data is in one partition) or docType as a partition key?
Which suggests you want to use a single container for all your data. I wouldn't keep users and employees as documents in the same container. They are separate domains and deserve their own container.
See Azure docs page on Partition Strategy and subsequent paragraph about access patterns. The recommendation is to:
Choose a partition key that enables access patterns to be evenly spread across logical partitions.
In the access patterns section, the good practice mentioned is to separate data into hot, medium and cold data and place it into their own containers. One caveat is, that according to this page the max number of containers per database with shared throughput is 25.
If that is not possible, and all data has to end up in a single container, then docType seems to be the right partition key, because your queries will get data by docType if I understood correctly.
As 404 wrote, you want to avoid Hot Partitioning i.e. jamming most of documents in a container into a single or a few logical partitions. Therefore you want to choose a partition key based on most frequent operations.
Based on this article, I have a question of strategy:
https://learn.microsoft.com/en-us/azure/cosmos-db/partition-data
A) Should I be structuring my partition keys so that my queries (ideally) end up at one partition? E.g. PartitionKey = CustomerId
OR
B) Does document still handle queries that cross multiple (many) partitions efficiently? Eg. PartitionKey = "CustomerId+ContextName+TypeName"
We currently have "A" implemented, but have discussed "B" because of the article has this quote in it:
It is a best practice to have a partition key with many distinct
values (100s-1000s at a minimum).
Emphasis on "at minimum". Our CustomerIds will not be of a volume to produce more than 2-300 partition keys. Should we add more information to it ("B"), knowing that one query may hit 30-50 partitions (i.e. the "TypeId" addition specifically)
SELECT * FROM c
WHERE(MyPartition = "1+ContextA+TypeA"
OR MyPartition = "1+ContextA+TypeB"
OR MyPartition = "1+ContextA+TypeC"
...)
AND <some other conditions>
The scenarios laid out in the article seem to presume that customer or user will generate plenty of keys. This isn't going to be true for us.
Docdb Sdk makes parallel calls when you run a cross partition query.
If you check the network traffic, you would notice that, it first queries the physical partition key ranges and then makes individual calls to each partition key range.
It does it in parallel, and it allows to control the maxdegreeofparallelism etc.
Having said that, there are two aspects to consider:
Volume of the data
If your volume is say 1 TB, that would mean it would required at-least 100 Physical partitions (each partition being 10 GB), hence it would make atleast 100 calls.
If your data volumes grow higher, making more calls might start to hurt the performance.
Querying aggregations
If you are using aggregations, currently supported by doc db SUM/AVG/COUNT/MIN/MAX. These cannot be performed across partitions.
Say I have:
My data stored in documetDB's collection for all of my tenants. (i.e. multiple tenants).
I configured the collection in such a way that all of my data is distributed uniformly across all partitions.
But partitions are NOT by each tenant. I use some other scheme.
Because of this data for a particular tenant is distributed across multiple partitions.
Here are my questions:
Is this the right thing to do to maximum performance for both reading and writing data?
What if I want to query for a particular tenant? What are the caveats in writing this query?
Any other things that I need to consider?
I would avoid queries across partitions, they come with quite a cost (basically multiply index and parsing costs with number of partitions - defaults to 25). It's fairly easy to try out.
I would prefer a solution where one can query on a specific partition, typically partitioning by tenant ID.
Remember that with partitioned collections, there's stil limits on each partition (10K RU and 10GB) - I have written about it here http://blog.ulriksen.net/notes-on-documentdb-partitioning/
It depends upon your usage patterns as well as the variation in tenant size.
In general for multi-tenant systems, 99% of all operations are within a single tenant. If you make the tenantID your partition key, then those operations will only touch a single partition. This won't make a single operation any faster (latency) but could provide huge throughput gains when under load by multiple tenants. However, if you only have 5 tenants and 1 of them is 10x bigger than all the others, then using the tenantID as your key will lead to a very unbalanced system.
We use the tenantID as the partition key for our system and it seems to work well. We've talked about what we would do if it became very unbalanced and one idea is to make the partition key be the tenantID + to split the large tenants up. We haven't had to do that yet though so we haven't worked out all of those details to know if that would actually be possible and performant, but we think it would work.
What you have described is a sensible solution, where you avoid data skews and load-balance across partitions well. Since the query for a particular tenant needs to touch all partitions, please remember to set FeedOptions.EnableCrossPartitionQuery to true (x-ms-documentdb-query-enablecrosspartition in the REST API).
DocumentDB site also has an excellent article on partitioned collections and tips for choosing a partition key in general. https://azure.microsoft.com/en-us/documentation/articles/documentdb-partition-data/
Using DynamoDB, two independent clients trying to write to the same item at the same time, using conditional writes, and trying to change the value that the condition is referencing. Obviously, one of these writes is doomed to fail with the condition check; that's ok.
Suppose during the write operation, something bad happens, and some of the various DynamoDB nodes fail or lose connectivity to each other. What happens to my write operations?
Will they both block or fail (sacrifice of "A" in the CAP theorem)? Will they both appear to succeed and only later it turns out that one of them actually was ignored (sacrifice of "C")? Or will they somehow both work correctly due to some magic (consistent hashing?) going on in the DynamoDB system?
It just seems like a really hard problem, but I can't find anything discussing the possibility of availability issues with conditional writes (unlike with, for instance, consistent reads, where the possibility of availability reduction is explicit).
There is a lack of clear information in this area but we can make some pretty strong inferences. Many people assume that DynamoDB implements all of the ideas from its predecessor "Dynamo", but that doesn't seem to be the case and it is important to keep the two separated in your mind. The original Dynamo system was carefully described by Amazon in the Dynamo Paper. In thinking about these, it is also helpful if you are familiar with the distributed databases based on the Dynamo ideas, like Riak and Cassandra. In particular, Apache Cassandra which provides a full range of trade-offs with respect to CAP.
By comparing DynamoDB which is clearly distributed to the options available in Cassandra I think we can see where it is placed in the CAP space. According to Amazon "DynamoDB maintains multiple copies of each item to ensure durability. When you receive an 'operation successful' response to your write request, DynamoDB ensures that the write is durable on multiple servers. However, it takes time for the update to propagate to all copies." (Data Read and Consistency Considerations). Also, DynamoDB does not require the application to do conflict resolution the way Dynamo does. Assuming they want to provide as much availability as possible, since they say they are writing to multiple servers, writes in DyanmoDB are equivalent to Cassandra QUORUM level. Also, it would seem DynamoDB does not support hinted handoff, because that can lead to situations requiring conflict resolution. For maximum availability, an inconsistent read would only have to be at the equivalent of Cassandras's ONE level. However, to get a consistent read given the quorum writes would require a QUORUM level read (following the R + W > N for consistency). For more information on levels in Cassandra see About Data Consistency in Cassandra.
In summary, I conclude that:
Writes are "Quorum", so a majority of the nodes the row is replicated to must be available for the write to succeed
Inconsistent Reads are "One", so only a single node with the row need be available, but the data returned may be out of date
Consistent Reads are "Quorum", so a majority of the nodes the row is replicated to must be available for the read to succeed
So writes have the same availability as a consistent read.
To specifically address your question about two simultaneous conditional writes, one or both will fail depending on how many nodes are down. However, there will never be an inconsistency. The availability of the writes really has nothing to do with whether they are conditional or not I think.