If we are partitioning a container in cosmosDb sql api , is it ok to have a partition key as unique in each document. I mean each document in the container will have its own logical partition and each logical parition will have only one document, we need to query on the unique key only so only one parition/document will get hit. Is there still any downside of such modelling related to performance /storage...?
If you are using Cosmos DB SQL API as a key/value store and only reading using ReadItemAsync() there is no downside to doing this.
Related
Is there some operation of the Scan API or the Query API that allows to perform a lookup on a table with a composite key (pk/sk) but that varies only in the pk to optimize the Scan operation of the table ?
Let me introduce a use case:
Suppose I have a partition key defined by the id of a project and within each project I have a huge amount of records (sk)
Now, I need to solve the query "return all projects". So I don't have a partition key and I have to perform a scan.
I know that I could create a GSI that solves this problem, but let's assume that this is not the case.
Is there any way to perform a scan that "hops" between each pk, ignoring the elements of the sk's?
In other words, I will collect the information of the first record of each partition key.
DynamoDB is a NoSQL database, as you already know. It is optimized for LOOKUP, and practices that you used to have in SQL databases or other (low-scale) databases are not always available in DynamoDB.
The concept of a partition key is to put records that are part of the same partition together and sorted by the sort key. The other side of it is that records that don't have the same partition key, are stored in other locations. It is not a long list (or tree) of records that you can scan over.
When you design your schema in a NoSQL database, you need to consider the access pattern to that data. If you need a list of all the projects, you need to maintain an index that will allow it.
Until now I used Linq to SQL to make query to my ComosDb database, which worked fine and I did not have to pass the partition key. However I now have to write a more complex query which search for a product on multiple fields so I decided to write to stored procedure, and here I have to pass the partition key to execute it.
Why passing the partition key is mandatory in some ways and not in others ?
In my use case, I have a collection containing products objects which all have a supplierId property which is the partition key, and catalogId property which contains an array of all catalogs where the product is available.
In my API, I require the catalogId to search for a product but not the supplier as it is redundant. Of course I could retrieve the supplierId using the catalogId first and then pass it to the method calling Cosmosdb but I don't really like it as it would mean that my application layer should be aware of the way the infrastructure works.
Do you have some advice on how to manage the dependency on the partition key ? Or maybe I did not model my data layer in the best way according to cosmosdb best practices ?
Linq may be able to deduce the partition key if it is sent as a filter predicate (where) which is why you didn't need to specify it. But if you're not passing it, Linq will happily run a fan-out query which, when done at large scale is slow and inefficient and definitely avoided at high request volumes.
Stored procs are scoped to a partition key so require it to be passed.
If you're doing a query here I would not use stored procedures as they only execute on the primary replica so can only access 1/4 of the throughput provisioned. Regular queries using the SDK can access any of the 4 replicas so better utilize throughput. This is especially important for high concurrency queries but no matter what you should aim to be efficient.
So if this indeed a cross partition query and you are not passing supplierId for a query which is executed very frequently you may want to look at your partition strategy and analyze your access patterns to your data to ensure that you are designing a database that will scale and be efficient.
I am new to CosmosDB and exploring on Partition Key. I understand that partitionKey helps faster retrieval. In my case suppose I have Customer Data which has custId, offerCode, offerId and some other properties. I am planning to keep partitionKey on offerId my question is, while fetching data do I need to fetch data by using offerId for better performance or I can fetch the data by other property from the collection. Does it impact on performance? Below is my schema or items -
{
"custId":"abc12345",
"offers":[
{
"offerId":"offer123",
"offerCode":"offerCode1"
},
{
"offerId":"offer123",
"offerCode":"offerCode2"
}
]
}
David is an expert in cosmosdb and as #David said, what you need to know is 'partitionkey', here's some doc.
Doc from official. And this one from stack overflow.
In my opinion, if your database won't contains much data(> 50G, physical partition can store up to 50GB data), that means all the logic partition(logic partitions are partitioned by partitionkey) exists in one physical partition, so the query won't across physical partitions, so you could even use item ID as the partition key so that you can ensure evenly balancing RU consumption.
By the way, as far as I am concerned, partition key plays the role of 'group', if you have a large database with plenty of data, and you have several physical partitions indeed, and now fetching with the partition key can help to efficiently find the place because one logic partition will exist together in one physical partition. You also should know that if you need to change your partition key, you need to move your data to a new container with your new desired partition key.
In general, if the data size is small, you even don't need to care the partition key, you can even use ID as the partition key, and fetching data with or without partition key won't affect the performance. If the data size is huge, you need to find a property as the partition key follow the principles below:
We are using cosmos db for our data storage and there is a case where I have to do cross partition query because I don't know the specific partition key. But I will know a part of it.
To elaborate, my partition key is combination of multiple strings, lets say A-B.
and lets say I only know A but not B. So is there any way to do wild card searching on the partition key.
would that optimize the query or its not possible. Depending on that I will consider if to put A in the the partition key at all or not
Based on my researching and Partitioning in Azure Cosmos DB, nowhere mentions cosmos db partition key supports wildcard searching feature. Only index policy supports wildcard setting:https://learn.microsoft.com/en-us/azure/cosmos-db/index-policy#including-and-excluding-property-paths
So,for your situation,you don't know B so that i'd suggest you considering setting pk as A. Besides,you could vote up this thread:https://github.com/PlagueHO/CosmosDB/issues/153
I'm migrating a very simple mongo DB (couple 100 entries) to Azure Cosmos DB. My app is based on node-js so I'm using mongoose as a mapper. Before it was really simple, define schema, query collection, finished.
Now when setting up a collection in cosmos db, I was asked about partion key and shard key. The first one I could ignore, but the last one was required. Quickly reading-up on that topic and understanding it was kind of partioning (again, which I do not need and want), I just chose _id as shard key.
Of course something does not work.
While find queries work just fine. Updating or insert records fail, below is the error:
MongoError: query in command must target a single shard key
Cosmos db (with the mongo API) was advertised to me as a drop-in replacement. Which clearly is not the case because I never needed to worry about such things in mongo, especially for such a small scale app/db.
So, can I disable sharding somehow? Alternatively, how can I define shard key and not worry about it at all going forward?
Cheers
You could create a CosmosDB collection with maximum fixed storage of 10GB. In that case the collection will not have to be sharded because the storage is not scalable and you will not receive errors from CosmosDB. However, due to the minimum throughput of 400 you might have slightly higher costs.
1.can I disable sharding somehow?
Based on the statements in Mongo official document,it can't be implemented.
MongoDB provides no method to deactivate sharding for a collection
after calling shardCollection. Additionally, after shardCollection,
you cannot change shard keys or modify the value of any field used in
your shard key index.
So,you can't deactivate or disable the shard key.
2.Alternatively, how can I define shard key and not worry about it at all going forward?
According to this link,you could set the shard key option in Schemas when you use insert/update operation on your collection.
new Schema({ .. }, { shardKey: { tag: 1, name: 1 }})
Please note that Mongoose does not send the shardcollection command for you. You must configure your shards yourself.
BTW, set _id as shard key might not be a appropriate decision. You could find some advices about choosing shard key from here.If you want to change the shard key or just remove the shard key,please refer to this case:How to change the shard key
It is required that each collection has a partition key and there is no way you can disable this requirement. Inserting a document into your collection without a field that targets the partition key results in an error. As stated in the documentation:
The Partition Key is used to automatically partition data among
multiple servers for scalability. Choose a JSON property name that has
a wide range of values and is likely to have evenly distributed access
patterns.