I'm migrating a very simple mongo DB (couple 100 entries) to Azure Cosmos DB. My app is based on node-js so I'm using mongoose as a mapper. Before it was really simple, define schema, query collection, finished.
Now when setting up a collection in cosmos db, I was asked about partion key and shard key. The first one I could ignore, but the last one was required. Quickly reading-up on that topic and understanding it was kind of partioning (again, which I do not need and want), I just chose _id as shard key.
Of course something does not work.
While find queries work just fine. Updating or insert records fail, below is the error:
MongoError: query in command must target a single shard key
Cosmos db (with the mongo API) was advertised to me as a drop-in replacement. Which clearly is not the case because I never needed to worry about such things in mongo, especially for such a small scale app/db.
So, can I disable sharding somehow? Alternatively, how can I define shard key and not worry about it at all going forward?
Cheers
You could create a CosmosDB collection with maximum fixed storage of 10GB. In that case the collection will not have to be sharded because the storage is not scalable and you will not receive errors from CosmosDB. However, due to the minimum throughput of 400 you might have slightly higher costs.
1.can I disable sharding somehow?
Based on the statements in Mongo official document,it can't be implemented.
MongoDB provides no method to deactivate sharding for a collection
after calling shardCollection. Additionally, after shardCollection,
you cannot change shard keys or modify the value of any field used in
your shard key index.
So,you can't deactivate or disable the shard key.
2.Alternatively, how can I define shard key and not worry about it at all going forward?
According to this link,you could set the shard key option in Schemas when you use insert/update operation on your collection.
new Schema({ .. }, { shardKey: { tag: 1, name: 1 }})
Please note that Mongoose does not send the shardcollection command for you. You must configure your shards yourself.
BTW, set _id as shard key might not be a appropriate decision. You could find some advices about choosing shard key from here.If you want to change the shard key or just remove the shard key,please refer to this case:How to change the shard key
It is required that each collection has a partition key and there is no way you can disable this requirement. Inserting a document into your collection without a field that targets the partition key results in an error. As stated in the documentation:
The Partition Key is used to automatically partition data among
multiple servers for scalability. Choose a JSON property name that has
a wide range of values and is likely to have evenly distributed access
patterns.
Related
I wanted to prevent any document create/update in Cosmos db with empty or null PartitionKey. Is this possible from a db level?
AFAIK, it is not possible to enforce this constraint from the database level. You would need to implement this from the application side only.
If your collection is setup with a partition key, I don't see how you can update it using any SDK without specifying it because how would CosmosDB know in what partition to put it?
Of course if you create a null partition key that is a different story.
Until now I used Linq to SQL to make query to my ComosDb database, which worked fine and I did not have to pass the partition key. However I now have to write a more complex query which search for a product on multiple fields so I decided to write to stored procedure, and here I have to pass the partition key to execute it.
Why passing the partition key is mandatory in some ways and not in others ?
In my use case, I have a collection containing products objects which all have a supplierId property which is the partition key, and catalogId property which contains an array of all catalogs where the product is available.
In my API, I require the catalogId to search for a product but not the supplier as it is redundant. Of course I could retrieve the supplierId using the catalogId first and then pass it to the method calling Cosmosdb but I don't really like it as it would mean that my application layer should be aware of the way the infrastructure works.
Do you have some advice on how to manage the dependency on the partition key ? Or maybe I did not model my data layer in the best way according to cosmosdb best practices ?
Linq may be able to deduce the partition key if it is sent as a filter predicate (where) which is why you didn't need to specify it. But if you're not passing it, Linq will happily run a fan-out query which, when done at large scale is slow and inefficient and definitely avoided at high request volumes.
Stored procs are scoped to a partition key so require it to be passed.
If you're doing a query here I would not use stored procedures as they only execute on the primary replica so can only access 1/4 of the throughput provisioned. Regular queries using the SDK can access any of the 4 replicas so better utilize throughput. This is especially important for high concurrency queries but no matter what you should aim to be efficient.
So if this indeed a cross partition query and you are not passing supplierId for a query which is executed very frequently you may want to look at your partition strategy and analyze your access patterns to your data to ensure that you are designing a database that will scale and be efficient.
I have an application on AWS using DynamoDB with user sending messages to each other. I am not familiar with AWS and I a lacking best practice knowledge
My application has now started to get slow to retrieve messages for a user because I have more and more data in my database.
I am thinking that it is because of my primary key and I wonder what could be a good primary key in this case.
Currently I am using a random guid as a primary key.
I am looking to retrieve all messages corresponding to a user, I am doing a scan operation.
I would like to use a composite value based on username as a primary key but I wonder if it will be better. For instance if I need to retrieve the number of messages for a user and to increment it will probably be even longer to do the request to create the primary key.
What would be a good primary key here ?
Thanks!
It will be better since it appears you often query based on the userid. Scans are expensive and should be avoided where possible. AWS has a great article on best practices for choosing a partition key (primary key). The key takeaway is the following:
You should evaluate various approaches based on your data ingestion and access pattern, then choose the most appropriate key with the least probability of hitting throttling issues.
Using a guid for the partition/primary key is a waste if you never query the data using it. Since using the query operation (rather than using scan) requires querying using the partition/primary (and sort key), you want to ensure you choose a value that you use to retrieve the data often and also has the sufficient cardinality to ensure your data is distributed across a reasonable amount of partitions.
What other access patterns do you have in your application? From what you've mentioned so far, userid seems to be a reasonable choice.
We are using cosmos db for our data storage and there is a case where I have to do cross partition query because I don't know the specific partition key. But I will know a part of it.
To elaborate, my partition key is combination of multiple strings, lets say A-B.
and lets say I only know A but not B. So is there any way to do wild card searching on the partition key.
would that optimize the query or its not possible. Depending on that I will consider if to put A in the the partition key at all or not
Based on my researching and Partitioning in Azure Cosmos DB, nowhere mentions cosmos db partition key supports wildcard searching feature. Only index policy supports wildcard setting:https://learn.microsoft.com/en-us/azure/cosmos-db/index-policy#including-and-excluding-property-paths
So,for your situation,you don't know B so that i'd suggest you considering setting pk as A. Besides,you could vote up this thread:https://github.com/PlagueHO/CosmosDB/issues/153
I'm a MSSQL developer who recently was tasked with building a new application using DynamoDB since we use AWS and we wanted a highly scaleable database service.
My biggest concern is data integrity. For example, I have a table for all my users where every row needs to have a username, email, and name field, all strings, with a verified field that's an int. Is there anyway to require all entries in that table to have those fields and to be of that particular type?
Since the application is in PHP I'm using Kettle as my ORM which should prevent me from messing up the data integrity but another developer voiced a concern about if we ever add another application or if someone manually changes some types via the console.
https://github.com/inouet/kettle
Currently, no, you are responsible for maintaining the integrity of your items with respect to the existence of attributes that are not keys on the base table. However, you can use LSI and GSI to enforce data types of attributes (notwithstanding my qualm that this is not a recommended pattern, as it could cause partition heat especially for attributes whose range of values is small). For example, verified seems like it might take only 0 or 1 as a value, so if you create a GSI with PK=verified where verified is a Number, writes to the base table may get throttled by the verified GSI.