Until now I used Linq to SQL to make query to my ComosDb database, which worked fine and I did not have to pass the partition key. However I now have to write a more complex query which search for a product on multiple fields so I decided to write to stored procedure, and here I have to pass the partition key to execute it.
Why passing the partition key is mandatory in some ways and not in others ?
In my use case, I have a collection containing products objects which all have a supplierId property which is the partition key, and catalogId property which contains an array of all catalogs where the product is available.
In my API, I require the catalogId to search for a product but not the supplier as it is redundant. Of course I could retrieve the supplierId using the catalogId first and then pass it to the method calling Cosmosdb but I don't really like it as it would mean that my application layer should be aware of the way the infrastructure works.
Do you have some advice on how to manage the dependency on the partition key ? Or maybe I did not model my data layer in the best way according to cosmosdb best practices ?
Linq may be able to deduce the partition key if it is sent as a filter predicate (where) which is why you didn't need to specify it. But if you're not passing it, Linq will happily run a fan-out query which, when done at large scale is slow and inefficient and definitely avoided at high request volumes.
Stored procs are scoped to a partition key so require it to be passed.
If you're doing a query here I would not use stored procedures as they only execute on the primary replica so can only access 1/4 of the throughput provisioned. Regular queries using the SDK can access any of the 4 replicas so better utilize throughput. This is especially important for high concurrency queries but no matter what you should aim to be efficient.
So if this indeed a cross partition query and you are not passing supplierId for a query which is executed very frequently you may want to look at your partition strategy and analyze your access patterns to your data to ensure that you are designing a database that will scale and be efficient.
Related
Is there some operation of the Scan API or the Query API that allows to perform a lookup on a table with a composite key (pk/sk) but that varies only in the pk to optimize the Scan operation of the table ?
Let me introduce a use case:
Suppose I have a partition key defined by the id of a project and within each project I have a huge amount of records (sk)
Now, I need to solve the query "return all projects". So I don't have a partition key and I have to perform a scan.
I know that I could create a GSI that solves this problem, but let's assume that this is not the case.
Is there any way to perform a scan that "hops" between each pk, ignoring the elements of the sk's?
In other words, I will collect the information of the first record of each partition key.
DynamoDB is a NoSQL database, as you already know. It is optimized for LOOKUP, and practices that you used to have in SQL databases or other (low-scale) databases are not always available in DynamoDB.
The concept of a partition key is to put records that are part of the same partition together and sorted by the sort key. The other side of it is that records that don't have the same partition key, are stored in other locations. It is not a long list (or tree) of records that you can scan over.
When you design your schema in a NoSQL database, you need to consider the access pattern to that data. If you need a list of all the projects, you need to maintain an index that will allow it.
Background: I have a relational db background and have never built anything for DynamoDB that wasn't just used for fast writes with very few reads. I am trying to learn DynamoDB patterns by migrating one of my help desk apps from MySQL to DynamoDB.
The application is a fairly simple one from a data storage perspective. A user submits a request and that request generates 1 or more tickets.
Setup: I have screens where people see initial requests and that request's tickets and search views that allow support to query on a bunch of attributes of a ticket (last name of user, status of ticket, use case of ticket, phone number of user, dept of user). This design in a SQL db is pretty straightforward but in Dynamo, I'm really being thrown for a loop on how to structure primary/sort keys and secondary indexes (if necessary).
I created one collection for requests and one collection for tickets. The individual requests have an array of ticket ids that belong to it. The ticket item has an attribute that stores the request id so that I can search that way. But what I am hung up on, is how do I incorporate searching on a ticket/request's attributes without having to do a full scan?
I read about composite keys and perhaps creating a composite sort key similar to: ## so that I can search on each of those fields directly without having to know the primary key (ticket id).
Question: How do you design dynamo collections/tables that require querying a lot of different attribute values without relying on a primary key?
This is typically something that DynamoDB is not good at, not to say it definitely cannot be done. The strength and speed for DynamoDB comes from having well known access patterns and designing your schema for these patterns. In general if you don't know what your users will search for, or there are many different possible queries, it's better to look at something like RDS or a native SQL DB. That being said a possible direction to solve this could be to create multiple lists for each of the fields and duplicate the data. This could all be done in the same table.
I am working with Azure CosmosDB, and more specifically with the Gremlin API, and I am a little bit stuck as to what to select as a partition key.
Indeed, since I'm using graph data, not all vertices follow the same data schema. If I select a property that not all vertices have in common, Azure won't let me store vertices which don't have a value for the partition key. The problem is, the only property they all have in common is /id, but Azure doesn't allow for this property to be used as a partition key.
Does that mean I need to create a property that all my vertices will have in common ? Doesn't that kill a little bit the purpose of graph data ? Or is there something I'm missing out ?
For example, in my case, I want to model an object and its parts. Each object and each part have a property /identificationNumber. Would it be better to use this property as a parition key, or to create a new property /partitionKey dedicated to the purpose of partitioning ? My concern is that, if I select /identificationNumber as the partition key, and if my data model has to evolve in the future, if I have to model new objects without an /identificationNumber, I will have to artificially add this property to these objects the data model, which might lead to some confusion.
Creating a dedicated property to use as a synthetic partition key is a good practice if there isn't an obvious existing property to use. This approach can mitigate cases where you don't have an /identificationNumber in some objects, since you can assign some other value as the partitionKey in those cases. This also allows flexibility around refactoring /identificationNumber in the future, since partitionKey is what needs to be unchanging.
We shouldn't be concerned about an "artificial property" because this is inherent with using a partitioned database. It doesn't need to be exposed to users, but devs need to understand Cosmos is somewhat different than traditional DBs. It's also possible to migrate to a new partition key by copying all data to a new container, in the worst case of regret down the road. It's probably best to start working on the project with a best guess and seeing how things work, and perhaps iterating on different ideas to compare performance etc.
I have an application on AWS using DynamoDB with user sending messages to each other. I am not familiar with AWS and I a lacking best practice knowledge
My application has now started to get slow to retrieve messages for a user because I have more and more data in my database.
I am thinking that it is because of my primary key and I wonder what could be a good primary key in this case.
Currently I am using a random guid as a primary key.
I am looking to retrieve all messages corresponding to a user, I am doing a scan operation.
I would like to use a composite value based on username as a primary key but I wonder if it will be better. For instance if I need to retrieve the number of messages for a user and to increment it will probably be even longer to do the request to create the primary key.
What would be a good primary key here ?
Thanks!
It will be better since it appears you often query based on the userid. Scans are expensive and should be avoided where possible. AWS has a great article on best practices for choosing a partition key (primary key). The key takeaway is the following:
You should evaluate various approaches based on your data ingestion and access pattern, then choose the most appropriate key with the least probability of hitting throttling issues.
Using a guid for the partition/primary key is a waste if you never query the data using it. Since using the query operation (rather than using scan) requires querying using the partition/primary (and sort key), you want to ensure you choose a value that you use to retrieve the data often and also has the sufficient cardinality to ensure your data is distributed across a reasonable amount of partitions.
What other access patterns do you have in your application? From what you've mentioned so far, userid seems to be a reasonable choice.
I'm migrating a very simple mongo DB (couple 100 entries) to Azure Cosmos DB. My app is based on node-js so I'm using mongoose as a mapper. Before it was really simple, define schema, query collection, finished.
Now when setting up a collection in cosmos db, I was asked about partion key and shard key. The first one I could ignore, but the last one was required. Quickly reading-up on that topic and understanding it was kind of partioning (again, which I do not need and want), I just chose _id as shard key.
Of course something does not work.
While find queries work just fine. Updating or insert records fail, below is the error:
MongoError: query in command must target a single shard key
Cosmos db (with the mongo API) was advertised to me as a drop-in replacement. Which clearly is not the case because I never needed to worry about such things in mongo, especially for such a small scale app/db.
So, can I disable sharding somehow? Alternatively, how can I define shard key and not worry about it at all going forward?
Cheers
You could create a CosmosDB collection with maximum fixed storage of 10GB. In that case the collection will not have to be sharded because the storage is not scalable and you will not receive errors from CosmosDB. However, due to the minimum throughput of 400 you might have slightly higher costs.
1.can I disable sharding somehow?
Based on the statements in Mongo official document,it can't be implemented.
MongoDB provides no method to deactivate sharding for a collection
after calling shardCollection. Additionally, after shardCollection,
you cannot change shard keys or modify the value of any field used in
your shard key index.
So,you can't deactivate or disable the shard key.
2.Alternatively, how can I define shard key and not worry about it at all going forward?
According to this link,you could set the shard key option in Schemas when you use insert/update operation on your collection.
new Schema({ .. }, { shardKey: { tag: 1, name: 1 }})
Please note that Mongoose does not send the shardcollection command for you. You must configure your shards yourself.
BTW, set _id as shard key might not be a appropriate decision. You could find some advices about choosing shard key from here.If you want to change the shard key or just remove the shard key,please refer to this case:How to change the shard key
It is required that each collection has a partition key and there is no way you can disable this requirement. Inserting a document into your collection without a field that targets the partition key results in an error. As stated in the documentation:
The Partition Key is used to automatically partition data among
multiple servers for scalability. Choose a JSON property name that has
a wide range of values and is likely to have evenly distributed access
patterns.