Converting document collections without partition key to with partition key - azure-cosmosdb

We are having production cosmosdb collection WITHOUT PARTITION KEY, now we have decided to introduce PARTITION KEY, it is understood it requires new collection creation with partition key & data migration with production downtime. All our collections having /id property which unique for a given collection. Question is, will /id would be ideal candidate for partition key? if so what are pros & cons?please suggest.

First of all you should have a look at the documentation of choosing partition key from the official docs
If you are going to use id as partition key, you'll need to check/understand if you are querying data on a property other than id, since you will be forced to do a cross-partition query. If not Id would be a good selection as Partition Key

Related

Multi partition key search operation in DynamoDB

Is there some operation of the Scan API or the Query API that allows to perform a lookup on a table with a composite key (pk/sk) but that varies only in the pk to optimize the Scan operation of the table ?
Let me introduce a use case:
Suppose I have a partition key defined by the id of a project and within each project I have a huge amount of records (sk)
Now, I need to solve the query "return all projects". So I don't have a partition key and I have to perform a scan.
I know that I could create a GSI that solves this problem, but let's assume that this is not the case.
Is there any way to perform a scan that "hops" between each pk, ignoring the elements of the sk's?
In other words, I will collect the information of the first record of each partition key.
DynamoDB is a NoSQL database, as you already know. It is optimized for LOOKUP, and practices that you used to have in SQL databases or other (low-scale) databases are not always available in DynamoDB.
The concept of a partition key is to put records that are part of the same partition together and sorted by the sort key. The other side of it is that records that don't have the same partition key, are stored in other locations. It is not a long list (or tree) of records that you can scan over.
When you design your schema in a NoSQL database, you need to consider the access pattern to that data. If you need a list of all the projects, you need to maintain an index that will allow it.

Would cross partition query in cosmos db be helpful if we know one part of it

We are using cosmos db for our data storage and there is a case where I have to do cross partition query because I don't know the specific partition key. But I will know a part of it.
To elaborate, my partition key is combination of multiple strings, lets say A-B.
and lets say I only know A but not B. So is there any way to do wild card searching on the partition key.
would that optimize the query or its not possible. Depending on that I will consider if to put A in the the partition key at all or not
Based on my researching and Partitioning in Azure Cosmos DB, nowhere mentions cosmos db partition key supports wildcard searching feature. Only index policy supports wildcard setting:https://learn.microsoft.com/en-us/azure/cosmos-db/index-policy#including-and-excluding-property-paths
So,for your situation,you don't know B so that i'd suggest you considering setting pk as A. Besides,you could vote up this thread:https://github.com/PlagueHO/CosmosDB/issues/153

can we have a non primary key as partition key in azure cosmos DB?

I am creating a Collection in cosmos db which can exceed 10GB. The collection does not have any primary key. is it good idea to make the non primary key field which i will be querying frequently as the partition key?
is it good idea to make the non primary key field which i will be
querying frequently as the partition key?
Actually, the choice of partition key is a question that deserves to be weighed repeatedly. The best partition key is one that provides for even distribution and higher cardinality.
Since choosing to be the non primary key that you will query frequently is your option now, I just discuss some of the possible positive and negative things as your references.
Firstly, we could say that the primary key is the safest and probably, most appropriate choice for a partition key.
It guarantees uniqueness of the value, which other than unique keys, is the only way to achieve. The distribution will be even and because the primary key will be your partition key, you will be able to use it in order to retrieve the document by reading it, instead of querying, which reduces operation speed and cost.
In terms of performance, if your query frequently field is not partition key, your query will definitely reduce query performance by crossing partitions. Surely, if the amount of data is large, it will have much effect.
In terms of cost, cosmos db is charged primarily by storage space and RUs consumption. As you said, choosing non primary key as partition key will lead less indexes storage. If mostly queries are not cross-partition, it also saves RUs consumption.
In terms of using of stored procedure, triggers or UDF, you can't use cross-partition transactions via stored procedures and triggers. Because then are partitioned so that you need to specify the partition key(cardinality is only 1) when you use them.
Just note that if partition key is created, it cannot be deleted or modified later. So consider it before you choose and make sure you do the data backup.
More details, still refer to the official doc.

Primary key scheme

I'm just getting started with DynamoDB and am not sure how to generate the primary key. Based on my research I see two good schemes.
UUID (or GUID)
Timestamp combined with a random number
Are these good options? What are the advantages and disadvantages of the two options?
In DynamoDB, primary keys are called Partition Key. Those two options are good, but it really depends on your case specifically. Here are two good posts on DynamoDB Partition keys: https://aws.amazon.com/pt/blogs/database/choosing-the-right-dynamodb-partition-key/ and https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html
Both the options mentioned above are subjected to your use-case. The optimal usage of a dynamoDB's table provisioned throughput depends mainly on two factors:
The primary key selection
The workload patterns on individual items
Primary keys can be simple (a partition key) or a composite key (primary key & a sort key). Imagine it as blocks are created in database as per partition key and elements are arranged in those blocks as per sort key.
Now as per dynamoDB official guide, structure the primary key elements to avoid one "hot" (heavily requested) partition key value to avoid slow overall performance. If your application has many users, then USER_ID can be a good partition key which will distribute loads and optimize performance
Consider a table with composite primary key, with date as the partition key. The number of records created on that particular day, say 2018-01-12, will be written in same partition, with sort key as item identifier.

DynamoDB: Keys and what they mean

I'm confused as to how to use DynamoDB table keys. The documentation mentions HASH (which seem to also be referred to as Partition) keys and RANGE (or SORT?) keys. I'm trying to roughly align these with my previous understanding of database indexing theories.
My current, mostly guess-based understanding is that a HASH key is essentially a primary key - it must be unique and is automatically indexed for fast-reading - and a RANGE key is basically something you should apply to any other field you plan on querying on (either in a WHERE-like or sorting context).
This is then somewhat confused by the introductions of Local and Global Secondary Indexes. How do they play into things?
If anyone could nudge me in the right direction, bearing in mind my current, probably flawed understanding has come from the docs, I'd be super grateful.
Thanks!
Basically, the DynamoDB table is partitioned based on partition key (otherwise called hash key).
1) If the table has only partition key, then it has to be unique. The DynamoDB table performance based pretty much on the partition key. The good partition key should be a well scattered value (should not have a sequence number as partition key like RDBMS primary key in legacy systems).
2) If the table has both partition key and sort key (otherwise called RANGE key), then the combination of them needs to be unique. It is a kind of concatenation key in RDBMS terms.
However, the usage differs in DynamoDB table. DynamoDB doesn't have a sorting functionality (i.e. ORDER BY clause) across the partition keys. For example, if you have 10 items with same partition key value and different sort key values, then you can sort the result based on the sort key attribute. You can't apply sorting on any other attributes including partition key.
All sort key values of a partition key will be maintained in the same partition for better performance (i.e. physically co-located).
LSI - There can be only one LSI for the table. It should be defined when you create the table. This is kind of alternate sort key for the table
GSI - In order to understand GSI, you need to understand the difference between SCAN and QUERY API in DynamoDB.
SCAN - is used when you don't know the partition key (i.e. full table scan to get the item)
QUERY - is used when you know the partition key (i.e. sort key is optional)
As DynamoDB costing is based on read/write capacity units and for better performance, scan is not the best option for most of the use cases. So, there is an option to create the GSI with alternate partition keys based on the Query Access Pattern (QAP).
GSI Example

Resources