Is there some operation of the Scan API or the Query API that allows to perform a lookup on a table with a composite key (pk/sk) but that varies only in the pk to optimize the Scan operation of the table ?
Let me introduce a use case:
Suppose I have a partition key defined by the id of a project and within each project I have a huge amount of records (sk)
Now, I need to solve the query "return all projects". So I don't have a partition key and I have to perform a scan.
I know that I could create a GSI that solves this problem, but let's assume that this is not the case.
Is there any way to perform a scan that "hops" between each pk, ignoring the elements of the sk's?
In other words, I will collect the information of the first record of each partition key.
DynamoDB is a NoSQL database, as you already know. It is optimized for LOOKUP, and practices that you used to have in SQL databases or other (low-scale) databases are not always available in DynamoDB.
The concept of a partition key is to put records that are part of the same partition together and sorted by the sort key. The other side of it is that records that don't have the same partition key, are stored in other locations. It is not a long list (or tree) of records that you can scan over.
When you design your schema in a NoSQL database, you need to consider the access pattern to that data. If you need a list of all the projects, you need to maintain an index that will allow it.
Related
I am new to CosmosDB and exploring on Partition Key. I understand that partitionKey helps faster retrieval. In my case suppose I have Customer Data which has custId, offerCode, offerId and some other properties. I am planning to keep partitionKey on offerId my question is, while fetching data do I need to fetch data by using offerId for better performance or I can fetch the data by other property from the collection. Does it impact on performance? Below is my schema or items -
{
"custId":"abc12345",
"offers":[
{
"offerId":"offer123",
"offerCode":"offerCode1"
},
{
"offerId":"offer123",
"offerCode":"offerCode2"
}
]
}
David is an expert in cosmosdb and as #David said, what you need to know is 'partitionkey', here's some doc.
Doc from official. And this one from stack overflow.
In my opinion, if your database won't contains much data(> 50G, physical partition can store up to 50GB data), that means all the logic partition(logic partitions are partitioned by partitionkey) exists in one physical partition, so the query won't across physical partitions, so you could even use item ID as the partition key so that you can ensure evenly balancing RU consumption.
By the way, as far as I am concerned, partition key plays the role of 'group', if you have a large database with plenty of data, and you have several physical partitions indeed, and now fetching with the partition key can help to efficiently find the place because one logic partition will exist together in one physical partition. You also should know that if you need to change your partition key, you need to move your data to a new container with your new desired partition key.
In general, if the data size is small, you even don't need to care the partition key, you can even use ID as the partition key, and fetching data with or without partition key won't affect the performance. If the data size is huge, you need to find a property as the partition key follow the principles below:
We are using cosmos db for our data storage and there is a case where I have to do cross partition query because I don't know the specific partition key. But I will know a part of it.
To elaborate, my partition key is combination of multiple strings, lets say A-B.
and lets say I only know A but not B. So is there any way to do wild card searching on the partition key.
would that optimize the query or its not possible. Depending on that I will consider if to put A in the the partition key at all or not
Based on my researching and Partitioning in Azure Cosmos DB, nowhere mentions cosmos db partition key supports wildcard searching feature. Only index policy supports wildcard setting:https://learn.microsoft.com/en-us/azure/cosmos-db/index-policy#including-and-excluding-property-paths
So,for your situation,you don't know B so that i'd suggest you considering setting pk as A. Besides,you could vote up this thread:https://github.com/PlagueHO/CosmosDB/issues/153
I'm confused as to how to use DynamoDB table keys. The documentation mentions HASH (which seem to also be referred to as Partition) keys and RANGE (or SORT?) keys. I'm trying to roughly align these with my previous understanding of database indexing theories.
My current, mostly guess-based understanding is that a HASH key is essentially a primary key - it must be unique and is automatically indexed for fast-reading - and a RANGE key is basically something you should apply to any other field you plan on querying on (either in a WHERE-like or sorting context).
This is then somewhat confused by the introductions of Local and Global Secondary Indexes. How do they play into things?
If anyone could nudge me in the right direction, bearing in mind my current, probably flawed understanding has come from the docs, I'd be super grateful.
Thanks!
Basically, the DynamoDB table is partitioned based on partition key (otherwise called hash key).
1) If the table has only partition key, then it has to be unique. The DynamoDB table performance based pretty much on the partition key. The good partition key should be a well scattered value (should not have a sequence number as partition key like RDBMS primary key in legacy systems).
2) If the table has both partition key and sort key (otherwise called RANGE key), then the combination of them needs to be unique. It is a kind of concatenation key in RDBMS terms.
However, the usage differs in DynamoDB table. DynamoDB doesn't have a sorting functionality (i.e. ORDER BY clause) across the partition keys. For example, if you have 10 items with same partition key value and different sort key values, then you can sort the result based on the sort key attribute. You can't apply sorting on any other attributes including partition key.
All sort key values of a partition key will be maintained in the same partition for better performance (i.e. physically co-located).
LSI - There can be only one LSI for the table. It should be defined when you create the table. This is kind of alternate sort key for the table
GSI - In order to understand GSI, you need to understand the difference between SCAN and QUERY API in DynamoDB.
SCAN - is used when you don't know the partition key (i.e. full table scan to get the item)
QUERY - is used when you know the partition key (i.e. sort key is optional)
As DynamoDB costing is based on read/write capacity units and for better performance, scan is not the best option for most of the use cases. So, there is an option to create the GSI with alternate partition keys based on the Query Access Pattern (QAP).
GSI Example
I have a dynamodb table.
It has Primary partition key - IdType (String) and Primary sort key - Id (String)
As it's hash range schema, IdType is not unique and one key can be multiple times. I need to find all the unique IdType.
How do we find that? One possible solution is to get all IdType using Scan and process all client side and find unique using our own code. But scan is expensive and scan only limits to 1MB data per scan so it is not feasible to scan as the table is already more than 1 MB data and it will gradually increase in future.
Is there any other way to do this? Any help would be appreciated.
PS: There are no indexes
Short answer would be NO, to query DynamoDB table the first thing you need is the Hash key so this eliminates all the options of Querying data because you must have hash key to find the data.
As far as I know DyanmoDB does not have any inbuilt attribute for finding a uniqueness of a key.
If you want to achieve this you can do it by
1) Scanning the table as you have mentioned and filter it at an application level.
2) If your data is not updated frequently then you can store the data in cache and retrieve the desired information
3) You can use another AWS service called cloudSearch to achieve the desired result (have to pay more)
If you are able to achieve with another method please do share it.
Hope that helps
I think I understand the concept of not having hot hashKeys so that you use all the partitions in provisioning throughput. But do UUID hashKeys do a better job of distributing across the partitions than numerically sequenced ones? In both cases is a hashcode generated from the key and that value used to assign to a partition? If so, how do the hashcodes from two strings like: "100444" and "100445" differ? Are they close?
"100444" and "100445" are not any more likely to be in the same partition than a completely different number, like "12345" for example. Think of a DynamoDB table as a big hash table, where the hash key of the table is the key into the hash table. The underlying hash table is organized by the hash of the key, not by the key itself. You'll find that numbers and strings (UUIDs) both distribute fine in DynamoDB in terms of their distribution across partitions.
UUIDs are useful in DynamoDB because sequential numbers are difficult to generate in a scalable way for primary keys. Random numbers work well for primary keys, but sequential values are hard to generate without gaps and in a way that scales to the level of throughput that you can provision in a DynamoDB table. When you insert new items into a DynamoDB table, you can use conditional writes to ensure an item doesn't already exist with that primary key value.
(Note: this question is also cross-posted in this AWS Forums post and discussed there as well).