Indexing synthetic partition key - azure-cosmosdb

I have a Cosmos container using a synthetic partition key /pk, where the value is only used for partitioning and never used in a query clause. Is there any reason not to exclude this path from indexing, given that the value is provided as the partition key with all operations? It seems like it should be excluded from indexing by definition, but not sure if partitioning and indexing somehow interact.

You should always include the partition key in your queries because queries without the partition key in the where clause will be fan outs which you really should try to avoid.
It is also recommended to keep the partition key in the index. There are quite a few people confused by this. We are soon going to start indexing it automatically, even if you exclude it so people don't suffer from not indexing it.

Related

How to select a partition key for a Graph database in Azure CosmosDB

I am working with Azure CosmosDB, and more specifically with the Gremlin API, and I am a little bit stuck as to what to select as a partition key.
Indeed, since I'm using graph data, not all vertices follow the same data schema. If I select a property that not all vertices have in common, Azure won't let me store vertices which don't have a value for the partition key. The problem is, the only property they all have in common is /id, but Azure doesn't allow for this property to be used as a partition key.
Does that mean I need to create a property that all my vertices will have in common ? Doesn't that kill a little bit the purpose of graph data ? Or is there something I'm missing out ?
For example, in my case, I want to model an object and its parts. Each object and each part have a property /identificationNumber. Would it be better to use this property as a parition key, or to create a new property /partitionKey dedicated to the purpose of partitioning ? My concern is that, if I select /identificationNumber as the partition key, and if my data model has to evolve in the future, if I have to model new objects without an /identificationNumber, I will have to artificially add this property to these objects the data model, which might lead to some confusion.
Creating a dedicated property to use as a synthetic partition key is a good practice if there isn't an obvious existing property to use. This approach can mitigate cases where you don't have an /identificationNumber in some objects, since you can assign some other value as the partitionKey in those cases. This also allows flexibility around refactoring /identificationNumber in the future, since partitionKey is what needs to be unchanging.
We shouldn't be concerned about an "artificial property" because this is inherent with using a partitioned database. It doesn't need to be exposed to users, but devs need to understand Cosmos is somewhat different than traditional DBs. It's also possible to migrate to a new partition key by copying all data to a new container, in the worst case of regret down the road. It's probably best to start working on the project with a best guess and seeing how things work, and perhaps iterating on different ideas to compare performance etc.

Scan Vs BatchGetItems in Dynamo-db

If I know the primary key of the items, Which approach is best approach
Scan with FilterExpression with IN Operator
BatchGetItem with all keys in request parameter
Please recommend the solution in terms of both latency and partitions impact.
Probably neither. Of course it all depends on the key schema and the data in the table, but you probably want to create an Global Secondary Index for your most frequently used queries.
Having said that; performing scans is highly discouraged, especially when working with large volumes of data. So if you know the primary key of the items you're interested in, go for BatchGetItems over doing a scan.

Maintain unique value for DynamoDB partition key

I'm new to "DynamoDB" and wanting to know best practice to maintaining unique partition key value when you add records to a table.
With my existing experience related to SQL, primary keys are normally maintained by the system with identity columns or via a trigger. I've searched through various forums and "AWS" documentation, but did not find any specifics. Do you manually determine the existence of partition key value or am I missing something obvious?
In DynamoDB the querying is flexibility is limited when compared to SQL. So the schema as well as partition key / sort key should be designed to make the most common and important queries as fast as possible. You can find some generic best practices here
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/best-practices.html
https://aws.amazon.com/blogs/database/choosing-the-right-dynamodb-partition-key/
If you can provide better context on the use case that you are trying to use DynamoDB, you should get more pointed answere

Retrieve all items with a column beginning with specified text on DynamoDB

I have a table in DynamoDB:
Id: int, hash key
Name: string
(there are many more columns, but I omitted them)
Typically I just pull out and update items by their Id, and this schema works fine for that.
However, one of the requirements is to have an auto-completing drop down box based on the name. I want to be able to query all items in this DynamoDB table for Name columns starting with a query string.
The SQL way of solving this would be to just add an index on Name and write a query like SELECT Id FROM table WHERE Name LIKE 'query%', but I can't figure out a DynamoDB-friendly way of doing this.
I have considered a few ways to solve this:
Scan the table. This is the easiest option, but least efficient. There's a bit more data in this table than I would be comfortable frequently scanning.
Scan + cache it in memory. But then I have to worry about cache invalidation etc.
Make Name a range key, which supports a begins_with function on the query. However, I'd still have to Scan the table since I want to retrieve results for every single hash key, so this doesn't really work.
Make a global secondary index and query it only with the range key. This also doesn't appear to be possible. I could have a column with a static value and use that as the hash key for the GSI, but that seems like a really ugly hack.
Use a full text search engine like CloudSearch, but this seems like massive overkill for my use case.
Is there a simple solution to this issue?
The use case you described is not directly supported by DynamoDB's Query operation today - DynamoDB typically requires you to specify a hashkey then query on the range key accordingly.
However, there is a popular scatter-gather technique that is commonly used for usecase such as yours. In this case, you would add an attribute bucket_id and create a global secondary index with bucket_id as hash key, and Name as the range key.
The bucket_id refers to a fixed range of IDs or numbers, with enough cardinality to ensure your global secondary index is well-distributed. For instance, bucket_id could range from 0 to 99. Then when updating your base table, whenever a new entry is added, a random bucket_id between 0 and 99 is assigned to it.
During your autocomplete query, the application would send 100 separate queries (scatter) for each bucket_id value (0 to 99) and use BEGINS_WITH on the range key Name. After the results are retrieved, the application would have to combine the 100 sets of responses and re-sort as necessary (gather).
The above process may seem a bit cumbersome, but it allows your system/table to scale well by ensuring the load is evenly distributed over a fixed key range. You can increase the bucket_id range as appropriate. To save cost, you can choose to project KEYS_ONLY onto your global secondary index, so cost of querying is minimized.
The problem is that DynamoDB is essentially a key-value store with support for operations against a single key, and you are trying to search all values which doesn't work well . The "simplest" solution to this is to have a known hash key and then you can Query it directly and specify conditions.
For example, you could query with hash_key='name_search' and range_key=begins_with(myText) or other_key=begins_with(myText) and get the use case you are describing. This will work fine for small sets of data that do not require a large amount of provisioned RCUs.
The problem is that this does not scale because you are not following any of the DynamoDB best practices (in fact, this is an anti-pattern). Take a look at the Understand Partition Behavior documentation
My suggestion would be to use a different service/solution to accomplish this rather than trying to squeeze DynamoDB into this use case.

Query a range of primary keys in dynamodb

I want to make sure I get this right,
Based on what I've read so far, you can NOT query a range of primary keys in dynamodb,
like if you have a primary key which is number like the phone number of your customers, you can not get items with primary keys larger than 3010000000 or between 3010000000 and 3020000000
to make it clear, I am not talking about the range key, my questions is about the primary key itself,
so if this is true, there are lots of use cases, like items between dates, users registered after some point, and... , that requiers either table scans,
is this correct?
EDIT: OK, one solution that comes to mind, would be to use only one dummy hash_key for primary key and insert the real key (like phone numbers above) as range keys, does this work?
Yes, you can not get a range of hash_key with DynamoDb. But this does not mean you are stuck with your use case.
Let's take the 'dates' use case and say your are building a logging application. You are likely to get lots of records each day.
If you use the day as the hash_key, you can put the full timestamp as the range_key. This way, you can split your query into chunks and get what you want.
Of course, to get the optimal results, you will need to know well the kind of queries. For example, what is the typical range ? With DynamoDb, as well as other key:value store, you most of the time model your data with query in mind, unlike SQL when you model with only data in mind.
Of course, if your items spans on larger/shorter range, just adapt this system.
Concerning the "all under the same dummy hash_key" sounds like a terrible idea. Sorry. I am not a hundred percent sure how it really works but I know DynamoDB does some sharding across so called partitions. I believe 1 hash_key <=> 1 partitions. Moreover, If read closely the documentation, you'll notice that the provisionned throughput is splited evenly between the partitions so that each partitions is only allocated a fraction of what you pay for.
Without modifying the keys of your primary DynamoDB table, you can add a GSI with a constant partition key and your primary table's partition key as its sort key.
This will enable you to query on the index's sort key and use the resulting partition keys to get the data you're looking for.

Resources