Using Cosmos DB how do I query just on the partition key

Using Cosmos DB how do I query just on the partition key - azure-cosmosdb

We have a group of related documents all sharing the same partition key. The thinking is simply grouping these up should be a case of querying on the partition key and stitching them together. What am I missing?
So
Select * from c where c.CustomerId = "500"
Would return say 3 documents, Address, Sales and Invoices who all have a property named CustomerId , with a value of 500.
I appreciate its not the primary key and I am purposely omiitng a row key.
Perhaps not splitting the documents is the answer but then the different documents have different TTLs and this would then becone problematic, wouldnt it(
CustomerId is the partition key.
The ms docs say this is possible (citing a city = seattle ) example. Where their partitionkey is city....
So, what am I missing, a complete misunderstaning of querying is cosmos ? (i can say I know a partition key is used to break up related data into partitions) I didnt know this made it an unqueryable aspect.
Also I can query with partition key and rowkey no problem.
EDIT 2:
This works:
SELECT * FROM c WHERE c.CustomerId > "499" AND c.CustomerId < "501"

Ok,
So the range query working was a bit of a lead.
Custom indexing on the collection was causing issues.
At this moment, I have removed the custom indexing entirely and will build back up and then post a more specific answer.
What I did read was that the PartitionKey is implicitly indexed anyway. There was an index on this ALSO so maybe this was causing funnies.
Indexing Policies CosmosDB

Maybe I'm not getting at all, but you have to be explicit about the value that you are looking for, I think is not the same:
c.CustomerId = "500"
VS
c.CustomerId = 500
because one is looking for text and the other one for a number, review how is stored your data, and it has to be the same if you want to perform the query using that value (and having in mind CustomerId is the Partition Key).

Related

Choosing Primary key for DynamoDB

A bit of context: I am trying to build an inventory to list my AWS resources in various accounts and I am planning to use DynamoDB to store the data. These will be the columns for my table: ResourceARN, ResourceName, ResourceType, StandardTag, IsDeleted, LastUpdateTime and ResourceCreationDate ( this field is available only for a few resource types like Ec2).
Question: I want to query my DDB table using account ID, resource type and tag name. I am stumped on choosing the primary key for the table. Since primary key should be unique and has to have 1:many relationship. Hence, I cannot use a combination of resourceType and account Id. Nor can I use resourceArn as my primary key since it is 1:1 relationship. Also, using the resourceARN as the sort key does not make sense to me. I understand that I can use a simple scan operation, but that is very costly and will take time if I add more data in my DDB.
I would appreciate any suggestions or guidance over the same.

Short answer
Partition key: Account ID
Sort key: <resource type>/<resource ID>
Rationale
It's a common pattern for a sort key to be a string concatenating multiple attributes. Since sort keys can be queried by prefix, you can leverage this in your queries:
Get all account resources: query all sort keys on the Account ID partition key
Get all EC2 instances of an account: query with partition key = <your account ID> and sort key begins_with('ec2-instance').
You may notice that ARNs follow such a hierarchy as well (what's probably not a coincidence). This would be effectively using a subset of the ARN as the sort key.
Some notes:
DynamoDB is about attributes as much as about columns. You don't need to include ResourceCreationDate in the records which don't have it, and doing so will save you space (see next point).
Attribute names count as storage for every record, which impacts cost and also throughput. It's common to use shorthand for names for this reason (rct instead of ResourceCreationTime for example).
You can use LSIs (Local Secondary Indexes) to order by creation and update times if you need this.

DynamoDB NOT EQUALS on GSI sort key

As the title suggest, I'm in a situation where I need to fetch all records from a dynamo table GSI, given that I know the hash key and I know the sort key that I want to avoid.
The table looks like this:
Id - Primary Key,
AId - GSI hash key,
BId - GSI sort key
I need an efficient query to get records by a query like this
AId = 1 and BId != 2.
DynamoDB doesn't support <> operator when querying on hash and sort keys, it's only present on filter expressions, but those are not allowed on any of the primary key fields either.
So what would be the solution here? Scanning is probably not a good idea, unless it would be possible to scan on a partition, but that doesn't seem to be supported either.
So the only solution that is obvious to me at this point is querying by the partition key and then filtering it out client side.

Assuming that your sort key is actually numeric as shown in your example...
Then your best option would be to issue two separate queries..
AId = 1 and BId < 2
AId = 1 and Bid > 2
Actually, as I write this...I think it would work regardless of the type of sort key...

DynamoDB how to search for a list of values

I have a DynamoDB instance with a partition key and sort key. Let's say that they are organisation (hash key) and employee id (sort key).
I want to retrieve all employees who's ids are in a list. They all work for the same organisation but they are not all of the employees of that organisation.
In SQL I'd do something like:
select * from table where organisation_id = 'org' and employee_id in [list of ids]
There does not seem to be an equivalent in DynamoDB.
My choices seem to be:
1) Iterate over all employee IDs using a Query OR
2) Use BatchGetItems and provide organisation_id:employee_id for all items
The first seems like it will be slower as it involves multiple requests while the second is a single request but may consume more RCUs.
Which of these is preferred solution to this problem? Or am I missing a better third way?

I would iterate your list using GetItem, adding each employee found to a collection. This approach isn't slow - DynamoDB is designed specifically for getting lots of items fast using their keys.
There is no need to use Query as you have both the partition key and range key. You would only use a Query if say you wanted all employees of one organisation.
If your list is particularly large you could use BatchGetItem, which will create multiple parallel threads and therefore reduce latency. You won't find much a difference though unless you have a lot of items to get.
By the way, DynamoDB does have an 'IN' operator but your can't use it on KeyConditions.

How to fetch multiple rows from DynamoDB using a non primary key

select * from tableName where columnName="value";
How can I fetch a similar result in DynamoDB using java, without using primary key as my attribute (Need to group data based on a value for a particular column).
I have gone through articles regarding getbatchitems, QuerySpec but all these require me to pass the primary key.
Can someone give a lead here?

Short answer is you can't. Whenever you use the Query or GetItem operations in DynamoDB you must always supply the table or index primary key.
You have two options:
Perform a Scan operation on the table and filter by columnName="value". However this requires DynamoDB to look at every item in the table so it is likely to be slow and expensive.
Add a Global Secondary Index to your table. This will require you to define a primary key for the index that contains the columnName you want to query

Encode PartitionKey into Document Id?

I have set the partition key of one of my Cosmos DBs to /partition.
For example: We have a Chat document that contains a list of Subscribers, then we have ChatMessages that contain a text, a reference to the author and some other properties. Both documents have a partition property that contains the type 'chat' and the chats id.
Chat example:
{
"id" : "955f3eca-d28d-4f83-976a-f5ff26d0cf2c",
"name" : "SO questions",
"isChat" : true,
"partition" : "chat_955f3eca-d28d-4f83-976a-f5ff26d0cf2c",
"subscribers" : [
...
]
}
We then have Message documents like this:
{
"id" : "4d1c7b8c-bf89-47e0-83e1-a8cf0d71ce5a",
"authorId" : "some guid",
"isMessage" : true,
"partition" : "chat_955f3eca-d28d-4f83-976a-f5ff26d0cf2c",
"text" : "What should I do?"
}
It is now very convenient to return all messages for a specific chat, I just need to query all documents of the partition chat_955f3eca-d28d-4f83-976a-f5ff26d0cf2c with the property isMessage = true. All good...
But if I now want to query my db for a specific message by id, I usually just know the id, but not the partition and therefor have to run a slow crosspartition query. Which then led me to the question if I should not add the partitionKey to the message id so I can split the id when querying the db for a faster lookup. I saw that the _rid property of a document looks like a combination of the id of a db and the id of the collection and then a document specific id. What I mean by this is (simplified):
Chat.Id = "abc"
Chat.Partition = "chat_abc" //[type]_[chatId]
Message.Id = "chat_abc|123" //[Chat.Partition]|[Message.Id]
Message.Partition = chat_abc //[Chat.Partition]
Lets assume that I now want to get the Message document by the id, I just split the id by the | symbol and then query the document with the 1st part of the id as partition and the full id as the key.
Does that make sense? Are there better ways to do this? Should I just always also pass the partitionKey of a document along, not just it's id? Should I just use the _rid properties instead?
Any experience is highly appreciated!
UPDATE
I have found the following answer here:
Some applications encode partition key as part of the ID, e.g.
partition key would be customer ID, and ID = "customer_id.order_id",
so you can extract the partition key from the ID value.
I have further asked the cosmos team by email if this is a recommended pattern and post an answer, in case I get any.

Yes, your proposal to extract partition key from id (via a convention like a prefix/delimiter) makes sense. This is common among applications that have a single key and want to refactor it to use Cosmos DB from a different storage system.
If you're building your application from scratch, you should consider wiring the composite key (partition key + item key ("id")) through your API/application.

First, if you know your data (and index) size) will remain within the 10gb limit and you RU/sec limit is ok, then a fixed partition-less collection will bypass this problem. Probably OP has knowlingly made the decision that partitioning is required, but it is an important consideration to note for generalization purposes. If possible, KISS ;)
If partitioning is a must, then AFAIK you cannot avoid crosspartition split and its overhead unless you know the partition key.
Imho the OP suggestion of merging the duplicated partition key into id field is a rather ugly solution, because:
Name id implies it is unique key, partition key is not part of it or necessary for this key and its uniqueness. Anyone using this key upstream would incur the forced excess cost of longer key, blocked from using the simpler Guid type, etc.
It will become a mess should your partitioning key change in future.
The internal structure of merged id would not be intuitive without documentation - it's parts are not named and even if they look like to have a pattern new devs would not know for sure without finding external documentation to reliably understand what's going on.
Your data model does not require this duplication on semantic level, it would be for your application querying comfort and hence such hacks should belong to your application code, not data model. Such leaking concerns should be avoided if possible.
Data duplication within document would unnecessarily increase document size, bandwidth, etc. (may or may not be notable, depending on scale and usage). in-document duplication is necessary at times, but imho not necessarily in this case.
A better design would be to ensure the partition key is always present in logic context and could be passed along to lookups. If you don't have it available, then maybe you should refactor you application code (not data design) to explicitly pass around the chatId along with id where needed. That is WITHOUT merging them together into some opaque string format.
Also, I don't see a good way to use _rid for this as if I remember correctly, it did not contain any internal reference to a partition or partition key.
Disclaimer: I don't have any access or deep insight into internal CosmosDB index design or _rid logic on partitioned collections. I may have misunderstood how it works.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex