Cosmos DB composite index best practices? - azure-cosmosdb

I've got some pretty high Cosmos usage right now that I'd like to reduce, and I think the way to do that is through composite indices, but I'm a little confused about the best approach.
My actual queries get more complex than this, but let's say I have 2 queries that look like this:
SELECT TOP 100 * FROM c WHERE c.partitionkey=n AND c.data.subdata1="str1" ORDER BY c._ts DESC
SELECT TOP 100 * FROM c WHERE c.partitionkey=n AND c.data.subdata1="str1" AND c.data.subdata2="str2" ORDER BY c._ts DESC
If I create a composite index that looks like this, will it help? Should I create two separate indices, one for each query? Should I put the partitionkey into the composite index, even though I'll only ever be searching on a single partition?
"compositeIndexes":[
[
{
"path":"/data/subdata1",
"order":"ascending"
},
{
"path":"/_ts",
"order":"descending"
}
]
]

In Cosmos DB, composite indexes will have a performance benefit for queries that have a multiple filters or both a filter and an ORDER BY clause. So in your case, I think the composite index will help.
Should I put the partitionkey into the composite index, even though
I'll only ever be searching on a single partition?
I believe that put the partitionkey into the composite index will improve performance of your SQL, although you search on a single partition.
The best practice is to test your SQL with different composite indexes in Azure Cosmos DB Emulator, and according to the Query Status to decide which to use.

I think you should have 2 composite indexes
First one should have partitionkey, subdata1 and _ts
Second one should have partitionkey, subdata2 and _ts
If your data is too large and you don't want to re-index, I would suggest to remove ORDER BY in database level and move it to your code.

Related

DynamoDB Best practice to select all items from a table with pagination (Without PK)

I simply want to get a list of products back from my table and paginated, the pagination part is relatively clear with last_evaluated_key, however all the examples are using on PK or SK, but in my case I just want to get paginated results sort by createdAt.
My product id (uniq uuid) is not very useful in this case. Is the last solution to scan the whole table?
Yes, you will use Scan. DynamoDB has two types of read operation, Query and Scan. You can Query for one-and-only-one Partition Key (and optionally a range of Sort Key values if your table has a compound primary key). Everything else is a Scan.
Scan operations read every item, max 1 MB, optionally filtered. Filters are applied after the read. Results are unsorted.
The SDKs have pagination helpers like paginateScan to make life easier.
Re: Cost. Ask yourself: "is Scan returning lots of data MB I don't actually need?" If the answer is "No", you are fine. The more you are overfetching, however, the greater the cost benefit of Query over Scan.

How to choose indexing keys, should we consider keys from GROUPBY also with WHERE keys or just with WHERE keys?

Assume document sample schema
{
"wherek1":"",
"pk":""
"groubby1":"",
"groupby2":"",
"count": 0
}
Assume select query
SELECT SUM(f.count) as outCount FROM TEST f WHERE f.pk='testk' and f.wherek1='hi' GROUP BY f.groubby1, f.groubby2
For the above query field should be indexed are
> pk/*
> wherek1/*
> groubby1/*
> groubby2/*
is my understanding correct?
Thanks
Adding group by properties will not yield any benefit. Only those properties used in where, order by or join clauses see benefit being indexed.
One thing to keep in mind is that containers with high cardinality on properties used in group by can get expensive. If you plan on running queries like this frequently you may want to consider building a materialized view using Change Feed.
Some links that may be helpful.
Queries with Group By
Lenni Lobel's blog on Change Feed
thanks.

DynamoDB how to search for a list of values

I have a DynamoDB instance with a partition key and sort key. Let's say that they are organisation (hash key) and employee id (sort key).
I want to retrieve all employees who's ids are in a list. They all work for the same organisation but they are not all of the employees of that organisation.
In SQL I'd do something like:
select * from table where organisation_id = 'org' and employee_id in [list of ids]
There does not seem to be an equivalent in DynamoDB.
My choices seem to be:
1) Iterate over all employee IDs using a Query OR
2) Use BatchGetItems and provide organisation_id:employee_id for all items
The first seems like it will be slower as it involves multiple requests while the second is a single request but may consume more RCUs.
Which of these is preferred solution to this problem? Or am I missing a better third way?
I would iterate your list using GetItem, adding each employee found to a collection. This approach isn't slow - DynamoDB is designed specifically for getting lots of items fast using their keys.
There is no need to use Query as you have both the partition key and range key. You would only use a Query if say you wanted all employees of one organisation.
If your list is particularly large you could use BatchGetItem, which will create multiple parallel threads and therefore reduce latency. You won't find much a difference though unless you have a lot of items to get.
By the way, DynamoDB does have an 'IN' operator but your can't use it on KeyConditions.

Using Cosmos DB how do I query just on the partition key

We have a group of related documents all sharing the same partition key. The thinking is simply grouping these up should be a case of querying on the partition key and stitching them together. What am I missing?
So
Select * from c where c.CustomerId = "500"
Would return say 3 documents, Address, Sales and Invoices who all have a property named CustomerId , with a value of 500.
I appreciate its not the primary key and I am purposely omiitng a row key.
Perhaps not splitting the documents is the answer but then the different documents have different TTLs and this would then becone problematic, wouldnt it(
CustomerId is the partition key.
The ms docs say this is possible (citing a city = seattle ) example. Where their partitionkey is city....
So, what am I missing, a complete misunderstaning of querying is cosmos ? (i can say I know a partition key is used to break up related data into partitions) I didnt know this made it an unqueryable aspect.
Also I can query with partition key and rowkey no problem.
EDIT 2:
This works:
SELECT * FROM c WHERE c.CustomerId > "499" AND c.CustomerId < "501"
Ok,
So the range query working was a bit of a lead.
Custom indexing on the collection was causing issues.
At this moment, I have removed the custom indexing entirely and will build back up and then post a more specific answer.
What I did read was that the PartitionKey is implicitly indexed anyway. There was an index on this ALSO so maybe this was causing funnies.
Indexing Policies CosmosDB
Maybe I'm not getting at all, but you have to be explicit about the value that you are looking for, I think is not the same:
c.CustomerId = "500"
VS
c.CustomerId = 500
because one is looking for text and the other one for a number, review how is stored your data, and it has to be the same if you want to perform the query using that value (and having in mind CustomerId is the Partition Key).

Dynamodb query expression

Team,
I have a dynamodb with a given hashkey (userid) and sort key (ages). Lets say if we want to retrieve the elements as "per each hashkey(userid), smallest age" output, what would be the query and filter expression for the dynamo query.
Thanks!
I don't think you can do it in a query. You would need to do full table scan. If you have a list of hash keys somewhere, then you can do N queries (in parallel) instead.
[Update] Here is another possible approach:
Maintain a second table, where you have just a hash key (userID). This table will contain record with the smallest age for given user. To achieve that, make sure that every time you update main table you also update second one if new age is less than current age in the second table. You can use conditional update for that. Update can either be done by application itself, or you can have AWS lambda listening to dynamoDB stream. Now if you need smallest age for each use, you still do full table scan of the second table, but this scan will only read relevant records, to it will be optimal.
There are two ways to achieve that:
If you don't need to get this data in realtime you can export your data into a other AWS systems, like EMR or Redshift and perform complex analytics queries there. With this you can write SQL expressions using joins and group by operators.
You can even perform EMR Hive queries on DynamoDB data, but they perform scans, so it's not very cost efficient.
Another option is use DynamoDB streams. You can maintain a separate table that stores:
Table: MinAges
UserId - primary key
MinAge - regular numeric attribute
On every update/delete/insert of an original query you can query minimum age for an updated user and store into the MinAges table
Another option is to write something like this:
storeNewAge(userId, newAge)
def smallestAge = getSmallestAgeFor(userId)
storeSmallestAge(userId, smallestAge)
But since DynamoDB does not has native transactions support it's dangerous to run code like that, since you may end up with inconsistent data. You can use DynamoDB transactions library, but these transactions are expensive. While if you are using streams you will have consistent data, at a very low price.
You can do it using ScanIndexForward
YourEntity requestEntity = new YourEntity();
requestEntity.setHashKey(hashkey);
DynamoDBQueryExpression<YourEntity> queryExpression = new DynamoDBQueryExpression<YourEntity>()
.withHashKeyValues(requestEntity)
.withConsistentRead(false);
equeryExpression.setIndexName(IndexName); // if you are using any index
queryExpression.setScanIndexForward(false);
queryExpression.setLimit(1);

Resources