Find smallest missing value in a sequence of numbers in Cosmos DB - azure-cosmosdb

Assuming I have documents which consistently have this structure:
{
id: <string_integer>,
partitionKey: ...
}
And I have ~1 million documents, and 'id' is a unique key and sequenced from 1+, is there anyway to perform a basic SQL-style query to get the lowest/smallest value not in the sequence?
Thanks!

This requirement can't be achieved through SQL in cosmos db. You need to do this with code on your client side.

Related

Cosmos DB composite index best practices?

I've got some pretty high Cosmos usage right now that I'd like to reduce, and I think the way to do that is through composite indices, but I'm a little confused about the best approach.
My actual queries get more complex than this, but let's say I have 2 queries that look like this:
SELECT TOP 100 * FROM c WHERE c.partitionkey=n AND c.data.subdata1="str1" ORDER BY c._ts DESC
SELECT TOP 100 * FROM c WHERE c.partitionkey=n AND c.data.subdata1="str1" AND c.data.subdata2="str2" ORDER BY c._ts DESC
If I create a composite index that looks like this, will it help? Should I create two separate indices, one for each query? Should I put the partitionkey into the composite index, even though I'll only ever be searching on a single partition?
"compositeIndexes":[
[
{
"path":"/data/subdata1",
"order":"ascending"
},
{
"path":"/_ts",
"order":"descending"
}
]
]
In Cosmos DB, composite indexes will have a performance benefit for queries that have a multiple filters or both a filter and an ORDER BY clause. So in your case, I think the composite index will help.
Should I put the partitionkey into the composite index, even though
I'll only ever be searching on a single partition?
I believe that put the partitionkey into the composite index will improve performance of your SQL, although you search on a single partition.
The best practice is to test your SQL with different composite indexes in Azure Cosmos DB Emulator, and according to the Query Status to decide which to use.
I think you should have 2 composite indexes
First one should have partitionkey, subdata1 and _ts
Second one should have partitionkey, subdata2 and _ts
If your data is too large and you don't want to re-index, I would suggest to remove ORDER BY in database level and move it to your code.

dynamodb query multiple values from GSI

I have a dynamo DB table (id(pk),name(sk),email,date,itemId(number))
and GSI on (itemId pk, date(sk)
trying to query for an array of itemIds [1,2,3,4] but getting error using the IN statement in KeyExperssionValue when doing
aws.DocClient.query
const IdsArrat = [1,2,3,4,5];
const query: {
IndexName: 'accountId-createdAt-index',
KeyConditionExpression: 'itemId IN (:a1,:a2,:a3)',
ExpressionAttributeValues: {
{
':a1':1,
':a2':2,
.......
}
},
ScanIndexForward: false,
},
getting error using the IN statement in.
This it possible to query for multiple values on GSI in dynamoDb ?
You're trying to query for multiple different partition key's in a GSI. This can only be done by doing multiple individual queries (3 in the example). It's also possible with a GSI that multiple values would get returned for a single Partition key lookup, so it's better to query the partition key "itemId" individually.
See the following for reference:
https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_Query.html#DDB-Query-request-KeyConditionExpression
It's not possible to have a IN and join multiple values in a query , but it's possible to use BatchGetItem to request multiple queries that are solved in parallel . This is actually very close to the IN solution you want.
The result will be a list of the elements in the table.
There are limits in the number of queries in the size of the result set < 16 MB and the number of queries < 100.
Please check this document for details :
https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_BatchGetItem.html
refering to this answer https://stackoverflow.com/a/70494101/7706503, you could try partiQL to construct similar statement for querying gsi table with multiple key,
select * from table."gsi_index_name" where partition_key in [key1,key2]
then you could send the statement with low level api in one shot, for example, in dotnet, it's called ExecuteStatementAsync

Dynamodb GetBatchItem vs query

Currently I use table.query to get items by matching partition key and sorted by sorting key. Now the new requirement is to handle batch query - a couple of hundred partition keys match and hopefully still sorted by sorting key in each partition key result. I find GetBatchItem that can handle up to 100 items per one query, but look like no sorting. Is one item here one row in DDB or all rows in one partition key?
From performance(query speed) and price perspective which one should I use? And do i have to do sorting for the result by myself if I use GetBatchItem? Ideally I like a solution of fast, cost effective and result sorted by sorting key in each partition key, but the first two are top priority and I can do sorting if I have to. Thanks
Query() is cheaper...
BatchGetItem() runs as individual GetItem() each costing 1 RCU (assuming your item is less than 400K).
Lets say you're item is 10K, Query() can return 40 of them for 1 RCU whereas returning 40 via BatchGetItem() will cost 40 RCU.

Dynamodb query expression

Team,
I have a dynamodb with a given hashkey (userid) and sort key (ages). Lets say if we want to retrieve the elements as "per each hashkey(userid), smallest age" output, what would be the query and filter expression for the dynamo query.
Thanks!
I don't think you can do it in a query. You would need to do full table scan. If you have a list of hash keys somewhere, then you can do N queries (in parallel) instead.
[Update] Here is another possible approach:
Maintain a second table, where you have just a hash key (userID). This table will contain record with the smallest age for given user. To achieve that, make sure that every time you update main table you also update second one if new age is less than current age in the second table. You can use conditional update for that. Update can either be done by application itself, or you can have AWS lambda listening to dynamoDB stream. Now if you need smallest age for each use, you still do full table scan of the second table, but this scan will only read relevant records, to it will be optimal.
There are two ways to achieve that:
If you don't need to get this data in realtime you can export your data into a other AWS systems, like EMR or Redshift and perform complex analytics queries there. With this you can write SQL expressions using joins and group by operators.
You can even perform EMR Hive queries on DynamoDB data, but they perform scans, so it's not very cost efficient.
Another option is use DynamoDB streams. You can maintain a separate table that stores:
Table: MinAges
UserId - primary key
MinAge - regular numeric attribute
On every update/delete/insert of an original query you can query minimum age for an updated user and store into the MinAges table
Another option is to write something like this:
storeNewAge(userId, newAge)
def smallestAge = getSmallestAgeFor(userId)
storeSmallestAge(userId, smallestAge)
But since DynamoDB does not has native transactions support it's dangerous to run code like that, since you may end up with inconsistent data. You can use DynamoDB transactions library, but these transactions are expensive. While if you are using streams you will have consistent data, at a very low price.
You can do it using ScanIndexForward
YourEntity requestEntity = new YourEntity();
requestEntity.setHashKey(hashkey);
DynamoDBQueryExpression<YourEntity> queryExpression = new DynamoDBQueryExpression<YourEntity>()
.withHashKeyValues(requestEntity)
.withConsistentRead(false);
equeryExpression.setIndexName(IndexName); // if you are using any index
queryExpression.setScanIndexForward(false);
queryExpression.setLimit(1);

How can I Scan an index in reverse in DynamoDB?

I am currently using DynamoDB and having a problem scanning. I am able to get paged results in forward order by using the ExclusiveStartKey. However, regardless of whether I set ScanIndexForward true or false, I get results in forward order from my scan operation. How can i get results in reverse order from a Scan in DynamoDB?
ScanIndexForward is the correct way to get items in descending order by the range key of the table or index you are querying. From the AWS API Reference:
A value that specifies ascending (true) or descending (false)
traversal of the index. DynamoDB returns results reflecting the
requested order determined by the range key. If the data type is
Number, the results are returned in numeric order. For type String,
the results are returned in order of ASCII character code values. For
type Binary, DynamoDB treats each byte of the binary data as unsigned
when it compares binary values.
Based on the docs for Scan, I conclude that there is no way to Scan in reverse. However, I would say that you are not using DynamoDB correctly if you need to do that. When designing a schema for a database like DyanmoDB you should plan the schema based on your expected queries to ensure that almost all application queries have a good index. Scans are meant more for sys admin operations or for feeding into MapReduce or analytics. "A Scan operation always scans the entire table, then filters out values to provide the desired result, essentially adding the extra step of removing data from the result set." (Query and Scan Performance) That can lead to performance problems and other issues.
Using DynamoDB is fundamentally different from working with a traditional relational database and requires a big change in the way you think about using it. You need to decide whether DynamoDB's advantages of availability in storage and performance, reliability and availability are worth accepting its limitations.
As of now the dynamoDB scan cannot return you sorted results.
You need to use a query with a new global secondary index (GSI) with a hashkey and range field. The trick is to use a hashkey which is assigned the same value for all data in your table.
I recommend making a new field for all data and calling it "Status" and set the value to "OK", or something similar.
Then your query to get all the results sorted would look like this:
{
TableName: "YourTable",
IndexName: "Status-YourRange-index",
KeyConditions: {
Status: {
ComparisonOperator: "EQ",
AttributeValueList: [
"OK"
]
}
},
ScanIndexForward: false
}
The docs for how to write GSI queries are found here: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GSI.html#GSI.Querying

Resources