I am querying DynamoDB from python and I would like to specify max ReadCapacityUnits that the query should use.
For example, my table has 100 ReadCapacityUnits, I would like to use only 5% of it which is 20.
Below is my query, how can I specify ReadCapacityUnits in this query
paginator = ddb_client.get_paginator('query')
response_iterator = paginator.paginate(TableName=table_name,
IndexName=INDEX_GSI,
KeyConditionExpression=condition,
ExpressionAttributeNames={ATTR_NAME: HASH_KEY},
ExpressionAttributeValues={
PLACEHOLDER: {'S': str(value)},
},
ConsistentRead=False,
ScanIndexForward=False,
PaginationConfig={"PageSize": 25})
You can't.
How big are your records?
Since an RCU is a read of up to 4KB of data (per second), and you specified a page size of 25, you'd have to have records larger than about 160 bytes for your query to consume more than 1 RCU.
Lets say your records are 1K, so for each RCU, you can read 4. Since your page size is 25, that would take only 7 RCU.
Key thing here is you're not using a filter express. (Good!)
With a filter express, you still pay for the data to be read even if it's not returned.
Also note that DDB will only read 1MB of data, before returning. (Even if all records are filtered out). This is why SCAN() as opposed to Query() can eat up your RCU.
Related
Dynamodb has a limitation on query that the maximum item is 1MB for a single query. How can I get the total item size from a query I send to dynamodb?
If it's the item count you want, you can use scanResult.getCount().
If you want the size of each scanRequest, you could write each request to a text file and, when the scan is complete, check the file sizes.
You can't.
Best you could do would be to estimate it by counting how many calls to Query() you make till LastEvaluatedKey is empty in the response; which indicates the last page of data as been returned.
If all your DDB entries are the same size you could total up Count/Scanned Count as you make the repeated calls to Query().
I have a collection in Azure Cosmos DB with iot messages (called DeviceEvents). The partition key is application id. I want to do a query by device id (each device belongs to exactly one application). So I have a query like this
SELECT VALUE root
FROM root
WHERE root["ApplicationId"] = 69 AND root["DeviceId"] = 2978
AND root["TimeStamp"] >= "2021-01-30T20:30:05.1635579Z"
AND root["TimeStamp"] <= "2021-02-19T20:30:05.1635969Z"
ORDER BY root["TimeStamp"] DESC OFFSET 0 LIMIT 30
When I execute the query like this I get Request Charge 10.96 RUs, Index lookup time
2.22 ms, Document load time 0.41 ms and Query engine execution time 0.24 ms
When I execute the query without the partition key
SELECT VALUE root
FROM root
WHERE root["DeviceId"] = 2978
AND root["TimeStamp"] >= "2021-01-30T20:30:05.1635579Z"
AND root["TimeStamp"] <= "2021-02-19T20:30:05.1635969Z"
ORDER BY root["TimeStamp"] DESC OFFSET 0 LIMIT 30
When I execute the query like this I get Request Charge 10.45 RUs, Index lookup time
1.91 ms, Document load time 0.5 ms and Query engine execution time 0.24 ms
While the numbers vary the query with the partition key consistently consumes more RU and has higher index lookup time.
I don't have enough data for Cosmos DB to create different physical partitions right now but I will probably need it in the future. My relevant indexing policy is this
"compositeIndexes": [
[
{
"path": "/DeviceId",
"order": "ascending"
},
{
"path": "/TimeStamp",
"order": "descending"
}
]
So my questions are
Do I need the partition key in the query?
Do I need the partition key in the index definition?
The reason you're getting confusing query stats is because the amount data is too small to provide meaningful results.
With a small amount of data (approx 20GB or less) you'll only be on a single physical partition. Cross-partition queries run just as fast as partitioned queries when on the same physical partition.
Where things start to blow up is when the database grows (scales). If you design your database to have a high number of cross-partition queries your database, by design, will not scale. So you definitely need (or should try as much as possible) to use the partition key in your queries, especially high volume queries.
I would also add TimeStamp in both an ascending and descending composite index.
The other thing you mentioned is every device belongs to the same applicationId. If that is the case then your container cannot grow larger than 20GB. If every device in this app has applicationId of 69 then you should redesign this container and find a new partition key. If your queries are always by device Id then that would make a much better partition key.
I have a doubt about Limit on query/scans on DynamoDB.
My table has 1000 records, and the query on all of them return 50 values, but if I put a Limit of 5, that doesn't mean that the query will return the first 5 values, it just say that query for 5 Items on the table (in any order, so they could be very old items or new ones), so it's possible that I got 0 items on the query. How can actually get the latest 5 items of a query? I need to set a Limit of 5 (numbers are examples) because it will to expensive to query/scan for more items than that.
The query has this input
{
TableName: 'transactionsTable',
IndexName: 'transactionsByUserId',
ProjectionExpression: 'origin, receiver, #valid_status, createdAt, totalAmount',
KeyConditionExpression: 'userId = :userId',
ExpressionAttributeValues: {
':userId': 'user-id',
':payment_gateway': 'payment_gateway'
},
ExpressionAttributeNames: {
'#valid_status': 'status'
},
FilterExpression: '#valid_status = :payment_gateway',
Limit: 5
}
The index of my table is like this:
Should I use a second index or something, to sort them with the field createdAt but then, how I'm sure that the query will look into all the items?
if I put a Limit of 5, that doesn't mean that the query will return the first 5 values, it just say that query for 5 Items on the table (in any order, so they could be very old items or new ones), so it's possible that I got 0 items on the query. How can actually get the latest 5 items of a query?
You are correct in your observation, and unfortunately there is no Query options or any other operation that can guarantee 5 items in a single request. To understand why this is the case (it's not just laziness on Amazon's side), consider the following extreme case: you have a huge database with one billion items, but do a very specific query which has just 5 matching items, and now making the request you wished for: "give me back 5 items". Such a request would need to read the entire database of a billion items, before it can return anything, and the client will surely give up by then. So this is not how DyanmoDB's Limit works. It limits the amount of work that DyanamoDB needs to do before responding. So if Limit = 100, DynamoDB will read internally 100 items, which takes a bounded amount of time. But you are right that you have no idea whether it will respond with 100 items (if all of them matched the filter) or 0 items (if none of them matched the filter).
So to do what you want to do efficiently, you'll need to think of a different way to model your data - i.e., how to organize the partition and sort keys. There are different ways to do it, each has its own benefits and downsides, you'll need to consider your options for yourself. Since you asked about GSI, I'll give you some hints about how to use that option:
The pattern you are looking for is called filtered data retrieval. As you noted, if you do a GSI with the sort key being createdAt, you can retrieve the newest items first. But you still need to do a filter, and still don't know how to stop after 5 filtered results (and not 5 pre-filtering) results. The solution is to ask DynamoDB to only put in the GSI, in the first place, items which pass the filtering. In your example, it seems you always use the same filter: "status = payment_gateway". DynamoDB doesn't have an option to run a generic filter function when building the GSI, but it has a different trick up its sleeve to achieve the same thing: Any time you set "status = payment_gateway", also set another attribute "status_payment_gateway", and when status is set to something else, delete the "status_payment_gateway". Now, create the GSI with "status_payment_gateway" as the partition key. DynamoDB will only put items in the GSI if they have this attribute, thereby achieving exactly the filtering you want.
You can also have multiple mutually-exclusive filtering criteria in one GSI by setting the partition key attribute to multiple different values, and you can then do a Query on each of these values separately (using KeyConditionExpression).
Currently I use table.query to get items by matching partition key and sorted by sorting key. Now the new requirement is to handle batch query - a couple of hundred partition keys match and hopefully still sorted by sorting key in each partition key result. I find GetBatchItem that can handle up to 100 items per one query, but look like no sorting. Is one item here one row in DDB or all rows in one partition key?
From performance(query speed) and price perspective which one should I use? And do i have to do sorting for the result by myself if I use GetBatchItem? Ideally I like a solution of fast, cost effective and result sorted by sorting key in each partition key, but the first two are top priority and I can do sorting if I have to. Thanks
Query() is cheaper...
BatchGetItem() runs as individual GetItem() each costing 1 RCU (assuming your item is less than 400K).
Lets say you're item is 10K, Query() can return 40 of them for 1 RCU whereas returning 40 via BatchGetItem() will cost 40 RCU.
Have to list all the records from a DynamoDB table, without any filter expression.
I want to limit the number of records hence using DynamoDBScanExpression with setLimit.
DynamoDBScanExpression scanExpression = new DynamoDBScanExpression();
....
// Set ExclusiveStartKey
....
scanExpression.setLimit(10);
However, the scan operation returns more than 10 results always !!!!
Is this the expected behaviour and if so how?
Python Answer
It is not possible to set a limit for scan() operations, however, it is possible to do so with a query.
A query searches through items, the rows in the database. It starts at the top or bottom of the list and finds items based on set criteria. You must have a partion and a sort key to do this.
A scan on the other hand searches through the ENTIRE database and not by items, and, as a result, is NOT ordered.
Since queries are based on items and scan is based on the ENTIRE database, only queries can support limits.
To answer OP's question, essentially it doesn't work because you're using scan not query.
Here is an example of how to use it using CLIENT syntax. (More advanced syntax version. Sorry I don't have a simpler example that uses resource. you can google that.)
def retrieve_latest_item(self):
result = self.dynamodb_client.query(
TableName="cleaning_company_employees",
KeyConditionExpression= "works_night_shift = :value",
ExpressionAttributeValues={':value': {"BOOL":"True"}},
ScanIndexForward = False,
Limit = 3
)
return result
Here is the DynamoDB module docs