From AWS Docs:
A single Query operation can retrieve a maximum of 1 MB of data. This limit applies before any FilterExpression or ProjectionExpression
is applied to the results. If LastEvaluatedKey is present in the
response and is non-null, you must paginate the result set.
I have been working on DynamoDB for sometime now, when I increase limit of a query it would always give me more records. So What's the closest meaning of Limit = 2? returning 2 items (or max 1 MB which we know for the fact) right? So, would that make Limit=1000; return 1000 items or 1000 MBs of data? Or 1000 records and no effect on data size? Or anything else?
The limit parameter only affects the number of items that are returned.
Limit = 2 means at most 2 items will be returned. The upper limit for the limit parameter is 1000 - one API call can't return more than 1000 items.
Depending on the item size, you may not get all the records that you specify with the limit parameter, because at most 1MB of data is read from the table.
That means if all items in your table are 400KB in size each (the max per item) and you set the limit parameter to 5, you will always get at most 2 items from the table, because of the 1MB limit.
In a query, the Limit is the number of items that will be Shown after the query (or available in your SDK response).
So if you make a query that normally would return 15 items, but have a limit of 2, you get the first 2 based on their sort key (and if no sort key its the first two that come back, I believe the oldest items, but dont quote me on that.
The 1mb limit is a hard cap of the total size of the return JSON from the query made by the SDK api call. So if you have 100 items and when in a JSON format they exceed 1 mb worth of data, then only the first 1mb worth of entries (whole entries) will be returned. A pagination token will also be returned (NextToken if I recall correctly) that can be used in the next query to start the return from the ending of the previous one (pagination).
Its VERY IMPORTANT to realize that the combination of Limit keyword and 1mb hard cap pagination means that if you have a query that would require pagination, and include a limit, that limit is applied AFTER the pagination begins.
So if the first page of your query returns 15 items and you have a limit of 5, you get the first five items. Then if you call next token you get item 16-20 because the original query, before the limit, had next token assigned to item 16.
In general, there is little reason to use Limit - instead, your Partition Key/Sort Key combination should be set up along your access patterns so you are only retrieving the actual items you need on any given call. Using SortKey expressions of >, <, =, between, starts_with, contains are a better way to limit the number of responses than Limit. The only major use case I can usually find for Limit is literally needing just the latest item of a potential multiple items after a specific date. But even so, its usually better just to take the entire query and get the first item yourself in your code (index 0) instead so you don't accidentally loose items from the limit/query combination.
Related
I have a use case where I want to always know and be able to look up DynamoDB items by their last read time. What is the easiest way to do this (I would prefer not to use any other services).
You can recall items by their last read time in DynamoDB by using a combination of the UpdateItem API, Query API and GSIs.
Estimate the amount of time your application will randomly read 1MB worth of items from the DynamoDB table. Let's assume that we are working with small items and that each item is <=1KB, so then, if the RPS on the table is 100, it will take 10 seconds for your application to randomly read 1MB of data.
Create a GSI that projects all attributes and is keyed on (PK=read_time_bucket, SK=read_time). The WPS on the GSI should equal the RPS of the base table in the case of small items.
Use the UpdateItem API with the following parameters to “read” each item. The UpdateItemResult will contain the item and the updated last read time and bucket.:
(
ReturnAttributes=ALL_NEW,
UpdateExpression="SET read_time_bucket = :bucket, read_time = :time",
ExpressionAttributeValues={
":bucket": <a partial timestamp indicating the 10-second-long bucket of time that corresponds to now>,
":time": <epoch millis>
}
)
You can use the Query API to look up items by last read time on the GSI, using key conditions on read_time_bucket and read_time.
You will need to adjust your time bucket size and throughput settings depending on item size and read/write patterns on the base table. If item size is prohibitively large, restrict the projection to SELECTED_ATTRIBUTES or KEYS_ONLY.
In my application search option is there it took time to search almost 60 seconds i just want to do search function just like a make my trip searching feature
At this point only thing i can do is recommend few technologies to achieve this
Sql Server -> Full Text Specification, Search with Contains, 'case when then' in query, stored procedure, limit the results for paging using 'Offset 10 Fetch next 10' method, Grouping to generate the total result count with in the query, ##Rowcount to get he total number of results .. i dont know if this help
I am wanting to keep track of multi stage processing job.
Likely just need the following fields
batchId (guid) | eventId (guid) | statusId (int) | timestamp | message (string)
There are relatively small number of events per batch.
I want to be able to easily query events that have a statusId less than n (still being processed or didn't finish processing).
Would using multiple rows for each status change, and querying for latest status be the best approach? I would use global secondary index but StatusId does not seem like a good candidate for hashkey (less than 10 statuses).
Instead of using multiple rows for every status change, if you updated the same event row instead, you could use a technique described in the DynamoDB documentation in the section 'Use a Calculated Value'. Basically this would involve adding another attribute (say 'derivedStatusId') which would be derived by appending a random number to statusId at the time of writing to DynamoDB. For example, for a statusId of 2, derivedStatusId could be one of {"2-00", "2-01", .. "2-99"}. Setting up a Global Secondary Index on derivedStatusId would give you some fan-out that will help in preventing the index from becoming hot.
If you are sure that you will use this index for only unfinished events, then removing the derivedStatusId attribute from the record when it transitions to a finished status will remove it from index as well - which may be a good property if events are expected to finish processing eventually, and if they stay around forever. This technique is called "Sparse Index" and is described in more detail here.
From your question, it seems like keeping status history recording is a desired property (I assume this because you want to have multiple rows for status changes). Consider putting this historical information in the same row. DynamoDB supports list data types and also has a generous 400KB item limit which may just allow you to capture all the desired historical information in the same record.
I am currently using ref.endAt().limit(n).on(...) to get the 'last' n values.
All the .priority are null so the list is sorted by name which is a 0 padded timestamp
It seemed that if I set the .priority of each item also to the timestamp that it would take more storage. Does it?
Regardless of whether or not it takes more storage, is there a significant performance difference for retrieving the last n sorted items if .priority is all null (so name sort is used) or if .priority are all unique and that .priority sort is used?
I am currently designing for it to work well with 10,000 ish items in a list. Is .priority or name sort better when a list gets over 1,000,000 items?
What about using ref.startAt(null, timeStart).endAt(null, timeEnd).on(...)?
I could profile, but how would I know that server load or network delays are or are not affecting it?
There should be no performance difference between using priority or key names to sort items. Firebase first looks for priority to sort items, and if it doesn't exist, sorts items by key name. There might be a very small performance gain by using priority instead of key name, but I expect this to very small.
This is not a big deal, but I wanted to throw it out there - if you had 100 items with incremental priorities, you would have a list like this:
item#1: { .priority: 1 }
...
item#100: { .priority: 100 }
The first item is #1 with priority 1, and the last item is #100 with priority 100. Now if you LIMIT() the list to 3 items, like this:
firebaseRef.limit(3).once(...)
Rather than being returned items 1-3, you would be returned items 97-100. Do most people expect that? It's the opposite of how limits generally work in other environments. In SQL for example, you start at the beginning of the set and stop when you hit the limit.
Now this isn't a technical limitation or anything (I believe), because we can actually get records 1-3 pretty easily by using STARTAT() on the first item:
firebaseRef.startAt(1).limit(3).once(...)
In fact, when LIMIT() is used without STARTAT() or ENDAT() it actually behaves like you specified ENDAT() with the last item. For example these produce the same results:
firebaseRef.limit(3).once(...)
firebaseRef.endAt(100).limit(3).once(...)
Doesn't it seem like the default behavior should be to mimic STARTAT() from the first position, rather than ENDAT() from the last position, if only LIMIT() is specified?
You are absolutely right in concluding that the default behavior works as if endAt() was used (and thus the latest items will be returned). This is because in the most common use cases you'd want to display the latest data, not the oldest; eg: chat history or notifications.