DynamoDB: Fetch all items that match one of possible keys - amazon-dynamodb

Say I have the following table —
itemA | itemB | relationScore | type
----------------------------------------
"s" | "a" | 1.0 | "foo"
"s" | "b" | 1.0 | "bar"
...
Here, itemA + itemB is unique per row, and I've made them the hash key and range key respectively.
My query, however, needs me to fetch all items such that —
itemA is equal to one of (A,B,C,D ....) (i.e. a list of options)
type is equal to "foo"
How do I build indexes to be able to do this without using scans / requiring multiple queries?
Note: I don't want to query on just type and filter it later in memory because type has very low cardinality, and would end up returning a humongous list back.

Maybe...
If you created a GSI with
hash key: type
range key: itemA
Then you could query using the GSI and (type = "foo", itemA between "A" and "D")
But obviously that requires itemA values to be a contiguous range. Which seems to be the case for your example, but may not be the case for the actual data.
EDIT
Since the itemA values aren't actually contiguous, and DDB doesn't support IN you're stuck with multiple queries.
This isn't the end of the world, as you could do the queries in parallel. In that case, I'd probably have the GSI with
hash key: itemA
range key: type
thus ensuring that each query is partition specific. (Even if your data or I/O requires are low enough that DDB doesn't actually create individual partitions)

For your first access pattern, DynamoDB provides a BatchGetItem operation which can return up to 16MB of data and contain as much as 100 items.
Your second access pattern can be accessed by creating a secondary index on the type field. This index would logically group all items of the same type into the same partition, which you could retrieve with a single query operation.
Edit: I misinterpreted the question and thought we were discussing two separate access patterns, not a single access pattern. As described, the existing data model won't support a single query operation.
The only way to search across partition keys (condition 1) is by a scan operation, or multiple queries. If you want to do this in a single query, you'll need to store your data such that the list of options and type is grouped together in a single partition. This is effectively "pre joining" your data.
Unless we can exploit features of your data that are not obvious via your example (e.g. itemA is lexicographically sortable), you are going to be stuck with multiple queries or a scan operation.

Related

Getting most recent item as decided by sort key for a set of hash keys

In my DynamoDB table my primary key is composed of a partition key (documentId - string) and sort key (revision - string).
documentId | revision | details (JSON)
A | 5 | { title: "Where's Wally New" }
A | 2 | { title: "Where's Wally" }
B | 3 | { title: "The Grapes of Wrath" }
C | 4 | { title: "The Great Gatsby" }
For a set of documentIds, I want to grab the latest revisions of those documents, as defined by the sort key. For example, I want to get the details of the latest revisions for documentId (A, B). This should return ("Where's Wally New", "The Grapes of Wrath").
I've managed to find people confirming you do this efficiently if you are just looking up one hash key/documentId at a time (e.g. NoSQL: Getting the latest values from tables DynamoDB/Azure Table Storage), but if I want to avoid having to make multiple read queries is this possible?
You’re looking for a batch query. It doesn’t exist (at least today). See a previous question on this at DynamoDB batch execute QueryRequests
One comment there suggested PartiQL could help. But no. According to https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/workbench.querybuilder.partiql.html
“As with the BatchGetItem operation, only singleton read operations are supported. Scan and query operations are not supported.”
You can sort by Sort Key.
By default results are sorted in ascending order, either numeric order or UTF-8 order.
As stated in the docs:
Query results are always sorted by the sort key value. If the data type of the sort key is Number, the results are returned in numeric order; otherwise, the results are returned in order of UTF-8 bytes. By default, the sort order is ascending. To reverse the order, set the ScanIndexForward parameter to false.
To reverse that and have it sorted in descending order, you need to set "ScanIndexForward": false in your query.
Now to only receive the top of the list - which will be the most recent revision ie. the highest revision number of that documentId - you can limit the results to one via
"Limit": 1.
However, since you're using strings for your sort key you will have issues with say numbers "9" and "10", since string "10" is of lesser "value" than string "9" since it starts with a "1".
I'd recommend switching to numbers for the revision number to resolve that issue.
Cheers!

DynamoDB Best practice to select all items from a table with pagination (Without PK)

I simply want to get a list of products back from my table and paginated, the pagination part is relatively clear with last_evaluated_key, however all the examples are using on PK or SK, but in my case I just want to get paginated results sort by createdAt.
My product id (uniq uuid) is not very useful in this case. Is the last solution to scan the whole table?
Yes, you will use Scan. DynamoDB has two types of read operation, Query and Scan. You can Query for one-and-only-one Partition Key (and optionally a range of Sort Key values if your table has a compound primary key). Everything else is a Scan.
Scan operations read every item, max 1 MB, optionally filtered. Filters are applied after the read. Results are unsorted.
The SDKs have pagination helpers like paginateScan to make life easier.
Re: Cost. Ask yourself: "is Scan returning lots of data MB I don't actually need?" If the answer is "No", you are fine. The more you are overfetching, however, the greater the cost benefit of Query over Scan.

Random Sampling of size N in Dynamo DB without full Table scan

I am new to dynamodb & was having some trouble in finding a way to randomly getting items without a full table scan ,most of the algorithms that i found consist of full table scans
I am also taking the case where we don’t have additional information of the table(Like columns and column Type such info is unknown)
Is there a way exist to do so
You can randomly sample by using a randomly generated exclusive start key for the scan or query operation. The exclusive start key does not have to match a record in the table. It just needs to follow the key structure of the table/index.
As with most questions about queries in DynamoDB, how you structure your data depends on how you want to query it.
For something like a random sampling, you have to make it confirm to the following core constraint of DynamoDB:
You have to provide a partition key
You can provide a sort key
So with a "single table" type design, you could structure your data something like this:
PK
SK
myVal
my_dict
6caaf1e3-eb8d-404a-a2ae-97d6682b0224
foo
my_dict
1c5496e8-c660-4b4e-980f-4abfb1942863
bar
my_dict
56551340-fff8-4824-a5be-70fcaece2e1a
baz
my_other_dict
520a7b37-233c-49dd-87da-77d871d98c92
test1
my_other_dict
65ccd54e-72c3-499d-a3a7-0cd989252607
test2
The PK is the identifier for your collection of random things to look up. The SK is a random UUID. And myVal contains the value you want to be returned.
You can query this db the following way:
SELECT * FROM "my-table" WHERE PK = 'my_dict' AND SK < '06a04e20-b239-48f2-a205-552eb61fef35'
By querying with an UUID as the SK, you'll get the first item in the table with an UUID close to the one you query for. By using a random uuid each time you query, you'll get a random result back.
The particular query above actually returns nothing, so you need to retry until you get a result.
Also, I haven't done the math (who has?), but I'd imagine that periodic queries like this won't generate perfectly random distributions, especially for small data sets.

DynamoDB Limit on query

I have a doubt about Limit on query/scans on DynamoDB.
My table has 1000 records, and the query on all of them return 50 values, but if I put a Limit of 5, that doesn't mean that the query will return the first 5 values, it just say that query for 5 Items on the table (in any order, so they could be very old items or new ones), so it's possible that I got 0 items on the query. How can actually get the latest 5 items of a query? I need to set a Limit of 5 (numbers are examples) because it will to expensive to query/scan for more items than that.
The query has this input
{
TableName: 'transactionsTable',
IndexName: 'transactionsByUserId',
ProjectionExpression: 'origin, receiver, #valid_status, createdAt, totalAmount',
KeyConditionExpression: 'userId = :userId',
ExpressionAttributeValues: {
':userId': 'user-id',
':payment_gateway': 'payment_gateway'
},
ExpressionAttributeNames: {
'#valid_status': 'status'
},
FilterExpression: '#valid_status = :payment_gateway',
Limit: 5
}
The index of my table is like this:
Should I use a second index or something, to sort them with the field createdAt but then, how I'm sure that the query will look into all the items?
if I put a Limit of 5, that doesn't mean that the query will return the first 5 values, it just say that query for 5 Items on the table (in any order, so they could be very old items or new ones), so it's possible that I got 0 items on the query. How can actually get the latest 5 items of a query?
You are correct in your observation, and unfortunately there is no Query options or any other operation that can guarantee 5 items in a single request. To understand why this is the case (it's not just laziness on Amazon's side), consider the following extreme case: you have a huge database with one billion items, but do a very specific query which has just 5 matching items, and now making the request you wished for: "give me back 5 items". Such a request would need to read the entire database of a billion items, before it can return anything, and the client will surely give up by then. So this is not how DyanmoDB's Limit works. It limits the amount of work that DyanamoDB needs to do before responding. So if Limit = 100, DynamoDB will read internally 100 items, which takes a bounded amount of time. But you are right that you have no idea whether it will respond with 100 items (if all of them matched the filter) or 0 items (if none of them matched the filter).
So to do what you want to do efficiently, you'll need to think of a different way to model your data - i.e., how to organize the partition and sort keys. There are different ways to do it, each has its own benefits and downsides, you'll need to consider your options for yourself. Since you asked about GSI, I'll give you some hints about how to use that option:
The pattern you are looking for is called filtered data retrieval. As you noted, if you do a GSI with the sort key being createdAt, you can retrieve the newest items first. But you still need to do a filter, and still don't know how to stop after 5 filtered results (and not 5 pre-filtering) results. The solution is to ask DynamoDB to only put in the GSI, in the first place, items which pass the filtering. In your example, it seems you always use the same filter: "status = payment_gateway". DynamoDB doesn't have an option to run a generic filter function when building the GSI, but it has a different trick up its sleeve to achieve the same thing: Any time you set "status = payment_gateway", also set another attribute "status_payment_gateway", and when status is set to something else, delete the "status_payment_gateway". Now, create the GSI with "status_payment_gateway" as the partition key. DynamoDB will only put items in the GSI if they have this attribute, thereby achieving exactly the filtering you want.
You can also have multiple mutually-exclusive filtering criteria in one GSI by setting the partition key attribute to multiple different values, and you can then do a Query on each of these values separately (using KeyConditionExpression).

DynamoDB ordered list

I'm trying to store a List as a DynamoDB attribute but I need to be able to retrieve the list order. At the moment the only solution I have come up with is to create a custom hash map by appending a key to the value and converting the complete value to a String and then store that as a list.
eg. key = position1, value = value1, String to be stored in the DB = "position1#value1"
To use the list I then need to filter out, organise, substring and reconvert to the original type. It seems like a long way round but at the moment its the only solution I can come up with.
Does anybody have any better solutions or ideas?
The List type in the newly added Document Types should help.
Document Data Types
DynamoDB supports List and Map data types, which can be nested to represent complex data structures.
A List type contains an ordered collection of values.
A Map type contains an unordered collection of name-value pairs.
Lists and maps are ideal for storing JSON documents. The List data type is similar to a JSON array, and the Map data type is similar to a JSON object. There are no restrictions on the data types that can be stored in List or Map elements, and the elements do not have to be of the same type.
I don't believe it is possible to store an ordered list as an attribute, as DynamoDB only supports single-valued and (unordered) set attributes. However, the performance overhead of storing a string of comma-separated values (or some other separator scheme) is probably pretty minimal given the fact that all the attributes for row must together be under 64KB.
(source: http://docs.amazonwebservices.com/amazondynamodb/latest/developerguide/DataModel.html)
Add a range attribute to your primary keys.
Composite Primary Key for Range Queries
A composite primary key enables you to specify two attributes in a table that collectively form a unique primary index. All items in the table must have both attributes. One serves as a “hash partition attribute” and the other as a “range attribute.” For example, you might have a “Status Updates” table with a composite primary key composed of “UserID” (hash attribute, used to partition the workload across multiple servers) and a “Time” (range attribute). You could then run a query to fetch either: 1) a particular item uniquely identified by the combination of UserID and Time values; 2) all of the items for a particular hash “bucket” – in this case UserID; or 3) all of the items for a particular UserID within a particular time range. Range queries against “Time” are only supported when the UserID hash bucket is specified.

Resources