Performing a conditional expression query on GSI in dynamodb - amazon-dynamodb

I know the query below is not supported in DynamoDB since you must use an equality expression on the HASH key.
query({
TableName,
IndexName,
KeyConditionExpression: 'purchases >= :p',
ExpressionAttributeValues: { ':p': 6 }
});
How can I organize my data so I can efficiently make a query for all items purchased >= 6 times?
Right now I only have 3 columns, orderID (Primary Key), address, confirmations (GSI).
Would it be better to use a different type of database for this type of query?

You would probably want to use the DynamoDB streams feature to perform aggregation into another DynamoDB table. The streams feature will publish events for each change to your data, which you can then process with a Lambda function.
I'm assuming in your primary table you would be tracking each purchase, incrementing a counter. A simple example of the logic might be on each update, you check the purchases count for the item, and if it is >= 6, add the item ID to a list attribute itemIDs or similar in another DynamoDB table. Depending on how you want to query this statistic, you might create a new entry every day, hour, etc.
Bear in mind DynamoDB has a 400KB limit per attribute, so this may not be the best solution depending on how many items you would need to capture in the itemIDs attribute for a given time period.
You would also need to consider how you reset your purchases counter (this might be a scheduled batch job where you reset purchase count back to zero every x time period).
Alternatively you could capture the time period in your primary table and create a GSI that is partitioned based upon time period and has purchases as the sort key. This way you could efficiently query (rather than scan) based upon a given time period for all items that have purchase count of >= 6.

You dont need to reorganise your data, just use a scan instead of a query
scan({
TableName,
IndexName,
FilterExpression: 'purchases >= :p',
ExpressionAttributeValues: { ':p': 6 }
});

Related

Using millisecond timestamp as the global secondary index in DynamodDb for range queries?

We have a Dynamodb table Events with about 50 million records that look like this:
{
"id": "1yp3Or0KrPUBIC",
"event_time": 1632934672534,
"attr1" : 1,
"attr2" : 2,
"attr3" : 3,
...
"attrN" : N,
}
The Partition Key=id and there is no Sort Key. There can be a variable number of attributes other than id (globally unique) and event_time, which are required.
This setup works fine for fetching by id but now we'd like to efficiently query against event_time and pull ALL attributes for records that match within that range (could be a million or two items). The criteria would be equal to something like WHERE event_date between 1632934671000 and 1632934672000, for example.
Without changing any existing data or transforming it through an external process, is it possible to create a Global Secondary Index using event_date and projecting ALL attributes that could allow a range query? By my understanding of DynamoDB this isn't possible but maybe there's another configuration I'm overlooking.
Thanks in advance.
(Edit: I rewrote the answer because the OP's comment clarified that the requirement is to query event_time ranges ignoring id. OP knows the table design is not ideal and is trying to make the best of a bad situation).
Is it possible to create a Global Secondary Index using event_date and projecting ALL attributes that could allow a range query?
Yes. You can add a Global Secondary Index to an existing table and choose which attributes to project. You cannot add an LSI to an existing table or change the table's primary key.
Without changing any existing data or transforming it through an external process?
No. You will need to manipulate the attibutes. Although arbitrary range queries are not its strength, DynamoDB has a time series pattern that can be adapted to your query pattern.
Let's say you query mostly by a limitied number of days. You would add a GSI with yyyy-mm-dd PK (Partition Key). Rows are made unique by a SK (Sort Key) that concatenates the timestamp with the id: event_time#id. PK and SK together are the Index's Composite Primary Key.
GSIPK1 = yyyy-mm-dd # 2022-01-20
GSISK1 = event_time#id # 1642709874551#1yp3Or0KrPUBIC
Querying for a single day needs 1 query operation, for a calendar week range needs 7 operations.
GSI1PK = "2022-01-20" AND GSI1SK > ""
Query a range within a day by adding a SK between condition:
GSI1PK = "2022-01-20" AND GSI1SK BETWEEN "1642709874" AND "16427098745"
It seems like one can create a global secondary index at any point.
Below is an excerpt from the Managing Global Secondary Indexes documentation which can be found here https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GSI.OnlineOps.html
To add a global secondary index to an existing table, use the UpdateTable operation with the GlobalSecondaryIndexUpdates parameter.

DynamoDB Limit on query

I have a doubt about Limit on query/scans on DynamoDB.
My table has 1000 records, and the query on all of them return 50 values, but if I put a Limit of 5, that doesn't mean that the query will return the first 5 values, it just say that query for 5 Items on the table (in any order, so they could be very old items or new ones), so it's possible that I got 0 items on the query. How can actually get the latest 5 items of a query? I need to set a Limit of 5 (numbers are examples) because it will to expensive to query/scan for more items than that.
The query has this input
{
TableName: 'transactionsTable',
IndexName: 'transactionsByUserId',
ProjectionExpression: 'origin, receiver, #valid_status, createdAt, totalAmount',
KeyConditionExpression: 'userId = :userId',
ExpressionAttributeValues: {
':userId': 'user-id',
':payment_gateway': 'payment_gateway'
},
ExpressionAttributeNames: {
'#valid_status': 'status'
},
FilterExpression: '#valid_status = :payment_gateway',
Limit: 5
}
The index of my table is like this:
Should I use a second index or something, to sort them with the field createdAt but then, how I'm sure that the query will look into all the items?
if I put a Limit of 5, that doesn't mean that the query will return the first 5 values, it just say that query for 5 Items on the table (in any order, so they could be very old items or new ones), so it's possible that I got 0 items on the query. How can actually get the latest 5 items of a query?
You are correct in your observation, and unfortunately there is no Query options or any other operation that can guarantee 5 items in a single request. To understand why this is the case (it's not just laziness on Amazon's side), consider the following extreme case: you have a huge database with one billion items, but do a very specific query which has just 5 matching items, and now making the request you wished for: "give me back 5 items". Such a request would need to read the entire database of a billion items, before it can return anything, and the client will surely give up by then. So this is not how DyanmoDB's Limit works. It limits the amount of work that DyanamoDB needs to do before responding. So if Limit = 100, DynamoDB will read internally 100 items, which takes a bounded amount of time. But you are right that you have no idea whether it will respond with 100 items (if all of them matched the filter) or 0 items (if none of them matched the filter).
So to do what you want to do efficiently, you'll need to think of a different way to model your data - i.e., how to organize the partition and sort keys. There are different ways to do it, each has its own benefits and downsides, you'll need to consider your options for yourself. Since you asked about GSI, I'll give you some hints about how to use that option:
The pattern you are looking for is called filtered data retrieval. As you noted, if you do a GSI with the sort key being createdAt, you can retrieve the newest items first. But you still need to do a filter, and still don't know how to stop after 5 filtered results (and not 5 pre-filtering) results. The solution is to ask DynamoDB to only put in the GSI, in the first place, items which pass the filtering. In your example, it seems you always use the same filter: "status = payment_gateway". DynamoDB doesn't have an option to run a generic filter function when building the GSI, but it has a different trick up its sleeve to achieve the same thing: Any time you set "status = payment_gateway", also set another attribute "status_payment_gateway", and when status is set to something else, delete the "status_payment_gateway". Now, create the GSI with "status_payment_gateway" as the partition key. DynamoDB will only put items in the GSI if they have this attribute, thereby achieving exactly the filtering you want.
You can also have multiple mutually-exclusive filtering criteria in one GSI by setting the partition key attribute to multiple different values, and you can then do a Query on each of these values separately (using KeyConditionExpression).

Recommended Schema for DynamoDB calendar/event like structure

I'm pretty new to DynamoDB design and trying to get the correct schema for my application. In this app different users will enter various attributes about their day. For example "User X, March 1st 12:00-2:00, Tired". There could be multiple entries for a given time, or overlapping times (e.g. tired from 12-2 and eating lunch from 12-1).
I'll need to query based on user and time ranges. Common queries:
Give me all the "actions" for user X between time t1 and t2
Give me all the start times for action Z for user X
My initial thought was that the partition key would be userid and range key for the start time, but that wont work because of duplicate start times right?
A second thought:
UserID - Partition Key
StartTime - RangeKey
Action - JSON document of all actions for that start time
[{ action: "Lunch", endTime:"1pm"},{action:tired, endTime:"2pm"}]
Any recommendation on a proper schema?
This doesn't really have a one solution. And you will need to evaluate multiple options depending on your use case how much data you have/how often would you query and by which fields etc.
But one good solution is to partition your schema like this.
Generated UUID as partition key
UserID
Start time (in unix epoch time or ISO8601 time format)
Advantages
Can handle multiple time zones
Can easily query for userID and start date (you will need secondary index with primary key userID and sort key start time)
More even distribution and less hot keys of your data across dynamoDB partitions because of randomly generated primary key.
Disadvantages
More data for every item (because of UUID) (+16 bytes)
Additional cost for new secondary index, note scanning the data in table is generally much more expensive than having secondary index.
This is pretty close to your initial thought, in order to get a bit more precise answer we will need a lot more information about how many writes and reads are you planning, and what kind of queries you will need.
You are right in that UserID as Partition key and StartTime as rangeKey would be the obvious choice, if it wasn't for the fact of your overlapping activities.
I would consider going for
UserID - Partition Key
StartTime + uuid - RangeKey
StartTime - Plain old attribute
Datetimes in DynamoDB just get stored as strings anyway. So the idea here is that you have StartTime + some uuid as your rangekey, which gives you a sortable table based on datetime whilst also assuring you have unique primary keys. You could then store the StartTime in a separate attribute or have a function for adding/removing the uuid from the StartTime + uuid attribute.

DynamoDB NoSQL design for queries

I am looking to store a log of user events. It is going to be a lot of entries so I thought DynamoDB would be good as everything else is hosted there.
I need to query these events in two ways, totalt of events for a user for a date (range) and occasionally all the events for a date.
I was thinking to store it in one table as user id (key), sequence number (key), date, time and duration.
Should it be multiple tables? How can this be done most efficient?
For a small amount of data this structure is ok.
Keep in mind that the sequence number (your range key) has to be provided by you. It seems a good idea to choose the date as a unix timestamp with a milliseconds accuracy as a sort key.
There is no need for extra tables.
However your structure depends largely on the read write capacity that you want to achieve, and the data size.
Supposing your user_id is your partition key.
For every distinct partition key value, the total sizes of all table and index items cannot exceed 10 GB.
A single partition can support a maximum of 3,000 read capacity units or 1,000 write capacity units.
You need to create your partition keys by taking into consideration these limitations.
For example a very active user has many events thus you need more than 1000 write capacity units. Unfortunately you have choosen as a partition the user id.
In this case you are limited to 1000 write capacity units therefore you might have failures.
You need to have a different structure. For example a partition name like
user_id_1 user_id_2 etc. Therefore a partition naming mechanism spreading the data to partitions according to your application's needs.
Check these links on dynamodb limitations.
Tables guidance,
Partition distribution
I would suggest the following structure for your events table:
user id -- hash key
event date/time (timestamp with milliseconds) -- range key
duration
Having event timestamp as a range key should be sufficient to provide uniqueness for an event (unless a user can have multiple events right in the same millisecond), so you don't need a sequence number.
Having such a schema, you can get all events for a user for a date by using simple query.
Unfortunately, DynamoDB do not support aggregate queries, so you can't get a total number of events for a user quickly (you would have to query all records and calculate total manually).
So I would suggest creating a separate table for user events statistics like this:
user id -- hash key
date -- range key
events_cnt (total number of events for a user for a date)
So, after you add a new record into your events table, you have to increment events counter for the user in statistics table like shown below:
var dynamodbDoc = new AWS.DynamoDB.DocumentClient();
var params = {
TableName : "user_events_stats",
Key: {
userId: "65716110-f4df-11e6-bc64-92361f002671" ,
date: "2017-02-17",
},
UpdateExpression: "SET #events_cnt = if_not_exists(#events_cnt, :zero) + :one",
ExpressionAttributeNames: {
"#events_cnt": "events_cnt",
},
ExpressionAttributeValues: {
":one": 1,
":zero": 0,
},
};
dynamodbDoc.update(params, function(err, data) {
});

Whats the best way to query DynamoDB based on date range?

As part of migrating from SQL to DynamoDB I am trying to create a DynamoDB table. The UI allows users to search based on 4 attributes start date, end date, name of event and source of event.
The table has 6 attributes and the above four are subset of it with other attributes being priority and location. The query as described above makes it mandatory to search based on the above four values. whats the best way to store the information in DynamoDB that will help me in querying based on start date and end date fairly easy.
I thought of creating a GSI with hashkey as startdate, rangekey as end date and GSI on the rest two attributes ?
Inshort:
My table in DynamoDB will have 6 attributes
EventName, Location, StartDate, EndDate, Priority and source.
Query will have 4 mandatory attributes
StartDate, EndDate, Source and Event Name.
Thanks for the help.
You can use greater than/less than comparison operators as part of your query http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/QueryAndScan.html
So you could try to build a table with schema:
(EventName (hashKey), "StartDate-EndDate" (sortKey), other attributes)
In this case the sort-key is basically a combination of start and end date allowing you to use >= (on the first part) and <= (on the second part)... dynamodb uses ASCII based alphabetical ordering... so lets assume your sortKey looks like the following: "73644-75223" you could use >= "73000-" AND <= "73000-76000" to get the given event.
Additionally, you could create a GSI on your table for each of your remaining attributes that need to be read via query. You then could project data into your index that you want to fetch with the query. In contrast to LSI, queries from GSI do not fetch attributes that are not projected. Be aware of the additional costs (read/write) involved by using GSI (and LSI)... and the additional memory required by data projections...
Hope it helps.

Resources