We have a table like this:
user_id | video_id | timestamp
1 2 3
1 3 4
1 3 5
2 1 1
And we need to query latest timestamp for each video viewed by a specific user.
Currently it's done like this:
response = self.history_table.query(
KeyConditionExpression=Key('user_id').eq(int(user_id)),
IndexName='WatchHistoryByTimestamp',
ScanIndexForward=False,
)
It queries all timestamps for all videos of specified user, but it does way huge load to database, because there can be thousands of timestamps of thousands videos.
I tried to find solution on Internet, but as I can see, all SQL solutions uses GROUP BY, but DynamoDB has no such features
There are 2 ways I know of doing this:
Method 1 GSI Global Secondary Index
GroupBy is sort of like partition in DynamoDB, (but not really). Your partition is currently user_id I assume, but you want video_id as the partition key, and timestamp as the sort key. You can do that creating a new GSI, and specify your new sort key timestamp & partition key video_id. This gives you the ability to query for a given video, the latest timestamp, as this query will only use 1 RCU and be super fast just add --max-items 1 --page-size 1. But you will need to supply the video_id.
Method 2 Sparse Index
The problem with 1 is you need to supply an ID, whereas you might just want to have a list of videos with their latest timestamp. There are a couple of ways to do this, one way I like is using a Sparse Index, if you have an attribute, called latest & set that to true for the latest timestamp, you can create a GSI and choose that attribute key latest, but not you will have to manually set and unset this value yourself, which you have to do in lambda streams or your app.
That does seem weird but this is how NoSQL works as opposed to SQL, which I myself am battling with now on a current project, where I am having to use some of these techniques myself, each time I do it just doesn't feel right but hopefully we'll get used to it.
Related
I have this DynamoDb table:
ID
customer_id
product_code
date_expire
3
12
TRE65GF
2023-11-15
5
12
WDD2
2023-11-15
4
44
BT4D
2023-06-23
What is the best way, in DynamoDb, to update the "date_expire" field to all customers with the same customer_id?
For example ,I want to set the date_expire to "2023-04-17" to all data with customer_id ="12".
Should I do a scan of the table to extract all the "IDs" and then a WriteRequestBatch?
Or is there a quicker way, like normal sql queries ("update table set field=value where condition=xx")?
If this is a common use-case, then I would suggest creating a GSI with a partition key of custome_id
customer_id
product_code
date_expire
ID
12
TRE65GF
2023-11-15
3
12
WDD2
2023-11-15
5
44
BT4D
2023-06-23
4
SELECT * FROM mytable.myindex WHERE customer_id = 12
First you do a Query on the customer_id to give you back all the customers data, then you have a choice on how to update the data:
UpdateItem
Depending on how many items returned it may be best to just iterate over them and call an UpdateItem on each item. UpdateItem is better than the PutItem or BatchWriteItem as its an upsert and not an overwrite, which means you will be less likely to corrupt your data due to conflicts/consistency.
BatchWriteItem
If you have a large amount of items for a customer, BatchWriteItem may be best for speed, where you can write batches of up to 25 items. But as mentioned above, you are overwriting data which can be dangerous when all you want to do is update.
TransactWriteItems
Transactions give you the ability to update batches of up to 100 items at a time, but the caveat is that the batch is ACID compliant, meaning if one item update fails for any reason, they all fail. However, based on your use-case, this may be what you intend to happen.
Examples
PHP examples are available here.
We have a Dynamodb table Events with about 50 million records that look like this:
{
"id": "1yp3Or0KrPUBIC",
"event_time": 1632934672534,
"attr1" : 1,
"attr2" : 2,
"attr3" : 3,
...
"attrN" : N,
}
The Partition Key=id and there is no Sort Key. There can be a variable number of attributes other than id (globally unique) and event_time, which are required.
This setup works fine for fetching by id but now we'd like to efficiently query against event_time and pull ALL attributes for records that match within that range (could be a million or two items). The criteria would be equal to something like WHERE event_date between 1632934671000 and 1632934672000, for example.
Without changing any existing data or transforming it through an external process, is it possible to create a Global Secondary Index using event_date and projecting ALL attributes that could allow a range query? By my understanding of DynamoDB this isn't possible but maybe there's another configuration I'm overlooking.
Thanks in advance.
(Edit: I rewrote the answer because the OP's comment clarified that the requirement is to query event_time ranges ignoring id. OP knows the table design is not ideal and is trying to make the best of a bad situation).
Is it possible to create a Global Secondary Index using event_date and projecting ALL attributes that could allow a range query?
Yes. You can add a Global Secondary Index to an existing table and choose which attributes to project. You cannot add an LSI to an existing table or change the table's primary key.
Without changing any existing data or transforming it through an external process?
No. You will need to manipulate the attibutes. Although arbitrary range queries are not its strength, DynamoDB has a time series pattern that can be adapted to your query pattern.
Let's say you query mostly by a limitied number of days. You would add a GSI with yyyy-mm-dd PK (Partition Key). Rows are made unique by a SK (Sort Key) that concatenates the timestamp with the id: event_time#id. PK and SK together are the Index's Composite Primary Key.
GSIPK1 = yyyy-mm-dd # 2022-01-20
GSISK1 = event_time#id # 1642709874551#1yp3Or0KrPUBIC
Querying for a single day needs 1 query operation, for a calendar week range needs 7 operations.
GSI1PK = "2022-01-20" AND GSI1SK > ""
Query a range within a day by adding a SK between condition:
GSI1PK = "2022-01-20" AND GSI1SK BETWEEN "1642709874" AND "16427098745"
It seems like one can create a global secondary index at any point.
Below is an excerpt from the Managing Global Secondary Indexes documentation which can be found here https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GSI.OnlineOps.html
To add a global secondary index to an existing table, use the UpdateTable operation with the GlobalSecondaryIndexUpdates parameter.
I have my dynamo db table as follows:
HashKey(Date) ,RangeKey(timestamp)
DB stores the data of each day(hash key) and time stamp(range key).
Now I want to query data of last 7 days.
Can i do this in one query? or do i need to call dbb 7 times for each day? order of the data does not matter So, can some one suggest an efficient query to do that.
I think you have a few options here.
BatchGetItem - The BatchGetItem operation returns the attributes of one or more items from one or more tables. You identify requested items by primary key. You could specify all 7 primary keys and fire off a single request.
7 calls to DynamoDB. Not ideal, but it'd get the job done.
Introduce a global secondary index that projects your data into the shape your application needs. For example, you could introduce an attribute that represents an entire week by using a truncated timestamp:
2021-02-08 (represents the week of 02/08/21T00:00:00 - 02/14/21T12:59:59)
2021-02-16 (represents the week of 02/15/21T00:00:00 - 02/22/21T12:59:59)
I call this a "truncated timestamp" because I am effectively ignoring the HH:MM:SS portion of the timestamp. When you create a new item in DDB, you could introduce a truncated timestamp that represents the week it was inserted. Therefore, all items inserted in the same week will show up in the same item collection in your GSI.
Depending on the volume of data you're dealing with, you might also consider separate tables to segregate ranges of data. AWS has an article describing this pattern.
As part of migrating from SQL to DynamoDB I am trying to create a DynamoDB table. The UI allows users to search based on 4 attributes start date, end date, name of event and source of event.
The table has 6 attributes and the above four are subset of it with other attributes being priority and location. The query as described above makes it mandatory to search based on the above four values. whats the best way to store the information in DynamoDB that will help me in querying based on start date and end date fairly easy.
I thought of creating a GSI with hashkey as startdate, rangekey as end date and GSI on the rest two attributes ?
Inshort:
My table in DynamoDB will have 6 attributes
EventName, Location, StartDate, EndDate, Priority and source.
Query will have 4 mandatory attributes
StartDate, EndDate, Source and Event Name.
Thanks for the help.
You can use greater than/less than comparison operators as part of your query http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/QueryAndScan.html
So you could try to build a table with schema:
(EventName (hashKey), "StartDate-EndDate" (sortKey), other attributes)
In this case the sort-key is basically a combination of start and end date allowing you to use >= (on the first part) and <= (on the second part)... dynamodb uses ASCII based alphabetical ordering... so lets assume your sortKey looks like the following: "73644-75223" you could use >= "73000-" AND <= "73000-76000" to get the given event.
Additionally, you could create a GSI on your table for each of your remaining attributes that need to be read via query. You then could project data into your index that you want to fetch with the query. In contrast to LSI, queries from GSI do not fetch attributes that are not projected. Be aware of the additional costs (read/write) involved by using GSI (and LSI)... and the additional memory required by data projections...
Hope it helps.
I'm new to DynamoDB - I already have an application where the data gets inserted, but I'm getting stuck on extracting the data.
Requirement:
There must be a unique table per customer
Insert documents into the table (each doc has a unique ID and a timestamp)
Get X number of documents based on timestamp (ordered ascending)
Delete individual documents based on unique ID
So far I have created a table with composite key (S:id, N:timestamp). However when I come to query it, I realise that since my id is unique, because I can't do a wildcard search on ID I won't be able to extract a range of items...
So, how should I design my table to satisfy this scenario?
Edit: Here's what I'm thinking:
Primary index will be composite: (s:customer_id, n:timestamp) where customer ID will be the same within a table. This will enable me to extact data based on time range.
Secondary index will be hash (s: unique_doc_id) whereby I will be able to delete items using this index.
Does this sound like the correct solution? Thank you in advance.
You can satisfy the requirements like this:
Your primary key will be h:customer_id and r:unique_id. This makes sure all the elements in the table have different keys.
You will also have an attribute for timestamp and will have a Local Secondary Index on it.
You will use the LSI to do requirement 3 and batchWrite API call to do batch delete for requirement 4.
This solution doesn't require (1) - all the customers can stay in the same table (Heads up - There is a limit-before-contact-us of 256 tables per account)