As part of migrating from SQL to DynamoDB I am trying to create a DynamoDB table. The UI allows users to search based on 4 attributes start date, end date, name of event and source of event.
The table has 6 attributes and the above four are subset of it with other attributes being priority and location. The query as described above makes it mandatory to search based on the above four values. whats the best way to store the information in DynamoDB that will help me in querying based on start date and end date fairly easy.
I thought of creating a GSI with hashkey as startdate, rangekey as end date and GSI on the rest two attributes ?
Inshort:
My table in DynamoDB will have 6 attributes
EventName, Location, StartDate, EndDate, Priority and source.
Query will have 4 mandatory attributes
StartDate, EndDate, Source and Event Name.
Thanks for the help.
You can use greater than/less than comparison operators as part of your query http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/QueryAndScan.html
So you could try to build a table with schema:
(EventName (hashKey), "StartDate-EndDate" (sortKey), other attributes)
In this case the sort-key is basically a combination of start and end date allowing you to use >= (on the first part) and <= (on the second part)... dynamodb uses ASCII based alphabetical ordering... so lets assume your sortKey looks like the following: "73644-75223" you could use >= "73000-" AND <= "73000-76000" to get the given event.
Additionally, you could create a GSI on your table for each of your remaining attributes that need to be read via query. You then could project data into your index that you want to fetch with the query. In contrast to LSI, queries from GSI do not fetch attributes that are not projected. Be aware of the additional costs (read/write) involved by using GSI (and LSI)... and the additional memory required by data projections...
Hope it helps.
Related
We have a Dynamodb table Events with about 50 million records that look like this:
{
"id": "1yp3Or0KrPUBIC",
"event_time": 1632934672534,
"attr1" : 1,
"attr2" : 2,
"attr3" : 3,
...
"attrN" : N,
}
The Partition Key=id and there is no Sort Key. There can be a variable number of attributes other than id (globally unique) and event_time, which are required.
This setup works fine for fetching by id but now we'd like to efficiently query against event_time and pull ALL attributes for records that match within that range (could be a million or two items). The criteria would be equal to something like WHERE event_date between 1632934671000 and 1632934672000, for example.
Without changing any existing data or transforming it through an external process, is it possible to create a Global Secondary Index using event_date and projecting ALL attributes that could allow a range query? By my understanding of DynamoDB this isn't possible but maybe there's another configuration I'm overlooking.
Thanks in advance.
(Edit: I rewrote the answer because the OP's comment clarified that the requirement is to query event_time ranges ignoring id. OP knows the table design is not ideal and is trying to make the best of a bad situation).
Is it possible to create a Global Secondary Index using event_date and projecting ALL attributes that could allow a range query?
Yes. You can add a Global Secondary Index to an existing table and choose which attributes to project. You cannot add an LSI to an existing table or change the table's primary key.
Without changing any existing data or transforming it through an external process?
No. You will need to manipulate the attibutes. Although arbitrary range queries are not its strength, DynamoDB has a time series pattern that can be adapted to your query pattern.
Let's say you query mostly by a limitied number of days. You would add a GSI with yyyy-mm-dd PK (Partition Key). Rows are made unique by a SK (Sort Key) that concatenates the timestamp with the id: event_time#id. PK and SK together are the Index's Composite Primary Key.
GSIPK1 = yyyy-mm-dd # 2022-01-20
GSISK1 = event_time#id # 1642709874551#1yp3Or0KrPUBIC
Querying for a single day needs 1 query operation, for a calendar week range needs 7 operations.
GSI1PK = "2022-01-20" AND GSI1SK > ""
Query a range within a day by adding a SK between condition:
GSI1PK = "2022-01-20" AND GSI1SK BETWEEN "1642709874" AND "16427098745"
It seems like one can create a global secondary index at any point.
Below is an excerpt from the Managing Global Secondary Indexes documentation which can be found here https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GSI.OnlineOps.html
To add a global secondary index to an existing table, use the UpdateTable operation with the GlobalSecondaryIndexUpdates parameter.
I have a dynamodb table which stores creation_date epoch in string format. This date is neither hash key nor sort key. Ultimate goal is querying the creation_date for a range i.e. I need all the ids in the give time range.
The table schema is:
id, version, creation_date, info.
id is hash key and version is sort key.
I was thinking of creating a cloudsearch domain and link that to dynamodb table. Is it possible to use a range query in cloudsearch using java if the date is in string format? If yes how?
Here’s how you can accomplish this in DynamoDB using a GSI with a hash key of creation_y_m and a GSI range key of creation_date.
When you’re querying for a range of creation dates, you need to do a bit of date manipulation to find out all of the months in between your two dates, but then you can query your GSI with a key condition expression like this one.
creation_y_m = 2019-02 AND creation_date BETWEEN 2019-02-05T12:00.00Z AND 2019-02-18T06:00:00Z
Given that most of your queries are a two week range, you will usually only have to make only one or two queries to get all of the items.
You may need to backfill the creation_y_m field, but it’s fairly straightforward to do that by scanning your table and updating each item to have the new attribute.
There are, of course, many variations on this. You could tweak how granular your hash key is (maybe you want just year, maybe you want year-month-day). You could use epoch time instead of ISO 8601 strings.
I'm pretty new to DynamoDB design and trying to get the correct schema for my application. In this app different users will enter various attributes about their day. For example "User X, March 1st 12:00-2:00, Tired". There could be multiple entries for a given time, or overlapping times (e.g. tired from 12-2 and eating lunch from 12-1).
I'll need to query based on user and time ranges. Common queries:
Give me all the "actions" for user X between time t1 and t2
Give me all the start times for action Z for user X
My initial thought was that the partition key would be userid and range key for the start time, but that wont work because of duplicate start times right?
A second thought:
UserID - Partition Key
StartTime - RangeKey
Action - JSON document of all actions for that start time
[{ action: "Lunch", endTime:"1pm"},{action:tired, endTime:"2pm"}]
Any recommendation on a proper schema?
This doesn't really have a one solution. And you will need to evaluate multiple options depending on your use case how much data you have/how often would you query and by which fields etc.
But one good solution is to partition your schema like this.
Generated UUID as partition key
UserID
Start time (in unix epoch time or ISO8601 time format)
Advantages
Can handle multiple time zones
Can easily query for userID and start date (you will need secondary index with primary key userID and sort key start time)
More even distribution and less hot keys of your data across dynamoDB partitions because of randomly generated primary key.
Disadvantages
More data for every item (because of UUID) (+16 bytes)
Additional cost for new secondary index, note scanning the data in table is generally much more expensive than having secondary index.
This is pretty close to your initial thought, in order to get a bit more precise answer we will need a lot more information about how many writes and reads are you planning, and what kind of queries you will need.
You are right in that UserID as Partition key and StartTime as rangeKey would be the obvious choice, if it wasn't for the fact of your overlapping activities.
I would consider going for
UserID - Partition Key
StartTime + uuid - RangeKey
StartTime - Plain old attribute
Datetimes in DynamoDB just get stored as strings anyway. So the idea here is that you have StartTime + some uuid as your rangekey, which gives you a sortable table based on datetime whilst also assuring you have unique primary keys. You could then store the StartTime in a separate attribute or have a function for adding/removing the uuid from the StartTime + uuid attribute.
I know the query below is not supported in DynamoDB since you must use an equality expression on the HASH key.
query({
TableName,
IndexName,
KeyConditionExpression: 'purchases >= :p',
ExpressionAttributeValues: { ':p': 6 }
});
How can I organize my data so I can efficiently make a query for all items purchased >= 6 times?
Right now I only have 3 columns, orderID (Primary Key), address, confirmations (GSI).
Would it be better to use a different type of database for this type of query?
You would probably want to use the DynamoDB streams feature to perform aggregation into another DynamoDB table. The streams feature will publish events for each change to your data, which you can then process with a Lambda function.
I'm assuming in your primary table you would be tracking each purchase, incrementing a counter. A simple example of the logic might be on each update, you check the purchases count for the item, and if it is >= 6, add the item ID to a list attribute itemIDs or similar in another DynamoDB table. Depending on how you want to query this statistic, you might create a new entry every day, hour, etc.
Bear in mind DynamoDB has a 400KB limit per attribute, so this may not be the best solution depending on how many items you would need to capture in the itemIDs attribute for a given time period.
You would also need to consider how you reset your purchases counter (this might be a scheduled batch job where you reset purchase count back to zero every x time period).
Alternatively you could capture the time period in your primary table and create a GSI that is partitioned based upon time period and has purchases as the sort key. This way you could efficiently query (rather than scan) based upon a given time period for all items that have purchase count of >= 6.
You dont need to reorganise your data, just use a scan instead of a query
scan({
TableName,
IndexName,
FilterExpression: 'purchases >= :p',
ExpressionAttributeValues: { ':p': 6 }
});
In DynamoDB, I have a table where each record has two date attributes, create_date and last_modified_date. These dates are in ISO-8601 format e.g. 2016-01-22T16:19:52.464Z.
I need to have a way of querying them based on the create_date and last_modified_date e.g.
get all records where create_date > [some_date]
get all records where last_modified_date < [some_date]
In general, I need to get all records where [date_attr] [comparison_op] [some_date].
One way of doing it is to insert a dummy fixed attribute with each record and create an index with the dummy attribute as the partition key and the create_date as the sort key (likewise for last_modified_date.)
Then I'll be able to query it as such by providing the fixed dummy attribute as partition key, the date attributes as the sort key and use any comparison operators <, >, <=, >=, and so on.
But this doesn't seem good and looks like a hack instead of a proper solution/design. Are there any better solutions?
There are some things that NoSQL DBs are not good at, but you can solve this with the following solutions:
Move this table data to SQL database for searching purpose: This can be effective because you will be able to query as per your requirement, this might be tedious sometimes because you need to synchronize the data between two different DBs
Integrate with Amazon CloudSearch: You can integrate this table with CloudSearch and then rather than querying your DynamoDB table you can query Cloudsearch
Integrate with Elasticsearch: Elasticsearch is similar to CloudSearch although each has pros and cons, the end result would be same - rather than querying DynamoDB, instead query Elasticsearch
As you have mentioned in your question, add GSI indexes