I'm pretty new to DynamoDB design and trying to get the correct schema for my application. In this app different users will enter various attributes about their day. For example "User X, March 1st 12:00-2:00, Tired". There could be multiple entries for a given time, or overlapping times (e.g. tired from 12-2 and eating lunch from 12-1).
I'll need to query based on user and time ranges. Common queries:
Give me all the "actions" for user X between time t1 and t2
Give me all the start times for action Z for user X
My initial thought was that the partition key would be userid and range key for the start time, but that wont work because of duplicate start times right?
A second thought:
UserID - Partition Key
StartTime - RangeKey
Action - JSON document of all actions for that start time
[{ action: "Lunch", endTime:"1pm"},{action:tired, endTime:"2pm"}]
Any recommendation on a proper schema?
This doesn't really have a one solution. And you will need to evaluate multiple options depending on your use case how much data you have/how often would you query and by which fields etc.
But one good solution is to partition your schema like this.
Generated UUID as partition key
UserID
Start time (in unix epoch time or ISO8601 time format)
Advantages
Can handle multiple time zones
Can easily query for userID and start date (you will need secondary index with primary key userID and sort key start time)
More even distribution and less hot keys of your data across dynamoDB partitions because of randomly generated primary key.
Disadvantages
More data for every item (because of UUID) (+16 bytes)
Additional cost for new secondary index, note scanning the data in table is generally much more expensive than having secondary index.
This is pretty close to your initial thought, in order to get a bit more precise answer we will need a lot more information about how many writes and reads are you planning, and what kind of queries you will need.
You are right in that UserID as Partition key and StartTime as rangeKey would be the obvious choice, if it wasn't for the fact of your overlapping activities.
I would consider going for
UserID - Partition Key
StartTime + uuid - RangeKey
StartTime - Plain old attribute
Datetimes in DynamoDB just get stored as strings anyway. So the idea here is that you have StartTime + some uuid as your rangekey, which gives you a sortable table based on datetime whilst also assuring you have unique primary keys. You could then store the StartTime in a separate attribute or have a function for adding/removing the uuid from the StartTime + uuid attribute.
Related
I'm new to DynamoDB and would like some help on how to best structure things, and whether it's the right tool for the job.
Let's say I have thousands of users signed up to receive messages. They can choose to receive messages every half hour, hour, couple of hours or every 4 hours. So essentially there is a schedule attribute for each user's message. Users can also specify a time window for when they receive these messages, e.g. 09:00 - 17:00 and also toggle an active state.
I want to be able to easily get the messages to send to the various users at the right time.
If done in SQL this would be really easy, with something like:
Select * from UserMessageSchedules
where
now() > startTime
and now() < endTime
and userIsActive
and schedule = 'hourly'
But I'm struggling to do something similar in DynamoDB. At first I thought I'd have the following schema:
userId (partion Key)
messageId (sort key)
schedule (one of half_hour, hour, two_hours, four_hours)
startTime_userId
endTime
I'd create a Global Secondary Index with the 'schedule' attribute being the partition key, and startTime + userId being the sort key.
I could then easily query for messages that need sending after a startTime.
But I'd still have to check endTime > now() within my lambda. Also, i'd be reading in most of the table, which seems inefficient and may lead to throughput issues?
And with the limited number of schedules, would I get hot partitions on the GSI?
So I then thought rather than sending messages from a table designed to store users preferences, I could process this table when an entry is made/edited and populate a toSend table, which would look like this:
timeSlot (pk) timeSlot_messageId (sk)
00:30 00:30_Message1_Id
00:30 00:30_Message2_Id
01:00 01:00_Message1_Id
Finding the messages to send at certain time would be nice and fast as I'd just query on the timeSlot
But again I'm worried about hot spots and throughput. Is it ok for each partition to have 1000's rows and for just that partition to be read? Are there any other problems with this approach?
Another possibility would be to have different tables (rather than partitions) for each half hour when something could be sent
e.g, toSendAt_00:30, toSendAt_01:00, toSendAt_01:30
and these would have the messageId as the primary key and would contain the data needing to be sent. I'd just scan the table. Is this overkill?
Rather than do big reads of data every half an hour, would I be better duplicating the data into Elastic Search and querying this?
Thanks!
I have a dynamodb table which stores creation_date epoch in string format. This date is neither hash key nor sort key. Ultimate goal is querying the creation_date for a range i.e. I need all the ids in the give time range.
The table schema is:
id, version, creation_date, info.
id is hash key and version is sort key.
I was thinking of creating a cloudsearch domain and link that to dynamodb table. Is it possible to use a range query in cloudsearch using java if the date is in string format? If yes how?
Here’s how you can accomplish this in DynamoDB using a GSI with a hash key of creation_y_m and a GSI range key of creation_date.
When you’re querying for a range of creation dates, you need to do a bit of date manipulation to find out all of the months in between your two dates, but then you can query your GSI with a key condition expression like this one.
creation_y_m = 2019-02 AND creation_date BETWEEN 2019-02-05T12:00.00Z AND 2019-02-18T06:00:00Z
Given that most of your queries are a two week range, you will usually only have to make only one or two queries to get all of the items.
You may need to backfill the creation_y_m field, but it’s fairly straightforward to do that by scanning your table and updating each item to have the new attribute.
There are, of course, many variations on this. You could tweak how granular your hash key is (maybe you want just year, maybe you want year-month-day). You could use epoch time instead of ISO 8601 strings.
I am looking to store a log of user events. It is going to be a lot of entries so I thought DynamoDB would be good as everything else is hosted there.
I need to query these events in two ways, totalt of events for a user for a date (range) and occasionally all the events for a date.
I was thinking to store it in one table as user id (key), sequence number (key), date, time and duration.
Should it be multiple tables? How can this be done most efficient?
For a small amount of data this structure is ok.
Keep in mind that the sequence number (your range key) has to be provided by you. It seems a good idea to choose the date as a unix timestamp with a milliseconds accuracy as a sort key.
There is no need for extra tables.
However your structure depends largely on the read write capacity that you want to achieve, and the data size.
Supposing your user_id is your partition key.
For every distinct partition key value, the total sizes of all table and index items cannot exceed 10 GB.
A single partition can support a maximum of 3,000 read capacity units or 1,000 write capacity units.
You need to create your partition keys by taking into consideration these limitations.
For example a very active user has many events thus you need more than 1000 write capacity units. Unfortunately you have choosen as a partition the user id.
In this case you are limited to 1000 write capacity units therefore you might have failures.
You need to have a different structure. For example a partition name like
user_id_1 user_id_2 etc. Therefore a partition naming mechanism spreading the data to partitions according to your application's needs.
Check these links on dynamodb limitations.
Tables guidance,
Partition distribution
I would suggest the following structure for your events table:
user id -- hash key
event date/time (timestamp with milliseconds) -- range key
duration
Having event timestamp as a range key should be sufficient to provide uniqueness for an event (unless a user can have multiple events right in the same millisecond), so you don't need a sequence number.
Having such a schema, you can get all events for a user for a date by using simple query.
Unfortunately, DynamoDB do not support aggregate queries, so you can't get a total number of events for a user quickly (you would have to query all records and calculate total manually).
So I would suggest creating a separate table for user events statistics like this:
user id -- hash key
date -- range key
events_cnt (total number of events for a user for a date)
So, after you add a new record into your events table, you have to increment events counter for the user in statistics table like shown below:
var dynamodbDoc = new AWS.DynamoDB.DocumentClient();
var params = {
TableName : "user_events_stats",
Key: {
userId: "65716110-f4df-11e6-bc64-92361f002671" ,
date: "2017-02-17",
},
UpdateExpression: "SET #events_cnt = if_not_exists(#events_cnt, :zero) + :one",
ExpressionAttributeNames: {
"#events_cnt": "events_cnt",
},
ExpressionAttributeValues: {
":one": 1,
":zero": 0,
},
};
dynamodbDoc.update(params, function(err, data) {
});
Coming from a SQL background, I understand the high-level concepts on NoSQL but still having troubles trying to translating some basic usage scenario. I am hoping someone can help.
My application simply record a location, a timestamp, and tempature for every second of the day. So we end up having 3 basic columns:
1) location
2) timestamp
3) and temperature
(All field are numbers and I'm storing the timestamp as an epoch for easy range querying)
I setup dynamodb with the location as the primary key, and the timestamp as the sortkey and temp as an attribute. This results in a composite key on location and timestamp which allows each location to have its own unique timestamp but not allow any individual location to have more than one identical timestamp.
Now comes the real-world queries:
Query each site for a time range (Works fine)
Query for any particular time-range return all temps for all locations (won't work)
So how would you account for the 2nd scenario? This is were I get hung up... Is this were we get into secondary indexes and things like that? For those of you smarter than me, how would you deal with this?
Thanks in advance for you help!
-D
you cant query for range of values in dynamodb. you can query for a range of values (range keys) that belongs to a certain value (hash key)
its not matter if this is table key, local secondary index key, or global secondary index (secondary index are giving you another query options..)
lets back to your scenario:
if timestamp is in seconds and you want to get all records between 2 timestamps then you can add another field 'min_timestamp'.
this field can be your global secondary hash key, and timestamp will be your global secondary range key.
now you can get all records that logged in a certain minute.
if you want a range of minutes, then you need to perform X queries (if X its the range of minutes)
you can also add another field 'hour_timestamp' (that hash key contains all records in a certain hour) and goes on... - but this approach is very dangerous - you going to update many records with the same hash key in the same point of time, and you can get many throughput errors...
As part of migrating from SQL to DynamoDB I am trying to create a DynamoDB table. The UI allows users to search based on 4 attributes start date, end date, name of event and source of event.
The table has 6 attributes and the above four are subset of it with other attributes being priority and location. The query as described above makes it mandatory to search based on the above four values. whats the best way to store the information in DynamoDB that will help me in querying based on start date and end date fairly easy.
I thought of creating a GSI with hashkey as startdate, rangekey as end date and GSI on the rest two attributes ?
Inshort:
My table in DynamoDB will have 6 attributes
EventName, Location, StartDate, EndDate, Priority and source.
Query will have 4 mandatory attributes
StartDate, EndDate, Source and Event Name.
Thanks for the help.
You can use greater than/less than comparison operators as part of your query http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/QueryAndScan.html
So you could try to build a table with schema:
(EventName (hashKey), "StartDate-EndDate" (sortKey), other attributes)
In this case the sort-key is basically a combination of start and end date allowing you to use >= (on the first part) and <= (on the second part)... dynamodb uses ASCII based alphabetical ordering... so lets assume your sortKey looks like the following: "73644-75223" you could use >= "73000-" AND <= "73000-76000" to get the given event.
Additionally, you could create a GSI on your table for each of your remaining attributes that need to be read via query. You then could project data into your index that you want to fetch with the query. In contrast to LSI, queries from GSI do not fetch attributes that are not projected. Be aware of the additional costs (read/write) involved by using GSI (and LSI)... and the additional memory required by data projections...
Hope it helps.