I'm new to DynamoDB and would like some help on how to best structure things, and whether it's the right tool for the job.
Let's say I have thousands of users signed up to receive messages. They can choose to receive messages every half hour, hour, couple of hours or every 4 hours. So essentially there is a schedule attribute for each user's message. Users can also specify a time window for when they receive these messages, e.g. 09:00 - 17:00 and also toggle an active state.
I want to be able to easily get the messages to send to the various users at the right time.
If done in SQL this would be really easy, with something like:
Select * from UserMessageSchedules
where
now() > startTime
and now() < endTime
and userIsActive
and schedule = 'hourly'
But I'm struggling to do something similar in DynamoDB. At first I thought I'd have the following schema:
userId (partion Key)
messageId (sort key)
schedule (one of half_hour, hour, two_hours, four_hours)
startTime_userId
endTime
I'd create a Global Secondary Index with the 'schedule' attribute being the partition key, and startTime + userId being the sort key.
I could then easily query for messages that need sending after a startTime.
But I'd still have to check endTime > now() within my lambda. Also, i'd be reading in most of the table, which seems inefficient and may lead to throughput issues?
And with the limited number of schedules, would I get hot partitions on the GSI?
So I then thought rather than sending messages from a table designed to store users preferences, I could process this table when an entry is made/edited and populate a toSend table, which would look like this:
timeSlot (pk) timeSlot_messageId (sk)
00:30 00:30_Message1_Id
00:30 00:30_Message2_Id
01:00 01:00_Message1_Id
Finding the messages to send at certain time would be nice and fast as I'd just query on the timeSlot
But again I'm worried about hot spots and throughput. Is it ok for each partition to have 1000's rows and for just that partition to be read? Are there any other problems with this approach?
Another possibility would be to have different tables (rather than partitions) for each half hour when something could be sent
e.g, toSendAt_00:30, toSendAt_01:00, toSendAt_01:30
and these would have the messageId as the primary key and would contain the data needing to be sent. I'd just scan the table. Is this overkill?
Rather than do big reads of data every half an hour, would I be better duplicating the data into Elastic Search and querying this?
Thanks!
Related
I have an asp.net application deployed in azure. This generates plenty of logs, some of which are exceptions. I do have a query in Log Analytics Workspace that picks up exceptions from logs.
I would like to know what is the best and/or cheapest way to detect anomalies in the exception count over a time period.
For example, if the average number of exceptions for every hour is N (based on information collected over the past 1 month or so), and if average goes > N+20 at any time (checked every 1 hour or so), then I need to be notified.
N would be dynamically changing based on trend.
I would like to know what is the best and/or cheapest way to detect anomalies in the exception count over a time period.
Yes, we can achieve this by following steps:
Store the average value in a Stored Query Result in Azure.
Using stored query result
.set stored_query_result
These are some limitations to keep the result. Refer MSDOC for detailed information.
Note: The stored query result will be available only 24 hours.
Workaround Follows
Set the Stored query result
# here i am using Stored query result to store the average value of trace message count for 5 hours
.set stored_query_result average <|
traces
| summarize events = count() by bin(timestamp, 5h)
| summarize avg(events)
2. Once Query Result Set you can use the Stored Query Result value in another KQL Query (The stored value was available till 24 hours)
# Retrieve the stored Query Result
stored_query_result(<StoredQueryResultName>) |
Query follows as per your need
Schedule the alert.
I have my dynamo db table as follows:
HashKey(Date) ,RangeKey(timestamp)
DB stores the data of each day(hash key) and time stamp(range key).
Now I want to query data of last 7 days.
Can i do this in one query? or do i need to call dbb 7 times for each day? order of the data does not matter So, can some one suggest an efficient query to do that.
I think you have a few options here.
BatchGetItem - The BatchGetItem operation returns the attributes of one or more items from one or more tables. You identify requested items by primary key. You could specify all 7 primary keys and fire off a single request.
7 calls to DynamoDB. Not ideal, but it'd get the job done.
Introduce a global secondary index that projects your data into the shape your application needs. For example, you could introduce an attribute that represents an entire week by using a truncated timestamp:
2021-02-08 (represents the week of 02/08/21T00:00:00 - 02/14/21T12:59:59)
2021-02-16 (represents the week of 02/15/21T00:00:00 - 02/22/21T12:59:59)
I call this a "truncated timestamp" because I am effectively ignoring the HH:MM:SS portion of the timestamp. When you create a new item in DDB, you could introduce a truncated timestamp that represents the week it was inserted. Therefore, all items inserted in the same week will show up in the same item collection in your GSI.
Depending on the volume of data you're dealing with, you might also consider separate tables to segregate ranges of data. AWS has an article describing this pattern.
I'm pretty new to DynamoDB design and trying to get the correct schema for my application. In this app different users will enter various attributes about their day. For example "User X, March 1st 12:00-2:00, Tired". There could be multiple entries for a given time, or overlapping times (e.g. tired from 12-2 and eating lunch from 12-1).
I'll need to query based on user and time ranges. Common queries:
Give me all the "actions" for user X between time t1 and t2
Give me all the start times for action Z for user X
My initial thought was that the partition key would be userid and range key for the start time, but that wont work because of duplicate start times right?
A second thought:
UserID - Partition Key
StartTime - RangeKey
Action - JSON document of all actions for that start time
[{ action: "Lunch", endTime:"1pm"},{action:tired, endTime:"2pm"}]
Any recommendation on a proper schema?
This doesn't really have a one solution. And you will need to evaluate multiple options depending on your use case how much data you have/how often would you query and by which fields etc.
But one good solution is to partition your schema like this.
Generated UUID as partition key
UserID
Start time (in unix epoch time or ISO8601 time format)
Advantages
Can handle multiple time zones
Can easily query for userID and start date (you will need secondary index with primary key userID and sort key start time)
More even distribution and less hot keys of your data across dynamoDB partitions because of randomly generated primary key.
Disadvantages
More data for every item (because of UUID) (+16 bytes)
Additional cost for new secondary index, note scanning the data in table is generally much more expensive than having secondary index.
This is pretty close to your initial thought, in order to get a bit more precise answer we will need a lot more information about how many writes and reads are you planning, and what kind of queries you will need.
You are right in that UserID as Partition key and StartTime as rangeKey would be the obvious choice, if it wasn't for the fact of your overlapping activities.
I would consider going for
UserID - Partition Key
StartTime + uuid - RangeKey
StartTime - Plain old attribute
Datetimes in DynamoDB just get stored as strings anyway. So the idea here is that you have StartTime + some uuid as your rangekey, which gives you a sortable table based on datetime whilst also assuring you have unique primary keys. You could then store the StartTime in a separate attribute or have a function for adding/removing the uuid from the StartTime + uuid attribute.
Coming from a SQL background, I understand the high-level concepts on NoSQL but still having troubles trying to translating some basic usage scenario. I am hoping someone can help.
My application simply record a location, a timestamp, and tempature for every second of the day. So we end up having 3 basic columns:
1) location
2) timestamp
3) and temperature
(All field are numbers and I'm storing the timestamp as an epoch for easy range querying)
I setup dynamodb with the location as the primary key, and the timestamp as the sortkey and temp as an attribute. This results in a composite key on location and timestamp which allows each location to have its own unique timestamp but not allow any individual location to have more than one identical timestamp.
Now comes the real-world queries:
Query each site for a time range (Works fine)
Query for any particular time-range return all temps for all locations (won't work)
So how would you account for the 2nd scenario? This is were I get hung up... Is this were we get into secondary indexes and things like that? For those of you smarter than me, how would you deal with this?
Thanks in advance for you help!
-D
you cant query for range of values in dynamodb. you can query for a range of values (range keys) that belongs to a certain value (hash key)
its not matter if this is table key, local secondary index key, or global secondary index (secondary index are giving you another query options..)
lets back to your scenario:
if timestamp is in seconds and you want to get all records between 2 timestamps then you can add another field 'min_timestamp'.
this field can be your global secondary hash key, and timestamp will be your global secondary range key.
now you can get all records that logged in a certain minute.
if you want a range of minutes, then you need to perform X queries (if X its the range of minutes)
you can also add another field 'hour_timestamp' (that hash key contains all records in a certain hour) and goes on... - but this approach is very dangerous - you going to update many records with the same hash key in the same point of time, and you can get many throughput errors...
There are several date/time fields in this view and I can never wrap my head around which column to order by (and which secondary column) in order to retrieve a list of SQL statements in the order in which they were executed on the server.
StartTime - The timestamp associated with when the query was submitted to Teradata for parsing.
FirstStepTime - The timestamp associated with when the first step of the query was executed.
FirstRespTime - The timestamp associated with when the first row was returned to the client.
The gap in time between the StartTime and FirstStep time include parsing time and any workload throttle delay that was enforced by Teradata's Dynamic Workload Manager. For the sake for keeping things simple here I will defer to an excellent article written by Teradata's Carrie Ballinger on dealing with delay time here.