Data Modeling with NoSQL (DynamoDB) - amazon-dynamodb

Coming from a SQL background, I understand the high-level concepts on NoSQL but still having troubles trying to translating some basic usage scenario. I am hoping someone can help.
My application simply record a location, a timestamp, and tempature for every second of the day. So we end up having 3 basic columns:
1) location
2) timestamp
3) and temperature
(All field are numbers and I'm storing the timestamp as an epoch for easy range querying)
I setup dynamodb with the location as the primary key, and the timestamp as the sortkey and temp as an attribute. This results in a composite key on location and timestamp which allows each location to have its own unique timestamp but not allow any individual location to have more than one identical timestamp.
Now comes the real-world queries:
Query each site for a time range (Works fine)
Query for any particular time-range return all temps for all locations (won't work)
So how would you account for the 2nd scenario? This is were I get hung up... Is this were we get into secondary indexes and things like that? For those of you smarter than me, how would you deal with this?
Thanks in advance for you help!
-D

you cant query for range of values in dynamodb. you can query for a range of values (range keys) that belongs to a certain value (hash key)
its not matter if this is table key, local secondary index key, or global secondary index (secondary index are giving you another query options..)
lets back to your scenario:
if timestamp is in seconds and you want to get all records between 2 timestamps then you can add another field 'min_timestamp'.
this field can be your global secondary hash key, and timestamp will be your global secondary range key.
now you can get all records that logged in a certain minute.
if you want a range of minutes, then you need to perform X queries (if X its the range of minutes)
you can also add another field 'hour_timestamp' (that hash key contains all records in a certain hour) and goes on... - but this approach is very dangerous - you going to update many records with the same hash key in the same point of time, and you can get many throughput errors...

Related

Dynamodb range query on timestamp

We have a DDB with a column : Timestamp (string)
Essentially we need to query data between a range of the Timestamp column.
What is the best way to do this?
I tried creating a GSI in my test environment using the Timestamp column but quickly realized that I will not be able to query a range of this column.
Edit: What I found as the best way to do this so far:
GSI on an event_type that we know will always be equal to Update
Added sort key as the Timestamp column instead, so I am able to query on a range of the timestamp
Do let me know if you know a better way to do this. Thanks.
Your approach is good. DynamoDB supports dates and you can do your query using "BETWEEN".
A much better, but situational, approach is to include the time range in the partition key. If your time ranges are always of the same size for example 1 day you can do something like
PK
EVENT_TIME_RANGE#(start <-> end)
Then retrieve all of the entires in this time range with a very effective PK lookup query.
If you can't do that but need to optimize for time range lookups you can copy the data into this "home made" time range index.

Recommended Schema for DynamoDB calendar/event like structure

I'm pretty new to DynamoDB design and trying to get the correct schema for my application. In this app different users will enter various attributes about their day. For example "User X, March 1st 12:00-2:00, Tired". There could be multiple entries for a given time, or overlapping times (e.g. tired from 12-2 and eating lunch from 12-1).
I'll need to query based on user and time ranges. Common queries:
Give me all the "actions" for user X between time t1 and t2
Give me all the start times for action Z for user X
My initial thought was that the partition key would be userid and range key for the start time, but that wont work because of duplicate start times right?
A second thought:
UserID - Partition Key
StartTime - RangeKey
Action - JSON document of all actions for that start time
[{ action: "Lunch", endTime:"1pm"},{action:tired, endTime:"2pm"}]
Any recommendation on a proper schema?
This doesn't really have a one solution. And you will need to evaluate multiple options depending on your use case how much data you have/how often would you query and by which fields etc.
But one good solution is to partition your schema like this.
Generated UUID as partition key
UserID
Start time (in unix epoch time or ISO8601 time format)
Advantages
Can handle multiple time zones
Can easily query for userID and start date (you will need secondary index with primary key userID and sort key start time)
More even distribution and less hot keys of your data across dynamoDB partitions because of randomly generated primary key.
Disadvantages
More data for every item (because of UUID) (+16 bytes)
Additional cost for new secondary index, note scanning the data in table is generally much more expensive than having secondary index.
This is pretty close to your initial thought, in order to get a bit more precise answer we will need a lot more information about how many writes and reads are you planning, and what kind of queries you will need.
You are right in that UserID as Partition key and StartTime as rangeKey would be the obvious choice, if it wasn't for the fact of your overlapping activities.
I would consider going for
UserID - Partition Key
StartTime + uuid - RangeKey
StartTime - Plain old attribute
Datetimes in DynamoDB just get stored as strings anyway. So the idea here is that you have StartTime + some uuid as your rangekey, which gives you a sortable table based on datetime whilst also assuring you have unique primary keys. You could then store the StartTime in a separate attribute or have a function for adding/removing the uuid from the StartTime + uuid attribute.

Table design for storing immutable time series telemetry/sensor data?

I'm looking for some advice on a DynamoDB table design to store telemetry data streaming in from 1000's of sensor hubs. The sensor hubs send up to 15,000 messages per day each, containing the following:
timestamp (unix time)
station_id (uuid)
sensor_type (string)
sensor_data (json)
I've looked into best practices for storing time series data, and will adopt a table partitioning strategy, where a new "hot data" table is created each month (and adjust RCU's and WCU's accordingly for older "cooler" tables).
What i'm not sure about is picking a suitable hash key and sort key, as well as setting up indexes, etc.
The majority of the queries to data will be: Give me messages where station_id = "foo" and sensor_type = "bar", and timestamp is between x and y.
At a guess, i'm assuming I would use station_id as the hash key, and timestamp as the sort key, but how do a query for messages with a particular sensor_type without resorting to filters? Would I be best to combine the station_id and sensor_type as the hash key?
Judging from the query example that you've provided I would do create the following table:
stationId_sensorType (String, partition key) - a combined attribute that contains concatenated values for station id and for sensor type
timestamp (Number, range key) - UNIX timestamp that you can use to sort by time stamp or to find only record with timestamps in range.
This will allow to get all values for a pair of (stationId, sensorType).
You can also store stationId and sensorType as separate fields in your items and then you can create GSI on them to support other queries, like, get all values for a stationId.

limit offset, sorting and aggregation challenges in DynamoDB

I am using DynamoDB to store my device events (in JSON format) into table for further analysis and using scan APIs to display the result set on UI, which requires
To define limit offset of records,say 10 records per page, means
result set should be paginated(e.g. page-1 has 0-10 records, page-2
has 11-20 records and so on), i got an API like scanRequest.withLimit(10) but it has different meaning of limit offset, does DynamoDB API comes with support of limit offset?
I also need to sort result set on basis of user input fields like sorting on Date, Serial Number etc, but still didn't get any sorting/order by APIs.
I may look for aggregation e.g. on Device Name, Date etc. which also doesn't seems to be available in DynamoDB.
The above situation led me to think about some others noSQL database solutions, Please assist me on above mentioned issues.
The right way to think about DynamoDB is as a key-value store with support for indexes.
"Amazon DynamoDB supports key-value data structures. Each item (row) is a key-value pair where the primary key is the only required attribute for items in a table and uniquely identifies each item. DynamoDB is schema-less. Each item can have any number of attributes (columns). In addition to querying the primary key, you can query non-primary key attributes using Global Secondary Indexes and Local Secondary Indexes."
https://aws.amazon.com/dynamodb/details/
A table can have 2 types of keys:
Hash Type Primary Key—The primary key is made of one attribute, a
hash attribute. DynamoDB builds an unordered hash index on this
primary key attribute. Each item in the table is uniquely identified
by its hash key value.
Hash and Range Type Primary Key—The primary
key is made of two attributes. The first attribute is the hash
attribute and the second one is the range attribute. DynamoDB builds
an unordered hash index on the hash primary key attribute, and a
sorted range index on the range primary key attribute. Each item in
the table is uniquely identified by the combination of its hash and
range key values. It is possible for two items to have the same hash
key value, but those two items must have different range key values.
What kind of primary key have you set up for your Device Events table? I would suggest that you denormalize your data (i.e. pull specific attributes out of the json) and build additional indexes on those attributes that you want to sort and aggregate on: Date, Serial Number, etc. If I know what kind of primary key you have set up on your table, I can point you in the right direction to build these indices so that you can get what you need via the query method. The scan method will be inefficient for you because it reads every row in the table.
Lastly, with regard to your "limit offset" question, I think that you're looking for the ExclusiveStartKey, which will be returned by DynamoDB in the response to your query.
The ExclusiveStartKey is what will help you do pagination. It's not necessary to depend on the LastEvaluatedKey from the response. You'll get LastEvaluatedKey only if you are getting more than a MB worth data. If LIMIT page size is such that total returned data size is less than 1 MB, you'll not get back LastEvaluatedKey. But that does not stop you from using ExclusiveStartKey as an offset.

How to design DynamoDB table to facilitate searching by time ranges, and deleting by unique ID

I'm new to DynamoDB - I already have an application where the data gets inserted, but I'm getting stuck on extracting the data.
Requirement:
There must be a unique table per customer
Insert documents into the table (each doc has a unique ID and a timestamp)
Get X number of documents based on timestamp (ordered ascending)
Delete individual documents based on unique ID
So far I have created a table with composite key (S:id, N:timestamp). However when I come to query it, I realise that since my id is unique, because I can't do a wildcard search on ID I won't be able to extract a range of items...
So, how should I design my table to satisfy this scenario?
Edit: Here's what I'm thinking:
Primary index will be composite: (s:customer_id, n:timestamp) where customer ID will be the same within a table. This will enable me to extact data based on time range.
Secondary index will be hash (s: unique_doc_id) whereby I will be able to delete items using this index.
Does this sound like the correct solution? Thank you in advance.
You can satisfy the requirements like this:
Your primary key will be h:customer_id and r:unique_id. This makes sure all the elements in the table have different keys.
You will also have an attribute for timestamp and will have a Local Secondary Index on it.
You will use the LSI to do requirement 3 and batchWrite API call to do batch delete for requirement 4.
This solution doesn't require (1) - all the customers can stay in the same table (Heads up - There is a limit-before-contact-us of 256 tables per account)

Resources