Range query on cloudsearch date - amazon-dynamodb

I have a dynamodb table which stores creation_date epoch in string format. This date is neither hash key nor sort key. Ultimate goal is querying the creation_date for a range i.e. I need all the ids in the give time range.
The table schema is:
id, version, creation_date, info.
id is hash key and version is sort key.
I was thinking of creating a cloudsearch domain and link that to dynamodb table. Is it possible to use a range query in cloudsearch using java if the date is in string format? If yes how?

Here’s how you can accomplish this in DynamoDB using a GSI with a hash key of creation_y_m and a GSI range key of creation_date.
When you’re querying for a range of creation dates, you need to do a bit of date manipulation to find out all of the months in between your two dates, but then you can query your GSI with a key condition expression like this one.
creation_y_m = 2019-02 AND creation_date BETWEEN 2019-02-05T12:00.00Z AND 2019-02-18T06:00:00Z
Given that most of your queries are a two week range, you will usually only have to make only one or two queries to get all of the items.
You may need to backfill the creation_y_m field, but it’s fairly straightforward to do that by scanning your table and updating each item to have the new attribute.
There are, of course, many variations on this. You could tweak how granular your hash key is (maybe you want just year, maybe you want year-month-day). You could use epoch time instead of ISO 8601 strings.

Related

Dynamodb range query on timestamp

We have a DDB with a column : Timestamp (string)
Essentially we need to query data between a range of the Timestamp column.
What is the best way to do this?
I tried creating a GSI in my test environment using the Timestamp column but quickly realized that I will not be able to query a range of this column.
Edit: What I found as the best way to do this so far:
GSI on an event_type that we know will always be equal to Update
Added sort key as the Timestamp column instead, so I am able to query on a range of the timestamp
Do let me know if you know a better way to do this. Thanks.
Your approach is good. DynamoDB supports dates and you can do your query using "BETWEEN".
A much better, but situational, approach is to include the time range in the partition key. If your time ranges are always of the same size for example 1 day you can do something like
PK
EVENT_TIME_RANGE#(start <-> end)
Then retrieve all of the entires in this time range with a very effective PK lookup query.
If you can't do that but need to optimize for time range lookups you can copy the data into this "home made" time range index.

Recommended Schema for DynamoDB calendar/event like structure

I'm pretty new to DynamoDB design and trying to get the correct schema for my application. In this app different users will enter various attributes about their day. For example "User X, March 1st 12:00-2:00, Tired". There could be multiple entries for a given time, or overlapping times (e.g. tired from 12-2 and eating lunch from 12-1).
I'll need to query based on user and time ranges. Common queries:
Give me all the "actions" for user X between time t1 and t2
Give me all the start times for action Z for user X
My initial thought was that the partition key would be userid and range key for the start time, but that wont work because of duplicate start times right?
A second thought:
UserID - Partition Key
StartTime - RangeKey
Action - JSON document of all actions for that start time
[{ action: "Lunch", endTime:"1pm"},{action:tired, endTime:"2pm"}]
Any recommendation on a proper schema?
This doesn't really have a one solution. And you will need to evaluate multiple options depending on your use case how much data you have/how often would you query and by which fields etc.
But one good solution is to partition your schema like this.
Generated UUID as partition key
UserID
Start time (in unix epoch time or ISO8601 time format)
Advantages
Can handle multiple time zones
Can easily query for userID and start date (you will need secondary index with primary key userID and sort key start time)
More even distribution and less hot keys of your data across dynamoDB partitions because of randomly generated primary key.
Disadvantages
More data for every item (because of UUID) (+16 bytes)
Additional cost for new secondary index, note scanning the data in table is generally much more expensive than having secondary index.
This is pretty close to your initial thought, in order to get a bit more precise answer we will need a lot more information about how many writes and reads are you planning, and what kind of queries you will need.
You are right in that UserID as Partition key and StartTime as rangeKey would be the obvious choice, if it wasn't for the fact of your overlapping activities.
I would consider going for
UserID - Partition Key
StartTime + uuid - RangeKey
StartTime - Plain old attribute
Datetimes in DynamoDB just get stored as strings anyway. So the idea here is that you have StartTime + some uuid as your rangekey, which gives you a sortable table based on datetime whilst also assuring you have unique primary keys. You could then store the StartTime in a separate attribute or have a function for adding/removing the uuid from the StartTime + uuid attribute.

Table design for storing immutable time series telemetry/sensor data?

I'm looking for some advice on a DynamoDB table design to store telemetry data streaming in from 1000's of sensor hubs. The sensor hubs send up to 15,000 messages per day each, containing the following:
timestamp (unix time)
station_id (uuid)
sensor_type (string)
sensor_data (json)
I've looked into best practices for storing time series data, and will adopt a table partitioning strategy, where a new "hot data" table is created each month (and adjust RCU's and WCU's accordingly for older "cooler" tables).
What i'm not sure about is picking a suitable hash key and sort key, as well as setting up indexes, etc.
The majority of the queries to data will be: Give me messages where station_id = "foo" and sensor_type = "bar", and timestamp is between x and y.
At a guess, i'm assuming I would use station_id as the hash key, and timestamp as the sort key, but how do a query for messages with a particular sensor_type without resorting to filters? Would I be best to combine the station_id and sensor_type as the hash key?
Judging from the query example that you've provided I would do create the following table:
stationId_sensorType (String, partition key) - a combined attribute that contains concatenated values for station id and for sensor type
timestamp (Number, range key) - UNIX timestamp that you can use to sort by time stamp or to find only record with timestamps in range.
This will allow to get all values for a pair of (stationId, sensorType).
You can also store stationId and sensorType as separate fields in your items and then you can create GSI on them to support other queries, like, get all values for a stationId.

DynamoDB query/sort based on timestamp

In DynamoDB, I have a table where each record has two date attributes, create_date and last_modified_date. These dates are in ISO-8601 format e.g. 2016-01-22T16:19:52.464Z.
I need to have a way of querying them based on the create_date and last_modified_date e.g.
get all records where create_date > [some_date]
get all records where last_modified_date < [some_date]
In general, I need to get all records where [date_attr] [comparison_op] [some_date].
One way of doing it is to insert a dummy fixed attribute with each record and create an index with the dummy attribute as the partition key and the create_date as the sort key (likewise for last_modified_date.)
Then I'll be able to query it as such by providing the fixed dummy attribute as partition key, the date attributes as the sort key and use any comparison operators <, >, <=, >=, and so on.
But this doesn't seem good and looks like a hack instead of a proper solution/design. Are there any better solutions?
There are some things that NoSQL DBs are not good at, but you can solve this with the following solutions:
Move this table data to SQL database for searching purpose: This can be effective because you will be able to query as per your requirement, this might be tedious sometimes because you need to synchronize the data between two different DBs
Integrate with Amazon CloudSearch: You can integrate this table with CloudSearch and then rather than querying your DynamoDB table you can query Cloudsearch
Integrate with Elasticsearch: Elasticsearch is similar to CloudSearch although each has pros and cons, the end result would be same - rather than querying DynamoDB, instead query Elasticsearch
As you have mentioned in your question, add GSI indexes

Whats the best way to query DynamoDB based on date range?

As part of migrating from SQL to DynamoDB I am trying to create a DynamoDB table. The UI allows users to search based on 4 attributes start date, end date, name of event and source of event.
The table has 6 attributes and the above four are subset of it with other attributes being priority and location. The query as described above makes it mandatory to search based on the above four values. whats the best way to store the information in DynamoDB that will help me in querying based on start date and end date fairly easy.
I thought of creating a GSI with hashkey as startdate, rangekey as end date and GSI on the rest two attributes ?
Inshort:
My table in DynamoDB will have 6 attributes
EventName, Location, StartDate, EndDate, Priority and source.
Query will have 4 mandatory attributes
StartDate, EndDate, Source and Event Name.
Thanks for the help.
You can use greater than/less than comparison operators as part of your query http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/QueryAndScan.html
So you could try to build a table with schema:
(EventName (hashKey), "StartDate-EndDate" (sortKey), other attributes)
In this case the sort-key is basically a combination of start and end date allowing you to use >= (on the first part) and <= (on the second part)... dynamodb uses ASCII based alphabetical ordering... so lets assume your sortKey looks like the following: "73644-75223" you could use >= "73000-" AND <= "73000-76000" to get the given event.
Additionally, you could create a GSI on your table for each of your remaining attributes that need to be read via query. You then could project data into your index that you want to fetch with the query. In contrast to LSI, queries from GSI do not fetch attributes that are not projected. Be aware of the additional costs (read/write) involved by using GSI (and LSI)... and the additional memory required by data projections...
Hope it helps.

Resources