We have a DDB with a column : Timestamp (string)
Essentially we need to query data between a range of the Timestamp column.
What is the best way to do this?
I tried creating a GSI in my test environment using the Timestamp column but quickly realized that I will not be able to query a range of this column.
Edit: What I found as the best way to do this so far:
GSI on an event_type that we know will always be equal to Update
Added sort key as the Timestamp column instead, so I am able to query on a range of the timestamp
Do let me know if you know a better way to do this. Thanks.
Your approach is good. DynamoDB supports dates and you can do your query using "BETWEEN".
A much better, but situational, approach is to include the time range in the partition key. If your time ranges are always of the same size for example 1 day you can do something like
PK
EVENT_TIME_RANGE#(start <-> end)
Then retrieve all of the entires in this time range with a very effective PK lookup query.
If you can't do that but need to optimize for time range lookups you can copy the data into this "home made" time range index.
Related
We have a Dynamodb table Events with about 50 million records that look like this:
{
"id": "1yp3Or0KrPUBIC",
"event_time": 1632934672534,
"attr1" : 1,
"attr2" : 2,
"attr3" : 3,
...
"attrN" : N,
}
The Partition Key=id and there is no Sort Key. There can be a variable number of attributes other than id (globally unique) and event_time, which are required.
This setup works fine for fetching by id but now we'd like to efficiently query against event_time and pull ALL attributes for records that match within that range (could be a million or two items). The criteria would be equal to something like WHERE event_date between 1632934671000 and 1632934672000, for example.
Without changing any existing data or transforming it through an external process, is it possible to create a Global Secondary Index using event_date and projecting ALL attributes that could allow a range query? By my understanding of DynamoDB this isn't possible but maybe there's another configuration I'm overlooking.
Thanks in advance.
(Edit: I rewrote the answer because the OP's comment clarified that the requirement is to query event_time ranges ignoring id. OP knows the table design is not ideal and is trying to make the best of a bad situation).
Is it possible to create a Global Secondary Index using event_date and projecting ALL attributes that could allow a range query?
Yes. You can add a Global Secondary Index to an existing table and choose which attributes to project. You cannot add an LSI to an existing table or change the table's primary key.
Without changing any existing data or transforming it through an external process?
No. You will need to manipulate the attibutes. Although arbitrary range queries are not its strength, DynamoDB has a time series pattern that can be adapted to your query pattern.
Let's say you query mostly by a limitied number of days. You would add a GSI with yyyy-mm-dd PK (Partition Key). Rows are made unique by a SK (Sort Key) that concatenates the timestamp with the id: event_time#id. PK and SK together are the Index's Composite Primary Key.
GSIPK1 = yyyy-mm-dd # 2022-01-20
GSISK1 = event_time#id # 1642709874551#1yp3Or0KrPUBIC
Querying for a single day needs 1 query operation, for a calendar week range needs 7 operations.
GSI1PK = "2022-01-20" AND GSI1SK > ""
Query a range within a day by adding a SK between condition:
GSI1PK = "2022-01-20" AND GSI1SK BETWEEN "1642709874" AND "16427098745"
It seems like one can create a global secondary index at any point.
Below is an excerpt from the Managing Global Secondary Indexes documentation which can be found here https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GSI.OnlineOps.html
To add a global secondary index to an existing table, use the UpdateTable operation with the GlobalSecondaryIndexUpdates parameter.
I have a dynamodb table which stores creation_date epoch in string format. This date is neither hash key nor sort key. Ultimate goal is querying the creation_date for a range i.e. I need all the ids in the give time range.
The table schema is:
id, version, creation_date, info.
id is hash key and version is sort key.
I was thinking of creating a cloudsearch domain and link that to dynamodb table. Is it possible to use a range query in cloudsearch using java if the date is in string format? If yes how?
Here’s how you can accomplish this in DynamoDB using a GSI with a hash key of creation_y_m and a GSI range key of creation_date.
When you’re querying for a range of creation dates, you need to do a bit of date manipulation to find out all of the months in between your two dates, but then you can query your GSI with a key condition expression like this one.
creation_y_m = 2019-02 AND creation_date BETWEEN 2019-02-05T12:00.00Z AND 2019-02-18T06:00:00Z
Given that most of your queries are a two week range, you will usually only have to make only one or two queries to get all of the items.
You may need to backfill the creation_y_m field, but it’s fairly straightforward to do that by scanning your table and updating each item to have the new attribute.
There are, of course, many variations on this. You could tweak how granular your hash key is (maybe you want just year, maybe you want year-month-day). You could use epoch time instead of ISO 8601 strings.
I'm looking for some advice on a DynamoDB table design to store telemetry data streaming in from 1000's of sensor hubs. The sensor hubs send up to 15,000 messages per day each, containing the following:
timestamp (unix time)
station_id (uuid)
sensor_type (string)
sensor_data (json)
I've looked into best practices for storing time series data, and will adopt a table partitioning strategy, where a new "hot data" table is created each month (and adjust RCU's and WCU's accordingly for older "cooler" tables).
What i'm not sure about is picking a suitable hash key and sort key, as well as setting up indexes, etc.
The majority of the queries to data will be: Give me messages where station_id = "foo" and sensor_type = "bar", and timestamp is between x and y.
At a guess, i'm assuming I would use station_id as the hash key, and timestamp as the sort key, but how do a query for messages with a particular sensor_type without resorting to filters? Would I be best to combine the station_id and sensor_type as the hash key?
Judging from the query example that you've provided I would do create the following table:
stationId_sensorType (String, partition key) - a combined attribute that contains concatenated values for station id and for sensor type
timestamp (Number, range key) - UNIX timestamp that you can use to sort by time stamp or to find only record with timestamps in range.
This will allow to get all values for a pair of (stationId, sensorType).
You can also store stationId and sensorType as separate fields in your items and then you can create GSI on them to support other queries, like, get all values for a stationId.
Coming from a SQL background, I understand the high-level concepts on NoSQL but still having troubles trying to translating some basic usage scenario. I am hoping someone can help.
My application simply record a location, a timestamp, and tempature for every second of the day. So we end up having 3 basic columns:
1) location
2) timestamp
3) and temperature
(All field are numbers and I'm storing the timestamp as an epoch for easy range querying)
I setup dynamodb with the location as the primary key, and the timestamp as the sortkey and temp as an attribute. This results in a composite key on location and timestamp which allows each location to have its own unique timestamp but not allow any individual location to have more than one identical timestamp.
Now comes the real-world queries:
Query each site for a time range (Works fine)
Query for any particular time-range return all temps for all locations (won't work)
So how would you account for the 2nd scenario? This is were I get hung up... Is this were we get into secondary indexes and things like that? For those of you smarter than me, how would you deal with this?
Thanks in advance for you help!
-D
you cant query for range of values in dynamodb. you can query for a range of values (range keys) that belongs to a certain value (hash key)
its not matter if this is table key, local secondary index key, or global secondary index (secondary index are giving you another query options..)
lets back to your scenario:
if timestamp is in seconds and you want to get all records between 2 timestamps then you can add another field 'min_timestamp'.
this field can be your global secondary hash key, and timestamp will be your global secondary range key.
now you can get all records that logged in a certain minute.
if you want a range of minutes, then you need to perform X queries (if X its the range of minutes)
you can also add another field 'hour_timestamp' (that hash key contains all records in a certain hour) and goes on... - but this approach is very dangerous - you going to update many records with the same hash key in the same point of time, and you can get many throughput errors...
I just started figuring out DynamoDB.
I have a simple table has date attribute(ex. 20160101) as HASH and created_at attribute(ex. 20160101185332) as RANGE.
I'd like to get latest N items from the table.
First, SCAN command does not have ScanIndexForward option. I think it's not possible with SCAN.
Next, QUERY command. It seems to be work if I repeat QUERY command several times to get enough number of items(cuz, I don't know how many items have same key value). - for example, I can query using today first and repeat for the day before if the result does not give enough items.
How can I do the job more efficiently? Or, can I query without KEY value?
as you described your table, you cant do it more efficiently, and you cant query dynamodb without KEY(hash) value
look at the answer here:
dynamodb get earliest inserted distinct values from a table