I am having the following table in my DynamoDB.
I want to get/extract all the data using the following condition or filters
This Month data : This will be the set of records that belongs to 1st of this month to today. ( I think this I can achieve using the BEGINS_WITH filter , again not sure whether this is the correct approach )
This Quarter data : This will be the set of records that belongs to this quarter, basically from 1st of April 2021 to 30th June 2021
This Year data : This will be set of records that belongs to this entire year
Question : How I can filter/query the data using the date column from the above table to get these 3 types (Month , Quarter ,Year ) of data.
Other Details
Table Size : 25 GB
Item Count : 4,081,678
It looks like you have time-based access patterns (e.g. fetch by month, quarter, year, etc).
Because your sort key starts with a date, you can implement your access patterns using the between condition on your sort key. For example (in pseudo code):
Fetch User 1 data for this month
query where user_id = 1 and date between 2021-06-01 and 2021-06-30
Fetch User 1 data for this quarter
query where user_id = 1 and date between 2021-01-01 and 2021-03-31
Fetch User 1 data for this month
query where user_id = 1 and date between 2021-06-01 and 2021-06-30
If you need to fetch across all users, you could use the same approach using the scan operation. While scan is commonly considered wasteful/inefficient, it's a fine approach if you run this type of query infrequently.
However, if this is a common access pattern, you might want to consider re-organizing your data to make this operation more efficient.
As mentioned in the above answer by #Seth Geoghegan , the above table design is not correct, ideally you should think before placing your Partition Key and Sort Key, still for the people like me who already have such kind of scenarios, here is the steps which I followed to mitigate my issue.
Enabled DynamoDB Steams
Re-trigger the data so that they can pass through the DDB Streams ( I added one additional column updated_dttm to all of my records using one of the script )
Process the Streams record , in my case I broken down the date column above to three more columns , event_date , category , sub_category respectively and updated back to the original record using the Lambda
Then I was able to query my data using event_date column , I can also create index over event_date column and make my query/search more effective
Points to Consider
Cost for updating records so that they can go to DDB Streams
Cost of reprocessing the records
Cost for updating records back to DDB
Related
We have a Dynamodb table Events with about 50 million records that look like this:
{
"id": "1yp3Or0KrPUBIC",
"event_time": 1632934672534,
"attr1" : 1,
"attr2" : 2,
"attr3" : 3,
...
"attrN" : N,
}
The Partition Key=id and there is no Sort Key. There can be a variable number of attributes other than id (globally unique) and event_time, which are required.
This setup works fine for fetching by id but now we'd like to efficiently query against event_time and pull ALL attributes for records that match within that range (could be a million or two items). The criteria would be equal to something like WHERE event_date between 1632934671000 and 1632934672000, for example.
Without changing any existing data or transforming it through an external process, is it possible to create a Global Secondary Index using event_date and projecting ALL attributes that could allow a range query? By my understanding of DynamoDB this isn't possible but maybe there's another configuration I'm overlooking.
Thanks in advance.
(Edit: I rewrote the answer because the OP's comment clarified that the requirement is to query event_time ranges ignoring id. OP knows the table design is not ideal and is trying to make the best of a bad situation).
Is it possible to create a Global Secondary Index using event_date and projecting ALL attributes that could allow a range query?
Yes. You can add a Global Secondary Index to an existing table and choose which attributes to project. You cannot add an LSI to an existing table or change the table's primary key.
Without changing any existing data or transforming it through an external process?
No. You will need to manipulate the attibutes. Although arbitrary range queries are not its strength, DynamoDB has a time series pattern that can be adapted to your query pattern.
Let's say you query mostly by a limitied number of days. You would add a GSI with yyyy-mm-dd PK (Partition Key). Rows are made unique by a SK (Sort Key) that concatenates the timestamp with the id: event_time#id. PK and SK together are the Index's Composite Primary Key.
GSIPK1 = yyyy-mm-dd # 2022-01-20
GSISK1 = event_time#id # 1642709874551#1yp3Or0KrPUBIC
Querying for a single day needs 1 query operation, for a calendar week range needs 7 operations.
GSI1PK = "2022-01-20" AND GSI1SK > ""
Query a range within a day by adding a SK between condition:
GSI1PK = "2022-01-20" AND GSI1SK BETWEEN "1642709874" AND "16427098745"
It seems like one can create a global secondary index at any point.
Below is an excerpt from the Managing Global Secondary Indexes documentation which can be found here https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GSI.OnlineOps.html
To add a global secondary index to an existing table, use the UpdateTable operation with the GlobalSecondaryIndexUpdates parameter.
I have a table which is partitioned on a date column.
TRANS_DT between '2010-01-0' and '2030-12-31' EACH INTERNAL '1' MONTH.
A view is created on the table where the TRANS_DT is CAST as CAST(CAST(TRANS_DT AS DATE FORMAT 'MMM-YYYY') AS VARCHAR(8)) to denote date like 'MAR-2021'.
When SQL's are running on the VIEW by applying condition as TRANS_DT='MAR-2021', the explain plan states that a full table scan is performed instead of Partition Level Scan.
I have my dynamo db table as follows:
HashKey(Date) ,RangeKey(timestamp)
DB stores the data of each day(hash key) and time stamp(range key).
Now I want to query data of last 7 days.
Can i do this in one query? or do i need to call dbb 7 times for each day? order of the data does not matter So, can some one suggest an efficient query to do that.
I think you have a few options here.
BatchGetItem - The BatchGetItem operation returns the attributes of one or more items from one or more tables. You identify requested items by primary key. You could specify all 7 primary keys and fire off a single request.
7 calls to DynamoDB. Not ideal, but it'd get the job done.
Introduce a global secondary index that projects your data into the shape your application needs. For example, you could introduce an attribute that represents an entire week by using a truncated timestamp:
2021-02-08 (represents the week of 02/08/21T00:00:00 - 02/14/21T12:59:59)
2021-02-16 (represents the week of 02/15/21T00:00:00 - 02/22/21T12:59:59)
I call this a "truncated timestamp" because I am effectively ignoring the HH:MM:SS portion of the timestamp. When you create a new item in DDB, you could introduce a truncated timestamp that represents the week it was inserted. Therefore, all items inserted in the same week will show up in the same item collection in your GSI.
Depending on the volume of data you're dealing with, you might also consider separate tables to segregate ranges of data. AWS has an article describing this pattern.
I want to store data in MySQL and query it based on the current day. I want to know what is the best practice to do so.
I want to store data totals for each day, so queries total data will be quick. I thought about modeling my table as follows:
TotalsByCountry
- Year
- Month
- Day
- countryId
- totalNumber
When I query the totals for a specific day and for specific country, I will query the table based on 4 columns, the Year, Month, Day and countryId.
I wanted to know if this is a good practice, or there is a better way to do so, like using one columns for data that holds the month, day and year, and query only two columns, the datetime columns and the coutryId.
need you help in choosing the right way to model the table. I also want to make another table that store totals based on gender, so take that into consideration too.
The data will need to be accessed frequently, maybe in real time because I want to show the data changes in real time. I will be developing the web app in asp.net and probably use web sockets to create constant connection that will update the data on the user in real time. So when data changes, it will be reflected on the user webpage in real-time. That's why I need a table modeling that will be ready for many queries. I will use caching for a few seconds so it want stress the db too much.
I hope I provided enough information, if not, please comment and I will reply.
Having three separate columns to store each individual element of a date (year/month/day) will add unnecessary overhead to your database in terms of insert performance and disk space.
What you will want to do is simply have a single DATETIME column to store the date and time, and have a composite index set up on (countryId, datetime_col).
Even if you wanted to query all rows based on a specific day or month, MySQL will still be able to utilize indexes on the DATETIME field, provided that you are writing your queries in the right way and making sure to never to wrap the DATETIME column within a function when you perform your conditional check.
Here is how you can write your query so that it will still be able to utilize indexes:
-- Get the sum of totalNumber of all rows based on current day
-- where countryId = 1
SELECT SUM(totalNumber) AS totalsum
FROM tbl
WHERE countryId = 1 AND
datetime_col >= CAST(CURDATE() AS DATETIME) AND
datetime_col < CAST(CURDATE() + INTERVAL 1 DAY AS DATETIME)
By making the comparison on the bare DATETIME column, the query remains sargable(i.e. able to utilize index range scans) and MySQL will be able to use indexes to quickly look up rows.
On the other hand, if you were to try to wrap the DATETIME column within a function to make the comparison:
-- Get the sum of totalNumber of all rows based on current day
-- where countryId = 1
SELECT SUM(totalNumber) AS totalsum
FROM tbl
WHERE countryId = 1 AND
DATE(datetime_col) = CURDATE()
...It would be quite inefficient because the DATE() function that wraps the column effectively renders the query as non-sargable, and any kind of index you have set up containing the DATETIME column will not be utilized.
You can also efficiently query for the total sum of all rows in the current month:
-- Get the sum of totalNumber of all rows based on current month
-- where countryId = 1
SELECT SUM(totalNumber) AS monthsum
FROM tbl
WHERE countryId = 1 AND
datetime_col >= CAST(CONCAT(YEAR(NOW()), '-', MONTH(NOW()), '-01') AS DATETIME) AND
datetime_col < CAST(CONCAT(YEAR(NOW()), '-', MONTH(NOW()), '-01') AS DATETIME) + INTERVAL 1 MONTH
And within the current year:
-- Get the sum of totalNumber of all rows based on current year
-- where countryId = 1
SELECT SUM(totalNumber) AS yearsum
FROM tbl
WHERE countryId = 1 AND
datetime_col >= CAST(CONCAT(YEAR(NOW()), '-01-01') AS DATETIME) AND
datetime_col < CAST(CONCAT(YEAR(NOW()), '-01-01') AS DATETIME) + INTERVAL 1 YEAR
My argument is:
If you want to be fast on a database lookups, you need well built queries that uses indexes.
Your approach require 4 indexes (that means slower insert), using a single date column you will require just two indexes, Also the query complexity will increase if you ever need to search for date ranges.
I have an application that writes temperature values to a MySQL table every second, It consists of the temperature and a datetime field.
I need to pull these values out of the table at specific intervals, every second, minute, hour etc.
So for example I will need to pull out values between 2 datetime fields and show the temperature at the hour for each of them.
One option I've considered is to create a temporary table that holds a list of the time intervals generated using MySQL INTERVAL and then joining that against the main table.
I'm just wondering if there's some time and date functions in MySQL that I'm overlooking that would make this easier?
Thanks.
You could use between for your date, and then a conditional WHERE clause using time() that looks at the structure of the timestamp. If it has 00:00 (for instance, 16:00:00) within it, take it, if not, leave it.
Example (untested):
SELECT temp, date
FROM temperatures
WHERE (date BETWEEN '2009/01/03 12:00:00' AND '2009/01/04 12:00:00')
AND (time(date) LIKE '%:00:00')
ORDER BY date ASC
LIMIT 10