Scenario:
Two tables (table1, table2) with this format:
id:short; timestamp:short; price:single
The data in both table are thesame except the timestamp.
Timestamp is given as a unix_time in ms.
Question:
What is minimum time difference between the timestamps of same record in table1 and table2.
In theory, the minimum time difference would be 0 or even 1ms. In practice wouldn't the time difference be a function of what it takes to put the data into the table along with other external performance factors?
One question I would have about the tables is why is the same record being stored in two places?
Related
I need to create a table with the following fields :
place, date, status
My keys are parition key - place , sort key - date
Status can be either 0 or 1
Table has approximately 300k rows per day and about 3 days worth of data at any given time, so about 1 million rows. I have a service that is continuously populating data to this DDB.
I need to run the following queries (only) once per day :
#1 Return count of all places with date = current_date-1
#2 Return count and list of all places with date= current_date-1 and status = 0
Questions :
As date is already a sort key, is query #1 bound to be quick?
Do we need to create indexes on sort key fields ?
If answer to above question is yes: for query #2, do I need to create a GSI on date and status? with date as Partition key, and status as sort key?
Creating a GSI vs using filter expression on status for query #2. Which of the two is recommended?
Running analytical queries (such as count) is a wrong usage of a NoSQL database such as DynamoDB that is designed for scalable LOOKUP use cases.
Even if you get the SCAN to work with one design or another, it will be more expensive and slow than it should.
A better option is to export the table data from DynamoDB into S3, and then run an Athena query over that data. It will be much more flexible to run various analytical queries.
Easiest thing for you to do is a full table scan once per day filtering by yesterday's date, and as part of that keep your own client-side count on if the status was 0 or 1. The filter is not index optimized so it will be a true full table scan.
Why not an export to S3? Because you're really just doing one query. If you follow the export route you'll have to a new export every day to keep the data fresh and the cost of the export in dollar terms (plus complexity) is more than a single full scan. If you were going to do repeated queries against the data then the export makes more sense.
Why not use GSIs? They would make the table scan more efficient by minimizing what's scanned. However, there's a cost (plus complexity) in keeping them current.
Short answer: a once per day full table scan is both simple to implement and as fast as you want (parallel scan is an option), plus it's not really costly.
How much would it cost? Million rows, 100 bytes each, so that's a 100 MB table. That's 25,000 read units to fully scan, which is halved down to 12,500 with eventual consistency. On Demand pricing is $0.25 per million read units. 12,500 / 1,000,000 * $0.25 = $0.003. Less than a dime a month. It'd be cheaper still if you run provisioned.
Just do the scan. :)
I am wondering if anyone knows a good way to store time series data of different time resolutions in DynamoDB.
For example, I have devices that send data to DynamoDB every 30 seconds. The individual readings are stored in a Table with the unique device ID as the Hash Key and a timestamp as the Range Key.
I want to aggregate this data over various time steps (30 mins, 1 hr, 1 day etc.) using a lambda and store the aggregates in DynamoDB as well. I then want to be able to grab any resolution data for any particular range of time, 48 30 minute aggregates for the last 24hrs for instance, or each daily aggregate for this month last year.
I am unsure if each new resolution should have its own tables, data_30min, data_1hr etc or if a better approach would be something like making a composite Hash Key by combining the resolution with the Device ID and storing all aggregate data in a single table.
For instance if the device ID is abc123 all 30 minute data could be stored with the Hash Key abc123_30m and the 1hr data could be stored with the HK abc123_1h and each would still use a timestamp as the range key.
What are some pros and cons to each of these approaches and is there a solution I am not thinking of which would be useful in this situation?
Thanks in advance.
I'm not sure if you've seen this page from the tech docs regarding Best Practices for storing time series data in DynamoDB. It talks about splitting your data into time periods such that you only have one "hot" table where you're writing and many "cold" tables that you only read from.
Regarding the primary/sort key selection, you should probably use a coarse timestamp value as the primary key and the actual timestamp as a sort key. Otherwise, if your periods are coarse enough, or each device only produces a relatively small amount of data then your idea of using the device id as the hash key could work as well.
Generating pre-aggregates and storing in DynamoDb would certainly work though you should definitely consider having separate tables for the different granularities you want to support. Beware of mutating data. As long as all your data arrives in order and you don't need to recompute old data, then storing pre-aggregated time series is fine but if data can mutate, or if you have to account for out-of order/late arriving data then things get complicated.
You may also consider a relational database for the "hot" data (ie. last 7 days, or whatever period makes sense) and then, running a batch process to pre-aggregate and move the data into cold, read-only DynamoDB tables, with DAX etc.
I have to prepare a table where I will keep weekly results for some aggregated data. Table will have 30 fields (10 CHARACTERs, 20 DECIMALs), I think I will have 250k rows weekly.
In my head I can see two scenarios:
Set table and relying on teradata in preventing duplicate rows - it should skip duplicate entries while inserting new data
Multi set table with UPI - it will give an error upon inserting duplicate row.
INSERT statement is going to be executed through VBA on excel, where handling possible teradata errors is not a problem.
Which scenario will be faster to run in a year time where there will be circa 14 millions rows
Is there any other way to have it done?
Regards
On a high level, since you would be having a comparatively high data count on your table, it is advisable not to use SET tables, rather go with the multiset table.
For more info you can refer to this link
http://www.dwhpro.com/teradata-multiset-tables/
Why do you care about Duplicate Rows? When you store weekly aggregates there should be no duplicates at all. And Duplicate Rows are not the same as duplicate Primary Key values.
Simply choose a PI which fits best your join/access pattern (maybe partition by date). To avoid any potential duplicates you might simply use MERGE instead of INSERT.
I have created a table with a Start Date/time and an end Date/Time. Now I am trying to create a query that will calculate the difference between the Start date and the End date in Days, Hours, and Minutes. I am using Access 2013. I would like to do this in the table because in the end, I need to store the Days, Hours, and Minutes in a permanent record. However, I understand it's not good programming to do the calculation in the table. Any help would be appreciated.
To do the calculations "in the table" you could use a Before Change data macro. For a table with
Date/Time fields named [Started] and [Ended] to hold the start/end values, and
Long Integer fields named [DurationDays], [DurationHours], and [DurationMinutes] to hold the calculated values,
the following macro might do the trick:
For more information on Data Macros see
Create a data macro
I have a table which has the primary key on one column and is partitioned by a date column. This is sample format of the DDL:
CREATE MULTISET TABLE DB.TABLE_NAME,
NO FALLBACK ,
NO BEFORE JOURNAL,
NO AFTER JOURNAL,
CHECKSUM = DEFAULT,
DEFAULT MERGEBLOCKRATIO
( FIRST_KEY DECIMAL(20,0) NOT NULL,
SECOND_KEY DECIMAL(20,0) ,
THIRD_COLUMN VARCHAR(5),
DAY_DT DATE FORMAT 'YYYY-MM-DD')
PRIMARY INDEX TABLE_NAME_IDX_PR (FIRST_KEY)
PARTITION BY RANGE_N(DAY_DT BETWEEN DATE '2007-01-06'
AND DATE '2016-01-02' EACH INTERVAL '1' DAY );
COLLECT STATS ON DB.TABLE_NAME COLUMN(FIRST_KEY);
The incoming data can be of size 30 million each day and I have loaded the data for 2012-04-11. Now i have to collect stats for only '2012-04-11' partition instead of whole table.
Is there any way to collect partition for a particular day?
You can simply collect stats on the system column PARTITION and it should update the histograms relating to the partitioned column.
COLLECT STATS ON {databasename}.{tablename} COLUMN (PARTITION);
This can be collected on both partitioned and non-partitioned tables. It helps provided the optimizer cardinality of the table and partitions (if they exist). It will update the statistics for all the partitions on the table. Collecting stats on the PARTITION column is a low CPU cost, short wall clock process. It is significantly less expensive than collecting stats on a physical column or the entire table. (Even for tables with millions, tens of millions or more records.)
If you want to determine whether the optimizer recognizes the refreshed statistics there is no direct way as of TD 13.10 (not sure about TD 14.x). However, if you run an EXPLAIN on your query you can tell if the optimizer has high confidence on the step which the criteria against the partitioned column is included. If you specify a single date, such as DATE '2012-04-11' you should see in the EXPLAIN that partition elimination has taken place on a single partition.
If you need help with digesting the EXPLAIN, edit your original question with the EXPLAIN plan for the query and I will help you digest it.