I am currently working on a project that collects a customers demographics weekly and stores the delta (from previous weeks) as a new record. This process will encompass 160 variables and a couple hundred million people (my management and a consulting firm requires this, although ~100 of the variables are seemingly useless). These variables will be collected from 9 different tables in our Teradata warehouse.
I am planning to split this into 2 tables.
Table with often used demographics (~60 variables sourced from 3 tables)
Normalized (1 customer id and add date for each demographic variable)
Table with rarely or unused demographics (~100 variables sourced from 6 tables)
Normalized (1 customer id and add date for each demographic variable)
MVC is utilized to save as much space as possible as the database it will live on is limited in size due to backup limitations. (to note the customer id currently consumes 30% (3.5gb) of the table 1's size, so additional tables would add that storage cost)
The table(s) will be accessed by finding the most recent record in relation to the date the Analyst has selected:
SELECT cus_id,demo
FROM db1.demo_test
WHERE (cus_id,add_dt) IN (
SELECT cus_id, MAX(add_dt)
FROM db1.dt_test
WHERE add_dt <= '2013-03-01' -- Analyst selected Point-in-Time Date
GROUP BY 1)
GROUP BY 1,2
This data will be used for modeling purposes, so a reasonable SELECT speed is acceptable.
Does this approach seem sound for storage and querying?
Is any individual table too large?
Is there a better suggested approach?
My concern with splitting further is
Space due to uncompressible fields such as dates and customer ids
Speed with joining 2-3 tables (I suspect an inner join may use very little resources.)
Please excuse my ignorance in this matter. I usually work with large tables that do not persist for long (I am a Data Analyst by profession) or the tables I build for long term data collection only contain a handful of columns.
Additional to Rob's remarks:
What is your current PI/partitioning?
Is the current performance unsatisfactory?
How do the analysts access beside the point-in-time, any other common conditions?
Depending on your needs a (prev_dt, add_dt) might be better than a single add_dt. More overhead to load, but querying might be as simple as date ... between prev_dt and end_dt.
A Join Index on (cus_id), (add_dt) might be helpful, too.
You might replace the MAX(subquery) with a RANK (MAX is usually slower, only when cus_id is the PI RANK might be worse):
SELECT *
FROM db1.demo_test
QUALIFY
RANK() OVER (PARTITION BY cus_id ORDER BY add_dt DESC) = 1
In TD14 you might split your single table in two row-containers of a column-partitioned table.
...
The width of the table at 160 columns, sparsely populated is not necessarily an incorrect physical implementation (normalized in 3NF or slightly de-normalized). I have also seen situations where attributes not regularly accessed are moved to a documentation table. If you elect to implement the latter in your physical implementation it would be in your best interest that each table share the same primary index. This allows the joining of these to tables (60 attributes and 100 attributes) to be AMP-local on Teradata.
If the access of the table(s) will also include the add_dt column you may wish create a partitioned primary index on this column. This will allow the optimizer to eliminate the other partitions from being scanned when the add_dt column is included in the WHERE clause of a query. Another option would be to test the behavior of a value ordered secondary index on the add_dt column.
Related
We are designing an application which will use DynamoDB as storage system.
We identified the different access patterns and after reviewing Global Secondary Indexes documentation, we got stuck on making a decision about which approach to use: Index overloading or having 2 sparse index.
To give more context, our application stores orders, we can have internal or external orders. Based on that, they will be linked to a Customer or a Warehouse:
As we would like to search by customer and/or warehouse we thought about 2 solutions.
First solution would be, keeping the above data structure, creating 2 indexes on:
GSI1 - Customer (PK)
GSI2 - Warehouse (PK)
Second solution is to overload another column like:
So only 1 index required: Destination (PK), and queried applies with a prefix.
The question is: "Is there any benefit between index overloading over having 2 different sparse Global Secondary Indexes?" (Cost saving on capacity provisioning, data transport, query times, data complexity...)
As I didn't get any answer yet, I'll add my opinion.
There is no big difference between the 2 approaches in both cases all items will end up being indexed and similar attributes stored.
Some benefits I could find are:
Benefits using 2 GSI
Data schema is easier to understand (no overloading)
More flexibility for evolving the schema: if requirements change, an order can be assigned to both a customer and a warehouse.
Capacity to adjust better projections (may not be always applicable, but you may only need 2 fields for Customer access pattern and 3 for Warehouse)
Smaller indexes have greater performance
Benefits using 1 GSI
No need to worry about capacity units, they can be similar to main table. When using 2 indexes, you need to know an estimation of how many records will fall under each of them, otherwise you need to over-provision them.
Example: If you set 50% RCU and WCU from main table to each of the indexes, but you have 70% orders which are for customers, some requests will be throttled.
In summary, even using 2 indexes allows to get a more precise configuration, it may end up having a higher cost and the need to review index configuration to adjust it to access patterns usage from time to time.
I will incrementally insert rows to a table. This table stores sales facts, and has some columns that will be used to define a identifier: business id (int), product name (string), product price (float). E.g. <1, heineken, 1.0>, <1, heineken, 22.99>.
Certainly, these values will be used in joins. When thinking the SQL way, I would create a hashed column using those columns. This way, I would be able to optimize some queries.
How about data lake and u-sql? Should I calculate the hash on insert? Should I leave it as is? Should I simply concatenate the values and create a big string?
Thanks in advance.
While U-SQL supports clustering and distribution schemes on multiple columns, you probably could gain some additional performance in your joins if you find an efficient value to do your equi-join comparison. So you could calculate a hash or concatenate.
However, I think finding the right distribution scheme and clustering is the better "bang for your buck".
And, more importantly, please do not incrementally insert small number of rows, but use bulk insertion of many rows at the same time (e.g., daily or weekly). And regularly rebuild the table or table partition to avoid table fragmentation that will have a much bigger impact on your query performance.
First, I explain my problem:
This is a table that will contain approximately 5,000,000 record per year, these records will be kept at least 10 years (it is not yet defined). We talk about events of production machine. I generate a report + a dashbord for displaying various information relatively complex (average number of events per 10 minutes a month, graphics, ...) and also wants to see the records themselves. The data displayed will be in large majority of the last 2 months, viewing the rest of the data must always be possible but at a lower speed of access.
I work on MariaDB v10.1.12.
The idea was to make a partition on the last 3 months. I realize now that this is not so easy. I have not found any solution to this partition, in fact, it is impossible to make a partition based on a now() or other current_date() etc. directly or indirectly via another calculated column.
Do you have any ideas for me? Perhaps another solution than a partition.
Thank you in advance.
I recommend PARTITION BY RANGE(TO_DAYS(...)) If you are only now breaking the table into partitions, I would recommend annual partitions for data before this year, then quarterly or monthly partitions henceforth. Yes, that, in theory, leads to an infinite number of partitions, but I predict that you will revamp the data structure within a few years.
20-50 partitions is a good number. More than that leads to inefficiencies due to the multitude of partitions; less than that leads to asking "why bother".
Use InnoDB. Design the PRIMARY KEY carefully, since it may be useful as the primary index into the data.
Usually it is best to put the date/timestamp column last in any indexes. Putting it first would be redundant since partition pruning comes first.
More on partitioning.
It sounds like a main purpose for the table is to summarize the data for graphing, etc. In that case, it may be very beneficial to build and maintain "Summary table(s)" of counts and subtotals over selected time intervals. 100 rows get added up for a 10-minute interval? If so, then the summary table based on 10-minute intervals would have 1/100th as many rows, and the queries would be much faster. Plus, you could 'denormalize' the summary tables to make them even simpler.
More on Summary tables.
It might be worth it to gather data for 10 minutes into a staging table, then summarize it into the summary table. And also throw the raw data into the big table.
Or, if the summary tables have everything you need, you could abandon the big table. Or, as a compromise, keep 12 month's worth of data (partitioned by month), and DROP PARTITION for older data. Meanwhile, the summary tables can continue to grow (although they will be much smaller).
Table partitioning is an advance features, it is not indexing, but rearrangement of tables data. So it is not "duplicate", indeed new data will stored according to the predefined partitioning range.
You must also specify month range criteria as usual. you MUST create index if those column are not used as partition range. When you make a select, algorithm that associate with partition table will handle those merging(if required) in background. So you just treat partition exactly like your typical table.
For more details, please check Mariadb paritioning overview
I have to prepare a table where I will keep weekly results for some aggregated data. Table will have 30 fields (10 CHARACTERs, 20 DECIMALs), I think I will have 250k rows weekly.
In my head I can see two scenarios:
Set table and relying on teradata in preventing duplicate rows - it should skip duplicate entries while inserting new data
Multi set table with UPI - it will give an error upon inserting duplicate row.
INSERT statement is going to be executed through VBA on excel, where handling possible teradata errors is not a problem.
Which scenario will be faster to run in a year time where there will be circa 14 millions rows
Is there any other way to have it done?
Regards
On a high level, since you would be having a comparatively high data count on your table, it is advisable not to use SET tables, rather go with the multiset table.
For more info you can refer to this link
http://www.dwhpro.com/teradata-multiset-tables/
Why do you care about Duplicate Rows? When you store weekly aggregates there should be no duplicates at all. And Duplicate Rows are not the same as duplicate Primary Key values.
Simply choose a PI which fits best your join/access pattern (maybe partition by date). To avoid any potential duplicates you might simply use MERGE instead of INSERT.
Is the materialized views of oracle(11g) are good practice for aggregated tables in Data warehousing?
We have DW processes that replace 2 month of data each day. Some time it means few Gigs for each month (~100K rows).
On top of them are materialized views that get refreshed after night cycle of data tranfer.
My question is would it be better to create aggregated tables instead of the MVs?
I think that one case where aggregated tables might be beneficial is where the aggregation can be effectively combined with the atomic-level data load, best illustrated with an example.
Let's say that you load a large volume of data into a fact table every day via a partition exchange. A materialized view refresh using partition change tracking is going to be triggered during or after the partition exchange and it's going to scan the modified partitions and apply the changes to the MV's.
It is possible that as part of the population of the table(s) that you are going to exchange with the fact table partitions you could also compute aggregates at various levels using CUBE/ROLLUP, and use multitable insert to load up tables that you can then partition exchange into one or more aggregation tables. Not only might this be inherently more efficient through avoiding rescanning the atomic-level data, your aggregates are computed prior to the fact table partition exchange so if anything goes wrong you can suspend the modification of the fact table itself.
Other thoughts might occur later ... I'll open the answer up as a community Wiki if other have ideas.