Is the materialized views of oracle(11g) are good practice for aggregated tables in Data warehousing?
We have DW processes that replace 2 month of data each day. Some time it means few Gigs for each month (~100K rows).
On top of them are materialized views that get refreshed after night cycle of data tranfer.
My question is would it be better to create aggregated tables instead of the MVs?
I think that one case where aggregated tables might be beneficial is where the aggregation can be effectively combined with the atomic-level data load, best illustrated with an example.
Let's say that you load a large volume of data into a fact table every day via a partition exchange. A materialized view refresh using partition change tracking is going to be triggered during or after the partition exchange and it's going to scan the modified partitions and apply the changes to the MV's.
It is possible that as part of the population of the table(s) that you are going to exchange with the fact table partitions you could also compute aggregates at various levels using CUBE/ROLLUP, and use multitable insert to load up tables that you can then partition exchange into one or more aggregation tables. Not only might this be inherently more efficient through avoiding rescanning the atomic-level data, your aggregates are computed prior to the fact table partition exchange so if anything goes wrong you can suspend the modification of the fact table itself.
Other thoughts might occur later ... I'll open the answer up as a community Wiki if other have ideas.
Related
Is there any good documentation on how query times change for a DynamoDB table based on equal read capacity and differing row sizes? I've been reading through the documentation and can't find anything, was wondering if anybody has done any studies into this?
My use case is that I'm putting a million rows into a table a week. These records are referenced quite a bit as they're entered but as time goes on the frequency at which I query those rows decreases. Can I leave those records in the table indefinitely with no detrimental effect on query time, or should I rotate them out so the newer data that is requested more frequently returns faster?
Please don't keep the old data indefinitely. It is advised to archive the data for better performance.
Few points on design and testing:-
Designing the proper hash key, so that the data is distributed
access the partitions
Understand Access Patterns for Time Series Data
Test your application at scale to avoid problems with "hot" keys
when your table becomes larger
Suppose you design a table to track customer behavior on your site,
such as URLs that they click. You might design the table with a
composite primary key consisting of Customer ID as the partition key
and date/time as the sort key. In this application, customer data
grows indefinitely over time; however, the applications might show
uneven access pattern across all the items in the table where the
latest customer data is more relevant and your application might
access the latest items more frequently and as time passes these items
are less accessed, eventually the older items are rarely accessed. If
this is a known access pattern, you could take it into consideration
when designing your table schema. Instead of storing all items in a
single table, you could use multiple tables to store these items. For
example, you could create tables to store monthly or weekly data. For
the table storing data from the latest month or week, where data
access rate is high, request higher throughput and for tables storing
older data, you could dial down the throughput and save on resources.
Time Series Data Access Pattern
Guidelines for table partitions
First, I explain my problem:
This is a table that will contain approximately 5,000,000 record per year, these records will be kept at least 10 years (it is not yet defined). We talk about events of production machine. I generate a report + a dashbord for displaying various information relatively complex (average number of events per 10 minutes a month, graphics, ...) and also wants to see the records themselves. The data displayed will be in large majority of the last 2 months, viewing the rest of the data must always be possible but at a lower speed of access.
I work on MariaDB v10.1.12.
The idea was to make a partition on the last 3 months. I realize now that this is not so easy. I have not found any solution to this partition, in fact, it is impossible to make a partition based on a now() or other current_date() etc. directly or indirectly via another calculated column.
Do you have any ideas for me? Perhaps another solution than a partition.
Thank you in advance.
I recommend PARTITION BY RANGE(TO_DAYS(...)) If you are only now breaking the table into partitions, I would recommend annual partitions for data before this year, then quarterly or monthly partitions henceforth. Yes, that, in theory, leads to an infinite number of partitions, but I predict that you will revamp the data structure within a few years.
20-50 partitions is a good number. More than that leads to inefficiencies due to the multitude of partitions; less than that leads to asking "why bother".
Use InnoDB. Design the PRIMARY KEY carefully, since it may be useful as the primary index into the data.
Usually it is best to put the date/timestamp column last in any indexes. Putting it first would be redundant since partition pruning comes first.
More on partitioning.
It sounds like a main purpose for the table is to summarize the data for graphing, etc. In that case, it may be very beneficial to build and maintain "Summary table(s)" of counts and subtotals over selected time intervals. 100 rows get added up for a 10-minute interval? If so, then the summary table based on 10-minute intervals would have 1/100th as many rows, and the queries would be much faster. Plus, you could 'denormalize' the summary tables to make them even simpler.
More on Summary tables.
It might be worth it to gather data for 10 minutes into a staging table, then summarize it into the summary table. And also throw the raw data into the big table.
Or, if the summary tables have everything you need, you could abandon the big table. Or, as a compromise, keep 12 month's worth of data (partitioned by month), and DROP PARTITION for older data. Meanwhile, the summary tables can continue to grow (although they will be much smaller).
Table partitioning is an advance features, it is not indexing, but rearrangement of tables data. So it is not "duplicate", indeed new data will stored according to the predefined partitioning range.
You must also specify month range criteria as usual. you MUST create index if those column are not used as partition range. When you make a select, algorithm that associate with partition table will handle those merging(if required) in background. So you just treat partition exactly like your typical table.
For more details, please check Mariadb paritioning overview
I'm building an application in ASP.NET(VB) with a MS SQL database. It is a search tool for cars that has a list of every car and all of their attributes (colors, # of doors, gas milage, mfg. year, etc). This tool outputs the results in a gridview and the users has the ability to perform advanced searches and filtering. The filtering needs to be very fine-grained (range of gas milage, color(s), mfg year range, etc.) and I cannot seem to find the best way to do this filtering without a large SQL where statement that is going to greatly impact SQL performance and page load. I feel like I'm missing something very obvious here, thank you for any help. I'm not sure what other details would be helpful.
This is not an OLTP database you're building--it's really an analytics database. There really isn't a way around the problem of having to filter. The question is whether the organization of the data will allow seeks most of the time, or will it require scans; and also whether the resulting JOINs can be done efficiently or not.
My recommendation is to go ahead and create the data normalized and all, as you are doing. Then, build a process that spins it into a data warehouse, denormalizing like crazy as needed, so that you can do filtering by WHERE clauses that have to do a lot less work.
For every single possible search result, you have a row in a table that doesn't require joining to other tables (or only a few fact tables).
You can reduce complexity a bit for some values such as gas mileage, by striping the mileage into bands of, say, 5 mpg. (10-19, 20-24, 25-29, etc.)
As you need to add to the data and change it, your data-warehouse-loading process (that runs once a day perhaps) will keep the data warehouse up to date. If you want more frequent loading that doesn't keep clients offline, you can build the data warehouse to an alternate node, then swap them out. Let's say it takes 2 hours to build. You build for 2 hours to a new database, then swap to the new database, and all your data is only 2 hours old. Then you wipe out the old database and use the space to do it again.
I am designing database for fleet management system.
I will be getting n number of records every 3 seconds. Obviously, there will be millions of record in my table where I am going to store current Information of vehicle in the current_location table. Here performance is an BIG issue.
To solve this, I received the following suggestions:
Create a separate table for each vehicle.
Here a table will be created at a run time as as soon as I click on create new table.And all the data related to particular table will be inserted and retrieve from that particular table.
Go for partition.
Please answer the following questions about these solutions.
What is difference between the two?
Which is best and why?
At what point will the number of rows in the tables cause performance issues?
Are there any other solutions?
Now ---if I go for range partition in sql server 2008 what should i do to,
partition using varchar(20).
i am planning to do partition based on vehicle no. eg MH30 q 1234.
Here In vehicle no. lets say mh30 q 1234--only 30 & q going to change....so my question is HOW SHOULD I GO. means how should write the partition function.
***1st this question was asked for my sql..now for sql server
********sorry guys now I shifted from my sql to sql server*****With The same question
definitely use partitioning. why go to all of the hassle to figure out which table to use to answer a question when mysql will do it for you? and good luck find the current location of all of your trucks if you're not using partitioning!
partitioning gives you the performance benefits of multiple tables, but with automatic pruning (selection of just the tables needed to answer the query).
nothing is ever "best". the question is: what is best for your problem?
this is impossible to answer. you will just have to monitor your system for performance issues and adjust server settings or scale as necessary.
at least as far as mysql is concerned, none as good as partitioning!
Don't bother with partitioning for 28,800 rows per day.
We don't (yet) with over 5 million per day. (The "yet" means we have no business input on what data retention policy they want)
There should be very little performance difference between making a separate table for each vehicle, and making the vehicle ID the first field in the primary key. You get the same grouping on disk either way, and mysql should have no trouble with millions of rows in a table.
Partitions are only useful if you have multiple disks on your machine and want to spread the load across disks.
So I guess my answer is do neither. Designing this in a priori seems overkill.
One thing I want to point out is that having one table (which you can partition later when you need to) will be much easier to maintain both in the database and in terms of querying the data.
I'm developing a statistics module for my website that will help me measure conversion rates, and other interesting data.
The mechanism I use is - to store a database entry in a statistics table - each time a user enters a specific zone in my DB (I avoid duplicate records with the help of cookies).
For example, I have the following zones:
Website - a general zone used to count unique users as I stopped trusting Google Analytics lately.
Category - self descriptive.
Minisite - self descriptive.
Product Image - whenever user sees a product and the lead submission form.
Problem is after a month, my statistics table is packed with a lot of rows, and the ASP.NET pages I wrote to parse the data load really slow.
I thought maybe writing a service that will somehow parse the data, but I can't see any way to do that without losing flexibility.
My questions:
How large scale data parsing applications - like Google Analytics load the data so fast?
What is the best way for me to do it?
Maybe my DB design is wrong and I should store the data in only one table?
Thanks for anyone that helps,
Eytan.
The basic approach you're looking for is called aggregation.
You are interested in certain function calculated over your data and instead of calculating the data "online" when starting up the displaying website, you calculate them offline, either via a batch process in the night or incrementally when the log record is written.
A simple enhancement would be to store counts per user/session, instead of storing every hit and counting them. That would reduce your analytic processing requirements by a factor in the order of the hits per session. Of course it would increase processing costs when inserting log entries.
Another kind of aggregation is called online analytical processing, which only aggregates along some dimensions of your data and lets users aggregate the other dimensions in a browsing mode. This trades off performance, storage and flexibility.
It seems like you could do well by using two databases. One is for transactional data and it handles all of the INSERT statements. The other is for reporting and handles all of your query requests.
You can index the snot out of the reporting database, and/or denormalize the data so fewer joins are used in the queries. Periodically export data from the transaction database to the reporting database. This act will improve the reporting response time along with the aggregation ideas mentioned earlier.
Another trick to know is partitioning. Look up how that's done in the database of your choice - but basically the idea is that you tell your database to keep a table partitioned into several subtables, each with an identical definition, based on some value.
In your case, what is very useful is "range partitioning" -- choosing the partition based on a range into which a value falls into. If you partition by date range, you can create separate sub-tables for each week (or each day, or each month -- depends on how you use your data and how much of it there is).
This means that if you specify a date range when you issue a query, the data that is outside that range will not even be considered; that can lead to very significant time savings, even better than an index (an index has to consider every row, so it will grow with your data; a partition is one per day).
This makes both online queries (ones issued when you hit your ASP page), and the aggregation queries you use to pre-calculate necessary statistics, much faster.