De-normalize live data for the sake of reports - Good or Bad? - report

What are the pros/cons of de-normalizing an enterprise application database because it will make writing reports easier?
Pro - designing reports in SSRS will probably be "easier" since no joins will be necessary.
Con - developing/maintaining the app to handle de-normalized data will become more difficult due to duplication of data and synchronization.
Others?

Denormalization for the sake of reports is Bad, m'kay.
Creating views, or a denormalized data warehouse is good.
Views have solved most of my reporting related needs. Data warehouses are great when users will be generating reports almost constantly or when your views start to slow down.
This is why you want to normalize your database
To free the collection of relations from undesirable insertion, update and deletion dependencies;
To reduce the need for restructuring the collection of relations as new types of data are introduced, and thus increase the life span of application programs;
To make the relational model more informative to users;
To make the collection of relations neutral to the query statistics, where these statistics are liable to change as time goes by.
—E.F. Codd, "Further Normalization of the Data Base Relational Model" via wikipedia

The only time you should consider de-normaliozation is when the time it takes the report to generate is not acceptable. De-normalization will cause consistentcy issues that are sometimes impossible to determine especially in large datasets

Don't denormalize just to get rid of complexity in reporting, it can cause huge problems in the rest of the application. Either you don't enforce the rules resulting in bad data or if you do then inserts, deletes and updates can be seriously slowed for everyone not just the two or three people who run reports.
If the reports truly can't run well, then create a data warehouse that is denormalized and populate it in a nightly or weekly feed. The kind of reports that typically need this do not generally care if the data is up-to-the minute as they are usually monthly, quarterly, or annual reports that process (and especially aggregate) large amounts of data after the fact.

You can do both... let the normalized database for applications.
Then create a denormalized database for reports, and create an application which regulary copy data from one database to the other.
After all, reports don't always need to have the latest updated data, most of the time you can easily launch an update every 1 hour on the reporting database, and only once a day day.

Beyond the data warehouse and views solutions provided in other answers, which are good in some ways, if you are willing to sacrifice some performance to get a good to the last second data, but still want a normalized database, you could use on Oracle a Materialized View with fast refresh on commit, or in Sql Server, you could use clustered indexes for a view.

Another Con is that the data is likely not to be real-time as there is some time moving around the data to go from a normalized form to a de-normalized. If someone wants the report to be up to the very second it was requested, that can be tough to do in this situation.
If this is a duplication of the synchronization in the original post, sorry I didn't quite see it that way.

Related

Modeling document data and query performance

I have an aggerate data model (think a Customer entity with Widgets that belong to them as a list of embedded entities).
When I search for customers (e.g DocumentDBRepository.GetItemsAsync) That will be hydrating the customer data model along with the widgets for each. For efficiency reasons, I don’t really need the customer search to consider the widgets.
Are there any strategies for this in document dbs (such as a “LiteCustomer” entity)? I suspect not as that is just the nature of the “schema-less” data I’ve told it to store in the first place, but interested to hear thoughts.
Is this simply a ‘non issue’?
First, disclaimer: data modeling is hard. There are many nuances and a SO question can never cover entire business and everything left unsaid in both Q and A. There's no silver bullets. Regardless..
"LiteCustomer"
Perfectly fine to have such model in your client code. Your main Customer model may and will have many representations, most of them simple subsets of full model. Similarly to relational sql, select only what you need. Don't fetch data to client which you don't need.
The SQL API provides quite cool SQL tools to compose json for return documents for you.
physical storage model may differ from domain model
Consider your usage scenarios. If many scenarios happen to work with customer without widgets (or vice versa) then consider having widgets as separate document(s) in storage model.
In DocDB, the question is often not so much in querying logic but what your application expects on modification logic. Querying which is indexed is fast and every sql query can easily do transformations (though cross-doc joining is troublesome). For C(R)UD - you have less options - it's always by full document. Having too large documents will end up with higher RU costs and complex code.
Questions to consider:
How often customer changes without widget count/details changing?
How often widgets change without customer changing?
Do widgets on customer change independently or always as a set?
When do you need transactional updates on customer+widget changes?
How would queries look like? Can they be indexed?
Test.
True, changing model later is cumbersome in DocDB, but don't try to fix something before you know it's broken. If you are not sure you have an issue or not, then most likely fixing the maybe-issue is costlier than not fixing it.
If in doubt, generate loads of data and test it out.

Statistics on large table presented on the web

We have a large table of data with about 30 000 0000 rows and growing each day currently at 100 000 rows a day and that number will increase over time.
Today we generate different reports directly from the database (MS-SQL 2012) and do a lot of calculations.
The problem is that this takes time. We have indexes and so on but people today want blazingly fast reports.
We also want to be able to change timeperiods, different ways to look at the data and so on.
We only need to look at data that is one day old so we can take all the data from yesterday and do something with it to speed up the queries and reports.
So do any of you got any good ideas on a solution that will be fast and still on the web not in excel or a BI tool.
Today all the reports are in asp.net c# webforms with querys against MS SQL 2012 tables..
You have an OLTP system. You generally want to maximize your throughput on a system like this. Reporting is going to require latches and locks be taken to acquire data. This has a drag on your OLTP's throughput and what's good for reporting (additional indexes) is going to be detrimental to your OLTP as it will negatively impact performance. And don't even think that slapping WITH(NOLOCK) is going to alleviate some of that burden. ;)
As others have stated, you would probably want to look at separating the active data from the report data.
Partitioning a table could accomplish this if you have Enterprise Edition. Otherwise, you'll need to do some hackery like Paritioned Views which may or may not work for you based on how your data is accessed.
I would look at extracted the needed data out of the system at a regular interval and pushing it elsewhere. Whether that elsewhere is a different set of tables in the same database or a different catalog on the same server or an entirely different server would depend a host of variables (cost, time to implement, complexity of data, speed requirements, storage subsystem, etc).
Since it sounds like you don't have super specific reporting requirements (currently you look at yesterday's data but it'd be nice to see more, etc), I'd look at implementing Columnstore Indexes in the reporting tables. It provides amazing performance for query aggregation, even over aggregate tables with the benefit you don't have to specify a specific grain (WTD, MTD, YTD, etc). The downside though is that it is a read-only data structure (and a memory & cpu hog while creating the index). SQL Server 2014 is going to introduce updatable columnstore indexes which will be giggity but that's some time off.

vertica for non-analytics

I have a big analytics module in my system and plan to use vertica for it.
Someone suggested that we also use vertica in the rest of our app (standard crud app with models from our domain) so not to manage multiple databases.
would vertica fit this dual scenario?
High frequency UPDATEs is probably where Vertica lags behind the worst. I would avoid using it for such data models.
Alec - I would like to respectfully challenge your comments on Vertica. In no way do you need to denormalize or sort data before loading. Vertica also holds the record for fastest loading of data over all databases.
You also talk about Vertica not being able to do complex analytics as well as an RDBMS. Vertica IS an RDBMS and can do analytics faster than any other RDBMS and they prove it over and over.
As far as your numbers, in my use case I load roughly 5 million records per second into my Vertica cluster and have 100's of billions of records.
So Yaron - I would highly recommend you look at Vertica before you rule it out based on this information.
As is often the case these days, a meaningful answer depends on what you need to do. In a general sense, 'big data' solutions have grown from large data volume deficiencies in RDBMS systems. No 'big data' solution can compete with the core capabilities of RDBMS systems, ie complex analytics, but RDBMS systems are poor (expensive) solutions for large data volume procesing. Practical solutions for now have to be hybrid solutions. Vertica can be good once data is loaded, but I believe (not an expert) it requires denormalisation of data and pre-sorting before loading to perform at it's best. For large data volumes this may add significantly to the required resources. There is a definite benefit to using one system for all your needs, but there are also benefits to keeping your options open.
The approach I take is to store and index new data and then provide specific feeds to various reporting/analytic engines as required. This separates the collection and storage of raw data from the complex analytic processing. I am happy to provide more details if you are interested. This separation addresses a core problem which has always been present in database systems. In the past you used to hear 'store fast, report slowly or store slowly, report fast, but you cannot do both'. The search for a complete solution has, in the last few years, spawned the many NoSQL offerings which typically address the 'store fast' task. Some systems also provide impressive query performance by storing data in memory or cache but this requires many servers for large data volumes. I believe NoSQL and SQL solutions can, and will be, integrated, but this is till down the track.
To give you some context, I work with scenarios where at least 1 billion records a day are loaded. If you are dealing with say 100 million records a day (big is relative), then your Vertica approach will probably suffice, otherwise I think you need to expand your options.
Test it. Each use case is different. Assuming Vertica is a solution for every use case is almost as bad as using MongoDB for every use case.
Vertica is a high performance analytics database, column oriented, designed to analyze incredibly large datasets and scale horizontally. It's also expensive, hard to administer, and documentation is spotty. The payoff in the right environment can be easily worth the work, obviously
MySQL is a traditional RDBMS, row oriented, designed to model relationships between structured data, and works well on a single node scale (though many companies have retrofitted it to great success, exemplar gratia, Facebook). It's incredibly well documented and seemingly works on any platform, language, or framework and can be used by anyone.
My guess is using Vertica for an employee address book database is like showing up to a blue collar job in a $3000 suit. Sure it works, but is it the right tool for the job? Maybe if you already have a Vertica license and your applications already have the requisite data adaptors/ORM/etc..., go ahead and give it a shot. It's still a SQL database so it should work fine in those situations. If your goal is minimal programming as opposed to optimal performance, then why use Vertica at all? Sounds like something simpler would be more ideal. Vertica may or may not give better performance in a regular CRUD application environment since it's not optimized for that, but you can always test both and see.
Vertiy have many issues with high concurrency (Many small transaction per minute )
In MPP systems , the data is segmented across the cluster and any time there is need to take cluster level lock ( mainly in commit time ) , so many commits many cluster level X locks .
high concurrency is less the use case in DWH and reporting , so vertica is perfect for that .
In most of the cases OLTP solutions ( like CRM and etc ) required to provide high concurrency for that very is bad choice
Thanks

Alternatives of Datatable

In my web application, I have a dynamic query that returns huge data to datatable, and this query is often recalled with different parameters. So database is exhausted.
I want to get all record with no parameters to an object, and perform queries (may be with linq) on this object. So database will not be exthausted.
Which objects can be used instead of datatable?
This is one of my pet peeves - people who return all the data from the database.
There is absolutely no need for this unless you are doing reporting.
If you are doing reporting, then you need to increase your hardware capability so that the database can cope. This may also include tuning your database, rearranging tables, reindexing, regular rebuilding of indexes, updating statistics, archiving out old data, etc.
If you are NOT doing reporting, then start limiting how much data can be queried at any one time. Users DO NOT need to see massive quantities of data all at once. They need to see discrete amounts of data presented in a manageable and coherent way.
Another rule of thumb i like to observe is: let your database server do the work, it is made to manipulate lots of data, it is what it is good at, and it should have the power to do it. Pulling back loads of data to the client, and then trying to manipulate that data on the client is a foolish thing to do. If your client machines are more powerful than the database server then you have issues.
Never ever perform this(except cache)!!!
You are trying to implement DB mechanisms, like
persistent storage
index search and query strategy
replication
and so on
Spend your time on db optimization(optimal scheme, indexes, query, partitioning).

How to handle large amounts of data for a web statistics module

I'm developing a statistics module for my website that will help me measure conversion rates, and other interesting data.
The mechanism I use is - to store a database entry in a statistics table - each time a user enters a specific zone in my DB (I avoid duplicate records with the help of cookies).
For example, I have the following zones:
Website - a general zone used to count unique users as I stopped trusting Google Analytics lately.
Category - self descriptive.
Minisite - self descriptive.
Product Image - whenever user sees a product and the lead submission form.
Problem is after a month, my statistics table is packed with a lot of rows, and the ASP.NET pages I wrote to parse the data load really slow.
I thought maybe writing a service that will somehow parse the data, but I can't see any way to do that without losing flexibility.
My questions:
How large scale data parsing applications - like Google Analytics load the data so fast?
What is the best way for me to do it?
Maybe my DB design is wrong and I should store the data in only one table?
Thanks for anyone that helps,
Eytan.
The basic approach you're looking for is called aggregation.
You are interested in certain function calculated over your data and instead of calculating the data "online" when starting up the displaying website, you calculate them offline, either via a batch process in the night or incrementally when the log record is written.
A simple enhancement would be to store counts per user/session, instead of storing every hit and counting them. That would reduce your analytic processing requirements by a factor in the order of the hits per session. Of course it would increase processing costs when inserting log entries.
Another kind of aggregation is called online analytical processing, which only aggregates along some dimensions of your data and lets users aggregate the other dimensions in a browsing mode. This trades off performance, storage and flexibility.
It seems like you could do well by using two databases. One is for transactional data and it handles all of the INSERT statements. The other is for reporting and handles all of your query requests.
You can index the snot out of the reporting database, and/or denormalize the data so fewer joins are used in the queries. Periodically export data from the transaction database to the reporting database. This act will improve the reporting response time along with the aggregation ideas mentioned earlier.
Another trick to know is partitioning. Look up how that's done in the database of your choice - but basically the idea is that you tell your database to keep a table partitioned into several subtables, each with an identical definition, based on some value.
In your case, what is very useful is "range partitioning" -- choosing the partition based on a range into which a value falls into. If you partition by date range, you can create separate sub-tables for each week (or each day, or each month -- depends on how you use your data and how much of it there is).
This means that if you specify a date range when you issue a query, the data that is outside that range will not even be considered; that can lead to very significant time savings, even better than an index (an index has to consider every row, so it will grow with your data; a partition is one per day).
This makes both online queries (ones issued when you hit your ASP page), and the aggregation queries you use to pre-calculate necessary statistics, much faster.

Resources