Storage engine optimization for datawarehousing - mariadb

We've been deploying a midsize datawarehouse database with daily updates, a few fact tables, many dimensions and even more ondemand reports programmed in a custom build php framework. We've optimized indexes to performance optimal levels.
Now we wonder if moving selectively to different storage engines for fact and dimension tables would help. Most if the tables are in InnoDB, some log tables in CSV. Would it be beneficial to move dimensions to Aria? Fact tables are fairly sized, yet not larger than 200 billion records in size, dimensions smaller than 1000 records. We are happy with performance now but the (fact)data is growing daily.
Any general thoughts?

Related

Reuse of DynamoDB table

Coming from an SQL background, it's easy to fall into a SQL pattern and mindset when designing NOSQL databases like DynamoDB. However, there are many best practices that rely on merging different kinds of data with different schemas in the same table. This can be very efficient to query for related data, in lieu of SQL joins.
However, if I have two distinct types of data with different schemas, and which are never queried together, since the introduction of on demand pricing for DynamoDB, is there any reason to merge data of different types into one table simply to keep the number of tables down? Prior to on demand, you had to pay for the capacity units per hour, so limiting the number of tables was reasonable. But with on demand, is there any reason not to create 100 tables if you have 100 unrelated data schemas?
I would say that the answer is "no, but":
On-demand pricing is significantly more expensive than provisioned pricing. So unless you're just starting out with DynamoDB with a low volume of requests, or have extremely fluctuating demand you are unlikely to use just on-demand pricing. Amazon have an interesting blog post titled Amazon DynamoDB auto scaling: Performance and cost optimization at any scale, where they explain how you can reserve some capacity for a year, then automatically reserve capacity for 15 minute intervals (so-called autoscaling), and use on-demand pricing just to demand exceeding those. In such a setup, the cheapest prices are the long-term (yearly, and even 3 year) reservations. And having two separate tables may complicate that reservation.
The benefit of having one table would be especially pronounced if your application's usage of the two different tables fluctuates up and down over the day. The sum of the two demands will usually be flatter than each of the two demands, allowing the cheaper capacity to be used more and on-demand to be used less.
The reason why I answered "no, but" and not "yes" is that it's not clear how much these effects are important in real applications, and how much can you save - in practice - by using one instead of two tables. If the number of tables is not two but rather ten, or the number of tables changes over the evolution of the application, maybe the saving can be even greater.

Firebase BigQuery Export Schema Size Difference

We have migrated all of our old Firebase BigQuery events tables to the new schema using the provided script. One thing we noticed was that the size of the daily tables increased dramatically.
For example, the data from 4/1/18 in the old schema was 3.5MM rows and 8.7 Gig. Once migrated, the new table from the same date is 32.3MM rows and 27 Gig. This is nearly 10 times larger in terms of number of rows and over 3X larger by space size.
Can someone tell me why the same data is so much larger in the new schema?
The result is that we are getting charged significantly more in BigQuery query costs when reading the tables from the new schema versus the old schema.
firebaser here
While increasing the size of the exported data definitely wasn't a goal, it is an expected side-effect of the new schema.
In the old storage format the events were stored in bundles. While I don't exactly know how the events are bundled, it was definitely always a bunch of events with their own unique and with shared properties. This meant that you frequently had to unnest the data in your query or cross join the tables with themselves, to get to the raw data, and then combine and group it again to fit your requirements.
In the new storage format, each event is stored separately. This definitely increases the storage size, since properties that were shared between events in a bundle, now are duplicated for each event. But the queries you write on the new format should be easier to read and can process the data faster, since they don't have to unnest it first.
So the larger storage size should come with a slightly faster processing speed. But I can totally imagine the sticker shock when you see the difference, and realize the improved speed doesn't always make up for that. I apologize if that is the case, and have been assured that don't have any other big schema changes planned from here on.

Saiku,Mondrian performance degrades with large amount of data

We are using mondrian olap schema with saiku to analyse our records.We are using star schema model.We have one fact table which contains around 3000000 records. We have four dimension tables timestamp,rank,path and domain. Timestamp is almost unique for each entry . Now after deploying schema in saiku when we are performing analysis saiku takes a lot of time to return results. It takes 10 minutes to fetch 3000 records and if number of records are more than 50000 saiku dies.Please suggest me on what should I do in order to boost performance of saiku and mondrian.
You can easily figure out if this is database issue or saiku/mondrian problem:
Enable sql logging facility in saiku-server/tomcat/webapps/saiku/WEB-INF/classes/log4j.xml (uncomment section below Special Log File specifically for Mondrian SQL Statements text
Restart server
Do couple of typical analysis in Saiku
Get used queries from log
Analyze performance of queries directly in database (e.g. for PostgreSQL there is explain analyze command)
If performance of queries is as slow as in Saiku then you have identified your problem.
Btw. If you really have dimension for timestamp (by second?) than you should consider splitting it into two dimensions with days and seconds.
It's hard to understand what is your particular problem.
Two things helped us when we struggle with saiku performance problems:
indices on all fields and sometimes their combinations that may be
used as dimensions - helps like everywhere in DB
we avoided joines with other tables denormalizing our data

Statistics on large table presented on the web

We have a large table of data with about 30 000 0000 rows and growing each day currently at 100 000 rows a day and that number will increase over time.
Today we generate different reports directly from the database (MS-SQL 2012) and do a lot of calculations.
The problem is that this takes time. We have indexes and so on but people today want blazingly fast reports.
We also want to be able to change timeperiods, different ways to look at the data and so on.
We only need to look at data that is one day old so we can take all the data from yesterday and do something with it to speed up the queries and reports.
So do any of you got any good ideas on a solution that will be fast and still on the web not in excel or a BI tool.
Today all the reports are in asp.net c# webforms with querys against MS SQL 2012 tables..
You have an OLTP system. You generally want to maximize your throughput on a system like this. Reporting is going to require latches and locks be taken to acquire data. This has a drag on your OLTP's throughput and what's good for reporting (additional indexes) is going to be detrimental to your OLTP as it will negatively impact performance. And don't even think that slapping WITH(NOLOCK) is going to alleviate some of that burden. ;)
As others have stated, you would probably want to look at separating the active data from the report data.
Partitioning a table could accomplish this if you have Enterprise Edition. Otherwise, you'll need to do some hackery like Paritioned Views which may or may not work for you based on how your data is accessed.
I would look at extracted the needed data out of the system at a regular interval and pushing it elsewhere. Whether that elsewhere is a different set of tables in the same database or a different catalog on the same server or an entirely different server would depend a host of variables (cost, time to implement, complexity of data, speed requirements, storage subsystem, etc).
Since it sounds like you don't have super specific reporting requirements (currently you look at yesterday's data but it'd be nice to see more, etc), I'd look at implementing Columnstore Indexes in the reporting tables. It provides amazing performance for query aggregation, even over aggregate tables with the benefit you don't have to specify a specific grain (WTD, MTD, YTD, etc). The downside though is that it is a read-only data structure (and a memory & cpu hog while creating the index). SQL Server 2014 is going to introduce updatable columnstore indexes which will be giggity but that's some time off.

Is it possible to prototype server/SQL performance on paper for various loads?

I am trying to figure out whether a web development project is feasible at the moment and have so far learned that the total row count of the proposed database (30 million rows, 5 columns and about 3 gb of storage) is well within the budget limits in terms of storage requirements, but because of the anticipated large number of queries that users will make to the database I am not sure if this will cause an unrealistic load to manage for the server to provide adequate performance (within my budget).
I will be using this grid (a live demo of performance benchmarks for 300,000 rows - http://demos.telerik.com/aspnet-ajax/grid/examples/performance/linq/defaultcs.aspx). Inserting a search term in the "product name" box and pressing enter takes 1.6 seconds from query to results render. It seems to me (a newbie) that 300,000 rows which take 1.6 seconds all in all must take much longer with 30 million rows, and so I am trying to figure out
what the increase in time would be the more rows are added up to 30 million
what the increase in time would be for each additional 1000 people using the search grid at the same time.
what hardware requirements are necessary to reduce the delays to an acceptable level
Hopefully if I can figure that out I can get a more realistic assessment for feasibility. FYI: The database need not be updated very regularly, it is more for readonly purposes.
Can this problem be prototyped on paper for these 3 points?
Even wide ball park estimates- without considering optimisation, am I talking hundreds of dollars for 5000 users to have searches below 10 seconds each, thousands, or tens of thousands of dollars?
[Will be asp.net RadControls for AJAX Grid, One of these cloud hosted servers: 4,096MB RAM
160GB Diskspace, and either Microsoft® SQL Server® 2008 R2 and SQL Server 2012 ]
The database need not be updated very regularly, it is more for readonly purposes.
Your search filters allow for substring searches, so db indexes are not going to help you and the search will go row-by-row.
It looks like your data would probably fit in 5GB of memory or so. I would store the whole thing in memory and seach there.

Resources