Saiku,Mondrian performance degrades with large amount of data - olap

We are using mondrian olap schema with saiku to analyse our records.We are using star schema model.We have one fact table which contains around 3000000 records. We have four dimension tables timestamp,rank,path and domain. Timestamp is almost unique for each entry . Now after deploying schema in saiku when we are performing analysis saiku takes a lot of time to return results. It takes 10 minutes to fetch 3000 records and if number of records are more than 50000 saiku dies.Please suggest me on what should I do in order to boost performance of saiku and mondrian.

You can easily figure out if this is database issue or saiku/mondrian problem:
Enable sql logging facility in saiku-server/tomcat/webapps/saiku/WEB-INF/classes/log4j.xml (uncomment section below Special Log File specifically for Mondrian SQL Statements text
Restart server
Do couple of typical analysis in Saiku
Get used queries from log
Analyze performance of queries directly in database (e.g. for PostgreSQL there is explain analyze command)
If performance of queries is as slow as in Saiku then you have identified your problem.
Btw. If you really have dimension for timestamp (by second?) than you should consider splitting it into two dimensions with days and seconds.

It's hard to understand what is your particular problem.
Two things helped us when we struggle with saiku performance problems:
indices on all fields and sometimes their combinations that may be
used as dimensions - helps like everywhere in DB
we avoided joines with other tables denormalizing our data

Related

Google BigQuery Optimization Strategies

I am querying data from Google Analytics Premium using Google BigQuery. At the moment, I have one single query which I use to calculate some metrics (like total visits or conversion rate). This query contains several nested JOIN clauses and nested SELECTs. While querying just one table I am getting the error:
Error: Resources exceeded during query execution.
Using GROUP EACH BY and JOIN EACH does not seem to solve this issue.
One solution to be adopted in the future involves extracting only the relevant data needed for this query and exporting it into a separate table (which will then be queried). This strategy works in principle, I have already a working prototype for it.
However, I would like to explore additional optimization strategies for this query that work on the original table.
In this presentation You might be paying too much for BigQuery some of them are suggested, namely:
Narrowing the scan (already doing it)
Using query cache (does not apply)
The book "Google BigQuery Analytics" mentions also adjusting query features, namely:
GROUP BY clauses generating large number of distinct groups (already
did this)
Aggregation functions requiring memory proportional to the number of input values (probably does not apply)
Join operations generating a greater number of outputs than inputs (does not seem to apply)
Another alternative is just splitting this query into its composing sub-queries, but at this moment I cannot opt for this strategy.
What else can I do to optimize this query?
Why does BigQuery have errors?
BigQuery is a shared and distributed resource and as such it is expected for jobs to fail at some point in time. This is why the only solution is to retry the job with exponential backoff. As a golden rule, jobs should be retried a minimum of 5 times and as long as a job is not unable to complete for more than 15 minutes the service is within the SLA [1].
What can be the causes?
I can think off two causes for this that can be affecting your queries:
Data skewing [2]
Unoptimized queries
Data Skewing
Regarding the first situation, this happens when data is not evenly distributed. Because the inner mechanic of BigQuery uses a version of MapReduce this means if you have for example a music or video file with millions of hits, the workers doing that data aggregation will have their resources exhausted while the other workers won’t be doing much at all because the aggregations for the videos or musics they are processing have little to no hits.
If this is the case, the recommendation is to uniformly distribute your data.
Unoptimized queries
If you don’t have access to modifying the data, the only solution is to optimize the queries. Optimized queries follow these general rules:
When using a SELECT, make sure you only select strictly the columns you need as this diminishes the cardinality of the requests (avoid using SELECT * for example)
Avoid using ORDER BY clauses on large sets of data
Avoid using GROUP BY clauses as they create a barrier to parallelism
Avoid using JOINS as these are extremely heavy on the worker's memory, and may cause resource starvation and resource errors (as in not enough memory).
Avoid using Analytical functions [3]
If possible, do your queries on Partitioned tables [4].
Following any of these strategies should help your queries have less errors and improve their overall running time.
Additional
You can't really understand BigQuery unless you understand MapReduce first. For this reason I strongly recommend you have a look on Hadoop tutorials, like the one in tutorialspoint:
https://www.tutorialspoint.com/hadoop/hadoop_mapreduce.htm
For a similar version of BigQuery, but that is Open Source (and less optimized in every single way) you can also check Apache Hive [4]. If you understand why Apache Hive fails, you understand why BigQuery fails.
[1] https://cloud.google.com/bigquery/sla
[2] https://www.mathsisfun.com/data/skewness.html
[3] https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#analytic-functions
[4] https://cloud.google.com/bigquery/docs/partitioned-tables
[5] https://en.wikipedia.org/wiki/Apache_Hive
Google's BigQuery has a lot of quirks because it is not ANSI compatible. These quirks are also its advantages. That said, you will waste too much time writing queries against BigQuery directly. You should either use an API/SDK or a tool such as Looker that will generate SQL for you: https://looker.com/blog/big-query-launch-blog at execution time, giving you resource estimate before spending your money.

Statistics on large table presented on the web

We have a large table of data with about 30 000 0000 rows and growing each day currently at 100 000 rows a day and that number will increase over time.
Today we generate different reports directly from the database (MS-SQL 2012) and do a lot of calculations.
The problem is that this takes time. We have indexes and so on but people today want blazingly fast reports.
We also want to be able to change timeperiods, different ways to look at the data and so on.
We only need to look at data that is one day old so we can take all the data from yesterday and do something with it to speed up the queries and reports.
So do any of you got any good ideas on a solution that will be fast and still on the web not in excel or a BI tool.
Today all the reports are in asp.net c# webforms with querys against MS SQL 2012 tables..
You have an OLTP system. You generally want to maximize your throughput on a system like this. Reporting is going to require latches and locks be taken to acquire data. This has a drag on your OLTP's throughput and what's good for reporting (additional indexes) is going to be detrimental to your OLTP as it will negatively impact performance. And don't even think that slapping WITH(NOLOCK) is going to alleviate some of that burden. ;)
As others have stated, you would probably want to look at separating the active data from the report data.
Partitioning a table could accomplish this if you have Enterprise Edition. Otherwise, you'll need to do some hackery like Paritioned Views which may or may not work for you based on how your data is accessed.
I would look at extracted the needed data out of the system at a regular interval and pushing it elsewhere. Whether that elsewhere is a different set of tables in the same database or a different catalog on the same server or an entirely different server would depend a host of variables (cost, time to implement, complexity of data, speed requirements, storage subsystem, etc).
Since it sounds like you don't have super specific reporting requirements (currently you look at yesterday's data but it'd be nice to see more, etc), I'd look at implementing Columnstore Indexes in the reporting tables. It provides amazing performance for query aggregation, even over aggregate tables with the benefit you don't have to specify a specific grain (WTD, MTD, YTD, etc). The downside though is that it is a read-only data structure (and a memory & cpu hog while creating the index). SQL Server 2014 is going to introduce updatable columnstore indexes which will be giggity but that's some time off.

Calculate at runtime vs Lookup from SQL Server Table

I have an MVC application that needs to run several tillion calculations. Of those, I am interested in only about 8 million results. I have to do this work because I need to see an overall high and low score. I will save this data, and store it is in a single table of 16 floats. I have a few indexes too on this table for lookups. So far I have only processed 5% of my data.
As users enter data into my website, I have to do calculations based on their data. I have to determine the Best and Worst outcomes. This is only about 4 million calculations. Right now, that takes about a second or less to calculate on my local PC. Or it is a simple query that will always return 2 records from my stored data. The Best and The Worst. Right now, the query to get the results is the same speed or faster than calculating the result, but I don't have all 8 million records yet. I am worried that the DB will get slow.
I was thinking I would use the Database Lookup, and if performance became an issue, switch to runtime calculation.
QUESTION: Should I just save myself the trouble and do the runtime calculation anyway?
I am not sure which option is more scalable. I don't expect a large user base for this website.
The site needs to be snappy.
Your question is a little vague to provide a clear cut answer, but my guess is using the db to calculate the totals will be far more efficient than you writing the code on the website. Sql Server will attempt to optimize the query to use as much of the server resources as possible to make it more efficient. Your code won't do that unless you specifically write it to do so.
I would start by loading the data and doing tests before making an optimization strategy. You have no idea where the real bottlenecks of the system will be before you load data that is remotely close to what you are going to have to deal with.
If I understand the question performing the calculation is more scalable has it is on that single data set. As you add data to a table even with indexes lookups will get slower. Also the indexes increase table size and increase the time required to insert a record.
If I've understood you correctly, this is a question about caching - should you calculate on the fly, or lookup the results in a cache?
In most web architectures, your SQL database is a brilliant cache, right up to the point where it becomes a terrible cache. Scaling your (SQL) database is notoriously tricky - introducing clustering, sharding etc. becomes a production in its own right.
My - very general - advice is to use your relational database for managing transactional data, and to use caching technology for caching. 8 million records should fit into RAM on a decent server these days - and you can add web servers far more cheaply than scaling your database.

SQlite3 Optimization: Store external file name in db? Or just have a huge number of rows?

I am a newbie with no comp sci background. So please forgive me for whatever dumb stuff I may say. I am working on a solar power monitoring project to monitor the power output of the solar power systems my company installs. I am writing a client that will query the inverter (for power output, voltage output, current output, system errors/faults, etc--which constitutes one "reading") of each of our monitoring customers every 15 minutes for as long as they have their system--which means roughly 35k readings per year per customer. So I was thinking of organizing my sqlite3 database in one of the two following ways.
(1) Have the database be two tables, one table with regular customer information (name, email, etc) and another much bigger table where each row represents one reading and includes the customer id and timestamp of reading as identifiers. Which means roughly 35k rows will be being added to this bigger table per customer per year. (Data more than two years old will be pared down and archived.)
OR
(2) Store all readings in a csv file (one csv file per customer) and store the csv file name in my table with regular customer information
This database will be serving a website (built on rails if that makes any difference for options) where customers will be able to view their power output data. I want to minimize the amount of time it will take to load their output data on login. I basically don't have a clear idea of the amount of time it would take for my computer to open and read in lines from a text file versus open, look for (based on customer id) and read in the data from a huge sqlite3 table--and therefore am having trouble knowing how to judge between the two options above. Also I'm having trouble gauging the limits of sqlite3 where it functions optimally despite having read some about it (I don't think I have the background to understand the reading I did because it seems to say 100s of millions of rows are just fine when I read other people's comments seeming to say just the opposite.). I am also open to a completely different option as I'm not married to anything right now. Whatever makes things load faster. Thanks so much in advance!
Storing the parsed data in sqlite would definitely be a timesaver if you're doing any kind of repeated data mining on it. CSV Parsing overhead would almost instantly eat up any database space/time savings you'd gain.
As for efficiency, you'd have to test it. There's no one hard fast rule that says "use this database" or "use that database". It's ALWAYS a "depends on the scenario". SQLite may be perfect for you in this case, but be useless for someone else with a slightly different workload.
SQL applications in general do very well with large data sets, as long as the columns being queried are indexed. You should keep them in the same database. It will take a huge lot less to obtain the data from the database than for parsing CSV files. Databases are created with the purpose of storing and retrieving data, CSV files are not.
I use MySQL databases with tens of millions of rows per table and queries return results in fractions of a second. SQLite might be faster.
Just make sure you create indexes for what you will be searching.
I would do option 1, but use a database server such as PostgreSQL instead of SQLite.
SQLite will lock the table on update so you may run into locking issues if you read and write to the table a lot. SQLite is better suited for single user applications on the desktop or in a smartphone.
You can easily have millions of rows without it causing any problems.

How many rows can an SQLite table hold before queries become time comsuming

I'm setting up a simple SQLite database to hold sensor readings. The tables will look something like this:
sensors
- id (pk)
- name
- description
- units
sensor_readings
- id (pk)
- sensor_id (fk to sensors)
- value (actual sensor value stored here)
- time (date/time the sensor sample was taken)
The application will be capturing about 100,000 sensor readings per month from about 30 different sensors, and I'd like to keep all sensor readings in the DB as long as possible.
Most queries will be in the form
SELECT * FROM sensor_readings WHERE sensor_id = x AND time > y AND time < z
This query will usually return about 100-1000 results.
So the question is, how big can the sensor_readings table get before the above query becomes too time consuming (more than a couple seconds on a standard PC).
I know that one fix might be to create a separate sensor_readings table for each sensor, but I'd like to avoid this if it is unnecessary. Are there any other ways to optimize this DB schema?
If you're going to be using time in the queries, it's worthwhile adding an index to it. That would be the only optimization I would suggest based on your information.
100,000 insertions per month equates to about 2.3 per minute so another index won't be too onerous and it will speed up your queries. I'm assuming that's 100,000 insertions across all 30 sensors, not 100,000 for each sensor but, even if I'm mistaken, 70 insertions per minute should still be okay.
If performance does become an issue, you have the option to offload older data to a historical table (say, sensor_readings_old) and only do your queries on the non-historical table (sensor_readings).
Then you at least have all the data available without affecting the normal queries. If you really want to get at the older data, you can do so but you'll be aware that the queries for that may take a while longer.
Are you setting indexes properly? Besides that and reading http://web.utk.edu/~jplyon/sqlite/SQLite_optimization_FAQ.html, the only answer is 'you'll have to measure yourself' - especially since this will be heavily dependent on the hardware and on whether you're using an in-memory database or on disk, and on if you wrap inserts in transactions or not.
That being said, I've hit noticeable delays after a couple of tens of thousands of rows, but that was absolutely non-optimized - from reading a bit I get the impression that there are people with 100's of thousands of rows with proper indexes etc. who have no problems at all.
SQLite now supports R-tree indexes ( http://www.sqlite.org/rtree.html ), ideal if you intend to do a lot of time range queries.
Tom
I know I am coming to this late, but I thought this might be helpful for anyone that comes looking at this question later:
SQLite tends to be relatively fast on reading as long as it is only serving a single application/user at a time. Concurrency and blocking can become issues with multiple users or applications accessing it at a single time and more robust databases like MS SQL Server tend to work better in a high concurrency environment.
As others have said, I would definitely index the table if you are concerned about the speed of read queries. For your particular case, I would probably create one index that included both id and time.
You may also want to pay attention to the write speed. Insertion can be fast, but commits are slow, so you probably want to batch many insertions together into one transaction before hitting commit. This is discussed here: http://www.sqlite.org/faq.html#q19

Resources