What is use of OLAP? - olap

How is it that OLAP data access can be faster than OLTP?

OLAP makes data access very quick by using of multidimensional data model.
If you have huge amount of data and report generation is extremely long (e.g. several hours) you could use OLAP to prepare the report. Then each request to already processed data would be fast.

OLAP is fundamentally for read-only data stores. Classic OLAP is a Data Warehouse or Data Mart and we work with either as an OLAP cube. Conceptually you can think of an OLAP Cube like a huge Excel PivotTable. That is a structure with sides (dimensions) and data intersections (facts) that has NO JOINS.
The data structure is one of the reasons that OLAP is so much faster to query than OLTP. Another reason is the concept of aggregations, which are stored intersections at a level higher the leaf (bottom). An example would be as follows:
You may load a cube with the facts about sales (i.e. how much in dollars, how many in units, etc..) with one row (or fact) for each sales amount by the following dimensions - time, products, customers, etc..The level at which you load each dimension, for example sales by EACH day and by EACH customer, etc...is the leaf data. Of course you will often want to query aggregated values, that is sales by MONTH, by customers in a certain CITY, etc...
Those aggregations can be calculated at query time, or they can be pre-aggregated and stored at cube load. At query time, OLAP cubes use a combination of stored and calculated aggregations. Unlike OLTP indexes, PARTIAL aggregations can be used.
In addition to this, most OLAP cubes have extensive caching set up by default and most also allow for very granular cache tuning (pre-loading).
Another consideration is that relatively recently in-memory BI (or OLAP) is being offered by more and more vendors. Obviously, if more of the OLAP data is in memory, then resulting queries will be EVEN faster than traditional OLAP. To see an example of an in-memory cube, take a look at my slide deck about SQL Server 2012 BISM.

You need to do some research on what OLAP is and why/when you need to use it. Try starting by searching Google for OLAP, and read this wikipedia article:
http://en.wikipedia.org/wiki/Online_analytical_processing

Related

custom partition in clickhouse

I have several questions about custom partitioning in clickhouse. Background: i am trying to build a TSDB on top of clickhouse. We need to support very large batch write and complicated OLAP read.
Let's assume we use the standard partition by month , and we have 20 nodes in our clickhouse cluster. I am wondering will the data from same month all flow to the same node or will clickhouse do some internal balance and put the data from same month to several nodes?
If all the data from same month write to the same node, then it will be very bad for our scenario. I will probably consider patition by (timestamp, tags)where tags are the different tags that define the data source. Our monitoring system will write data to TSDB every 30 seconds. Our read pattern is usually single table range scan or several tables join on a column. Any advice on how should i customize my partition strategy?
Since clickhouse does not support secondary index, and we will run selection query on columns, i think i should put those important columns into the primary key, so my primary key will probably be like (timestamp, ip, port...), any advice on this design or make give a good reason why clickhouse does not support secondary index like bitmap index on other non-primary column?
In ClickHouse, partitioning and sharding are two independent mechanisms. Partitioning by month means that data from different months will never be merged to be stored in same file on a filesystem and has nothing to do with data placement between nodes (which is controlled by choosing how exactly do you setup tables and run your INSERT INTO queries).
Partitioning by months or weeks is usually doing fine, for choosing primary key see official documentation: https://clickhouse.yandex/docs/en/operations/table_engines/mergetree/#selecting-the-primary-key
There are no fundamental issues with adding those, for example bloom filter index development is in progress: https://github.com/yandex/ClickHouse/pull/4499

Should bulk data be included in the graph?

I have been using ArangoDB for a while now for smaller system requirements and love it. We have recently been tasked by a client to analyze a large amount of financial data which is currently housed in SQL but I was hoping to more efficiently query the data in ArangoDB.
One of the more simplistic requirements is to rollup gl entry amounts to determine account totals across their general ledger. There are approximately 2200 accounts in their general ledger with a maximum depth of approximately 10. The number of gl entries is approximately 150 million and I was wondering what the most efficient method of aggregating account totals would be?
I plan on using a graph to manage the account hierarchy/structure but should edges be created for 150 million gl entries or is it more efficient to traverse the inbound relationships and run sub queries on the gl entry collections to calculate total the amounts?
I would normally just run the tests myself but I am struggling with simply loading the data in my local instance of arango and thought I would get some insight while I work at loading the data.
Thanks in advance!
What is the benefit you're looking to gain by moving the data into a graph model. If it's to build connections between accounts, customers, GL's, and such, then it might be best to go with a hybrid model.
It's possible to build a hierarchical graph style relationship between your accounts and GL's, but then store your GL entries in a flat document collection.
This way you can use AQL style graph queries to quickly determine relationships between accounts and GLs. If you need to SUM entries in a GL, then you can have queries that identify the GL._id's and then sum the flat collections that have foreign keys that reference the GL._id they are associated with.
By adding indexes on your foreign keys you will speed up queries, and by using Foxx Micro Services you can provide a layer of abstraction between a REST style query and the actual data model you are using. That way if you find you need to change your database model under the covers, by updating your Foxx MicroServices the consumer doesn't need to be aware of those changes.
I can't answer your question on performance, you'll just need to ensure your hardware is appropriately spec'ed.

OLAP vs In-Memory

I am working with Big Data and all my backend logics are written in php. So for faster output, which of the follwoing technologies would be efficient and good for my product.
OLAP.
In-Memory Database.
Well, when we talk about Big Data, I would choose an OLAP database. But let's take a closer look at the technologies:
OLAP (= On-line Analytic Processing)
... has the basic technological idea of pre-aggregating data on dimension-levels.
Let's guess you wanna query a sales order table with thousands of orders per day, month and years.
You define dimensions like order date, sales channel, ship-to country and measures like turnover, no of orders, shipping time.
Usually, you would answer the following questions with an OLAP database:
How many sales orders did we had in June 2016?
What was the turnover (aggregated amount of sales orders) in 2016 with sales channel SHOP send to the USA?
How long did it take on average to ship a sales order per week/month?
... or more technical:
You can answer all questions, where you have an aggregation in the SELECT clause and a dimension in the where clause:
SELECT
SUM(amount) AS Turnover,
AVG(shipping_time) AS avg_shipping_time
FROM sales_orders
WHERE DATEPART(year,order_date) = 2016 AND sales_channel = 'SHOP'
As more as the OLAP system can aggregate, as better is the performance. Therefore it would be a bad approach using the sales order number or post addresses as dimensions. The OLAP idea is to eliminate data (or rows). That requires standardized data.
The following questions you would better be answered in relational databases (data warehouse):
Which were the Top 50 sales orders of September 2016?
Tell me the customer address of the sales orders of January 2017
etc.
So what is In-Memory?
The idea of In Memory is that it is faster to query data in RAM than on your disk. But RAM is also expensive.
In-Memory in relational databases are actually build more for OLTP (On-line Transaction processing) systems - systems where a user makes transactions and work - not for analysis.
Actually, today enterprise OLAP Systems like SQL Server Analytics Service uses also a In-Memory technology after aggregating the data (OLAP technology). You just don't see it.
--
So OLAP is the right thing, or...?
Let's think also about something else: An OLAP database is something different than a relational database and sometimes it is too oversized to use an OLAP database (f.e. when you just have this one huge table).
An OLAP database needs to be processed (aggregated & prepared for use). That is - most of the time - done in the night where no one is working (ok, you can do it every second if you want :-) )
If you are new to the Big Data and just want to fix this one thing in your application - and don't have a clue about OLAP, I recommend you: Try to fix it in your application code - except you want to dig into a new world with new terms, languages like MDX instead of SQL etc.
The complexability depends on the OLAP database you choose. But in fact, you can develop easily your own "OLAP" aggregation level in your application... it just might be not so flexible as a OLAP database.
Possible solutions in your applications might be:
use SQL Server indexed views - or similar functions in other DBs
use SQL table trigger
use a cron job to aggregate data and write it into a table

Statistics on large table presented on the web

We have a large table of data with about 30 000 0000 rows and growing each day currently at 100 000 rows a day and that number will increase over time.
Today we generate different reports directly from the database (MS-SQL 2012) and do a lot of calculations.
The problem is that this takes time. We have indexes and so on but people today want blazingly fast reports.
We also want to be able to change timeperiods, different ways to look at the data and so on.
We only need to look at data that is one day old so we can take all the data from yesterday and do something with it to speed up the queries and reports.
So do any of you got any good ideas on a solution that will be fast and still on the web not in excel or a BI tool.
Today all the reports are in asp.net c# webforms with querys against MS SQL 2012 tables..
You have an OLTP system. You generally want to maximize your throughput on a system like this. Reporting is going to require latches and locks be taken to acquire data. This has a drag on your OLTP's throughput and what's good for reporting (additional indexes) is going to be detrimental to your OLTP as it will negatively impact performance. And don't even think that slapping WITH(NOLOCK) is going to alleviate some of that burden. ;)
As others have stated, you would probably want to look at separating the active data from the report data.
Partitioning a table could accomplish this if you have Enterprise Edition. Otherwise, you'll need to do some hackery like Paritioned Views which may or may not work for you based on how your data is accessed.
I would look at extracted the needed data out of the system at a regular interval and pushing it elsewhere. Whether that elsewhere is a different set of tables in the same database or a different catalog on the same server or an entirely different server would depend a host of variables (cost, time to implement, complexity of data, speed requirements, storage subsystem, etc).
Since it sounds like you don't have super specific reporting requirements (currently you look at yesterday's data but it'd be nice to see more, etc), I'd look at implementing Columnstore Indexes in the reporting tables. It provides amazing performance for query aggregation, even over aggregate tables with the benefit you don't have to specify a specific grain (WTD, MTD, YTD, etc). The downside though is that it is a read-only data structure (and a memory & cpu hog while creating the index). SQL Server 2014 is going to introduce updatable columnstore indexes which will be giggity but that's some time off.

Oracle materialized views or aggregated tables in datawarehouse

Is the materialized views of oracle(11g) are good practice for aggregated tables in Data warehousing?
We have DW processes that replace 2 month of data each day. Some time it means few Gigs for each month (~100K rows).
On top of them are materialized views that get refreshed after night cycle of data tranfer.
My question is would it be better to create aggregated tables instead of the MVs?
I think that one case where aggregated tables might be beneficial is where the aggregation can be effectively combined with the atomic-level data load, best illustrated with an example.
Let's say that you load a large volume of data into a fact table every day via a partition exchange. A materialized view refresh using partition change tracking is going to be triggered during or after the partition exchange and it's going to scan the modified partitions and apply the changes to the MV's.
It is possible that as part of the population of the table(s) that you are going to exchange with the fact table partitions you could also compute aggregates at various levels using CUBE/ROLLUP, and use multitable insert to load up tables that you can then partition exchange into one or more aggregation tables. Not only might this be inherently more efficient through avoiding rescanning the atomic-level data, your aggregates are computed prior to the fact table partition exchange so if anything goes wrong you can suspend the modification of the fact table itself.
Other thoughts might occur later ... I'll open the answer up as a community Wiki if other have ideas.

Resources