OLAP vs In-Memory - olap

I am working with Big Data and all my backend logics are written in php. So for faster output, which of the follwoing technologies would be efficient and good for my product.
OLAP.
In-Memory Database.

Well, when we talk about Big Data, I would choose an OLAP database. But let's take a closer look at the technologies:
OLAP (= On-line Analytic Processing)
... has the basic technological idea of pre-aggregating data on dimension-levels.
Let's guess you wanna query a sales order table with thousands of orders per day, month and years.
You define dimensions like order date, sales channel, ship-to country and measures like turnover, no of orders, shipping time.
Usually, you would answer the following questions with an OLAP database:
How many sales orders did we had in June 2016?
What was the turnover (aggregated amount of sales orders) in 2016 with sales channel SHOP send to the USA?
How long did it take on average to ship a sales order per week/month?
... or more technical:
You can answer all questions, where you have an aggregation in the SELECT clause and a dimension in the where clause:
SELECT
SUM(amount) AS Turnover,
AVG(shipping_time) AS avg_shipping_time
FROM sales_orders
WHERE DATEPART(year,order_date) = 2016 AND sales_channel = 'SHOP'
As more as the OLAP system can aggregate, as better is the performance. Therefore it would be a bad approach using the sales order number or post addresses as dimensions. The OLAP idea is to eliminate data (or rows). That requires standardized data.
The following questions you would better be answered in relational databases (data warehouse):
Which were the Top 50 sales orders of September 2016?
Tell me the customer address of the sales orders of January 2017
etc.
So what is In-Memory?
The idea of In Memory is that it is faster to query data in RAM than on your disk. But RAM is also expensive.
In-Memory in relational databases are actually build more for OLTP (On-line Transaction processing) systems - systems where a user makes transactions and work - not for analysis.
Actually, today enterprise OLAP Systems like SQL Server Analytics Service uses also a In-Memory technology after aggregating the data (OLAP technology). You just don't see it.
--
So OLAP is the right thing, or...?
Let's think also about something else: An OLAP database is something different than a relational database and sometimes it is too oversized to use an OLAP database (f.e. when you just have this one huge table).
An OLAP database needs to be processed (aggregated & prepared for use). That is - most of the time - done in the night where no one is working (ok, you can do it every second if you want :-) )
If you are new to the Big Data and just want to fix this one thing in your application - and don't have a clue about OLAP, I recommend you: Try to fix it in your application code - except you want to dig into a new world with new terms, languages like MDX instead of SQL etc.
The complexability depends on the OLAP database you choose. But in fact, you can develop easily your own "OLAP" aggregation level in your application... it just might be not so flexible as a OLAP database.
Possible solutions in your applications might be:
use SQL Server indexed views - or similar functions in other DBs
use SQL table trigger
use a cron job to aggregate data and write it into a table

Related

Which DB to use for comparing courses of data by days?

I'm currently thinking about a little "BigData" Project where I want to record some utilizations every 10 minutes and write them to a DB over several month or years.
I then want to analyze the data e.g. in these ways:
Which time of the day is best (in terms of a low utilization)?
What are the differences in utilization between normal weekdays and days on the weekend?
At what time does the higher part of the utilization begin on a normal monday?
For this I obviously need the possibility to build averaged graphs for e.g. all mondays that where recorded so far.
For the first "proof of concept" I set up a InfluxDB and Grafana which works quite fine for seeing the data being written to the DB, but the more I research on the internet the more I see that InfluxDB is not made for what I want to do (or it can not do it yet).
So which Database would be best to record and analyze data like that? Or is it more like a question about which tool to use to analyze the data? Which tool could that be?
InfluxDB query language is not flexible enough for your kind of questions.
SQL databases supported by Grafana (MySQL, Postgres, TimescaleDB, Clickhouse) seem to fit better.The choice depends on your preferences and amount of your data. For smaller datasets pure MySQL & Postgres may be enough. For higher loads consider TimescaleDB. For billions of datapoints Clickhouse is a probably better.
If you want a lightweight but scalable NoSQL timeseries solution have a look at VictoriaMetrics.

MS SQL product list with filtering

I'm building an application in ASP.NET(VB) with a MS SQL database. It is a search tool for cars that has a list of every car and all of their attributes (colors, # of doors, gas milage, mfg. year, etc). This tool outputs the results in a gridview and the users has the ability to perform advanced searches and filtering. The filtering needs to be very fine-grained (range of gas milage, color(s), mfg year range, etc.) and I cannot seem to find the best way to do this filtering without a large SQL where statement that is going to greatly impact SQL performance and page load. I feel like I'm missing something very obvious here, thank you for any help. I'm not sure what other details would be helpful.
This is not an OLTP database you're building--it's really an analytics database. There really isn't a way around the problem of having to filter. The question is whether the organization of the data will allow seeks most of the time, or will it require scans; and also whether the resulting JOINs can be done efficiently or not.
My recommendation is to go ahead and create the data normalized and all, as you are doing. Then, build a process that spins it into a data warehouse, denormalizing like crazy as needed, so that you can do filtering by WHERE clauses that have to do a lot less work.
For every single possible search result, you have a row in a table that doesn't require joining to other tables (or only a few fact tables).
You can reduce complexity a bit for some values such as gas mileage, by striping the mileage into bands of, say, 5 mpg. (10-19, 20-24, 25-29, etc.)
As you need to add to the data and change it, your data-warehouse-loading process (that runs once a day perhaps) will keep the data warehouse up to date. If you want more frequent loading that doesn't keep clients offline, you can build the data warehouse to an alternate node, then swap them out. Let's say it takes 2 hours to build. You build for 2 hours to a new database, then swap to the new database, and all your data is only 2 hours old. Then you wipe out the old database and use the space to do it again.

Statistics on large table presented on the web

We have a large table of data with about 30 000 0000 rows and growing each day currently at 100 000 rows a day and that number will increase over time.
Today we generate different reports directly from the database (MS-SQL 2012) and do a lot of calculations.
The problem is that this takes time. We have indexes and so on but people today want blazingly fast reports.
We also want to be able to change timeperiods, different ways to look at the data and so on.
We only need to look at data that is one day old so we can take all the data from yesterday and do something with it to speed up the queries and reports.
So do any of you got any good ideas on a solution that will be fast and still on the web not in excel or a BI tool.
Today all the reports are in asp.net c# webforms with querys against MS SQL 2012 tables..
You have an OLTP system. You generally want to maximize your throughput on a system like this. Reporting is going to require latches and locks be taken to acquire data. This has a drag on your OLTP's throughput and what's good for reporting (additional indexes) is going to be detrimental to your OLTP as it will negatively impact performance. And don't even think that slapping WITH(NOLOCK) is going to alleviate some of that burden. ;)
As others have stated, you would probably want to look at separating the active data from the report data.
Partitioning a table could accomplish this if you have Enterprise Edition. Otherwise, you'll need to do some hackery like Paritioned Views which may or may not work for you based on how your data is accessed.
I would look at extracted the needed data out of the system at a regular interval and pushing it elsewhere. Whether that elsewhere is a different set of tables in the same database or a different catalog on the same server or an entirely different server would depend a host of variables (cost, time to implement, complexity of data, speed requirements, storage subsystem, etc).
Since it sounds like you don't have super specific reporting requirements (currently you look at yesterday's data but it'd be nice to see more, etc), I'd look at implementing Columnstore Indexes in the reporting tables. It provides amazing performance for query aggregation, even over aggregate tables with the benefit you don't have to specify a specific grain (WTD, MTD, YTD, etc). The downside though is that it is a read-only data structure (and a memory & cpu hog while creating the index). SQL Server 2014 is going to introduce updatable columnstore indexes which will be giggity but that's some time off.

vertica for non-analytics

I have a big analytics module in my system and plan to use vertica for it.
Someone suggested that we also use vertica in the rest of our app (standard crud app with models from our domain) so not to manage multiple databases.
would vertica fit this dual scenario?
High frequency UPDATEs is probably where Vertica lags behind the worst. I would avoid using it for such data models.
Alec - I would like to respectfully challenge your comments on Vertica. In no way do you need to denormalize or sort data before loading. Vertica also holds the record for fastest loading of data over all databases.
You also talk about Vertica not being able to do complex analytics as well as an RDBMS. Vertica IS an RDBMS and can do analytics faster than any other RDBMS and they prove it over and over.
As far as your numbers, in my use case I load roughly 5 million records per second into my Vertica cluster and have 100's of billions of records.
So Yaron - I would highly recommend you look at Vertica before you rule it out based on this information.
As is often the case these days, a meaningful answer depends on what you need to do. In a general sense, 'big data' solutions have grown from large data volume deficiencies in RDBMS systems. No 'big data' solution can compete with the core capabilities of RDBMS systems, ie complex analytics, but RDBMS systems are poor (expensive) solutions for large data volume procesing. Practical solutions for now have to be hybrid solutions. Vertica can be good once data is loaded, but I believe (not an expert) it requires denormalisation of data and pre-sorting before loading to perform at it's best. For large data volumes this may add significantly to the required resources. There is a definite benefit to using one system for all your needs, but there are also benefits to keeping your options open.
The approach I take is to store and index new data and then provide specific feeds to various reporting/analytic engines as required. This separates the collection and storage of raw data from the complex analytic processing. I am happy to provide more details if you are interested. This separation addresses a core problem which has always been present in database systems. In the past you used to hear 'store fast, report slowly or store slowly, report fast, but you cannot do both'. The search for a complete solution has, in the last few years, spawned the many NoSQL offerings which typically address the 'store fast' task. Some systems also provide impressive query performance by storing data in memory or cache but this requires many servers for large data volumes. I believe NoSQL and SQL solutions can, and will be, integrated, but this is till down the track.
To give you some context, I work with scenarios where at least 1 billion records a day are loaded. If you are dealing with say 100 million records a day (big is relative), then your Vertica approach will probably suffice, otherwise I think you need to expand your options.
Test it. Each use case is different. Assuming Vertica is a solution for every use case is almost as bad as using MongoDB for every use case.
Vertica is a high performance analytics database, column oriented, designed to analyze incredibly large datasets and scale horizontally. It's also expensive, hard to administer, and documentation is spotty. The payoff in the right environment can be easily worth the work, obviously
MySQL is a traditional RDBMS, row oriented, designed to model relationships between structured data, and works well on a single node scale (though many companies have retrofitted it to great success, exemplar gratia, Facebook). It's incredibly well documented and seemingly works on any platform, language, or framework and can be used by anyone.
My guess is using Vertica for an employee address book database is like showing up to a blue collar job in a $3000 suit. Sure it works, but is it the right tool for the job? Maybe if you already have a Vertica license and your applications already have the requisite data adaptors/ORM/etc..., go ahead and give it a shot. It's still a SQL database so it should work fine in those situations. If your goal is minimal programming as opposed to optimal performance, then why use Vertica at all? Sounds like something simpler would be more ideal. Vertica may or may not give better performance in a regular CRUD application environment since it's not optimized for that, but you can always test both and see.
Vertiy have many issues with high concurrency (Many small transaction per minute )
In MPP systems , the data is segmented across the cluster and any time there is need to take cluster level lock ( mainly in commit time ) , so many commits many cluster level X locks .
high concurrency is less the use case in DWH and reporting , so vertica is perfect for that .
In most of the cases OLTP solutions ( like CRM and etc ) required to provide high concurrency for that very is bad choice
Thanks

What is use of OLAP?

How is it that OLAP data access can be faster than OLTP?
OLAP makes data access very quick by using of multidimensional data model.
If you have huge amount of data and report generation is extremely long (e.g. several hours) you could use OLAP to prepare the report. Then each request to already processed data would be fast.
OLAP is fundamentally for read-only data stores. Classic OLAP is a Data Warehouse or Data Mart and we work with either as an OLAP cube. Conceptually you can think of an OLAP Cube like a huge Excel PivotTable. That is a structure with sides (dimensions) and data intersections (facts) that has NO JOINS.
The data structure is one of the reasons that OLAP is so much faster to query than OLTP. Another reason is the concept of aggregations, which are stored intersections at a level higher the leaf (bottom). An example would be as follows:
You may load a cube with the facts about sales (i.e. how much in dollars, how many in units, etc..) with one row (or fact) for each sales amount by the following dimensions - time, products, customers, etc..The level at which you load each dimension, for example sales by EACH day and by EACH customer, etc...is the leaf data. Of course you will often want to query aggregated values, that is sales by MONTH, by customers in a certain CITY, etc...
Those aggregations can be calculated at query time, or they can be pre-aggregated and stored at cube load. At query time, OLAP cubes use a combination of stored and calculated aggregations. Unlike OLTP indexes, PARTIAL aggregations can be used.
In addition to this, most OLAP cubes have extensive caching set up by default and most also allow for very granular cache tuning (pre-loading).
Another consideration is that relatively recently in-memory BI (or OLAP) is being offered by more and more vendors. Obviously, if more of the OLAP data is in memory, then resulting queries will be EVEN faster than traditional OLAP. To see an example of an in-memory cube, take a look at my slide deck about SQL Server 2012 BISM.
You need to do some research on what OLAP is and why/when you need to use it. Try starting by searching Google for OLAP, and read this wikipedia article:
http://en.wikipedia.org/wiki/Online_analytical_processing

Resources