Can you recommend some popular software to implement OLAP? It will be much better if there are extra related links.
There are two main categories of the OLAP systems - MOLAP and ROLAP:
MOLAP (multi-dimensional online analytical processing) = MOLAP stores this data in an optimized multi-dimensional storage. The word optimized is important here. Data needs to be pre-processed to be stored in such special data stores but then reading of a data is fast (stored data is optimized for performing analytical queries).
ROLAP (relational online analytical processing) = ROLAP does not require the pre-computation of analytical information. ROLAP store and access the data in a relational database and generate SQL queries to obtain analytical information (aggregate information is not pre-computed, but the information can be cached after computed for a first time).
OLAP systems consist of cubes, dimensions and measures. The cube metadata is typically created from a star schema or snowflake schema. I recommend to take a look at The Data Warehouse Toolkit, 3rd Edition
Basic BI solutions usually consist of five parts. I will describe them on open source ROLAP solution which I can recommend:
Source systems = Various databases, web services, files. Objects for analysis. Data loaded to DWH.
OLAP DWH (data warehouse) = Database for storing current and historical analytical data in multi-dimensional schemas (usually star schemas). For the ROLAP you can choose any RDBMS:
Row oriented: PostgreSQL, MySQL, etc.
Column oriented (optimal for OLAP): Monet DB, Vertica, Amazon Redshift
ETL (Extract-transform-load) = Process of extracting data from source systems, transforming it and loading to DWH.
E.g. Pentaho Data Integration (Kettle)
OLAP server = Builds OLAP (cubes), on the top of DWH, which can be queried using multidimensional query language (MDX) and accessed by BI front-end applications.
There is an open source ROLAP server called Mondrian OLAP - cashes MDX results (use OLAP schema workbench tool to create Mondrian OLAP schema).
BI Analysis tool = Front-end application for analyzing data:
Ad-hoc analysis: Saiku Analysis application (you can preview Saiku demo here)
Reporting, Dashboards: Pentaho BI server CE (CDF: charts portfolio CCC + maps, etc.)
MOLAP solutions:
There is an open source MOLAP server: Palo. Most of the others are commercial: Jedox, icCube, etc.
Other ROLAP solutions:
Microsoft Analysis Services (all aggregated OLAP data are precalculated within a step called building OLAP), Oracle Business Intelligence Suite EE, etc.
Related
I'm implementing a simple workflow in which I have three different data sources (API, parquet file and PostgreSQL database). The goal is to gather the data from all the different sources and store it in a PostgreSQL warehouse.
The task flow I projected goes like:
Create PostrgreSQL DW >> [Get data from source 1, Get data from source 2, Get data from source 3] >> Insert data into PostrgreSQL DW
In order for this to work, I would have to share the data from the "Get Data" tasks to the "Insert Data" task.
My questions are:
Is sharing data between tasks a bad/wrong thing to do?
Should I approach this any other way?
If I implement a task to get the data from the source and then insert it to another database, wouldn't it not be idempotent?
Airflow is primarily an orchestrator. Sharing small snippets of data between tasks is encouraged with XComs, but large amounts of data are not supported. XComs are stored in the Airflow database which would quickly fill up if you used this pattern often.
Your use case sounds more suited to Apache Beam which is designed to process data in parallel at scale. It's much more common to use Airflow to schedule your beam pipelines, which do the actual work of ETL.
There is an Airflow Operator for Apache Beam. Depending on the size of your data you can process it locally on the Airflow workers with the DirectRunner. Or if you need to process large amounts of data you can offload the pipeline execution to a cloud solution like GCP's Dataflow using Beam's DataflowRunner.
The Airflow + Beam pattern is much more common and a powerful combination when dealing with data. Even if your datasets are small this pattern will let you scale with no further effort required if you need to in the future.
We "need" to feed our SSAS Tabular models with Snowflake data (through ODBC because we have no Snowflake connector).
We tried with SQL Server 2016 and SQL Server 2017 and get atrocious performance (few rows a second...).
Under PowerBI, there is a Snowflake connector and it's fast.
I came across a thread from someone having an apparently similar speed problem when trying to feed Snowflake data to SAS.
He seems to have solved his problem by specifying some (ODBC?) parameters:
dbcommit = 10000
autocommit = no
readbuff = 200
insertbuff = 200
Are these parameters specific to Snowflake? Or just ODBC?
Thanks
Eric
These are SAS options, not ODBC or Snowflake.
For the load issue, ODBC row by row inserts is typically a slow approach.
Is there an option for you to export the data to S3 or Blob and then bulk load in SSAS before running your models - or alternatively could you push down the queries to Snowflake itself?
Instructions for SSAS Tabular models in Azure Analysis Services are here. Everything except the On Premises Data Gateway should apply to you.
Make sure you check whether tracing=0 in your ODBC data source as tracing=6 killed performance for me.
I want to build a Kafka Connector in order to retrieve records from a database at near real time. My database is the Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 and the tables have millions of records. First of all, I would like to add the minimum load to my database using CDC. Secondly, I would like to retrieve records based on a LastUpdate field which has value after a certain date.
Searching at the site of confluent, the only open source connector that I found was the “Kafka Connect JDBC”. I think that this connector doesn’t have CDC mechanism and it isn’t possible to retrieve millions of records when the connector starts for the first time. The alternative solution that I thought is Debezium, but there is no Debezium Oracle Connector at the site of Confluent and I believe that it is at a beta version.
Which solution would you suggest? Is something wrong to my assumptions of Kafka Connect JDBC or Debezium Connector? Is there any other solution?
For query-based CDC which is less efficient, you can use the JDBC source connector.
For log-based CDC I am aware of a couple of options however, some of them require license:
1) Attunity Replicate that allows users to use a graphical interface to create real-time data pipelines from producer systems into Apache Kafka, without having to do any manual coding or scripting. I have been using Attunity Replicate for Oracle -> Kafka for a couple of years and was very satisfied.
2) Oracle GoldenGate that requires a license
3) Oracle Log Miner that does not require any license and is used by both Attunity and kafka-connect-oracle which is is a Kafka source connector for capturing all row based DML changes from an Oracle and streaming these changes to Kafka.Change data capture logic is based on Oracle LogMiner solution.
We have numerous customers using IBM's IIDR (info sphere Data Replication) product to replicate data from Oracle databases, (as well as Z mainframe, I-series, SQL Server, etc.) into Kafka.
Regardless of which of the sources used, data can be normalized into one of many formats in Kafka. An example of an included, selectable format is...
https://www.ibm.com/support/knowledgecenter/en/SSTRGZ_11.4.0/com.ibm.cdcdoc.cdckafka.doc/tasks/kcopauditavrosinglerow.html
The solution is highly scalable and has been measured to replicate changes into the 100,000's of rows per second.
We also have a proprietary ability to reconstitute data written in parallel to Kafka back into its original source order. So, despite data having been written to numerous partitions and topics , the original total order can be known. This functionality is known as the TCC (transactionally consistent consumer).
See the video and slides here...
https://kafka-summit.org/sessions/exactly-once-replication-database-kafka-cloud/
I have about 100GB data in BigQuery, and I'm fairly new to using data analysis tools. I want to grab about 3000 extracts for different queries, using a programmatic series of SQL queries, and then run some statistical analysis to compare kurtosis across those extracts.
Right now my workflow is as follows:
running on my local machine, use BigQuery Python client APIs to grab the data extracts and save them locally
running on my local machine, run kurtosis analysis over the extracts using scipy
The second one of these works fine, but it's pretty slow and painful to save all 3000 data extracts locally (network timeouts, etc).
Is there a better way of doing this? Basically I'm wondering if there's some kind of cloud tool where I could quickly run the calls to get the 3000 extracts, then run the Python to do the kurtosis analysis.
I had a look at https://cloud.google.com/bigquery/third-party-tools but I'm not sure if any of those do what I need.
So far Cloud Datalab is your best option
https://cloud.google.com/datalab/
It is in beta so some surprises are possible
Datalab is built on top of below (Jupyter/IPython) option and totally in cloud
Another option is Jupyter/IPython Notebook
http://jupyter-notebook-beginner-guide.readthedocs.org/en/latest/
Our data sience team started with second option long ago with great success and now are moving toward Datalab
For the rest of the business (prod, bi, ops, sales, marketing, etc.), though, we had to build our own workflow/orchestration tool as nothing around was found good or relevant enough.
two easy ways:
1: if your issue is network like you say, use a google compute engine machine to do the analisis, in the same zone as your bigquery tables (us, eu etc). it will not have network issues getting data from bigquery and will be super-fast.
the machine will only cost you for the minutes you use it. save a snapshot of your machine to reuse the machine setup anytime (snapshot also has monthly cost but much lower than having the machine up.)
2: use Google cloud Datalab (beta as of dec. 2015) which supports bigquery sources and gives you all the tools you need to do the analysis and later share it with others:
https://cloud.google.com/datalab/
from their docs: "Cloud Datalab is built on Jupyter (formerly IPython), which boasts a thriving ecosystem of modules and a robust knowledge base. Cloud Datalab enables analysis of your data on Google BigQuery, Google Compute Engine, and Google Cloud Storage using Python, SQL, and JavaScript (for BigQuery user-defined functions)."
You can check out Cooladata
It allows you to query BQ tables as external data sources.
What you can do is either schedule your queries and export the results to Google storage, where you can pick up from there, or use the built in powerful reporting tool to answer your 3000 queries.
It will also provide you all the BI tools you will need for your business.
My source data is in Oracle and target data is in Teradata.Can you please provide me the easy and quick way to validate data .There are 900 tables.If possible can you provide syntax too
There is a product available known as the Teradata Gateway that works with Oracle and allows you to access Teradata in a "heterogeneous" manner. This may not be the most effective way to compare the data.
Ultimately what your requirements sound more process driven and to be done effectively would require the source data to be compared/validated as stage tables on the Teradata environment after your ETL/ELT process has completed.