I am trying to reconstruct a Cognos Transformer cube in Snowflake.
1. Do we have an option to build an OLAP cube in Snowflake (like SSAS, Cognos Transformer)?
2. Any recommendations of what the approach should be or steps to be followed?
Currently there is no option similar to an SSAS cube in Snowflake. Once data is loaded into the databases Snowflake allows to query the data in a way similar to traditional OLTP databases.
For data aggregations the Snowflake library offers rich sets of in-built functions. Together with UDFs, SPs and Materialized Views we can build custom solutions for precomputed data aggregations.
For data analysis we still have to rely upon third party tools. Snowflake provides a variety of different connectors to access its database objects from other analytical tools.
There are plans in near future to introduce an integrated tool for data aggregations, analysis and reporting.
Use TM1 to build your OLAP cube, then run Cognos over the top of the TM1 cube.
TM1 will have no problem shaping your Snowflake data into OLAP structure.
Snowflake is no multidimensional database and offers analytical statements like "Group by cube" as Oracle also does. But this is more like a matrix with aggregations. There's no drill down or drill up available like SSAS Cubes, PowerCubes and other multidimensional databases (MDDB) are offering.
An option could be to simulate OLAP by creating ad hoc aggregations and use JavaScript to drill down / drill up. But in my experience operations equal to drilling will take often more than 10 seconds (if not extremly high ressources are available). Snowflake is probably not the best solution for such use cases.
Related
I'm thinking of re-architecting an RDS model to a DynamoDB one and it appears mostly to be working using a single-table design. We have, however a log table that can contain 5-10 million rows that are queried on many attributes.
Is there any pattern that might be applicable in migrating to DynamoDB or is this a case where full scans would be required and we would just be better off keeping the log stuff as a relational table?
Thanks in advance,
Nik
Those keywords and phrases "log" and "queried on many attributes" sound to me like DynamoDB is not the best solution for your log data. If the number of distinct queries is fairly limited and well-known in advance, you might be able to design your keys to fit your access patterns.
For example, if you commonly query on Color and Quantity attributes, you could design a key like COLOR#Red#QTY#25. And you could use secondary or global secondary indexes for queries involving other attributes similarly.
But it is not a great solution if you have many attributes that you need to query arbitrarily.
Alternative Solution: Another serverless option to consider is storing your log data in S3 and using Athena to query it using SQL.
You will likely be trading away a bit of latency and speed by taking this approach compared to RDS and DynamoDB. But queries against log data often don't need millisecond response times, so it can cover a lot of use cases.
Data modelling for DynamoDB
Write down all of your access patterns, in order of priority/most used
Research models which are similar to your use-case
Download NoSQL Workbench and create test models where you can visualize your ideas
Run commands against DynamoDB Local and test your access patterns are fulfilled.
Access Parterns
Your access patterns will ultimately decide if DynamoDB will suit your needs. If you need to query based on multiple fields you can have up to 20 Global Secondary Indexes which will give you some flexibility, but usually if you exceed 8-10 indexes then DynamoDB may not be a good choice or the schema is badly designed.
Use smart designs with sort-key and index-key overloading, it will allow you to group the data better and make your access patterns more efficient.
Log Data Use-case
Storing log data is a pretty common use-case for DynamoDB and many many AWS customers use it for that sole purpose. But I can't over emphasize the importance of understanding your access patterns and working backwards from those to create your model.
Alternatives
If you require query capability or free text search ability, then you could use DynamoDB integrations with OpenSearch (via Lambda/EventBridge) for example, with OpenSearch providing you the flexibility for your queries.
Doesn't seem like a good use case - I have done it and wasn't at all happy with the result - now I load 'log like' data into elasticsearch and much happier with the result.
In my case, I insert the data to dynamodb - to archive it - but also feed data in ES, but once in a while if I kill my ES cluster, I can reload all or some of the data from ddb.
I created a multidimensional data model in SAP HANA as a calculation view type Cube with star join. In this calculation view I only used calculation views type Dimension, which include the dimension tables and the necessary changes I made to them (e.g. building hierarchies).
I now need to present a conceptual data model with all the dependencies. In PowerDesigner it is possible to reverse engineer physical data models, but when I try to do as it is described by SAP I get the physical tables as a result without the connections. I imported all calculation views and the necessary tables.
Does this happen because I did not connect the tables itself and only the views and is there a way to solve this?
Thank you very much for reading this. :)
SAP PowerDesigner can read the SAP HANA information models online help: Calculation Views (HANA).
This allows for impact analysis, i.e. the dependencies to source tables and views are captured.
However, the SAP HANA information views are usually not considered part of a logical data model as they are rather parts of analytical applications.
As for the lack of join conditions in the reverse engineered data model: if the model is reversed from the database runtime objects, that is the tables and views currently in the database, then you won't commonly find that foreign key constraints are implemented as database constraints.
Instead, SAP products implement the join definition either in the application layer (SAP Netweaver dictionary) or in the repository via view definitions and CDS associations.
See PowerDesigner and HANA for details on this.
Currently I'm building a Shiny APP using several queries to a PostgreSQL database (mainly SELECT and INSERT statements). The application works but I'm trying to make it faster. When I compare the execution times between the same query using the RPostgreSQL package and a db client like Postico, it's taking 8 times more with the RPostgreSQL package.
Any ideas of ways of boosting the performance or connecting to a PostgreSQL database from R?
Thanks
Have you ever heard about the package dbplyr (with the b)?
I would recommend it because this package enables your dplyr (with no b) to be used with SQL databases.
There are many advantages since the way you interact with your databases will shift
from this:
to this:
These images are extracted from a great article entitled "Databases using R" by Edgar Ruiz (2017). You should take a look at it HERE for more details.
The main advantages presented by Mr. Ruiz are, and I quote:
"
1) Run data exploration over all of the data - Instead of coming up with a plan to decide what data to import, we can focus on analyzing the data inside the database, which in turn should yield faster insights.
2) Use the SQL Engine to run the data transformations - We are, in effect, pushing the computation to the database because dplyr is sending SQL queries to the database.
3) Collect a targeted dataset - After become familiar with the data and choosing the data points that will either be shared or modeled, a final query can then be used to bring back only that data into memory in R.
4) All your code is in R! - Because we are using dplyr to communicate with the database, there is no need to change language, or tools, to perform the data exploration. "
So, you will probably gain the speed you are looking for with dbplyr/dplyr.
You should give it a try.
You can find more information about it and how to establish the connection with your PostgreSQL Server using the DBI package at:
https://cran.r-project.org/web/packages/dbplyr/vignettes/dbplyr.html
and
https://rviews.rstudio.com/2017/05/17/databases-using-r/
For example, lets say I wish to analyze a months worth of company data for trends. I plan on doing regression analysis and classification using an MLP.
A months worth of data has ~10 billion data points (rows).
There are 30 dimensions to the data.
12 features are numeric (integer or float; continuous).
The rest are categoric (integer or string).
Currently the data is stored in flat files (CSV) and is processed and delivered in batches. Data analysis is carried out in R.
I want to:
change this to stream processed (rather than batch process).
offload the computation to a Spark cluster
house the data in a time-series database to facilitate easy read/write and query. In addition, I want the cluster to be able to query data from the database when loading the data into memory.
I have an Apache Kafka system that can publish the feed for the processed input data. I can write a Go module to interface this into the database (via CURL, or a Go API if it exists).
There is already a development Spark cluster available to work with (assume that it can be scaled as necessary, if and when required).
But I'm stuck on the choice of database. There are many solutions (here is a non-exhaustive list) but I'm looking at OpenTSDB, Druid and Axibase Time Series Database.
Other time-series databases which I have looked at briefly, seem more as if they were optimised for handling metric data. (I have looked at InfluxDB, RiakTS and Prometheus)
Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can
access diverse data sources including HDFS, Cassandra, HBase, and S3. - Apache Spark Website
In addition, the time-series database should store that data in a fashion that exposes it directly to Spark (as this is time-series data, it should be immutable and therefore satisfies the requirements of an RDD - therefore, it can be loaded natively by Spark into the cluster).
Once the data (or data with dimensionality reduced by dropping categoric elements) is loaded, using sparklyr (a R interface for Spark) and Spark's machine learning library (MLib, this cheatsheet provides a quick overview of the functionality), regression and classification model can be developed and experimented with.
So, my questions:
Does this seem like a reasonable approach to working with big data?
Is my choice of database solutions correct? (I am set on working with columnar store and time-series database, please do not recommend SQL/Relational DBMS)
If you have prior experience working with data analysis on clusters, from both an analytics and systems point of view (as I am doing both), do you have any advice/tips/tricks?
Any help would be greatly appreciated.
How is it that OLAP data access can be faster than OLTP?
OLAP makes data access very quick by using of multidimensional data model.
If you have huge amount of data and report generation is extremely long (e.g. several hours) you could use OLAP to prepare the report. Then each request to already processed data would be fast.
OLAP is fundamentally for read-only data stores. Classic OLAP is a Data Warehouse or Data Mart and we work with either as an OLAP cube. Conceptually you can think of an OLAP Cube like a huge Excel PivotTable. That is a structure with sides (dimensions) and data intersections (facts) that has NO JOINS.
The data structure is one of the reasons that OLAP is so much faster to query than OLTP. Another reason is the concept of aggregations, which are stored intersections at a level higher the leaf (bottom). An example would be as follows:
You may load a cube with the facts about sales (i.e. how much in dollars, how many in units, etc..) with one row (or fact) for each sales amount by the following dimensions - time, products, customers, etc..The level at which you load each dimension, for example sales by EACH day and by EACH customer, etc...is the leaf data. Of course you will often want to query aggregated values, that is sales by MONTH, by customers in a certain CITY, etc...
Those aggregations can be calculated at query time, or they can be pre-aggregated and stored at cube load. At query time, OLAP cubes use a combination of stored and calculated aggregations. Unlike OLTP indexes, PARTIAL aggregations can be used.
In addition to this, most OLAP cubes have extensive caching set up by default and most also allow for very granular cache tuning (pre-loading).
Another consideration is that relatively recently in-memory BI (or OLAP) is being offered by more and more vendors. Obviously, if more of the OLAP data is in memory, then resulting queries will be EVEN faster than traditional OLAP. To see an example of an in-memory cube, take a look at my slide deck about SQL Server 2012 BISM.
You need to do some research on what OLAP is and why/when you need to use it. Try starting by searching Google for OLAP, and read this wikipedia article:
http://en.wikipedia.org/wiki/Online_analytical_processing