On the documentation for partitioning: http://www.iccube.com/support/documentation/user_guide/reference/partitioning_edition.php
It says that partitioning is only supported for DB tables. Is it possible to have partitioning for flat file data sources?
thanks,
John
Flat files do not support table partitionning (see edition below).
Note that table partitioning allows for:
1) faster read as several partitions can be read at the same time
2) internal cube data being partitioned
With flat files 1) does not make much sense and 2) can be achieved with "level" partitioning I guess.
You can contact icCube for more details and/or to request this feature.
EDITION: This feature will be supported in upcoming version of icCube.
Related
I am trying to reconstruct a Cognos Transformer cube in Snowflake.
1. Do we have an option to build an OLAP cube in Snowflake (like SSAS, Cognos Transformer)?
2. Any recommendations of what the approach should be or steps to be followed?
Currently there is no option similar to an SSAS cube in Snowflake. Once data is loaded into the databases Snowflake allows to query the data in a way similar to traditional OLTP databases.
For data aggregations the Snowflake library offers rich sets of in-built functions. Together with UDFs, SPs and Materialized Views we can build custom solutions for precomputed data aggregations.
For data analysis we still have to rely upon third party tools. Snowflake provides a variety of different connectors to access its database objects from other analytical tools.
There are plans in near future to introduce an integrated tool for data aggregations, analysis and reporting.
Use TM1 to build your OLAP cube, then run Cognos over the top of the TM1 cube.
TM1 will have no problem shaping your Snowflake data into OLAP structure.
Snowflake is no multidimensional database and offers analytical statements like "Group by cube" as Oracle also does. But this is more like a matrix with aggregations. There's no drill down or drill up available like SSAS Cubes, PowerCubes and other multidimensional databases (MDDB) are offering.
An option could be to simulate OLAP by creating ad hoc aggregations and use JavaScript to drill down / drill up. But in my experience operations equal to drilling will take often more than 10 seconds (if not extremly high ressources are available). Snowflake is probably not the best solution for such use cases.
I created a multidimensional data model in SAP HANA as a calculation view type Cube with star join. In this calculation view I only used calculation views type Dimension, which include the dimension tables and the necessary changes I made to them (e.g. building hierarchies).
I now need to present a conceptual data model with all the dependencies. In PowerDesigner it is possible to reverse engineer physical data models, but when I try to do as it is described by SAP I get the physical tables as a result without the connections. I imported all calculation views and the necessary tables.
Does this happen because I did not connect the tables itself and only the views and is there a way to solve this?
Thank you very much for reading this. :)
SAP PowerDesigner can read the SAP HANA information models online help: Calculation Views (HANA).
This allows for impact analysis, i.e. the dependencies to source tables and views are captured.
However, the SAP HANA information views are usually not considered part of a logical data model as they are rather parts of analytical applications.
As for the lack of join conditions in the reverse engineered data model: if the model is reversed from the database runtime objects, that is the tables and views currently in the database, then you won't commonly find that foreign key constraints are implemented as database constraints.
Instead, SAP products implement the join definition either in the application layer (SAP Netweaver dictionary) or in the repository via view definitions and CDS associations.
See PowerDesigner and HANA for details on this.
We are moving to AWS EMR/S3 and using R for analysis (sparklyr library). We have 500gb sales data in S3 containing records for multiple products. We want to analyze data for couple of products and want to read only subset of file into EMR.
So far my understanding is that spark_read_csv will pull in all the data. Is there a way in R/Python/Hive to read data only for products we are interested in?
In short, the choice of the format is on the opposite side of the efficient spectrum.
Using data
Partitioned by (partitionBy option of the DataFrameWriter or correct directory structure) column of interest.
Clustered by (bucketBy option of the DataFrameWriter and persistent metastore) on the column of interest.
can help to narrow down the search to particular partitions in some cases, but if filter(product == p1) is highly selective, then you're likely looking at the wrong tool.
Depending on the requirements:
A proper database.
Data warehouse on Hadoop.
might be a better choice.
You should also consider choosing a better storage format (like Parquet).
I was looking for the best way to capture historical data in HANA for master data tables without the VALID_TO and VALID_FROM fields.
From my understanding, we have 2 options here.
Create a custom history table and run a stored procedure that populates this history table from the original table. Here we compromise with the real-time reporting capability on top of this table.
Enable the History table flag in SLT for this table so that SLT creates this as a history table which solves this problem.
Option 2 looks like a clear winner to me but I would like your thoughts on this as well.
Let me know.
Thanks,
Shyam
You asked for thoughts...
I would not use history tables for modeling time dependent master data. That's not the way history tables work. Think of them as system versioned temporal tables using commit IDs for the validity range. There are several posts on this topic in the SAP community.
Most applications I know need application time validity ranges instead (or sometimes both). Therefore I would rather model the time dependency explicitly using valid from / valid to. This gives you the opportunity e.g. to model temporal joins in CalcViews or query the data using "standard" SQL. The different ETL tools like EIM SDI or BODS have also options for populating such time dependent tables using special transformations like "table comparison" or "history preserving". Just search the web for "slowly changing dimensions" for the concepts.
In the future maybe temporal tables as defined in SQL 2011 could be an option as well, but I do not know when those will be available in HANA.
We are considering to move away from having reports from a transactional database to an offline/ reporting database for our java web based project. Some ETL(like Kettle) to load the full/delta updates to offline database.
Reasons are obvious, to reduce load on transactional db & reporting performance.
Our outstanding questions are related to designing the offline database as i have no much knowledge on OLAP. Requirement is to have reports run by some report engine like Jasper/Pentaho, develop analytics and dashboards.
what is the best way to design offline/reporting
1) One big flat table? - I am sure this very bad idea.
2) Multiple flat tables. I mean multiple denormalized tables. Idea is to de-normalize the related tables and link other de-normalized tables to get detailed reports.
And any idea how we can handle summaries?
3) Star schema, facts and dimensions.
One dumb question here: can we have all other detail columns in a fact table along with additive measures(summaries or aggregated data)
Is there a an tool to denormalize data from a set of normalised tables?
Thanks in advance.
Pradeep