Time dependent Master data via History tables in SAP HANA - hierarchy

I was looking for the best way to capture historical data in HANA for master data tables without the VALID_TO and VALID_FROM fields.
From my understanding, we have 2 options here.
Create a custom history table and run a stored procedure that populates this history table from the original table. Here we compromise with the real-time reporting capability on top of this table.
Enable the History table flag in SLT for this table so that SLT creates this as a history table which solves this problem.
Option 2 looks like a clear winner to me but I would like your thoughts on this as well.
Let me know.
Thanks,
Shyam

You asked for thoughts...
I would not use history tables for modeling time dependent master data. That's not the way history tables work. Think of them as system versioned temporal tables using commit IDs for the validity range. There are several posts on this topic in the SAP community.
Most applications I know need application time validity ranges instead (or sometimes both). Therefore I would rather model the time dependency explicitly using valid from / valid to. This gives you the opportunity e.g. to model temporal joins in CalcViews or query the data using "standard" SQL. The different ETL tools like EIM SDI or BODS have also options for populating such time dependent tables using special transformations like "table comparison" or "history preserving". Just search the web for "slowly changing dimensions" for the concepts.
In the future maybe temporal tables as defined in SQL 2011 could be an option as well, but I do not know when those will be available in HANA.

Related

How to keep track of database changes

I'm working with Progress 11.6 appBuilder and procedure editor (and Data Dictionary).
Regularly we are doing modifications at the customer's database, there are two types of modifications:
Modifications of the structure: those are done, using interactive GUI of the data dictionary.
Modifications of the data: those are done, using the procedure editor
An example of a data modification in the procedure typically looks like this:
FOR EACH Table1 WHERE Table1.Field1 = <value>:
CREATE Table2.
Table2.Field1 = <value>.
Table2.Field2 = <some-other-value>.
END.
This is completely in contradiction with one of the basics of software delivery quantity, repeatability: there is no way to return to the previous situation!
Therefore I'm looking for ways to do this in an (automatable) repeatable way, hence my questions:
What can we use instead of the interactive GUI of data dictionary (without undo feature) in order to perform/undo database structure modifications?
What can we do in order to undo database data modifications? (Is there something like a Oracle redo log or a Oracle archive log in Progress?)
In case you say "What are you talking about? You can do "Undo transaction" in the data dictionary.", I mean the following:
I perform a transaction using the data dictionary, I leave the data dictionary and the day later the customer complains. When I open the data dictionary at that moment, the "Undo transaction" feature is disabled.
At a high level you should be creating "df files" (DDL scripts) and applying those to the customer database rather than manually making changes. There are many ways to create those files and you can automate the entire process with the appropriate tooling.
One of the most common ways to create a df file is to create whatever new schema you need in your development database and then use the "create an incremental df" facility in the data dictionary tool. This tool compares the development database schema to the target schema and builds a "df file" (DDL script) of the differences. You could connect directly to the target db for this process or you could have an empty skeleton db that you use for this.
How to create an incremental df file
(If you then reverse the comparison you can also create a reversing df file to undo the changes.)
Most df files consist of additions - new tables, new fields, new indexes. These can all be added online and that can all be completely scripted. And, of course, the individual df files and all of the supporting scripts can (and should) be stored in a repository (like git or whatever).
As for the data change scripts... there's no reason that those programs cannot be written as actual programs and saved in a repository. You can enclose the whole update in a transaction and UNDO it if that is appropriate. For what it is worth, I personally do not think that is a very good idea. Especially when large amounts of data are involved you really don't want to be creating monstrous multi-gigabyte undo logs. You're better off with a second "reversing transaction" script that will roll things back piecemeal. A side benefit is that you can still use that if you decide to back out the change a day or three afterwards.
The really gory details are going to depend on your development process and the customers change management process and the tooling available. It kind of sounds like there is not much process or tooling at either end of this relationship so you probably have a lot of adventures ahead of you!

Tables with data that will never be deleted or changed

This is a more in depth follow up to a question I asked yesterday about storing historical data ( Storing data in a side table that may change in its main table ) and I'm trying to narrow down my question.
If you have a table that represents a data object at the application level and need that table for historical purposes is it considered bad practice to set it up to where the information can't be deleted. Basically I have a table representing safety requirements for a worker and I want to make it so that these requirements can never be deleted or changed. So if a change needs to made a new record is created.
Is this not a good idea? What are the best practice to deal with data like this? I have a table with historical safety training data and it points to the table with requirement data (as well as some other key tables) so I can't let the requirements be changed or the historical table will be pointing to the wrong information.
Is this not a good idea?
Your scenario sounds perfectly valid to me. If you have historical data that you need to keep there are various ways to meeting that requirement.
Option 1:
Store all historical data and current data in one table (make sure you store a creation date so you know what's old and what's new). When you need to retrieve the most recent record for someone, just base it on the most recent date that exists in the table.
Option 2:
Store all historical data in a separate table and keep current data in another. This might be beneficial if you're working with millions of records so you don't degrade performance of any applications built on top of it. Either at the time of creating a new record or through some nightly job you can move old data into the other table to keep your current table lightweight.
Here is one alternative, that is not necessarily "better" but is something to keep in mind...
You could have separate "active" and "historical" tables, then create a trigger so whenever a row in the active table is modified or deleted, the old row values are copied to the historical table, together with the timestamp.
This way, the application can work with the active table in a natural way, while the accurate history of changes is automatically generated in the historical table. And since this works at the DBMS level, you'll be more resistant to application bugs.
Of course, things can get much messier if you need to maintain a history of the whole graph of objects (i.e. several tables linked via FOREIGN KEYs). Probably the simplest option is to simply forgo referential integrity for historical tables and just keep it for active tables.
If that's not enough for your project's needs, you'll have to somehow represent a "snapshot" of the whole graph at the moment of change. One way to do it is to treat the connections as versioned objects too. Alternatively, you could just copy all the connections with each version of the endpoint object. Either case will complicate your logic significantly.

Microsoft BI Report with Input

I need to create a Microsoft BI Produced Report, that will display Actual Data, retrieved from database. But I also want users to be able to fill Forecast Data for next month in the column beside the reported actual data, and the inputted data will be loaded to Forecast Cube.
Is it possible to do it ? What is the right strategy ?
Thanks for your input :) !
MS BI products are based on reporting, so they don't have any direct ways to interact with the data like you're asking. There are several options that start with an SSRS report though (or a PowerPivot book etc.).
SharePoint Workflow - allows a lot of other control aspects to the input process. Pickup the list with an ETL package.
You could also do a similar thing with a simple web app. Link either through a report action.
ETL - Make the report exportable to excel and leave blank columns for user input. Re-absorb it through an ETL process that reads the modified Excel file. This can be a manually triggered job, not necessarily part of a DM/DW nightly ETL.
You could also just have a manual script. I would think and hope that the forecast data isn't changed that often, so a simpler solution would probably be best.

How to setup data model for customizable application

I have an ASP.NET data entry application that is used by multiple clients. The application consists of multiple data entry modules that are common to all clients.
I now have multiple clients that want their own custom module added which will typically consist of a dozen or so data points. Some values will be text, others numeric, some will be dropdown selections, etc.
I'm in need of suggestions for handling the data model for this. I have two thoughts on how to handle. First would be to create a new table for each new module for each client. This is pretty clean but I don't particular like it. My other thought is to have one table with columns for each custom data point for each client. This table would end up with a lot of columns and a lot of NULL values. I don't really like either solution and suspect there's a better way to do this, so any feedback you have will be appreciated.
I'm using SQL Server 2008.
As always with these questions, "it depends".
The dreaded key-value table.
This approach relies on a table which lists the fields and their values as individual records.
CustomFields(clientId int, fieldName sysname, fieldValue varbinary)
Benefits:
Infinitely flexible
Easy to implement
Easy to index
non existing values take no space
Disadvantage:
Showing a list of all records with complete field list is a very dirty query
The Microsoft way
The Microsoft way of this kind of problem is "sparse columns" (introduced in SQL 2008)
Benefits:
Blessed by the people who design SQL Server
records can be queried without having to apply fancy pivots
Fields without data don't take space on disk
Disadvantage:
Many technical restrictions
a new field requires DML
The xml tax
You can add an xml field to the table which will be used to store all the "extra" fields.
Benefits:
unlimited flexibility
can be indexed
storage efficient (when it fits in a page)
With some xpath gymnastics the fields can be included in a flat recordset.
schema can be enforced with schema collections
Disadvantages:
not clearly visible what's in the field
xquery support in SQL Server has gaps which makes getting your data a real nightmare sometimes
There are maybe more solutions, but to me these are the main contenders. Which one to choose:
key-value seems appropriate when the number of extra fields is limited. (say no more than 10-20 or so)
Sparse columns is more suitable for data with many properties which are filled out infrequent. Sounds more appropriate when you can have many extra fields
xml column is very flexible, but a pain to query. Appropriate for solutions that write rarely and query rarely. ie: don't run aggregates etc on the data stored in this field.
I'd suggest you go with the first option you described. I wouldn't over think it. The second option you outlined would be a bad idea in my opinion.
If there are fields common to all the modules you're adding to the system you should consider keeping those in a single table then have other tables with the fields specific to a particular module related back to the primary key in the common table. This is basically table inheritance (http://www.sqlteam.com/article/implementing-table-inheritance-in-sql-server) and will centralize the common module data and make it easier to query across modules.

Simulate records in database without entering any

I've nearly finished the development of a project and would like to test its performance, especially the database query calls. I'm using Linq to SQL to search via usernames, but I've only got around 10 'users' in my database, so I can't really get a decent speed reading. How can I simulate thousands/millions of users in the database without actually creating new records? I've read about Selenium, but it seems that is good for repeat actions (simulating concurrent users?). Are there any other tools I should look into, or are there any options in VS 2008 (Professional Edition)?
Thanks
You can "trick" SQL Server into thinking there are more records than there actually are in a table using the approach outlined in this article. See the section on False SQL Server Statistics
e.g.
UPDATE STATISTICS TableName WITH ROWCOUNT=100000
will create statistics for the table as if it has 100000 rows in. You can then see what effect this has on the execution plan. But note this is undocumented functionality as so it may give quirky behaviour.
You could just populate your table with sample data. There's various tools available to help out with that like, Red Gate's SQL Data Generator. I prefer actually having large data volumes as I think that is what will be more accurate.

Resources