processing dimensions on incremental load - iccube

I'm performing an incremental load into my cube (from a csv data source) but if some of the incremental data in my fact tables is associated with new dimension members, I'm not able to figure out how to also incrementally load new dimension related data. Is it possible to do this?

Yes it is possible. Similarly to the facts, dimensions can be incrementally loaded (see this page). You need to define the incremental load strategy that applies to the tables used to load the dimensions. In case you cannot ensure table consistency (facts rows are referencing existing dimension members), have a look to the unresolved rows policy (www).
Hope that helps.

Related

Complex Synthetic Data - Create manually or use a package/tool?

The data I work with consists of multiple tables (around 5-10) with single tables containing up to 10 million entries. So, overall I'd describe it as a large data set, but not too large to work with on a 'normal' computer. I'm in the need of a synthetic data set with the same structure and internal dependencies, i.e. a dummy data set. I can't use the data I work with as the data contains sensitive information.
I did research on synthetic data and came across different solutions. The first would be online providers where one uploads the original data and synthetic data is created based on the given input. This sounds like a nice solution, but I'd rather not share the original data with any external sources, so this is currently not an option for me.
The second solution I came across isthe synthpop package in R. I tried that, however, I encountered two problems: the first one being that for larger tables (as the tables in the orginal data sets are) it takes a very long time to execute. The second one being that I only got it working for a single table, however, I need to keep the dependencies between the tables, otherwhise the synthetic data doesn't make any sense.
The third option would be to do the whole data creation by myself. I have good and solid knowledge about the domain and the data, so I would be able to define the internal constraints formally and then write a script to follow these. The problem I see here is that it would obviously be a lot of work and as I'm no expert on synthetic data creation, I might still overlook something important.
So basically I have two questions:
Is there a good package/solution you can recommend (preferably in R, but ultimately the programming language doesn't matter so much) for automatically creating synthetic (and private) data based on original input data consisting of multiple tables?
Would you recommend the manual approach or would you recommend to spend more time on the synthpop package for example and try that approach?

What strategy to manage EF core change tracker when a large number of entities are tracked

We are using entity framework core 6 in an ASP.net core accounting software application.
A few operation consist of importing a large number of different entities in the database (these are backup restore process and XML import from another software). The amount of data in these source files can be quite large (several ten of thousands of entities).
Since the number of entities is too large to handle in a single transaction, we have a batching system that will call "SaveChanges" on the db context every few hundreds inserts (otherwise, the final "SaveChanges" simply wouldn't work)
We're running into a performance problem: when the change tracker contains many entities (a few thousands or more), every call to DetectChanges takes a loooooong time (several seconds) and so the whol import process is becomming almost exponentially slower as the dataset size grows.
We are experimenting with the possibility of create new, short-lived context to save some of the more numerous entities instead of loading the in the initial db context but that is a process that is rather hard to code properly: there are many object that we need to copy (in part or in full) and pass to the calling context to be able to rebuild the data structure properly.
So, I was wondering if there was another approach. Maybe a way to tell the change tracger that a set of entities should be kep around for reference but not to be serialized anymore (and, of course, skipped by the change detection process).
Edit I was asked for a specific business case so here it is: accounting data is stored per fiscal year.
Each fiscal year contains the data itself but also all the configuration options necessary for the software to work. This data is actually a rather complex set of relationships: accounts contains reference to tax templates (to be used when creating entry line for this account) which themselves contains several references to accounts (for referencing which accounts should be used to create entry lines for recording the tax amount). There are many such circular relationships in the model.
The load process therefore need to load the accounts first and record, for each one, what tax template it references. Then we load the taxe templates, fill in the references to the accounts and then have to process the accounts again to enter the ID of the newly created taxes.
We're using an ORM because the data model is defined by the class model: saving data directly to the database certainly would be possible but every time we would change the model, we'd had to manually adjust all these methodes as well: I'm trying to limit the number of ways my (small) team can shoot themselves in the foot when improving out model (whihc is evolving fast) and having a single reference for the data model seems like the way to go.

AnalysisServices: Cannot query internal supporting structures for column because they are not processed. Please refresh or recalculate the table

I'm getting the following error when trying to connect Power BI to my tabular model in AS:
AnalysisServices: Cannot query internal supporting structures for column 'table'[column] because they are not processed. Please refresh or recalculate the table 'table'
It is not a calculated column and the connection seems to work fine on the local copy. I would appreciate any help with this!
This would depend on how you are processing the data within your model. If you have just done a Process Data, then the accompanying meta objects such as relationships have not yet been built.
Every column of data that you load needs to also be processed in this way regardless of whether it is a calculated column or not.
This can be achieved by running a Process Recalc on the Database or by loading your tables or table partitions with a Process Full/Process Default rather than just a Process Data, which automatically runs the Process Recalc once the data is loaded.
If you have a lot of calculated columns and tables that result in a Process Recalc taking a long time, you will need to factor this in to your refreshes and model design.
If you run a Process Recalc on your database or a Process Full/Process Default on your table now, you will no longer have those errors in Power BI.
More in depth discussion on this can be found here: http://bifuture.blogspot.com/2017/02/ssas-processing-tabular-model.html

how to properly implement different views options of the same data set

I'm currently building a model view architecture and came across an issue I can't find information on across the internet.
I have one set of complex data, that I want to show to the user in two (or more) different fashions :
full data is shown
only selected (partial) information in shown
The way this data is printed is to me irrelevant, but if this help it's either in a table view (basic information) or a column view (full information). those two clases comes from QT model / view framework.
Now I though about two option to implement this and wonder the one I should use
Option 1
I build my data structure,
include it in a custom model
specialize (subclass) view classes in order to only print what I'm interrested in.
Option 2
I build my data structure,
specialize my models to only provide access to relevant data
use standart view to print it on screen.
I would honestly go for option 2, but seeing the amount of case over the internet where option 1 is used I started to wonder if I'm doing it right. (I never found any example of dual model of a data when multiple view of a model appears to be quite frequent)
Placing data relevant handling inside view classes seem wrong to me, but duplicating models of a data leads to either duplicated data (which also seems wrong) or shared data (and then model no longer 'hold' the data)
I also had a look on QT delegates, but those class are mostly meant to change the appearence of data. I didn't find a way using delegates to ignore the data that is not relevant for one view.
You are completely right thinking that it's wrong to use views for filtering data. The only reasons for reimplementing a view is to have a different view of the same data or special processing of user events.
So there are two ways to filter out a data:
1.Create two models which will share the data. It's a standard recommended approach - not to keep data in models.
2.Create one model providing all the data and create a proxy model inherited from QSortFilterProxyModel to filter out the data.
You will need to reimplement filterAcceptsColumn method to filter out columns and filterAcceptsRow to filter out rows.
Then use View-Model to show all the data or View-Proxy-Model to show some data.

Tables with data that will never be deleted or changed

This is a more in depth follow up to a question I asked yesterday about storing historical data ( Storing data in a side table that may change in its main table ) and I'm trying to narrow down my question.
If you have a table that represents a data object at the application level and need that table for historical purposes is it considered bad practice to set it up to where the information can't be deleted. Basically I have a table representing safety requirements for a worker and I want to make it so that these requirements can never be deleted or changed. So if a change needs to made a new record is created.
Is this not a good idea? What are the best practice to deal with data like this? I have a table with historical safety training data and it points to the table with requirement data (as well as some other key tables) so I can't let the requirements be changed or the historical table will be pointing to the wrong information.
Is this not a good idea?
Your scenario sounds perfectly valid to me. If you have historical data that you need to keep there are various ways to meeting that requirement.
Option 1:
Store all historical data and current data in one table (make sure you store a creation date so you know what's old and what's new). When you need to retrieve the most recent record for someone, just base it on the most recent date that exists in the table.
Option 2:
Store all historical data in a separate table and keep current data in another. This might be beneficial if you're working with millions of records so you don't degrade performance of any applications built on top of it. Either at the time of creating a new record or through some nightly job you can move old data into the other table to keep your current table lightweight.
Here is one alternative, that is not necessarily "better" but is something to keep in mind...
You could have separate "active" and "historical" tables, then create a trigger so whenever a row in the active table is modified or deleted, the old row values are copied to the historical table, together with the timestamp.
This way, the application can work with the active table in a natural way, while the accurate history of changes is automatically generated in the historical table. And since this works at the DBMS level, you'll be more resistant to application bugs.
Of course, things can get much messier if you need to maintain a history of the whole graph of objects (i.e. several tables linked via FOREIGN KEYs). Probably the simplest option is to simply forgo referential integrity for historical tables and just keep it for active tables.
If that's not enough for your project's needs, you'll have to somehow represent a "snapshot" of the whole graph at the moment of change. One way to do it is to treat the connections as versioned objects too. Alternatively, you could just copy all the connections with each version of the endpoint object. Either case will complicate your logic significantly.

Resources