I am new to informatica data-integration. We are building a common data layer to ingest data from multiple sources (RDBMS and File Storage) to target DB.
We are intend to ingest only common entities and their respective columns/attributes. For example, I have product table which is available in all the sources but with different name and column count. Please refer below
Please refer below table.
In above table, I have 2 data source having product information. However, both structure is not same as the column name differ from source to source and clumn count as well.
Now, I am trying to standardize and would like to build generic data pipeline in #Informatica Data Integration Hub.
I have explored their Schema Drift option but that is only available in Mass Ingestion.
They also have Java Activity block to do manual column mapping. But the problem here is my data source will increase in future and that will become manual intervention.
Any suggestion?
Related
We are using entity framework core 6 in an ASP.net core accounting software application.
A few operation consist of importing a large number of different entities in the database (these are backup restore process and XML import from another software). The amount of data in these source files can be quite large (several ten of thousands of entities).
Since the number of entities is too large to handle in a single transaction, we have a batching system that will call "SaveChanges" on the db context every few hundreds inserts (otherwise, the final "SaveChanges" simply wouldn't work)
We're running into a performance problem: when the change tracker contains many entities (a few thousands or more), every call to DetectChanges takes a loooooong time (several seconds) and so the whol import process is becomming almost exponentially slower as the dataset size grows.
We are experimenting with the possibility of create new, short-lived context to save some of the more numerous entities instead of loading the in the initial db context but that is a process that is rather hard to code properly: there are many object that we need to copy (in part or in full) and pass to the calling context to be able to rebuild the data structure properly.
So, I was wondering if there was another approach. Maybe a way to tell the change tracger that a set of entities should be kep around for reference but not to be serialized anymore (and, of course, skipped by the change detection process).
Edit I was asked for a specific business case so here it is: accounting data is stored per fiscal year.
Each fiscal year contains the data itself but also all the configuration options necessary for the software to work. This data is actually a rather complex set of relationships: accounts contains reference to tax templates (to be used when creating entry line for this account) which themselves contains several references to accounts (for referencing which accounts should be used to create entry lines for recording the tax amount). There are many such circular relationships in the model.
The load process therefore need to load the accounts first and record, for each one, what tax template it references. Then we load the taxe templates, fill in the references to the accounts and then have to process the accounts again to enter the ID of the newly created taxes.
We're using an ORM because the data model is defined by the class model: saving data directly to the database certainly would be possible but every time we would change the model, we'd had to manually adjust all these methodes as well: I'm trying to limit the number of ways my (small) team can shoot themselves in the foot when improving out model (whihc is evolving fast) and having a single reference for the data model seems like the way to go.
Context: We store historical data in Azure Data Lake as versioned parquet files from our existing Databricks pipeline where we write to different Delta tables. One particular log source is about 18 GB a day in parquet. I have read through the documentation and executed some queries using Kusto.Explorer on the external table I have defined for that log source. In the query summary window of Kusto.Explorer I see that I download the entire folder when I search it, even when using the project operator. The only exception to that seems to be when I use the take operator.
Question: Is it possible to prune columns to reduce the amount of data being fetched from external storage? Whether during external table creation or using an operator at query time.
Background: The reason I ask is that in Databricks it is possible to use the SELCECT statement to only fetch the columns I'm interested in. This reduces the query time significantly.
As David wrote above, the optimization does happen on Kusto side, but there's a bug with the "Downloaded Size" metric - it presents the total data size, regardless of the selected columns. We'll fix. Thanks for reporting.
I have multiple flatfiles (CSV) (with multiple records) where files will be received randomly. I have to combine them (records) with unique ID fields.
How can I combine them, if there is no common unique field for all files, and I don't know which one will be received first?
Here are some files examples:
In real there are 16 files.
Fields and records are much more then in this example.
I would avoid trying to do this purely in XSLT/BizTalk orchestrations/C# code. These are fairly simple flat files. Load them into SQL, and create a view to join your data up.
You can still use BizTalk to pickup/load the files. You can also still use BizTalk to execute the view or procedure that joins the data up and sends your final message.
There are a few questions that might help guide how this would work here:
When do you want to join the data together? What triggers that (a time of day, a certain number of messages received, a certain type of message, a particular record, etc)? How will BizTalk know when it's received enough/the right data to join?
What does a canonical version of this data look like? Does all of the data from all of these files truly get correlated into one entity (e.g. a "Trade" or a "Transfer" etc.)?
I'd probably start with defining my canonical entity, and then look towards the path of getting a "complete" picture of that canonical entity by using SQL for this kind of case.
I am new to the data warehouse and currently working on this project.
Is there any way to insert new data transactional to existing cube? With tools or with MDX query maybe?
MDX is usually just a read-only language.
With an OLAP cube you have two options to change the data:
UPDATE/INSERT to the underlying SQL data mart yourself, and then rebuild the cube
Use something called WRITEBACK where you set numbers directly in the cube, and it decides how to save these back to the data mart (which is tricky if you set a number at the top level, and it has to decide how to split that value up between all the members down to the bottom level)
Usually there is an ETL (Extract, Transform, Load) tool like Pentaho (open source) or Informatica which populates a data warehouse. The data warehouse itself may use a proper database and a product like Mondrian is used to hold data in cubes.Jasper Server, for example has mondrian packaged with it. Data from transactional system is populated in the data warehouse, then the cube is 'refreshed'. There may be other possible approaches.
I am thinking about ways to manage a table naming issue for an application I'm planning.
Currently there is a table that contains raw materials (approximately 150 rows) and each row has about 20 columns mainly float-value numbers used for calculations. For reference, let's refer to this table as 'materials'.
As these number columns are periodically updated and some users may also wish to create custom versions, I will create a management utility that allows copy/edit of this material table but, crucially it will only allow "save as" under a new table name so that data is never destroyed.
Thus using an automated routine, I'll create 'materials_1", "materials_2" and so on. When a calculation is done in another part of the website, I will store the name of the materials table that was used for the calculation so in future there can be audits done to find where a result has come from.
Now... is it possible for me to do this in an entity framework / unit of work/ repository way (like John Papa's codecamper example) without me having to continually edit the datacontext and add the newly created table names? I can't see how I can pass in a variable to the repository interface that contains the specific table name, which will always be a 'materials' entity even though the physical table name varies?