I have another case where I don't know how to find a solution with BizTalk.
I have this two flat files (in real there are 9 files to combine) and the output must be like shown in the picture:
How can I combine files which ID repeat several times in the main file.
In the below picture, the main file is "People". Is there way to do this without writing any code in BizTalk, or must I store this data in SQL DB after that i join them with a stored procedure?
Can you help me lay-out the steps I need to take, because I know how to combine files together but that is without the repeated ID's.
Related
I have some data contained in a CSV file, I need to efficiently access that information and want to importing it into my existing database.
I am wondering if I can make a pre-loaded database with the tables I need and then build the rest of the database on top of it (or make a second separate connection), or load the database from the CSV files on first startup.
What would be the preferred method and either way how would can I achieve it efficiently?
p.s 2 files are about 1000 lines long and 2 columns wide which seems to me to be considered fairly small... and the other ones really shouldn't be more then 10 lines long and 6-7 columns wide
Edit: realised I have a bunch of tables that need to be updated yearly, so any form that risks the user input data is unacceptable so using the existing DB is a not an option...
I have two CSV files. In the first one I have: first_name, last_name and in the second I have: email, phone. The two files connect by line index (same number of records). I need to save all data in parquet format.
First option - connect two schemes to one and save everything in one parquet file.
Second option - save two schemes separately (as two parquet files).
According to my use-case there is a high probability to take the second option (2 files). At the end I need to query data using various tools, most often using Presto.
Question 1- is it possible to pull data from two parquet files (let's say select first_name, email)?
Question 2- Will there be a difference in run times?
I have run some tests, but cannot come to an accurate conclusion...
You can pull data from those two tables but you need to have some join keys in order to combine the records. If it is not there the you might have to use row_number() assuming data are in the same order in both the tables. Data size also matters here.
In big data world, denormalized format is the recommendation if you have to join those tables very frequently in your queries. This approach will give you better performance.
I have two R data frames. For example, orders and customers. If I write them to file with saveRDS(), they take up a certain amount of space. If I join them, I'll end up with one big data frame. If I save that to file, the file is much larger than the initial two. However, no new data has actually been created. I think R is treating each row as completely unique and independent. If a customer has 10 orders, their info is just repeated 10 times instead of stored as a single entity. Is there a way to optimize this? Is the only option to just save the two tables and join them every time?
I have multiple flatfiles (CSV) (with multiple records) where files will be received randomly. I have to combine them (records) with unique ID fields.
How can I combine them, if there is no common unique field for all files, and I don't know which one will be received first?
Here are some files examples:
In real there are 16 files.
Fields and records are much more then in this example.
I would avoid trying to do this purely in XSLT/BizTalk orchestrations/C# code. These are fairly simple flat files. Load them into SQL, and create a view to join your data up.
You can still use BizTalk to pickup/load the files. You can also still use BizTalk to execute the view or procedure that joins the data up and sends your final message.
There are a few questions that might help guide how this would work here:
When do you want to join the data together? What triggers that (a time of day, a certain number of messages received, a certain type of message, a particular record, etc)? How will BizTalk know when it's received enough/the right data to join?
What does a canonical version of this data look like? Does all of the data from all of these files truly get correlated into one entity (e.g. a "Trade" or a "Transfer" etc.)?
I'd probably start with defining my canonical entity, and then look towards the path of getting a "complete" picture of that canonical entity by using SQL for this kind of case.
I have a database that is used to store transactional records, these records are created and another process picks them up and then removes them. Occasionally this process breaks down and the number of records builds up. I want to setup a (semi) automated way to monitor things, and as my tool set is limited and I have an R shaped hammer, this looks like an R shaped nail problem.
My plan is to write a short R script that will query the database via ODBC, and then write a single record with the datetime, the number of records in the query, and the datetime of the oldest record. I'll then have a separate script that will process the data file and produce some reports.
What's the best way to create my datafile, At the moment my options are
Load a dataframe, add the record and then resave it
Append a row to a text file (i.e. a csv file)
Any alternatives, or a recommendation?
I would be tempted by the second option because from a semantic point of view you don't need the old entries for writing the new ones, so there is no reason to reload all the data each time. It would be more time and resources consuming to do that.