handling multiple formats of data file (JSON, XML, CSV) - pipeline

If data comes in various file formats in one single object storage bucket. Should this handled with one single pipeline? what is the best practice?

It will depend on if your requirements include join/merge data from different formats.
Say if you have multiple sources and each source reads data for a file format. And then you want to do a flatten to merge your PCollections and does aggregations, you have to have one single pipeline.
You can also check[1], [2], [3].
There is [4] that shows how BeamSQL convert from text file to Row.
[1]https://beam.apache.org/documentation/pipelines/design-your-pipeline/#multiple-sources
[2]https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java
[3]https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/values/Row.java
[4]https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/meta/provider/text/TextTable.java#L68

Related

How to convert data type from CSV before load in DynamoDB

I want to load a CSV file into DynamoDB but I can't find a way to specify the type for each column of my file.
Take the following data from my CSV file:
"discarded","query","uuid","range_key"
false,"How can I help you?","h094dfd9e-a604-4187-99ff--mmxk","log#en#MISMATCH#2021-04-30T12:00:00.000Z"
The discarded column should be considered as a BOOL but DynamoDB imports it as a String.
Is there any way I can specify a type before importing the CSV or should I process the data with a script to handles the conversions myself?
AWS does not currently provide any tools to simplify this kind of operation rather than the REST API.
However, Dynobase, a third-party application developed to easily manage DynamoDB, allows you to import/export data in CSV and JSON formats.
The import tool allows you to select the type of data before insertion.

Parquet concat or split two schemes

I have two CSV files. In the first one I have: first_name, last_name and in the second I have: email, phone. The two files connect by line index (same number of records). I need to save all data in parquet format.
First option - connect two schemes to one and save everything in one parquet file.
Second option - save two schemes separately (as two parquet files).
According to my use-case there is a high probability to take the second option (2 files). At the end I need to query data using various tools, most often using Presto.
Question 1- is it possible to pull data from two parquet files (let's say select first_name, email)?
Question 2- Will there be a difference in run times?
I have run some tests, but cannot come to an accurate conclusion...
You can pull data from those two tables but you need to have some join keys in order to combine the records. If it is not there the you might have to use row_number() assuming data are in the same order in both the tables. Data size also matters here.
In big data world, denormalized format is the recommendation if you have to join those tables very frequently in your queries. This approach will give you better performance.

Writing and appending to a compound table in an HDF5 file in Julia

How can I write and append data to a compound table in an HDF5 with a column of variable length strings and other columns of various standard types (Int64, Float64, Bool)?
The basics are in Julia in some form or another. HDF5.jl uses the HDF5 Group's C interface and JLD2.jl writes custom HDF5 files implemented entirely in Julia but I haven't found a way of creating, writing to, and appending to such a compound table yet.
My goal is to have a file which stores data from a number of instruments which is clearly annotated. As more data comes in it will periodically be appended to these HDF5 files. A binary file is needed to keep the files to a manageable size and a common standard is needed for portability amongst the programming languages used in our group. Databases aren't practical for our use case.
I think that you will be able to do what you want to do, but I think that the best way forward for you is probably to read the HDF5 Users Guide to see how HDF5 works. The question posted here is so broad that it's akin to asking "How do I store data in a relational database?"
A few things to point you in the right direction, though:
HDF5 is not a relational database and even though PyTables maps a tabular interface onto HDF5, it is semantically incorrect to refer to HDF5 tables and columns. Instead, there are HDF5 datasets which can store elements of a particular type. Those elements can be of compound type, which are roughly equivalent to C structs, and have 1 or more fields, which are also of particular types.
If you want to store heterogeneous data, you are probably going to want multiple datasets. If you have multiple instruments, particularly if they have different data rates, I would probably store their data in different datasets. You could probably also use a giant dataset with a huge compound type that stores all your data for all your instruments but that would almost certainly be awkward, have worse performance, and not compress as well.
I would avoid HDF5 variable-length types. They are awkward to use, the data can't be compressed, and performance is poor since they break locality (the dataset just stores a reference to a separate file location, where the real data are kept). Instead, think about flattening your data, either by concatenating the data and storing a separate dataset of indexes into the start/end points of the concat dataset, or by storing a fixed number of data points that is big enough to hold typical data. For example, if you need to store strings that are never going to be more than 100 characters long, just make an n x 100 dataset and compression will probably handle the extra nulls. Most people who think they need HDF5 variable-length types do not, in fact, need variable-length types. In fact, I would say that a majority of people who are new to HDF5 and inquire about vlen types actually just need extendable datasets.
The HDF5 User's Guide is located here:
https://support.hdfgroup.org/HDF5/doc/index.html

Combine files records with repeated ID in BizTalk

I have another case where I don't know how to find a solution with BizTalk.
I have this two flat files (in real there are 9 files to combine) and the output must be like shown in the picture:
How can I combine files which ID repeat several times in the main file.
In the below picture, the main file is "People". Is there way to do this without writing any code in BizTalk, or must I store this data in SQL DB after that i join them with a stored procedure?
Can you help me lay-out the steps I need to take, because I know how to combine files together but that is without the repeated ID's.

How to Combine multiple files in BizTalk?

I have multiple flatfiles (CSV) (with multiple records) where files will be received randomly. I have to combine them (records) with unique ID fields.
How can I combine them, if there is no common unique field for all files, and I don't know which one will be received first?
Here are some files examples:
In real there are 16 files.
Fields and records are much more then in this example.
I would avoid trying to do this purely in XSLT/BizTalk orchestrations/C# code. These are fairly simple flat files. Load them into SQL, and create a view to join your data up.
You can still use BizTalk to pickup/load the files. You can also still use BizTalk to execute the view or procedure that joins the data up and sends your final message.
There are a few questions that might help guide how this would work here:
When do you want to join the data together? What triggers that (a time of day, a certain number of messages received, a certain type of message, a particular record, etc)? How will BizTalk know when it's received enough/the right data to join?
What does a canonical version of this data look like? Does all of the data from all of these files truly get correlated into one entity (e.g. a "Trade" or a "Transfer" etc.)?
I'd probably start with defining my canonical entity, and then look towards the path of getting a "complete" picture of that canonical entity by using SQL for this kind of case.

Resources