Unable to create individual delta table from delta format snappy.parquet files - bigdata

I have multiple parquet files in storage account and converted all into delta format. Now, I need to save the result into individual delta table for each files.
df=spark.read.option("mergeschema","true") \
.format("parquet").load("/mnt/testadls/.*parquet")
df.write.format("delta").save("/mnt/testadls/delta")
this dataframe will write into multiple snappy.parquet files(delta files)
Now,if I am trying to create separate delta table from individual snappy.parquet files I am not able to do it I am getting below partition error
A partition path fragment should be the form like part1=foo/part2=bar.
%sql
create table deltatable using delta location /mnt/testadls/delta/part-001-pid-5372710096-b67676465-b62f-45b5-a5c9-51626727-6264-1-c000.snappy.parquet
delta file name example = part-001-pid-5372710096-b67676465-b62f-45b5-a5c9-51626727-6264-1-c000.snappy.parquet

A Delta table != a Parquet file. You cannot read a single Parquet file as a Delta table. A Delta table = parquet files + the _delta_log directory. If you save all of data into one Delta table, there will be only one Delta table. And you cannot read each Parquet file separately.

Related

convert the time columns to DolphinDB TIME when importing live market data

When importing a CSV file containing live market data to DolphinDB, how to convert the time columns to the DolphinDB TIME type time in DolphinDB? For example, the csv file contains a time value 093000000 which will be parsed as a string if I import the file with loadText. How can I convert it to 09:30:00.000, the TIME data type?
If you have imported the data into DolphinDB, use temporalParse to convert strings to temporal types in DolphinDB.
temporalParse( "093000000", "HHmmssSSS")
If you haven't yet imported the CSV file, use loadText. Use the schema parameter to specify the time format of the temporal column.
Check the code sample below:
schema = extractTextSchema("yourFilePath");
update schema set type="TIME" where name = "yourTimeColumnName" //modify the data type of a specified column
update schema set format="" // Add a format column to the variable schema
update schema set format="HHmmssSSS" where name = "yourTimeColumnName" // Modify the time format of the specified TIME column
loadText("yourFilePath",,schema)

I want to use READ_NOS to read a file from S3 and get all rows to return. But it only returns some rows

I want to use READ_NOS to read a file from S3 and get all rows to return. But it only returns some rows .
I create a foreign table for a parquet file.
but result: https://imgur.com/a/E0KLNJT
use studio still the same result: https://imgur.com/a/d8UP9uH
how to get the all rows return ?
the first SQL (COUNT*) show the number of records. The second one the number of Parquet Files. So in average each files holds 6.470 records.
There is a Teradata Orange Book dedicated to Use of NOS with some backgroud but as well some example SQL. Chapter 5 of this is focussed on Parquet Files.
It looks like RETURNTYPE ('NOSREAD_PARQUET_SCHEMA') is important in the combination os READ_NOS and Parquet.

Is it possible to import a CSV file to an existing table without the headers being included?

I'm trying to import a CSV file to a table that is empty but already exists in an SQLite database. For example:
sqlite> CREATE TABLE data (...);
sqlite> .mode csv
sqlite> .import mydata.csv data
I have created the table in advance because I'd like to specify a primary key, data types, and foreign key constraints. This process works as expected, but it unfortunately includes the header row from the CSV file in the table.
Here's what I've learned from the SQLite docs regarding CSV imports:
There are two cases to consider: (1) Table "tab1" does not previously exist and (2) table "tab1" does already exist.
In the first case, when the table does not previously exist, the table is automatically created and the content of the first row of the input CSV file is used to determine the name of all the columns in the table. In other words, if the table does not previously exist, the first row of the CSV file is interpreted to be column names and the actual data starts on the second row of the CSV file.
For the second case, when the table already exists, every row of the CSV file, including the first row, is assumed to be actual content. If the CSV file contains an initial row of column labels, that row will be read as data and inserted into the table. To avoid this, make sure that table does not previously exist.
So basically, I get extra data because I've created the table in advance. Is there a flag to change this behavior? If not, what's the best workaround?
The sqlite3 command-line shell has no such flag.
If you have a sufficiently advanced OS, you can use an external tool to split off the first line:
sqlite> .import "|tail -n +2 mydata.csv" data
You can also use the --skip 1 option with .import as documented on the sqlite3 website and this SO Answer. So, you can use the following command
.import --csv --skip 1 mydata.csv data

U-SQL How can I get the current filename being processed to add to my extract output?

I need to add meta data about the Row being processed. I need the filename to be added as a column. I looked at the ambulance demos in the Git repo, but can't figure out how to implement this.
You use a feature of U-SQL called 'file sets' and 'virtual columns'. In my simple example, I have two files in my input directory, I use file sets and refer to the virtual columns in the EXTRACT statement, eg
// Filesets, file set with virtual column
#q =
EXTRACT rowId int,
filename string,
extension string
FROM "/input/filesets example/{filename}.{extension}"
USING Extractors.Tsv();
#output =
SELECT filename,
extension,
COUNT( * ) AS records
FROM #q
GROUP BY filename,
extension;
OUTPUT #output TO "/output/output.csv"
USING Outputters.Csv();
My results:
Read more about both features here:
https://msdn.microsoft.com/en-us/library/azure/mt621320.aspx

unix file fetch by timestamp

I have a list of files that get added to my work stream. They are csv with a date time stamp to indicate when they are created. I need to pick up each file in the order of the datetime in the file name to process it. Here is a sample list that I get:
Workprocess_2016_11_11T02_00_12.csv
Workprocess_2016_11_11T06_50_45.csv
Workprocess_2016_11_11T10_06_18.csv
Workprocess_2016_11_11T14_23_00.csv
How would I compare the files to search for the oldest one and work towards the chronological newer file? The day the files are dumped is the same, so I can only use from the timestamp in file name.
The beneficial aspect of that date time format is that it sorts the same lexically and chronologically. So all you need is
for file in *.csv; do
mv "$f" xyz
process xyz
done

Resources