xdmp:document-load Xquery Command - xquery

When I ingest a csv file containing multiple xml records using mlcp, I use an options file to change the desired ML output from one csv document into multiple xml documents. How do I script this using xdmp:document-load command within the query console?

I don't think xdmp:document-load provides an option for that. Instead, use xdmp:document-get, split with XPath, then xdmp:document-insert.

Related

How to drop delimiter in hive table

I download the file using download.file function in R
Save it locally
Move it to HDFS
My hive table is mapped to this location of file in HDFS. Now, this file has , as the delimiter/separator which I cannot change using download.file function from R.
There is a field that has , in its content. How do I tell hive to drop this delimiter if found anywhere within the field contents?
I understand we can change the delimiter but is there really a way I can drop it like sqoop allows us to do?
My R script is as simple as
download.file()
hdfs.put
Is there a workaround?

How to use read.csv to read only those lines that matche with some regular expression?

I want to read a large file using read.csv in R. Now one way to fetch lines matching with some pattern is to fetch all lines in a data-frame first and then to filter only required lines. The problem with this approach is that the file size is too large and all data may not fit in memory on some machines. So is there any way I can use grep or something similar along with read.csv to fetch only few lines that are of interest?
You can't use read.table and its derivatives for this purpose. You can, however, use readLines to read in data in chunks and apply your regular expression to each element, which corresponds to a line.
Another alternative would be to use a database like framework. Package sqldf can read a csv file into a SQL data base. You can use an SQL query to read only desired lines.

Hadoop InputFormat for Excel

I need to create a map-reducing program which reads an Excel file from HDFS and does some analysis on it. From there store the output in the format of excel file. I know that TextInputFormat is used to read a .txt file from HDFS but which method or which inputformat should I have to use?
Generally, hadoop is overkill for this scenario, but some relevant solutions
parse the file externally and convert to an hadoop compatible format
read the complete file as a single record see this answer
use two chained jobs. the 1st like in 2, reads the file in bulk, and emits each record as input for the next job.

run saxon xquery over batch of xml files and produce one output file for each input file

How do I run xquery using Saxon HE 9.5 on a directory of files using the build in command-line? I want to take one file as input and produce one file as output.
This sounds very obvious, but I can't figure it out without using saxon extensions that are only available in PE and EE.
I can read in the files in a directory using fn:collection() or using input parameters. But then I can only produce one output file.
To keep things simple, let's say I have a directory "input" with my files 01.xml, 02.xml, ... 99.xml. Then I have an "output" directory where I want to produce the files with the same names -- 01.xml, 02.xml, ... 99.xml.
Any ideas?
My real data set is large enough (tens of thousands of files) that I don't want to fire off the jvm, so writing a shell script to call the saxon command-line for each file is out of the question.
If there are no built-in command-line options, I may just write my own quick java class.
The capability to produce multiple output files from a single query is not present in the XQuery language (only in XSLT), and the capability to process a batch of input files is not present in Saxon's XQuery command line (only in the XSLT command line).
You could call a single-document query repeatedly from Ant, XProc, or xmlsh (or of course from Java), or you could write the code in XSLT instead.

Export From Teradata Table to CSV

Is it possible to transfer the date from the Teradata Table into .csv file directly.
Problem is - my table has more that 18 million rows.
If yes, please send tell me the process
For a table that size I would suggest using the FastExport utility. It does not natively support a CSV export but you can mimic the behavior.
Teradata SQL Assistant will export to a CSV but it would not be appropriate to use with a table of that size.
BTEQ is another alternative that may be acceptable for a one-time dump if the table.
Do you have access to any of these?
It's actually possible to change the delimiter of exported text files within Teradata SQL Assistant, without needing any separate applications:
Go to Tools > Options > Export/Import. From there, you can change the Use this delimiter between column option from {Tab} to ','.
You might also want to set the 'Enclose column data in' option to 'Double Quote', so that any commas in the data itself don't upset the file structure.
From there, you use the regular text export: File > Export Results, run the query, and select one of the Delimited Text types.
Then you can just use your operating system to manually change the file extension from .txt to .csv.
These instructions are from SQL Assistant version 16.20.0.7.
I use the following code to export data from the Teradata Table into .csv file directly.
CREATE EXTERNAL TABLE
database_name.table_name (to be created) SAMEAS database_name.table_name (already existing, whose data is to be exported)
USING (DATAOBJECT ('C:\Data\file_name.csv')
DELIMITER '|' REMOTESOURCE 'ODBC');
You can use FastExport utility from Teradata Studio for exporting the table in CSV format. You can define the delimiter as well.
Very simple.
Basic idea would be to export first table as a TXT file and then converting TXT t o CSV using R...read.table ()---> write.csv().....
Below are the steps of exporting TD table as txt file:
Select export option from file
Select all records from the table you want to export
Save it as a TXT file
Then use R to convert TXT file to CSV (set working directory to the location where you have saved your big TXT file):
my_table<-read.table("File_name.txt", fill = TRUE, header = TRUE)
write.csv(my_table,file = "File_name.csv")
This had worked for 15 million records table. Hope it helps.

Resources