Hadoop InputFormat for Excel - dictionary

I need to create a map-reducing program which reads an Excel file from HDFS and does some analysis on it. From there store the output in the format of excel file. I know that TextInputFormat is used to read a .txt file from HDFS but which method or which inputformat should I have to use?

Generally, hadoop is overkill for this scenario, but some relevant solutions
parse the file externally and convert to an hadoop compatible format
read the complete file as a single record see this answer
use two chained jobs. the 1st like in 2, reads the file in bulk, and emits each record as input for the next job.

Related

How to avoid/disable .crc files for writing csv files in sparklyr?

I am writing spark data frame to local file system as a csv file by using the spark_write_csv function. In the output directory, there is one .crc file for each part file.
I am looking for any functions or property of Hadoop/Spark that avoid generation of these .crc files.
flights_tbl<-copy_to(sc,flights,"flights")
spark_write_csv(flights_tbl, path="xxx" , mode = "overwrite")
This is the output i get:
.part-00000-365d53be-1946-441a-8e25-84cb009f2f45-c000.csv.crc
part-00000-365d53be-1946-441a-8e25-84cb009f2f45-c000
It is not possible. Checksum files are generated for all Spark Data Sources and built-in legacy RDD API and the behavior is not configurable.
To avoid it completely you'd have:
Implement your own Hadoop Input format.
Or implement your own Data Source (v1 or v2) which doesn't depend on Hadoop input formats.
and add spakrlyr wrappers to expose in R codebase.

.wav file length/duration without reading in the file

Is there a way to extract the information about .wav file length/duration without having to read in the file in R? I have thousands of those files and it would take a long time if I had to read in every single one to find its duration. Windows File Explorer gives you and option to turn on the Length field and you can see the file duration, but is there a way to extract that information to be able to use in in R?
This is what I tried and would like to avoid doing since reading in tens of thousands of audio files in R will take a long time:
library(tuneR)
audio<-readWave("AudioFile.wav")
round(length(audio#left) / audio#samp.rate, 2)
You can use the readWave function from the tuneR package with header=TRUE. This will only head the metadata of the file and not the entire file.
library(tuneR)
audio<-readWave("AudioFile.wav", header=TRUE)
round(audio$samples / audio$sample.rate, 2)

R Converting large CSV files to HDFS

I am currently using R to carry out analysis.
I have a large number of CSV files all with the same headers that I would like to process using R. I had originally read each files sequentially into R and row binded them together before carrying out the analysis together.
The number of files that need to be read in is growing and so keeping them all in memory to carry out manipulations to the data is becoming infeasible.
I can combine all of the CSV files together without using R and thus not keeping it in memory. This leaves a huge CSV file would converting it to HDFS make sense in order to be able to carry out the relevant analysis? And in addition to this...or would be make more sense to carry out the analysis on each csv file separately and then combine it at the end?
I am thinking that perhaps a distributed file system and using a cluster of machines on amazon to carry out the analysis efficiently.
Looking at rmr here, it converts data to HDFS but apparently its not amazing for really big data...how would one convert the csv in a way that would allow efficient analysis?
You can build a composite csv file into the hdfs. First, you can create an empty hdfs folder first. Then, you pull each csv file separately into the hdfs folder. In the end, you will be able to treat the folder as a single hdfs file.
In order to pull the files into the hdfs, you can either use a terminal for loop, the rhdfs package, or load your files in-memory and user to.dfs (although I don't recommend you the last option). Remember to take the header off from the files.
Using rmr2, I advise you to first convert the csv into the native hdfs format, then perform your analysis on it. You should be able to deal with big data volumes.
HDFS is a file system, not a file format. HDFS actually doesn't handle small files well, as it usually has a default block size of 64MB, which means any file from 1B to 63MB will take 64MB of space.
Hadoop is best to work on HUGE files! So it would be best for you to concatenate all your small files into one giant file on HDFS that your Hadoop tool should have a better time handling.
hdfs dfs -cat myfiles/*.csv | hdfs dfs -put - myfiles_together.csv

run saxon xquery over batch of xml files and produce one output file for each input file

How do I run xquery using Saxon HE 9.5 on a directory of files using the build in command-line? I want to take one file as input and produce one file as output.
This sounds very obvious, but I can't figure it out without using saxon extensions that are only available in PE and EE.
I can read in the files in a directory using fn:collection() or using input parameters. But then I can only produce one output file.
To keep things simple, let's say I have a directory "input" with my files 01.xml, 02.xml, ... 99.xml. Then I have an "output" directory where I want to produce the files with the same names -- 01.xml, 02.xml, ... 99.xml.
Any ideas?
My real data set is large enough (tens of thousands of files) that I don't want to fire off the jvm, so writing a shell script to call the saxon command-line for each file is out of the question.
If there are no built-in command-line options, I may just write my own quick java class.
The capability to produce multiple output files from a single query is not present in the XQuery language (only in XSLT), and the capability to process a batch of input files is not present in Saxon's XQuery command line (only in the XSLT command line).
You could call a single-document query repeatedly from Ant, XProc, or xmlsh (or of course from Java), or you could write the code in XSLT instead.

Merging EBCDIC converted files and pdf files into a single file and pushing to mainframes

I have two pdf files and two text files which are converted into ebcdif format. The two text files acts like cover files for the pdf files containing details like pdf name, number of pages, etc. in a fixed format.
Cover1.det, Firstpdf.pdf, Cover2.det, Secondpdf.pdf
Format of the cover file could be:
Firstpdf.pdf|22|03/31/2012
that is
pdfname|page num|date generated
which is then converted into ebcdic format.
I want to merge all these files in a single file in the order first text file, first pdf file, second text file, second pdf file.
The idea is to then push this single merged file into mainframes using scp.
1) How to merge above mentioned four files into a single file?
2) Do I need to convert pdf files also in ebcdic format ? If yes, how ?
3) As far as I know, mainframe files also need record length details during transit. How to find out record length of the file if at all I succeed in merging them in a single file ?
I remember reading somewhere that it could be done using put and append in ftp. However since I have to use scp, I am not sure how to achieve this merging.
Thanks for reading.
1) Why not use something like pkzip?
2) I don't think converting the pdf files to ebcdic is necessary or even possible. The files need to be transfered in Binary mode
3) Using pkzip and scp you will not need the record length
File merging could easily be achieved by using Cat command in unix with > and >> append operators.
Also, if the next file should start from a new line (as was my case) a blank echo could be inserted between files.

Resources