I wrote my UDF to load file into Pig. It works well for loading text file, however, now I need also be able to read .gz file. I know I can unzip the file then process, but I want just read .gz file without to unzip it.
I have my UDF extends from LoadFunc, then in my costom input file MyInputFile extends TextInputFormat. I also Implemented MyRecordReader. Just wondering if extends TextInputFormat is the problem? I tried FileInputFormat, still cannot read the file. Anyone wrote UDF read data from .gz file before?
TextInputFormat handles gzip files as well. Have a look at its RecordReader's (LineRecordReader) initialize() method where the proper CompressionCodec is initialized. Also note that gzip files aren't splittable (even if they are located on S3) so you might either need to use a splittable format (e.g: LZO) or an uncompressed data to exploit the desired level of parallel processing.
If your gzipped data is stored locally you can uncompress and copy it to hdfs in one step as described here. Or if it's already on hdfs
hadoop fs -cat /data/data.gz | gzip -d | hadoop fs -put - /data/data.txt would be more convenient.
Related
i'am currently trying to unzip some specific files within a ARR file. This ARR file is within a tar.gz file.
Is it possible to unzip these files without a intermediate step/One liner. Its important that the first tar.gz will not be unpacked.
Thanks!
you can try something like:
gzip -dc input_file.tar.gz|tar xf - path/to/file/you/want/to/extract
This decompress and untar the archive in memory and have advantage of run faster.
I have to convert .xls or .xlxs file to .csv file without using plugins or tools using Unix Command
Is their any way to do this ?
I Tried to do like this below ...But not working
Change the characterSet code from .xls file to UTF-8 encoding
Then create file again with extension change
cp temp.xls temp.csv
It is possible, but you need to realise that an *.xls file is a zipped directory structure (just unzip such a file, using Winzip or 7-zip). The unzipping can also be done using UNIX commands.
But what then? The directory structure is quite complicated to understand, and in order to create a script or a program which can do this (without using any external tools) is a tremendous work, so I'd propose you, either to use external tools anyway, or to make sure the files you receive already are CSV format.
I have couple of .csv files in C:\Users\USER_NAME\Documents which are more than 2 GB in size. I want to use Apache Spark to read the data out of them in R. I am using Microsoft R Open 3.3.1 with Spark 2.0.1.
I am stuck with reading the .csv files with the function spark_read_csv(...) defined in Sparklyr package. It is asking for a file path which starts with file://. I want to know the proper file path for my case starting with file:// and ends with the file name which are in .../Documents directory.
I had a similar problem. In my case it was necessary for the .csv file to be put into the hdfs file system before calling it with spark_read_csv.
I think you probably have a similar problem.
If your cluster is also running with hdfs you need to use:
hdfs dfs -put
Best,
Felix
I have a number of zip files located in a single folder eg:
file1.gz
file2.gz
file3.gz
file4.gz
I'm looking for a way of automatically unzipping these using a batch job to a similarly named folder structure so for example the contents of file1.gz will drop into a folder named file1.
I have been told that 7zip would address my issue but can't figure out how to go about it.
Any help is greatly appreciated.
Which OS are you using? This is something you'd do using the shell's capabilities, you could write
for A in *.gz ; do gunzip $A ; done
I'm using gunzip here, because .gz is actually gzip, But you can use the 7zip CLI tool as well, of course. If you're on Windows, then I recommend installing a real shell (the standard cmd.exe can not really be considered a shell IMHO).
I need to regularly send a collection of log files that can grow quite large, so I would like to only send the last n lines of the each of the files.
for example:
/usr/local/data_store1/file.txt (500 lines)
/usr/local/data_store2/file.txt (800 lines)
Given a file with a list of needed files named files.txt, I would like to create an archive (tar or zip) with the last 100 lines of each of those files.
I can do this by creating a separate directory structure with the tail-ed files, but that seems like a waste of resources when there's probably some piping magic that can happen to accomplish it. Full directory structure also must be preserved since files can have the same names in different directories.
I would like the solution to be a shell script if possible, but perl (without added modules) is also acceptable (this is for Solaris machines that don't have ruby/python/etc.. installed on them.)
You could try
tail -n 10 your_file.txt | while read line; do zip /tmp/a.zip $line; done
where a.zip is the zip file and 10 is n or
tail -n 10 your_file.txt | xargs tar -czvf test.tar.gz --
for tar.gz
You are focusing in an specific implementation instead of looking at the bigger picture.
If the final goal is to have an exact copy of the files on the target machine while minimizing the amount of data transfered, what you should use is rsync, which automatically sends only the parts of the files that have changed and also can automatically compress while sending and decompress while receiving.
Running rsync doesn't need any more daemons on the target machine that the standard sshd one, and to setup automatic transfers without passwords you just need to use public key authentication.
There is no piping magic for that, you will have to create the folder structure you want and zip that.
mkdir tmp
for i in /usr/local/*/file.txt; do
mkdir -p "`dirname tmp/${i:1}`"
tail -n 100 "$i" > "tmp/${i:1}"
done
zip -r zipfile tmp/*
Use logrotate.
Have a look inside /etc/logrotate.d for examples.
Why not put your log files in SCM?
Your receiver creates a repository on his machine from where he retrieves the files by checking them out.
You send the files just by commiting them. Only the diff will be transmitted.