Ingest bzip2 files with Apache Flume - flume-ng

I need to ingest compressed files in bzip2. Is it possible using flume?
I have tried it with spooling directory and BlobDeserializer, but it is unreadable at the sink.
Thanks in advance!

As of now flume does not support compressed spool data. if you want you can use custom interceptor to uncompress on the fly.

Related

In Pentaho can I load file and proccess the data directly into oracle database.

As of now i am downloading the file from SFTP to local and then adding into the database.I want to remove the extra step that is to download the file to machine.
The Text file input step is based on Apache vfs, which can read from a sftp server. So the solution is to define the Filename/Directory with the appropriate syntax:
sftp://[ username[: password]#] hostname[: port][ relative-path]
Supported file systems are on the Apache Common VFS web page.

how to write custom flume-ng source for creating avro file on hdfs sink ?

I'm trying to write a CustomSource which can create avro files on hdfs sink. But I can't figure it out.
Would like to see some guideline or example.
This will help you to get the details:
https://flume.apache.org/FlumeUserGuide.html
But on a high level to create a custom flume source:
1. Change the Flume config. Here you add your source configuration and point it to the MyCustomSource.java
2. Write the Source logic in MyCustomSource.java and copy the jar to flume nodes and restart.

SFTP polling using java

My scenario as follows:
One java program is updating some random files to a SFTP location.
My requirement is as soon as a file is uploaded by the previous java program, using java I need to download the file. The files can be of size 100MB. I am searching for some java API which is helpful in this way. Here I even don't know the name of files. But I can keep a regular expression for this. A same file can be uploaded by previous program periodically. Since file size is high I need to wait until the complete file to be uploaded.
I used Jsch to download files, but I am not getting how to poll using jsch.
Polling
All you can do is to keep listing remote directory periodically, until you find a new file. There's no better way with SFTP. For that you obviously use ChannelSftp.ls().
Regarding selecting files matching certain pattern, see:
JSch ChannelSftp.ls - pass match patterns in java
Waiting until the upload is complete
Again, there's no support for this in widespread implementations of SFTP.
For details, see my answer at:
SFTP file lock mechanism.

SFTP Files Timestamp

We have a client SFTP using which we receive our files. The problem is that their SFTP behaves in a wierd way. All the folder and files in it has the same timestamp, even if the file is posted today it shows timestamp as 01/01/2013
How is this possible?
If i download the same files to my local it shows different timestamp (which i believe is the original timestamp for the files)
The Left side is my local and the right side is SFTP.
Is there any way to identify what are the files that are posted today directly from SFTP, during this conditions(Timestamp not updating).
We are looking for a way to have it automated but not sure how to do it in this scenario
Any help will be much appreciated
If you're using a command-line sftp client you can just use the -p configuration flag to preserve timestamps either when starting the sftp client or on download.
For example, doing this downloads all files in a directory and sets their timestamp to now:
sftp> mget *
Using the -p flag preserves the timestamps of the source, however:
sftp> mget -p *
I assume the graphical client you're using has something similar.
Typically for automated file transfers such as you described here I would recommend using rsync. rsync has the ability to only transfer what has changed and can preserve the timestamps. It has many options. It comes with Linux and the Mac. There likely is rsync for Windows if needed.

How to decompress zlib files created with ByteArray.compress?

I work on a Flex application that creates compressed files and uploads them on a server. The files are created with ByteArray.compress method, which is zlib compression. I can decompress them using Python API on the server but I prefer to keep the files compressed there. I want to be able to download and decompress the files later, however WinZip and WinRar fail to decompress them. When I google for zlib utility, I only find zlib dll library. I need a simple application for Windows (and/or Linux) that is capable of decompressing zlib files.
So, zlib compression will certainly compress the data down, but it doesn't include the file headers that make it a "ZIP" file that can be opened using apps like Windows, WinZip or WinRar.
Adobe has some documents that explain how to READ a zip file, including information about the header. If you want to WRITE a zip file, just use this information to write out the header with the data.
Good luck!

Resources