NiFi forward/duplicate TCP Stream

NiFi forward/duplicate TCP Stream - tcp

I'm supposed to duplicate a binary TCP Stream.
So I set up a NiFi 1.9.0 server, put in a ListenTCP processor and a PutTCP processor, configured the proper IPs and Ports and connected them.
So far so good, the packets were received by the ListenTCP processor and also forwareded by the PutTCP processor.
But NiFi seems to mess around with the data somehow, the sent packets aren't exactly the same as received. I expected NiFi to just forward everything 1:1 but something is happening and I cannot find out what.
I've been playing around with the Character Set, Max Batch Size and Batching Message Delemiter settings on the ListenTCP processor and also with the Outgoing Message Delemiter and Character Set on the PutTCP processor.
I also messed around with a MergeContent processor but didn't get it to work properly.
Here you can see the difference between received (red) and sent data (captured using tcpflow).
Link to picture
Another problem is that I don't really know the data I'm processing, it says in the documentation:
These log files are in the machine-readable binary format that is described by the XML file called ebm.xml.
and
The streamed events are in the TCP-based binary format.
I do have access to ebm.xml file, but not sure how I can make use of it.
Anyone an idea how I can get NiFi to simply forward everything?
I'm new to NiFi, so I might have missed some possibilites...

The ListenTCP processor reads data from the stream using a new-line character as a logical message separator. For example, if the stream had:
<chunk1><new-line><chunk2><new-line><chunk3><new-line>
It would result in reading chunk1, chunk2, and chunk3 into an internal queue.
When it writes them back out it uses the outgoing message delimiter. So the outgoing flow file would be:
<chunk1><outgoing-delim><chunk2><outgoing-delim><chunk3><outgoing-delim>
Unfortunately it is more geared towards receiving textual data such as logs which are typically line-delimited. The chunks should be passing through unaltered as byte[], but typically binary data wouldn't have these logical new-line boundaries, so I'm not sure how well it works for that.

Related

Figuring out what kind of payload is carried by a packet

I'm working with Scapy to parse a set of .pcap files. I would like to understand what kind of payload those packets are carrying. If I have for example a pcap file with a lot of UDP packets which payloads has the same starting bytes I don't know what kind of encoding was used, and the first values keep repeating in other packets. Is there any program or python library that could allow me to figure out or try to guess what kind of encoding was used (if for example is an RTP payload or MPEG one and so on)?
UPDATE
I was able to use nDPI on those pcap files and it gave me satisfying results for all the flows except for a set of them that it was not able to recognize. I'm going to share with you the first part of the hex representation of the data:
f1d00404d1002d7c484830320000020080073804610d00007b09040000000000010f000000000000000000000000000000000000000000000000000121e002a22e537fcccb815afafce2361b
The first part f1d004 does not change between previous and successive packets. I have already tried to decode them with different protocols using wireshark's feature "Decode as". I have tried with RTP,RTCP,RTSP,JSON,MPEG. If can be useful, this is the capture related to a camera, that's why I tried the previous protocols.

recv() data of unknown size with Berkeley Sockets

I have a code in C++ in which i use recv() from Berkeley Sockets to receive data from a remote host. The issue is that i do not know the size of the data ( which is variable ) so i need some kind of timeout opt ( probably ) to make this work.
Since I'm new in sockets programming, i was wondering how does for example a web client handle responses from a server ( eg a server sends the html data to the client ). Does it use some kind of timeout, since it doesn't know how big the page is ? Same with an FTP client.

When your data is of variable length, then typically that data is framed within another container. That is to say, there's a header preceding the actual data block that tell the receiver how much data it should accept.
For example HTTP uses new line characters to delimit data. If there's variable-length message, then in the header it will include "Content-length:" field that indicates exactly how many bytes to read once entire header is received (header stops when you read 2 consecutive new lines).
It is perfectly fine to read 4 bytes from socket, get how much data follows, then do another receive and read the rest. Only be careful, when you ask for 4 bytes, the socket might give you anywhere between 1-4 bytes so anything less than 4 means you need to go back and ask for remaining few bytes. This is a very common mistake. In dev environment you will almost always get 4 bytes when asking for 4, but once you deploy your app, somewhere on some machine you will get random crashes because their network behavior is somehow different.
Generally, it is a bad approach to rely on timeouts to determine when you reach end of data. With a timeout, you might get things "reliably" working in a well-controlled dev environment, but it is a very flaky solution. Any CPU/disk/network hick up might cause your app to stop receiving prematurely. You are also limiting your data throughput and responsiveness since your app is sleeping for some time interval instead of doing work.

Web server - how to parse requests? Asynchronous Stream Tokenizer?

I'm attempting to create a simple webserver in C# in asynchronous socket programming style. The purpose is very narrow - a Comet server (http long-polling).
I've got the windows service running, accepting connections, dumping request info to the Console and returning simple fixed content to the client.
Now, I can't figure out a manageable strategy for parsing the request data asynchronously and safely. I've written synchronous LL1 parsers before. I'm not sure if LL1 Parser is appropriate or necessary for HTTP. I don't know how to tokenize the input stream asynchronously. All I can think of is having an input buffer per client, reading into that, then copying that to a StringBuilder and periodically checking to see if I have a complete request. But that seems inefficient and might led to difficult to debug/maintain code.
Also, there are the two phases of the connection of receiving the request in full and the sending a response - in this case, after some delay. Once the request is validated and actionable, only then am I planning to enroll the connection in the long-polling manager. However, a misbehaving client could continue to send data and fill up a buffer, so I think I need to continue to monitor and empty the input stream during the response phase, right?
Any guidance on this is appreciated.
I guess the first step is knowing whether it is possible to efficiently tokenize a network stream asynchronously and without a large intermediate buffer. Even without a proper parser, the same challenges of creating a tokenizer apply to reading "lines" of input at a time, or even reading until double blank lines (one big token). I don't want to read one byte at a time from the network, but neither do I want to read too many bytes and have to store them in some intermediate buffer, right?

For HTTP the best way is reading the headers in memory completely (until you receive \r\n\r\n) and then simply splitting by \r\n to get the headers and every header by : to separate name and value.
There's no need to use a complex parser for that.

Is there anything in the FTP protocol like the HTTP Range header?

Suppose I want to transfer just a portion of a file over FTP - is it possible using a standard FTP protocol?
In HTTP I could use a Range header in the request to specify the data range of the remote resource. If it's a 1mb file, I could ask for the bytes from 600k to 700k.
Is there anything like that in FTP? I am reading the FTP RFC, don't see anything, but want to make sure I'm not missing anything.
There's a Restart command in FTP - would that work?
Addendum
After getting Brian Bondy's answer below, I wrote a read-only Stream class that wraps FTP. It supports Seek() and Read() operations on a resource that is read via FTP, based on the REST verb.
Find it at http://cheeso.members.winisp.net/srcview.aspx?dir=streams&file=FtpReadStream.cs
It's pretty slow to Seek(), because setting up the data socket takes a long time. Best results come when you wrap that stream in a BufferedStream.

Yes you can use the REST command.
REST sets the point at which a subsequent file transfer should start. It is used usually for restarting interrupted transfers. The command must come right before a RETR or STOR and so come after a PORT or PASV.
From FTP's RFC 959:
RESTART (REST) The argument field
represents the server marker at which
file transfer is to be restarted. This
command does not cause file transfer
but skips over the file to the
specified data checkpoint. This
command shall be immediately followed
by the appropriate FTP service command
which shall cause file transfer to
resume.
Read more:
http://www.faqs.org/rfcs/rfc959.html#ixzz0jZp8azux

You should check out how GridFTP does parallel transfers. That's using the sort of techniques that you want (and might actually be code that it is better to borrow rather than implementing from scratch yourself).

TCP/IP programming, data in more than one packet

I am writing an application in C, using libpcap. My program listens for new packets and parses them
according to a grammar. The payload actually is XML.
Sometimes one packet is not enough for an XML file, so the XML buffer is splitted into separate packets.
I want to add code logic in order to handle these cases. However I don't know in advance that a packet does not contain the whole data. How do I know that a packet has more data that will be send next? How to i recognize that a new packet contains the rest of the data?
Do I have to use the TH_FIN flag? Could you please explain it to me?

There's nothing in TCP that defines packets, that's up to the higher layers to define if they need to - TCP is just a stream.
If this is raw XML over a TCP stream, you actually need to parse the xml - you'll know when you have a whole xml document when you've received the end of the document element.
If it's XML packaged over HTTP , you might be able to parse out the Content-Length: header which should contain the length of the body.
Note, reassembling a TCP stream from captured packets is a very hard problem, there's a lot of corner cases, e.g. you'd need to handle retransmission , out of sequence tcp segments and many more. http://libnids.sourceforge.net/ might help you.

As Anon say use a higher level stream library.
But even then you need to know the chunk side before starting to handle it, as you will read from the stream in block's of n bytes.
Thus you want to first send in binary the number of bytes to be sent, then send x bytes, and repeat, thus when you are receiving the chucks via select/read to know went you have all of chunk one to pass to the processor.

If you're using TCP, use a TCP library that gives you the data as a stream instead of trying to handle the packets yourself.

Stream is good. Another option is to store the incoming data in a buffer (eg char*) and search for application messaging framing characters or in the case of Xml, the root end tag. Once you've found a complete xml message at the front of the buffer, pull it out and process.

The XMPP instant messaging protocol, used by Jabber, has means to move XML chunks over a TCP stream. I don't know how exactly it is done myself, but RFC 3290 is the protocol definition. You should be able to work it out from that.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex