How to efficiently split pcap files based on TCP stream? - tcp

I am trying to split large pcap files containing hundreds of TCP streams into separate files. My current approach (see below) seems quite inefficient to me. My question is: What is the most efficient way of splitting pcap files into separate files by TCP stream?
Current approach
In my current approach, I first use tshark to find out which TCP streams are in the file. Next, for each of these TCP streams, I read the original file and extract the given stream. The code snippet below shows my approach:
#!/bin/bash
# Get all TCP stream numbers
for stream in `tshark -r $file -T fields -e tcp.stream | sort -n | uniq`
do
# Extract specified stream from $file and write it to a separate file.
tshark -r "$file" -Y "tcp.stream eq $stream" -w "$file.$stream.pcap"
done
However, this approach seems inefficient as tshark has to read the pcap file several times (once for each stream). I would ideally like a solution that goes over the original pcap file once and upon finding a packet belonging to a specific connection, append it to that file.
Other approaches
I have looked around for other approaches as well, but they do not seem to suit my situation:
PcapPlusPlus' PcapSplitter has a slightly different definition of a TCP connection. They define 'connection' as the same (protocol, source ip, destination ip, source port, destination port)-tuple, which might show weird behaviour if multiple TCP streams have the same tuple. I believe wireshark/tshark actually base their TCP streams on the SYN:SYN-ACK and FIN:FIN-ACK flags (but please correct me if I am wrong).
Python's Scapy Scapy has the same problem as PcapSplitter in that it does not provide any way of splitting TCP streams apart from the 5-tuple described above. (Of course I could write this myself, but that would be beyond the scope of my current work).
Also for both of these solutions, I am not entirely sure whether they are able to correctly handle erroneous captures.
Question
Therefore, I would like to have some suggestions on how to split pcap files into separate files based on TCP stream in the most efficient way.

Have you looked into Tracewrangler? It's for Windows but the documentation does mention that it can run under wine.
That's probably the best tool I can think of, but you might want to have a look at some others listed on the Wireshark wiki Tools page.

An efficient way (in performance point of view) is obliviously a dedicated program for the task.
libpcap-library may have needed functions for implementing such:
pcap_open_offline for opening a pcap file for reading
pcap_dump_open for opening pcap files for writing
pcap_dump for write packet to target files
And bunch of functions for filtering/handling the input.

You can use pkt2flow:
https://github.com/caesar0301/pkt2flow
Usage: ./pkt2flow [-huvx] [-o outdir] pcapfile
Options:
-h print this help and exit
-u also dump (U)DP flows
-v also dump the in(v)alid TCP flows without the SYN option
-x also dump non-UDP/non-TCP IP flows
-o (o)utput directory```

Related

Figuring out what kind of payload is carried by a packet

I'm working with Scapy to parse a set of .pcap files. I would like to understand what kind of payload those packets are carrying. If I have for example a pcap file with a lot of UDP packets which payloads has the same starting bytes I don't know what kind of encoding was used, and the first values keep repeating in other packets. Is there any program or python library that could allow me to figure out or try to guess what kind of encoding was used (if for example is an RTP payload or MPEG one and so on)?
UPDATE
I was able to use nDPI on those pcap files and it gave me satisfying results for all the flows except for a set of them that it was not able to recognize. I'm going to share with you the first part of the hex representation of the data:
f1d00404d1002d7c484830320000020080073804610d00007b09040000000000010f000000000000000000000000000000000000000000000000000121e002a22e537fcccb815afafce2361b
The first part f1d004 does not change between previous and successive packets. I have already tried to decode them with different protocols using wireshark's feature "Decode as". I have tried with RTP,RTCP,RTSP,JSON,MPEG. If can be useful, this is the capture related to a camera, that's why I tried the previous protocols.

Wireshark / Tshark pcap reassemble code

Does anybody know where within tshark or wireshark the code is that I could use to reassemble pcap files? I am working on an app and need to reassemble pcap files, but don't need the other functionality of wireshark / tshark...so, hoping to use this as guidance.
Thanks.
If it's tcpdump file (not ng format), you can throw away the first 24 bytes (file header) of the second file and concatenate the rest to the first file, then you do the same for all the other files.
mergecap (from wireshark suite) will merge two or more pcap files.

tshark not able to read icmp6 fields

I am able to read/replay all the headers and fields with tshark until hitting IPv6 header (ethernet header & IPv6 header), but when I am trying to replay the pcap files to read icmpv6 fields, nothing is being displayed of those fields.
Is this a bug of tshark? Is there any alternative tool to read all the fields in all headers of a packet?
The version of tshark I am using is 1.2.11
Bro is a network traffic analysis tool with full IPv6 support, whereas tshark seems to struggle with IPv6. In Bro, you can get connection summaries by running it on a trace as follows:
bro -C -r trace.pcap
and inspect the resulting file conn.log in the same directory. You may find the accompanying tool bro-cut helpful to extract only a subset of the columns, e.g.,
bro-cut id.orig_h id.resp_h id.orig_p id.resp_p proto < conn.log
would extract the connection 5-tuple and print it to STDOUT, so that you can continue processing it with your favorite text munching tool.

Is it safe to pipe the output of several parallel processes to one file using >>?

I'm scraping data from the web, and I have several processes of my scraper running in parallel.
I want the output of each of these processes to end up in the same file. As long as lines of text remain intact and don't get mixed up with each other, the order of the lines does not matter. In UNIX, can I just pipe the output of each process to the same file using the >> operator?
No. It is not guaranteed that lines will remain intact. They can become intermingled.
From searching based on liori's answer I found this:
Write requests of {PIPE_BUF} bytes or less shall not be interleaved with data from other processes doing writes on the same pipe. Writes of greater than {PIPE_BUF} bytes may have data interleaved, on arbitrary boundaries, with writes by other processes, whether or not the O_NONBLOCK flag of the file status flags is set.
So lines longer than {PIPE_BUF} bytes are not guaranteed to remain intact.
One possibly interesting thing you could do is use gnu parallel: http://www.gnu.org/s/parallel/ For example if you you were spidering the sites:
stackoverflow.com, stackexchange.com, fogcreek.com
you could do something like this
(echo stackoverflow.com; echo stackexchange.com; echo fogcreek.com) | parallel -k your_spider_script
and the output is buffered by parallel and because of the -k option returned to you in the order of the site list above. A real example (basically copied from the 2nd parallel screencast):
~ $ (echo stackoverflow.com; echo stackexchange.com; echo fogcreek.com) | parallel -k ping -c 1 {}
PING stackoverflow.com (64.34.119.12): 56 data bytes
--- stackoverflow.com ping statistics ---
1 packets transmitted, 0 packets received, 100.0% packet loss
PING stackexchange.com (64.34.119.12): 56 data bytes
--- stackexchange.com ping statistics ---
1 packets transmitted, 0 packets received, 100.0% packet loss
PING fogcreek.com (64.34.80.170): 56 data bytes
64 bytes from 64.34.80.170: icmp_seq=0 ttl=250 time=23.961 ms
--- fogcreek.com ping statistics ---
1 packets transmitted, 1 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 23.961/23.961/23.961/0.000 ms
Anyway, ymmv
Generally, no.
On Linux this might be possible, as long as two conditions are met: each line is written in a one operation, and the line is no longer than PIPE_SIZE (usually the same as PAGE_SIZE, usually 4096). But... I wouldn't count on that; this behaviour might change.
It is better to use some kind of real logging mechanism, like syslog.
Definitely no, I had a log-management script where I assumed this worked, and it did work, until I moved it to an under-load production server. Not a good day... But basically you end up with sometimes completely mixed up lines.
If I'm trying to capture from multiple sources, it is much simpler (and easier to debug) having a multiple-file 'paper trails' and if I need an over-all log file, concatenate based on timestamp (you are using time-stamps, right?) or as liori said, syslog.
Use temporary files and concatenate them together. It's the only safe way to do what you want to do, and there will (probably) be negligible performance loss that way. If performance is really a problem, try making sure that your /tmp directory is a RAM-based filesystem and putting your temporary files there. That way the temporary files are stored in RAM instead of on a hard drive, so reading/writing them is near-instant.
You'll need to ensure that you're writing whole lines in single write operations (so if you're using some form of stdio, you'll need to set it for line buffering for at least the length of the longest line that you can output.) Since the shell uses O_APPEND for the >> redirection then all your writes will then automatically append to the file with no further action on your part.
Briefly, no. >> doesn't respect multiple processes.
In addition to the idea of using temporary files, you could also use some kind of aggregating process, although you would still need to make sure your writes are atomic.
Think Apache2 with piped logging (with something like spread on the other end of the pipe if you're feeling ambitious). That's the approach it takes, with multiple threads/processes sharing a single logging process.
As mentioned above it's quite a hack, but works pretty well =)
( ping stackoverflow.com & ping stackexchange.com & ping fogcreek.com ) | cat
same thing with '>>' :
( ping stackoverflow.com & ping stackexchange.com & ping fogcreek.com ) >> log
and with exec on the last one you save one process:
( ping stackoverflow.com & ping stackexchange.com & exec ping fogcreek.com ) | cat

TCP flow extraction

I need to extract TCP Flows with their content from dump file and then save their flow into other file each flow separately
You definitely want to use Bro, more specifically, its contents.bro policy. For example, given a trace that contains HTTP requests, running the following ...
bro -r http.trace -f 'tcp and port 80' contents
... produces files
contents.[senderIP].[senderPort]-[destIP].[destPort]
contents.[destIP].[destPort]-[senderIP].[senderPort]
for each connection, each containing the unidirectional content of the flow.
The flow reassembly is highly robust, the process scales to very large files, and everything is customizable to your needs.
If you're only doing a few, Wireshark can do this.
Steps:
Open up the capture in Wireshark.
Click on a packet from the TCP connection you're interested in
Analyze -> Follow TCP Stream
Click 'Raw'
Select (from the popup menu) one of 'Entire Conversation' or one of the two directions.
Click 'Save As'
Alternate steps, for HTTP only:
Open up the capture
Select File -> Export -> Objects -> HTTP
A dialog will open showing all the HTTP objects in the capture. You can save some or all of them.
This is with Wireshark 1.2.1 on Linux/GTK. The 'follow TCP stream' option has been moved around between versions, so it may be somewhere else if you have an older version. But its always been called Follow TCP Stream so you should be able to find it.
Quick searching also reveals several other options if Wireshark doesn't work for you: ngrep, tcpick, chaosreader, and tcpflow.
tcpflow -r my_dump_file.pcap -o output_dir/
It will extract each tcp flow, separately, into a file under output_dir. Each flow in its own file.
Here's the manpage with more options
Wire shark maybe? It can be used to filter sessions and I think you can then save them seperatly.
You could also have a look at NetFlow and related tools.

Resources