Is it safe to pipe the output of several parallel processes to one file using >>? - unix

I'm scraping data from the web, and I have several processes of my scraper running in parallel.
I want the output of each of these processes to end up in the same file. As long as lines of text remain intact and don't get mixed up with each other, the order of the lines does not matter. In UNIX, can I just pipe the output of each process to the same file using the >> operator?

No. It is not guaranteed that lines will remain intact. They can become intermingled.
From searching based on liori's answer I found this:
Write requests of {PIPE_BUF} bytes or less shall not be interleaved with data from other processes doing writes on the same pipe. Writes of greater than {PIPE_BUF} bytes may have data interleaved, on arbitrary boundaries, with writes by other processes, whether or not the O_NONBLOCK flag of the file status flags is set.
So lines longer than {PIPE_BUF} bytes are not guaranteed to remain intact.

One possibly interesting thing you could do is use gnu parallel: http://www.gnu.org/s/parallel/ For example if you you were spidering the sites:
stackoverflow.com, stackexchange.com, fogcreek.com
you could do something like this
(echo stackoverflow.com; echo stackexchange.com; echo fogcreek.com) | parallel -k your_spider_script
and the output is buffered by parallel and because of the -k option returned to you in the order of the site list above. A real example (basically copied from the 2nd parallel screencast):
~ $ (echo stackoverflow.com; echo stackexchange.com; echo fogcreek.com) | parallel -k ping -c 1 {}
PING stackoverflow.com (64.34.119.12): 56 data bytes
--- stackoverflow.com ping statistics ---
1 packets transmitted, 0 packets received, 100.0% packet loss
PING stackexchange.com (64.34.119.12): 56 data bytes
--- stackexchange.com ping statistics ---
1 packets transmitted, 0 packets received, 100.0% packet loss
PING fogcreek.com (64.34.80.170): 56 data bytes
64 bytes from 64.34.80.170: icmp_seq=0 ttl=250 time=23.961 ms
--- fogcreek.com ping statistics ---
1 packets transmitted, 1 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 23.961/23.961/23.961/0.000 ms
Anyway, ymmv

Generally, no.
On Linux this might be possible, as long as two conditions are met: each line is written in a one operation, and the line is no longer than PIPE_SIZE (usually the same as PAGE_SIZE, usually 4096). But... I wouldn't count on that; this behaviour might change.
It is better to use some kind of real logging mechanism, like syslog.

Definitely no, I had a log-management script where I assumed this worked, and it did work, until I moved it to an under-load production server. Not a good day... But basically you end up with sometimes completely mixed up lines.
If I'm trying to capture from multiple sources, it is much simpler (and easier to debug) having a multiple-file 'paper trails' and if I need an over-all log file, concatenate based on timestamp (you are using time-stamps, right?) or as liori said, syslog.

Use temporary files and concatenate them together. It's the only safe way to do what you want to do, and there will (probably) be negligible performance loss that way. If performance is really a problem, try making sure that your /tmp directory is a RAM-based filesystem and putting your temporary files there. That way the temporary files are stored in RAM instead of on a hard drive, so reading/writing them is near-instant.

You'll need to ensure that you're writing whole lines in single write operations (so if you're using some form of stdio, you'll need to set it for line buffering for at least the length of the longest line that you can output.) Since the shell uses O_APPEND for the >> redirection then all your writes will then automatically append to the file with no further action on your part.

Briefly, no. >> doesn't respect multiple processes.

In addition to the idea of using temporary files, you could also use some kind of aggregating process, although you would still need to make sure your writes are atomic.
Think Apache2 with piped logging (with something like spread on the other end of the pipe if you're feeling ambitious). That's the approach it takes, with multiple threads/processes sharing a single logging process.

As mentioned above it's quite a hack, but works pretty well =)
( ping stackoverflow.com & ping stackexchange.com & ping fogcreek.com ) | cat
same thing with '>>' :
( ping stackoverflow.com & ping stackexchange.com & ping fogcreek.com ) >> log
and with exec on the last one you save one process:
( ping stackoverflow.com & ping stackexchange.com & exec ping fogcreek.com ) | cat

Related

How to efficiently split pcap files based on TCP stream?

I am trying to split large pcap files containing hundreds of TCP streams into separate files. My current approach (see below) seems quite inefficient to me. My question is: What is the most efficient way of splitting pcap files into separate files by TCP stream?
Current approach
In my current approach, I first use tshark to find out which TCP streams are in the file. Next, for each of these TCP streams, I read the original file and extract the given stream. The code snippet below shows my approach:
#!/bin/bash
# Get all TCP stream numbers
for stream in `tshark -r $file -T fields -e tcp.stream | sort -n | uniq`
do
# Extract specified stream from $file and write it to a separate file.
tshark -r "$file" -Y "tcp.stream eq $stream" -w "$file.$stream.pcap"
done
However, this approach seems inefficient as tshark has to read the pcap file several times (once for each stream). I would ideally like a solution that goes over the original pcap file once and upon finding a packet belonging to a specific connection, append it to that file.
Other approaches
I have looked around for other approaches as well, but they do not seem to suit my situation:
PcapPlusPlus' PcapSplitter has a slightly different definition of a TCP connection. They define 'connection' as the same (protocol, source ip, destination ip, source port, destination port)-tuple, which might show weird behaviour if multiple TCP streams have the same tuple. I believe wireshark/tshark actually base their TCP streams on the SYN:SYN-ACK and FIN:FIN-ACK flags (but please correct me if I am wrong).
Python's Scapy Scapy has the same problem as PcapSplitter in that it does not provide any way of splitting TCP streams apart from the 5-tuple described above. (Of course I could write this myself, but that would be beyond the scope of my current work).
Also for both of these solutions, I am not entirely sure whether they are able to correctly handle erroneous captures.
Question
Therefore, I would like to have some suggestions on how to split pcap files into separate files based on TCP stream in the most efficient way.
Have you looked into Tracewrangler? It's for Windows but the documentation does mention that it can run under wine.
That's probably the best tool I can think of, but you might want to have a look at some others listed on the Wireshark wiki Tools page.
An efficient way (in performance point of view) is obliviously a dedicated program for the task.
libpcap-library may have needed functions for implementing such:
pcap_open_offline for opening a pcap file for reading
pcap_dump_open for opening pcap files for writing
pcap_dump for write packet to target files
And bunch of functions for filtering/handling the input.
You can use pkt2flow:
https://github.com/caesar0301/pkt2flow
Usage: ./pkt2flow [-huvx] [-o outdir] pcapfile
Options:
-h print this help and exit
-u also dump (U)DP flows
-v also dump the in(v)alid TCP flows without the SYN option
-x also dump non-UDP/non-TCP IP flows
-o (o)utput directory```

nmap host discovery and data-length option

I am doing host discovery only (-sn) option, trying to determine active hosts that are up and running.
My first command was:
nmap -sn -PS21,22,25,53,80,443,3389,8000,8080,42000 -PA80,443,8080,42000 -PU53 xxx.xxx.xxx.xxx/27
I am scanning public IP's and the above command produces a result stating that 18 hosts are up.
However, when I run the above command with --data-length "option" (either 32 or 56), it produces a result with only 8 hosts up.
I was expecting to see more hosts, if anything... but not less. (The data-length option adds a bytes of data to every packet to simulate the ping tool and it may help evade firewall rules set to drop 0 byte packets).
I am reading Fydors book however I am having trouble understanding the behavior above.
Any ideas?
Thanks
--data-length adds data to every packet. Your TCP discovery options (-PS, -PA) are sending packets that do not usually contain data. In this case, these packets are more likely to be dropped or ignored since they are unusual. The case where --data-length is useful is for the -PE (ICMP Echo Request) discovery option. ICMP Echo Request datagrams are usually sent with some data payload, but Nmap defaults to empty probes, so IDS products like Snort will sometimes block or alert on these probes.

gawk to read last bit of binary data over a pipe without timeout?

I have a program already written in gawk that downloads a lot of small bits of info from the internet. (A media scanner and indexer)
At present it launches wget to get the information. This is fine, but I'd like to simply reuse the connection between invocations. Its possible a run of the program might make between 200-2000 calls to the same api service.
I've just discovered that gawk can do networking and found geturl
However the advice at the bottom of that page is well heeded, I can't find an easy way to read the last line and keep the connection open.
As I'm mostly reading JSON data, I can set RS="}" and exit when body length reaches the expected content-length. This might break with any trailing white space though. I'd like a more robust approach. Does anyone have a nicer way to implement sporadic http requests in awk that keep the connection open. Currently I have the following structure...
con="/inet/tcp/0/host/80";
send_http_request(con);
RS="\r\n";
read_headers();
# now read the body - but do not close the connection...
RS="}"; # for JSON
while ( con |& getline bytes ) {
body = body bytes RS;
if (length(body) >= content_length) break;
print length(body);
}
# Do not close con here - keep open
Its a shame this one little thing seems to be spoiling all the potential here. Also in case anyone asks :) ..
awk was originally chosen for historical reasons - there were not many other language options on this embedded platform at the time.
Gathering up all of the URLs in advance and passing to wget will not be easy.
re-implementing in perl/python etc is not a quick solution.
I've looked at trying to pipe urls to a named pipe and into wget -i - , that doesn't work. Data gets buffered, and unbuffer not available - also I think wget gathers up all the URLS until EOF before processing.
The data is small so lack of compression is not an issue.
The problem with the connection reuse comes from the HTTP 1.0 standard, not gawk. To reuse the connection you must either use HTTP 1.1 or try some other non-standard solutions for HTTP 1.0. Don't forget to add the Host: header in your HTTP/1.1 request, as it is mandatory.
You're right about the lack of robustness when reading the response body. For line oriented protocols this is not an issue. Moreover, even when using HTTP 1.1, if your scripts locks waiting for more data when it shouldn't, the server will, again, close the connection due to inactivity.
As a last resort, you could write your own HTTP retriever in whatever langauage you like which reuses connections (all to the same remote host I presume) and also inserts a special record separator for you. Then, you could control it from the awk script.

Method to track lost packets source in FreeBSD

I have FreeBSD host (some sort of HTTP Proxy) with spikes of retransmitted packets number. Is there any way to track were host loosing them (per incoming connection).
I usually capture a bunch of them with tcpdump or similar; and then post process them elsewhere. In your case that should not be hard - as you just need the header.
Something like tcpdump (without; or a < 200 byte -s fly) would do on the target machine.
Compress/move this file then off to a desktop machine to work on it. I'd start with something like wireshark (simply use the filters).
Beyond that - simple grep-ing/wc-counting or a small perl script may be called for. To save you re-inventing histograms; consider http://snippets.aktagon.com/snippets/62-How-to-generate-a-histogram-with-Perl or do a quick google.

Prevent FIFO from closing / reuse closed FIFO

Consider the following scenario:
a FIFO named test is created. In one terminal window (A) I run cat <test and in another (B) cat >test. It is now possible to write in window B and get the output in window A. It is also possible to terminate the process A and relaunch it and still be able to use this setup as suspected. However if you terminate the process in window B, B will (as far as I know) send an EOF through the FIFO to process A and terminate that as well.
In fact, if you run a process that does not terminate on EOF, you'll still not be able to use your FIFO you redirected to the process. Which I think is because this FIFO is considered closed.
Is there anyway to work around this problem?
The reason to why I ran into this problem is because I'd like to send commands to my minecraft server running in a screen session. For example: echo "command" >FIFO_to_server. This is problably possible to do by using screen by itself but I'm not very comfortable with screen I think a solution using only pipes would be a simpler and cleaner one.
A is reading from a file. When it reaches the end of the file, it stops reading. This is normal behavior, even if the file happens to be a fifo. You now have four approaches.
Change the code of the reader to make it keep reading after the end of the file. That's saying the input file is infinite, and reaching the end of the file is just an illusion. Not practical for you, because you'd have to change the minecraft server code.
Apply unix philosophy. You have a writer and a reader who don't agree on protocol, so you interpose a tool that connects them. As it happens, there is such a tool in the unix toolbox: tail -f. tail -f keeps reading from its input file even after it sees the end of the file. Make all your clients talk to the pipe, and connect tail -f to the minecraft server:
tail -n +1 -f client_pipe | minecraft_server &
As mentioned by jilles, use a trick: pipes support multiple writers, and only become closed when the last writer goes away. So make sure there's a client that never goes away.
while true; do sleep 999999999; done >client_pipe &
The problem is that the server is fundamentally designed to handle a single client. To handle multiple clients, you should change to using a socket. Think of sockets as “meta-pipes”: connecting to a socket creates a pipe, and once the client disconnects, that particular pipe is closed, but the server can accept more connections. This is the clean approach, because it also ensures that you won't have mixed up data if two clients happen to connect at the same time (using pipes, their commands could be interspersed). However, it require changing the minecraft server.
Start a process that keeps the fifo open for writing and keeps running indefinitely. This will prevent readers from seeing an end-of-file condition.
From this answer -
On some systems like Linux, <> on a named pipe (FIFO) opens the named pipe without blocking (without waiting for some other process to open the other end), and ensures the pipe structure is left alive. For instance in:
So you could do:
cat <>up_stream >down_stream
# the `cat pipeline keeps running
echo 1 > up_stream
echo 2 > up_stream
echo 3 > up_stream
However, I can't find documentation about this behavior. So this could be implementation detail which is specific to some systems. I tried the above on MacOS and it works.
You can add multiple inputs ino a pipe by adding what you require in brackets with semi-colons in your 'mkfifo yourpipe':
(cat file1; cat file2; ls -l;) > yourpipe

Resources