cat file | ... vs ... <file - unix

Is there a case of ... or context where cat file | ... behaves differently than ... <file?

When reading from a regular file, cat is in charge of reading the data, performs it as it pleases, and might constrain it in the way it writes it to the pipeline. Obviously, the contents themselves are preserved, but anything else could be tainted. For example: block size and data arrival timing. Additionally, the pipe in itself isn't always neutral: it serves as an additional buffer between the input and ....
Quick and easy way to make the block size issue apparent:
$ cat large-file | pv >/dev/null
5,44GB 0:00:14 [ 393MB/s] [ <=> ]
$ pv <large-file >/dev/null
5,44GB 0:00:03 [1,72GB/s] [=================================>] 100%

Besides the thing posted by other users, when using input redirection from a file, standard input is the file but when piping the output of cat to the input, standard input is a stream with the contents of the file. When standard input is the file will be able to seek within the file but the pipe will not allow it. You can see this by finding a zip file and running the following commands:
zipinfo /dev/stdin < thezipfile.zip
and
cat thezipfile.zip | zipinfo /dev/stdin
The first command will show the contents of the zipfile while the second will show an error, though it is a misleading error because zipinfo does not check the result of the seek call and errors later on.

A useless use of cat is always to be avoided. It's like driving with the handbrake on. It wastes CPU cycles for nothing, the OS constantly context switching between the cat process and the next in the pipe. If all the world's useless cats were gone and stopped being invented, reinvented, passed on from father to son, we wouldn't have global warming because we could easily live with 1.21 Gigawatts of power saved.
Thanks. I feel better now. Please join me in my crusade to stamp out useless use of cat on stackoverflow. This site is, as far as I perceive it, a major contribution to the proliferation of useless cats. I don't blame the newbies, but I do want to teach them. Workers and newbies of the world, loosen the handbrakes and save the planet!!!1!

cat will allow you to pipe multiple files in sequentially. Otherwise, < redirection and cat file | produce the same side effects.

Pipes cause a subshell to be invoked for the command on the right. This interferes with environment variables.
cat foo | while read line
do
...
done
echo "$line"
versus
while read line
do
...
done < foo
echo "$line"

One further difference is behavior on a blocking open() of the input file.
For example, assuming input is a FIFO with no writers, one invocation will not spawn any child programs until the input file is opened, while the other will spawn two processes:
prog ... < a_fifo # 'prog' not launched until shell can open file
cat a_fifo | prog ... # 'prog' and 'cat' are running (latter may block on open)
In practice this rarely matters except in contrived circumstances. prog might periodically log or do some cleanup work while waiting for input, for example, which you might want to happen even if no input is available. (Why wouldn't prog be sophisticated enough to open its own input fifo nonblocking?)

cat file | starts up another program (cat) that doesn't have to start in the second case. It also makes it more confusing if you want to use "here documents". But it should behave the same.

Related

How does a program "know" to terminate early when its output is piped to head?

Say I need to run a UNIX command-line program on some big file. If I want to see a bit of the output, I can pipe the output to head:
$ cat big_file.txt | head
For cat and many similar programs, this command runs extremely fast. Since only the first few lines are needed, program execution stops as soon as the requirement on the other end of the pipe is satisfied. So we all know and love this kind of pipe command for quickly exploring what's in a file or what kind of output a program will write for us.
How does this work at the code level? Does some kind of signal get sent to cat telling it that its output isn't needed anymore, and it can stop running now?

Ring buffer log file on unix

I'm trying to come up with a unix pipeline of commands that will allow me to log only the most recent n lines of a program's output to a text file.
The text file should never be more than n lines long. (it may be less when it is first filling the file)
It will be run on a device with limited memory/resources, so keeping the filesize small is a priority.
I've tried stuff like this (n=500):
program_spitting_out_text > output.txt
cat output.txt | tail -500 > recent_output.txt
rm output.txt
or
program_spitting_out_text | tee output.txt | tail -500 > recent_output.txt
Obviously neither works for my purposes...
Anyone have a good way to do this in a one-liner? Or will I have to write a script/utility?
Note: I don't want anything to do with dmesg and must use standard BSD unix commands. The "program_spitting_out_text" prints out about 60 lines/second, every second.
Thanks in advance!
If program_spitting_out_text runs continuously and keeps it's file open, there's not a lot you can do.
Even deleting the file won't help since it will still continue to write to the now "hidden" file (data still exists but there is no directory entry for it) until it closes it, at which point it will be really removed.
If it closes and reopens the log file periodically (every line or every ten seconds or whatever), then you have a relatively easy option.
Simply monitor the file until it reaches a certain size, then roll the file over, something like:
while true; do
sleep 5
lines=$(wc -l <file.log)
if [[ $lines -ge 5000 ]]; then
rm -f file2.log
mv file.log file2.log
touch file.log
fi
done
This script will check the file every five seconds and, if it's 5000 lines or more, will move it to a backup file. The program writing to it will continue to write to that backup file (since it has the open handle to it) until it closes it, then it will re-open the new file.
This means you will always have (roughly) between five and ten thousand lines in the log file set, and you can search them with commands that combine the two:
grep ERROR file2.log file.log
Another possibility is if you can restart the program periodically without affecting its function. By way of example, a program which looks for the existence of a file once a second and reports on that, can probably be restarted without a problem. One calculating PI to a hundred billion significant digits will probably not be restartable without impact.
If it is restartable, then you can basically do the same trick as above. When the log file reaches a certain size, kill of the current program (which you will have started as a background task from your script), do whatever magic you need to in rolling over the log files, then restart the program.
For example, consider the following (restartable) program prog.sh which just continuously outputs the current date and time:
#!/usr/bin/bash
while true; do
date
done
Then, the following script will be responsible for starting and stopping the other script as needed, by checking the log file every five seconds to see if it has exceeded its limits:
#!/usr/bin/bash
exe=./prog.sh
log1=prog.log
maxsz=500
pid=-1
touch ${log1}
log2=${log1}-prev
while true; do
if [[ ${pid} -eq -1 ]]; then
lines=${maxsz}
else
lines=$(wc -l <${log1})
fi
if [[ ${lines} -ge ${maxsz} ]]; then
if [[ $pid -ge 0 ]]; then
kill $pid >/dev/null 2>&1
fi
sleep 1
rm -f ${log2}
mv ${log1} ${log2}
touch ${log1}
${exe} >> ${log1} &
pid=$!
fi
sleep 5
done
And this output (from an every-second wc -l on the two log files) shows what happens at the time of switchover, noting that it's approximate only, due to the delays involved in switching:
474 prog.log 0 prog.log-prev
496 prog.log 0 prog.log-prev
518 prog.log 0 prog.log-prev
539 prog.log 0 prog.log-prev
542 prog.log 0 prog.log-prev
21 prog.log 542 prog.log-prev
Now keep in mind that's a sample script. It's relatively intelligent but probably needs some error handling so that it doesn't leave the executable running if you shut down the monitor.
And, finally, if none of that suffices, there's nothing stopping you from writing your own filter program which takes standard input and continuously outputs that to a real ring buffer file.
Then you would simply do:
program_spitting_out_text | ringbuffer 4096 last4k.log
That program could be a true ring buffer in that it treats the 4k file as a circular character buffer but, of course, you'll need a special marker in the file to indicate the write-point, along with a program that can turn it back into a real stream.
Or, it could do much the same as the scripts above, rewriting the file so that it's always below the size desired.
Since apparently this basic feature (circular file) does not exist on GNU/Linux, and because I needed it to track logs on my Raspberry Pi with limited storage, I just wrote the code as suggest above!
Behold: circFS
Unlike other tools quoted on this post and other similar, the maximum size is arbitrary and only limited by the actual available storage.
It does not rotate with several files, all is kept in the single file, which is rewritten on "release".
You can have as many log files as needed in the virtual directory.
It is a single C file (~600 lines including comments), and it builds with a single compile line after having installed fuse development dependencies.
This first version is very basic (see the README), if you want to improve it with some of the TODOs (see the TODO) be welcome to submit pull requests.
As a joke, this is my first "write only" fuse driver! :-)

In what order does cat choose files to display?

I have the following line in a bash script:
find . -name "paramsFile.*" | xargs -n131072 cat > parameters.txt
I need to make sure the order the files are concatenated in does not change when I use this command. For example, if I run this command twice on the same set of paramsFile.*, parameters.txt should be the same both times. My question is, is this the case? And if it isn't, how can I make sure it is?
Thanks!
Edit: the same question goes for xargs: would that change how the files are fed to cat?
Edit2: as William Pursell pointed out, this question is actually about find. Does find always return files in the same order?
From description in man cat:
The cat utility reads files sequentially, writing them to the standard
output. The file operands are processed in command-line order.
If file is a single dash (`-') or absent, cat reads from the standard input. If file is a UNIX domain socket, cat connects to it
and
then reads it until EOF. This complements the UNIX domain binding capability available in inetd(8).
So yes as long as you pass the files to cat in the same order every time you'll be ok.

Piping program output to less does not display beginning of the output

I am trying to make a bunch of files in my directory, but the files are generating ~200 lines of errors, so they fly past my terminal screen too quickly and I have to scroll up to read them.
I'd like to pipe the output that displays on the screen to a pager that will let me read the errors starting at the beginning. But when I try
make | less
less does not display the beginning of the output - it displays the end of the output that's usually piped to the screen, and then tells me the output is 1 line long. When I try typing Gg, the only line on the screen is the line of the makefile that executed, and the regular screen output disappears.
Am I using less incorrectly? I haven't really ever used it before, and I'm having similar problems with something like, sh myscript.sh | less where it doesn't immediately display the beginning of the output file.
The errors from make appear on the standard error stream (stderr in C), which is not redirected by normal pipes. If you want to have it redirected to less as well, you need either make |& less (csh, etc.) or make 2>&1 | less (sh, bash, etc.).
Error output is sent to a slightly different place which isn't caught by normal pipelines, since you often want to see errors but not have them intermixed with data you're going to process further. For things like this you use a redirection:
$ make 2>&1 | less
In bash and zsh (and csh/tcsh, which is where they borrowed it from) this can be shortened to
$ make |& less
With things like make which are prone to produce lots of errors I will want to inspect later, I generally capture the output to a file and then less that file later:
$ make |& tee make.log
$ less make.log

How do you resolve issues with named pipes?

I have a binary program* which takes the contents of a supplied file, processes it, and prints the result on the screen through stdout. For an automation script, I would like to use a named pipe to send data to this program and process the output myself. After trying to get the script to work I realized that there is an issue with the binary program accepting data from the named pipe. To illustrate the problem I have outlined several tests using the unix shell.
It is easy to show that the program works by processing an actual data file.
$ binprog file.txt > output.txt
This will result in output.txt containing the processed information from file.txt.
The named pipe (pipe.txt) works as seen by this demonstration.
$ cat pipe.txt > output.txt
$ cat file.txt > pipe.txt
This will result in output.txt containing the data from file.txt after it has been sent through the pipe.
When the binary program is reading from the named pipe instead of the file, things do not work correctly.
$ binprog pipe.txt > output.txt
$ cat file.txt > pipe.txt
In this case output.txt contains no data even after cat and binprog terminate. Using top and ps, I can see binprog "running" and seemingly doing work. Everything executes with no errors.
Why is there no output produced by binprog in this third example?
What are some things I could try to get this working?
[*] The program in question is svm-scale from libsvm. I chose to generalize the examples to keep them clean and simple.
Are you sure the program will work with a pipe? If it needs random access to the input file it won't work. The program will get an error whenever it tries to seek in the input file.
If you know the program is designed to work with pipes, and you're using bash, you can use process substitution to avoid having to explicitly create the named pipe.
binprog <(cat file.txt) > output.txt
Does binprog also accept input on stdin? If so, this might work for you.
cat pipe.txt | binprog > output.txt
cat file.txt > pipe.txt
Edit: Briefly scanned the manpage for svm-scale. Give this a whirl instead:
cat pipe.txt | svm-scale - > output.txt
If binprog is not working well with anything other than a terminal as an input, maybe you need to give it a (pseudo-)terminal (pty) for its input. That is harder to organize, but the expect program is one way of doing that relatively easily. There are discussions of programming with pty's in
Advanced Programming in the Unix Environment, 3rd Edn by W Richard Stevens and Stephen A Rago, and in Advanced Unix Programming, 2nd Edn by Marc J Rochkind.
Something else to look at is the output of truss or strace or the local equivalent. These programs log all the system calls made by a process. On Solaris, I'd run:
truss -o binprog.truss binprog
interactively, and see what it does. Then I'd try it with i/o redirection, and then with i/o redirection from the named pipe; there may be some significant differences between what it does, or you may see the system call that is hanging. If you see forks in the truss log file, you would need to add a '-f' flag to follow children.

Resources