Unix: What are stdin/out/err REALLY? - unix

Assuming the following are correct...
stdin, stdout, and stderr are streams
streams are file descriptors
file descriptors are numbers/indexes in the kernel representing open files
Questions:
a. Does it follow by transition that stdin/out/err involve open files? So if I do ls /dir, does ls output the results to a file referred to by stdout(2)?
b. Where does above file live? in a /proc//? OR is that where the FD lives?
c. What is /dev/stdout? If I do vim /dev/stdout, vim tells me it is not a file. I see there's a series of links that lead to /dev/pts/27. What is going on? I tried to cat /dev/stdout but nothing happens.
d. In general, how is it that "files" in linux are actually NOT files?

Some of your assumptions are incorrect. For example, stdin is of type FILE*; it's not a "file descriptor".
stdin, stdout, and stderr are macros defined in <stdio.h>. (Yes, they're required to be macros, not just variable names). They expand to expressions of type FILE*, and they point to the FILE objects associated with the standard input, output, and error streams.
A "file descriptor" is a small integer value representing a POSIX stream. On UNIX-like systems, FILE* values are generally associated with file descriptors (you can use the fileno and fdopen functions to go from one to the other), but they're not the same thing.
Basically, there are two distinct I/O systems, one built on top of the other. The lower level system uses numeric file descriptors, manipulated via the open, read, write, and close functions, and so forth. The higher level, as defined by the ISO C standard, uses pointers of type FILE*, manipulated with fopen, fread, fwrite, fprintf, putchar, fclose, and so forth.
As I mentioned, on UNIX-like system, the C standard layer is generally implemented on top of the POSIX layer. On non-POSIX systems (like MS Windows), the C standard layer may be implemented on top of some other system-specific interface.
Linux and other UNIX-like systems try (incompletely) to follow an "everything is a file" philosophy. There are a number of file-like entities under /proc. These are not physical files stored on disk; they're entities that can be accessed using either the POSIX or ISO C I/O layers. Neither layer requires the "files" it deals with to be actual disk files, so there's nothing inconsistent about this.
man proc for more information on what's under the /proc directory (there's far more detail than I can put in this answer).

Related

In a UNIX pipeline how to get the user-tool interaction of the first stage piped into the next stage?

Page 32 of the book "The UNIX Programming Environment" makes this profound statement about UNIX pipes:
The programs in a pipeline actually run at the same time, not one
after another. This means that the programs in a pipeline can be
interactive; the kernel looks after whatever scheduling and
synchronization is need to make it all work.
Wow! By using pipes I can get parallel processing for free!
"I've got to illustrate this awesome capability to my colleagues" I thought. I will implement a demo: create a simple interactive tool that mimics what I type in at the command window. I'll name that tool mimic. Use a pipe to connect it to another tool which counts the number of lines and characters. I'll name that tool wc (sadly, I am working on Windows, which doesn't have the UNIX wc program so I must implement my own). The two tools will run on a command line like this:
mimic | wc
The output of mimic is piped into wc.
Well, that's the idea.
I implemented the mimic tool with a very simple Flex lexer, which I show below. Compiling it generated mimic.exe
When I run mimic.exe from the command line it does indeed mimic what I type:
> mimic
hello world
hello world
greetings
greetings
ctrl-c
I implemented wc using AWK. The AWK program (wc.awk) is shown below. When I run wc from the command line it does indeed count the lines and characters:
> echo Hello World | awk -f wc.awk
lines 1 chars 13
However, when I put them together with a pipe, they don't work as I imagined they would. Here's a sample run:
> mimic | awk -f wc.awk
hello world
greetings
ctrl-c
Hmm, nothing. No mimicking. No line counts. No char counts.
How come it's not working? What am I doing wrong, please?
What can I do to make it work as I expected? I expected it to work like this: I type something in at the command line. mimic repeats it, sending it to the pipe which sends it to wc which reports the number of lines and characters. I type in the next thing at the command line. mimic repeats it, sending it to the pipe which sends it to wc which reports the number of lines and characters. And so forth. That's the behavior I thought I was implementing.
Here is my simple Flex lexer (mimic):
%option noyywrap
%option always-interactive
%%
%%
int main(int argc, char *argv[])
{
yyin = stdin;
yylex();
return 0;
}
Here is my simple AWK program (wc):
{ nchars = nchars + length($0) + 1 }
END { printf("lines %-10d chars %d\n", NR, nchars) }
This question has little or nothing to do with Flex, Bison, Awk, and really not so much about Unix, either (since you're experimenting with Windows).
I don't have Windows handy, but the underlying issue is basically about stdio buffering, so it's reproducible on Unix as well.
To simplify, I only implemented mimic, which I did directly rather than using Flex (which is clearly overkill):
#include <stdio.h>
int main(void) {
for (int ch; (ch = getchar()) != EOF; ) putchar(ch);
return 0;
}
Since you use %always-interactive, which forces Flex to always read one character at a time with fgetc(), that has basically the same sequence of standard library calls as your program, except for simplifying fwrite of one byte to the equivalent putchar.
It certainly has the same execution characteristic:
$ ./mimic
Here we go round the mulberry busy,
Here we go round the mulberry busy,
the mulberry bush, the mulberry bush.
the mulberry bush, the mulberry bush.
In the above, I signalled end-of-input by typing Ctrl-D for the third line. On Windows, I would have had to have typed Ctrl-Z followed by Enter to get the same effect. If I kill the execution by typing Ctrl-C instead, I get roughly the same result (other than the fact that Ctrl-C shows up in the console):
$ ./mimic
Here we go round the mulberry bush,
Here we go round the mulberry bush,
the mulberry bush, the mulberry bush.
the mulberry bush, the mulberry bush.
^C
Now, since mimic just copies stdin to stdout, I might expect to be able to pipe it into itself and get the same result. But the output is a little different:
$ ./mimic | ./mimic
Here we go round the mulberry bush,
the mulberry bush, the mulberry bush.
Here we go round the mulberry bush,
the mulberry bush, the mulberry bush.
Again, I signaled end of input by typing Ctrl-D. And it was only after I typed the Ctrl-D that any output appeared; at that point, both lines were echoed. And if I terminate the program abruptly with Ctrl-C, I don't get any output at all:
$ ./mimic | ./mimic
Here we go round the mulberry bush,
the mulberry bush, the mulberry bush.
^C
OK, that's the data. Now, we need an explanation. And the explanation has to do with the way C standard library streams are buffered, which you can read about in the setvbuf manpage (and in many places on the web). I'll summarise:
The C standard specifies that all input functions execute "as if" implemented by repeated calls to fgetc, and all output functions "as if" implemented by repeated calls to fputc. Note that this means that there is nothing special about the chunk of bytes written by a single printf or fwrite. The library does not do anything to ensure that the sequence of calls is atomic. If you have two processes both writing to stdout, the messages can get interleaved, and that will happen from time to time.
The data written by fputc (and, consequently, all stdio output functions) is actually placed into an output buffer. This buffer is not part of the Operating System (which may add another layer of buffering). It's strictly part of the stdio library functions, which are ordinary userland functions. You could write them yourself (and it's a pretty good exercise to do so). From time to time, the contents of this buffer are sent to the appropriate operating system interface in order to be transferred to the output device.
"From time to time" is deliberately unspecific. There are three standard buffering modes, although the standard doesn't require them all to be used by a particular library implementation, and it also doesn't restrict the library implementation from using different buffering modes (although the three specified modes do basically cover the useful possibilities). However, most C library implementations you're likely to use do implement all three, pretty well as described in the standard. So take the following as a description of a common implementation technique, but be aware that on certain idiosyncratic platforms, it might not be accurate.
The three buffering modes are:
Unbuffered. In this mode, each byte written is transferred to the operating system (and, it is hoped, to the actual output device) as soon as possible.
Fully buffered. In this mode, there is a buffer of a predetermined size (often 8 kilobytes, but different library implementations have different defaults for different platforms). If you want to, you can supply your own buffer (of an arbitrary size) for a particular output stream, using the setvbuf standard library function (q.v.). Fully-buffered output might stay in the output buffer until it is full (although a given implementation may release output earlier). The buffer will, however, be sent to the operating system if you call fflush or fclose on the stream, or if fclose is called automatically when main returns. (It's not sent if the process dies, though.)
Line-buffered. In this mode, the stream again has an output buffer of a predetermined size, which is usually exactly the same as the buffer used in "fully-buffered" mode. The difference is that the buffer is also sent to the operating system when a new line character ('\n') is written. (If the buffer gets full before an end-of-line character is written, then it is sent to the operating system, just as in Fully-Buffered mode. But most of the time, lines will be fairly short, so the buffer will be sent to the OS at the end of each line.)
Finally, you need to know that stdout is fully-buffered by default unless the standard library can determine that stdout is connected to some kind of console device, in which case it is not fully-buffered. (On Unix, it is typically line-buffered.) By contrast, stderr is not fully-buffered. (On Unix, it is typically unbuffered.) You can change the buffering mode, and the buffer size if relevant, calling setvbuf before the first write operation to the stream.
The above-mentioned defaults are one of the reasons you are encouraged to write error messages to stderr rather than stdout: since stderr is unbuffered, the error message will appear as soon as possible. Also, you should normally put \n at the end of output line; if standard output is a terminal and therefore line buffered, the \n will ensure that the line is actually output, rather than languishing in the output buffer.
You can see all of this in action in the above examples. When I just ran ./mimic, leaving stdout mapped to the terminal, the output showed up each time I entered a line. (That also has to do with the way terminal input is handled by the terminal driver, which is another kettle of fish.)
But when I piped mimic into itself, the first mimics standard output is redirected to a pipe. A pipe is not a terminal, so that mimic's stdout is fully-buffered by default. Since the buffer is longer than the total input, the entire program runs without sending anything to stdout, until the buffer is flushed when stdout is implicitly closed by main returning.
Moreover, if I kill the process (by typing Ctrl-C, for example, or by sending it a SIGKILL signal), then the output buffer is never sent to the operating system, and nothing appears on the console.
If you're writing console apps using standard C library calls, it's very important to understand how stdio output buffering affects the sequence of outputs you see. That's just as true on Windows as on Unix. (Of course, it doesn't apply if you use native Windows or Posix I/O interfaces.)

Looping through the content of a file in Zsh

I'm trying to loop through the contents of a file in zsh. In my loop I want to get user input. Going off of this answer for Bash, I'm attempting to do:
while read -u 10 line; do
echo $line;
# TODO read from stdin here, etc.
done 10<myfile.txt
However I get an error:
zsh: parse error near `10'
Referring to the 10 after the done. Obviously I'm not getting the file descriptor syntax right, but I'm having trouble figuring out the docs.
Use a file descriptor number less than 10. If you want to hard code file descriptor numbers, stick to the range 3-9 (plus 0-2 for stdin,out,err). When zsh needs file descriptors itself, it uses them in the 10+ range.
If you're even getting close to needing more than the 7 available hard coded file descriptors, you should really think about using variables to name them. Syntax like exec {myfd}<myfile.txt will open a file with zsh allocating a file descriptor greater than 10 and assigning it to $myfd.
Bourne shell syntax is not entirely unambiguous given file descriptors numbering 10 and over and even in bash, I'd advise against using them. I'm not entirely sure how bash avoids conflicts if it needs to open any for internal use - I guess it never needs to leave any open. This may look like a zsh limitation at first sight but is actually a sensible feature.

unix: can i write to the same file in parallel without missing entries?

I wrote a script that executes commands in parallel. I let them all write an entry to the same log file. It does not matter if the order is wrong or entries are interleaved, but i noticed that some entries are missing. I should probably lock the file before writing, however, is it true that if multiple processes try to write to a file simultaneously, it will result in missing entries?
Yes, if different processes independently open and write to the same file, it may result in overlapping writes and missing data. This happens because each process will get its own file pointer, that advances only by local writes.
Instead of locking, a better option might be to open the log file once in an ancestor of all worker processes, have it inherited across fork(), and used by them for logging. This means that there will be a single shared file pointer, that advances when any of the processes writes a new entry.
In a script you should use ">> file" (double greater than) to append output to that file. The interpreter will open the destination in "append" mode. If your program also wants to append, follow the directives below:
Open a text file in "append" mode ("a+") and give preference to printing only full lines (don't do multiple 'print' followed by a final 'println', but print the entire line with a single 'println').
The fopen documentation states this:
DESCRIPTION
The fopen() function opens the file whose pathname is the
string pointed to by filename, and associates a stream with
it.
The argument mode points to a string beginning with one of
the following sequences:
r or rb Open file for reading.
w or wb Truncate to zero length or create file
for writing.
a or ab Append; open or create file for writing
at end-of-file.
r+ or rb+ or r+b Open file for update (reading and writ-
ing).
w+ or wb+ or w+b Truncate to zero length or create file
for update.
a+ or ab+ or a+b Append; open or create file for update,
writing at end-of-file.
The character b has no effect, but is allowed for ISO C
standard conformance (see standards(5)). Opening a file with
read mode (r as the first character in the mode argument)
fails if the file does not exist or cannot be read.
Opening a file with append mode (a as the first character in
the mode argument) causes all subsequent writes to the file
to be forced to the then current end-of-file, regardless of
intervening calls to fseek(3C). If two separate processes
open the same file for append, each process may write freely
to the file without fear of destroying output being written
by the other. The output from the two processes will be
intermixed in the file in the order in which it is written.
It is because of this intermixing that you want to give preference to
using only 'println' (or its equivalent).

Are there any equivalent of C/C++ __FILE__ and __LINE__ macros in R?

I'm trying to get the equivalent of FILE or LINE macros in C or C++ in R (or S+). Any ideas?
FILE The presumed name of the current source file (a character string literal).
LINE The presumed line number (within the current source file) of the current source line (an integer constant).
As for context - I have log messages being flushed to console from different sections of the code, and given that the messages themselves are built at run-time, it is often very difficult to find out where this log message is coming from (with the size of the R code growing to many thousand lines and running on a distributed grid). However if I could dump the FILE and LINE number along with the log messages, it would be much easier to trace the logs...
Use the #line directive. The structure is #line nn "filename". See Duncan's Murdoch's article on source references for more.

What is the general syntax of a Unix shell command?

In particular, why is that sometimes the options to some commands are preceded by a + sign and sometimes by a - sign?
for example:
sort -f
sort -nr
sort +4n
sort +3nr
These days, the POSIX standard using getopt() (aka getopt(3)) is widely used as a standard notation, but in the early days, people were experimenting. On some machines, the sort command no longer supports the + notation. However, various commands (notably ar and tar) accept controls without any prefix character - and dd (alluded to by Alok in a comment) uses another convention altogether.
The GNU convention of using '--' for long options (supported by getopt_long(3)) was changed from using '+'. Of course, the X11 software uses a single dash before multi-character options. So, the whole thing is a collection of historic relics as people experimented with how best to handle it.
POSIX documents the Utility Conventions that it works to, except where historical precedent is stronger.
What styles of option handling are there?
[At one time, SO 367309 contained the following material as my answer. It was originally asked 2008-12-15 02:02 by FerranB, but was subsequently closed and deleted.]
How many different types of options do you recognize? I can think of
many, including:
Single-letter options preceded by single dash, groupable when there is
no argument, argument can be attached to option letter or in next
argument (many, many Unix commands; most POSIX commands).
Single-letter options preceded by single dash, grouping not allowed,
arguments must be attached (RCS).
Single-letter options preceded by single dash, grouping not allowed,
arguments must be separate (pre-POSIX SCCS, IIRC).
Multi-letter options preceded by single dash, arguments may be
attached or in next argument (X11 programs; also Java and many programs on Mac OS X with a NeXTSTEP heritage).
Multi-letter options preceded by single dash, may be abbreviated
(Atria Clearcase).
Multi-letter options preceded by single plus (obsolete).
Multi-letter options preceded by double dash; arguments may follow '='
or be separate (GNU utilities).
Options without prefix/suffix, some names have abbreviations or are
implied, arguments must be separate. (AmigaOS
Shell)
For options taking an optional argument, sometimes the argument must be attached (co -p1.3 rcsfile.c),
sometimes it must follow an '=' sign. POSIX doesn't support optional
arguments meaningfully (the POSIX getopt() only allows them for the last
option on the command line).
All sensible option systems use an option consisting of double-dash
('--') alone to mean "end of options" — the following arguments are
"non-option arguments" (usually file names; POSIX calls them 'operands')
even if they start with a
dash. (I regard supporting this notation as an imperative. Be aware that if the -- is preceded by an option requiring an argument, the -- will be treated as the argument to the option, not as the 'end of options' marker.)
Many but not all programs accept single dash as a file name to mean
standard input (usually) or standard output (occasionally). Sometimes,
as with GNU 'tar', both can be used in a single command line:
... | tar -cf - -F - | ...
The first solo dash means 'write to stdout'; the second means 'read file
names from stdin'.
Some programs use other conventions — that is, options not preceded by a
dash. Many of these are from the oldest days of Unix. For example,
'tar' and 'ar' both accept options without a dash, so:
tar cvzf /tmp/somefile.tgz some/directory
The dd command uses opt=value exclusively:
dd if=/some/file of=/another/file bs=16k count=200
Some programs allow you to interleave options and other arguments
completely; the C compiler, make and the GNU utilities run without
POSIXLY_CORRECT in the environment are examples. Many programs expect
the options to precede the other arguments.
Note that git and other VCS commands often use a hybrid system:
git commit -m 'This is why it was committed'
There is a sub-command as one of the arguments. Often, there will be optional 'global' options that can be specified between the command and the sub-command. There are examples of this in POSIX; the sccs command is in this category; you can argue that some of the other commands that run other commands are also in this category: nice and xargs spring to mind from POSIX; sudo is a non-POSIX example, as are svn and cvs.
I don't have strong preferences between the different systems. When
there are few enough options, then single letters with mnemonic value
are convenient. GNU supports this, but recommends backing it up with
multi-letter options preceded by a double-dash.
There are some things I do object to. One of the worst is the same
option letter being used with different meanings depending on what other
option letters have preceded it. In my book, that's a no-no, but I know
of software where it is done.
Another objectionable behaviour is inconsistency in style of handling
arguments (especially for a single program, but also within a suite of
programs). Either require attached arguments or require detached
arguments (or allow either), but do not have some options requiring an
attached argument and others requiring a detached argument. And be
consistent about whether '=' may be used to separate the option and
the argument.
As with many, many (software-related) things — consistency is more
important than the individual decisions. Using tools that automate
and standardize the argument processing helps with consistency.
Whatever you do, please, read the TAOUP's Command-Line Options and
consider Standards for Command Line Interfaces. (Added by J F
Sebastian — thanks; I agree.)
It's completely arbitrary; the command may implement all of the option handling in its own special way or it might call out to some other convenience functions. The getopt() family of functions is pretty popular, so most software written even remotely recently follows the conventions set by those routines. There are always exceptions, of course!
It's left to apps to parse options hence the inconsistency. Expanding on your sort example these are all equivalent for coreutils:
sort -k3
sort --k 3
sort --key 3
sort --key=3
_POSIX2_VERSION=199209 sort +2
A shell command is just a program, and it is free to interpret its command line any way it likes.
Unix never had anything like Apple's interface police to make sure that the command-line interface was consistent across applications. As a result, there is inconsistency, especially in older commands.
Peering into my crystal ball, I think command-line tools will slowly migrate toward GNU standards, double dashes and all. (I grew up with single dashes and still find the double dash very awkward, but it is consistent.)

Resources