Page 32 of the book "The UNIX Programming Environment" makes this profound statement about UNIX pipes:
The programs in a pipeline actually run at the same time, not one
after another. This means that the programs in a pipeline can be
interactive; the kernel looks after whatever scheduling and
synchronization is need to make it all work.
Wow! By using pipes I can get parallel processing for free!
"I've got to illustrate this awesome capability to my colleagues" I thought. I will implement a demo: create a simple interactive tool that mimics what I type in at the command window. I'll name that tool mimic. Use a pipe to connect it to another tool which counts the number of lines and characters. I'll name that tool wc (sadly, I am working on Windows, which doesn't have the UNIX wc program so I must implement my own). The two tools will run on a command line like this:
mimic | wc
The output of mimic is piped into wc.
Well, that's the idea.
I implemented the mimic tool with a very simple Flex lexer, which I show below. Compiling it generated mimic.exe
When I run mimic.exe from the command line it does indeed mimic what I type:
> mimic
hello world
hello world
greetings
greetings
ctrl-c
I implemented wc using AWK. The AWK program (wc.awk) is shown below. When I run wc from the command line it does indeed count the lines and characters:
> echo Hello World | awk -f wc.awk
lines 1 chars 13
However, when I put them together with a pipe, they don't work as I imagined they would. Here's a sample run:
> mimic | awk -f wc.awk
hello world
greetings
ctrl-c
Hmm, nothing. No mimicking. No line counts. No char counts.
How come it's not working? What am I doing wrong, please?
What can I do to make it work as I expected? I expected it to work like this: I type something in at the command line. mimic repeats it, sending it to the pipe which sends it to wc which reports the number of lines and characters. I type in the next thing at the command line. mimic repeats it, sending it to the pipe which sends it to wc which reports the number of lines and characters. And so forth. That's the behavior I thought I was implementing.
Here is my simple Flex lexer (mimic):
%option noyywrap
%option always-interactive
%%
%%
int main(int argc, char *argv[])
{
yyin = stdin;
yylex();
return 0;
}
Here is my simple AWK program (wc):
{ nchars = nchars + length($0) + 1 }
END { printf("lines %-10d chars %d\n", NR, nchars) }
This question has little or nothing to do with Flex, Bison, Awk, and really not so much about Unix, either (since you're experimenting with Windows).
I don't have Windows handy, but the underlying issue is basically about stdio buffering, so it's reproducible on Unix as well.
To simplify, I only implemented mimic, which I did directly rather than using Flex (which is clearly overkill):
#include <stdio.h>
int main(void) {
for (int ch; (ch = getchar()) != EOF; ) putchar(ch);
return 0;
}
Since you use %always-interactive, which forces Flex to always read one character at a time with fgetc(), that has basically the same sequence of standard library calls as your program, except for simplifying fwrite of one byte to the equivalent putchar.
It certainly has the same execution characteristic:
$ ./mimic
Here we go round the mulberry busy,
Here we go round the mulberry busy,
the mulberry bush, the mulberry bush.
the mulberry bush, the mulberry bush.
In the above, I signalled end-of-input by typing Ctrl-D for the third line. On Windows, I would have had to have typed Ctrl-Z followed by Enter to get the same effect. If I kill the execution by typing Ctrl-C instead, I get roughly the same result (other than the fact that Ctrl-C shows up in the console):
$ ./mimic
Here we go round the mulberry bush,
Here we go round the mulberry bush,
the mulberry bush, the mulberry bush.
the mulberry bush, the mulberry bush.
^C
Now, since mimic just copies stdin to stdout, I might expect to be able to pipe it into itself and get the same result. But the output is a little different:
$ ./mimic | ./mimic
Here we go round the mulberry bush,
the mulberry bush, the mulberry bush.
Here we go round the mulberry bush,
the mulberry bush, the mulberry bush.
Again, I signaled end of input by typing Ctrl-D. And it was only after I typed the Ctrl-D that any output appeared; at that point, both lines were echoed. And if I terminate the program abruptly with Ctrl-C, I don't get any output at all:
$ ./mimic | ./mimic
Here we go round the mulberry bush,
the mulberry bush, the mulberry bush.
^C
OK, that's the data. Now, we need an explanation. And the explanation has to do with the way C standard library streams are buffered, which you can read about in the setvbuf manpage (and in many places on the web). I'll summarise:
The C standard specifies that all input functions execute "as if" implemented by repeated calls to fgetc, and all output functions "as if" implemented by repeated calls to fputc. Note that this means that there is nothing special about the chunk of bytes written by a single printf or fwrite. The library does not do anything to ensure that the sequence of calls is atomic. If you have two processes both writing to stdout, the messages can get interleaved, and that will happen from time to time.
The data written by fputc (and, consequently, all stdio output functions) is actually placed into an output buffer. This buffer is not part of the Operating System (which may add another layer of buffering). It's strictly part of the stdio library functions, which are ordinary userland functions. You could write them yourself (and it's a pretty good exercise to do so). From time to time, the contents of this buffer are sent to the appropriate operating system interface in order to be transferred to the output device.
"From time to time" is deliberately unspecific. There are three standard buffering modes, although the standard doesn't require them all to be used by a particular library implementation, and it also doesn't restrict the library implementation from using different buffering modes (although the three specified modes do basically cover the useful possibilities). However, most C library implementations you're likely to use do implement all three, pretty well as described in the standard. So take the following as a description of a common implementation technique, but be aware that on certain idiosyncratic platforms, it might not be accurate.
The three buffering modes are:
Unbuffered. In this mode, each byte written is transferred to the operating system (and, it is hoped, to the actual output device) as soon as possible.
Fully buffered. In this mode, there is a buffer of a predetermined size (often 8 kilobytes, but different library implementations have different defaults for different platforms). If you want to, you can supply your own buffer (of an arbitrary size) for a particular output stream, using the setvbuf standard library function (q.v.). Fully-buffered output might stay in the output buffer until it is full (although a given implementation may release output earlier). The buffer will, however, be sent to the operating system if you call fflush or fclose on the stream, or if fclose is called automatically when main returns. (It's not sent if the process dies, though.)
Line-buffered. In this mode, the stream again has an output buffer of a predetermined size, which is usually exactly the same as the buffer used in "fully-buffered" mode. The difference is that the buffer is also sent to the operating system when a new line character ('\n') is written. (If the buffer gets full before an end-of-line character is written, then it is sent to the operating system, just as in Fully-Buffered mode. But most of the time, lines will be fairly short, so the buffer will be sent to the OS at the end of each line.)
Finally, you need to know that stdout is fully-buffered by default unless the standard library can determine that stdout is connected to some kind of console device, in which case it is not fully-buffered. (On Unix, it is typically line-buffered.) By contrast, stderr is not fully-buffered. (On Unix, it is typically unbuffered.) You can change the buffering mode, and the buffer size if relevant, calling setvbuf before the first write operation to the stream.
The above-mentioned defaults are one of the reasons you are encouraged to write error messages to stderr rather than stdout: since stderr is unbuffered, the error message will appear as soon as possible. Also, you should normally put \n at the end of output line; if standard output is a terminal and therefore line buffered, the \n will ensure that the line is actually output, rather than languishing in the output buffer.
You can see all of this in action in the above examples. When I just ran ./mimic, leaving stdout mapped to the terminal, the output showed up each time I entered a line. (That also has to do with the way terminal input is handled by the terminal driver, which is another kettle of fish.)
But when I piped mimic into itself, the first mimics standard output is redirected to a pipe. A pipe is not a terminal, so that mimic's stdout is fully-buffered by default. Since the buffer is longer than the total input, the entire program runs without sending anything to stdout, until the buffer is flushed when stdout is implicitly closed by main returning.
Moreover, if I kill the process (by typing Ctrl-C, for example, or by sending it a SIGKILL signal), then the output buffer is never sent to the operating system, and nothing appears on the console.
If you're writing console apps using standard C library calls, it's very important to understand how stdio output buffering affects the sequence of outputs you see. That's just as true on Windows as on Unix. (Of course, it doesn't apply if you use native Windows or Posix I/O interfaces.)
A semi-newbie to UNIX piping, so apologies if I'm asking anything obvious here. I'm using a program called CCExtractor to grab the closed captions from a video file. It has the option to receive a file from stdin, and it works great if I do the following:
./ccextractor -stdin < myvideofile.wtv
However, I want to try using it "live" - as a video is recording, it'll transcribe the subtitles. From my understanding, < won't do that, as it'll stop as soon as it reaches the current end of the file. Following this answer on Stack Overflow, it seems like:
tail -c +1 -f myvideofile.wtv | ./ccextractor -stdin
should work - but it doesn't process any part of the video at all (it should, at least, work as well as the previous command and parse the existing data). I figured I'd take a step back and use a simple cat:
cat myvideofile.wtv | ./ccextractor -stdin
and that doesn't work either. I was of the belief that the first and third commands ought to be roughly equivalent, but that's obviously not the case. What are the differences, and how could I get this to work?
I'm new to stackoverflow so forgive me if I do something wrong. I trying to understand how a simple server would work in Haskell. I think I'm missing something very simple or fundamental about how hGetContents works.
import Network
import System.IO
main = withSocketsDo $ do
socket <- listenOn $ PortNumber 5002
(h, _, _) <- accept socket
c <- hGetContents h
-- putStrLn c -- doesn't work
-- putStrLn $ head $ lines c -- works!
-- putStrLn $ unlines $ take 2 $ lines c -- works!
-- putStrLn $ unlines $ take 3 $ lines c -- works!
-- putStrLn $ unlines $ take 6 $ lines c -- works!
putStrLn $ unlines $ take 10 $ lines c -- doesn't work
hPutStr h $ "HTTP/1.0 200 OK\r\nContent-Length: 5\r\n\r\nHello!\r\n"
hClose h
After running the program, I navigate via web browser to http://localhost:5002. The problem seems to be that, depending on how much I've parsed the handle contents, I eventually am unable to send a response. I'd like to be able to parse the request before I send a response. I've commented in the code the cases that work and the cases that don't. Hoogle says that for hGetContents (lazy) the handle is "semi-closed" as it is being read. Am I misunderstanding the laziness or should I consider the handle closed once I begin parsing its contents?
The error I get is "hPutChar: resource vanished (Broken pipe)." Thanks for any help.
I tried to reproduce your problem. For that I executed your code and send it a request using nc:
printf "1\n2\n3\n4\n5\n6\n7\n8\n9\n10\n11" | nc localhost 5002
As expected the server (code from your question) printed out first 10 lines and exited without any error. The client (nc) printed:
HTTP/1.0 200 OK
Content-Length: 5
Hello!
and also exited without an error.
So, at first I couldn't understand what's your problem, but then I tried to send a smaller request:
printf "1\n2\n3\n4\n5\n6\n" | nc localhost 5002
The server printed first 6 lines and didn't exit. The client also didn't exit, so I interrupted it with Ctrl-C and after that the server exited with "resource vanished" error.
I took some thinking and it started making sense to me. I don't understand lazy IO too good, so if my explanation isn't clear or correct it would be helpful if someone with better understanding would improve it.
Let's follow your code. First:
(h, _, _) <- accept socket
c <- hGetContents h
You open a handle and read it's content. Note that the handle is lazy and the content that you get is also lazy. When we say that something is lazy we mean that it can be passed around without being evaluated (it's often referred as 'call by name' vs 'call by value').
Now:
putStrLn $ unlines $ take 10 $ lines c
Here it is, you pass your lazy, unevaluated content to another function take 10. take 10 will try to evaluate first 10 elements of a list and return them, if there are less than 10 elements in the list it would simply return all of them. After take 10 we have putStrLn and unlines which both perfectly compatible with laziness.
Now let's say that client sends an input that is only 6 lines long and then starts waiting for the respond. Our server lazily receives the content and tries to print first 10 lines. First, take 10 function happily consumes the first 6 lines and passes them over to putStrLn . unlines, what happens then? take 10 can't just finish it's output because there is absolutely no indication that it is the end. The handle is still open, bytes still can be floating from client to server, so it just waits for more input.
This behaviour can be observed by running:
nc localhost 5002
and manually typing there 10 lines. The input would appear on server line-by-line as you type. After you will type the 10th line the server will respond with "Hello" message.
P.S: I guess that the behaviour that you described happens because you web browser sends 6 to 9 lines of something with the request.
To test, debug and analyze this kind of low level servers you should use simple tools like nc and curl instead of your web browser :)
When you initiate a lazy read on a handle, you give up the right to do anything much else with the handle until the contents string is fully forced, or you close the handle manually (at which point attempting to force any more of the contents string will lead to bad behavior or an error).
TL;DR
This is not a situation where lazy I/O is appropriate. The situations where a lazy read on a socket is appropriate can probably be counted on zero fingers. You can use regular strict I/O if you like, or conduit, or pipes, or some Haskell web framework like Yesod or Scotty or various other competitors.
Calling hGetContents puts the handle into a "semi-closed" state. You should not perform any operations on the handle after that point. You should only use the string returned from hGetContents.
Put simply, don't use lazy I/O here. You need to manually read and write individual strings one at a time, since the timing matters.
In general, lazy I/O is kind of neat, but it doesn't work well for anything much beyond toy examples.
Will using "yes" command waste lot of cpu cycles ?
I have a long running script (script code not in my control) which accepts something as input just once. Then the script runs for long time .
To automate I use "yes" command to feed the input
yes hello | myscript
Will yes command steal/waste lot of cpu cycles ?. As per the docs i read it keeps printing the string argument to piped program
I gave top command , i didnt see "yes" there in top
yes will print the string "hello" when it has the chance - this means that the receiving end (your script) must be waiting for I/O (i.e., expecting input). So: no, yes does not take any CPU when the receiving end does not wait for input, the process is blocked.
See the run state of the process yes in a ps auxf for confirmation.
I am writing a program with a named pipe with multiple readers and multiple writers. The idea is to use that named pipe to create pairs of reader/writer. That is:
A reads the pipe
B writes in the pipe
(vice versa)
Pair A-B created!
In order to ensure that only one process is reading and one is writing, I have used 2 locks with flock. Just like this.
Reader Code:
echo "[JOB $2, Part $REMAINING] Taking next machine..."
VMTAKEN=$((
flock -x 200;
cat $VMPIPE;
)200>$JOINQUEUELOCK)
echo "[JOB $2, Part $REMAINING] Machine $VMTAKEN taken..."
Writer Code:
((
flock -x 200;
echo "[MACHINE $MACHINEID] I am inside the critical section"
echo "$MACHINEID" > $VMPIPE;
echo "[MACHINE $MACHINEID] Going outside the critical section"
)200>$VMQUEUELOCK)
echo "[MACHINE $MACHINEID] Got new Job"
I sometimes get the following problem:
[MACHINE 3] I am inside the critical section
[JOB 1, Part 249] Taking next machine...
[MACHINE 3] Going outside the critical section
[MACHINE 1] I am inside the critical section
[MACHINE 1] Going outside the critical section
[MACHINE 1]: Got new Job
[MACHINE 3]: Got new Job
[JOB 1, Part 249] Machine 3
1 taken...
As you can see, Another writer wrote before the reader finished reading. What can I do to get rid of this problem? Should I use an ACK Pipe or something?
Thank you in advance
This would be a typical use for semaphores:
Create 2 semaphores - one for reading processed, the other one for writing processes. set each semaphore to value 1
Reading processes sem_wait(2) on the semaphore for readers until semphore > 0 and lower it to zero if they get it.
Writing processes will do the same with the semaphore intended for them
A controlling process (which may also set up the semaphores initially) could check, if both semaphores are zero and assign the pair
reader/writer release the semaphores (increasing them by 1 again) so next readr or writer will get the semaphore.
For passing informations between reader/writer shared memory may be used...