Pipe and fork in unix |? - unix

I was reading what's a pipe in an operating system and there's something I don't understand.
Take a sequence of unix pipe-character-separated commands like
cat file | grep "something"
what happens when the pipe | is processed? I understand that a unix pipe is opened through the pipe() function, but I don't see how a 'fork' would take place here in any of the processes involved.
What happens and how is a fork involved (if any) ?

For your case there are actually two fork calls being made: One for the cat command and one for the grep command.
The fork calls are needed, because the only standard way in POSIX to execute programs is with the exec family of calls, and those replace the current process image with the image from the loaded program. And if the shell doesn't fork a new process for each command it executes, the first program run would replace the shell, and the shell will not exist anymore.
The pipe is set up so that the write-end of the pipe will be connected to standard output of the first child process (the cat command), and the read-end of the pipe will be connected to standard input of the second child process (the grep command).
All of this is happening behind the scenes in the shell.

Related

How do I run a command(with its output being piped to a file) in the background?

Say I have a program foo, which prints a gazillion lines to the console.
How do I run it in the background, while piping its output to a file?
I tried this
./foo | output.txt&
Doesn't seem to work
Take a look at the nohup utility, it allows to detach a command from the tty:
nohup sh -c "./foo 2>&1 > output.txt" &
Piping the output of a command to a file actually does not work, you can only redirect it: that is the > output.txt. Piping makes sense if what follows is a command again which accepts input from its standard input, but not for a passive file. The additional 2>&1 redirects the commands standard error output into the standard output, so that you have only one single output pipe, otherwise potential errors would still spill out to the controlling tty. The actual command here is a shell invoked, that is because piping will break the sequence otherwise.

Write to one output file from a few parallel LSF bsub jobs, avoiding writing at the same time

I have developed a code that composed of two files:
An 'envelop bash file', which does a few things and writes to a log-file, and then at some point runs into a for loop in which within it it executes one job at a time using bsub.
And 'an internal bash file', which gets as input the name of the log-file (in addition to other input values that necessary for its execution), and executes process X (using the input values it received from the 'envelop file'.
Once process X is finished, the 'internal script' writes to the log-file that process X (with its specific serial number) has been completed.
Since the for-loop of the envelop file loops 10 times, there are at least 10 parallel processes that being executed and run in parallel, and they all being executed with bsub given the SAME log-file name. The idea is that they would all report to the same log-file once they completed their execution of Process X.
The general procedure works well, and in each case process X is being executed, and the log-file accumulates as required all the notifications regarding the completion of process X. However, in some incidences we see that the writing to the log-file get disturbed and output lines of two parallel runs are running into each other.
I would like to lock the log-file in such manner that would allow it to receive text only from one parallel run at a time. The idea is to avoid cases where the text becomes mixed due to two processes that write by chance to the log-file exactly at the same time.
Here is the part of my envelop file which call to the bsub (I reduced the content to the minimum necessary):
for ((i=1;i<=$batchesnumber; i++));
do
bsub -J $SerialName -q normal "bash FetchFasta.bash $genome_fa ${SerialFileName}".bed" $logfile"
done
Here is the part of my internal file that echo to the log-file:
(
echo "~~~~~~~~~~~~~~~~~~"
echo "^^^^^^^^^^^^^^^^^^"
echo -n "Completed running "; bedtools -version
echo "bedtools getfasta -s -fi $genome_fasta -bed $mySerialFile -fo ${mySerialFile%.*}".fa" "
echo "Run's completion time is: $timedate"
echo -e "~~~~~~~~~~~~~~~~~~\n"
) >> $logfile
I would appreciate any useful solution!
There's a couple of ways I can think of going about this:
Have each job write its output to a different file (use $LSB_JOBID inside each job to name the file). Then use another "cleanup" job to concatenate all of the ouptut into a single file. You can use job dependencies (bsub -w) to make sure the cleanup job runs after all the other jobs are done.
Implement a lock inside your "internal" job to make sure only one of them writes to a file at a time. This is a lot simpler than it might sound, one way to do it is to have each job try to create the same directory with mkdir before writing to the file and then delete the directory after its done. If they fail to create the directory it's because another one of the jobs got to it first and is currently writing to the file.
Here's a snippet illustrating #2 in bash:
# Try to get the lock every second
while ! mkdir lock &> /dev/null ; do
sleep 1
done
# Got the lock, write to the logfile
echo blahblahblah >> $logfile
# Release the lock
rmdir lock
I should mention an important caveat here though: if one of your jobs dies while it's "holding the lock" (say someone sends it a kill signal at the wrong time) then it'll never remove the directory and all the other jobs won't be able to create it, so they'll just keep sleeping forever.

command pipe into subshell

What is the difference between
cat dat | tee >(wc -l ) | some other command
and
cat dat | tee file | wc -l
in terms of what is happening under the hood?
I can understand the second one as tee is forking the stream into a file and also to a pipe. But I am confused with the first one.
The first notation is the process substitution of Bash 4.x (not in 3.x, or not all versions of 3.x).
As far as tee is concerned, it is given a file name (such as /dev/fd/64) to which it writes as well as to standard output; it is actually a file descriptor for the write end of a pipe. As far as wc is concerned, it reads its standard input (which is the read end of the pipe that is connected to /dev/fd/64 for tee), and writes its answer to the standard output of the shell invoking the pipeline (not the standard output of tee which goes down the pipeline).
Since >( is process substitiution of bash,
the first line says:
send the contents of file 'dat' into some other command
while process 'wc' is run with its input or output
connected to a pipe which also sends the content of 'dat'
check "Process Substitution" of bash manpage.

How do I use the nohup command without getting nohup.out?

I have a problem with the nohup command.
When I run my job, I have a lot of data. The output nohup.out becomes too large and my process slows down. How can I run this command without getting nohup.out?
The nohup command only writes to nohup.out if the output would otherwise go to the terminal. If you have redirected the output of the command somewhere else - including /dev/null - that's where it goes instead.
nohup command >/dev/null 2>&1 # doesn't create nohup.out
Note that the >/dev/null 2>&1 sequence can be abbreviated to just >&/dev/null in most (but not all) shells.
If you're using nohup, that probably means you want to run the command in the background by putting another & on the end of the whole thing:
nohup command >/dev/null 2>&1 & # runs in background, still doesn't create nohup.out
On Linux, running a job with nohup automatically closes its input as well. On other systems, notably BSD and macOS, that is not the case, so when running in the background, you might want to close input manually. While closing input has no effect on the creation or not of nohup.out, it avoids another problem: if a background process tries to read anything from standard input, it will pause, waiting for you to bring it back to the foreground and type something. So the extra-safe version looks like this:
nohup command </dev/null >/dev/null 2>&1 & # completely detached from terminal
Note, however, that this does not prevent the command from accessing the terminal directly, nor does it remove it from your shell's process group. If you want to do the latter, and you are running bash, ksh, or zsh, you can do so by running disown with no argument as the next command. That will mean the background process is no longer associated with a shell "job" and will not have any signals forwarded to it from the shell. (A disowned process gets no signals forwarded to it automatically by its parent shell - but without nohup, it will still receive a HUP signal sent via other means, such as a manual kill command. A nohup'ed process ignores any and all HUP signals, no matter how they are sent.)
Explanation:
In Unixy systems, every source of input or target of output has a number associated with it called a "file descriptor", or "fd" for short. Every running program ("process") has its own set of these, and when a new process starts up it has three of them already open: "standard input", which is fd 0, is open for the process to read from, while "standard output" (fd 1) and "standard error" (fd 2) are open for it to write to. If you just run a command in a terminal window, then by default, anything you type goes to its standard input, while both its standard output and standard error get sent to that window.
But you can ask the shell to change where any or all of those file descriptors point before launching the command; that's what the redirection (<, <<, >, >>) and pipe (|) operators do.
The pipe is the simplest of these... command1 | command2 arranges for the standard output of command1 to feed directly into the standard input of command2. This is a very handy arrangement that has led to a particular design pattern in UNIX tools (and explains the existence of standard error, which allows a program to send messages to the user even though its output is going into the next program in the pipeline). But you can only pipe standard output to standard input; you can't send any other file descriptors to a pipe without some juggling.
The redirection operators are friendlier in that they let you specify which file descriptor to redirect. So 0<infile reads standard input from the file named infile, while 2>>logfile appends standard error to the end of the file named logfile. If you don't specify a number, then input redirection defaults to fd 0 (< is the same as 0<), while output redirection defaults to fd 1 (> is the same as 1>).
Also, you can combine file descriptors together: 2>&1 means "send standard error wherever standard output is going". That means that you get a single stream of output that includes both standard out and standard error intermixed with no way to separate them anymore, but it also means that you can include standard error in a pipe.
So the sequence >/dev/null 2>&1 means "send standard output to /dev/null" (which is a special device that just throws away whatever you write to it) "and then send standard error to wherever standard output is going" (which we just made sure was /dev/null). Basically, "throw away whatever this command writes to either file descriptor".
When nohup detects that neither its standard error nor output is attached to a terminal, it doesn't bother to create nohup.out, but assumes that the output is already redirected where the user wants it to go.
The /dev/null device works for input, too; if you run a command with </dev/null, then any attempt by that command to read from standard input will instantly encounter end-of-file. Note that the merge syntax won't have the same effect here; it only works to point a file descriptor to another one that's open in the same direction (input or output). The shell will let you do >/dev/null <&1, but that winds up creating a process with an input file descriptor open on an output stream, so instead of just hitting end-of-file, any read attempt will trigger a fatal "invalid file descriptor" error.
nohup some_command > /dev/null 2>&1&
That's all you need to do!
Have you tried redirecting all three I/O streams:
nohup ./yourprogram > foo.out 2> foo.err < /dev/null &
You might want to use the detach program. You use it like nohup but it doesn't produce an output log unless you tell it to. Here is the man page:
NAME
detach - run a command after detaching from the terminal
SYNOPSIS
detach [options] [--] command [args]
Forks a new process, detaches is from the terminal, and executes com‐
mand with the specified arguments.
OPTIONS
detach recognizes a couple of options, which are discussed below. The
special option -- is used to signal that the rest of the arguments are
the command and args to be passed to it.
-e file
Connect file to the standard error of the command.
-f Run in the foreground (do not fork).
-i file
Connect file to the standard input of the command.
-o file
Connect file to the standard output of the command.
-p file
Write the pid of the detached process to file.
EXAMPLE
detach xterm
Start an xterm that will not be closed when the current shell exits.
AUTHOR
detach was written by Robbert Haarman. See http://inglorion.net/ for
contact information.
Note I have no affiliation with the author of the program. I'm only a satisfied user of the program.
Following command will let you run something in the background without getting nohup.out:
nohup command |tee &
In this way, you will be able to get console output while running script on the remote server:
sudo bash -c "nohup /opt/viptel/viptel_bin/log.sh $* &> /dev/null" &
Redirecting the output of sudo causes sudo to reask for the password, thus an awkward mechanism is needed to do this variant.
If you have a BASH shell on your mac/linux in-front of you, you try out the below steps to understand the redirection practically :
Create a 2 line script called zz.sh
#!/bin/bash
echo "Hello. This is a proper command"
junk_errorcommand
The echo command's output goes into STDOUT filestream (file descriptor 1).
The error command's output goes into STDERR filestream (file descriptor 2)
Currently, simply executing the script sends both STDOUT and STDERR to the screen.
./zz.sh
Now start with the standard redirection :
zz.sh > zfile.txt
In the above, "echo" (STDOUT) goes into the zfile.txt. Whereas "error" (STDERR) is displayed on the screen.
The above is the same as :
zz.sh 1> zfile.txt
Now you can try the opposite, and redirect "error" STDERR into the file. The STDOUT from "echo" command goes to the screen.
zz.sh 2> zfile.txt
Combining the above two, you get:
zz.sh 1> zfile.txt 2>&1
Explanation:
FIRST, send STDOUT 1 to zfile.txt
THEN, send STDERR 2 to STDOUT 1 itself (by using &1 pointer).
Therefore, both 1 and 2 goes into the same file (zfile.txt)
Eventually, you can pack the whole thing inside nohup command & to run it in the background:
nohup zz.sh 1> zfile.txt 2>&1&
You can run the below command.
nohup <your command> & > <outputfile> 2>&1 &
e.g.
I have a nohup command inside script
./Runjob.sh > sparkConcuurent.out 2>&1

killall on process of same name

I would like to use killall on a process of the same name from which killall will be executed without killing the process spawning the killall.
So in more detail, say I have process foo, and process foo is running. I want to be able to run "foo -k", and have the new foo kill the old foo, without killing itself.
pgrep foo | grep -v $$ | xargs kill
If you don't have pgrep, you'll have to come up with some other way of generating the list of PIDs of interest. Some options are:
Use ps with appropriate options, followed by some combination of grep, sed and/or awk to match the processes and extract the PIDs.
killall can send a signal 0 instead of SIGTERM; the standard semantics of this is that it doesn't send a signal, but just determines if the process is alive or not. Perhaps you can use killall to select the process list and get it to print the PIDs of the matching ones that are alive. This would also probably require a bit of post-processing with sed.
There may be something along the lines of Linux's /proc filesystem with pseudo-files holding system data that you could grovel through. Again, grep/awk/sed are your friends here.
If you truly need particular details on how to do this, comment or send me mail, and I'll try expanding some of these options in more detail.
[Edits: added further options for those without pgrep.]
This seems to work on OS X:
killall -s foo | perl -ne 'system $_ unless /\b'$PPID'\b/'
killall -s lists what it would do, one PID at a time. Do what it would do except for killing yourself.
The usual way to solve this is to have foo write its process ID to a file, say something like /var/run/foo.pid when it is run in daemon mode. Then you can have the non-daemon version read the PID from the PID file and call kill(2) on it directly. This is usually how apache and the like handle it. Of course the newer OSX daemons go through launchd(8) instead, but there are still a few that use good old fashioned signals.

Resources