Why not pipe list of file names into cat? - unix

What is the design rationale that cat doesn't take list of file names from pipe input? Why did the designers choose that the following does not work?
ls *.txt | cat
Instead of this, they chose that we need to pass the file names as argument to cat as:
ls *.txt | xargs cat

When you say ls *.txt | cat doesn't work, you should say that doesn't work as you expect. In fact, that works in the way it was thought to work.
From man:
cat - Concatenate FILE(s), or standard input, to standard output
Suppose the next output:
$ ls *.txt
file1.txt
file2.txt
... the input to cat will be:
file1.txt
file2.txt
...and that's exactly what cat output in the standard output
In some shells, it's equivalent to:
cat <(ls *.txt)
or
ls *.txt > tmpfile; cat tmpfile
So, cat is really working as their designers expected to do so.
On the other hand, what you are expecting is that cat interprets its input as a set of filenames to read and concatenate their content, but when you pipe to cat, that input works as a lonely file.

To make it short, cat is a command, like echo or cp, and few others, which cannot convert pipe redirected input stream into arguments.
So, xargs, is used, to pass the input stream as an argument to the command.
More details here: http://en.wikipedia.org/wiki/Xargs
As a former unix SA, and now, Python developer, I believe I could compare xargs, to StringIO/CStringIO, in Python, as it kind of helps the same way.
When it comes to your question: Why didn't they allow stream input? Here is what I think
Nobody but them could answer this.
I believe, however, than cat is meant to print to stdout the content of a file, while the command echo, was meant to print to stdout the content of a string.
Each of these commands, had a specific role, when created.

Related

Unix Text Processing - how to remove part of a file name from the results?

I'm searching through text files using grep and sed commands and I also want the file names displayed before my results. However, I'm trying to remove part of the file name when it is displayed.
The file names are formatted like this: aja_EPL_1999_03_01.txt
I want to have only the date without the beginning letters and without the .txt extension.
I've been searching for an answer and it seems like it's possible to do that with a sed or a grep command by using something like this to look forward and back and extract between _ and .txt:
(?<=_)\d+(?=\.)
But I must be doing something wrong, because it hasn't worked for me and I possibly have to add something as well, so that it doesn't extract only the first number, but the whole date. Thanks in advance.
Edit: Adding also the working command I've used just in case. I imagine whatever command is needed would have to go at the beginning?
sed '/^$/d' *.txt | grep -P '(^([A-ZÖÄÜÕŠŽ].*)?[Pp][Aa][Ll]{2}.*[^\.]$)' *.txt --colour -A 1
The results look like this:
aja_EPL_1999_03_02.txt:PALLILENNUD : korraga üritavad ümbermaailmalendu kaks meeskonda
A desired output would be this:
1999_03_02:PALLILENNUD : korraga üritavad ümbermaailmalendu kaks meeskonda
First off, you might want to think about your regular expression. While the one you have you say works, I wonder if it could be simplified. You told us:
(^([A-ZÖÄÜÕŠŽ].*)?[Pp][Aa][Ll]{2}.*[^\.]$)
It looks to me as if this is intended to match lines that start with a case insensitive "PALL", possibly preceded by any number of other characters that start with a capital letter, and that lines must not end in a backslash or a dot. So valid lines might be any of:
PALLILENNUD : korraga üritavad etc etc
Õlu on kena. Do I have appalling speling?
Peeter Pall is a limnologist at EMU!
If you'd care to narrow down this description a little and perhaps provide some examples of lines that should be matched or skipped, we may be able to do better. For instance, your outer parentheses are probably unnecessary.
Now, let's clarify what your pipe isn't doing.
sed '/^$/d' *.txt
This reads all your .txt files as an input stream, deletes any empty lines, and prints the output to stdout.
grep -P 'regex' *.txt --otheroptions
This reads all your .txt files, and prints any lines that match regex. It does not read stdin.
So .. in the command line you're using right now, your sed command is utterly ignored, as sed's output is not being read by grep. You COULD instruct grep to read from both files and stdin:
$ echo "hello" > x.txt
$ echo "world" | grep "o" x.txt -
x.txt:hello
(standard input):world
But that's not what you're doing.
By default, when grep reads from multiple files, it will precede each match with the name of the file from whence that match originated. That's also what you're seeing in my example above -- two inputs, one x.txt and the other - a.k.a. stdin, separated by a colon from the match they supplied.
While grep does include the most minuscule capability for filtering (with -o, or GNU grep's \K with optional Perl compatible RE), it does NOT provide you with any options for formatting the filename. Since you can'd do anything with the output of grep, you're limited to either parsing the output you've got, or using some other tool.
Parsing is easy, if your filenames are predictably structured as they seem to be from the two examples you've provided.
For this, we can ignore that these lines contain a file and data. For the purpose of the filter, they are a stream which follows a pattern. It looks like you want to strip off all characters from the beginning of each line up to and not including the first digit. You can do this by piping through sed:
sed 's/^[^0-9]*//'
Or you can achieve the same effect by using grep's minimal filtering to return every match starting from the first digit:
grep -o '[0-9].*'
If this kind of pipe-fitting is not to your liking, you may want to replace your entire grep with something in awk that combines functionality:
$ awk '
/[\.]$/ {next} # skip lines ending in backslash or dot
/^([A-ZÖÄÜÕŠŽ].*)?PALL/ { # lines to match
f=FILENAME
sub(/^[^0-9]*/,"",f) # strip unwanted part of filename, like sed
printf "%s:%s\n", f, $0
getline # simulate the "-A 1" from grep
printf "%s:%s\n", f, $0
}' *.txt
Note that I haven't tested this, because I don't have your data to work with.
Also, awk doesn't include any of the fancy terminal-dependent colourization that GNU grep provides through the --colour option.

Median Calculation in Unix

I need to calculate median value for the below input file. It is working fine for odd occurrences but not for even occurrences. Below is the input file and the script used. Could you please check what is wrong with this command and correct the same.
Input file:
col1,col2
AR,2.52
AR,3.57
AR,1.29
AR,6.66
AR,3.05
AR,5.52
Desired Output:
AR,3.31
Unix command:
cat test.txt | sort -t"," -k2n,2 | awk '{arr[NR]=$1} END { if (NR%2==1) print arr[(NR+1)/2]; else print (arr[NR/2]+arr[NR/2+1])/2}'
Don't forget that your input file has an additional line, containing the header. You need to take an additional step in your awk script to skip the first line.
Also, due to the fact you're using the default field separator, $1 will contain the whole line, so your code arr[NR/2]+arr[NR/2+1])/2 is never going to work. I would suggest that you changed it so that awk splits the input on a comma, then use the second field $2.
sort -t, -k2n,2 file | awk -F, 'NR>1{a[++i]=$2}END{if(i%2==1)print a[(i+1)/2];else print (a[i/2]+a[i/2+1])/2}'
I also removed your useless use of cat. Most tools, including sort and awk, are capable of reading in files directly, so you don't need to use cat with them.
Testing it out:
$ cat file
col1,col2
AR,2.52
AR,3.57
AR,1.29
AR,6.66
AR,3.05
AR,5.52
$ sort -t, -k2n,2 file | awk -F, 'NR>1{a[++i]=$2}END{if(i%2==1)print a[(i+1)/2];else print (a[i/2]+a[i/2+1])/2}'
3.31
It shouldn't be too difficult to modify the script slightly to change the output to whatever you want.

How to cat using part of a filename in terminal?

I'm using terminal on OS 10.X. I have some data files of the format:
mbh5.0_mrg4.54545454545_period0.000722172513951.params.dat
mbh5.0_mrg4.54545454545_period0.00077271543854.params.dat
mbh5.0_mrg4.59090909091_period-0.000355232058085.params.dat
mbh5.0_mrg4.59090909091_period-0.000402015664015.params.dat
I know that there will be some files with similar numbers after mbh and mrg, but I won't know ahead of time what the numbers will be or how many similarly numbered ones there will be. My goal is to cat all the data from all the files with similar numbers after mbh and mrg into one data file. So from the above I would want to do something like...
cat mbh5.0_mrg4.54545454545*dat > mbh5.0_mrg4.54545454545.dat
cat mbh5.0_mrg4.5909090909*dat > mbh5.0_mrg4.5909090909.dat
I want to automate this process because there will be many such files.
What would be the best way to do this? I've been looking into sed, but I don't have a solution yet.
for file in *.params.dat; do
prefix=${file%_*}
cat "$file" >> "$prefix.dat"
done
This part ${file%_*} remove the last underscore and following text from the end of $file and saves the result in the prefix variable. (Ref: http://www.gnu.org/software/bash/manual/bashref.html#Shell-Parameter-Expansion)
It's not 100% clear to me what you're trying to achieve here but if you want to aggregate files into a file with the same number after "mbh5.0_mrg4." then you can do the following.
ls -l mbh5.0_mrg4* | awk '{print "cat " $9 " > mbh5.0_mrg4." substr($9,12,11) ".dat" }' | /bin/bash
The "ls -s" lists the file and the "awk" takes the 9th column from the result of the ls. With some string concatenation the result is passed to /bin/bash to be executed.
This is a linux bash script, so assuming you have /bind/bash, I'm not 100% famililar with OS X. This script also assumes that the number youre grouping on is always in the same place in the filename. I think you can change /bin/bash to almost any shell you have installed.

Unix Pipes for Command Argument [duplicate]

This question already has answers here:
How to pass command output as multiple arguments to another command
(5 answers)
Read expression for grep from standard input
(1 answer)
Closed last month.
I am looking for insight as to how pipes can be used to pass standard output as the arguments for other commands.
For example, consider this case:
ls | grep Hello
The structure of grep follows the pattern: grep SearchTerm PathOfFileToBeSearched. In the case I have illustrated, the word Hello is taken as the SearchTerm and the result of ls is used as the file to be searched. But what if I want to switch it around? What if I want the standard output of ls to be the SearchTerm, with the argument following grep being PathOfFileToBeSearched? In a general sense, I want to have control over which argument the pipe fills with the standard output of the previous command. Is this possible, or does it depend on how the script for the command (e.g., grep) was written?
Thank you so much for your help!
grep itself will be built such that if you've not specified a file name, it will open stdin (and thus get the output of ls). There's no real generic mechanism here - merely convention.
If you want the output of ls to be the search term, you can do this via the shell. Make use of a subshell and substitution thus:
$ grep $(ls) filename.txt
In this scenario ls is run in a subshell, and its stdout is captured and inserted in the command line as an argument for grep. Note that if the ls output contains spaces, this will cause confusion for grep.
There are basically two options for this: shell command substitution and xargs. Brian Agnew has just written about the former. xargs is a utility which takes its stdin and turns it into arguments of a command to execute. So you could run
ls | xargs -n1 -J % grep -- % PathOfFileToBeSearched
and it would, for each file output by ls, run grep -e filename PathOfFileToBeSearched to grep for the filename output by ls within the other file you specify. This is an unusual xargs invocation; usually it's used to add one or more arguments at the end of a command, while here it should add exactly one argument in a specific place, so I've used -n and -J arguments to arrange that. The more common usage would be something like
ls | xargs grep -- term
to search all of the files output by ls for term. Although of course if you just want files in the current directory, you can this more simply without a pipeline:
grep -- term *
and likewise in your reversed arrangement,
for filename in *; do
grep -- "$#" PathOfFileToBeSearched
done
There's one important xargs caveat: whitespace characters in the filenames generated by ls won't be handled too well. To do that, provided you have GNU utilities, you can use find instead.
find . -mindepth 1 -maxdepth 1 -print0 | xargs -0 -n1 -J % grep -- % PathOfFileToBeSearched
to use NUL characters to separate filenames instead of whitespace

difference between grep Vs cat and grep

i would like to know difference between below 2 commands, I understand that 2) should be use but i want to know the exact sequence that happens in 1) and 2)
suppose filename has 200 characters in it
1) cat filename | grep regex
2) grep regex filename
Functionally (in terms of output), those two are the same. The first one actually creates a separate process cat which simply send the contents of the file to standard output, which shows up on the standard input of the grep, because the shell has connected the two with a pipe.
In that sense grep regex <filename is also equivalent but with one less process.
Where you'll start seeing the difference is in variants when the extra information (the file names) is used by grep, such as with:
grep -n regex filename1 filename2
The difference between that and:
cat filename1 filename2 | grep -n regex
is that the former knows about the individual files whereas the latter sees it as one file (with no name).
While the former may give you:
filename1:7:line with regex in 10-line file
filename2:2:another regex line
the latter will be more like:
7:line with regex in 10-line file
12:another regex line
Another executable that acts differently if it knows the file names is wc, the word counter programs:
$ cat qq.in
1
2
3
$ wc -l qq.in # knows file so prints it
3 qq.in
$ cat qq.in | wc -l # does not know file
3
$ wc -l <qq.in # also does not know file
3
First one:
cat filename | grep regex
Normally cat opens file and prints its contents line by line to stdout. But here it outputs its content to pipe'|'. After that grep reads from pipe(it takes pipe as stdin) then if matches regex prints line to stdout. But here there is a detail grep is opened in new shell process so pipe forwards its input as output to new shell process.
Second one:
grep regex filename
Here grep directly reads from file(above it was reading from pipe) and matches regex if matched prints line to stdout.
If you want to check the actual execution time diffrence, first create a file with 100000 lines:
user#server ~ $ for i in $(seq 1 100000); do echo line${1} >> test_f; done
user#server ~ $ wc -l test_f
100000 test_f
Now measure:
user#server ~ $ time grep line test_f
#...
real 0m1.320s
user 0m0.101s
sys 0m0.122s
user#server ~ $ time cat test_f | grep line
#...
real 0m1.288s
user 0m0.132s
sys 0m0.108s
As we can see, the diffrence is not too big...
Actually, though the outputs are the same;
-$cat filename | grep regex
This command looks for the content of the file "filename", then fetches regex in it; while
-$grep regex filename
This command directly searches for the content named regex in the file "filename"
Functionally they are equivalent, however, the shell will fork two processes for cat filename | grep regex and connect them with a pipe.

Resources