How do I pipe comm outputs to a file? - unix

I've used the comm command to compare two files, but I'm unable to pipe it to a third file:
comm file1 file2 > file3
comm: file 1 is not in sorted order
comm: file 2 is not in sorted order
How do I do this? The files are sorted already.
(comm file1 file2 works and prints it out)
sample input:
file1:
21
24
31
36
40
87
105
134
...
file2:
10
21
31
36
40
40
87
103
...
comm file1 file2: works
comm file1 file2 > file3
comm: file 1 is not in sorted order
comm: file 2 is not in sorted order

You've sorted numerically; comm works on lexically sorted files.
For instance, in file2, the line 103 is dramatically out of order with the lines 21..87. Your files must be 'plain sort sorted'.
If you've got bash (4.x), you can use process substitution:
comm <(sort file1) <(sort file2)
This runs the two commands and ensures that the comm process gets to read their standard output as if they were files.
Failing that:
(
sort -o file1 file1 &
sort -o file2 file2 &
wait
comm file1 file2
)
This uses parallelism to get the file sorted at the same time. The sub-shell (in ( ... )) ensures that you don't end up waiting for other background processes to finish.

Your sample data is NOT sorted lexicographically (like in a dictionary), which is what commands like comm and sort (without the -n option) expect, where for example 100 should be before 20.
Are you sure that you aren't simply not noticing the error message when you don't redirect the output, since the error would be intermixed with the output lines on the terminal?

You have to sort the files first with the sort program.

Try :
sort -o file1 file1
sort -o file2 file2
comm file1 file2 > file3

I don't get the same results as you, but perhaps your version of comm is complaining that the files are not sorted lexically. Using the input you provided (the ... makes it interesting, I know it's not a part of your actual files.)
$ comm file[12]
10
21
24
31
36
40
40
87
103
...
105
134
...
I was surprised that ... wasn't in the third column, so I tried:
$ comm <(sort file1) <(sort file2)
...
10
103
105
134
21
24
31
36
40
40
87
That's better, but 105 > 24, right?
$ comm <(sort -n file1) <(sort -n file2)
...
10
21
24
31
36
40
40
87
103
105
134
I think those were the results you are looking for. The two 40s are also interesting. If you want to eliminate these:
$ comm <(sort -nu file1) <(sort -nu file2)
...
10
21
24
31
36
40
87
103
105
134

I ran into a similar issue, where comm was complaining even though I had run sort. The problem was that I was running Cygwin, and sort pointed to some MSDOS version (I guess). By using the specific path (C:\Cygwin\bin\sort in my case), it worked.

I had a similar issue when I had sorted files but was getting the same error with
comm -23 16-unique.log 23-unique.log > 16-only.log
but I figured the redirection wasn't working properly so I tried
(comm -23 16-unique.log 23-unique.log ) > 16-only.log
but using sort to ensure the inputs where sorted was the business.
comm -23 <(sort 16-unique.log) <( sort 23-unique.log) > 16-only.log
[As an side the -23 switch means that only the unique rows in the first file will be in the output] also man comm

Related

Awk program to compare number of fields by space of each line

I am trying to check if each line has a same length(or number of fields) in a file.
I am doing the following but it seems not to work.
NR==1 {length=NF}
NR>1 && NF!=length {print}
Can this be done by a one-liner awk? or a program is fine.
A sample of input would be:
12 34 54 56
12 89 34 33
12
29 56 42 42
My expected output would be "yes" or "no" if they have the same number of fields or not.
You could try this command which checks the number of fields in each line and compares it to the number of fields of the first line:
awk 'NR==1{a=NF; b=0} (NR>1 && NF!=a){print "No"; b=1; exit 1}END{if (b==0) print "Yes"}' test.txt
Checking is aborted in the first line whose number of fields is distinct from the first line of input.
For input
12 43 43
12 32
you will get "No"
Try:
awk 'BEGIN{a="yes"} last!="" && NF!=last{a="no"; exit} {last=NF} END{print a}' file
How it works
BEGIN{a="yes"}
This initializes the variable a to yes. (We assume all lines have the same number fields until proven otherwise.)
last!="" && NF!=last{a="no"; exit}
If last has been assigned a value and the number of fields on the current line is not the same as last, then set a to no and exit.
{last=NF}
Update last to the number of fields on the current line.
END{print a}
Before exiting, print a.
Examples
$ cat file1
2 34 54 56
12 89 34 33
12
29 56 42 42
$ awk 'BEGIN{a="yes"} last!="" && NF!=last{a="no"; exit} {last=NF} END{print a}' file1
no
$ cat file2
2 34 54 56
12 89 34 33
29 56 42 42
$ awk 'BEGIN{a="yes"} last!="" && NF!=last{a="no"; exit} {last=NF} END{print a}' file2
yes
I am assuming that you want to check fields of all lines, if they are equal or not if this is case then try following.
awk '
FNR==1{
value=NF
count++
next
}
{
count=NF==value?++count:count
}
END{
if(count==FNR){
print "All lines are of same fields"
}
else{
print "All lines are NOT of same fields."
}
}
' Input_file
Additional stuff(only if require): In case you want to print contents of file whose all lines are having same fields along with yes or all are same fields in file message in output then try following.
awk '
{
val=val?val ORS $0:$0
}
FNR==1{
value=NF
count++
next
}
{
count=NF==value?++count:count
}
END{
if(count==FNR){
print "All lines are of same fields" ORS val
}
else{
print "All lines are NOT of same fields."
}
}
' Input_file
this should do
$ awk 'NR==1{p=NF} p!=NF{s=1; exit} END{print s?"No":"Yes"}' file
however, setting the exit status would be better if this will be part of a workflow.
Since equivalence has transitive property, there is no need to keep NF other than the first line; setting 0 as your success value doesn't require initialization to default value.
An efficient even fields shell function, using sed to construct a regex, (based on the first line of input), to feed to GNU grep, which looks for field length mismatches:
# Usage: ef filename
ef() { sed '1s/[^ ]*/[^ ]*/g;q' "$1" | grep -v -m 1 -q -f - "$1" \
&& echo no || echo yes ; }
For files with uneven fields grep -m 1 quits after the first non-uniform line -- so if the file is a million lines long, but the mismatch occurs on line #2, grep only needs to read two lines, not a million. On the other hand, if there's no mismatch grep would have to read a million lines.

Print required lines from 3 files in Unix

There are 3 files in a directory. How can i print first file 1st line, Second file 3rd line and Third file 4th line using UNIX command ?
I tried with cat filename.txt| sed -n 1p but it is applicable for only one file. How can I view all the three files at a time ??
Using awk. at the beginning of each file f is increased to follow which file we're dealing with then we just team that up with the required record number of each file (FNR):
$ awk 'FNR==1 {f++} f==1&&FNR==1 || f==2&&FNR==3 || f==3&&FNR==4' 1 2 3
11
23
34
Record of the first file, the others are similar:
$ cat 1
11
12
13
14

UNIX (AIX) Command Help - Sed & Awk

I'm running this on an AIX 6.1.
The intended purpose of this command is to display the following information in the following format:
GetUsedRAM:GetUsedSwap:CPU_0_System:CPU_0_User:…CPU_N_System:CPU_N_User
The command is composed of several sub commands:
echo `vmstat 1 2 | tr -s ' ' ':' | cut -d':' -f4,5,14-15 | tail -1 | sed 's/\([0-9]*:[0-9]*:\)\([0-9]*:[0-9]*\)/\1/'``mpstat -a 1 1 | tr -s ' ' '|' | head -8 | tail -4 | cut -d'|' -f 25,27 | awk -F "|" '{printf "%.0f:%.0f:",$2,$1}' | sed '$s/.$//'| sed -e "s/ \{1,\}$//"| awk '{int a[10];split($1, a,":");printf("%d:%d:%d:%d:%d:%d:%d:%d",a[0],a[1],a[2],a[3],a[4],a[5],a[6],a[7])}'`
Which I'll re format for clarity:
echo \
`vmstat 1 2 |
tr -s ' ' ':' |
cut -d':' -f4,5,14-15 |
tail -1 |
sed 's/\([0-9]*:[0-9]*:\)\([0-9]*:[0-9]*\)/\1/' \
` \
`mpstat -a 1 1 |
tr -s ' ' '|' |
head -8 |
tail -4 |
cut -d'|' -f 25,27 |
awk -F "|" '{printf "%.0f:%.0f:",$2,$1}' |
sed '$s/.$//' |
sed -e "s/ \{1,\}$//" |
awk '{int a[10];split($1, a,":");printf("%d:%d:%d:%d:%d:%d:%d:%d",a[0],a[1],a[2],a[3],a[4],a[5],a[6],a[7])}' \
`
I understand all of the tr, cut, head tail, and (roughly) vmstat/mpstat commands. The first sed is where I get lost, I've tried running the command in smaller segments and not quite sure why it seems to work as a whole but not when I truncate the command before the next tr.
I'm also not so sure on the awk command although I understand the premise vaguely, as a function allowing formatted output.
Similarly, I have a vague understanding of sed being a command allowing certain strings/characters being replaced in some file.
I'm not able to make out what this specific implementation in the above case is.
Could anyone provide some clarity or direction as to exactly what is happening at each sed and awk step within the context of the entire command?
Thanks for your help.
Simplification
This two simpler commands will get the exact same output:
# GetUsedRAM:GetUsedSwap:CPU_0_System:CPU_0_User:…CPU_N_System:CPU_N_User
# Select fields 4,5 of last line, and format with :
comm1=`vmstat 1 2 |
awk '$4~/[0-9]/{avm=$4;fre=$5} END{printf "%s:%s",avm,fre}'
`
# Select fields 27 (sy) and 25 (us) for four cpu, print as decimal.
comm2=`mpstat -A 1 1 |
awk -v firstline=6 -v cpus=4 '
BEGIN{start=firstline-1; end=firstline+cpus;}
NR>start && NR<end {printf( ":%d:%d", $27,$25)}'
`
echo "${comm1}${comm2}"
Description.
Description of original commands
The whole command is the concatenation of two commands.
The first command:
The output of the vmstat is shown in this link.
The columns 4 and 5 are 'avm' and 'fre'. The output in columns 14 and 15,
seem to be 'us' (user) and 'sy' (system). And I say seem as no output
from the user is available to confirm.
The first command
`vmstat 1 2 | # Execute the command vmstat.
tr -s ' ' ':' | # convert all spaces to colon (:).
cut -d':' -f4,5,14-15 | # select fields 4,5,14,and 15
tail -1 | # select last line.
sed 's/\([0-9]*:[0-9]*:\)\([0-9]*:[0-9]*\)/\1/' \ # See below.
`
The sed command selects inside braces all digits [0-9]* before a colon
repeated twice. And then again (without the last colon). That's the whole
string in two parts: « (dd:dd:)(dd:dd) » (d means digit).
And finally, it replaces such whole string by what was selected inside
the first braces /\1/.
All this complexity just removes fields 14 and 15 as selected by cut.
A simpler command with exactly the same output is:
Select fields 4,5 of last line, and format with (:).
`vmstat 1 2 | awk '
$4~/[0-9]/{avm=$4;fre=$5} END{printf "%s:%s:",avm,fre}'
`
The second command:
The output of mpstat -A is similar to this one from Linux.
And also similar to this AIX mpstat -d output.
However, the exact output of AIX 6.1 for mpstat -a (ALL) on the computer
used could have several variations. Anyway, guided by the intended final
output desired: CPU_0_System:CPU_0_User:…CPU_N_System:CPU_N_User.
It seems that the columns to be selected should be us (user) and sy
(sys) percent of time that used the cpu for all cpu in use,
which seem to be four on the computer measured.
The manual for AIX 6.1 mpstat is here.
It has a list of all the 40 columns that are presented when the option
-a ALL is used:
CPU min maj mpcs mpcr dev soft dec ph cs ics bound rq push
S3pull S3grd S0rd S1rd S2rd S3rd S4rd S5rd S3hrd S4hrd S5hrd
sysc us sy wa id pc %ec ilcs vlcs lcs %idon %bdon %istol %bstol %nsp
us and sy are listed as the fields 27 and 28, however the command presented
by the user selects fields number 25 and 27. Close but not the same. The
only way to confirm would be to receive the output of the command from the user.
For testing I will be using the output of mpstat 5 1 from here.
# mpstat 5 1
System configuration: lcpu=4 ent=1.0 mode=Uncapped
cpu min maj mpc int cs ics rq mig lpa sysc us sy wt id pc %ec lcs
0 4940 0 1 632 685 268 0 320 100 263924 42 55 0 4 0.57 35.1 277
1 990 0 3 1387 2234 805 0 684 100 130290 28 47 0 25 0.27 16.6 649
2 3943 0 2 531 663 223 0 389 100 276520 44 54 0 3 0.57 34.9 270
3 1298 0 2 1856 2742 846 0 752 100 82141 31 40 0 29 0.22 13.4 650
ALL 11171 0 8 4406 6324 2142 0 2145 100 752875 39 51 0 10 1.63 163.1 1846
The second command
`mpstat -A 1 1 | # execute command
tr -s ' ' '|' | # replace all spaces with (|).
head -8 | # select 8 first lines.
tail -4 | # select last four lines.
cut -d'|' -f 25,27 | # select fields 25 and 27
awk -F "|" '{printf "%.0f:%.0f:",$2,$1}' | # print the fields as integers.
sed '$s/.$//' | # on the last line ($), substitute the last character (.$) by nothing.
sed -e "s/ \{1,\}$//" | # remove trailing space(s).
awk '{
int a[10];
split($1, a,":");
printf("%d:%d:%d:%d:%d:%d:%d:%d",a[0],a[1],a[2],a[3],a[4],a[5],a[6],a[7])
}' \
`
About the int: For older versions of awk, calling a function without the parentheses is equivalent to call the function on $0. int is equivalent to int($0), which is not printed, nor used. The same happens to the value of a[10].
The split sets each value of the command in a[i]. Then, all values of a[i] are printed as decimals.
The equivalent, and way simpler is:
Command #2
`mpstat -A 1 1 |
awk -v firstline=6 -v cpus=4 '
BEGIN{start=firstline-1; end=firstline+cpus;}
NR>start && NR<end {printf( ":%d:%d", $27,$25)}'
`

Finding common elements from one file in a column of another file and output the entire row of the latter

I needed to extract all hits from one list (list.txt) which can be found in one of the columns of another (here in Data.txt) into a third (output.txt).
Data.txt (tab delimited)
some_data more_data other_data here yet_more_data etc
A B 2 Gee;Whiz;Hello 13 12
A B 2 Gee;Whizz;Hi 56 32
E 4 Btm;Lol 16 2
T 3 Whizz 13 3
List.txt
Gee
Whiz
Lol
Ideally output.txt looks like
some_data more_data other_data here yet_more_data etc
A B 2 Gee;Whiz;Hello 13 12
A B 2 Gee;Whizz;Hi 56 32
E 4 Btm;Lol 16 2
So I tried a shell script
for ids in List.txt
do
grep $ids Data.txt >> output.txt
done
except I typed out everything (cut and paste actually) in List.txt in said script.
Unfortunately it gave me an output.txt including the last line, I assume as 'Whizz' contains 'Whiz'.
I also tried cat Data.txt | egrep -F "List.txt" and that resulted in grep: conflicting matchers specified -- I suppose that was too naive of me. The actual files: List.txt contains a sorted list of 985 words, Data.txt has 115576 rows with 17 columns.
Some help/guidance would be much appreciated thanks.
Try something like this:
for ids in List.txt
do
grep "[TAB;]$ids[TAB;]" Data.txt >> output.txt
done
But it has two drawbacks:
"Data.txt" is scanned multiple times
You can get one line multiple times.
If it is problem try two step version:
cat List.txt | sed -e "s/.*/[TAB;]\0[TAB;]/g" > List_mod.txt
grep -f List_mod.txt Data.txt > output.txt
Note:
TAB character can be inserted by combination Ctrl-V following by Tab key in command line, and Tab character in editor. You have to check if your edit does not change tab to series of spaces.
The UNIX tool for general text processing is "awk":
awk '
NR==FNR { list[$0]; next }
{
for (word in list) {
if ($0 ~ "[\t;]" word "[\t;]") {
print
next
}
}
}
' List.txt Data.txt > output.txt

Count total number of lines in a project excluding certain folders or files

Using the command:
wc -l + `find . -name \* -print`
You can get the total number of lines of all files inside a folder.
But imagine you have some folders (for example libraries), which you don't want to count their lines because you didn't write them.
So, how would you count the lines in a project excluding certain folders?
cloc has always been a great friend whenever I need to count lines of src-code. Using 2.6.29 linux kernel as an example:
$ cloc .
26667 text files.
26357 unique files.
2782 files ignored.
http://cloc.sourceforge.net v 1.50 T=168.0 s (140.9 files/s, 58995.0 lines/s)
--------------------------------------------------------------------------------
Language files blank comment code
--------------------------------------------------------------------------------
C 11435 1072207 1141803 5487594
C/C++ Header 10033 232559 368953 1256555
Assembly 1021 35605 41375 223098
make 1087 4802 5388 16542
Perl 25 1431 1648 7444
yacc 5 447 318 2962
Bourne Shell 50 464 1232 2922
C++ 1 205 58 1496
lex 5 222 246 1399
HTML 2 58 0 378
NAnt scripts 1 85 0 299
Python 3 62 77 277
Bourne Again Shell 4 55 22 265
Lisp 1 63 0 218
ASP 1 33 0 136
awk 2 14 7 98
sed 1 0 3 29
XSLT 1 0 1 7
--------------------------------------------------------------------------------
SUM: 23678 1348312 1561131 7001719
--------------------------------------------------------------------------------
With find, you can also "negate" matching conditions with !. For example, if I want to list all the .java files in a directory, excluding those containing Test:
find . -name "*.java" ! -name "*Test*"
Hope this helps!
Edit:
By the way, the -name predicate only filters file names. If you want to filter paths (so you can filter directories), use -path:
find . -path "*.java" ! -path "*Test*"
you could always exclude them by listing out the files using regular expressions,
for example,
*.txt will include only txt files and so on...
I made an NPM package specifically for this usage, which allows you to call a CLI tool and providing the directory path and the folders/files to ignore
it goes like:
npm i -g #quasimodo147/countlines
to get the $ countlines command in your terminal
then you can do
countlines . node_modules build dist

Resources