grep string without anything surrounding it - count

I have a directory with multiple files, let's call these 1.txt, 2.txt, etc. Each file consists of 3 columns: an ID, a lat, and a lon. Unfortunately, some of the ID's may also be numbered (e.g., 1346, 248, 67912, etc.). I am trying to count the number of instances a station ID occurs among all files (1.txt, 2.txt, etc.) based on a master file (masterfile.txt). So far I have:
while IFS='' read line || [[ -n "$line" ]]
do
cat * | grep -w -c $line >> counting_filename.txt
done < masterfile.txt
Which works great. However, if the lat and/or long contain the particular ID, this will also count. For example, if I am looking for the station ID of 4575, if there is a lat of '47.4575' or a lon of '-77.4575', these will also go towards the count. Therefore, there are two solutions I could think of that I can't figure out:
1) Just grep the first column of instances, or
2) grep while NOT including the leading '.'
For example:
1.txt
4575 39.4575 -77.51
5010 38.3498 -78.4575
LAMS 38.4444 -78.3126
2.txt
3124 39.1010 -79.4575
4575 39.4575 -77.5010
PAOQ 39.2222 -78.0032
If I ran the above command, I would get a count of 6 for 4575, 2 for 5010, 1 for LAMS, 1 for 3124, and 1 for PAOQ.
What is desired is: 2 for 4575, 1 for 5010, 1 for LAMS, 1 for 3124, and 1 for PAOQ.
Any thoughts?

You're using the wrong tools - a small, simple awk script will be far more robust, efficient, and portable than a mixture of shell loops, read, grep, etc.
It's not clear what masterfile.txt is for as from your example it looks like this is all you need:
$ awk '{cnt[$1]++} END{for (id in cnt) print id, cnt[id]}' 1.txt 2.txt
LAMS 1
PAOQ 1
3124 1
4575 2
5010 1
If you need masterfile.txt to list a specific set of IDs rather than just producing counts for all IDs as above then you can do that too:
$ cat masterfile.txt
4575
3124
PAOQ
BLAH
$ awk 'NR==FNR{ids[$1];next} $1 in ids{cnt[$1]++} END{for (id in cnt) print id, cnt[id]}' masterfile.txt 1.txt 2.txt
PAOQ 1
3124 1
4575 2
$ awk 'NR==FNR{ids[$1];next} $1 in ids{cnt[$1]++} END{for (id in ids) print id, cnt[id]+0}' masterfile.txt 1.txt 2.txt
BLAH 0
PAOQ 1
3124 1
4575 2
I added BLAH to show the different options you have of handling an ID from masterfile.txt that doesn't appear in your other files.

Related

Merge and remove redundant lines *among* files

I need to merge several files, removing redundant lines among files, while keeping redundant lines within files. A schematic representation of my files is the following:
File1.txt
1
2
3
3
4
5
6
File2.txt
6
7
8
8
9
File3.txt
9
10
10
11
The desired output would be:
1
2
3
3
4
5
6
7
8
8
9
10
10
11
I would prefer to get a solution either in awk, or in bash or in R language. I searched the web for solutions and, though there were plenty of them* (please find some examples below), there were all removing duplicated lines regardless of the fact that they were located within or outside files.
Thanks in advance.
Arturo
Examples of previous solutions removing redundant lines both within and outside files:
https://unix.stackexchange.com/questions/50103/merge-two-lists-while-removing-duplicates
https://unix.stackexchange.com/questions/457320/combine-text-files-and-delete-duplicate-lines
https://unix.stackexchange.com/questions/350520/awk-combine-two-big-files-and-remove-duplicated-lines
https://unix.stackexchange.com/questions/257467/merging-2-files-and-keeping-the-one-duplicate
With your shown samples, could you please try following. This will NOT remove redundant lines within files but will remove them file wise.
awk '
FNR==1{
for(key in current){
total[key]
}
delete current
}
!($0 in total)
{
current[$0]
}
' file1.txt file2.txt file3.txt
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==1{ ##Checking condition if its first line(of each file) then do following.
for(key in current){ ##Traverse through current array here.
total[key] ##placing index of current array into total(for all files) one.
}
delete current ##Deleting current array here.
}
!($0 in total) ##If current line is NOT present in total then do following.
{
current[$0] ##Place current line into current array.
}
' file1.txt file2.txt file3.txt ##Mentioning Input_file names here.
Here's a trick adding on to https://stackoverflow.com/a/15385080/3358272 using diff and its output format. There is likely a presumption of "sorted" here, untested.
out=$(mktemp -p .)
tmpout=$(mktemp -p .)
trap 'rm -f "${out}" "${tmpout}"' EXIT
for F in ${#} ; do
{ cat "${out}" ;
diff --changed-group-format='%>' --unchanged-group-format='' "${out}" "${F}" ;
} > "${tmpout}"
mv "${tmpout}" "${out}"
done
cat "${out}"
Output:
$ ./question.sh F*
1
2
3
3
4
5
6
7
8
8
9
10
10
11
$ diff <(./question.sh F*) Output.txt
(Per markp-fuso's comment, if File3.txt had two 9s, this would preserve both.)

Convert specific column of file into upper case in unix (without using awk and sed)

My file is as below
file name = test
1 abc
2 xyz
3 pqr
How can i convert second column of file in upper case without using awk or sed.
You can use tr to transform from lowercase to uppercase. cut will extract the single columns and paste will combine the separated columns again.
Assumption: Columns are delimited by tabs.
paste <(cut -f1 file) <(cut -f2 file | tr '[:lower:]' '[:upper:]')
Replace file with your file name (that is test in your case).
In pure bash
#!/bin/bash
while read -r col1 col2;
do
printf "%s%7s\n" "$col1" "${col2^^}"
done < file > output-file
Input-file
$ cat file
1 abc
2 xyz
3 pqr
Output-file
$ cat output-file
1 ABC
2 XYZ
3 PQR

Compare 2 files in unix file1(2M numbers/rows/lines) , file2(2,000,480 numbers/rows/lines)

How can I compare this 2 big files in unix.
I've already tried using 'grep -Fxvf file1.txt file2.txt | wc -l' but the output is 2,000,480 and when switching file1 and file2 the output is 1,999,999.
How can I get the output of '480' because that's what i am expecting.
I've also tried using diff/cmp commands but the output is too complicated.
I think you want an absolute value of a difference in line numbers in 2 files. You can achieve it easily with awk and get a decent result. You'd read numbers of lines in an array and later subtract the array values in the END block. For pure shell it'd have to get more complex. Imagine you get some test data generated (10 and 14 line files):
$ seq 1 10 > ten
$ seq 1 14 > fourteen
And then you do:
$ ( wc -l ten ; wc -l fourteen ) | awk '{ print $1}' | sort -rn | xargs -J % echo % - p | dc
The result:
4
But much better way would be do just do it in 3 lines (get word count for file1, then file2 and then subtract)

Finding common elements from one file in a column of another file and output the entire row of the latter

I needed to extract all hits from one list (list.txt) which can be found in one of the columns of another (here in Data.txt) into a third (output.txt).
Data.txt (tab delimited)
some_data more_data other_data here yet_more_data etc
A B 2 Gee;Whiz;Hello 13 12
A B 2 Gee;Whizz;Hi 56 32
E 4 Btm;Lol 16 2
T 3 Whizz 13 3
List.txt
Gee
Whiz
Lol
Ideally output.txt looks like
some_data more_data other_data here yet_more_data etc
A B 2 Gee;Whiz;Hello 13 12
A B 2 Gee;Whizz;Hi 56 32
E 4 Btm;Lol 16 2
So I tried a shell script
for ids in List.txt
do
grep $ids Data.txt >> output.txt
done
except I typed out everything (cut and paste actually) in List.txt in said script.
Unfortunately it gave me an output.txt including the last line, I assume as 'Whizz' contains 'Whiz'.
I also tried cat Data.txt | egrep -F "List.txt" and that resulted in grep: conflicting matchers specified -- I suppose that was too naive of me. The actual files: List.txt contains a sorted list of 985 words, Data.txt has 115576 rows with 17 columns.
Some help/guidance would be much appreciated thanks.
Try something like this:
for ids in List.txt
do
grep "[TAB;]$ids[TAB;]" Data.txt >> output.txt
done
But it has two drawbacks:
"Data.txt" is scanned multiple times
You can get one line multiple times.
If it is problem try two step version:
cat List.txt | sed -e "s/.*/[TAB;]\0[TAB;]/g" > List_mod.txt
grep -f List_mod.txt Data.txt > output.txt
Note:
TAB character can be inserted by combination Ctrl-V following by Tab key in command line, and Tab character in editor. You have to check if your edit does not change tab to series of spaces.
The UNIX tool for general text processing is "awk":
awk '
NR==FNR { list[$0]; next }
{
for (word in list) {
if ($0 ~ "[\t;]" word "[\t;]") {
print
next
}
}
}
' List.txt Data.txt > output.txt

need code in awk unix or use substr

I want to print words between "ctr{words}" and count the same words in a file.
I tried:
sed -n 's/.*ctr{\(.[^}]*\).*/\1/p' file
but it didn't search all words, just one word
The file is:
962796057604|mar0101|0|00000107A20E00000A6C331650B920340C00|0|0|400019FD7DBFBF7F|1001|962796057604|0 |01001|||-1|795971936| 00962795971936|16||-1| 00962795971936|-1|0|2|0|416019000659493|0||||||0|0|2012.12.01 00:07:09|12|30|0|516|16|1|2012.12.01 00:06:39|1|0||202|20001||0B12F1001104697209100300000000000000|1|1|11000|0|0||0881006972091003F000||0 714F610045584E6|000000000000|3|1|0000000000000000|0|140|0|0|0|0|0|0|||0|2|||||||||||||||||||||0|||0| |0|1|143|acf{0}cif{0}fcf{0}con{0}cuf{0}ctr{**Mo7afazat**}cgpa{962796057604}vlr{0096279001300}cff{0}roaf{0}mpty{0}ftksn{JMT}ftksr{0001}ftktp{CallTicketCPOCS} ||
1|34|2012.12.01 00:08:35|12|4|921-*203-0000000000-962796298894|mar0101|0|000001028225AE4AD868A8B750B900980C00|1|0|4000018001000002||962796298894|||||-1|||||-1||-1|0||||0||||||-1|-1|||-1|0|-1|-1|-1|2012.12.01 00:08:35|1|0||-1|0|||||||||||||0|0|||3797|0|12|-2147483648|-2147483648|-2147483648|-2147483648|||||||||||||||||||||||||0|||0||1|6|244|tid{111210532409329884}pfid{20}gob{1}rid{globitel} afid{}uid1{962796298894}aid1{1}ar1{0}uid2{globitel}aid2{-1}pid{1234}pur{!GDRC COMMIT AMOUNT 0}ratinf{}rec{0}rots{0}tda{}mid{}exd{0}reqa{0}ctr{**JaishanaIN**}ftksn{JMT}ftksr{0001}ftktp{PayCallTicket}||
1|34|2012.12.01 00:08:35|12|4|100-50-0-962796605155|mar0101|0|00000102A20400000A6A439D50B920520C00|0|0|400019FD7DBFBF7F|1001|962796605155|1 6||||-1|b116c||16||-1||-1|0|0|0|416017002233360|0||||||0|0|1970.01.01 02:00:00|0|0|0|220|0|1|1970.01.01 02:00:00|1|0||194|0||000000000000000000000000000000000000|0|0||0|0||00000000000000000000||0000000000 000000|000000000000|0|0|0000000000000000|0|370|0|0|0|0|0|0|||0|0|||||||||||||||||||||0|||0||0|1|70|a cf{3}ussd{1}ctr{**ZainElKul**}ftksn{JMT}ftksr{0001}ftktp{CallTicketCPOCS}||
1|34|2012.12.01 00:08:35|12|4|100-10-0
1|34|2012.12.01 00:08:35|12|4|921-*203-0000000000-962797611253|mar0101|0|0000010282B54BD015FF4C4B50B8F96E0C00|1|0|4000018001000002||962797611253|||||-1|||||-1||-1|0||||0||||||-1|-1|||-1|0|-1|-1|-1|2012.12.01 00:08:35|1|0||-1|0|||||||||||||0|0|||885|0|12|-2147483648|-2147483648|-2147483648|-2147483648|||||||||||||||||||||||||0|||0||1|6|243|tid{111220371293561120}pfid{20}gob{1}rid{globitel} afid{}uid1{962797611253}aid1{1}ar1{0}uid2{globitel}aid2{-1}pid{1234}pur{!GDRC COMMIT AMOUNT 0}ratinf{}rec{0}rots{0}tda{}mid{}exd{0}reqa{0}ctr{**ZainElKul**}ftksn{JMT}ftksr{0001}ftktp{PayCallTicket}||
-962795292027|mar0101|0|00000101A20200000A6A96B750B920300C00|0|0|400019FD7DBFBF7F|1001|962795292027|0 |01004|||-1|797196452| 00962797196452|16||-1| 00962797196452|-1|0|2|0|416018002276781|0||||||0|0|2012.12.01 00:07:09|12|12|23|516|16|1|2012.12.01 00:06:34|1|0||202|1||0B12F1001104697209100300000000000000|1|1|11000|0|0||0881006972091003F000||0714F 6100455AD67|000000000000|3|1|0000000000000000|0|30|0|0|0|0|0|0|||0|0|||||||||||||||||||||0|||0||0|1| 171|acf{0}cif{0}fcf{0}con{0}cuf{0}ctr{ZainUnlimited}cgpa{962795292027}vlr{0096279001300}cff{0}roaf{0}mpty{0}cacc{1;0;30}cquo{1;230;}ftksn{JMT}ftksr{000 1}ftktp{CallTicketCPOCS}||
1|34|2012.12.01 00:08:35|12|4|921-*203-0000000000-962796012818|mar0101|0|0000010882218115085D5F9150B920520C00|0|0|4000018001000002||962796012818|||||-1|||||-1||-1|0||||0||||||-1|-1|||-1|0|-1|-1|-1|2012.12.01 00:08:35|1|0||-1|1|||||||||||||0|0|||70|0|0|-2147483648|-2147483648|-2147483648|-2147483648|||||||||||||||||||||||||0|||0||1|6|258|tid{111221366974701289}pfid{17}gob{1}rid{globitel} afid{}uid1{962796012818}aid1{1}ar1{-2147483648}uid2{}aid2{-1}pid{DEFAULT_DECISION}pur{!GDRC Balance Check}ratinf{}rec{0}rots{0}tda{}mid{}exd{0}reqa{0}ctr{**AlBarakehNew**}ftksn{JMT}ftksr{0001}ftktp{PayCallTicket}||
1|34|2012.12.01 00:08:35|12|4|921-*203-0000000000-962797251349|mar0101|0|0000010282A451483EDFCFD350B920400C00|1|0|4000018001000002||962797251349|||||-1|||||-1||-1|0||||0||||||-1|-1|||-1|0|-1|-1|-1|2012.12.01 00:08:35|1|0||-1|0|||||||||||||0|0|||440|0|12|-2147483648|-2147483648|-2147483648|-2147483648|||||||||||||||||||||||||0|||0||1|6|245|tid{111211342745325133}pfid{20}gob{1}rid{globitel} afid{}uid1{962797251349}aid1{1}ar1{0}uid2{globitel}aid2{-1}pid{1234}pur{!GDRC COMMIT AMOUNT 0}ratinf{}rec{0}rots{0}tda{}mid{}exd{0}reqa{0}ctr{**ZainElKulSN**}ftksn{JMT}ftksr{0001}ftktp{PayCallTicket}||
1|34|2012.12.01 00:08:35|12|4|921-*203-0000000000-
It looks like you are missing the counts. The easiest way to do this is to pipe your output through uniq -c:
$ sed -n 's/.*ctr{\(.[^}]*\).*/\1/p' file | sort | uniq -c
1 **Mo7afazat**
1 **JaishanaIN**
2 **ZainElKul**
1 ZainUnlimited
1 **AlBarakehNew**
1 **ZainElKulSN**
Another way, only using awk:
$ awk 'match($0,".*ctr{([^}]*)}.*",m){a[m[1]]++}END{for(i in a) print i,a[i]}' file
ZainUnlimited 1
**ZainElKulSN** 1
**Mo7afazat** 1
**ZainElKul** 2
**JaishanaIN** 1
**AlBarakehNew** 1
When searching for matches in files grep is the best choice more often than not.
Using grep with postive lookahead and uniq -c:
$ grep -Po "(?<=ctr{)[^}]+" file | uniq -c
1 Mo7afazat
1 JaishanaIN
2 ZainElKul
1 ZainUnlimited
1 AlBarakehNew
1 ZainElKulSN
From man uniq:
Note: 'uniq' does not detect repeated lines unless they are adjacent.
For files where the duplicates are not adjacent pipe to sort first however the order in which each match is found in the orignal file will be lost:
grep -Po "(?<=ctr{)[^}]+" file | sort | uniq -c
1 AlBarakehNew
1 JaishanaIN
1 Mo7afazat
2 ZainElKul
1 ZainElKulSN
1 ZainUnlimited

Resources