I'm using this following format to get the counts of how many times unique IPs hit my website.
Search a log file for total count of unique ips
zcat *file* | awk '{print $1}' | sort | uniq -c | sort -n
This gives me a list of IPs and it's occurrence.
1001 109.165.113.xxx
1001 178.137.88.xxx
1001 178.175.13.xxx
1001 81.4.217.xxx
1060 74.122.180.xxx
1103 67.201.52.xxx
1203 81.144.138.xxx
1670 54.240.158.xxx
1697 54.239.137.xxx
2789 39.183.147.xxx
4630 93.158.143.xxx
What I want to find out is simple and if it can be done on a single command line.
I just want the count of this list. So from the above example. I want for the buffer to tell me 11. I thought I could use a second AWK command to count the unique occurrence of the 2nd output but I guess you cannot use AWK twice in a single command line.
Obviously I can output the above to a log file and then run a second awk command to count the unique occurrence of the 2nd field(IPS) but I was hoping to get this done in a single command.
You might want:
zcat ... |
awk '{cnt[$1]++} END{for (ip in cnt) {unq++; print cnt[ip], ip}; print unq+0}'
If you have GNU awk you can add BEGIN{PROCINFO["sorted_in"]="#ind_num_asc"} at the front to get the loop output sorted, see http://www.gnu.org/software/gawk/manual/gawk.html#Controlling-Scanning.
Here is the awk code to get total count of unique ips
zcat *file* | awk '{a[$1]} END {print length(a)}'
Related
I have a CSV file like below
05032020
Col1|col2|col3|col4|col5
Infosys
Tcs
Wipro
Accenture
Deloitte
I want record count by skipping date and Header columns
O/p: Record count 5 with including line numbers
cat FF_Json_to_CSV_MAY03.txt
05032020
requestId|accountBranch|accountNumber|guaranteeGuarantor|accountPriority|accountRelationType|accountType|updatedDate|updatedBy
0000000001|5BW|52206|GG1|02|999|CHECKING|20200503|BTCHLCE
0000000001|55F|80992|GG2|02|1999|IRA|20200503|0QLC
0000000001|55F|24977|CG|01|3999|CERTIFICAT|20200503|SRIKANTH
0000000002|5HJ|03349|PG|01|777|SAVINGS|20200503|BTCHLCE
0000000002|5M8|999158|GG3|01|900|CORPORATE|20200503|BTCHLCE
0000000002|5LL|49345|PG|01|999|CORPORATE|20200503|BTCHLCE
0000000002|5HY|15786|PG|01|999|CORPORATE|20200503|BTCHLCE
0000000003|55F|34956|CG|01|999|CORPORATE|20200503|SRIKANTH
0000000003|5BY|14399|GG10|03|10|MONEY MARK|20200503|BTCHLCE
0000000003|5PE|32100|PG|04|999|JOINT|20200503|BTCHLCE
0000000003|5LB|07888|GG25|02|999|BROKERAGE|20200503|BTCHLCE
0000000004|55F|36334|CG|02|999|JOINT|20200503|BTCHLCE
0000000005|55F|06739|GG9|02|999|SAVINGS|20200503|BTCHLCE
0000000005|5CP|39676|PG|01|999|SAVINGS|20200503|BTCHLCE
0000000006|55V|62452|CG|01|10|CORPORATE|20200503|SRIKANTH
0000000007|55V|H9889|CG|01|999|SAVINGS|20200503|BTCHLCE
0000000007|5L2|03595|PG|02|999|CORPORATE|20200503|BTCHLCE
0000000007|55V|C1909|GG8|01|10|JOINT|20200503|BTCHLCE
I need line numbers from 00000000001
There are two ways to solve your issue:
Count only the records you want to count
Count all records and remove the ones you don't want to count
From your example, it's not possible to know how to do it, but let me give you some ideas:
Imagine that your file starts with 3 header lines, then you can do something like:
wc -l inputfile | awk '{print $1-3}'
Imagine that the lines you want to count all start with a number and a dot, then you can do something like:
grep "[0-9]*\." inputfile | wc -l
I have a file named tt.txt and the contents of this file is as follows:
fdgs
jhds
fdgs
I am trying to get the similar row as the output in a text file.
my expected output is:
fdgs
fdgs
to do so, I used this command:
uniq -u tt.txt > output.txt
but it returns:
fdgs
jhds
fdgs
do you know how to fix it?
If by similar row you mean the row with the same content.
From the uniq manpage the uniq command would only filter the adjacent matching lines from the repeated lines. So you need to sort the input first and used -D option to print all duplicated lines like below. However -D options is limited to the GNU implementation, and doing this would print the output in different order from the input.
sort tt.txt | uniq -D
If you want the output to be in the same order you need to remember the input line number and sort the line number again like this
cat -n tt.txt | sort -k 2 | uniq -f 1 -D | sort -k 1,1 | sed 's/\s+[0-9]+\s+//'
cat -n would print the content with the line number
sort -k 2 would sort the input starting at 2rd column
uniq -f 1 would ignore the first column
sort -k1,1 would sort the the output back by the original line number
sed 's/\s+[0-9]+\s+//' would delete the first column with line number
uniq -u command would output only the unique input line, which is completely opposite as what you want.
One in awk:
$ awk '++seen[$0]==2;seen[$0]>1' file
fdgs
fdgs
I'm trying to filter the number of times a service is called by a different user in a log file.
I was thinking about using uniq -c, but almost all lines are unique thanks to the timestamp. What I want is to ignore the parts of the line I don't need and just focus on the service name and the call id which identifies each separate call.
The log format is something like this:
27/02/2017 00:00:00 [useremail#email.com] [sessioninfo(**callId**)] **serviceName**
Being callId and serviceName the strings I want to filter.
And my required output would be the count of each different callId that is found in the same line as the service call.
For example for the input :
27/02/2017 00:00:00 [useremail#email.com] [sessioninfo(12345)] service1
27/02/2017 00:00:01 [useremail1#email.com] [sessioninfo(12346)] service1
27/02/2017 00:00:02 [useremail2#email.com] [sessioninfo(12347)] service1
27/02/2017 00:00:00 [useremail#email.com] [sessioninfo(12345)] service1
The output would be 3, because one of the lines is using the same callId.
Is there any way I could achieve this with a grep, or would I need to create more advanced script to do the job?
You may use the following awk:
awk -F '[\\(\\)\\]]+' '{ print $3 " " $4 }' somelog.log
You may combine it later with sort and then uniq and get the count:
awk -F '[\\(\\)\\]]+' '{ print $3 " " $4 }' somelog.log | sort | uniq
What I want is to ignore the parts of the line I don't need.
In your case, what you need is the -f option of uniq:
-f num Ignore the first num fields in each input line when doing comparisons. A
field is a string of non-blank characters separated from adjacent fields
by blanks. Field numbers are one based, i.e., the first field is field one.
So you would sort the log file, find unique lines (discounting the first three fields) with uniq -f3 and then find the number of such lines with wc -l.
i.e.
sort out.log | uniq -f 3 | wc -l
I have a text file 2 fields separated by :
i3583063:b3587412
i3583064:b3587412
i3583065:b3587412
i3583076:b3587421
i3583077:b3587421
i3583787:b3587954
i3584458:b3588416
i3584459:b3588416
i3584460:b3588416
i3584461:b3588416
i3584462:b3588416
i3584463:b3588416
i3584464:b3588416
i3584465:b3588416
Field 1 is always uniq but not field 2 it can be repeated. How can I identify first, 2nd 3rd etc. occurrence of field 2? Can I use count?
Thanks
I don't know if I've ever heard of a standard Unix count utility, but you can do this with Awk. Here's an Awk script that adds the count as a third column:
awk -F: 'BEGIN {OFS=":"} {$3=++count[$2]; print}' input.txt
It should generate the output:
i3583063:b3587412:1
i3583064:b3587412:2
i3583065:b3587412:3
i3583076:b3587421:1
i3583077:b3587421:2
i3583787:b3587954:1
i3584458:b3588416:1
i3584459:b3588416:2
i3584460:b3588416:3
i3584461:b3588416:4
i3584462:b3588416:5
i3584463:b3588416:6
i3584464:b3588416:7
i3584465:b3588416:8
The heart of the script {$3=++count[$2]; print} simply increments a counter indexed by the value of the second field, stores it in a new third field, and then outputs the line with this new field. Awk is a great little language and still well worth learning.
You can use the sort command with the -u parameter. This way redundant lines are removed.
sort -u filename.txt
If you want to count occurrences
sort -u filename.txt | wc -l
I want to get grep content of file matching particular text and then want to save all those records which matches particular text to new file and also want to make sure that matched content is removed from original file.
296949657|QL|163744584|163744581|20441||
292465754|RE|W757|3012|301316469|00|
296950717|RC|7264|00001|013|27082856203|
292465754|QL|191427266|191427266|16405||
296950717|RC|7264|AETNAACTIVE|HHRPPO|27082856203|
299850356|RC|7700|153447|0891185100102-A|W19007007201|
292465754|RE|W757|3029|301316469|00|
299850356|RC|7700|153447|0891185100104-A|W19007007201|
293695591|QL|743559415|743559410|18452||
297348183|RC|6602|E924|0048|CD101699303|
297348183|RC|6602|E924|0051|CD101699303|
108327882|QL|613440276|613440275|17435||
I have written awk and it works as expected for small files but for larger files is not working as expected....am sure that i have missed something...
awk '{print $0 > ($0~/RC/?"RC_RECORDS":"TEST.DAT")}' TEST.DAT
any thoughts on how to fix this.
Update 1
Now in above file, i always want to check values of column two to |RC| and if it matches then move that record to RC_RECORDS file and if values matches to |RE| then move it to RE_RECORDS, how can this be done.
Case 1:
So for example if i have records as
108327882|RE|613440276|613440275|RC||
then it should go to RE_RECORDS file.
Case 2:
108327882|RC|613440276|613440275|RE||
then it should go to RE_RECORDS
Case 3:
108327882|QL|613440276|613440275|RC||
then it should not go to either RE_RECORDS or RC_RECORDS
Case 4:
108327882|QL|613440276|613440275|RE||
then it should not go to either RE_RECORDS or RC_RECORDS
My hunch is that
awk '/\|RC\|/ {print > "RC_RECORDS.DAT";next} {print > "NEWTEST.DAT"}' TEST.DAT | awk '$2 == "RC"'
awk '/\|RE\|/ {print > "RE_RECORDS.DAT";next} {print > "FINAL_NEWTEST.DAT"}' NEWTEST.DAT | awk '$2 == "RE"'
but wanted to check if there's an better and quicker solution out there that can be used.
Update 2
Update 3
I think this is what you want:
Option 1
awk -F'|' '
$2=="RC" {print >"RC_RECORDS.TXT";next}
$2=="RE" {print >"RE_RECORDS.TXT";next}
{print >"OTHER_RECORDS.TXT"}' file
You can put it all on one line if you prefer, like this:
awk -F'|' '$2=="RC"{print >"RC_RECORDS.TXT";next} $2=="RE"{print >"RE_RECORDS.TXT";next}{print >"OTHER_RECORDS.TXT"}' file
Option 2
Or you can see how grep compares for speed/readability:
grep -E "^[[:alnum:]]+\|RC\|" file > RC_RECORDS.TXT &
grep -E "^[[:alnum:]]+\|RE\|" file > RE_RECORDS.TXT &
grep -vE "^[[:alnum:]]+\|R[CE]" file > OTHER_RECORDS.TXT &
wait
Option 3
This solution uses 2 awk processes and maybe achieves better parallelism in the I/O. The first awk extracts the RC records to a file and passes the rest onwards. The second awk extracts the RE records to a file and passes the rest on to be written to the OTHER_RECORDS.TXT file.
awk -F'|' '$2=="RC"{print >"RC_RECORDS.TXT";next} 1' file | awk -F'|' '$2=="RE"{print >"RE_RECORDS.TXT";next} 1' > OTHER_RECORDS.TXT
I created an 88M record file (3 GB), and ran some test on a desktop iMac as follows:
Option 1: 65 seconds
Option 2: 92 seconds
Option 3: 53 seconds
Your mileage may vary.
My file looks like this, i.e. 33% RE records, 33% RC records and rest junk:
00000000|RE|abcdef|ghijkl|mnopq|rstu
00000001|RC|abcdef|ghijkl|mnopq|rstu
00000002|XX|abcdef|ghijkl|mnopq|rstu
00000003|RE|abcdef|ghijkl|mnopq|rstu
00000004|RC|abcdef|ghijkl|mnopq|rstu
00000005|XX|abcdef|ghijkl|mnopq|rstu
00000006|RE|abcdef|ghijkl|mnopq|rstu
00000007|RC|abcdef|ghijkl|mnopq|rstu
00000008|XX|abcdef|ghijkl|mnopq|rstu
00000009|RE|abcdef|ghijkl|mnopq|rstu
Sanity Check
wc -l *TXT
29333333 OTHER_RECORDS.TXT
29333333 RC_RECORDS.TXT
29333334 RE_RECORDS.TXT
88000000 total