get the top N counts based on the value of another column - unix

I have a file with two columns that looks like this:
a 3
a 7
b 6
a 6
b 1
b 8
c 1
b 1
For each value in the first column, I'd like to find the top N counts from the second column. Using this example, I'd like to find the top 2 values in column 2 for each string in column 1. The desired output would be:
a 7
a 6
b 8
b 6
c 1
I was trying to do something like this with awk, but I am not that familiar with it. This gives the max, not top N:
awk '$2>max[$1]{max[$1]=$2; row[$1]=$0} END{for (i in row) print row[i]}'

Could you please try following, using sort + awk solution.
sort -k2 -s -nr Input_file | awk '++array[$1]<=2' | sort -k1,1 -k2,2nr
OR
sort -k2 -s -nr Input_file | sort -k1,1 -k2,2nr | awk '++array[$1]<=2'
Logical brief explanation: First 2 sort commands are being used to sort data as per 1st and 2nd fields to get data in right order(as per OP's samples), then passing its output to awk to get only 1st 2 occurrences of each first field only as per ask.

$ sort -k1,1 -k2,2rn file | awk -v n=2 '(++cnt[$1])<=n'
a 7
a 6
b 8
b 6
c 1

Related

Extract two consecutive lines that have non-consecutive strings

I have a very large text file with 2 columns and more than 10 mio of lines.
Most lines have in column 2 a number that is the number of column 2 of the previous line +1. However, few thousands of lines behave differently (see example below).
Input file:
A 1
A 2
A 3
A 10
A 11
A 12
A 40
A 41
I would like to extract the pair of two lines that do not respect the +1 increment in column 2.
Desired output file:
A 3
A 10
A 12
A 40
Is there (preferentially) an awk command that allows to do that?
I tried several codes comparing column 2 of two consecutive lines but unfortunately I fail until now (see the code below).
awk 'FNR==1 {print; next} $2==p2+1 {print p $0; p=""; next} {p=$0 ORS; p2=$2}' input.txt > output.txt
Thanks for your help. Best,
Would you please try the following:
awk 'NR>1 {if ($2!=p2+1) print p ORS $0} {p=$0; p2=$2}' input.txt > output.txt
Output:
A 3
A 10
A 12
A 40
The variables names are similar to yours: p holds the previous line and
p2 holds the second column of the previous line.
The condition NR>1 suppresses to print on the 1st line.
if ($2!=p2+1) print p ORS $0 prints the pairs of two lines which
meet the condition.
The block {p=$0; p2=$2} preserves values of current line for the next iteration.
I like perl for the text processing that needs arithmetic.
$ perl -ane 'print and next if $.<3; print $p and print if $F[3]!=$fp+1; $fp=$F[3]; $p=$_' input.txt
| COLUMN 1 | COLUMN 2 |
| -------- | -------- |
| A | 3 |
| A | 10 |
| A | 12 |
| A | 40 |
This is using -a to autosplit into #F.
Prints first 2 lines: print and next if $.<3
On subsequent lines, prints previous line and current line if the 4th field isn't exactly one more than the prior 4th field: print $p and print if $F[3]!=$fp+1
Saves the 4th field as $fp and the entire line as $p: $fp=$F[3]; $p=$_
Assumptions:
columns are tab-delimited
the 1st column may contain white space (this isn't demonstrated in the sample provided by OP but it also hasn't been ruled out)
lines of interest must have the same value in the 1st column (ie, if the values in the 1st column differ then we don't bother with comparing the values in the 2nd column and instead proceed to the next input line)
if 3 consecutive lines meet the criteria, the 2nd/middle line is only printed once
Setup:
$ cat input.txt
A 1
A 2
A 3 # match
A 10 # match
A 11
A 12 # match
A 23 # match
A 40 # match
A 41
X to Z 101
X to Z 102 # match
X to Z 104 # match
X to Z 105
NOTE: comments only added here to highlight the lines that match the search criteria
One awk idea:
awk -F'\t' '
FNR==1 { prevline=$0 }
FNR>1 { if ($1 == prev1 && $2+0 != prev2+1) {
if (prevline) print prevline
print
prevline="" # make sure this line is not printed again if next line also meets criteria
}
else
prevline=$0
}
{ prev1=$1; prev2=$2 }
' input.txt
This generates:
A 3
A 10
A 12
A 23
A 40
X to Z 102
X to Z 104
This might work for you (GNU sed):
sed -nE 'N;h
s/.*\s+(.*)\n.*(\s.*)/echo "$((\1+1))\2"/e;/^(.*)\s\1$/!{x;p;x};x;D' file
Open a two line window throughout the length of the file.
Make a copy of the window and increment the 2nd column of the first line by one. If this amended value is equal to the 2nd column of the second line then print both unadulterated lines.
Delete the first line and repeat.
N.B. This may print the second of these lines twice if the following line meets the same criteria.

Count the number of times a word in a specific order (Unix)

count how many times A and B shows up in this example file:
Ex:
1,2,3,A
2,3,1,A
3,1,2,A
1,2,3,B
1,3,2,B
Expected Output should be:
A 3
B 2
So far I have:
grep -cw "*A" <file>
with output:
3
Which only displays the number of occurrences.
Try this:
mayankp#mayank:~/$ cat t1.txt
1,2,3,A
2,3,1,A
3,1,2,A
1,2,3,B
1,3,2,B
mayankp#mayank:~/$ awk -F, 'NR{arr[$4]++}END{for (a in arr) print a, arr[a]}' t1.txt
A 3
B 2
Let me know if this helps.

Performing calculations based on customer ID in comma-separated file [duplicate]

This question already has an answer here:
Use awk to sum or average for each unique ID
(1 answer)
Closed 6 years ago.
I have a file that contains several comma-separated columns, including a customer ID in the first column.
One customer ID may occur on several rows, but always refers to the same real customer.
How do I run basic calculations in a shell script based on this ID column? For example, calculating the sum of the mileages (the 5th field) for the given customer ID.
102,305,Jin,Kerala,40
104,308,Paul,US,45
105,350,Nina,AUS,50
102,390,Jin,Kerala,10
104,395,Paul,US,35
102,399,Jin,Kerala,35
5th field is the mileage, 1st field is the customer ID.
This is a simple awk script that will sum up the mileages and print the customer IDs together with the sums at the end:
#!/usr/bin/awk -f
BEGIN { FS = "," }
{
customer_id = $1;
mileage = $5;
total_mileage[customer_id] += mileage;
}
END {
for (customer_id in total_mileage) {
print customer_id, total_mileage[customer_id];
}
}
To run (after making it executable with chmod +x script.awk):
$ ./script.awk data.in
102 85
104 80
105 50
Alternatively, as a "one-liner":
$ awk -F, '{t[$1]+=$5} END {for (c in t){print c,t[c]}}' data.in
102 85
104 80
105 50
While I agree with #wilx that using a database might be smarter, this sample awk script should get you started:
awk -v FS=',' '{miles[$1] += $5}
END { for (customerid in miles) {
print customerid, miles[customerid]; } }' customers
You can get a list of unique IDs using something like (assuming the first column is the ID):
awk '{print $1}' inputFile | sort -u
This outputs the first field of every single line in the input file inputFile, sorts them and removes duplicates.
You can then use that method with a bash loop to process each of the unique IDs with another awk command to perform some action on them. In the following snippet, I print out the matching lines for each ID:
for id in $(awk '{print $1}' inputFile | sort -u) ; do
echo "${id}:"
awk -vid=${id} '$1==id {print " "$0)' inputFile
done
In that code, for each individual ID, it first outputs the ID then uses awk to only process lines matching that ID. The action carried out is to output the full line with indentation.
Of course, you can do anything you wish with the lines matching each ID. As shown below, an example more closely matching your requirements.
First, here's an input file I used for testing - we can assume field 1 is the customer ID and field 2 the mileage:
$ cat inputFile
a 1
b 2
c 3
a 4
b 5
c 6
a 7
b 8
c 9
b 10
c 11
c 12
And here's a command-line transcript of the method proposed (note that $ and + are input prompt and continuation prompt respectively, they are not part of the actual commands):
$ for id in $(awk '{print $1}' inputFile | sort -u) ; do
+ awk -vid=${id} '
+ $1==id {print $0; sum += $2 }
+ END {print "Total: "sum; print }
+ ' inputFile
+ done
a 1
a 4
a 7
Total: 12
b 2
b 5
b 8
b 10
Total: 25
c 3
c 6
c 9
c 11
c 12
Total: 41
Keep in mind that, for non-huge data sets, it's also possible to do this in a single pass awk script, using associative arrays to store the totals then outputting all the data in the END block. I myself tend to prefer the multi-pass approach myself since it minimises the possibility of running out of memory. The trade-off, of course, is that it will no doubt take longer since you're processing the file more than once.
For a single-pass solution, you can use something like:
$ awk '{sum[$1] += $2} {for (key in sum) { print key": "sum[key]}}' inputFile
which gives you:
a: 12
b: 25
c: 41

Count lines that contains a repeated pattern

I have a file (File.txt) that contains 10 lines which may contains a repeated patterns at a certain position (14-17)
fsdf sfkljkl4565
fjjf lmlkfdm1235
fkljfgdfgdfg6583
eretjioijolj6933
ioj ijijsfoi4565
dgodiiopkpok6933
fsj opkjfiej4565
ihfzejijjijf4565
dfsdkfjlfeff1235
dijdijojijdz4565
The Desired Output is counting the lines that contains a pattern :
#occurences pattern
5 4565
2 1235
1 6583
2 6933
I have tried to filter the file
cat File.txt | cut -c14-17 | sort -n -K1,1-1,3 >> File_Filtered.txt
I need your help to add the first column (#occurences)
To get a count of repeats, use uniq -c. Thus, try:
$ cut -c13-17 File.txt | sort -n | uniq -c | sort -nr
5 4565
2 6933
2 1235
1 6583
The above was tested using Linux with GNU utilities. (Judging by your sample code, you may be using different tools.)
Including a header
The following includes the header and uses column -t to assure that everything lines up nicely:
$ { echo '#occurences pattern'; cut -c13-17 File.txt | sort -n | uniq -c | sort -nr; } | column -t
#occurences pattern
5 4565
2 6933
2 1235
1 6583
$ awk '{cnt[substr($0,13)]++} END{for (i in cnt) print cnt[i], i}' file
2 6933
1 6583
5 4565
2 1235

Make a file counting instances in sets of 5

I have a file that looks like this:
1 rs531842 503939 61733 G A
1 rs10494103 35025 114771 C T
1 rs17038458 254490 21116837 G A
1 rs616378 525783 21127670 T C
1 rs3845293 432526 21199392 A C
2 rs16840461 233620 157112959 A G
2 rs1560628 224228 157113214 T C
2 rs17200880 269314 257145829 C T
2 rs10497165 35844 357156412 C T
2 rs7607531 624696 457156575 T C
...with column 1 stretching on to 22, and several thousand entries in total.
I want to create a file that lists bins of 5 million from column 4 which have data, separating by column 1.
Basically, all but column 1 and 4 can be discarded. A simple imput would look like this:
InputChr1:
61733
114771
21116837
21127670
21199392
InputChr2:
157112959
157113214
257145829
357156412
457156575
So, for the example above, I would want to get two files that look like this:
OutputChr1.txt
Start End Occurrences
1 5000000 2
20000001 25000000 3
OutputChr2.txt
Start End Occurrences
155000001 160000000 2
255000001 260000000 1
355000001 360000000 1
455000001 460000000 1
Any ideas? It seems like something that should be doable with lapply in R, but I can't get the for loops to work...
EDIT: Actually, I made this look much harder than it needed to be - basically, I want to split the original file by column 1, extract the data in column 4, and then count the instances in bins of 5 million.
(Apologies for slightly random tags, just trying to think of which tools might be best!)
Well, this happened to be very challenging. I couldn't find a way to use an unique awk command, though.
awk -v const=5000000 -v max=150
'{a[$1,int($4/const)]++; b[$1]}
END{for (i in b)
{for (j=0; j<max; j++)
print i, j*const +1, (j+1)*const, a[i,j]
}
}' file
And then to get only the results:
awk 'NF==4'
Explanation
-v const=5000000 -v max=150 give the variables. const is the 5 million value to split the results. max is the biggest number up to which we will look for info in the END block.
a[$1,int($4/const)]++ create an array with (1st field, 4th field) as index. Note the second is int($4/const) is to get from 23432 --> 0, 6000000 --> 1, etc. That is, to see in which block of values is every 4th column.
b[$1] keep track of the first columns that have been processed.
END{for (i in b) {for (j=0; j<max; j++) print j, j*const +1, (j+1)*const, a[i,j]}}' print the values.
awk 'NF==4' just print those lines that have 4 columns. This way it just outputs those cases in which there were matches.
In case you want to store the values into a new file, you can do
awk 'NF==4 {print > "OutputChr"$1".txt}'
Sample output
$ awk -v const=5000000 -v max=150 '{a[$1,int($4/const)]++; b[$1]} END{for (i in b) {for (j=0; j<max; j++) print i, j*const +1, (j+1)*const, a[i,j]}}' a | awk 'NF==4'
1 1 5000000 2
1 20000001 25000000 3
2 155000001 160000000 2
2 255000001 260000000 1
2 355000001 360000000 1
2 455000001 460000000 1
All in one
awk '{ v=int($4/const)
a[$1 FS v]++
min[$1]=min[$1]<v?min[$1]:v # get the Minimum of column $4 for group $1
max[$1]=max[$1]>v?max[$1]:v # get the Minimum of column $4 for group $1
}END{ for (i in min)
for (j=min[i];j<=max[i];j++) # set the for loop, and use the min and max value.
if (a[i FS j]!="") print j*const+1,(j+1)*const,a[i FS j] > "OutputChr" i ".txt" # if the data is exist, print to file "OutputChr" i ".txt"
}' const=5000000 file
result:
$ cat OutputChr1.txt
1 5000000 2
20000001 25000000 3
$ cat OutputChr2.txt
155000001 160000000 2
255000001 260000000 1
355000001 360000000 1
455000001 460000000 1

Resources