Remove lines duplicated more than n times - unix

I want to remove lines duplicated more than 3 times (or 4 times) in the first 3 columns. The main goal is to remove lines where the genomic coordinates are duplicated more than 3 or 4 times.
Input file.tsv
chr
position
position2
ref
alt
chr21
10464942
10464942
T
C
chr21
10464942
10464942
T
C
chr21
10464961
10464961
A
G
chr21
10464961
10464961
C
G
chr21
10464961
10464961
A
G
chr21
10464961
10464961
T
C
chr21
10465086
10465086
T
C
Desired output if n=3
chr
position
position2
ref
alt
chr21
10464942
10464942
T
C
chr21
10464942
10464942
T
C
chr21
10465086
10465086
T
C
I tried awk '{if(!seen[$1,$2,$3]++) {if(++count[$1,$2,$3]<=3) print} }' and some sort and uniq combinations, but they don't get me the output I want.

Annotating with a dup count lets us easily solve this.
Python will be more convenient than awk.
import csv
import typer
def get_annotated_rows(sheet, prefix_length=3):
"""Generates (count, row) tuples.
Count for a row will be 1 if this is the first time we've seen it,
and increments with each duplicate row.
We assess duplicates by examining just an initial prefix of each row.
"""
prev = None
count = 0
for row in sheet:
prefix = row[:prefix_length]
if prefix != prev:
prev = prefix
count = 1
else:
count += 1
yield count, row
def main(infile: str = "input_file.tsv", n: int = 4):
with open(infile) as fin:
sheet = csv.reader(fin, delimiter="\t")
for count, row in get_annotated_rows(sheet):
if count <= n:
print("\t".join(row))
if __name__ == '__main__':
typer.run(main)
install:
$ pip install typer

A common shell scripting trick is to reformat data so it can be processed easily by using *nix utilities. Often the troublesome utility is the uniq command, with it -f (skip fields option), where fields are skipped at the front of the record. Many times you wish you could skip at the end of the record, so we rely on awk to reformat data to have the skipable fields at the end:
#!/bin/bash
awk 'NR>1{print $4 " " $5 " " $1 "_" $2 "_" $3 }' data.txt \
| sort -k3 | uniq -cf2 \
| awk '$1<3{
split($4,arr,"_")
for (i=1;i<=$1;i++) {
print arr[1]"\t" arr[2]"\t" arr[3]"\t" $2 " " $3
}
}'
Output
chr21 10464942 10464942 T C
chr21 10464942 10464942 T C
chr21 10465086 10465086 T C
You can change the field separators in the print statement as needed to match the needs of your consuming system.
(And this code can be folded up onto one line, giving the much desired (if misvalued) "oneliner" (-: ).
IHTH

Related

Drop or remove column using awk

I wanted to drop first 3 column;
This is my data;
DETAIL 02032017
Name Gender State School Class
A M Melaka SS D
B M Johor BB E
C F Pahang AA F
EOF 3
I want my data like this:
DETAIL 02032017
School Class
SS D
BB E
AA F
EOF 3
This is my current command that I get mycommandoutput:
awk -v date="$(date +"%d%m%Y")" -F\| 'NR==1 {h=$0; next}
{file="TEST_"$1"_"$2"_"date".csv";
print (a[file]++?"": "DETAIL"date"" ORS h ORS) $0 > file} END{for(file in a) print "EOF " a[file] > file}' testing.csv
Can anyone help me?
Thank you :)
I want to remove first three column
If you just want to remove the first three columns, you can just set them to empty strings, leaving alone those that don't have three columns, something like:
awk 'NF>=3 {$1=""; $2=""; $3=""; print; next}{print}'
That has the potentially annoying habit of still having the field separators between those empty fields but, since modifying columns will reformat the line anyway, I assume that's okay:
DETAIL 02032017
School Class
SS D
BB E
AA F
EOF 3
If awk is the only tool being used to process them, the spacing won't matter. If you do want to preserve formatting (meaning that the columns are at very specific locations on the line), you can just get a substring of the entire line:
awk '{if (NF>=3) {$0 = substr($0,25)}; print}'
Since that doesn't modify individual fields, it won't trigger a recalculation of the line that would change its format:
DETAIL 02032017
School Class
SS D
BB E
AA F
EOF 3

Performing calculations based on customer ID in comma-separated file [duplicate]

This question already has an answer here:
Use awk to sum or average for each unique ID
(1 answer)
Closed 6 years ago.
I have a file that contains several comma-separated columns, including a customer ID in the first column.
One customer ID may occur on several rows, but always refers to the same real customer.
How do I run basic calculations in a shell script based on this ID column? For example, calculating the sum of the mileages (the 5th field) for the given customer ID.
102,305,Jin,Kerala,40
104,308,Paul,US,45
105,350,Nina,AUS,50
102,390,Jin,Kerala,10
104,395,Paul,US,35
102,399,Jin,Kerala,35
5th field is the mileage, 1st field is the customer ID.
This is a simple awk script that will sum up the mileages and print the customer IDs together with the sums at the end:
#!/usr/bin/awk -f
BEGIN { FS = "," }
{
customer_id = $1;
mileage = $5;
total_mileage[customer_id] += mileage;
}
END {
for (customer_id in total_mileage) {
print customer_id, total_mileage[customer_id];
}
}
To run (after making it executable with chmod +x script.awk):
$ ./script.awk data.in
102 85
104 80
105 50
Alternatively, as a "one-liner":
$ awk -F, '{t[$1]+=$5} END {for (c in t){print c,t[c]}}' data.in
102 85
104 80
105 50
While I agree with #wilx that using a database might be smarter, this sample awk script should get you started:
awk -v FS=',' '{miles[$1] += $5}
END { for (customerid in miles) {
print customerid, miles[customerid]; } }' customers
You can get a list of unique IDs using something like (assuming the first column is the ID):
awk '{print $1}' inputFile | sort -u
This outputs the first field of every single line in the input file inputFile, sorts them and removes duplicates.
You can then use that method with a bash loop to process each of the unique IDs with another awk command to perform some action on them. In the following snippet, I print out the matching lines for each ID:
for id in $(awk '{print $1}' inputFile | sort -u) ; do
echo "${id}:"
awk -vid=${id} '$1==id {print " "$0)' inputFile
done
In that code, for each individual ID, it first outputs the ID then uses awk to only process lines matching that ID. The action carried out is to output the full line with indentation.
Of course, you can do anything you wish with the lines matching each ID. As shown below, an example more closely matching your requirements.
First, here's an input file I used for testing - we can assume field 1 is the customer ID and field 2 the mileage:
$ cat inputFile
a 1
b 2
c 3
a 4
b 5
c 6
a 7
b 8
c 9
b 10
c 11
c 12
And here's a command-line transcript of the method proposed (note that $ and + are input prompt and continuation prompt respectively, they are not part of the actual commands):
$ for id in $(awk '{print $1}' inputFile | sort -u) ; do
+ awk -vid=${id} '
+ $1==id {print $0; sum += $2 }
+ END {print "Total: "sum; print }
+ ' inputFile
+ done
a 1
a 4
a 7
Total: 12
b 2
b 5
b 8
b 10
Total: 25
c 3
c 6
c 9
c 11
c 12
Total: 41
Keep in mind that, for non-huge data sets, it's also possible to do this in a single pass awk script, using associative arrays to store the totals then outputting all the data in the END block. I myself tend to prefer the multi-pass approach myself since it minimises the possibility of running out of memory. The trade-off, of course, is that it will no doubt take longer since you're processing the file more than once.
For a single-pass solution, you can use something like:
$ awk '{sum[$1] += $2} {for (key in sum) { print key": "sum[key]}}' inputFile
which gives you:
a: 12
b: 25
c: 41

Unix print pattern between the Strings

I have a file which has got content like below. START and STOP stand for a block.
START
X | 123
Y | abc
Z | +=-
STOP
START
X | 456
Z | +%$
STOP
START
X | 789
Y | ghi
Z | !##
STOP
I would like to get the values of X and Y printed in the format below for each block:
123 ~~ abc
456 ~~
789 ~~ ghi
If it is single occurrence of START/STOP, sed -n '/START/,/STOP/p' would have helped. Since this is repetitive, I need your help.
Based on my own solution to How to select lines between two marker patterns which may occur multiple times with awk/sed:
awk -v OFS=" ~~ " '
/START/{flag=1;next}
/STOP/{flag=0; print first, second; first=second=""}
flag && $1=="X" {first=$3}
flag && $1=="Y" {second=$3}' file
Test
$ awk -v OFS=" ~~ " '/START/{flag=1;next}/STOP/{flag=0; print first, second; first=second=""} flag && $1=="X" {first=$3} flag && $1=="Y" {second=$3}' a
123 ~~ abc
456 ~~
789 ~~ ghi
Sed is always the wrong choice for any problem that involves processing multiple lines. All of sed's arcane constructs for doing so became obsolete in the mid-1970s when awk was invented.
Whenever you have name-value pairs in your input I find it useful to create an array that maps each name to it's value and then access the array by the names. In this case, using GNU awk for multi-char RS and delete array:
$ cat tst.awk
BEGIN {
RS = "\nSTOP\n"
OFS=" ~~ "
}
{
delete n2v
for (i=2;i<=NF;i+=3) {
n2v[$i] = $(i+2)
}
print n2v["X"], n2v["Y"]
}
$ gawk -f tst.awk file
123 ~~ abc
456 ~~
789 ~~ ghi
Because I like brain teasers (not because this sort of thing is practical to do in sed), a possible sed solution is
sed -n '/START/,/STOP/ { //!H; // { g; /^$/! { s/.*\nX | \([^\n]*\).*/\1 ~~/; ta; s/.*/~~/; :a G; s/\n.*Y | \([^\n]*\).*/ \1/; s/\n.*//; p; s/.*//; h } } }'
This works as follows:
/START/,/STOP/ { # between two start and stop lines
//! H # assemble the lines in the hold buffer
# note that // repeats the previously
# matched pattern, so // matches the
# start and end lines, //! all others.
// { # At the end
g # That is: When it is one of the
/^$/! { # boundary lines and the hold buffer
# is not empty
s/.*\nX | \([^\n]*\).*/\1 ~~/ # isolate the X value, append ~~
ta # if there is no X value, just use ~~
s/.*/~~/
:a
G # append the hold buffer to that
s/\n.*Y | \([^\n]*\).*/ \1/ # and isolate the Y value so that
# the pattern space contains X ~~ Y
s/\n.*// # Cutting off everything after a newline
# is important if there is no Y value
# and the previous substitution did
# nothing
p # print the result
s/.*// # and make sure the hold buffer is
h # empty for the next block.
}
}
}

AWK to use multiple spaces as delimiter

I am using below command to join two files using first two columns.
awk 'NR==FNR{a[$1,$2]=substr($0,3);next} ($1,$2) in a{print $0, a[$1,$2] > "br0102_3.txt"}' br01.txt br02.txt
Now, by default AWk command uses whitespaces as the separators. But my file may contain single space between two words, e.g.
File 1:
ABCD TEXT1 TEXT2 123123112312312312312312312312312312
BCDEFG TEXT3TEXT4 133123123123123123123123123125423423
QWERT TEXT5TEXT6 123123123123125456678786789698758567
File 2:
ABCD TEXT1 TEXT2 12312312312312312312312312312
BCDEFG TEXT3TEXT4 31242342342342342342342342343
MNHT TEXT8 TEXT9 31242342342342342342342342343
I want the result file as ;
ABCD TEXT1 TEXT2 123123112312312312312312312312312312 12312312312312312312312312312
BCDEFG TEXT3TEXT4 133123123123123123123123123125423423 31242342342342342342342342343
QWERT TEXT5TEXT6 123123123123125456678786789698758567
MNHT TEXT8 TEXT9 31242342342342342342342342343
Any hints ?
awk supports a regular expression as the value of FS so you can specify a regular expression that matches at least two spaces. Something like -F '[[:space:]][[:space:]]+'.
$ awk '{print NF}' File2
4
3
4
$ awk -F '[[:space:]][[:space:]]+' '{print NF}' File2
3
3
3
You are using fixed width fields so you should be using gnu awk FIELDWIDTHS (or similar) to separate the fields, e.g. if the 2nd field is the 15 chars from char 8 to char 23 inclusive in this file:
$ cat file
abc def ghi klm
AAAAAAAB C D E F G H IJJJJ
abc def ghi klm
$ awk -v FIELDWIDTHS="7 15 4" '{print "<" $2 ">"}' file
<def ghi >
<B C D E F G H I>
< def ghi >
Any solution that relies on a certain number of spaces between fields will fail when you have 1 or zero spaces between your fields.
If you want to strip leading/trailing blanks from your target field(s):
$ awk -v FIELDWIDTHS="7 15 4" '{gsub(/^\s+|\s+$/,"",$2); print "<" $2 ">"}' file
<def ghi>
<B C D E F G H I>
<def ghi>
awk automatically detects multiple spaces if field seperator is set to " "
Thus, this simply works:
awk -F' ' '{ print $2 }'
to get the second column if you have a table like the one mentioned.

command to identify distinct field value

Could anyone help me on a command to identify distinct values from a particular column ?
For eg my input is like
Column1 Column2 Column3
a 11 abc
a 22 abc
b 33 edf
c 44 ghi
I require a output like
Column1
a
b
c
my input file has header. So i need a command in which we pass Column1 as parameter.
Run the following command, with an input file:
$ head -1 input.file | awk '{ print $1}'; awk '{ if (NR > 1) print $1 }' input.file | uniq
Column1
a
b
c
Or just:
$ awk '{print $1 }' input.file | uniq
Column1
a
b
c
File distinct.pl:
#!/usr/bin/perl
$_ = <STDIN>;
#F = split;
map $col{$F[$_]}=$_, (0..$#F); # map column names to numbers
while (<STDIN>)
{
#F = split;
$val{$F[$col{$ARGV[0]}]} = undef # implement the set of values
}
$, = "\n";
print $ARGV[0], ""; # output the column parameter
print sort(keys %val), "" # output sorted set of values
Example command: distinct.pl Column2 <input
Note: Non-existent column names yield the values from the first column.

Resources