Excluding options for AWK, separating breadXI from breadX - unix

I'm working with manipulating lines from a .vcf file where bread is listed 1 through 20 in roman numerals. I only want the lines corresponding to bread 10, so I've used
awk '/breadX/ {print}' file.vcf > Test.txt
to output a list of lines containing "breadX" to Test.txt. That is all good, however it is also including "breadXI" on to "breadXX" in the list. Is there an option to exclude cases that don't match assuming "breadX" is out of order and towards the middle (XIV...X...XX), and that there is more information in the line. I only want lines that start with bread 10, and not any of the other options. Any help would be appreciated.

In the lack of definitive data sample to see what might follow the breadX just exclude all possible strings where roman numeral symbols I, V, X, L, D, M follow:
$ awk '/^breadX([^IVXLDM]|$)/' file
Sample test file:
$ cat file
breadX
breadXI
breadX2
3
Test it:
$ awk '/^breadX([^IVXLDM]|$)/' file
Output:
breadX
breadX2

If breadX is a word, you can use word boundary to limit your search.
cat file
test breadXI more
hi breadX yes
cat home breadXX
awk '/\<breadX\>/' file
hei breadX yes
\< start of word
\> end of word
PS you do not need the print since its default action if test is true.

Related

Median Calculation in Unix

I need to calculate median value for the below input file. It is working fine for odd occurrences but not for even occurrences. Below is the input file and the script used. Could you please check what is wrong with this command and correct the same.
Input file:
col1,col2
AR,2.52
AR,3.57
AR,1.29
AR,6.66
AR,3.05
AR,5.52
Desired Output:
AR,3.31
Unix command:
cat test.txt | sort -t"," -k2n,2 | awk '{arr[NR]=$1} END { if (NR%2==1) print arr[(NR+1)/2]; else print (arr[NR/2]+arr[NR/2+1])/2}'
Don't forget that your input file has an additional line, containing the header. You need to take an additional step in your awk script to skip the first line.
Also, due to the fact you're using the default field separator, $1 will contain the whole line, so your code arr[NR/2]+arr[NR/2+1])/2 is never going to work. I would suggest that you changed it so that awk splits the input on a comma, then use the second field $2.
sort -t, -k2n,2 file | awk -F, 'NR>1{a[++i]=$2}END{if(i%2==1)print a[(i+1)/2];else print (a[i/2]+a[i/2+1])/2}'
I also removed your useless use of cat. Most tools, including sort and awk, are capable of reading in files directly, so you don't need to use cat with them.
Testing it out:
$ cat file
col1,col2
AR,2.52
AR,3.57
AR,1.29
AR,6.66
AR,3.05
AR,5.52
$ sort -t, -k2n,2 file | awk -F, 'NR>1{a[++i]=$2}END{if(i%2==1)print a[(i+1)/2];else print (a[i/2]+a[i/2+1])/2}'
3.31
It shouldn't be too difficult to modify the script slightly to change the output to whatever you want.

copying everything from line number xxxxx to line number zzzzzz in vi editor

I would like to copy a couple of screenful lines using vi editor. Anything from line number xxxx to line number zzzzz.
Then, I want to write these lines into another file.
In the command mode (hit <ESC>) type:
:X,Zy
WhereX is the first line and Z is the last line.
Example
Copy lines 3 to 500:
:3,500y
To insert go to the line after which you want to instert the copies lines and hit p (lower 'P').
If you want to insert the lines befor a particular line then hit P (upper 'P').
If you want to do this in vi then you can use:
:XXX,ZZZy<enter>
However, it looks like you want to store these lines in another file. Then, awk comes handy:
awk 'NR==XXX,NR==ZZZ' a > new_file
If the numbers happen to be variables, use them as this:
awk -v first="$first" -v last="$last" 'NR==first,NR==last' a > new_file
Test
Let's print a sequence of 50 numbers in the file a, each one in a different line:
$ seq 50 > a
Then, we produce the output:
$ awk 'NR==5,NR==7' a > new_file
$ cat new_file
5
6
7

Filter file based on internal value within a column

In UNIX, I would like to filter my 3 columns file based on the "DP" values that within the 3rd column.
I'd like to obtain only rows that have DP values higher than 7.
A|49.14|AC=2;AF=0.500;AN=4;BaseQRankSum=1.380;DP=6;Dels=0.00;
T|290.92|AC=2;AF=1.00;AN=2;DP=8;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;
T|294.75|AC=6;AF=1.00;AN=6;DP=9;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=6;
I'm using here "|" for separating between my three columns
Here's one simple solution
$ echo "A|49.14|AC=2;AF=0.500;AN=4;BaseQRankSum=1.380;DP=6;Dels=0.00;
AC=6;AF=1.00;AN=6;DP=9;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=6;T|290.92|AC=2;AF=1.00;AN=2;DP=8;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;
MLEAC=6;" \
| awk '{dpVal=$0;sub(/.*DP=/, "", dpVal);sub(/;.*$/,"", dpVal); if (dpVal>7) print}'
output
T|290.92|AC=2;AF=1.00;AN=2;DP=8;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;
T|294.75|AC=6;AF=1.00;AN=6;DP=9;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=6;
This makes a copy of each line ($0), the strips away everything before DP=, and everything after the trailing ; char for that field, leaving just the value for DP. That value is tested, and if true the whole line is printed (the default action of awk print is to print the whole line, but you can tell it to print anything you like, maybe print "Found it:" $0 or zillons of variants.
edit
I would like to keep all the first 53 lines intact and save them as well to my Output.txt file.
Yes, very easy, you're on the right track. With awk is is very easy to have multiple conditions process different parts or conditions in a file. Try this:
awk 'FNR <= 53 {print}
FNR > 53 {
vpVal=$0;sub(/.*DP=/, "", dpVal);sub(/;.*$/,"", dpVal)
if (dpVal>7) print
}' File.vcf > Output.txt
(i don't have a file to test with, so let me know if this isn't right).
IHTH

grep: how to show the next lines after the matched one until a blank line [not possible!]

I have a dictionary (not python dict) consisting of many text files like this:
##Berlin
-capital of Germany
-3.5 million inhabitants
##Earth
-planet
How can I show one entry of the dictionary with the facts?
Thank you!
You can't. grep doesn't have a way of showing a variable amount of context. You can use -A to show a set number of lines after the match, such as -A3 to show three lines after a match, but it can't be a variable number of lines.
You could write a quick Perl program to read from the file in "paragraph mode" and then print blocks that match a regular expression.
as andy lester pointed out, you can't have grep show a variable amount of context in grep, but a short awk statement might do what you're hoping for.
if your example file were named file.dict:
awk -v term="earth" 'BEGIN{IGNORECASE=1}{if($0 ~ "##"term){loop=1} if($0 ~ /^$/){loop=0} if(loop == 1){print $0}}' *.dict
returns:
##Earth
-planet
just change the variable term to the entry you're looking for.
assuming two things:
dictionary files have same extension (.dict for example purposes)
dictionary files are all in same directory (where command is called)
If your grep supports perl regular expressions, you can do it like this:
grep -iPzo '(?s)##Berlin.*?\n(\n|$)'
See this answer for more on this pattern.
You could also do it with GNU sed like this:
query=berlin
sed -n "/$query/I"'{ :a; $p; N; /\n$/!ba; p; }'
That is, when case-insensitive $query is found, print until an empty line is found (/\n$/) or the end of file ($p).
Output in both cases (minor difference in whitespace):
##Berlin
-capital of Germany
-3.5 million inhabitants

Difference between linenumbers of cat file | nl and wc -l file

i have a file with e.g. 9818 lines. When i use wc -l file, i see 9818 lines. When i vi the file, i see 9818 lines. When i :set numbers, i see 9818 lines. But when i cat file | nl, i see the final line number is 9750 (e.g.). Basically i'm asking why line numbers from cat file | nl and wc -l file do not match.
wc -l: count all lines
nl: count all (nonempty) lines
try
nl -ba: count all lines
nl(1) says the default is for header and footer lines to not be numbered (-hn -fn), and those are specified by repeating \; on various lines. Perhaps your input file includes some of these?
I suggest reading the output of nl line by line against cat -n output and see where things diverge. Or use diff -u if you want to take the fun out of reading 9818 lines. :)
nl does not number blank lines, so this is almost certainly the reason. If you can point us to the file, we can confirm that, but I suspect this is the case.

Resources