awk $4 column if column = value with characters thereafter - unix

I have a file with the following data within for example:
20 V 70000003d120f88 1 2
20 V 70000003d120f88 2 2
20x00 V 70000003d120f88 2 2
10020 V 70000003d120f88 1 5
I want to get the sum of the 4th column data.
Using the the below command, I can acheive this, however the row 20x00 is excluded. I want to everything to start with 20 must be sumed and nothing before that, so 20* for example:
cat testdata.out | awk '{if ($1 == '20') print $4;}' | awk '{s+=$1}END{printf("%.0f\n", s)}'
The output value must be:
5
How can I achieve this using awk. The below I attempted also does not work:
cat testdata.out | awk '$1 ~ /'20'/ {print $4;}' | awk '{s+=$1}END{printf("%.0f\n", s)}'

There is no need to use 3 processes, anything can be done by one AWK process. Check it out:
awk '$1 ~ /^20/ { a+=$4 } END { print a }' testdata.out
explanation:
$1 ~ /^20/ checks to see if $1 starts with 20
if yes, we add $4 in the variable a
finally, we print the variable a
result 5
EDIT:
Ed Morton rightly points out that the result should always be of the same type, which can be solved by adding 0 to the result.
You can set the exit status if it is necessary to distinguish whether the result 0 is due to no matches
(output status 0) or matching only zero values ​​(output status 1).
The exit code for different input data can be checked e.g. echo $?
The code would look like this:
awk '$1 ~ /^20/ { a+=$4 } END { print a+0; exit(a!="") }' testdata.out

Figured it out:
cat testdata.out | awk '$1 ~ /'^20'/ {print $4;}' | awk '{s+=$1}END{printf("%.0f\n", s)}'
The above might not work for all cases, but below will suffice:
i=20
cat testdata.out | awk '{if ($1 == "'"$i"'" || $1 == ""'"${i}"'"x00") print $4;}' | awk '{s+=$1}END{printf("%.0f\n", s)}'

Related

Transposing multiple columns in multiple rows keeping one column fixed in Unix

I have one file that looks like below
1234|A|B|C|10|11|12
2345|F|G|H|13|14|15
3456|K|L|M|16|17|18
I want the output as
1234|A
1234|B
1234|C
2345|F
2345|G
2345|H
3456|K
3456|L
3456|M
I have tried with the below script.
awk -F"|" '{print $1","$2","$3","$4"}' file.dat | awk -F"," '{OFS=RS;$1=$1}1'
But the output is generated as below.
1234
A
B
C
2345
F
G
H
3456
K
L
M
Any help is appreciated.
What about a single simple awk process such as this:
$ awk -F\| '{print $1 "|" $2 "\n" $1 "|" $3 "\n" $1 "|" $4}' file.dat
1234|A
1234|B
1234|C
2345|F
2345|G
2345|H
3456|K
3456|L
3456|M
No messing with RS and OFS.
If you want to do this dynamically, then you could pass in the number of fields that you want, and then use a loop starting from the second field.
In the script, you might first check if the number of fields is equal or greater than the number you pass into the script (in this case n=4)
awk -F\| -v n=4 '
NF >= n {
for(i=2; i<=n; i++) print $1 "|" $i
}
' file
Output
1234|A
1234|B
1234|C
2345|F
2345|G
2345|H
3456|K
3456|L
3456|M
# perl -lne'($a,#b)=((split/\|/)[0..3]);foreach (#b){print join"|",$a,$_}' file.dat
1234|A
1234|B
1234|C
2345|F
2345|G
2345|H
3456|K
3456|L
3456|M

Filter a file using a column value greater than a number (awk not working)

I am trying to filter a file using values in the 8 column >= 10. I am using awk but for some reason it doesn't work. Am I doing something wrong, what am I missing?
head df_TPM.csv
LQNS02136402.1_14821_3p,12680.71611,11346.42368,11686.28693,9067.797819,7429.467928,5551.660333,3246.956281
LQNS02000137.1_325_3p,8342.540984,5905.726173,4503.363041,3616.191278,3142.965662,3678.829299,6288.621969
LQNS02278148.1_40791_3p,4921.502758,2461.882836,429.824973,261.273116,132.0239748,68.6191655,70.8815385
LQNS02278089.1_34112_3p,4246.71324,4584.529009,8687.922574,7570.83746,5801.384953,2870.020801,734.3131465
LQNS02278075.1_32377_5p,4143.547577,4093.91803,10804.12323,10062.99269,7925.240969,4712.484455,1080.915573
LQNS02138569.1_14892_3p,2668.27957,2160.173542,837.2584183,233.2310273,84.62362925,64.6037895,23.456714
LQNS02278075.1_32324_5p,2331.608924,491.8868983,1527.312199,881.8683105,747.1474225,347.397634,74.07259175
LQNS02278075.1_32382_3p,2140.686095,2439.122353,10837.38169,12569.95295,9385.530878,6022.323737,1705.900969
LQNS02000138.1_777_5p,1819.275149,1762.009649,8565.396754,33280.90019,32176.07604,15849.37306,11872.99383
LQNS02278186.1_47223_3p,1687.843418,728.4288968,1328.048172,1306.424238,2102.27342,14.78892225,9.92647375
#Extract column 1 and 8 and print if $8>=10
cat df_TPM.csv |awk -F"," '{print $1, $8}' | grep -E "^LQN" | awk -F " " '$2>= 10'
LQNS02276925.1_23356_5p 5.352369
LQNS02277221.1_25158_5p 2.82778125
LQNS02277812.1_29775_3p 11.1090745
LQNS02278074.1_32154_3p 6.124789
LQNS02278139.1_39525_5p 22.6656355
#As you can see lots of numbers shouldn't be there (ex: 2.82778125 < 10)
By seeing OP's comment, in case you don't want to search for LQN text in starting of line and want to check if 8th column is greater than 10 then try following(to check if lines have LQN remove ! from following codes).
awk -F"," '$8+0 >= 10 && !/^LQN/{print $1, $8}' df_TPM.csv
OR To get total lines try: counting those matched lines could be done in a single awk itself.
awk -F"," '$8+0 >= 10 && !/^LQN/{count++} END{print count}' df_TPM.csv
Explanation: Adding detailed explanation for above.
awk -F"," ' ##Starting awk program from here.
$8+0 >= 10 && !/^LQN/{ ##Checking condition if 8th field is greater than 10 and NOT LQN.
count++ ##Increasing count with 1 here.
}
END{ ##Starting END block of this awk program from here.
print count ##Printing count value here.
}
' df_TPM.csv ##Mentioning Input_file name here.
To handle control M characters in awk code itself try: considering that you don't want control M characters in your Input_file.
awk -F"," '{gsub(/\r/,"")} $8 >= 10 && !/^LQN/{count++} END{print count}' df_TPM.csv
You need to tell your awk to coerce $8 into a number by computing $8+0. It is recommended that you ensure you have GNU awk installed to avoid issues. Also, you may probably use dos2unix before working on the files to normalize the line endings.
The whole command can be written as
awk -F"," '/^LQN/ && $8+0 >= 10 {print $1, $8}' df_TPM.csv
See the online awk demo.
NOTE: To only count these lines, use
The whole command can be written as
awk -F, '/^LQN/ && $8+0 >= 10 {cnt++} END{print cnt}' df_TPM.csv
To find the lines that do not start with LQN, just add the negation operator ! before /^LQN/:
awk -F, '!/^LQN/ && $8+0 >= 10 {cnt++} END{print cnt}' df_TPM.csv
Details
-F"," (= -F,) - set the field separator to a comma
/^LQN/ && $8+0 >= 10 - if the current line starts with LQN and the eighth field is equal or larger than 10
!/^LQN/ && $8+0 >= 10 - if the current line does not start with LQN and the eighth field is equal or larger than 10
{print $1, $8} - print Field 1 and 8
{cnt++} - increment the cnt variable
END{print cnt} - print the cnt variable once the awk finishes processing lines.

Stop awk search on a condition

At the moment I have my R function generate an awk script to load, selectively, a subset of a csv into fread.
The resulting awk string looks something like this:
tail -n +2 ../data/faults_main_only_dp_1_shopFlag.csv | parallel -k -q --block 500M --pipe awk -F , ' $5 > \"2013-01-01\" && $5 < \"2015-11-17\" && $2 ~ /^F59PHI$|^GP20ECO$|^GT42CU-ACE$/ && $20 ~ /^Disregard$|^EMD Work Item$|^Pending$|^Pre-Work Item$|^Road Failure$|^Unit Shopped$|^Watch$|^Work Item$|^NA$/ {print $2 \",\" $88 \",\" $17 \",\" $5 \",\" $9 \",\" $22 \",\" $3 \",\" $15 \",\" $14 } '
The thing is: as of recent, my csv is ordered by dates ($5), in descending order, so if the user enters a specific lower-bound date, and awk gets to that line, it make sense for it to stop. (I am not sure how that would work the parallelization I am doing above. Maybe there is a way to select only the part of the csv that is “above” the lower-bound of the date and then pass the resulting csv into the awk script.) Is there a way to do that?
Generate an awk program such as this which explitly exits on a condition (here if field 3 exceeds 4):
$2 > 3 & $2 < 10
$3 > 4 { exit}
Put the above in a file called myprog.awk, say, and assuming default separators run it with this (your awk may be called something else):
gawk -f myprog.awk mydata.dat
or put it on the command line but you will have to be careful regarding quoting depending on the shell you use:
$2 > 3 & $2 < 10; $3 > 4 { exit}

UNIX copy lines to new file IF one column matches AND another has a value below 5x10^-8

Similar question to many previous ones (including mine) but I can't find the solution. This is purely a syntax error and I cannot figure out how to make it work.
I have two files in Unix. In file1 I have 5 columns and about 6000 rows. I am trying to match rows in file2 to rows in file1 IF column 1 matches exactly AND if the value in row 5 of file1 is less than 0.00000005 for said row.
file1:
SNPs Context Intergenic Risk Allele Frequency p-Value
rs9747992 Intergenic 1 0.086 2.00E-07
rs2059865 Intron 0 0.235 3.00E-07
rs117020818 Intergenic 1 0.046 7.00E-07
rs1074145 Intergenic 1 0.162 4.00E-09
file2:
snpid hg18chr bp a1 a2 zscore pval CEUmaf
rs3131972 1 742584 A G 0.289 0.7726 .
rs3131969 1 744045 A G 0.393 0.6946 .
rs3131967 1 744197 T C 0.443 0.658 .
rs1048488 1 750775 T C -0.289 0.7726 .
I can do the first part BUT it keeps outputting a file that is larger than the first two. I am unsure if this is a real result file or just full of duplicates? I also cannot do the 'less than' command. I have tried putting it into the command as a second pattern and also piping it, as below:
awk 'FNR==NR{a[$1]=$0;next}{if ($1 in a) {print $0}}' file1 file2 > output | awk '{if (a[$5] < 0.00000005)}'
and
awk 'FNR==NR{a[$1]=$0;next}{if ($1 in a && $5 < 0.00000005)} {print $0}}' file1 file2 > output
Both times it's giving me the same size file which is much larger than either file1 or file2. If you want examples of the tables please just say.
Tentative solution:
A tentative solution I am using is to just make a new file containing only lines from file1 which have that <0.00000005 value. This works though I would like to know my original answer for posterity.
awk '$5<=0.00000005' file1 > file11
Per my comments above, if you're using file2 as a filter list, you need to load it into the a[] array.
I've made up a small sample of how that works, the test for $28 < .000005 should be easy to add as you have it in your code.
With file data1
1 2 3 4 5 6 7
2 3 4 5 6 7 8
4 5 8 7 8 9 10
and file searchList
3
Then
awk 'FNR==NR{a[$0]=$0;next}
FNR!=NR{ if ($2 in a) print $0}
#dbg END{for (x in a) print "x="x " a[x]=" a[x]
}' searchList data1
gives output
2 3 4 5 6 7 8
edit Per our conversation in comments, my best guess without seeing your required output would be
I've added an extra record in file1 so there can be match
rs3131972 Intergenic 1 0.086 2.00E-07
awk '( FNR==NR && (sprintf("%.07f",$5) < .000000005) ) {
a[$1]=$0
#dbg print "a["$1"]="a[$1]
next
}
FNR!=NR{
#dbg print "$1="$1
if ($1 in a)print "Matched:" $0
}' file1 file2
The output is now
Matched:rs3131972 1 742584 A G 0.289 0.7726 .
IHTH
Shellter's answer is good. Mine is more about what you did wrong. Your first attempt
> awk 'FNR==NR{a[$1]=$0;next}{if ($1 in a) {print $0}}
' file1 file2 > output | awk '{if (a[$5] < 0.00000005)}'
fails because your pipeline is wrong. You need to pipe awk | awk > output not awk >output | awk. The latter will receive no input and produce no output from the last step of the pipeline. Also, the second Awk instance has no knowledge of the variables you used in the first.
Furthermore, you seem to have a recurring problem with spurious braces in Awk. The general syntax is awk "condition1 { action1 } condition2 { action2 }..." where you can omit a condition to do an action unconditionally, or omit the action part (with the braces) to perform the default action { print $0 }. But here, you have only an action, which is however actually a condition, with no side effects such as printing anything. You want to remove the braces and the if wrapper.
So you need
awk 'FNR==NR{a[$1]=$0;next}{if ($1 in a) {print $0}}' file1 file2 |
awk '$5 < 0.00000005' >output
which (in accordance with the rules for omitting a condition or an action, and with some refactoring) can be much simplified to
awk 'FNR==NR{a[$1]=$0;next}
$1 in a' file1 file2 |
awk '$5 < 0.00000005' >output
Your second attempt is closer;
> awk 'FNR==NR{a[$1]=$0;next}
{if ($1 in a && $5 < 0.00000005)} {print $0}}' file1 file2 > output
but again, you have too many brackets. The closing brace after the if ruins it all! So you have effectively "if (condition)" then nothing (maybe this should be a syntax error!), followed by a new block with an unconditional print. But overall, this is much better.
awk 'FNR==NR{a[$1]=$0;next}
{if ($1 in a && $5 < 0.00000005) print $0}' file1 file2 > output
which of course can be simplified to
awk 'FNR==NR{a[$1]=$0;next}
($1 in a) && $5 < 0.00000005' file1 file2 > output
Answer that worked based on Shellters assistance.
awk -F $'\t' 'NR==FNR{if ($5 < 0.00000005){a[$1]=$0}} NR!=FNR{if ($1 in a) print $0}' file1 file2 > output
Thanks

How to print the last but one record of a file using awk?

How to print the last but one record of a file using awk?
Something like:
awk '{ prev_line=this_line; this_line=$0 } END { print prev_line }' < file
Essentially, keep a record of the line before the current one, until you hit the end of the file, then print the previous line.
edit to respond to comment:
To just extract the second field in the penultimate line:
awk '{ prev_f2=this_f2; this_f2=$2 } END { print prev_f2 }' < file
You can do it with awk but you may find that:
tail -2 inputfile | head -1
will be a quicker solution - it grabs the last two lines of the complete set then the first of those two.
The following transcript shows how this works:
pax$ echo '1
> 2
> 3
> 4
> 5' | tail -2 | head -1
4
If you must use awk, you can use:
pax$ echo '1
2
3
4
5' | awk '{last = this; this = $0} END {print last}'
4
It works by keeping the last and current line in variables last and this and just printing out last when the file is finished.

Resources