arguments in awk command - unix

for var in 1 2 3 4 5 6 7 8 9 10
do
echo $var
export lower_bound=$((($var - 1) * 10))
echo $lower_bound
export upper_bound=$(($var * 10))
echo $upper_bound
awk '$2 >= $(($lower_bound)) && $2 < $(($upper_bound))' shubham_test.txt > file_$var.txt
#awk -v var1="$lower_bound" -v var2="$upper_bound" '{$2 >= $var1 && $2 < $var2}' shubham_test.txt > file_$var.txt
done
I am trying to split a file based on the value of a field in a text file in unix. The awk works when I pass hardcoded values to caompare but does not work properly if I am passing varaibles for comparison (upper_bound and lower_bound).
I looked it up and even replaced the awk command with the following:
awk -v var1="$lower_bound" -v var2="$upper_bound" '{$2 >= $var1 && $2 < $var2}' shubham_test.txt > file_$var.txt
so that it takes arguments. But this is not working either. Can anyone help?

Your second format should work, but without the $ for the variable names - so :
awk -v var1="$lower_bound" -v var2="$upper_bound" '{$2 >= var1 && $2 < var2}' shubham_test.txt
Your first version fails because the single quotes mean the unix shell does not evaluate the $lower_bound variables, but instead is passing the entire literal string to awk. If you're writing shell scripts, you must know how to use single-, double- and back-quotes - take a look at http://www.tutorialspoint.com/unix/unix-quoting-mechanisms.htm for example.

Related

Filter a file using a column value greater than a number (awk not working)

I am trying to filter a file using values in the 8 column >= 10. I am using awk but for some reason it doesn't work. Am I doing something wrong, what am I missing?
head df_TPM.csv
LQNS02136402.1_14821_3p,12680.71611,11346.42368,11686.28693,9067.797819,7429.467928,5551.660333,3246.956281
LQNS02000137.1_325_3p,8342.540984,5905.726173,4503.363041,3616.191278,3142.965662,3678.829299,6288.621969
LQNS02278148.1_40791_3p,4921.502758,2461.882836,429.824973,261.273116,132.0239748,68.6191655,70.8815385
LQNS02278089.1_34112_3p,4246.71324,4584.529009,8687.922574,7570.83746,5801.384953,2870.020801,734.3131465
LQNS02278075.1_32377_5p,4143.547577,4093.91803,10804.12323,10062.99269,7925.240969,4712.484455,1080.915573
LQNS02138569.1_14892_3p,2668.27957,2160.173542,837.2584183,233.2310273,84.62362925,64.6037895,23.456714
LQNS02278075.1_32324_5p,2331.608924,491.8868983,1527.312199,881.8683105,747.1474225,347.397634,74.07259175
LQNS02278075.1_32382_3p,2140.686095,2439.122353,10837.38169,12569.95295,9385.530878,6022.323737,1705.900969
LQNS02000138.1_777_5p,1819.275149,1762.009649,8565.396754,33280.90019,32176.07604,15849.37306,11872.99383
LQNS02278186.1_47223_3p,1687.843418,728.4288968,1328.048172,1306.424238,2102.27342,14.78892225,9.92647375
#Extract column 1 and 8 and print if $8>=10
cat df_TPM.csv |awk -F"," '{print $1, $8}' | grep -E "^LQN" | awk -F " " '$2>= 10'
LQNS02276925.1_23356_5p 5.352369
LQNS02277221.1_25158_5p 2.82778125
LQNS02277812.1_29775_3p 11.1090745
LQNS02278074.1_32154_3p 6.124789
LQNS02278139.1_39525_5p 22.6656355
#As you can see lots of numbers shouldn't be there (ex: 2.82778125 < 10)
By seeing OP's comment, in case you don't want to search for LQN text in starting of line and want to check if 8th column is greater than 10 then try following(to check if lines have LQN remove ! from following codes).
awk -F"," '$8+0 >= 10 && !/^LQN/{print $1, $8}' df_TPM.csv
OR To get total lines try: counting those matched lines could be done in a single awk itself.
awk -F"," '$8+0 >= 10 && !/^LQN/{count++} END{print count}' df_TPM.csv
Explanation: Adding detailed explanation for above.
awk -F"," ' ##Starting awk program from here.
$8+0 >= 10 && !/^LQN/{ ##Checking condition if 8th field is greater than 10 and NOT LQN.
count++ ##Increasing count with 1 here.
}
END{ ##Starting END block of this awk program from here.
print count ##Printing count value here.
}
' df_TPM.csv ##Mentioning Input_file name here.
To handle control M characters in awk code itself try: considering that you don't want control M characters in your Input_file.
awk -F"," '{gsub(/\r/,"")} $8 >= 10 && !/^LQN/{count++} END{print count}' df_TPM.csv
You need to tell your awk to coerce $8 into a number by computing $8+0. It is recommended that you ensure you have GNU awk installed to avoid issues. Also, you may probably use dos2unix before working on the files to normalize the line endings.
The whole command can be written as
awk -F"," '/^LQN/ && $8+0 >= 10 {print $1, $8}' df_TPM.csv
See the online awk demo.
NOTE: To only count these lines, use
The whole command can be written as
awk -F, '/^LQN/ && $8+0 >= 10 {cnt++} END{print cnt}' df_TPM.csv
To find the lines that do not start with LQN, just add the negation operator ! before /^LQN/:
awk -F, '!/^LQN/ && $8+0 >= 10 {cnt++} END{print cnt}' df_TPM.csv
Details
-F"," (= -F,) - set the field separator to a comma
/^LQN/ && $8+0 >= 10 - if the current line starts with LQN and the eighth field is equal or larger than 10
!/^LQN/ && $8+0 >= 10 - if the current line does not start with LQN and the eighth field is equal or larger than 10
{print $1, $8} - print Field 1 and 8
{cnt++} - increment the cnt variable
END{print cnt} - print the cnt variable once the awk finishes processing lines.

Define specific output count in EXPR command

I have a scenario wherein I want to have 9 character count in expr.
I have sample code which is:
var1=012345678 #this is 9 characters
sum=`expr $var1 + 1`
echo "$sum"
Here is the result:
./sample.sh : 12345679 #this is only 8 characters
My expected output:
./sample.sh : 012345679
Any help on this?
The leading zero is removed when doing the math.
You can force a 9 length output using printf "%09d" 123.
When you try to use the the syntax ((sum=${var1} + 1 )) you have another problem: When the first digit is 0, bash expects a different radix.
You can remove the first 0 with
var1=012345678
echo "${var1#0}"
This only helps with your input, not with 00012.
Removing the leading zeroes and printing the sum can be done with echo $((10#$var1))
var1=00012345678
((sum=$((10#$var1)) + 1))
printf "%09d\n" $sum
This can be solved easier with
var1=00012345678
echo "${var1} 1" |awk '{ printf("%09d\n", $1 + $2) }'
You can avoid the echo with
awk -v var1=$var1 'BEGIN { printf("%09d\n", var1 + 1) }'
The BEGIN is used for parsing without an inputfile.
The option -v is a clean way to use a shell variable inside an awk script.
Do not try things with quotes, one day it will shoot your own foot:
# Don't do this
awk 'BEGIN { printf("%09d\n", '${var1}' + 1) }' # Just do not do it

Stop awk search on a condition

At the moment I have my R function generate an awk script to load, selectively, a subset of a csv into fread.
The resulting awk string looks something like this:
tail -n +2 ../data/faults_main_only_dp_1_shopFlag.csv | parallel -k -q --block 500M --pipe awk -F , ' $5 > \"2013-01-01\" && $5 < \"2015-11-17\" && $2 ~ /^F59PHI$|^GP20ECO$|^GT42CU-ACE$/ && $20 ~ /^Disregard$|^EMD Work Item$|^Pending$|^Pre-Work Item$|^Road Failure$|^Unit Shopped$|^Watch$|^Work Item$|^NA$/ {print $2 \",\" $88 \",\" $17 \",\" $5 \",\" $9 \",\" $22 \",\" $3 \",\" $15 \",\" $14 } '
The thing is: as of recent, my csv is ordered by dates ($5), in descending order, so if the user enters a specific lower-bound date, and awk gets to that line, it make sense for it to stop. (I am not sure how that would work the parallelization I am doing above. Maybe there is a way to select only the part of the csv that is “above” the lower-bound of the date and then pass the resulting csv into the awk script.) Is there a way to do that?
Generate an awk program such as this which explitly exits on a condition (here if field 3 exceeds 4):
$2 > 3 & $2 < 10
$3 > 4 { exit}
Put the above in a file called myprog.awk, say, and assuming default separators run it with this (your awk may be called something else):
gawk -f myprog.awk mydata.dat
or put it on the command line but you will have to be careful regarding quoting depending on the shell you use:
$2 > 3 & $2 < 10; $3 > 4 { exit}

print duplicate entries without deleting unix/linux

Let's say I have a file like this with 2 columns
56-cde
67-cde
56-cao
67-cgh
78-xyz
456-hhh
456-jjjj
45678-nnmn
45677-abdc
45678-aief
I am trying to get an output like this:
56-cde
56-cao
67-cde
67-cgh
456-hhh
456-jjjj
45678-aief
45678-nnmn
So basically instead of printing out the unique values I need to print the duplicates:
I tried to accomplish this using awk like this :
cat input.txt | awk -F"-" '{print $1,$2}' | sort -n | uniq -w 2 -D
This is without doubt showing me what values in column 1 have been duplicated, and also displaying the duplicated values of column 1 along with the respective column 2 values. But since I am hardcoding the number of bytes to 2, it displays the duplicated values only for the 2 digit numbers in column one. Is there a way to do this using awk ?
Thanks in advance.
See if your uniq has a -D option. My cygwin version does:
cat input.txt | sort | uniq -w 2 -D
another awk solution without arrays (but with presort)
sort -n file | awk -F- '
NR==1{p=$1; a=$0; c++; next}
p==$1{a=a RS $0; c++; next}
c{print a}
{a=$0; p=$1; c=0}
END{if(c) print a}'
This is what I came up with (just an awk program, no external sort, uniq etc.):
BEGIN { FS = "-" }
{ arr[$1] = arr[$1] "-" $2 }
END {
for (i in arr) {
if ((n = split(arr[i], a)) < 3) continue
for (j = 2; j <= n; ++j)
print i"-"a[j]
}
}
It collects all numbers along with the different strings attached
in arr (assuming the strings won't contain dashes -).
With gawk, you could use arrays of arrays in order to avoid the concatenation and splitting with dashes.
I would handle the varying-number-of-digits case by pre-conditioning the data so that the number field is a fixed large width (and use that width in uniq):
cat input.txt | awk -F- '{printf "%12d-%s\n",$1,$2}'| sort | uniq -w 12 -D
If you need the output left-justified as well, just tack on this post-conditioning step:
| awk '{print $1}'
Using Perl
$ cat two_cols.txt
56-cde
67-cde
56-cao
67-cgh
78-xyz
456-hhh
456-jjjj
45678-nnmn
45677-abdc
45678-aief
$ perl -F"-" -lane ' #t=#{$kv{$F[0]}}; push(#t,$_); $kv{$F[0]}=[#t]; END { while(($x,$y)=each(%kv)){ print join("\n",#{$y}) if scalar #{$y}>1 }} ' two_cols.txt
67-cde
67-cgh
56-cde
56-cao
456-hhh
456-jjjj
45678-nnmn
45678-aief
$

While read line, awk $line and write to variable

I am trying to split a file into different smaller files depending on the value of the fifth field. A very nice way to do this was already suggested and also here.
However, I am trying to incorporate this into a .sh script for qsub, without much success.
The problem is that in the section where the file to which output the line is specified,
i.e., f = "Alignments_" $5 ".sam" print > f
, I need to pass a variable declared earlier in the script, which specifies the directory where the file should be written. I need to do this with a variable which is built for each task when I send out the array job for multiple files.
So say $output_path = ./Sample1
I need to write something like
f = $output_path "/Alignments_" $5 ".sam" print > f
But it does not seem to like having a $variable that is not a $field belonging to awk. I don't even think it likes having two "strings" before and after the $5.
The error I get back is that it takes the first line of the file to be split (little.sam) and tries to name f like that, followed by /Alignments_" $5 ".sam" (those last three put together correctly). It says, naturally, that it is too big a name.
How can I write this so it works?
Thanks!
awk -F '[:\t]' ' # read the list of numbers in Tile_Number_List
FNR == NR {
num[$1]
next
}
# process each line of the .BAM file
# any lines with an "unknown" $5 will be ignored
$5 in num {
f = "Alignments_" $5 ".sam" print > f
} ' Tile_Number_List.txt little.sam
UPDATE, AFTER ADDING -V TO AWK AND DECLARING THE VARIABLE OPATH
input=$1
outputBase=${input%.bam}
mkdir -v $outputBase\_TEST
newdir=$outputBase\_TEST
samtools view -h $input | awk 'NR >= 18' | awk -F '[\t:]' -v opath="$newdir" '
FNR == NR {
num[$1]
next
}
$5 in num {
f = newdir"/Alignments_"$5".sam";
print > f
} ' Tile_Number_List.txt -
mkdir: created directory little_TEST'
awk: cmd. line:10: (FILENAME=- FNR=1) fatal: can't redirect to `/Alignments_1101.sam' (Permission denied)
awk variables are like C variables - just reference them by name to get their value, no need to stick a "$" in front of them like you do with shell variables:
awk -F '[:\t]' ' # read the list of numbers in Tile_Number_List
FNR == NR {
num[$1]
next
}
# process each line of the .BAM file
# any lines with an "unknown" $5 will be ignored
$5 in num {
output_path = "./Sample1/"
f = output_path "Alignments_" $5 ".sam"
print > f
} ' Tile_Number_List.txt little.sam
To pass the value of the shell variable such as $output_path to awk you need to use the -v option.
$ output_path=./Sample1/
$ awk -F '[:\t]' -v opath="$ouput_path" '
# read the list of numbers in Tile_Number_List
FNR == NR {
num[$1]
next
}
# process each line of the .BAM file
# any lines with an "unknown" $5 will be ignored
$5 in num {
f = opath"Alignments_"$5".sam"
print > f
} ' Tile_Number_List.txt little.sam
Also you still have the error from your previous question left in your script
EDIT:
The awk variable created with -v is obase but you use newdir what you want is:
input=$1
outputBase=${input%.bam}
mkdir -v $outputBase\_TEST
newdir=$outputBase\_TEST
samtools view -h "$input" | awk -F '[\t:]' -v opath="$newdir" '
FNR == NR && NR >= 18 {
num[$1]
next
}
$5 in num {
f = opath"/Alignments_"$5".sam" # <-- opath is the awk variable not newdir
print > f
}' Tile_Number_List.txt -
You should also move NR >= 18 into the second awk script.

Resources