Appending whitespace to a variable in AWK script - unix

I have an AWK script, which receives an input variable from another script.
The length of the input variable is compared. if the length is 3, two whitespace is added infront of variable. If the length is 4, 1 whitespace is added in front. I could compare the length but am not able to append white space.
I tried the following in AWK script
if (length(input_variable) ==3 ) {
input_variable = " "input_variable
} else if(length(input_variable) ==4 ){
input_variable = " "input_variable
}print input_variable
Output: No value is getting printed. Please help me

you should use printf
awk '{printf "%5s", $1}'
pads with spaces on the left to the desired length, don't reinvent.

Related

How to replace a CRLF char from a variable length file in the middle of a string in Unix?

My sample file is variable length, without any field delimiters. Lines have a minimum of 18 chars length and the 'CRLF' is potentially (not always) between columns 11-15. How do I replace this with a space only when it has a new line char ('CRLF') in the middle (columns 11-15). I still want to keep true end of record.
Sample data:
Input:
1123xxsdfdsfsfdsfdssa
1234ddfxxyff
frrrdds
1123dfdffdfdxxxxxxxxxas
1234ydfyyyzm
knsaaass
1234asdafxxfrrrfrrrsaa
1123werwetrretttrretertre
Expected output:
1123xxsdfdsfsfdsfdssa
1234ddfxxyfff rrrdds
1123dfdffdfdxxxxxxxxxas
1234ydfyyyzm knsaaass
1234asdafxxfrrrfrrrsaa
1123werwetrretttrretertre
What I tried:
sed '/^.\{15\}$/!N;s/./ /11' filename
But above code just adding space, not removing 'CRLF'
Given your sample data, this seems to produce the desired output:
$ awk 'length($0) < 18 { getline x; $0 = $0 " " x} { print }' data
1123xxsdfdsfsfdsfdssa
1234ddfxxyff frrrdds
1123dfdffdfdxxxxxxxxxas
1234ydfyyyzm knsaaass
1234asdafxxfrrrfrrrsaa
1123werwetrretttrretertre
$
However, if the input contained CRLF line endings, things would not be so happy; it would be best to filter out the CR characters altogether (Unix files don't normally contain CR and certainly do not normally have CRLF line endings).
$ tr -d '\r' < data | awk 'length($0) < 18 { getline x; $0 = $0 " " x} { print }'
1123xxsdfdsfsfdsfdssa
1234ddfxxyff frrrdds
1123dfdffdfdxxxxxxxxxas
1234ydfyyyzm knsaaass
1234asdafxxfrrrfrrrsaa
1123werwetrretttrretertre
$
If you really need DOS-style CRLF input and output, you probably need to use a program such as utod or unix2dos (or some other similar tool) to convert from Unix line endings to DOS.

unix split FASTA using a loop, awk and split

I have a long list of data organised as below (INPUT).
I want to split the data up so that I get an output as below (desired OUTPUT).
The code below first identifies all the lines containing ">gi" and saves the linecount of those lines in an array called B.
Then, in a new file, it should replace those lines from array B with the shortened version of the text following the ">gi"
I figured the easiest way would be to split at "|", however this does not work (no separation happens with my code if i replace " " with "|")
My code is below and does split nicely after the " " if I replace the "|" by " " in the INPUT, however I get into trouble when I want to get the text between the [ ] brackets, which is NOT always there and not always only 2 words...:
B=$( grep -n ">gi" 1VAO_1DII_5fxe_all_hits_combined.txt | cut -d : -f 1)
awk <1VAO_1DII_5fxe_all_hits_combined.txt >seqIDs_1VAO_1DII_5fxe_all_hits_combined.txt -v lines="$B" '
BEGIN {split(lines, a, " "); for (i in a) change[a[i]]=1}
NR in change {$0 = ">" $4}
1
'
let me know if more explanations are needed!
INPUT:
>gi|9955361|pdb|1E0Y|A:1-560 Chain A, Structure Of The D170sT457E DOUBLE MUTANT OF VANILLYL- Alcohol Oxidase
MSKTQEFRPLTLPPKLSLSDFNEFIQDIIRIVGSENVEVISSKDQIVDGSYMKPTHTHDPHHVMDQDYFLASAIVA
>gi|557721169|dbj|GAD99964.1|:1-560 hypothetical protein NECHADRAFT_63237 [Byssochlamys spectabilis No. 5]
MSETMEFRPMVLPPNLLLSEFNGFIRETIRLVGCENVEVISSKDQIHDGSYMDPRHTHDPHHIMEQDYFLASAIVAPRNV
desired OUTPUT:
>1E0Y
MSKTQEFRPLTLPPKLSLSDFNEFIQDIIRIVGSENVEVISSKDQIVDGSYMKPTHTHDPHHVMDQDYFLASAIVAPRNV
>GAD99964.1 Byssochlamys spectabilis No. 5
MSETMEFRPMVLPPNLLLSEFNGFIRETIRLVGCENVEVISSKDQIHDGSYMDPRHTHDPHHIMEQDYFLASAIVA
This can be done in one step with awk (gnu awk):
awk -F'|' '/^>gi/{a=1;match($NF,/\[([^]]*)]/, b);print ">"$4" "b[1];next}a{print}!$0{a=0}' input > output
In a more readable way:
/^>gi/ { # when the line starts with ">gi"
a=1; # set flag "a" to 1
# extract the eventual part between brackets in the last field
match($NF,"\\[([^]]*)]", b);
print ">"$4" "b[1]; # display the line
next # jump to the next record
}
a { print } # when "a" (allowed block) display the line
!$0 { a=0 } # when the line is empty, set "a" to 0 to stop the display

How to split and replace strings in columns using awk

I have a tab-delim text file with only 4 columns as shown below:
GT:CN:CNL:CNP:CNQ:FT .:2:a:b:c:PASS .:2:c:b:a:PASS .:2:d:c:a:FAIL
If the string "FAIL" is found in a specific column starting from column2 to columnN (all the strings are separated by ":") then it would need to replace the second element in that column to "-1". Sample output is shown below:
GT:CN:CNL:CNP:CNQ:FT .:2:a:b:c:PASS .:2:c:b:a:PASS .:-1:d:c:a:FAIL
Any help using awk?
With any awk:
$ awk 'BEGIN{FS=OFS="\t"} {for (i=2;i<=NF;i++) if ($i~/:FAIL$/) sub(/:[^:]+/,":-1",$i)} 1' file
GT:CN:CNL:CNP:CNQ:FT .:2:a:b:c:PASS .:2:c:b:a:PASS .:-1:d:c:a:FAIL
In order to split in awk you can use "split".
An example of it would be the following:
split(1,2,"3");
1 is the string you want to split
2 is the array you want to split it into
and 3 is the character that you want to be split on
e.g
string="hello:world"
result=`echo $string | awk '{ split($1,ARR,":"); printf("%s ",ARR[1]);}'`
In this case the result would be equal to hello, because we split the string to the " : " character and we printed the first half of the ARR, if we would print the second half (so printf("%s ",ARR[2])) of the ARR then it would be returned to result the "world".
With gawk:
awk '{$0=gensub(/[^:]*(:[^:]*:[^:]*:[^:]:FAIL)/,"-1\\1", "g" , $0)};1' File
with sed:
sed 's/[^:]*\(:[^:]*:[^:]*:[^:]:FAIL\)/-1\1/g' File
If you are using GNU awk, you can take advantage of the RT feature1 and split the records at tabs and newlines:
awk '$NF == "FAIL" { $2 = "-1"; } { printf "%s", $0 RT }' RS='[\t\n]' FS=':' infile
Output:
GT:CN:CNL:CNP:CNQ:FT .:2:a:b:c:PASS .:2:c:b:a:PASS .:-1:d:c:a:FAIL
1 The record separator that follows the current record.
Your requirements are somewhat vague, but I'm pretty sure this does what you want with bog standard awk (no gnu-awk extensions):
awk '/FAIL/{$2=-1}1' ORS=\\t RS=\\t FS=: OFS=: input

print if all value are higher

I have a file like:
A 50.40,60.80,56.60,67.80,51.20,78.40,63.80,64.2
B 37.40,37.40,38.40,38.80,58.40,58.80,45.00,44.8
.
.
.
I want to print those lines that all values in column 2 are more than 50
output:
A 50.40,60.80,56.60,67.80,51.20,78.40,63.80,64.2
I tried:
cat file | tr ',' '\t' | awk '{for (i=2; i<=NF; i++){if($i<50) continue; else print $i}}'
I hope you meant that r tag you added to your question.
tab <- read.table("file")
splt <- strsplit(as.character(tab[[2]]), ",")
rows <- unlist(lapply(splt, function(a) all(as.numeric(a) > 50)))
tab[rows,]
This will read your file as a space-separated table, split the second column into individual values (resulting in a list of character vectors), then compute a logical value for each such row depending on whether or not all values are > 50. These results are combined to a logical vector which is then used to subset your data.
The field separator can be any regular expression, so if you include commas in FS your approach works:
awk '{ for(i=2; i<=NF; i++) if($i<=50) next } 1' FS='[ \t,]+' infile
Output:
A 50.40,60.80,56.60,67.80,51.20,78.40,63.80,64.2
Explanation
The for-loop runs through the comma-separated values in the second column and if any of them is lower than or equal to 50 next is executed, i.e. skip to next line. If the first block is passed, the 1 is encountered which evaluates to true and executes the default block: { print $0 }.

Printing all lines that have/are duplicates of first field in Unix shell scipt

I have a list of of semi-colon delimeted data:
TR=P561;dir=o;day=sa;TI=16:30;stn=south station;Line=worcester
I need to take this file and print out only the lines with TR values that occur more than once. I would like first occurrence and all duplicates listed.
Thanks
If you are willing to make 2 passes on the file:
awk '!/TR=/ { next } # Ignore lines that do not set TR
{t=$0; sub( ".*TR=", "", t ); sub( ";.*", "", t ) } # Get TR value
FNR == NR { a[t] +=1 } # Count the number of times this value of TR seen
FNR != NR && a[t] > 1 # print those lines whose TR value is seen more than once
' input-file input-file
This uses a common awk idiom of checking FNR to see which file we are using. By passing the input-file as an argument twice, it becomes a way to run one command on the first pass, and a different command on the second.

Resources