awk -- adding a new delimiter to the default space delimiter - unix

awk default delimiter space treats any amount of space between two fields as equivalent..
echo "1 2"|awk '{for (i=1;i<=NF;i++) print $i}'
#which gives the result (two spaces between 1 and 2)
1
2
How can I add "=" to this existing delimiter? I have tried the following and that has started to consider "single" space character as a delimiter and spoiled the above result.
echo "1 2"|awk -F"[ |=]" '{for (i=1;i<=NF;i++) print $i}'
#which gives the result
1
2
How can I give any amount of space as a delimiter here? Thanks in advance.

You can specify a regular expression as the delimiter:
echo "1 2"|awk -F"[ |=]+" '{for (i=1;i<=NF;i++) print $i}'
It also means
echo "1 2 3==5"|awk -F"[ |=]+" '{for (i=1;i<=NF;i++) print $i}'
would print
1
2
3
5

Related

Transposing multiple columns in multiple rows keeping one column fixed in Unix

I have one file that looks like below
1234|A|B|C|10|11|12
2345|F|G|H|13|14|15
3456|K|L|M|16|17|18
I want the output as
1234|A
1234|B
1234|C
2345|F
2345|G
2345|H
3456|K
3456|L
3456|M
I have tried with the below script.
awk -F"|" '{print $1","$2","$3","$4"}' file.dat | awk -F"," '{OFS=RS;$1=$1}1'
But the output is generated as below.
1234
A
B
C
2345
F
G
H
3456
K
L
M
Any help is appreciated.
What about a single simple awk process such as this:
$ awk -F\| '{print $1 "|" $2 "\n" $1 "|" $3 "\n" $1 "|" $4}' file.dat
1234|A
1234|B
1234|C
2345|F
2345|G
2345|H
3456|K
3456|L
3456|M
No messing with RS and OFS.
If you want to do this dynamically, then you could pass in the number of fields that you want, and then use a loop starting from the second field.
In the script, you might first check if the number of fields is equal or greater than the number you pass into the script (in this case n=4)
awk -F\| -v n=4 '
NF >= n {
for(i=2; i<=n; i++) print $1 "|" $i
}
' file
Output
1234|A
1234|B
1234|C
2345|F
2345|G
2345|H
3456|K
3456|L
3456|M
# perl -lne'($a,#b)=((split/\|/)[0..3]);foreach (#b){print join"|",$a,$_}' file.dat
1234|A
1234|B
1234|C
2345|F
2345|G
2345|H
3456|K
3456|L
3456|M

OFS when using if-else statement in awk

I have a simple text file, delimited by multiple spaces, and with a different number of columns (6 or 5).
What I am trying to do is, for the rows with more than 5 columns, combine the 2 last columns in one, doing:
cat data.txt | awk '{if(NF>5) print $1,$2,$3,$4,$5"_"$6; else print $0} OFS="," ' > data.csv
The problem is that the OFS is not working for the else statement.
Example - input:
a d e t er ap
b q j n mm
Output that I am getting:
a,d,e,t,er_ap
b q j n mm
Desirable output:
a,d,e,t,er_ap
b,q,j,n,mm
Any suggestions?
Set your OFS in the BEGIN block so that it's a comma before any processing happens. Also when you do print $0 without manipulating the line in any way, awk will just spit out the line as-is with whatever delimiters are in place in the source file. Personally I think that's dumb, but that's awk. As a workaround, just set one column equal to itself, then print:
awk 'BEGIN{OFS=","}{if(NF>5) print $1,$2,$3,$4,$5"_"$6; else {$1=$1;print $0}}' data.txt
If you anticipate more than 6 columns you can just have it toss underscores for all of them after column 5 with some printf trickery too
awk '{for (i=1;i<=NF;i++){printf (i==NF)?"%s\n":(i>=5)?"%s_":"%s,", $i}}' data.txt

Unix - print distinct field lengths

I would like to print the total number of times each length occurs in a field.
The column type is varChar and the strings in that field are either 9, 10, or 15 characters long. I want to know how many exist of each length.
My code:
awk -F'|'
'NR>1 $61!="" &&
if /length($61)=15/ then {a++}
elif /length($61)=10/ then {b++}
else /length($61)=9/ then {c++}
fi {print a ", " b ", " c}'
ERROR:
awk -F'|' 'NR>1 $61!="" && if /length($61)=15/ then {a++} elif /length($61)=10/ then {b++} else /length($61)=9/ then {c++} fi {print a ", " b ", " c}'
Syntax Error The source line is 1.
The error context is
NR>1 >>> $61!= <<<
awk: 0602-500 Quitting The source line is 1.
INPUT
A pipe delimited .sqf file with 1.2 million rows and column 61 is varChar 15.
based on your pseudo-code I guess you want
awk -F'|' -v OFS=', ' 'NR>1 {count[length($61)]++}
END {print count[15],count[10],count[9]}' file
you'll also have the other length counts there in case of data quality check.
If you want to have 0 instead of null for missing counts, change to count[n]+0 as suggested in the comments.

Define specific output count in EXPR command

I have a scenario wherein I want to have 9 character count in expr.
I have sample code which is:
var1=012345678 #this is 9 characters
sum=`expr $var1 + 1`
echo "$sum"
Here is the result:
./sample.sh : 12345679 #this is only 8 characters
My expected output:
./sample.sh : 012345679
Any help on this?
The leading zero is removed when doing the math.
You can force a 9 length output using printf "%09d" 123.
When you try to use the the syntax ((sum=${var1} + 1 )) you have another problem: When the first digit is 0, bash expects a different radix.
You can remove the first 0 with
var1=012345678
echo "${var1#0}"
This only helps with your input, not with 00012.
Removing the leading zeroes and printing the sum can be done with echo $((10#$var1))
var1=00012345678
((sum=$((10#$var1)) + 1))
printf "%09d\n" $sum
This can be solved easier with
var1=00012345678
echo "${var1} 1" |awk '{ printf("%09d\n", $1 + $2) }'
You can avoid the echo with
awk -v var1=$var1 'BEGIN { printf("%09d\n", var1 + 1) }'
The BEGIN is used for parsing without an inputfile.
The option -v is a clean way to use a shell variable inside an awk script.
Do not try things with quotes, one day it will shoot your own foot:
# Don't do this
awk 'BEGIN { printf("%09d\n", '${var1}' + 1) }' # Just do not do it

How to convert multiple lines into fixed column lengths

To convert rows into tab-delimited, it's easy
cat input.txt | tr "\n" " "
But I have a long file with 84046468 lines. I wish to convert this into a file with 1910147 rows and 44 tab-delimited columns. The first column is a text string such as chrXX_12345_+ and the other 43 columns are numerical strings. Is there a way to perform this transformation?
There are NAs present, so I guess sed and substituting "\n" for "\t" if the string preceding is a number doesn't work.
sample input.txt
chr10_1000103_+
0.932203
0.956522
1
0.972973
1
0.941176
1
0.923077
1
1
0.909091
0.9
1
0.916667
0.8
1
1
0.941176
0.904762
1
1
1
0.979592
0.93617
0.934783
1
0.941176
1
1
0.928571
NA
1
1
1
0.941176
1
0.875
0.972973
1
1
NA
0.823529
0.51366
chr10_1000104_-
0.952381
1
1
0.973684
sample output.txt
chr10_1000103_+ 0.932203 (numbers all tab-delimited)
chr10_1000104_- etc
(sorry alot of numbers to type manually)
sed '
# use a delimiter
s/^/M/
:Next
# put a counter
s/^/i/
# test counter
/^\(i\)\{44\}/ !{
$ !{
# not 44 line or end of file, add the next line
N
# loop
b Next
}
}
# remove marker and counter
s/^i*M//
# replace new line by tab
s/\n/ /g' YourFile
some limite if more than 255 tab on sed (so 44 is ok)
Here's the right approach using 4 columns instead of 44:
$ cat file
chr10_1000103_+
0.932203
0.956522
1
chr10_1000104_-
0.952381
1
1
$ awk '{printf "%s%s", $0, (NR%4?"\t":"\n")}' file
chr10_1000103_+ 0.932203 0.956522 1
chr10_1000104_- 0.952381 1 1
Just change 4 to 44 for your real input.
If you are seeing control-Ms in your output it's because they are present in your input so use dos2unix or similar to remove them before running the tool or with GNU awk you could just set -v RS='\n\r'.
When posting questions it's important to make it as clear, simple, and brief as possible so that as many people as possible will be interested in helping you.
BTW, cat input.txt | tr "\n" " " is a UUOC and should just be tr "\n" " " < input.txt
Not the best solution, but should work:
line="nonempty"; while [ ! -z "$line" ]; do for i in $(seq 44); do read line; echo -n "$line "; done; echo; done < input.txt
If there is an empty line in the file, it will terminate. For a more permanent solution I'd try perl.
edit:
If you are concerned with efficiency, just use awk.
awk '{ printf "%s\t", $1 } NR%44==0{ print "" }' < input.txt
You may want to strip the trailing tab character with | sed 's/\t$//' or make the awk script more complicated.
This might work for you (GNU sed):
sed '/^chr/!{H;$!d};x;s/\n/\t/gp;d' file
If a line does not begin with chr append it to the hold space and then delete it unless it is the last. If the line does start chr or it is the last line, then swap to the hold space and replace all newlines by tabs and print out the result.
N.B. the start of the next line will be left untouched in the pattern space which becomes the new hold space.

Resources