OFS when using if-else statement in awk - unix

I have a simple text file, delimited by multiple spaces, and with a different number of columns (6 or 5).
What I am trying to do is, for the rows with more than 5 columns, combine the 2 last columns in one, doing:
cat data.txt | awk '{if(NF>5) print $1,$2,$3,$4,$5"_"$6; else print $0} OFS="," ' > data.csv
The problem is that the OFS is not working for the else statement.
Example - input:
a d e t er ap
b q j n mm
Output that I am getting:
a,d,e,t,er_ap
b q j n mm
Desirable output:
a,d,e,t,er_ap
b,q,j,n,mm
Any suggestions?

Set your OFS in the BEGIN block so that it's a comma before any processing happens. Also when you do print $0 without manipulating the line in any way, awk will just spit out the line as-is with whatever delimiters are in place in the source file. Personally I think that's dumb, but that's awk. As a workaround, just set one column equal to itself, then print:
awk 'BEGIN{OFS=","}{if(NF>5) print $1,$2,$3,$4,$5"_"$6; else {$1=$1;print $0}}' data.txt
If you anticipate more than 6 columns you can just have it toss underscores for all of them after column 5 with some printf trickery too
awk '{for (i=1;i<=NF;i++){printf (i==NF)?"%s\n":(i>=5)?"%s_":"%s,", $i}}' data.txt

Related

Transposing multiple columns in multiple rows keeping one column fixed in Unix

I have one file that looks like below
1234|A|B|C|10|11|12
2345|F|G|H|13|14|15
3456|K|L|M|16|17|18
I want the output as
1234|A
1234|B
1234|C
2345|F
2345|G
2345|H
3456|K
3456|L
3456|M
I have tried with the below script.
awk -F"|" '{print $1","$2","$3","$4"}' file.dat | awk -F"," '{OFS=RS;$1=$1}1'
But the output is generated as below.
1234
A
B
C
2345
F
G
H
3456
K
L
M
Any help is appreciated.
What about a single simple awk process such as this:
$ awk -F\| '{print $1 "|" $2 "\n" $1 "|" $3 "\n" $1 "|" $4}' file.dat
1234|A
1234|B
1234|C
2345|F
2345|G
2345|H
3456|K
3456|L
3456|M
No messing with RS and OFS.
If you want to do this dynamically, then you could pass in the number of fields that you want, and then use a loop starting from the second field.
In the script, you might first check if the number of fields is equal or greater than the number you pass into the script (in this case n=4)
awk -F\| -v n=4 '
NF >= n {
for(i=2; i<=n; i++) print $1 "|" $i
}
' file
Output
1234|A
1234|B
1234|C
2345|F
2345|G
2345|H
3456|K
3456|L
3456|M
# perl -lne'($a,#b)=((split/\|/)[0..3]);foreach (#b){print join"|",$a,$_}' file.dat
1234|A
1234|B
1234|C
2345|F
2345|G
2345|H
3456|K
3456|L
3456|M

Removing blank spaces in specific column in pipe delimited file in AIX

Good morning. Long time reader, first time emailer so please be gentle.
I'm working on AIX 5.3 and have a 42 column pipe delimited file. There are telephone numbers in columns 15 & 16 (land|mobile) which may or may not contain spaces depending on who has keyed in the data.
I need to remove these space from columns 15 & 16 only ie
Column 15 | Column 16 **Currently**
01942 665432|07865346122
01942756423 |07855 333567
Column 15 | Column 16 **Needs to be**
01942665432|07865346122
01942756423|07855333567
I have a quick & dirty script which unfortunately is proving to be anything but quick because it's a while loop reading every single line, cutting the field on the pipe delimiter, assigning it to a variable, using sed on column 15 & 16 only to strip blank spaces then writing it out to a new file ie
cat $file | while read
output
do
.....
fourteen=$( echo $output | cut -d'|' -f14 )
fifteen=$( echo $output | cut -d'|' -f15 | sed 's/ //g' )
echo ".....$fourteen|$fifteen..." > $new_file
done
I know there must be a better way to do this, probably using AWK, but am open to any kind of suggestion anyone can offer as the script as it stands is taking half an hour plus to process 176,000 records.
Thanks in advance.
Yes, awk is better suited here
$ cat ip.txt
a|foo bar|01942 665432|07865346122|123
b|i j k |01942756423 |07855 333567|90870
$ awk 'BEGIN{FS=OFS="|"} {gsub(" ","",$3); gsub(" ","",$4)} 1' ip.txt
a|foo bar|01942665432|07865346122|123
b|i j k |01942756423|07855333567|90870
BEGIN{FS=OFS="|"} set | as input and output field separator
gsub(" ","",$3) replace all spaces with nothing only for column 3
gsub(" ","",$4) replace all spaces with nothing only for column 4
1 idiomatic way to print the input record (including any modification done )
Change 3 and 4 to whatever field you need
In case first line should not be affected, add a condition
awk 'BEGIN{FS=OFS="|"} NR>1{gsub(" ","",$3); gsub(" ","",$4)} 1' ip.txt

How to split and replace strings in columns using awk

I have a tab-delim text file with only 4 columns as shown below:
GT:CN:CNL:CNP:CNQ:FT .:2:a:b:c:PASS .:2:c:b:a:PASS .:2:d:c:a:FAIL
If the string "FAIL" is found in a specific column starting from column2 to columnN (all the strings are separated by ":") then it would need to replace the second element in that column to "-1". Sample output is shown below:
GT:CN:CNL:CNP:CNQ:FT .:2:a:b:c:PASS .:2:c:b:a:PASS .:-1:d:c:a:FAIL
Any help using awk?
With any awk:
$ awk 'BEGIN{FS=OFS="\t"} {for (i=2;i<=NF;i++) if ($i~/:FAIL$/) sub(/:[^:]+/,":-1",$i)} 1' file
GT:CN:CNL:CNP:CNQ:FT .:2:a:b:c:PASS .:2:c:b:a:PASS .:-1:d:c:a:FAIL
In order to split in awk you can use "split".
An example of it would be the following:
split(1,2,"3");
1 is the string you want to split
2 is the array you want to split it into
and 3 is the character that you want to be split on
e.g
string="hello:world"
result=`echo $string | awk '{ split($1,ARR,":"); printf("%s ",ARR[1]);}'`
In this case the result would be equal to hello, because we split the string to the " : " character and we printed the first half of the ARR, if we would print the second half (so printf("%s ",ARR[2])) of the ARR then it would be returned to result the "world".
With gawk:
awk '{$0=gensub(/[^:]*(:[^:]*:[^:]*:[^:]:FAIL)/,"-1\\1", "g" , $0)};1' File
with sed:
sed 's/[^:]*\(:[^:]*:[^:]*:[^:]:FAIL\)/-1\1/g' File
If you are using GNU awk, you can take advantage of the RT feature1 and split the records at tabs and newlines:
awk '$NF == "FAIL" { $2 = "-1"; } { printf "%s", $0 RT }' RS='[\t\n]' FS=':' infile
Output:
GT:CN:CNL:CNP:CNQ:FT .:2:a:b:c:PASS .:2:c:b:a:PASS .:-1:d:c:a:FAIL
1 The record separator that follows the current record.
Your requirements are somewhat vague, but I'm pretty sure this does what you want with bog standard awk (no gnu-awk extensions):
awk '/FAIL/{$2=-1}1' ORS=\\t RS=\\t FS=: OFS=: input

Remove all lines from file with duplicate value in field, including the first occurrence

I would like to remove all the lines in my data file that contain a value in column 2 that is repeated in column 2 in other lines.
I've sorted by the value in column 2, but can't figure out how to use uniq for just the values in one field as the values are not necessarily of the same length.
Alternately, I can remove lines with the duplicate using an awk one-liner like
awk -F"[,]" '!_[$2]++'
but this retains the line with the first incidence of the repeated value in col 2.
As an example, if my data is
a,b,c
c,b,a
d,e,f
h,i,j
j,b,h
I would like to remove ALL lines (including the first) where b occurs in the second column.
Like this:
d,e,f
h,i,j
Thanks for any advice!!
If the order is not important then the following should work:
awk -F, '
!seen[$2]++ {
line[$2] = $0
}
END {
for(val in seen)
if(seen[val]==1)
print line[val]
}' file
Output
h,i,j
d,e,f
Solution with grep:
grep -v -E '\b,b,\b' text.txt
Content of the file:
$ cat text.txt
a,b,c
c,b,a
d,e,f
h,i,j
j,b,h
a,n,b
b,c,f
$ grep -v -E '\b,b,\b' text.txt
d,e,f
h,i,j
a,n,b
b,c,f
Hope it helps
Some different awk:
awk -F, '
BEGIN {f=0}
FNR==NR {_[$2]++;next}
f==0 {
f=1
for(j in _)if(_[j]>1)delete _[j]
}
$2 in _
' file file
Explanation
The awk passes through the file twice - that's why it appears twice at the end. On the first pass (when FNR==NR) I count the number of times each column 2 appears in array _[]. At the end of the first pass, I then delete all elements of _[] where that element has been seen more than once. Then, on the second pass, I print lines whose second field appears in _[].

How do I get the maximum value for a text in a UNIX file as shown below?

I have a unix file with the following contents.
$cat myfile.txt
abc:1
abc:2
hello:3
hello:6
wonderful:1
hai:2
hai:4
hai:8
How do I get the max value given for each text in the file above.
'abc' value 2
'hello' value 6
'hai' value 8
'womderful' 1
Based on the current example in your question, minus the first line of expected output:
awk -F':' '{arr[$1]=$2 ; next} END {for (i in arr) {print i, arr[i]} } ' inputfile
You example input and expected output are very confusing.... The reason I posted this is to get feedback from the OP forthcoming
This assumes the data is unsorted, but also works with sorted data (New):
sort -t: -k2n inputfile | awk -F':' '{arr[$1]=$2 ; next} END {for (i in arr) {print i, arr[i]} } '

Resources