Find a pattern and replace the pattern by adding a newline in front of it - unix

Would like to replace a pattern with newline and preserving the pattern -
Sample input
:16R:ABC:20C::CORP:30E::ABC
would like to replace the pattern ":[0-9][0-9]" with a new line along with ":[0-9][0-9]" pattern
Output
:16R:ABC
:20C::CORP
:30E::ABC
Currently came up with -
echo ":16R:ABC:20C::CORP:30E::ABC" | sed 's/[:][0-9][0-9]/\
:/g;/^$/!P;D'
:R:ABC
:C::CORP
:E::ABC
Expected Output:
:16R:ABC
:20C::CORP
:30E::ABC
It's not preserving the pattern , any suggestions ?

Using a straightforward sed solution, POSIX-ly
sed 's/[A-Z]\([:][0-9][0-9][A-Z]\)/\n\1/g'
If you can use awk and have the GNU variant available, you can call patsplit() to split on the pattern :[0-9][0-9][A-Z] and start replacing from the 2nd occurrence onwards
awk '{ n = patsplit($0, arr, /[:][0-9][0-9][A-Z]/)
for (iter = 2; iter <= n; iter++)
sub(arr[iter], ORS arr[iter]);
}1'
or with any POSIX awk
awk '{ n = split($0, arr, /[:]/)
for (iter = 3; iter <= n; iter++)
if ( match(arr[iter], /[0-9][0-9][a-zA-Z]/) )
sub(":"arr[iter], ORS ":" arr[iter]);
}1'

Related

How to calculate max and min of multiple columns (row wise) using awk

This might be simple - I have a file as below:
df.csv
col1,col2,col3,col4,col5
A,2,5,7,9
B,6,10,2,3
C,3,4,6,8
I want to perform max(col2,col4) - min(col3,col5) but I get an error using max and min in awk and write the result in a new column. So the desired output should look like:
col1,col2,col3,col4,col5,New_col
A,2,5,7,9,2
B,6,10,2,3,3
C,3,4,6,8,2
I used the code below but is does not work - how can I solve this?
awk -F, '{print $1,$2,$3,$4,$5,$(max($7,$9)-min($8,$10))}'
Thank you.
$ cat tst.awk
BEGIN { FS=OFS="," }
{ print $0, (NR>1 ? max($2,$4) - min($3,$5) : "New_col") }
function max(a,b) {return (a>b ? a : b)}
function min(a,b) {return (a<b ? a : b)}
$ awk -f tst.awk file
col1,col2,col3,col4,col5,New_col
A,2,5,7,9,2
B,6,10,2,3,3
C,3,4,6,8,2
If your actual "which is larger" calculation is more involved than just using >, e.g. if you were comparing dates in some non-alphabetic format or peoples names where you have to compare the surname before the forename and handle titles, etc., then you'd write the functions as:
function max(a,b) {
# some algorithm to compare the 2 strings
}
function min(a,b) {return (max(a,b) == a ? b : a)}
You may use this awk:
awk 'BEGIN{FS=OFS=","} NR==1 {print $0, "New_col"; next} {print $0, ($2 > $4 ? $2 : $4) - ($3 < $5 ? $3 : $5)}' df.csv
col1,col2,col3,col4,col5,New_col
A,2,5,7,9,2
B,6,10,2,3,3
C,3,4,6,8,2
A more readable version:
awk '
BEGIN { FS = OFS = "," }
NR == 1 {
print $0, "New_col"
next
}
{
print $0, ($2 > $4 ? $2 : $4) - ($3 < $5 ? $3 : $5)
}' df.csv
get an error using max and min in awk and write the result in a new column.
No such function are available in awk but for two values you might harness ternary operator, so in place of
max($7,$9)
try
($7>$9)?$7:$9
and in place of
min($8,$10)
try
($8<$10)?$8:$10
Above exploit ?: which might be explained as check?valueiftrue:valueiffalse, simple example, let file.txt content be
100,100
100,300
300,100
300,300
then
awk 'BEGIN{FS=","}{print ($1>$2)?$1:$2}' file.txt
output
100
300
300
300
Also are you sure about 1st $ in $(max($7,$9)-min($8,$10))? By doing so you instructed awk to get value of n-th column, where n is result of computation inside (...).

print duplicate entries without deleting unix/linux

Let's say I have a file like this with 2 columns
56-cde
67-cde
56-cao
67-cgh
78-xyz
456-hhh
456-jjjj
45678-nnmn
45677-abdc
45678-aief
I am trying to get an output like this:
56-cde
56-cao
67-cde
67-cgh
456-hhh
456-jjjj
45678-aief
45678-nnmn
So basically instead of printing out the unique values I need to print the duplicates:
I tried to accomplish this using awk like this :
cat input.txt | awk -F"-" '{print $1,$2}' | sort -n | uniq -w 2 -D
This is without doubt showing me what values in column 1 have been duplicated, and also displaying the duplicated values of column 1 along with the respective column 2 values. But since I am hardcoding the number of bytes to 2, it displays the duplicated values only for the 2 digit numbers in column one. Is there a way to do this using awk ?
Thanks in advance.
See if your uniq has a -D option. My cygwin version does:
cat input.txt | sort | uniq -w 2 -D
another awk solution without arrays (but with presort)
sort -n file | awk -F- '
NR==1{p=$1; a=$0; c++; next}
p==$1{a=a RS $0; c++; next}
c{print a}
{a=$0; p=$1; c=0}
END{if(c) print a}'
This is what I came up with (just an awk program, no external sort, uniq etc.):
BEGIN { FS = "-" }
{ arr[$1] = arr[$1] "-" $2 }
END {
for (i in arr) {
if ((n = split(arr[i], a)) < 3) continue
for (j = 2; j <= n; ++j)
print i"-"a[j]
}
}
It collects all numbers along with the different strings attached
in arr (assuming the strings won't contain dashes -).
With gawk, you could use arrays of arrays in order to avoid the concatenation and splitting with dashes.
I would handle the varying-number-of-digits case by pre-conditioning the data so that the number field is a fixed large width (and use that width in uniq):
cat input.txt | awk -F- '{printf "%12d-%s\n",$1,$2}'| sort | uniq -w 12 -D
If you need the output left-justified as well, just tack on this post-conditioning step:
| awk '{print $1}'
Using Perl
$ cat two_cols.txt
56-cde
67-cde
56-cao
67-cgh
78-xyz
456-hhh
456-jjjj
45678-nnmn
45677-abdc
45678-aief
$ perl -F"-" -lane ' #t=#{$kv{$F[0]}}; push(#t,$_); $kv{$F[0]}=[#t]; END { while(($x,$y)=each(%kv)){ print join("\n",#{$y}) if scalar #{$y}>1 }} ' two_cols.txt
67-cde
67-cgh
56-cde
56-cao
456-hhh
456-jjjj
45678-nnmn
45678-aief
$

Removing all occurences of duplicates in a file on Unix

I would like to remove both occurrences of duplicates from a file based on a number of columns. Here is a toy example:
Would like to delete all records that do not have uniqueness through the first 4 columns. So applying the awk script to:
BLUE,CAR,RED,HOUSE,40
BLUE,CAR,BLACK,HOUSE,20
BLUE,CAR,GREEN,HOUSE,10
BLUE,TRUCK,RED,HOUSE,40
BLUE,TRUCK,GREEN,HOUSE,40
BLUE,TRUCK,RED,HOUSE,40
Should result in
BLUE,CAR,RED,HOUSE,40
BLUE,CAR,BLACK,HOUSE,20
BLUE,CAR,GREEN,HOUSE,10
BLUE,TRUCK,GREEN,HOUSE,40
I have tried:
awk -F"," -v OFS="," '{cnt[$1,$2,$3,$4]++} END {for (rec in cnt) if (cnt[rec] == 1) print rec}' ss.txt
Which successfully removes both dupes, but does not apply the correct delimiter or print the whole record, resulting in:
BLUECARREDHOUSE
BLUETRUCKGREENHOUSE
BLUECARBLACKHOUSE
BLUECARGREENHOUSE
I prefer an awk solution but any portable solution is welcomed.
Given that you want the whole record for the records that are unique in the first 4 columns, this would do the job:
awk -F',' '{cnt[$1,$2,$3,$4]++;line[$1,$2,$3,$4] = $0}
END {for (rec in cnt) if (cnt[rec] == 1) print line[rec]}' \
ss.txt
Save the lines as well as the counts; get back what you entered. This gets painful if you have gigabyte files; there are ways to only save the unique lines if you want. This only saves the first version of each line, and deletes an entry when it is known to be non-unique. (Untested - but I believe it should work. Modified per comment from Ed Morton.)
awk -F',' '{ if (cnt[$1,$2,$3,$4]++ == 0)
line[$1,$2,$3,$4] = $0
else
delete line[$1,$2,$3,$4]
}
END {for (rec in line) print line[rec]}' \
ss.txt
If you only wanted the 4 key columns, then this just saves the 4 columns in the comma-separated format that you'll print:
awk -F',' '{cnt[$1,$2,$3,$4]++;line[$1,$2,$3,$4] = $1 "," $2 "," $3 "," $4}
END {for (rec in cnt) if (cnt[rec] == 1) print line[rec]}' \
ss.txt

How to replace second existing patteren in unix file

I want to replace the second existence of the pattern in unix.
Input File:-
12345|45345|TaskID|dksj|kdjfdsjf|TaskID|12
1245|425345|TaskID|dksj|kdjfdsjf|TaskID|12
1234|25345|TaskID|dksj|TaskID|kdjfdsjf|12|TaskID
123425|65345|TaskID|dksj|kdjfdsjf|12|TaskID
123425|15325|TaskID|dksj|kdjfdsjf|12
Sample Output file:-
12345|45345|TaskID|dksj|kdjfdsjf|TaskID1|12
1245|425345|TaskID2|dksj|kdjfdsjf|TaskID3|12
1234|25345|TaskID|dksj|TaskID1|kdjfdsjf|12|TaskID2
123425|65345|TaskID3|dksj|kdjfdsjf|12|TaskID4
123425|15325|TaskID|dksj|kdjfdsjf|12
your example does not match your question,
so i'll only show how to replace every second match of the given pattern
use awk. it's very powerfull tool for command line text processing
replace.sh as follow:
cat | awk -v search="$1" -v repl="$2" '
BEGIN {
flag = 0
}
{
split($0, a, search)
len = length(a)
for (f = 1; f < len; f += 1) {
printf "%s%s", a[f], (flag % 2 == 0 ? search : repl)
flag += 1
}
printf "%s%s", a[len], ORS
}
'
cat input.txt | ./replace.sh TaskID TaskID1

awk field separator , when the separator shows up in double quote

I am trying to use awk to read some input at the field position at 3, $3, field 3 is a string
awk -F'","' '{print $1}' input.txt
my file input.txt looks like this
field1,field2,field3,field4,field5
the problem is that these fields are separated by commas, some of them are double quoted while others are not. And field 5 is double quoted and contains every type of symbols. Example:
imfield1,imfield2,"imfield3",imfield4,"im"",""fi"",el,""d5"
can awk handle a situation like this??
In more gner, how can I get the whole string by typIng $5 ?
You can use Lorance Stinson's Awk CSV parser, in which case it's as simple as:
function parse_csv(..) {
..
}
{
num_fields = parse_csv($0, csv, ",", "\"", "\"", "\\n", 1);
print csv[2]
}
If you're not hell-bent on Awk, Python also comes with a nice CSV parser:
import csv, sys
for row in csv.reader(sys.stdin):
print row[2]
Or from the command line (bit tricky in one line):
python -c 'import csv,sys;[sys.stdout.write(row[2]+"\n") for row in csv.reader(sys.stdin)]' < input.txt
The separator is a simple comma, not a comma betweeen quotes. If the fields do not contain commas, then awk may be up for the task:
awk -F , '
{
if ($3 ~ /^".*"$/) {
$3 = substr($3, 2, length($3)-2);
gsub(/""/, "", $3);
}
print $3;
}' input.txt
This is already getting pretty complicated. If there can be commas inside fields, use a proper CSV parser, for example in Perl or Python. See https://unix.stackexchange.com/questions/7425/is-there-a-robust-command-line-tool-for-processing-csv-files
You can parse the line in awk setting null field separator. Instead of printf("%s",$i) you can assign $i to a var and print out when inda==0
#echo "\"AAA,BBB\",\"CCC\",\"DDD, EEE, FFF\"" > uno
awk 'BEGIN { FS="" }
{
for ( i=1; i<NF; i++) {
if ( $i == "\"" )
if ( inda == 0 )
inda = 1
else
inda = 0
if ( $i == "," )
if ( inda == 0 )
$i="|"
printf("%s",$i)
}
printf("\n")
}' uno

Resources