How to replace second existing patteren in unix file - unix

I want to replace the second existence of the pattern in unix.
Input File:-
12345|45345|TaskID|dksj|kdjfdsjf|TaskID|12
1245|425345|TaskID|dksj|kdjfdsjf|TaskID|12
1234|25345|TaskID|dksj|TaskID|kdjfdsjf|12|TaskID
123425|65345|TaskID|dksj|kdjfdsjf|12|TaskID
123425|15325|TaskID|dksj|kdjfdsjf|12
Sample Output file:-
12345|45345|TaskID|dksj|kdjfdsjf|TaskID1|12
1245|425345|TaskID2|dksj|kdjfdsjf|TaskID3|12
1234|25345|TaskID|dksj|TaskID1|kdjfdsjf|12|TaskID2
123425|65345|TaskID3|dksj|kdjfdsjf|12|TaskID4
123425|15325|TaskID|dksj|kdjfdsjf|12

your example does not match your question,
so i'll only show how to replace every second match of the given pattern
use awk. it's very powerfull tool for command line text processing
replace.sh as follow:
cat | awk -v search="$1" -v repl="$2" '
BEGIN {
flag = 0
}
{
split($0, a, search)
len = length(a)
for (f = 1; f < len; f += 1) {
printf "%s%s", a[f], (flag % 2 == 0 ? search : repl)
flag += 1
}
printf "%s%s", a[len], ORS
}
'
cat input.txt | ./replace.sh TaskID TaskID1

Related

Explain the AWK syntax in detail(NR,FNR,NF)

I am learning file comparison using awk.
I found syntax like below,
awk '
(NR == FNR) {
s[$0]
next
}
{
for (i=1; i<=NF; i++)
if ($i in s)
delete s[$i]
}
END {
for (i in s)
print i
}' tests tests2
I couldn't understand what is the Syntax ...Can you please explain in detail?
What exactly does it do?
awk ' # use awk
(NR == FNR) { # process first file
s[$0] # hash the whole record to array s
next # process the next record of the first file
}
{ # process the second file
for (i=1; i<=NF; i++) # for each field in record
if ($i in s) # if field value found in the hash
delete s[$i] # delete the value from the hash
}
END { # after processing both files
for (i in s) # all leftover values in s
print i # are output
}' tests tests2
For example, for files:
tests:
1
2
3
tests2:
1 2
4 5
program would output:
3

Compare 2 lines and print column numbers if matching and NULL if not

for below scenario i need ur help.
File 1: Is a config file
a|b|c|d|e|f|g
File 2: Input file
a|c|d|g
I have to compare "File 1" and "File 2" and print the following from file 2
a||c|d|||g
So basically I need to compare both the records and for the matching records i have to print it from file 2 and when not matching i have to put NULLS.
I think I can interpret what you're asking: you want to "remap" File2 onto the headers of File1.
$ cat File1
first|last|email|phone|address|colour|size
$ cat File2
first|last|phone|size
Glenn|Jackman|555-555-1212|L
$ awk -F '|' '
NR == FNR {
n = NF
for (i=1; i<=NF; i++) head[i] = $i
print
next
}
FNR == 1 {
for (i=1; i<=NF; i++) f2head[i] = $i
next
}
{
for (i=1; i<=NF; i++) data[f2head[i]] = $i
for (i=1; i<=n; i++)
printf "%s%s", data[head[i]], (i<n ? FS : RS)
}
' File1 File2
first|last|email|phone|address|colour|size
Glenn|Jackman||555-555-1212|||L
From the first block, the head array will be:
{1:"first",2:"last",3:"email",4:"phone",5:"address",6:"colour",7:"size"}
From the second block, the f2head array will be:
{1:"first",2:"last",3:"phone",4:"size"}
In the third block, the data array will be
{"first":"Glenn","last":"Jackman","phone":"555-555-1212","size":"L"}
another awk
$ awk 'BEGIN {FS=OFS="|"};
NR==1 {n=split($0,ht);
for(i=1;i<=n;i++) h[ht[i]]=i; next}
NR==2 {n2=split($0,h2)}
{split($0,t); $0="";
for(i=1;i<=n2;i++) $(h[h2[i]])=t[i]}1' file1 file2
seems cryptic but in essence does the following:
find the index of header2 (h2) values in header1 (h) and set the corresponding field ($(h[h2[i]])) in from file2 (t[i]).

How to match a list of strings in two different files using a loop structure?

I have a file processing task that I need a hand in. I have two files (matched_sequences.list and multiple_hits.list).
INPUT FILE 1 (matched_sequences.list):
>P001 ID
ABCD .... (very long string of characters)
>P002 ID
ABCD .... (very long string of characters)
>P003 ID
ABCD ... ( " " " " )
INPUT FILE 2 (multiple_hits.list):
ID1
ID2
ID3
....
What I want to do is match the second column (ID2, ID4, etc.) with a list of IDs stored in multiple_hits.list. Then create a new matched_sequences file similar to the original but which excludes all IDs found in multiple_hits.list (about 60 out of 1000). So far I have:
#!/bin/bash
X=$(cat matched_sequences.list | awk '{print $2}')
Y=$(cat multiple_hits.list | awk '{print $1}')
while read matched_sequenes.list
do
[ $X -ne $Y ] && (cat matched_sequences.list | awk '{print $1" "$2}') > new_matched_sequences.list
done
I get the following error raised:
-bash: read: `matched_sequences.list': not a valid identifier
Many thanks in advance!
EXPECTED OUTPUT (new_matched_sequences.list):
Same as INPUT FILE 1 with all IDs in multiple_hits.list excluded
#!/usr/bin/awk -f
function chomp(s) {
sub(/^[ \t]*/, "", s)
sub(/[ \t\r]*$/, "", s)
return s
}
BEGIN {
file = ARGV[--ARGC]
while ((getline line < file) > 0) {
a[chomp(line)]++
}
RS = ""
FS = "\n"
ORS = "\n\n"
}
{
id = chomp($1)
sub(/^.* /, "", id)
}
!(id in a)
Usage:
awk -f script.awk matched_sequences.list multiple_hits.list > new_matched_sequences.list
A shorter awk answer is possible, with a tiny script reading first the file with the IDs to exclude, and then the file containing the sequences. The script would be as follows (comments make it long, it's just three useful lines in fact:
BEGIN { grab_flag = 0 }
# grab_flag will be used when we are reading the sequences file
# (not absolutely necessary to set here, though, because we expect the file will start with '>')
FNR == NR { hits[$1] = 1 ; next } # command executed for all lines of the first file: record IDs stored in multiple_hits.list
# otherwise we are reading the second file, containing the sequences:
/^>/ { if (hits[$2] == 1) grab_flag = 0 ; else grab_flag = 1 } # sets the flag indicating whether we have to output the sequence or not
grab_flag == 1 { print }
And if you call this script exclude.awk, you will invoke it this way:
awk -f exclude.awk multiple_hits.list matched_sequences.list

Unix comparing more than 2 files

I got 4 files i need to compare, and i want the output to display the numbers in file X that are not present in the other files.
The files consist of 1 column, where on each row we have a number.
I can't use the diff command because it is not available.
I was trying to use comm, but the output was wrong every time.
Thank for your help.
It's not an easy task actually. You can try this. It uses GNU awk.
gawk 'function report() { printf("File \"%s\" has different number of lines: %d\nIt should be %d\n", FILENAME, FNR, o); exit } NR == FNR { a[NR] = $0; ++o; next } FNR > o { report() } a[FNR] != $0 { printf("Line %d of file \"%s\" is different from line in file \"%s\": %s\nIt should be: %s\n", FNR, FILENAME, ARGV[1], $0, a[FNR]); exit } ENDFILE { if (ARGIND > 1 && FNR != o) report() }' file1 file2 file3
Script version:
#!/usr/bin/gawk -f
function report() {
printf("File \"%s\" has different number of lines: %d\nIt should be %d\n",
FILENAME, FNR, o)
exit 1
}
NR == FNR {
a[NR] = $0
++o
next
}
FNR > o {
report()
}
a[FNR] != $0 {
printf("Line %d of file \"%s\" is different from line in file \"%s\": %s\nIt should be: %s\n",
FNR, FILENAME, ARGV[1], $0, a[FNR])
exit 1
}
ENDFILE {
if (ARGIND > 1 && FNR != o)
report()
}
Usage:
gawk -f script.awk file1 file2 file3 ...
Consider we have 3 files N1 N2 N3 to compare.
Add all the values in the files in a new one called out, and remove duplicates. Then compare each file with the main one "out" and output Missing for each comparison. Then add the Missing values in 1 file, sort them and remove duplicates.
cat N1 N2 N3 | sort -u > out
comm -2 -3 out $sort N1 > Missing1
comm -2 -3 out $sort N2 > Missing2
comm -2 -3 out $sort N3 > Missing3
cat Missing1 Missing2 Missing3 | sort -u > Missing.txt
Thanks for the help :)
If you have vim installed then issue this command
vim -d file1 file2 file3 ...

awk field separator , when the separator shows up in double quote

I am trying to use awk to read some input at the field position at 3, $3, field 3 is a string
awk -F'","' '{print $1}' input.txt
my file input.txt looks like this
field1,field2,field3,field4,field5
the problem is that these fields are separated by commas, some of them are double quoted while others are not. And field 5 is double quoted and contains every type of symbols. Example:
imfield1,imfield2,"imfield3",imfield4,"im"",""fi"",el,""d5"
can awk handle a situation like this??
In more gner, how can I get the whole string by typIng $5 ?
You can use Lorance Stinson's Awk CSV parser, in which case it's as simple as:
function parse_csv(..) {
..
}
{
num_fields = parse_csv($0, csv, ",", "\"", "\"", "\\n", 1);
print csv[2]
}
If you're not hell-bent on Awk, Python also comes with a nice CSV parser:
import csv, sys
for row in csv.reader(sys.stdin):
print row[2]
Or from the command line (bit tricky in one line):
python -c 'import csv,sys;[sys.stdout.write(row[2]+"\n") for row in csv.reader(sys.stdin)]' < input.txt
The separator is a simple comma, not a comma betweeen quotes. If the fields do not contain commas, then awk may be up for the task:
awk -F , '
{
if ($3 ~ /^".*"$/) {
$3 = substr($3, 2, length($3)-2);
gsub(/""/, "", $3);
}
print $3;
}' input.txt
This is already getting pretty complicated. If there can be commas inside fields, use a proper CSV parser, for example in Perl or Python. See https://unix.stackexchange.com/questions/7425/is-there-a-robust-command-line-tool-for-processing-csv-files
You can parse the line in awk setting null field separator. Instead of printf("%s",$i) you can assign $i to a var and print out when inda==0
#echo "\"AAA,BBB\",\"CCC\",\"DDD, EEE, FFF\"" > uno
awk 'BEGIN { FS="" }
{
for ( i=1; i<NF; i++) {
if ( $i == "\"" )
if ( inda == 0 )
inda = 1
else
inda = 0
if ( $i == "," )
if ( inda == 0 )
$i="|"
printf("%s",$i)
}
printf("\n")
}' uno

Resources