Awk/Sed to replace word after specific string - unix

I have a file like below
<DATABASE name="ABC" url="jdbc:sybase:Tds:eqprod3:5060/ABC01" driver="com.sybase.jdbc2.jdbc.SybDriver" user="user" pwd="password" minConnections="10" maxConnections="10" maxConnectionLife="1440000" startDate="01/01/2014" endDate="01/30/2014" type="dev"/>
<DATABASE name="XYZ" url="jdbc:sybase:Tds:eqprod2:5050/XYZ01" driver="com.sybase.jdbc2.jdbc.SybDriver" user="user" pwd="password" minConnections="10" maxConnections="10" maxConnectionLife="1440000" startDate="02/01/2014" endDate="02/02/2014" type="dev"/>
Now I want to search for word ABC01 in url part and search next startDate to it and change the value, lets say to 02/01/2014.
Could you please help me to get required output.

With sed :
sed '/ABC01/ s/startDate="[^"]*"/startDate="02\/01\/2014"/g' your.file

This script works for both single- and multi-line DATABASE elements.
awk '
# search for ABC01 in url attribute
/url="[^"]*ABC01[^"]*"/{
# set the flag
f = 1;
# current line number
i1 = NR;
}
# save line if the flag is set
(f){lines[NR] = $0}
# output line otherwise
(!f){print $0}
# if the flag is set, search for startDate attribute
(f && $0 ~ /startDate="/){
# replace value of startDate attribute with 02/01/2014
s = gensub(/(startDate=")[^"]+/, "\\102/01/2014", 1, $0)
# current line number
i2 = NR;
# output non-modified lines (no output if i2 == i1)
for (i = i1; i < i2; i++){print lines[i]};
# output modified line
print s;
# unset the flag
f = 0;
}'

Related

how to use awk to pull fields and make them variables to do data calculation

I am using awk to compute Mean Frational Bias form a data file. How can I make the data points a variable to call in to my equation?
Input.....
col1 col2
row #1 Yavg: 14.87954
row #2 Xavg: 20.83804
row #3 Ystd: 7.886613
row #4 Xstd: 8.628519
I am looking to feed into this equation....
MFB = .5 * (Yavg-Xavg)/[(Yavg+Xavg)/2]
output....
col1 col2
row #1 Yavg: 14.87954
row #2 Xavg: 20.83804
row #3 Ystd: 7.886613
row #4 Xstd: 8.628519
row #5 MFB: (computed value)
currently trying to use the following code to do this but not working....
var= 'linear_reg-County119-O3-2004-Winter2013-2018XYstats.out.out'
val1=$(awk -F, OFS=":" "NR==2{print $2; exit}" <$var)
val2=$(awk -F, OFS=":" "NR==1{print $2; exit}" <$var)
#MFB = .5*((val2-val1)/((val2+val1)/2))
awk '{ print "MFB :" .5*((val2-val1)/((val2+val1)/2))}' >> linear_regCounty119-O3-2004-Winter2013-2018XYstats-wMFB.out
Try running: awk -f mfb.awk input.txt where
mfb.awk:
BEGIN { FS = OFS = ": " } # set the separators
{ v[$1] = $2; print } # store each line in an array named "v"
END {
MFB = 0.5 * (v["Yavg"] - v["Xavg"]) / ((v["Yavg"] + v["Xavg"]) / 2)
print "MFB", MFB
}
input.txt:
Yavg: 14.87954
Xavg: 20.83804
Ystd: 7.886613
Xstd: 8.628519
Output:
Yavg: 14.87954
Xavg: 20.83804
Ystd: 7.886613
Xstd: 8.628519
MFB: -0.166823
Alternatively, mfb.awk can be the following, resembling your original code:
BEGIN { FS = OFS = ": " }
{ print }
NR == 1 { Yavg = $2 } NR == 2 { Xavg = $2 }
END {
MFB = 0.5 * (Yavg - Xavg) / ((Yavg + Xavg) / 2)
print "MFB", MFB
}
Note that you don't usually toss variables back and forth between the shell and Awk (at least when you deal with a single input file).

remove certains characters at the beginning of a line with awk

I would like to remove some characters that can be at a beginning of a line, with awk. The characters I would like to remove are # and/or =
Here is a file example :
#word <= Remove #
=word <= Remove =
#=word <= Remove # AND =
=#word <= Remove = AND #
At the moment, I use sub(/^\#/, "", $0) to remove # at the beginning of a line. How can I edit this line so it removes #, = and both if they are present together ?
With awk:
awk '{sub(/^[#=]+/, "")}1' File
With sed:
sed -r 's/^[#=]+//' File
In awk:
$ awk 'gsub(/^[=#]+/,"")||1' file
word <= Remove #
word <= Remove =
word <= Remove # AND =
word <= Remove = AND #
Explained:
awk '
gsub(/^[=#]+/,"") || 1 # replace all leading =s and #s with "" and print nevertheless
' file

Merging the rows in a file using awk

Can somebody explain the meaning of the below script please?
awk -F "," 's != $1 || NR ==1{s=$1;if(p){print p};p=$0;next}
{sub($1,"",$0);p=p""$0;}
END{print p}' file
The file has the following data:
2,"URL","website","aaa.com"
2,"abc","text","some text"
2,"Password","password","12345678"
3,"URL","website","10.0.10.75"
3,"uname","text","some text"
3,"Password","password","password"
3,"ulang","text","some text"
4,"URL","website","192.168.2.110"
4,"Login","login","admin"
4,"Password","password","blah-blah"
and the output is:
2,"URL","website","aaa.com","abc","text","some text",Password","password","12345678"
3,"URL","website","10.0.10.75","uname","text","some text""Password","password","password","ulang","text","some text"
awk has this structure
pattern {action}
for your script, let's analyze the elements, first pattern
s != $1 || NR == 1 # if the variable s is not equal to first field
# or we're operating on first row
first action
s = $1 # assign variable s to first field
if (p) { # if p is not empty, print
print p
}
p = $0 # assign the current line to p variable
next # move to next, skip the rest
next pattern is missing, so the action will apply to all rows
sub($1, "", $0) # same as $1="", will clear the first field
p = ((p "") $0) # concat new line to p
last pattern is special reserved word END, only applied when all rows are consumed (there is counterpart BEGIN that's applied before the file is opened)
END {
print p # print the value of p
}

How to match a list of strings in two different files using a loop structure?

I have a file processing task that I need a hand in. I have two files (matched_sequences.list and multiple_hits.list).
INPUT FILE 1 (matched_sequences.list):
>P001 ID
ABCD .... (very long string of characters)
>P002 ID
ABCD .... (very long string of characters)
>P003 ID
ABCD ... ( " " " " )
INPUT FILE 2 (multiple_hits.list):
ID1
ID2
ID3
....
What I want to do is match the second column (ID2, ID4, etc.) with a list of IDs stored in multiple_hits.list. Then create a new matched_sequences file similar to the original but which excludes all IDs found in multiple_hits.list (about 60 out of 1000). So far I have:
#!/bin/bash
X=$(cat matched_sequences.list | awk '{print $2}')
Y=$(cat multiple_hits.list | awk '{print $1}')
while read matched_sequenes.list
do
[ $X -ne $Y ] && (cat matched_sequences.list | awk '{print $1" "$2}') > new_matched_sequences.list
done
I get the following error raised:
-bash: read: `matched_sequences.list': not a valid identifier
Many thanks in advance!
EXPECTED OUTPUT (new_matched_sequences.list):
Same as INPUT FILE 1 with all IDs in multiple_hits.list excluded
#!/usr/bin/awk -f
function chomp(s) {
sub(/^[ \t]*/, "", s)
sub(/[ \t\r]*$/, "", s)
return s
}
BEGIN {
file = ARGV[--ARGC]
while ((getline line < file) > 0) {
a[chomp(line)]++
}
RS = ""
FS = "\n"
ORS = "\n\n"
}
{
id = chomp($1)
sub(/^.* /, "", id)
}
!(id in a)
Usage:
awk -f script.awk matched_sequences.list multiple_hits.list > new_matched_sequences.list
A shorter awk answer is possible, with a tiny script reading first the file with the IDs to exclude, and then the file containing the sequences. The script would be as follows (comments make it long, it's just three useful lines in fact:
BEGIN { grab_flag = 0 }
# grab_flag will be used when we are reading the sequences file
# (not absolutely necessary to set here, though, because we expect the file will start with '>')
FNR == NR { hits[$1] = 1 ; next } # command executed for all lines of the first file: record IDs stored in multiple_hits.list
# otherwise we are reading the second file, containing the sequences:
/^>/ { if (hits[$2] == 1) grab_flag = 0 ; else grab_flag = 1 } # sets the flag indicating whether we have to output the sequence or not
grab_flag == 1 { print }
And if you call this script exclude.awk, you will invoke it this way:
awk -f exclude.awk multiple_hits.list matched_sequences.list

awk count and sum based on slab:

Would like to extract all the lines from first file (GunZip *.gz i.e Input.csv.gz), if the first file 4th field is falls within a range of
Second file (Slab.csv) first field (Start Range) and second field (End Range) then populate Slab wise count of rows and sum of 4th and 5th field of first file.
Input.csv.gz (GunZip)
Desc,Date,Zone,Duration,Calls
AB,01-06-2014,XYZ,450,3
AB,01-06-2014,XYZ,642,3
AB,01-06-2014,XYZ,0,0
AB,01-06-2014,XYZ,205,3
AB,01-06-2014,XYZ,98,1
AB,01-06-2014,XYZ,455,1
AB,01-06-2014,XYZ,120,1
AB,01-06-2014,XYZ,0,0
AB,01-06-2014,XYZ,193,1
AB,01-06-2014,XYZ,0,0
AB,01-06-2014,XYZ,161,2
Slab.csv
StartRange,EndRange
0,0
1,10
11,100
101,200
201,300
301,400
401,500
501,10000
Expected Output:
StartRange,EndRange,Count,Sum-4,Sum-5
0,0,3,0,0
1,10,NotFound,NotFound,NotFound
11,100,1,98,1
101,200,3,474,4
201,300,1,205,3
301,400,NotFound,NotFound,NotFound
401,500,2,905,4
501,10000,1,642,3
I am using below two commands to get the above output , expect "NotFound"cases .
awk -F, 'NR==FNR{s[NR]=$1;e[NR]=$2;c[NR]=$0;n++;next} {for(i=1;i<=n;i++) if($4>=s[i]&&$4<=e[i]) {print $0,","c[i];break}}' Slab.csv <(gzip -dc Input.csv.gz) >Op_step1.csv
cat Op_step1.csv | awk -F, '{key=$6","$7;++a[key];b[key]=b[key]+$4;c[key]=c[key]+$5} END{for(i in a)print i","a[i]","b[i]","c[i]}' >Op_step2.csv
Op_step2.csv
101,200,3,474,4
501,10000,1,642,3
0,0,3,0,0
401,500,2,905,4
11,100,1,98,1
201,300,1,205,3
Any suggestions to make it one liner command to achieve the Expected Output , Don't have perl , python access.
Here is another option using perl which takes benefits of creating multi-dimensional arrays and hashes.
perl -F, -lane'
BEGIN {
$x = pop;
## Create array of arrays from start and end ranges
## $range = ( [0,0] , [1,10] ... )
(undef, #range)= map { chomp; [split /,/] } <>;
#ARGV = $x;
}
## Skip the first line
next if $. ==1;
## Create hash of hash
## $line = '[0,0]' => { "count" => counts , "sum4" => sum_of_col4 , "sum5" => sum_of_col5 }
for (#range) {
if ($F[3] >= $_->[0] && $F[3] <= $_->[1]) {
$line{"#$_"}{"count"}++;
$line{"#$_"}{"sum4"} +=$F[3];
$line{"#$_"}{"sum5"} +=$F[4];
}
}
}{
print "StartRange,EndRange,Count,Sum-4,Sum-5";
print join ",", #$_,
$line{"#$_"}{"count"} //"NotFound",
$line{"#$_"}{"sum4"} //"NotFound",
$line{"#$_"}{"sum5"} //"NotFound"
for #range
' slab input
StartRange,EndRange,Count,Sum-4,Sum-5
0,0,3,0,0
1,10,NotFound,NotFound,NotFound
11,100,1,98,1
101,200,3,474,4
201,300,1,205,3
301,400,NotFound,NotFound,NotFound
401,500,2,905,4
501,10000,1,642,3
Here is one way using awk and sort:
awk '
BEGIN {
FS = OFS = SUBSEP = ",";
print "StartRange,EndRange,Count,Sum-4,Sum-5"
}
FNR == 1 { next }
NR == FNR {
ranges[$1,$2]++;
next
}
{
for (range in ranges) {
split(range, tmp, SUBSEP);
if ($4 >= tmp[1] && $4 <= tmp[2]) {
count[range]++;
sum4[range]+=$4;
sum5[range]+=$5;
next
}
}
}
END {
for(range in ranges)
print range, (count[range]?count[range]:"NotFound"), (sum4[range]?sum4[range]:"NotFound"), (sum5[range]?sum5[range]:"NotFound") | "sort -t, -nk1,2"
}' slab input
StartRange,EndRange,Count,Sum-4,Sum-5
0,0,3,NotFound,NotFound
1,10,NotFound,NotFound,NotFound
11,100,1,98,1
101,200,3,474,4
201,300,1,205,3
301,400,NotFound,NotFound,NotFound
401,500,2,905,4
501,10000,1,642,3
Set the Input, Output Field Separators and SUBSEP to ,. Print the Header line.
If it is the first line skip it.
Load the entire slab.txt in to an array called ranges.
For every range in the ranges array, split the field to get start and end range. If the 4th column is in the range, increment the count array and add the value to sum4 and sum5 array appropriately.
In the END block, iterate through the ranges and print them.
Pipe the output to sort to get the output in order.

Resources