finding first and last occurrence of a string using awk or sed - unix

I couldn't find what I am looking for online so I hope someone can help me here. I have a file with the following lines:
CON/Type/abc.sql
CON/Type/bcd.sql
CON/Table/last.sql
CON/Table/first.sql
CON/Function/abc.sql
CON/Package/foo.sql
What I want to do is to find the first occurrence of Table, print a new string and then find last occurrence and print another string. For example, output should look like this:
CON/Type/abc.sql
CON/Type/bcd.sql
set define on
CON/Table/last.sql
CON/Table/first.sql
set define off
CON/Function/abc.sql
CON/Package/foo.sql
As you can see, after finding first occurrence of Table I printed "set define on" before the first occurrence. For the last occurrence I printed "set define off" after last match of Table. Can someone help me write an awk script? Using sed would be okay too.
Note: The lines with Table can appear in the first line of the file or middle or last. In this case they appear in the middle of the rest of the lines.

$ awk -F/ '$2=="Table"{if (!f)print "set define on";f=1} f && $2!="Table"{print "set define off";f=0} 1' file
CON/Type/abc.sql
CON/Type/bcd.sql
set define on
CON/Table/last.sql
CON/Table/first.sql
set define off
CON/Function/abc.sql
CON/Package/foo.sql
How it works
-F/
Set the field separator to /
$2=="Table"{if (!f)print "set define on";f=1}
If the second field is Table, then do the following: (a) if flag f is zero, then print set define on; (b) set flag f to one (true).
f && $2!="Table"{print "set define off";f=0}
If flag f is true and the second field is not Table, then do the following: (a) print set define off; (b) set flag f to zero (false).
1
Print the current line.
Alternate Version
As suggested by Etan Reisner, the following does the same thing with the logic slightly reorganized, eliminating the need for the if statement:
awk -F/ '$2=="Table" && !f {print "set define on";f=1} $2!="Table" && f {print "set define off";f=0} 1' file

Related

Need of awk command explaination

I want to know how the below command is working.
awk '/Conditional jump or move depends on uninitialised value/ {block=1} block {str=str sep $0; sep=RS} /^==.*== $/ {block=0; if (str!~/oracle/ && str!~/OCI/ && str!~/tuxedo1222/ && str!~/vprintf/ && str!~/vfprintf/ && str!~/vtrace/) { if (str!~/^$/){print str}} str=sep=""}' file_name.txt >> CondJump_val.txt
I'd also like to know how to check the texts Oracle, OCI, and so on from the second line only. 
The first step is to write it so it's easier to read
awk '
/Conditional jump or move depends on uninitialised value/ {block=1}
block {
str=str sep $0
sep=RS
}
/^==.*== $/ {
block=0
if (str!~/oracle/ && str!~/OCI/ && str!~/tuxedo1222/ && str!~/vprintf/ && str!~/vfprintf/ && str!~/vtrace/) {
if (str!~/^$/) {
print str
}
}
str=sep=""
}
' file_name.txt >> CondJump_val.txt
It accumulates the lines starting with "Conditional jump ..." ending with "==...== " into a variable str.
If the accumulated string does not match several patterns, the string is printed.
I'd also like to know how to check the texts Oracle, OCI, and so on from the second line only.
What does that mean? I assume you don't want to see the "Conditional jump..." line in the output. If that's the case then use the next command to jump to the next line of input.
/Conditional jump or move depends on uninitialised value/ {
block=1
next
}
perhaps consolidate those regex into a single chain ?
if (str !~ "oracle|OCI|tuxedo1222|v[f]?printf|vtrace") {
print str
}
There are two idiomatic awkisms to understand.
The first can be simplified to this:
$ seq 100 | awk '/^22$/{flag=1}
/^31$/{flag=0}
flag'
22
23
...
30
Why does this work? In awk, flag can be tested even if not yet defined which is what the stand alone flag is doing - the input is only printed if flag is true and flag=1 is only executed when after the regex /^22$/. The condition of flag being true ends with the regex /^31$/ in this simple example.
This is an idiom in awk to executed code between two regex matches on different lines.
In your case, the two regex's are:
/Conditional jump or move depends on uninitialised value/ # start
# in-between, block is true and collect the input into str separated by RS
/^==.*== $/ # end
The other 'awkism' is this:
block {str=str sep $0; sep=RS}
When block is true, collect $0 into str and first time though, RS should not be added in-between the last time. The result is:
str="first lineRSsecond lineRSthird lineRS..."
both depend on awk being able to use a undefined variable without error

Unix: Using filename from another file

A basic Unix question.
I have a script which counts the number of records in a delta file.
awk '{
n++
} END {
if(n >= 1000) print "${completeFile}"; else print "${deltaFile}";
}' <${deltaFile} >${fileToUse}
Then, depending on the IF condition, I want to process the appropriate file:
cut -c2-11 < ${fileToUse}
But how do I use the contents of the file as the filename itself?
And if there are any tweaks to be made, feel free.
Thanks in advance
Cheers
Simon
To use as a filename the contents of a file which is itself identified by a variable (as asked)
cut -c2-11 <"$( cat $filetouse )"
// or in zsh just
cut -c2-11 <"$( < $filetouse )"
unless the filename in the file ends with one or more newline character(s), which people rarely do because it's quite awkward and inconvenient, then something like:
read -rdX var <$filetouse; cut -c2-11 < "${var%?}"
// where X is a character that doesn't occur in the filename
// maybe something like $'\x1f'
Tweaks: your awk prints the variable reference ${completeFile} or ${deltaFile} (because they're within the single-quoted awk script) not the value of either variable. If you actually want the value, as I'd expect from your description, you should pass the shell vars to awk vars like this
awk -vf="$completeFile" -vd="$deltaFile" '{n++} END{if(n>=1000)print f; else print d}' <"$deltaFile"`
# the " around $var can be omitted if the value contains no whitespace and no glob chars
# people _often_ but not always choose filenames that satisfy this
# and they must not contain backslash in any case
or export the shell vars as env vars (if they aren't already) and access them like
awk '{n++} END{if(n>=1000) print ENVIRON["completeFile"]; else print ENVIRON["deltaFile"]}' <"$deltaFile"
Also you don't need your own counter, awk already counts input records
awk -vf=... -vd=... 'END{if(NR>=1000)print f;else print d}' <...
or more briefly
awk -vf=... -vd=... 'END{print (NR>=1000?f:d)}' <...
or using a file argument instead of redirection so the name is available to the script
awk -vf="$completeFile" 'END{print (NR>=1000?f:FILENAME)}' "$deltaFile" # no <
and barring trailing newlines as above you don't need an intermediate file at all, just
cut -c2-11 <"$( awk -vf="$completeFile" -'END{print (NR>=1000?f:FILENAME)}' "$deltaFile")"
Or you don't really need awk, wc can do the counting and any POSIX or classic shell can do the comparison
if [ $(wc -l <"$deltaFile") -ge 1000 ]; then c="$completeFile"; else c="$deltaFile"; fi
cut -c2-11 <"$c"

How do I replace empty strings in a tsv with a value?

I have a tsv, file1, that is structured as follows:
col1 col2 col3
1 4 3
22 0 8
3 5
so that the last line would look something like 3\t\t5, if it was printed out. I'd like to replace that empty string with 'NA', so that the line would then be 3\tNA\t5. What is the easiest way to go about this using the command line?
awk is designed for this scenario (among a million others ;-) )
awk -F"\t" -v OFS="\t" '{
for (i=1;i<=NF;i++) {
if ($i == "") $i="NA"
}
print $0
}' file > file.new && mv file.new file
-F="\t" indicates that the field separator (also known as FS internally to awk) is the tab character. We also set the output field separator (OFS) to "\t".
NF is the number of fields on a line of data. $i gets evaluated as $1, $2, $3, ... for each value between 1 and NF.
We test if the $i th element is empty with if ($i == "") and when it is, we change the $i th element to contain the string "NA".
For each line of input, we print the line's ($0) value.
Outside the awk script, we write the output to a temp file, i.e. file > file.new. The && tests that the awk script exited without errors, and if OK, then moves the file.new over the original file. Depending on the safety and security use-case your project requires, you may not want to "destroy" your original file.
IHTH.
A straightforward approach is
sed -i 's/^\t/NA\t/;s/\t$/\tNA/;:0 s/\t\t/\tNA\t/;t0' file
sed -i edit file in place;
s/a/b/ replace a with b;
s/^\t/\tNA/ replace \t in the beginning of the line with NA\t
(the first column becomes NA);
s/\t$/\tNA/ the same for the last column;
s/\t\t/\tNA\t/ insert NA in between \t\t;
:0 s///; t0 repeat s/// if there was a replacement (in case there are other missing values in the line).

grep string from a TCL variable

I want to grep a certain amount of string from a TCL variable and use that in my tool command. Example:
${tcl_Var} - this contains string like VEG_0_1/ABC
I want to grep from above string until it the point it hits first forward slash, so in the above case it would be VEG_0_1. And then replace it in my command. Example:
VEG_0_1/REST_OF_THE_COMMAND.
Don't think in terms of grep, think about "string manipulation" instead.
Use regsub for "search and replace:
% set tcl_Var VEG_0_1/ABC
VEG_0_1/ABC
% set newvar [regsub {/.+} $tcl_Var {/REST_OF_THE_COMMAND}]
VEG_0_1/REST_OF_THE_COMMAND
Alternately, your problem can be solved by splitting the string on /, taking the first component, then appending the "rest of the command":
% set newvar "[lindex [split $tcl_Var /] 0]/REST_OF_THE_COMMAND"
VEG_0_1/REST_OF_THE_COMMAND
Or using string indices:
% set newvar "[string range $tcl_Var 0 [string first / $tcl_Var]]REST_OF_THE_COMMAND"
VEG_0_1/REST_OF_THE_COMMAND
You can do this with regular expressions using the regsub TCL command. There is no need to run the external program grep. See more info here: http://www.tcl.tk/man/tcl8.4/TclCmd/regsub.htm
If you are new to regular expressions, read TCL-specific tutorial about them.
set tcl_Var VEG_0_1/ABC
set varlist [split $tcl_Var "/"]
set newvar [lindex $varlist 0]/REST_OF_THE_COMMAND

How to delete partial duplicate lines with AWK?

I have files with these kind of duplicate lines, where only the last field is different:
OST,0202000070,01-AUG-09,002735,6,0,0202000068,4520688,-1,0,0,0,0,0,55
ONE,0208076826,01-AUG-09,002332,316,3481.055935,0204330827,29150,200,0,0,0,0,0,5
ONE,0208076826,01-AUG-09,002332,316,3481.055935,0204330827,29150,200,0,0,0,0,0,55
OST,0202000068,01-AUG-09,003019,6,0,0202000071,4520690,-1,0,0,0,0,0,55
I need to remove the first occurrence of the line and leave the second one.
I've tried:
awk '!x[$0]++ {getline; print $0}' file.csv
but it's not working as intended, as it's also removing non duplicate lines.
#!/bin/awk -f
{
s = substr($0, 0, match($0, /,[^,]+$/))
if (!seen[s]) {
print $0
seen[s] = 1
}
}
If your near-duplicates are always adjacent, you can just compare to the previous entry and avoid creating a potentially huge associative array.
#!/bin/awk -f
{
s = substr($0, 0, match($0, /,[^,]*$/))
if (s != prev) {
print prev0
}
prev = s
prev0 = $0
}
END {
print $0
}
Edit: Changed the script so it prints the last one in a group of near-duplicates (no tac needed).
As a general strategy (I'm not much of an AWK pro despite taking classes with Aho) you might try:
Concatenate all the fields except
the last.
Use this string as a key to a hash.
Store the entire line as the value
to a hash.
When you have processed all lines,
loop through the hash printing out
the values.
This isn't AWK specific and I can't easily provide any sample code, but this is what I would first try.

Resources