How to find a pattern using sed? - unix

How can I combine multiple filters using sed?
Here's my data set
sex,city,age
male,london,32
male,manchester,32
male,oxford,64
female,oxford,23
female,london,33
male,oxford,45
I want to identify all lines which contain MALE AND OXFORD. Here's my approach:
sed -n '/male/,/oxford/p' file
Thanks

You can associate a block with the first check and put the second in there. For example:
sed -n '/male/ { /oxford/ p; }' file
Or invert the check and action:
sed '/male/!d; /oxford/!d' file
However, since (as #Jotne points out) lines that contain female also contain male and you probably don't want to match them, the patterns should at least be amended to contain word boundaries:
sed -n '/\<male\>/ { /\<oxford\>/ p; }' file
sed '/\<male\>/!d; /\<oxford\>/!d' file
But since that looks like comma-separated data and the check is probably not meant to test whether someone went to male university, it would probably be best to use a stricter check with awk:
awk -F, '$1 == "male" && $2 == "oxford"' file
This checks not only if a line contains male and oxford but also if they are in the appropriate fields. The same can be achieved, somewhat less prettily, with sed by using
sed '/^male,oxford,/!d' file

A single sed command command can be used to solve this. Let's look at two variations of using sed:
$ sed -e 's/^\(male,oxford,.*\)$/\1/;t;d' file
male,oxford,64
male,oxford,45
$ sed -e 's/^male,oxford,\(.*\)$/\1/;t;d' file
64
45
Both have the essentially the same regex:
^male,oxford,.*$
The interesting features are the capture group placement (either the whole line or just the age portion) and the use of ;t;d to discard non matching lines.
By doing it this way, we can avoid the requirement of using awk or grep to solve this problem.

You can use awk
awk -F, '/\<male\>/ && /\<oxford\>/' file
male,oxford,64
male,oxford,45
It uses the word anchor to prevent hit on female.

Related

Scraping 5 characters off a webpage using sed or something better?

I have a downloaded webpage I would like to scrape using sed or awk. I have been told by a colleague that what I'm trying to achieve isn't possible with sed and maybe this is probably correct seeing as he is a bit of a linux guru.
What I am trying to achieve:
I am trying to scrape a locally stored html document for every value within this html label which apppears hundreds of times on a webpage..
For example:
<label class="css-l019s7">54321</label>
<label class="css-l019s7">55555</label>
This label class never changes, so it seems the perfect point to do scraping and get the values:
54321
55555
There are hundreds of occurences of this data and I need to get a list of them all.
As sed probably isn't capable of this, I would be forever greatful if someone could demonstrate AWK or something else?
Thank you.
Things I've tried:
sed -En 's#(^.*<label class=\"css-l019s7\">)(.*)(</label>.*$)#\2#gp' TS.html
This code above managed to extract about 40 of the numbers out of 320. There must be a little bug in this sed command for it to work partially.
Use a parser like xmllint:
xmllint --html --recover --xpath '//label[#class="css-l019s7"]/text()' TS.html
As an interest in sed was expressed (note that html can use newlines instead of spaces, for example, so this is not very robust):
sed 't a
s/<label class="css-l019s7">\([^<]*\)/\
\1\
/;D
:a
P;D' TS.html
Using awk:
awk '$1~/^label class="css-l019s7"$/ { print $2 }' RS='<' FS='>' TS.html
or:
awk '$1~/^[\t\n\r ]*label[\t\n\r ]/ &&
$1~/[\t\n\r ]class[\t\n\r ]*=[\t\n\r ]*"css-l019s7"[\t\n\r ]*([\t\n\r ]|$)/ {
print $2
}' RS='<' FS='>' TS.html
someone could demonstrate AWK or something else?
This task seems for me as best fit for using CSS selector. If you are allowed to install tools you might use Element Finder for this following way:
elfinder -s "label.css-l019s7"
which will search for labels with class css-l019s7 in files in current directory.
with grep you can get the values
grep -Eo '([[:digit:]]{5})' file
54321
55555
with awk you can concrete where the values are, here in the lines with label at the beginning and at the end:
awk '/^<label|\/label>$/ {if (match($0,/[[:digit:]]{5}/)) { pattern = substr($0,RSTART,RLENGTH); print pattern}}' file
54321
55555
using GNU awk and gensub:
awk '/label class/ && /css-l019s7/ { str=gensub("(<label class=\"css-l019s7\">)(.*)(</label>)","\\2",$0);print str}' file
Search for lines with "label class" and "css=1019s7". Split the line into three sections and substitute the line for the second section, reading the result into a variable str. Print str.

Removing comments from a datafile. What are the differences?

Let's say that you would like to remove comments from a datafile using one of two methods:
cat file.dat | sed -e "s/\#.*//"
cat file.dat | grep -v "#"
How do these individual methods work, and what is the difference between them? Would it also be possible for a person to write the clean data to a new file, while avoiding any possible warnings or error messages to end up in that datafile? If so, how would you go about doing this?
How do these individual methods work, and what is the difference
between them?
Yes, they work same though sed and grep are 2 different commands. Your sed command simply substitutes all those lines which having # with NULL. On other hand grep will simply skip or ignore those lines which will skip lines which have # in it.
You could get more information on these by man page as follows:
man grep:
-v, --invert-match
Invert the sense of matching, to select non-matching lines. (-v is specified by POSIX.)
man sed:
s/regexp/replacement/
Attempt to match regexp against the pattern space. If successful, replace that portion matched with replacement. The
replacement may
contain the special character & to refer to that portion of the pattern space which matched, and the special escapes \1
through \9 to
refer to the corresponding matching sub-expressions in the regexp.
Would it also be possible for a person to write the clean data to a
new file, while avoiding any possible warnings or error messages to
end up in that datafile?
yes, we could re-direct the errors by using 2>/dev/null in both the commands.
If so, how would you go about doing this?
You could try like 2>/dev/null 1>output_file
Explanation of sed command: Adding explanation of sed command too now. This is only for understanding purposes and no need to use cat and then use sed you could use sed -e "s/\#.*//" Input_file instead.
sed -e " ##Initiating sed command here with adding the script to the commands to be executed
s/ ##using s for substitution of regexp following it.
\#.* ##telling sed to match a line if it has # till everything here.
//" ##If match found for above regexp then substitute it with NULL.
That grep -v will lose all the lines that have # on them, for example:
$ cat file
first
# second
thi # rd
so
$ grep -v "#" file
first
will drop off all lines with # on it which is not favorable. Rather you should:
$ grep -o "^[^#]*" file
first
thi
like that sed command does but this way you won't get empty lines. man grep:
-o, --only-matching
Print only the matched (non-empty) parts of a matching line,
with each such part on a separate output line.

Sed line match plus line below

I can find my lines with this pattern, but in some case the info is on the line after the match. How can I also get the line following my match line?
sed -n '/SQL3227W Record token/p' /log/PLAN_2015-08-16*.MSG >ERRORS.txt
Firstly, this looks like a job for grep:
grep -A 1 'SQL3227W Record token' /log/PLAN_2015-08-16*.MSG >ERRORS.txt
(-A 1 means to print an additional 1 line After the match).
Secondly, if you're using GNU sed, you can use a second address of +1 thus:
sed -n '/SQL3227W Record token/,+1p' /log/PLAN_2015-08-16*.MSG >ERRORS.txt
Otherwise, (if you really must use non-Gnu sed), then each time you match, append the following line to your pattern space. Delete the first line, before continuing loop (in case the second line is also a match).
Untested code:
#!/bin/sed -nf
/SQL3227W Record token/{
N
P
D
}
sed is for simple substitutions on individual lines, that is all. For anything even slightly more interesting just use awk:
awk '/SQL3227W Record token/{c=2} c&&c--' file
See Printing with sed or awk a line following a matching pattern for other related idioms.

Why sed removes last line?

$ cat file.txt
one
two
three
$ cat file.txt | sed "s/one/1/"
1
two
Where is the word "three"?
UPDATED:
There is no line after the word "three".
As Ivan suggested, your text file is missing the end of line (EOL) marker on the final line. Since that's not present, three is printed out by sed but then immediately over-written by your prompt. You can see it if you force an extra line to be printed.
sed 's/one/1/' file.txt && echo
This is a common problem since people incorrectly think of the EOL as an indication that there's a following line (which is why it's commonly called a "newline") and not as an indication that the current line has ended.
Using comments from other posts:
older versions of sed do not process the last line of a file if no EOL or "new line" is present.
echo can be used to add a new line
Then, to solve the problem you can re-order the commands:
( cat file.txt && echo ) | sed 's/one/1/'
I guess there is no new line character after last line. sed didn't find line separator after last line and ignore it.
Update
I suggest you to rewrite this in perl (if you have it installed):
cat file.txt | perl -pe 's/one/1/'
Instead of cat'ing the file and piping into sed, run sed with the file name as an argument after the substitution string, like so:
sed "s/one/1/" file.txt
When I did it this way, I got the "three" immediately following by the prompt:
1
two
three$
A google search shows that the man page for some versions of sed (not the GNU or BSD versions, which work as you'd expect) indicate that it won't process an incomplete line (one that's not newline-terminated) at the end of a file. The solution is to ensure your files end with a newline, install GNU sed, or use awk or perl instead.
here's an awk solution
awk '{gsub("one","1")}1' file.txt

excluding first and last lines from sed /START/,/END/

Consider the input:
=sec1=
some-line
some-other-line
foo
bar=baz
=sec2=
c=baz
If I wish to process only =sec1= I can for example comment out the section by:
sed -e '/=sec1=/,/=[a-z]*=/s:^:#:' < input
... well, almost.
This will comment the lines including "=sec1=" and "=sec2=" lines, and the result will be something like:
#=sec1=
#some-line
#some-other-line
#
#foo
#bar=baz
#
#=sec2=
c=baz
My question is: What is the easiest way to exclude the start and end lines from a /START/,/END/ range in sed?
I know that for many cases refinement of the "s:::" claws can give solution in this specific case, but I am after the generic solution here.
In "Sed - An Introduction and Tutorial" Bruce Barnett writes: "I will show you later how to restrict a command up to, but not including the line containing the specified pattern.", but I was not able to find where he actually show this.
In the "USEFUL ONE-LINE SCRIPTS FOR SED" Compiled by Eric Pement, I could find only the inclusive example:
# print section of file between two regular expressions (inclusive)
sed -n '/Iowa/,/Montana/p' # case sensitive
This should do the trick:
sed -e '/=sec1=/,/=sec2=/ { /=sec1=/b; /=sec2=/b; s/^/#/ }' < input
This matches between sec1 and sec2 inclusively and then just skips the first and last line with the b command. This leaves the desired lines between sec1 and sec2 (exclusive), and the s command adds the comment sign.
Unfortunately, you do need to repeat the regexps for matching the delimiters. As far as I know there's no better way to do this. At least you can keep the regexps clean, even though they're used twice.
This is adapted from the SED FAQ: How do I address all the lines between RE1 and RE2, excluding the lines themselves?
If you're not interested in lines outside of the range, but just want the non-inclusive variant of the Iowa/Montana example from the question (which is what brought me here), you can write the "except for the first and last matching lines" clause easily enough with a second sed:
sed -n '/PATTERN1/,/PATTERN2/p' < input | sed '1d;$d'
Personally, I find this slightly clearer (albeit slower on large files) than the equivalent
sed -n '1,/PATTERN1/d;/PATTERN2/q;p' < input
Another way would be
sed '/begin/,/end/ {
/begin/n
/end/ !p
}'
/begin/n -> skip over the line that has the "begin" pattern
/end/ !p -> print all lines that don't have the "end" pattern
Taken from Bruce Barnett's sed tutorial http://www.grymoire.com/Unix/Sed.html#toc-uh-35a
I've used:
sed '/begin/,/end/{/begin\|end/!p}'
This will search all the lines between the patterns, then print everything not containing the patterns
you could also use awk
awk '/sec1/{f=1;print;next}f && !/sec2/{ $0="#"$0}/sec2/{f=0}1' file

Resources