I have a downloaded webpage I would like to scrape using sed or awk. I have been told by a colleague that what I'm trying to achieve isn't possible with sed and maybe this is probably correct seeing as he is a bit of a linux guru.
What I am trying to achieve:
I am trying to scrape a locally stored html document for every value within this html label which apppears hundreds of times on a webpage..
For example:
<label class="css-l019s7">54321</label>
<label class="css-l019s7">55555</label>
This label class never changes, so it seems the perfect point to do scraping and get the values:
54321
55555
There are hundreds of occurences of this data and I need to get a list of them all.
As sed probably isn't capable of this, I would be forever greatful if someone could demonstrate AWK or something else?
Thank you.
Things I've tried:
sed -En 's#(^.*<label class=\"css-l019s7\">)(.*)(</label>.*$)#\2#gp' TS.html
This code above managed to extract about 40 of the numbers out of 320. There must be a little bug in this sed command for it to work partially.
Use a parser like xmllint:
xmllint --html --recover --xpath '//label[#class="css-l019s7"]/text()' TS.html
As an interest in sed was expressed (note that html can use newlines instead of spaces, for example, so this is not very robust):
sed 't a
s/<label class="css-l019s7">\([^<]*\)/\
\1\
/;D
:a
P;D' TS.html
Using awk:
awk '$1~/^label class="css-l019s7"$/ { print $2 }' RS='<' FS='>' TS.html
or:
awk '$1~/^[\t\n\r ]*label[\t\n\r ]/ &&
$1~/[\t\n\r ]class[\t\n\r ]*=[\t\n\r ]*"css-l019s7"[\t\n\r ]*([\t\n\r ]|$)/ {
print $2
}' RS='<' FS='>' TS.html
someone could demonstrate AWK or something else?
This task seems for me as best fit for using CSS selector. If you are allowed to install tools you might use Element Finder for this following way:
elfinder -s "label.css-l019s7"
which will search for labels with class css-l019s7 in files in current directory.
with grep you can get the values
grep -Eo '([[:digit:]]{5})' file
54321
55555
with awk you can concrete where the values are, here in the lines with label at the beginning and at the end:
awk '/^<label|\/label>$/ {if (match($0,/[[:digit:]]{5}/)) { pattern = substr($0,RSTART,RLENGTH); print pattern}}' file
54321
55555
using GNU awk and gensub:
awk '/label class/ && /css-l019s7/ { str=gensub("(<label class=\"css-l019s7\">)(.*)(</label>)","\\2",$0);print str}' file
Search for lines with "label class" and "css=1019s7". Split the line into three sections and substitute the line for the second section, reading the result into a variable str. Print str.
need to insert '\N' between whereever 2 sequencial commas in the line like below:
"abc,,,,5,,,3.2,,"
to:
"abc,\N,\N,\N,5,\N,\N,3.2,\N,"
Also, the number of the consequencial comma is not fixed, maybe 6, 7 or more. Need a flexible way to handle it.
Didn't find a clear solution from the google.
You can just use the following sed command:
sed 's/,,/,\\N,/g;s/,,/,\\N,/g;'
Demo:
$ echo 'abc,,,,5,,,3.2,,' | sed 's/,,/,\\N,/g;s/,,/,\\N,/g;s/,,/,\\N,/g'
abc,\N,\N,\N,5,\N,\N,3.2,\N,
Explanations:
s/,,/,\\N,/g will replace ,, by ,\N, globally on the string, you will have however to do two passes on the pattern space to be sure that all the replacements took place giving the commands: s/,,/,\\N,/g;s/,,/,\\N,/g;.
Additional notes:
To answer to your doubts about this approach not being flexible, I have prepared the following input file.
$ cat input_comma.txt
abc,,,,5,,,3.2,,
,,,,,,def,
1,,,,,,1.2
6commas,,,,,,
7commas,,,,,,,
As you can see, it does not matter how many successive commas are present in the input:
$ sed 's/,,/,\\N,/g;s/,,/,\\N,/g;s/,,/,\\N,/g' input_comma.txt
abc,\N,\N,\N,5,\N,\N,3.2,\N,
,\N,\N,\N,\N,\N,def,
1,\N,\N,\N,\N,\N,1.2
6commas,\N,\N,\N,\N,\N,
7commas,\N,\N,\N,\N,\N,\N,
With awk a similar approach in 2 passes can be implemented in the same way:
$ echo "test,,,mmm,,,,aa,," | awk '{gsub(/\,\,/,",\\N,");gsub(/\,\,/,",\\N,")} 1'
test,\N,\N,mmm,\N,\N,\N,aa,\N,
Could you please try following once.
awk '{gsub(/\,\,/,",\\N,");gsub(/\,\,/,",\\N,")} 1' Input_file
With perl:
perl -pe '1 while s/,,/,\\N,/g'
I have config.xml. Here I need to retrieve the value of the attribute from the xpath
/domain/server/name
I can only use grep/sed/awk. Need Help
The content of the xml is below where I need to retrieve the Server Name only.
<domain>
<server>
<name>AdminServer</name>
<port>1234</port>
</server>
<server>
<name>M1Server</name>
<port>5678</port>
</server>
<machine>
<name>machine01</name>
</machine>
<machine>
<name>machine02</name>
</machine>
</domain>
The output should be :
AdminServer
M1Server
I tried to do,
sed -ne '/<\/name>/ { s/<[^>]*>(.*)<\/name>/\1/; p }' config.xml
sed is only for simple substitutions on individual lines, doing anything else with sed is strictly for mental exercise, not for real code. That's not what you are trying to do so you shouldn't even be considering sed. Just use awk:
$ awk -F'[<>]' 'p=="server" && $2=="name"{print $3} {p=$2}' file
AdminServer
M1Server
That will work with any awk on any UNIX box. If that's not all you need then edit your question to provide more truly representative sample input and expected output.
Try this command. Name your xml and supply that file as an input.
awk '/<server>/,/<\/server>/' < name.xml | grep "name" | cut -d ">" -f2 | cut -d "<" -f1
OutPut:
AdminServer
M1Server
Based on your sample Input_file shown, could you please try following.
awk -F"[><]" '/<\/server>/{a="";next} /<server>/{a=1;next} a && /<name>/{print $3}' Input_file
sed -n '/<server>/{n;s/\s*<[^>]*>//gp}'
for example. for the first match
1. /<server>/
match the line that contains "<server>" got " <server>"
2. n
the "n" command will go to next line. after executed "n" command got " <name>AdminServer</name>"
3.s/\s*<[^>]*>//gp
replece all "\s*<[^>]*>" as "". then print the pattern space
type "info sed" for more sed command
You can get the desired output with just sed:
sed -n 's:.*<name>\(.*\)</name>.*:\1:p' config.xml
I feel dirty parsing XML in awk.
The following finds the correct depth of entry with the right tag name. It does not verify the path, though it depends on the elements you specified. While this works on your example data, it makes certain ugly assumptions and it's not guaranteed to work elsewhere:
awk -F'[<>]' '$2~/^(domain|server|name)$/{n++} $1~/\// {n--} n==3&&$2=="name"{print $3}' input.xml
A better solution would be to parse the XML itself.
$ awk -F'[<>]' -v check="domain.server.name" '$2~/^[a-z]/ { path=path "." $2; closex="</"$2">" } $0~closex { sub(/\.[^.]$/,"",path) } substr(path,2)==check {print path " = " $3}' input.xml
.domain.server.name = AdminServer
Here it is split out for easier commenting.
$ awk -F'[<>]' -v check="domain.server.name" '
# Split fields around pointy brackets. Supply a path to check.
$2~/^[a-z]/ { # If we see an open tag,
path=path "." $2 # append the current tag to our path,
closex="</"$2">" # compose a close tag which we'll check later.
}
$0~closex { # If we see a close tag,
sub(/\.[^.]$/,"",path) # truncate the path.
}
substr(path,2)==check { # If we match the given path,
print path " = " $3 # print the result.
}
' input.xml
Note that this solution barfs horribly if you feed it badly formatted XML. The recognition of tags could be improved, but may be sufficient if you have consistently formatted XML. It may barf horribly for other reasons too. Do not do this. Install the correct tools to parse XML properly.
i have once question, suppose i am using "=" as fiels seperator, in this case if my string contain for example
abc=def\=jkl
so if i use = as fields seperator, it will split into 3 as
abc def\ jkl
but as i have escaped 2nd "=" , my output should be as
abc def\=jkl
Can anyone please provide me any suggestion , if i can achieve this.
Thanks in advance
I find it simplest to just convert the offending string to some other string or character that doesn't appear in your input records (I tend to use RS if it's not a regexp* since that cannot appear within a record, or the awk builtin SUBSEP otherwise since if that appears in your input you have other problems) and then process as normal other than converting back within each field when necessary, e.g.:
$ cat file
abc=def\=jkl
$ awk -F= '{
gsub(/\\=/,RS)
for (i=1; i<=NF; i++) {
gsub(RS,"\\=",$i)
print i":"$i
}
}' file
1:abc
2:def\=jkl
* The issue with using RS if it is an RE (i.e. multiple characters) is that the gsub(RS...) within the loop could match a string that didn't get resolved to a record separator initially, e.g.
$ echo "aa" | gawk -v RS='a$' '{gsub(RS,"foo",$1); print "$1=<"$1">"}'
$1=<afoo>
When the RS is a single character, e.g. the default newline, that cannot happen so it's safe to use.
If it is like the example in your question, it could be done.
awk doesn't support look-around regex. So it would be a bit difficult to get what you want by setting FS.
If I were you, I would do some preprocessing, to make the data easier to be handled by awk. Or you could read the line, and using other functions by awk, e.g. gensub() to remove those = s you don't want to have in result, and split... But I guess you want to achieve the goal by playing field separator, so I just don't give those solutions.
However it could be done by FPAT variable.
awk -vFPAT='\\w*(\\\\=)?\\w*' '...' file
this will work for your example. I am not sure if it will work for your real data.
let's make an example, to split this string: "abc=def\=jkl=foo\=bar=baz"
kent$ echo "abc=def\=jkl=foo\=bar=baz"|awk -vFPAT='\\w*(\\\\=)?\\w*' '{for(i=1;i<=NF;i++)print $i}'
abc
def\=jkl
foo\=bar
baz
I think you want that result, don't you?
my awk version:
kent$ awk --version|head -1
GNU Awk 4.0.2