Scraping 5 characters off a webpage using sed or something better? - web-scraping

I have a downloaded webpage I would like to scrape using sed or awk. I have been told by a colleague that what I'm trying to achieve isn't possible with sed and maybe this is probably correct seeing as he is a bit of a linux guru.
What I am trying to achieve:
I am trying to scrape a locally stored html document for every value within this html label which apppears hundreds of times on a webpage..
For example:
<label class="css-l019s7">54321</label>
<label class="css-l019s7">55555</label>
This label class never changes, so it seems the perfect point to do scraping and get the values:
54321
55555
There are hundreds of occurences of this data and I need to get a list of them all.
As sed probably isn't capable of this, I would be forever greatful if someone could demonstrate AWK or something else?
Thank you.
Things I've tried:
sed -En 's#(^.*<label class=\"css-l019s7\">)(.*)(</label>.*$)#\2#gp' TS.html
This code above managed to extract about 40 of the numbers out of 320. There must be a little bug in this sed command for it to work partially.

Use a parser like xmllint:
xmllint --html --recover --xpath '//label[#class="css-l019s7"]/text()' TS.html
As an interest in sed was expressed (note that html can use newlines instead of spaces, for example, so this is not very robust):
sed 't a
s/<label class="css-l019s7">\([^<]*\)/\
\1\
/;D
:a
P;D' TS.html
Using awk:
awk '$1~/^label class="css-l019s7"$/ { print $2 }' RS='<' FS='>' TS.html
or:
awk '$1~/^[\t\n\r ]*label[\t\n\r ]/ &&
$1~/[\t\n\r ]class[\t\n\r ]*=[\t\n\r ]*"css-l019s7"[\t\n\r ]*([\t\n\r ]|$)/ {
print $2
}' RS='<' FS='>' TS.html

someone could demonstrate AWK or something else?
This task seems for me as best fit for using CSS selector. If you are allowed to install tools you might use Element Finder for this following way:
elfinder -s "label.css-l019s7"
which will search for labels with class css-l019s7 in files in current directory.

with grep you can get the values
grep -Eo '([[:digit:]]{5})' file
54321
55555
with awk you can concrete where the values are, here in the lines with label at the beginning and at the end:
awk '/^<label|\/label>$/ {if (match($0,/[[:digit:]]{5}/)) { pattern = substr($0,RSTART,RLENGTH); print pattern}}' file
54321
55555

using GNU awk and gensub:
awk '/label class/ && /css-l019s7/ { str=gensub("(<label class=\"css-l019s7\">)(.*)(</label>)","\\2",$0);print str}' file
Search for lines with "label class" and "css=1019s7". Split the line into three sections and substitute the line for the second section, reading the result into a variable str. Print str.

Related

Remove duplicate lines based on starting pattern using bash

I'm trying to remove duplicates in a list of Jira tickets that follow the following syntax:
XXXX-12345: a description
where 12345 is a pattern like [0-9]+ and the XXXX is constant. For example, the following list:
XXXX-1111: a description
XXXX-2222: another description
XXXX-1111: yet another description
should get cleaned up like this:
XXXX-1111: a description
XXXX-2222: another description
I've been trying using sed but while what I had worked on Mac it didn't on linux. I think it'd be easier with awk but I'm not an expert on any of them.
I tried:
sed -r '$!N; /^XXXX-[0-9]+\n\1/!P; D' file
This simple awk should get the output:
awk '!seen[$1]++' file
XXXX-1111: a description
XXXX-2222: another description
If the digits are the only thing defining a dup, you could do:
awk -F: '{split($1,arr,/-/); if (seen[arr[2]]++) next} 1' file
If the XXXX is always the same, you can simplify to:
awk -F: '!seen[$1]++' file
Either prints:
XXXX-1111: a description
XXXX-2222: another description
This might work for you (GNU sed):
sed -nE 'G;/^([^:]*:).*\n\1/d;P;h' file
-nE turn on explicit printing and extended regexps.
G append unique lines from the hold space to the current line.
/^([^:]*:).*\n\1/d If the current line key already exists, delete it.
P otherwise, print the current line and
h store unique lines in the hold space
N.B. Your sed solution would work (not as is but with some tweaking) but only if the file(s) were sorted by the key.
sed -E 'N;/^([^:]*:).*\n\1/!P;D' file

how to extract sub string of a filename using different types of delimiter in shell script?

I'm learning shell script. Let's say abcd-2.1.1.4.jar is file name. I want to extract the version i.e. "2.1.1.4". I tried with "cut" syntax.
"abcd-2.1.1.4.jar" | cut -d'-' -f 2 return output abcd-2.1.1.4.jar I can't using different types of delimiter
Is there is any other way to achieve that.
Thank you.
You better try with sed: echo abcd-2.1.1.4 | sed 's/.*\-\([0-9\.]\+\)\.jar/\1/'
Hoping you're using GNU sed, otherwise it might be a little different.

How to find a pattern using sed?

How can I combine multiple filters using sed?
Here's my data set
sex,city,age
male,london,32
male,manchester,32
male,oxford,64
female,oxford,23
female,london,33
male,oxford,45
I want to identify all lines which contain MALE AND OXFORD. Here's my approach:
sed -n '/male/,/oxford/p' file
Thanks
You can associate a block with the first check and put the second in there. For example:
sed -n '/male/ { /oxford/ p; }' file
Or invert the check and action:
sed '/male/!d; /oxford/!d' file
However, since (as #Jotne points out) lines that contain female also contain male and you probably don't want to match them, the patterns should at least be amended to contain word boundaries:
sed -n '/\<male\>/ { /\<oxford\>/ p; }' file
sed '/\<male\>/!d; /\<oxford\>/!d' file
But since that looks like comma-separated data and the check is probably not meant to test whether someone went to male university, it would probably be best to use a stricter check with awk:
awk -F, '$1 == "male" && $2 == "oxford"' file
This checks not only if a line contains male and oxford but also if they are in the appropriate fields. The same can be achieved, somewhat less prettily, with sed by using
sed '/^male,oxford,/!d' file
A single sed command command can be used to solve this. Let's look at two variations of using sed:
$ sed -e 's/^\(male,oxford,.*\)$/\1/;t;d' file
male,oxford,64
male,oxford,45
$ sed -e 's/^male,oxford,\(.*\)$/\1/;t;d' file
64
45
Both have the essentially the same regex:
^male,oxford,.*$
The interesting features are the capture group placement (either the whole line or just the age portion) and the use of ;t;d to discard non matching lines.
By doing it this way, we can avoid the requirement of using awk or grep to solve this problem.
You can use awk
awk -F, '/\<male\>/ && /\<oxford\>/' file
male,oxford,64
male,oxford,45
It uses the word anchor to prevent hit on female.

Copy a four digit number and past at the end of each line with text before it

Basically i've been trying to figure out a way to take a four digit number from each line and paste it at the end of its line with the word pass in front. IE
take this file:
Home1234 10.10.10.1
Home1248 10.10.10.2
Home0934 10.10.10.3
Home0047 10.10.10.4
And after should look like:
Home1234 10.10.10.1 pass1234
Home1248 10.10.10.2 pass1248
Home0934 10.10.10.3 pass0934
Home0047 10.10.10.4 pass0047
You can try with this:
awk '{a=substr($1,5,4); print $0" pass"a}' YOUR_FILE
substr($1,5,4) gets NNNN from HomeNNNN and stores in var a
print $0" pass"a print all the line plus var a
If text is not always Home but a different with variable size, you can use:
awk '{a=substr($1,length($1)-3,4); print $0" pass"a}' YOUR_FILE
this awk one-liner may help you:
awk '{n=substr($1,length($1)-3);$0=$0" pass"n}1' file
Your problem is not defined very well (what do you want to do with an input like: 1234 5678 3333 10.3.5.5?), but perhaps:
sed '/^\([^ ]*\([0-9]\{4\}\).*\)/s//\1 pass\2/' input
Another sed variation using the & (paste everything matched):
sed -ie 's/^.*\([0-9]\{4\}\).*/& pass\1/' yourfile

How do you split a file base on a token?

Let's say you got a file containing texts (from 1 to N) separated by a $
How can a slit the file so the end result is N files?
text1 with newlines $
text2 $etc... $
textN
I'm thinking something with awk or sed but is there any available unix app that already perform that kind of task?
awk 'BEGIN{RS="$"; ORS=""} { textNumber++; print $0 > "text"textNumber".out" }' fileName
Thank to Bill Karwin for the idea.
Edit : Add the ORS="" to avoid printing a newline at the end of each files.
Maybe split -p pattern?
Hmm. That may not be exactly what you want. It doesn't split a line, it only starts a new file when it sees the pattern. And it seems to be supported only on BSD-related systems.
You could use something like:
awk 'BEGIN {RS = "$"} { ... }'
edit: You might find some inspiration for the { ... } part here:
http://www.gnu.org/manual/gawk/html_node/Split-Program.html
edit: Thanks to comment from dmckee, but csplit also seems to copy the whole line on which the pattern occurs.
If I'm reading this right, the UNIX cut command can be used for this.
cut -d $ -f 1- filename
I might have the syntax slightly off, but that should tell cut that you're using $ separated fields and to return fields 1 through the end.
You may need to escape the $.
awk -vRS="$" '{ print $0 > "text"t++".out" }' ORS="" file
using split command we can split using strings.
but csplit command will allow you to slit files basing on regular expressions as well.

Resources