Scraping a web page & Formatting it - web-scraping

I need some pointers on how to go about solving this problem:
I have more than 10K + simple HTML web pages which all have the same format. When I say "same format", I mean that they all will have the same h1 tag at the begining but with a varying text and followed by a table and then followed by a link, etc. So, if you see, the basic HTML skeleton of the 10K+ pages are the same but just that the text will keep varying.
I have a way to iterate through all those 10K pages. I however do not know how I can copy specific text in that page onto a XLS/CSV column-wise. Once I can achieve this I will import this excel sheet into MySQL and do further processing.
I know PHP to a certain extent. So, this is what I can think of:
$html = file_get_contents("http://www.SomeWebsite.com/");
I then can use some RegEx to manipulate the data I need. I however do not know how to handle redirects.
This is what I can think of but is there anything better? May be an existing tool or better scripting languages?

You may use HTQL to extract the html content. It has Python and COM interfaces. see: http://htql.net/
To extract the <h1> tag, simply use "<h1>" as the query.

You could do this with PHP, though I recommend XPath instead of regular expressions.
Personally I use Python with lxml and this webscraping library.

Related

Julia: website scraping?

I have been trying for days to move forward with this little code for getting the headers and the links of the news from a journal website.
using HTTP
function website_parser(website_url::AbstractString)
r = readstring(get(website_url))
splitted = split(r, "\n")
end
website_parser("https://www.nature.com/news/newsandviews")
The problem is that I could not figure out how to proceed on once I got the text from the website. How can I retrieve specific elements (as header and link of the news in this case)?
Any help is very much appreciated, thank you
You need some kind of HTML parsing. For only extracting the header, you probably can get away with regex, which are built in.
If it gets more complicated than that, regular expressions don't generalize, and you should use a full-fledged HTML parser. Gumbo.jl seems to be state of the art in Julia and has a rather simple interface.
In the latter case, it's unneccessary to split the document; in the former, it at least makes things more complicated, since then you have to think about line breaks. So, better parse first, then split.
Specific elements can be extracted using the library Cascadia git repo
for instance, the class attribute for elements in the HTML page can be extracted via qs = eachmatch(Selector(".classID"),h.root) so that all the class elements such as <div class="classID"> get selected/extracted for the returned query string (qs).

Easiest way to scrape webpages to save to .csv

There is a page I want to scrape, you can pass it variables in the URL and it generates specific content. All the content is in a giant HTML table.
I am looking for a way to write a script that can go through 180 of these different pages, extract specific information from certain columns in the table, do some math, and then write them to a .csv file. That way I can do further analysis myself on the data.
What is the easiest way to scrape webpages, parse HTML and then store the data to a .csv file?
I have done stuff similar in python and PHP, the parsing of HTML is not the most easiest thing to do, or cleanest. Are there other routes that are easier?
If you have some experience with python, I would recommend something like BeautifulSoup, or in PHP you can use PhPQuery.
Once you know how to use the HTML-parser, then you can create a "pipes-and-filter" program to do the math and dump it to a csv file.
Have a look at this question for more info on a Python solution.

Use an excel template and update it programatically

I found this tool but I wonder if it still the right way nowdays with net 4.0 or is there any straight forward oob alternatives.
I just need to add columns and update excel stuff programatically. There are many ways but I need to keep the original document as a template. The link above explains exactly what the requeriments are and why they created such "ExcelPackage" library.
A quick look at the link you provided seems like it will in fact keep the original template intact and just return a populated version of that template. This is a pretty common way to create and populate Excel documents using Open XML since it helps to minimize the amount of code you have to write. If you did not specify the layout, styles, formats, etc in a template you would be forced to define those when coding and that could lead to some bloated code. Overall, a project like this or using the Open XML SDK 2.0 to create the documents is the way to go.

Aggregating from various sources

It could be a project well beyond my skills right now but I've got around one full month to spend on it so I think I can do it. What I want to build is this: Gather news about a specific subject from various sources. Easy, right? Just get the rss feeds and display them on a page. Well, I want something more advanced: Duplicates removed and customized presentation (that is, be able to define/change the format in which the news headlines are displayed).
I've played a bit with Yahoo Pipes and some other tools and I am facing two big problems:
Some sources don't provide rss feeds. How do I create one?
What's the best method to find and remove duplicates. I thought about comparing the headlines and checking if there is a matching bigger than, say, 50%. Is that a good practice though?
Please add any other things (problems, suggestions, whatever) I might not have considered.
Duplication is a nasty issue. What I eventually ended up doing:
1. Strip out all HTML tags except for links (Although I started using regex, I was burned. I eventually moved to custom parsing to remove tags)
2. Strip out all whitespace
3. Case-desensitize
4. Hash all that with MD5.
Here's why you leave the link in:
A comment might be as simple as "Yes, this sucks". "Yes, this sucks" could be a common comment. BUT if the text "this sucks" is linked to different things, then it is not a duplicate comment.
Additionally, you will find that HTML tag escaping is weird with RSS feeds. You would think that a stray < would be double-encoded: (I think)&<;
But it is not. It is encoded <
But so too are HTML tags! :<p>
I eventually copied all the known HTML tags as parsed by Mozilla Firefox and manually recognized those tags.
Creating an RSS feed from HTML is quite nasty and I can only point you to services such as Spinn3r, which are fantastic at de-duplication and content extraction. These services typically use probability-based algorithms that are above me. I know of one provider that got away with regexing pages (They had to know that a certain page was MySpace-based or Blogger-based) but they did not perform admirably.
You might want to try to use the YQL module to scrape a webpage that doesn't provide RSS. Here's a sample of a YQL statement to scrape HTML.
About duplicates, take a look at this pipe.
Customized presentation: if you want it truly customized you'll have to manipulate the pipe results yourself, e.g. get it as JSON an manipulate it with Javascript, or process it server-side.

Interpreting Search Results

I am tasked with writing a program that, given a search term and the HTML source of a page representing search results of some unknown search engine (it can really be anything, a blog, a shop, Google, eBay, ...), needs to build a data structure of the results containing "what's in the results": a title for earch result, the "details" link, the position within the results etc. It is not known whether the results page contains any of the data at all, and whether there are any search results. The goal is to feed the data structure into another program that extracts meaning.
What I am looking for is not BeautifulSoup or a RegExp but rather some clever ideas or algorithms on how to interpret the HTML source. What do I do to find out what part of the page constitutes a single result item? How do I filter the markup noise to extract the important bits? What would you do? Pointers to fields of research covering what I try to to are aly greatly appreciated.
Thanks, Simon
I doubt that there exist a silver-bullet algorithm that without any training will just work on any arbitrary search query output.
However, this task can be solved and is actually solved in many applications, but with different approach. First you have to define general structure of single search result item based on what you actually going to do with it (it could be name, date, link, description snippet, etc.), and then write number of html parsers that will extract necessary necessary fields from search result output of particular web sites.
I know it is not super sexy solution, but it probably the only one that works. And it is not rocket science. Writing parsers is actually extremly simple, you can make dozen per day. If you will look into html source of search result, you will notice that output results are typically very structured and marked with specific div sections or class atributes, so it is very easy to find it in the document. You dont have even use any complicated HTML parsing library for that, something grep-like will be enough.
For example, on this particular page your question starts with <div class="post-text"> and ends with </div>. Everything in between is actually a post text with some HTML formatting that you may want to remove along with extra spaces and "\n". And this <div class="post-text"> appears on the page only once.
Once you go at large scale with your retrieval applicaiton, you will find out that there is not that big variety of different search engines on different sites, and you will be able to re-use already created parsers for sties using similar search engines.
The only thing you have to remember is built-in self-testing. Sites tend to upgrade and change design from time to time. If your application is going to live for some time, you will need to include into your parsers some logic that will check validity of their results and notify you every time search output has changed and is not compatible anymore with your parser. Then you will have to modify particular parser or write new one.
Hope this helps.

Resources