Julia: website scraping? - web-scraping

I have been trying for days to move forward with this little code for getting the headers and the links of the news from a journal website.
using HTTP
function website_parser(website_url::AbstractString)
r = readstring(get(website_url))
splitted = split(r, "\n")
end
website_parser("https://www.nature.com/news/newsandviews")
The problem is that I could not figure out how to proceed on once I got the text from the website. How can I retrieve specific elements (as header and link of the news in this case)?
Any help is very much appreciated, thank you

You need some kind of HTML parsing. For only extracting the header, you probably can get away with regex, which are built in.
If it gets more complicated than that, regular expressions don't generalize, and you should use a full-fledged HTML parser. Gumbo.jl seems to be state of the art in Julia and has a rather simple interface.
In the latter case, it's unneccessary to split the document; in the former, it at least makes things more complicated, since then you have to think about line breaks. So, better parse first, then split.
Specific elements can be extracted using the library Cascadia git repo
for instance, the class attribute for elements in the HTML page can be extracted via qs = eachmatch(Selector(".classID"),h.root) so that all the class elements such as <div class="classID"> get selected/extracted for the returned query string (qs).

Related

How do you test a function that just retrieves a template output?

I have a template class that grabs HTML and basically returns html to the caller. How do I test the caller using PHP Unit? Do I just assertTrue(is_string(call_function))? It seems like a stupid test, and I thought I may be testing it improperly.
Is the returned HTML supposed to be well-formed? If so you could validate it.
And/or if there is always supposed to be a certain node, or string of text, present you could check for its existence. Using strpos, regexes, or a proper DOM parser.
This StackOverflow question gives you some ideas for ways to parse and query your HTML: How do you parse and process HTML/XML in PHP?
More generally, the way I usually approach how to test a function that returns a string is to use:
$html=call_function();
$this->assertEquals("dummy",$html);
Then it fails, but tells me the correct output, so I paste that in:
$html=call_function();
$expected=<<<EOD
<html>
...
</html>
EOD;
$this->assertEquals($expected,$html);
If it fails again I then study the differences between the two correct answers I have. If this is a good unit test should they really even be different? Do I want to use a mock object to replace some uncontrollable aspect of the system? (E.g. if the HTML it is returning is google search results, then maybe I want a mock object to simulate calling google, but always return exactly the same search results page.)
If the only differences are timestamps I might use regexes to hunt-and-destroy them, to give me a string that should always be the same, e.g.
$html=preg_replace('/\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}/','[TIMESTAMP]',$html);
ADDITION
If the HTML string is very big, one alternative is to use md5() to reduce it to a short string. This will still warn you when something breaks, but the (big) downside is when it breaks you won't know where. If you are concerned about that then it is better to use the DOM approach (or its poor cousin, regexes) to just cherry-pick a few key parts of the HTML to test.

t() function does't add the string to the translation interface

i use customfiel php code inside one of my views to translate a string since 2.x of views is bad at localization. i use the following php code:
echo t('Watch Video');
but the string does not appear in the "translate interface" section.
thanks for your help.
lukas
The accepted answer is wrong, as the localization script is not scanning anything. The string is registered in the translate interface as soon as it gets passed through the t() function for the first time in the non-standard language.
Therefore, for translation it doesn't matter if the code you are writing is eval'd (interpreted from the database) or exists in the source. Obviously good practice would be to keep code in files where it belongs.
This blog post describes what needs to be done to get your strings into the translate interface.
The localisation database is built by scanning the source code, looking for instances of the t() function (and Drupal.t() in Javascript).
If the code in question has been entered into a text box in the Drupal admin area, then it isn't in the source code, so it won't be picked up by the localisation process.
For this reason (and others), you should put as little code as possible into the admin text boxes. There is usually an alternative way to achieve the same thing, but even if there isn't, you should reduce the code to a minimum -- best practice would be to have nothing there except a single line function call: have it call a function, and write the function code in your module or theme. That way it will be parsed when you run the localisation.

Scraping a web page & Formatting it

I need some pointers on how to go about solving this problem:
I have more than 10K + simple HTML web pages which all have the same format. When I say "same format", I mean that they all will have the same h1 tag at the begining but with a varying text and followed by a table and then followed by a link, etc. So, if you see, the basic HTML skeleton of the 10K+ pages are the same but just that the text will keep varying.
I have a way to iterate through all those 10K pages. I however do not know how I can copy specific text in that page onto a XLS/CSV column-wise. Once I can achieve this I will import this excel sheet into MySQL and do further processing.
I know PHP to a certain extent. So, this is what I can think of:
$html = file_get_contents("http://www.SomeWebsite.com/");
I then can use some RegEx to manipulate the data I need. I however do not know how to handle redirects.
This is what I can think of but is there anything better? May be an existing tool or better scripting languages?
You may use HTQL to extract the html content. It has Python and COM interfaces. see: http://htql.net/
To extract the <h1> tag, simply use "<h1>" as the query.
You could do this with PHP, though I recommend XPath instead of regular expressions.
Personally I use Python with lxml and this webscraping library.

Aggregating from various sources

It could be a project well beyond my skills right now but I've got around one full month to spend on it so I think I can do it. What I want to build is this: Gather news about a specific subject from various sources. Easy, right? Just get the rss feeds and display them on a page. Well, I want something more advanced: Duplicates removed and customized presentation (that is, be able to define/change the format in which the news headlines are displayed).
I've played a bit with Yahoo Pipes and some other tools and I am facing two big problems:
Some sources don't provide rss feeds. How do I create one?
What's the best method to find and remove duplicates. I thought about comparing the headlines and checking if there is a matching bigger than, say, 50%. Is that a good practice though?
Please add any other things (problems, suggestions, whatever) I might not have considered.
Duplication is a nasty issue. What I eventually ended up doing:
1. Strip out all HTML tags except for links (Although I started using regex, I was burned. I eventually moved to custom parsing to remove tags)
2. Strip out all whitespace
3. Case-desensitize
4. Hash all that with MD5.
Here's why you leave the link in:
A comment might be as simple as "Yes, this sucks". "Yes, this sucks" could be a common comment. BUT if the text "this sucks" is linked to different things, then it is not a duplicate comment.
Additionally, you will find that HTML tag escaping is weird with RSS feeds. You would think that a stray < would be double-encoded: (I think)&<;
But it is not. It is encoded <
But so too are HTML tags! :<p>
I eventually copied all the known HTML tags as parsed by Mozilla Firefox and manually recognized those tags.
Creating an RSS feed from HTML is quite nasty and I can only point you to services such as Spinn3r, which are fantastic at de-duplication and content extraction. These services typically use probability-based algorithms that are above me. I know of one provider that got away with regexing pages (They had to know that a certain page was MySpace-based or Blogger-based) but they did not perform admirably.
You might want to try to use the YQL module to scrape a webpage that doesn't provide RSS. Here's a sample of a YQL statement to scrape HTML.
About duplicates, take a look at this pipe.
Customized presentation: if you want it truly customized you'll have to manipulate the pipe results yourself, e.g. get it as JSON an manipulate it with Javascript, or process it server-side.

Interpreting Search Results

I am tasked with writing a program that, given a search term and the HTML source of a page representing search results of some unknown search engine (it can really be anything, a blog, a shop, Google, eBay, ...), needs to build a data structure of the results containing "what's in the results": a title for earch result, the "details" link, the position within the results etc. It is not known whether the results page contains any of the data at all, and whether there are any search results. The goal is to feed the data structure into another program that extracts meaning.
What I am looking for is not BeautifulSoup or a RegExp but rather some clever ideas or algorithms on how to interpret the HTML source. What do I do to find out what part of the page constitutes a single result item? How do I filter the markup noise to extract the important bits? What would you do? Pointers to fields of research covering what I try to to are aly greatly appreciated.
Thanks, Simon
I doubt that there exist a silver-bullet algorithm that without any training will just work on any arbitrary search query output.
However, this task can be solved and is actually solved in many applications, but with different approach. First you have to define general structure of single search result item based on what you actually going to do with it (it could be name, date, link, description snippet, etc.), and then write number of html parsers that will extract necessary necessary fields from search result output of particular web sites.
I know it is not super sexy solution, but it probably the only one that works. And it is not rocket science. Writing parsers is actually extremly simple, you can make dozen per day. If you will look into html source of search result, you will notice that output results are typically very structured and marked with specific div sections or class atributes, so it is very easy to find it in the document. You dont have even use any complicated HTML parsing library for that, something grep-like will be enough.
For example, on this particular page your question starts with <div class="post-text"> and ends with </div>. Everything in between is actually a post text with some HTML formatting that you may want to remove along with extra spaces and "\n". And this <div class="post-text"> appears on the page only once.
Once you go at large scale with your retrieval applicaiton, you will find out that there is not that big variety of different search engines on different sites, and you will be able to re-use already created parsers for sties using similar search engines.
The only thing you have to remember is built-in self-testing. Sites tend to upgrade and change design from time to time. If your application is going to live for some time, you will need to include into your parsers some logic that will check validity of their results and notify you every time search output has changed and is not compatible anymore with your parser. Then you will have to modify particular parser or write new one.
Hope this helps.

Resources