Web-crawler for unstructured data - web-scraping

Are there any web-crawlers adapted for parsing many unstructured websites (news, articles) and extracting a main block of content from them without previously defined rules?
I mean when I'm parsing a news feed, I want to extract the main content block from each article to do some NLP stuff. I have a lot of websites and it will take forever to look into their DOM model and write rules for each of them.
I was trying to use Scrapy and get all text without tags and scripts, placed in a body, but it include a lot of un-relevant stuff, like menu items, ad blocks, etc.
site_body = selector.xpath('//body').extract_first()
But doing NLP over such kind of content will not be very precise.
So is there any other tools or approaches for doing such tasks?

I tried to solve this with pattern matching. Thereby you annotate the source of webpage itself and use it as sample that gets matched, and you do not need to write special rules.
For example, if you look in the source of this page, you see:
<td class="postcell">
<div>
<div class="post-text" itemprop="text">
<p>Are there any web-crawlers adapted for parsing many unstructured websites (news, articles) and extracting a main block of content from them without previously defined rules?</p>
Then you remove your text and add {.} to mark the place as relevant and get:
<td class="postcell">
<div>
<div class="post-text" itemprop="text">
{.}
(usually you need closing tags as well, but for a single element it is not necessary)
Then you pass that as pattern to Xidel (SO seems to block the default user agent, so it needs to be changed),
xidel 'http://stackoverflow.com/questions/36066030/web-crawler-for-unstructured-data' --user-agent "Mozilla/5.0 (compatible; Xidel)" -e '<td class="postcell"><div><div class="post-text" itemprop="text">{.}'
and it outputs your text
Are there any web-crawlers adapted for parsing many unstructured websites (news, articles) and extracting a main block of content from them without previously defined rules?
I mean when I'm parsing a news feed, I want to extract the main content block from each article to do some NLP stuff. I have a lot of websites and it will take forever to look into their DOM model and write rules for each of them.
I was trying to use Scrapy and get all text without tags and scripts, placed in a body, but it include a lot of un-relevant stuff, like menu items, ad blocks, etc.
site_body = selector.xpath('//body').extract_first()
But doing NLP over such kind of content will not be very precise.
So is there any other tools or approaches for doing such tasks?

You can use Beautiful Soup inside your parse() and get_text():
from bs4 import BeautifulSoup, Comment
soup = BeautifulSoup(response.body, 'html.parser')
yield {'body': soup.get_text() }
You can also manually remove stuff you don't want (if you find that you like some markup e.g. <H1>'s or <b>'s might be useful signals)
# Remove invisible tags
#for i in soup.findAll(lambda tag: tag.name in ['script', 'link', 'meta']):
# i.extract()
You could do similar thing to whitelist a few tags.

Related

When using apoc.load.html, Is it possible to return the full HTML rather than only text?

Lets say I want to scrape the Neo4j RefCard found at: https://neo4j.com/docs/cypher-refcard/current/
And I would like to fetch a 'code' example along with its styling. Here's my target. Notice that it has CSS treatment (font, color...):
...so in Neo4j I call the apoc.load.html procedure as shown here, and you can see it's no problem finding the content:
It returns a map with three keys: tagName, attributes, and text.
The text is the issue for me. It's stripped of all styling. I would like for it to let me know more about the styling of the different parts of this text.
The actual HTML in the webpage looks like following image with all of these span class tags: cm-string, cm-node, cm-atom, etc. Note that this was not generated by Neo4j's apoc.load.html procedure. It came straight from my Chrome browser's inspect console.
I don't need the actual fonts and colors, just the tag names.
I can seen in the documentation that there is an optional config map you can supply, but there's no explanation for what can be configured there. It would be lovely if I could configure it to return, say, HTML rather than text.
The library that Neo4j uses for CSS selection here is jsoup.
So I am hoping to not strip the <span> tags, or otherwise, extract their class names for each segment of text.
Could you not generate the HTML yourself from the properties in your object? It looks they are all span tags with 3 different classes depending on whether your using the property name, property value, or property delimiter?
That is probably how they are generating the HTML themselves.
Okay, two years later I revisited this question I posted, and did find a solution. I'll keep it short.
The APOC procedure CALL apoc.load.html is using the scraping library Jsoup, which is not a full-fledged browser. When it visits a page it reads the html sent by the server but ignores any javascript. As a result, if a page uses javascript for inserting content or even just formatting the content, then Jsoup will miss the html that the javascript would have generated had it run.
So I have just tried out the service at prerender.com. It's simple to use. You send it a URL, it takes your url as an argument and fetches that page itself and executes the page's javascript as it does. It returns the final result as static HTML.
So if I just call prerender.com with apoc.load.html then the Jsoup library will simply ask for the html and this time it will get the fully rendered html. :)
You can try the following two queries and see the difference pre-rendering makes. The span tags in this page are rendered only by javascript. So if we call it asking for its span tags without pre-rendering we get nothing returned.
CALL apoc.load.html("https://neo4j.com/docs/cypher-refcard/current/", {target:".listingblock pre:contains(age: 38) span"}) YIELD value
UNWIND value.target AS spantags
RETURN spantags
...but if we call it via the prender.com website, you will get a bunch of span tags and their content.
CALL apoc.load.html("https://service.prerender.cloud/https://neo4j.com/docs/cypher-refcard/current/", {target:".listingblock pre:contains(age: 38) span"}) YIELD value
UNWIND value.target AS spantags
RETURN spantags

How to elegantly modify html to inject html element after x-th paragraph on the server side?

I need to modify html coming from external file (server side) before I render it and inject a quote 'component' like this:
This component needs to be injected after 2nd paragraph and I'm planning to use htmlagillity pack. Any examples? Is HtmlNode.InsertAfter() method good choice once I found third paragraph which should be trivial.
Another question is would it be possible to inject sitecore placeholder or even usercontrol that is going to render my quote instead of pure html? I feel it should be but not sure what would be good approach.
Thanks
I can suggest two possible approaches here:
1) Use snippets with some customisation. Snippets allow users to insert pre-defined chunks of HTML into a RTE field. You could have a pre-defined piece of HTML which might have some identifier to indicate it should use custom processing (I would suggest some data-xxx style attribute which would not conflict with any CSS or JavaScript). Then you could create a new renderField pipeline processor which would detect the data-xxx attribute within the content of a rich text field - you would use HtmlAgilityPack for this and then replace that snippet with the contents of your server-side file.
-or-
2) Split your text content into two separate chunks and have two instances of a "HtmlText" rendering within the placeholder, with a rendering for your quote text between them in the same placeholder.
I would advise that having a rule to insert text after the second paragraph would be quite 'brittle' as this would be very reliant on content editors setting the rich text field contents in quite a precise way e.g. to always ensure two or more paragraphs and to always break text with paragraphs - they might decide to use a load of line breaks instead to split their text. That said if you did do this, you would create a new renderField pipeline processor.

How can I strip html and word formatted text from a text box

I have a multiline textbox, need to restrict users from adding html input, any other scripts, copy pasting from word or any other word processer.
But I need to allow bullets for the input.
I thought it would be a simple thing to do since it looks like a common problem.
But I could not find a good solution in the web, please help.
I am using telerik tool kit as well.
If you need to strip out HTML then HTML Agility Pack is your friend. It will deal with all manner of malformed html. As a bonus it is included in Sitecore already.
If you want to use something with a friendlier syntax then consider CSQuery or Fizzler both of which provide you with a jQuery type syntax from within C#.
If you need to build a whitelist then take a look at this post on how to add whitelist:
public void RemoveNotInWhiteList(HtmlNode pNode, IEnumerable<string> pWhiteList)
{
if (!pWhiteList.Contains(pNode.Name))
{
pNode.Remove();
return;
}
pNode.Attributes
.Where(att => !pWhiteList.Contains(att.Name))
.ToList()
.ForEach(att => att.Remove());
pNode.ChildNodes
.ToList()
.ForEach(att => RemoveNotInWhiteList(att, pWhiteList));
}
You could create a Validation rule, I reckon (in /sitecore/System/Settings/Validation Rules). Put the allowed HTML in a whitelist somewhere (possibly a Sitecore item), when validating run through that whitelist. If any other HTML tags appear in it, make it invalid.
This doesn't stop them from putting it in, but it will stop the item from being published.
You could even create a custom item:saved event handler which strips out all HTML tags apart from the whitelisted stuff. Again, it doesn't stop them from putting the HTML tags in, but as soon as the item is saved it will be removed. Going even a step further than this, I think it also would be possible to use the Rules Engine for this - this article by John West shows how to use the Rules engine to modify item names, but you could modify it to read out specific text boxes.
Neither option here will stop users from inputting HTML, but the HTML tags will automatically be removed when the item is saved.

Parsing page data into sidebar - wordpress

What would be the proper procedure for accessing the current page html data and picking up all of a certain tag and throwing them into the sidebar as links?
I'm not sure your proficiency with php, but I'll give you and overview of what you'd probably want to do.
First, you need the HTML. I'm assuming you're running this on a page (in a page.php file or single.php file, or similar), this means that you have access to the global variable $post, which contains the html of the page in it. To access it you can use the helper function get_the_content(), this returns the html being displayed.
Next you need to parse through this to get the h2 tags. A simple regex can handle this, something like <h2[^>]*>(.*)</h2>. It's important to remember that this regex is very picky, so format your html correctly, no multiline h2s.
So now you have the html, and have parsed it with a regex to get the h2s. Now you need to generate the list from the results, and prepend it to the top of the content of the page. There are a ton of ways to do this, the easiest being just running the code in the right spot in the template file.
Of course there are probably better ways of doing this, I'd recommend you look at say a FAQ plugin (if that's what this is for), or do the lists manually (as this system can be broken), or possibly use a custom post type; but for your question, that's how I'd do it.

Symfony2 translation of long texts

Since a few weeks I started playing with Symfony2.
It seems a very powerful framework, but there are some things I cannot still understand.
In documentation I see that i18n (i.e. Translations) is managed by the Translator service. If I correctly understood, the main way to have a website translated is to put the collection of messages I want to translate inside the different files messages.XX.yml (XX=en,fr,it,etc...), one for each language.
This could be perfect for short texts, which possibly do not include any HTML markup. But how do you deal with long text? For instance, how can I manage the translation of a Terms Of Service or an About page?
I guess I should include different templates for each locale I want to use. Am I right?
Thanks for your help!
You can have long texts in .yml translation file as well as html tags. Put your Terms Of Service text in messages.xx.yml file like this:
TermsOfServiceText: >
<p>Here goes my Terms of service code</p>
<p>It can be put in several lines and <strong>can include html tags!</strong></p>
<p>It can also include links</p>
<p>Just make sure that you put '>' sign after your translation keyword like in the first line of this example code
and start your message in next line with double space indentation</p>
Now, in your twig template call translation with this:
{{ 'TermsOfServiceText'|trans|raw }}
raw is used to skip escaping html tags.
I don't think that different templates could be as solution. But feel free to choose what you prefer. I'll go with https://github.com/stof/StofDoctrineExtensionsBundle in particular with the Translatable behaviour.

Resources