Best visible content extractor available - web-scraping

So my application needs visible content from a given URL, like just the text part, no html no header or footer data. As of now I am using beautifulsoup and boilerpipe for getting the same. But in some rare cases I am not getting enough data or the right data. So was wondering is there any other competitor, programming language is not a barrier.

I would recommend xpath or css extractors directly for content extraction, both selectors are already simply implemented on parsel module.
For a complete suite of web-crawling + content extractor, scrapy would be my preferred option.
And if you want to extract to visually select what parts of the html to extract, I would recommend portia.
Hope that helped.

Related

How to style content coming from a Headless CMS?

Last month I read about Headless CMS for the first time, and I just felt in loved with that approach.
But just right after, I wondered how could I format and/or add style to the content if some day I worked with this technology.
By styling the content, I mean words within a title, paragraph and so on; not a whole paragraph, which is quite obvious it can be done.
It seems to me that it is impossible, since you only get a JSON with no HTML whatsoever; just raw texts. So it looks like this is the major downside of consuming content through a Headless CMS from a Front End perspective.
Formatting text is just fundamental, specially when dealing with large content. And I am sure I cannot be the first one concerned about not being able to add some bold and/or italics to a text to emphasize the important parts of it.
But I can't find any website discussing this topic, just "how to model the content" and whatnot.
Does really no one care about it?
I would appreciate if anyone could shed some light about this question.
Diving into the Headless CMS #RicoHancock has pointed out, I've learnt that it is completely feasible to store rich text and strucuted content within a JSON that can be converted to HTML following some specifications I wasn't aware of.
In the particular case of DatoCMS, they use a specification called dast.
To learn more about it, visit their docs (the following link contains very illustrative code examples):
https://www.datocms.com/docs/structured-text/dast
Paraphrasing their own words:
Structured Text format adheres to the Unified collective, which offers a big ecosystem of utilities to parse, transform, manipulate, convert and serialize content of any kind.
The "Unified collective" is a collective of free and open source packages to work with content as structured data with plugins. In order to create the syntax trees, Unified uses UNIST nodes.
UNIST is a specification, and stands for "UNiversal Syntax Tree".
More info about the UNIST spec and the Unified ecosystem:
https://github.com/syntax-tree/unist
https://unifiedjs.com/learn/guide/introduction-to-unified/
https://unifiedjs.com/learn/guide/using-unified/
TLDR: Markdown.
The company I work for uses DatoCMS. We have a blog, and each blog post is created in our CMS by our copywriting team. DatoCMS allows us (the developers) to create "blocks" that make up the blog post. We have image blocks and content blocks that are rendered by a template file on our frontend. The content blocks support Markdown, so italics, bold, and links work. When our copywriting/marketing team want to make a new blog post, they go to the CMS, create a new post, add a title, slug, and blocks, and then save.
I don't have much experience with other Headless CMS', so not sure if Markdown will work there, but I don't know why it wouldn't, Markdown is all over the internet. (In fact, this answer is Markdown XD)

Parse Onlineshop - Onlineshop Data

I am searching for a solution for crawling % parsing a whole website (online shop) automatically and save all products as Product-name and product-price in a CSV.
Gaining data from a website can be extremely simple or the complete opposite. It depends on how the website is made. A shop tends to be a complex website and thus the DOM (the HTML structure) is mostly unique for that website. It is very unlikely that someone else tried the exact same thing you want for that page. So you have to write code and extract the necessary piecs.
This will be our example product: http://www.thomann.de/gb/focusrite_scarlett_2i2.htm
HTML uses classes to tell the CSS (for styling) how to design or render a certain element. You can use this behaviour for you and find an element containing the price by a class. In this example it is .tr-prod-price.
Every major browser has a Discover element function and it can be used to find a class for a element which appears on screen. Make a right click on your text (price or title) press Q (Firefox only).
Now, you've got closer to parsing your data. Now it is time to write code. You could use Python, Java or even JavaScript to give you some examples. JavaScript in conjunction with Node.JS could be very easy, because JS has the built in methods we need.
You may need a searchengine to find the detail pages of a product. Google can list you all results like site:thomann.de/gb. But of course Google does not provide an easy way (API) to get this information and if you start writing your own parser for that I am not sure about the legal consequences. The legal side needs also to be adressed for you main intention.

How do I output HTML form data to PDF?

I need to collect data from a visitor in an HTML form and then have them print a document with the appropriate fields pre-populated. They'll need to have a couple of signatures on the document, so it has to be printed.
The paper form already exists, so one idea was to scan it in, with nothing filled out, as an image. I would then have the HTML form data print out using CSS for positioning and using the blank scanned form as a background image.
A better option, I would think, would be to automatically generate the PDF with this data, but I'm not sure how to accomplish either.
Suggestions and ideas would be greatly appreciated! =)
I would have to respectfully disagree with Osvaldo. Using CSS to align on a printed document would take ages to do efficiently in the aspect of cross-browser integration. Plus, if Microsoft comes out with a new browser, you're going to have to constantly update for the new use in browsers.
If you know any PHP (Which, if you know JavaScript and HTML, basic PHP is very simple), here's a good library you can use, FDPF:
Thankfully, PHP doesn't deprecate a whole lot of methods and the total code is less than 10 lines if you have to go in and change things around.
You can control printed documents acceptably well with CSS, so I would suggest you to try that option first. Because it's easier.
This is actually a great php library for converting HTML to PDF documents http://code.google.com/p/dompdf/ there are many demo's available on the site
XSL-FO is what I would recommend. XSL-FO (along with XSLT and XPath) is a sub-standard of XSL that was designed to be an abstract representation of a formatted document (that contains, text, graphic elements, fonts, styles, etc).
XSL-FO documents are valid xml documents, and there exist tools and apis that allow you to convert an XSL-FO documet to MS Word, PDF, RTF, etc. Depending on the technology you use, a quick google search will tell you what is available.
Here are a few links to help you get started with XSL-FO:
http://en.wikipedia.org/wiki/XSL_Formatting_Objects
http://www.w3schools.com/xslfo/xslfo_intro.asp
http://www.w3.org/TR/xsl11/

What are appropriate markup languages for users with disabilities?

Suppose you're developing a web site and blind users will be a significant chunk of your target market. If the web site includes document editing functionality, what would be appropriate WYSIWYM tools? Are languages like Markdown, Textile and Wiki Formatting really accessible or are they inconvenient to blind users?
I'm a blind programmer and while I haven't used most of the languages you mention I've found that any markdown language is fairly easy to use if you have the desire to learn it. I've had no problem using either HTML or several markup languages for wiki's. Part of it will depend on how invested the users are in your site. If it's a site that will be visited infrequently or for short periods of time, it's much less likely that a user will take the time to learn the required markup whether they are blind or not. Unfortunately, I have not found an accessible JavaScript WYSIWYG editor but I find it easier to manually enter the markup so haven't looked very hard.
the first question is: how important is semantic structure? could you get away with plain text. You could do simple parsing like treating blank lines as paragraph markers, treating a series of lines which begin with * as a bulleted list, identify URLs and make them into links, etc.
As a blind developer myself, I have no problem in understanding languages like Markdown. But if it's a syntax I'm unfamiliar with, I'll only learn it if I expect to use the site very often, or care deeply about the content.
Two final thoughts come to mind: while I certainly experience some accessibility challenges using TinyMCE, you could develop something much simpler - provide less than 10 formatting options, like inserting hyperlinks, making lists, centering text, setting the style (such as heading) etc.
And lastly, when I talk to non-technical blind people, they often just write their content in Word and paste into a wiki or blog post. This sounded strange when I first heard it, but it does make sense. So an ideal solution would accept pasted in content.
In closing - it depends how important this is, and how much effort you want to expend. Maybe a Markdown editor with a live preview (like on this site), buttons for inserting simple formatting like URLs, and the ability to paste in rich text would tick all boxes :-)
On a web page, the most accessible embedded text editor for blind users is one that uses standard HTML, such as a <textarea> element, with a corresponding <label> element:
<label for="editor">Enter your text here using wiki markup:</label>
<textarea id="editor"></textarea>
If a WYSIWYM tool is built using standard accessible HTML, then blind users can easily enter text into it, with full confidence that they're entering text in the right place. Then the question becomes: Which is the better markup language? They all require memorization, but some may be more intuitive than others. One way to find out which is best would be to do some usability testing with a wide variety of target users. Also be sure to providing easy, accessible access to syntax help.
Picture yourself working in pure text 80x4 display (just open a console and resize appropriately), then use vi/emacs/ed and you'll soon realize what markup will get in the way.
Try to do as much work as possible to understand plain text, else use light markup like POD, finally things like AsciiDoc are very powerful but needs training.
I don't know about WYSIWYG/WYSIWYM tools, but I do know that complying with W3C standards (especially their HTML5 en CSS3 drafts) while writing your own editor code will help a lot.
In CSS you can specify speed and intonation of speech. In HTML you can specify alternative text (alt attribute in many elements) that screen readers are compatible with. Be sure to know when to use the abbr and the acronym elements. Use the former when you want the screen reader to read the meaning of an abbreviation and the latter when the acronym should be read as a word (e.g. ASAP, NATO and OS).
For the editor itself, I recommend creating a WYSIWYG editor that uses divs and spans. Screen readers will understand easily the structure of a document. For the current line, use a text box; for every other line that's not being edited, convert the contents immediately to valid HTML.
If you find a good tool, be sure to post it here. I'm looking for one too. :-)

the efficient and accurate way of parsing a web html to text field with formating?

i want to parse the web html to flex text field i have three approaches in mind to do soo
to use string function splice etc. to replace tags with the one which text fields can understand but is it too complex and processing over head is there that reduces the efficiency but i have max control to do changes what i needed.
parsing the html to xml and then use this as the text input to text field what about the efficiency and control need to know about from this question.
regular expressions .
which one will be the most suitable in paring the web text.
help required
regards.
I'd suggest to completely change your approach. If you need a simple HTML to be displayed, then you can directly parse it by the flash/flex. In other cases, when you have a complex HTML page(s), then you should use IFrame. That's really simple and efficient approach and could be very useful in your case. Here you can find its description:
Flex and Iframe,Flex Iframe Project site.
Good luck!

Resources