How to clean HTML before parsing it using HTML Unit - web-scraping

I am scraping html using HtmlUnit but the html is malformed with few tags as unclosed and thus HtmlUnit is giving wrong results.So I need to clean it before passing it to HtmlUnit.
How can I do that.
A short code snippet or tutorial would be appreciated

I believe you could do this by implementing your own WebConnectionWrapper. Then you'll have to find some HTML library that fixes this properly (if possible). All you should do then is making sure the wrapper sends the content to the library so that when it reaches HTMLUnit's parser the HTML content is already processed.

Related

Web Component that can request another HTML URL and inject it into it's shadow DOM

I spent some time today with Lit trying to make a simple WebComponent that makes a HTTP GET to a URI, which returns a fully formed HTML document, and I want to inject said HTML document into the WebComponent's shadow DOM; basically this WebComponent acts as a simple proxy for embedding an externally hosted (but trusted) web snippet on my web page. I ran into a few problems:
Lit considers all HTML unsafe, so i had to mark it with Lit's unsafeHTML directive.
Then, I noticed none of the script or link tags in the injected HTML were being followed, so I parsed the incoming HTML as a HtmlDocument, located all the script/link tags, and "re-created" them using document.createElement(...) and returned them in my render(). I'm now noticing that images arent showing up either.
I don't like scraping scripts/links and re-creating them and jamming them into my web component anyhow, but I'm curious - what's the right way to approach this syndicating/consuming syndicated HTML pages/fragments?
Is this a solved problem w/ oEmbed already?
Is this simpler to do with a different WebComponent library?
This seems way harder than it should be at this point.
I think that it has little to do with WebComponents but rather with the HTML5 specs. lit-html uses innerHTML to create elements.
Script elements inserted using innerHTML do not execute when they are inserted
There are still ways to execute JS but this has little to do with your question.
unsafeHTML(`<img src="triggerError" onerror="alert('BOOM')">`)
Regarding the images, it may be a path issue?
This should work:
unsafeHTML(`<img src='http://placehold.it/350x350'>`)

How do I script html that is not well formed to be well formed using classic asp and vbscript?

I am trying to parse some html to switch out values of various element attributes. I decided that the most reliable way to parse the html was to use an xml parser (msxml.)
The problem is that the html I'm trying to parse contains attribute like:
<param name="flashvars" value="autoplay=false&brand=embed&cid=97%2Ftest&locale=en_US"/>
Which causes the xml parser to blow up. I figured out that I need to server.htmlencode() the value attribute in order for the xml parser to load it properly. How do I approach this?
I feel like the problem is a vicious circle. I couldn't use regex's because html is not regular enough, and now I can't use xml parsers because the html isn't "well formed"
help. How do I approach this issue? I want to be able to change attribute values with a vbscript.
Is your HTML well formed? If so you could simply use an XML DomDocument. Use XPath to find the attributes you want to replace.
You can actually use JScript serverside as well in ASP, whicdh might give you access to HTMLDom libraries you could use.
You should probably have a look at one of the libraries for cleaning up HTML, something like HTML Tidy http://www.w3.org/People/Raggett/tidy/
Your main problem is you need to do a replace on the ampersands, they need to be & in well formed XML/XHTML.

How should ajax request html files be formatted?

Im using jquery UI's tabs with ajax.
I was wondering if the files that the ajax calls are gonna retrieve are supposed to be formatted starting with <html> or just the minimal html possible cause its gonna be injected into an already formatted valid xhtml file.... I hope Im making myself clear.
Thanks in advance.
If you're going to inject what you receive from the server directly into the DOM, you'll want an HTML snippet. Something like
<div>This is something <strong>injected</strong></div>
is preferred over
<html><body><div>This is something <strong>injected</strong></div></body></html>
Minimal html. All the examples on the jquery UI tabs page use HTML shards.
You should be able to spit out the HTML exactly as you would want it dropped in to place (i.e. enclosing tags are not necessary).

Move generated javascript out of rendered html

One SEO advice we got was to move all javascript to external files, so the code could be removed from the text. For fixed scripts this is not a problem, but some scripts need to be generated as they depend on some ClientId that is generated by asp.net.
Can I use the ScriptManager (from asp.net Ajax or from Telerik) to send this script to the browser or do I need to write my own component for that?
I found only ways to combine fixed files and/or embedded resources (also fixed).
How about registering the ClientIDs in an inline Javascript array/hash, and have your external JS file iterate through that?
Spiderbots do not read JavaScript blocks. This advice is plain wrong.
Some javascript can break W3C validators (and possibly cause issues with some spiderbots)
You can reduce this by placing this code around your javascript:
< !-- no script
... your javascript code and functions ...
// -->
Note: remove the space between "<" and "!" as it seems to comment out the example here :-)

Are there any tools out there to compare the structure of 2 web pages?

I receive HTML pages from our creative team, and then use those to build aspx pages. One challenge I frequently face is getting the HTML I spit out to match theirs exactly. I almost always end up screwing up the nesting of <div>s between my page and the master pages.
Does anyone know of a tool that will help in this situation -- something that will compare 2 pages and output the structural differences? I can't use a standard diff tool, because IDs change from what I receive from creative, text replaces lorem ipsum, etc..
You can use HTMLTidy to convert the HTML to well-formed XML so you can use XML Diff, as Gulzar suggested.
tidy -asxml index.html
If out output XML compliant HTML. Or at least translate your HTML product into XML compliancy, you at least could then XSL your output to remove the content and id tags. Apply the same transformation to their html, and then compare.
I was thinking on lines of XML Diff since HTML can be represented as an XML Document.
The challenge with HTML is that it might not be always well formed. Found one more here showing how to use XMLDiff class.
A copy of my own answer from here.
What about DaisyDiff (Java and PHP vesions available).
Following features are really nice:
Works with badly formed HTML that can be found "in the wild".
The diffing is more specialized in HTML than XML tree differs. Changing part of a text node will not cause the entire node to be changed.
In addition to the default visual diff, HTML source can be diffed coherently.
Provides easy to understand descriptions of the changes.
The default GUI allows easy browsing of the modifications through keyboard shortcuts and links.
winmerge is a good visual diff program

Resources