Create XML object from poorly formatted HTML

Create XML object from poorly formatted HTML - apache-flex

I want to make an XML document from an HTML one so I can use the XML parsing tools. My problem is that my HTML is not guaranteed to be XHTML nor valid. How can I bypass the exceptions? In this string <p> is not terminated, nor is <br> nor <meta>.
var poorHtml:String = "<html><meta content=\"stuff\" name=\"description\"><p>Hello<br></html>";
var html:XML = new XML(poorHtml);
TypeError: Error #1085: The element type "meta" must be terminated by the matching end-tag "</meta>".

I did some searching and couldn't come up with anything except this doesn't really seem possible, the major issue is how should it correct when the format is not valid.
In the case of browsers, every browser does this based upon it's own rules of what should happen in the case that the closing tag isn't found (put it in wherever it would cause the code to produce a valid XML and subsequently DOM tree, or self terminate the tag, or remove the tag, or for the case that a closing tag was found with no opening how should this be handled, what about unclosed attributes etc.).
Unfortunately I don't know of anything in the specification that explains what should be done in this case, with XHTML just like how flex treats it these are fatal errors and result in no functionality rather than how HTML4 treated it with the quirky and transitional DTD options.
To avoid the error or give better error messaging you can use this:
var poorHtml:String = "<html><meta content=\"stuff\" name=\"description\"><p>Hello<br></html>";
try
{
var html:XML = new XML(poorHtml);
}
catch(e:TypeError)
{
trace("error caught")
}
but it's likely you'll be best off using some sort of server side script to validate the XML or correct the XML before passing it over to the client.

There is probably an implementation of HTML Tidy in just about any language you might happen to be working with. This looks promising for your sitation: http://code.google.com/p/as3htmltidylib/
If you don't want to drag in a whole library (I wouldn't), you could just write your own XML parser that handles errors in whatever way suits you (I'd suggest auto-closing tags until the document makes sense again, ignoring end tags with no start tags, maybe un-closing certain special tags such as "body" and "html"). This has the added advantage that you can optimize it for whatever jobs you need it for, i.e. by storing a list of all elements with the attribute "href" as you come to them.

You could try to pass your HTML through HTML Tidy on the server before loading it. I believe that HTML Tidy does a good job at cleaning up broken HTML.

Related

How to resolve PCDATA invalid Char value 27 [9] error when parsing XML in R [duplicate]

Currently, I'm working on a feature that involves parsing XML that we receive from another product. I decided to run some tests against some actual customer data, and it looks like the other product is allowing input from users that should be considered invalid. Anyways, I still have to try and figure out a way to parse it. We're using javax.xml.parsers.DocumentBuilder and I'm getting an error on input that looks like the following.
<xml>
...
<description>Example:Description:<THIS-IS-PART-OF-DESCRIPTION></description>
...
</xml>
As you can tell, the description has what appears to be an invalid tag inside of it (<THIS-IS-PART-OF-DESCRIPTION>). Now, this description tag is known to be a leaf tag and shouldn't have any nested tags inside of it. Regardless, this is still an issue and yields an exception on DocumentBuilder.parse(...)
I know this is invalid XML, but it's predictably invalid. Any ideas on a way to parse such input?

That "XML" is worse than invalid – it's not well-formed; see Well Formed vs Valid XML.
An informal assessment of the predictability of the transgressions does not help. That textual data is not XML. No conformant XML tools or libraries can help you process it.
Options, most desirable first:
Have the provider fix the problem on their end. Demand well-formed XML. (Technically the phrase well-formed XML is redundant but may be useful for emphasis.)
Use a tolerant markup parser to cleanup the problem ahead of parsing as XML:
Standalone: xmlstarlet has robust recovering and repair capabilities credit: RomanPerekhrest
xmlstarlet fo -o -R -H -D bad.xml 2>/dev/null
Standalone and C/C++: HTML Tidy works with XML too. Taggle is a port of TagSoup to C++.
Python: Beautiful Soup is Python-based. See notes in the Differences between parsers section. See also answers to this question for more
suggestions for dealing with not-well-formed markup in Python,
including especially lxml's recover=True option.
See also this answer for how to use codecs.EncodedFile() to cleanup illegal characters.
Java: TagSoup and JSoup focus on HTML. FilterInputStream can be used for preprocessing cleanup.
.NET:
XmlReaderSettings.CheckCharacters can
be disabled to get past illegal XML character problems.
#jdweng notes that XmlReaderSettings.ConformanceLevel can be set to
ConformanceLevel.Fragment so that XmlReader can read XML Well-Formed Parsed Entities lacking a root element.
#jdweng also reports that XmlReader.ReadToFollowing() can sometimes
be used to work-around XML syntactical issues, but note
rule-breaking warning in #3 below.
Microsoft.Language.Xml.XMLParser is said to be “error-tolerant”.
Go: Set Decoder.Strict to false as shown in this example by #chuckx.
PHP: See DOMDocument::$recover and libxml_use_internal_errors(true). See nice example here.
Ruby: Nokogiri supports “Gentle Well-Formedness”.
R: See htmlTreeParse() for fault-tolerant markup parsing in R.
Perl: See XML::Liberal, a "super liberal XML parser that parses broken XML."
Process the data as text manually using a text editor or
programmatically using character/string functions. Doing this
programmatically can range from tricky to impossible as
what appears to be
predictable often is not -- rule breaking is rarely bound by rules.
For invalid character errors, use regex to remove/replace invalid characters:
PHP: preg_replace('/[^\x{0009}\x{000a}\x{000d}\x{0020}-\x{D7FF}\x{E000}-\x{FFFD}]+/u', ' ', $s);
Ruby: string.tr("^\u{0009}\u{000a}\u{000d}\u{0020}-\u{D7FF}\u{E000‌}-\u{FFFD}", ' ')
JavaScript: inputStr.replace(/[^\x09\x0A\x0D\x20-\xFF\x85\xA0-\uD7FF\uE000-\uFDCF\uFDE0-\uFFFD]/gm, '')
For ampersands, use regex to replace matches with &: credit: blhsin, demo
&(?!(?:#\d+|#x[0-9a-f]+|\w+);)
Note that the above regular expressions won't take comments or CDATA
sections into account.

A standard XML parser will NEVER accept invalid XML, by design.
Your only option is to pre-process the input to remove the "predictably invalid" content, or wrap it in CDATA, prior to parsing it.

The accepted answer is good advice, and contains very useful links.
I'd like to add that this, and many other cases of not-wellformed and/or DTD-invalid XML can be repaired using SGML, the ISO-standardized superset of HTML and XML. In your case, what works is to declare the bogus THIS-IS-PART-OF-DESCRIPTION element as SGML empty element and then use eg. the osx program (part of the OpenSP/OpenJade SGML package) to convert it to XML. For example, if you supply the following to osx
<!DOCTYPE xml [
<!ELEMENT xml - - ANY>
<!ELEMENT description - - ANY>
<!ELEMENT THIS-IS-PART-OF-DESCRIPTION - - EMPTY>
]>
<xml>
<description>blah blah
<THIS-IS-PART-OF-DESCRIPTION>
</description>
</xml>
it will output well-formed XML for further processing with the XML tools of your choice.
Note, however, that your example snippet has another problem in that element names starting with the letters xml or XML or Xml etc. are reserved in XML, and won't be accepted by conforming XML parsers.

IMO these cases should be solved by using JSoup.
Below is a not-really answer for this specific case, but found this on the web (thanks to inuyasha82 on Coderwall). This code bit did inspire me for another similar problem while dealing with malformed XMLs, so I share it here.
Please do not edit what is below, as it is as it on the original website.
The XML format, requires to be valid a unique root element declared in the document.
So for example a valid xml is:
<root>
<element>...</element>
<element>...</element>
</root>
But if you have a document like:
<element>...</element>
<element>...</element>
<element>...</element>
<element>...</element>
This will be considered a malformed XML, so many xml parsers just throw an Exception complaining about no root element. Etc.
In this example there is a solution on how to solve that problem and succesfully parse the malformed xml above.
Basically what we will do is to add programmatically a root element.
So first of all you have to open the resource that contains your "malformed" xml (i. e. a file):
File file = new File(pathtofile);
Then open a FileInputStream:
FileInputStream fis = new FileInputStream(file);
If we try to parse this stream with any XML library at that point we will raise the malformed document Exception.
Now we create a list of InputStream objects with three lements:
A ByteIputStream element that contains the string: <root>
Our FileInputStream
A ByteInputStream with the string: </root>
So the code is:
List<InputStream> streams =
Arrays.asList(
new ByteArrayInputStream("<root>".getBytes()),
fis,
new ByteArrayInputStream("</root>".getBytes()));
Now using a SequenceInputStream, we create a container for the List created above:
InputStream cntr =
new SequenceInputStream(Collections.enumeration(str));
Now we can use any XML Parser library, on the cntr, and it will be parsed without any problem. (Checked with Stax library);

How to remove cross site scripting? [duplicate]

I have a website that allows to enter HTML through a TinyMCE rich editor control. It's purpose is to allow users to format text using HTML.
This user entered content is then outputted to other users of the system.
However this means someone could insert JavaScript into the HTML in order to perform a XSS attack on other users of the system.
What is the best way to filter out JavaScript code from a HTML string?
If I perform a Regular Expression check for <SCRIPT> tags it's a good start, but an evil doer could still attach JavaScript to the onclick attribute of a tag.
Is there a fool-proof way to script out all JavaScript code, whilst leaving the rest of the HTML untouched?
For my particular implementation, I'm using C#

Microsoft have produced their own anti-XSS library, Microsoft Anti-Cross Site Scripting Library V4.0:
The Microsoft Anti-Cross Site Scripting Library V4.0 (AntiXSS V4.0) is an encoding library designed to help developers protect their ASP.NET web-based applications from XSS attacks. It differs from most encoding libraries in that it uses the white-listing technique -- sometimes referred to as the principle of inclusions -- to provide protection against XSS attacks. This approach works by first defining a valid or allowable set of characters, and encodes anything outside this set (invalid characters or potential attacks). The white-listing approach provides several advantages over other encoding schemes. New features in this version of the Microsoft Anti-Cross Site Scripting Library include:- A customizable safe list for HTML and XML encoding- Performance improvements- Support for Medium Trust ASP.NET applications- HTML Named Entity Support- Invalid Unicode detection- Improved Surrogate Character Support for HTML and XML encoding- LDAP Encoding Improvements- application/x-www-form-urlencoded encoding support
It uses a whitelist approach to strip out potential XSS content.
Here are some relevant links related to AntiXSS:
Anti-Cross Site Scripting Library
Microsoft Anti-Cross Site Scripting Library V4.2 (AntiXSS V4.2)
Microsoft Web Protection Library

Peter, I'd like to introduce you to two concepts in security;
Blacklisting - Disallow things you know are bad.
Whitelisting - Allow things you know are good.
While both have their uses, blacklisting is insecure by design.
What you are asking, is in fact blacklisting. If there had to be an alternative to <script> (such as <img src="bad" onerror="hack()"/>), you won't be able to avoid this issue.
Whitelisting, on the other hand, allows you to specify the exact conditions you are allowing.
For example, you would have the following rules:
allow only these tags: b, i, u, img
allow only these attributes: src, href, style
That is just the theory. In practice, you must parse the HTML accordingly, hence the need of a proper HTML parser.

If you want to allow some HTML but not all, you should use something like OWASP AntiSamy, which allows you to build a whitelisted policy over which tags and attributes you allow.
HTMLPurifier might also be an alternative.
It's of key importance that it is a whitelist approach, as new attributes and events are added to HTML5 all the time, so any blacklisting would fail within short time, and knowing all "bad" attributes is also difficult.
Edit: Oh, and regex is a bit hard to do here. HTML can have lots of different formats. Tags can be unclosed, attributes can start with or without quotes (single or double), you can have line breaks and all kinds of spaces within the tags to name a few issues. I would rely on a welltested library like the ones I mentioned above.

Regular expressions are the wrong tool for the job, you need a real HTML parser or things will turn bad. You need to parse the HTML string and then remove all elements and attributes but the allowed ones (whitelist approach, blacklists are inherently insecure). You can take the lists used by Mozilla as a starting point. There you also have a list of attributes that take URL values - you need to verify that these are either relative URLs or use an allowed protocol (typically only http:/https:/ftp:, in particular no javascript: or data:). Once you've removed everything that isn't allowed you serialize your data back to HTML - now you have something that is safe to insert on your web page.

I try to replace tag element format like this:
public class Utility
{
public static string PreventXSS(string sInput) {
if (sInput == null)
return string.Empty;
string sResult = string.Empty;
sResult = Regex.Replace(sInput, "<", "< ");
sResult = Regex.Replace(sResult, #"<\s*", "< ");
return sResult;
}
}
Usage before save to db:
string sResultNoXSS = Utility.PreventXSS(varName)
I have test that I have input data like :
<script>alert('hello XSS')</script>
it will be run on browser. After I add Anti XSS the code above will be:
< script>alert('hello XSS')< /script>
(There is a space after <)
And the result, the script won't be run on browser.

Retrieve data from a website via Visual Basic

There is this website that we purchase widgets from that provides details for each of their parts on its own webpage. Example: http://www.digikey.ca/product-search/en?lang=en&site=ca&KeyWords=AE9912-ND. I have to find all of their parts that are in our database, and add Manufacturer and Manufacturer Part Number values to their fields.
I was told that there is a way for Visual Basic to access a webpage and extract information. If someone could point me in the right direction on where to start, I'm sure I can figure this out.
Thanks.

How to scrape a website using HTMLAgilityPack (VB.Net)
I agree that htmlagilitypack is the easiest way to accomplish this. It is less error prone than just using Regex. The following will be how I deal with scraping.
After downloading htmlagilitypack*dll, create a new application, add htmlagilitypack via nuget, and reference to it. If you can use Chrome, it will allow you to inspect the page to get information about where your information is located. Right-click on a value you wish to capture and look for the table that it is found in (follow the HTML up a bit).
The following example will extract all the values from that page within the "pricing" table. We need to know the XPath value for the table (this value is used to instruct htmlagilitypack on what to look for) so that the document we create looks for our specific values. This can be achieved by finding whatever structure your values are in and right click copy XPath. From this we get...
//*[#id="pricing"]
Please note that sometimes the XPath you get from Chrome may be rather large. You can often simplify it by finding something unique about the table your values are in. In this example it is "id", but in other situations, it could easily be headings or class or whatever.
This XPath value looks for something with the id equal to pricing, that is our table. When we look further in, we see that our values are within tbody,tr and td tags. HtmlAgilitypack doesn't work well with the tbody so ignore it. Our new XPath is...
//*[#id='pricing']/tr/td
This XPath says look for the pricing id within the page, then look for text within its tr and td tags. Now we add the code...
Dim Web As New HtmlAgilityPack.HtmlWeb
Dim Doc As New HtmlAgilityPack.HtmlDocument
Doc = Web.Load("http://www.digikey.ca/product-search/en?lang=en&site=ca&KeyWords=AE9912-ND")
For Each table As HtmlAgilityPack.HtmlNode In Doc.DocumentNode.SelectNodes("//*[#id='pricing']/tr/td")
Next
To extract the values we simply reference our table value that was created in our loop and it's innertext member.
Dim Web As New HtmlAgilityPack.HtmlWeb
Dim Doc As New HtmlAgilityPack.HtmlDocument
Doc = Web.Load("http://www.digikey.ca/product-search/en?lang=en&site=ca&KeyWords=AE9912-ND")
For Each table As HtmlAgilityPack.HtmlNode In Doc.DocumentNode.SelectNodes("//*[#id='pricing']/tr/td")
MsgBox(table.InnerText)
Next
Now we have message boxes that pop up the values...you can switch the message box for an arraylist to fill or whatever way you wish to store the values. Now simply do the same for whatever other tables you wish to get.
Please note that the Doc variable that was created is reusable, so if you wanted to cycle through a different table in the same page, you do not have to reload the page. This is a good idea especially if you are making many requests, you don't want to slam the website, and if you are automating a large number of scrapes, it puts some time between requests.
Scraping is really that easy. That's is the basic idea. Have fun!

Html Agility Pack is going to be your friend!
What is exactly the Html Agility Pack (HAP)?
This is an agile HTML parser that builds a read/write DOM and supports
plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor
XSLT to use it, don't worry...). It is a .NET code library that allows
you to parse "out of the web" HTML files. The parser is very tolerant
with "real world" malformed HTML. The object model is very similar to
what proposes System.Xml, but for HTML documents (or streams).
Looking at the source of the example page you provided, they are using HTML5 Microdata in their markup. I searched some more on CodePlex and found a microdata parser which may help too: MicroData Parser

Are there any anti-XSS libraries for ASP.Net?

I was reading some questions trying to find a good solution to preventing XSS in user provided URLs(which get turned into a link). I've found one for PHP but I can't seem to find anything for .Net.
To be clear, all I want is a library which will make user-provided text safe(including unicode gotchas?) and make user-provided URLs safe(used in a or img tags)
I noticed that StackOverflow has very good XSS protection, but sadly that part of their Markdown implementation seems to be missing from MarkdownSharp. (and I use MarkdownSharp for a lot of my content)

Microsoft has the Anti-Cross Site Scripting Library; you could start by taking a look at it and determining if it fits your needs. They also have some guidance on how to avoid XSS attacks that you could follow if you determine the tool they offer is not really what you need.

There's a few things to consider here. Firstly, you've got ASP.NET Request Validation which will catch many of the common XSS patterns. Don't rely exclusively on this, but it's a nice little value add.
Next up you want to validate the input against a white-list and in this case, your white-list is all about conforming to the expected structure of a URL. Try using Uri.IsWellFormedUriString for compliance against RFC 2396 and RFC 273:
var sourceUri = UriTextBox.Text;
if (!Uri.IsWellFormedUriString(sourceUri, UriKind.Absolute))
{
// Not a valid URI - bail out here
}
AntiXSS has Encoder.UrlEncode which is great for encoding string to be appended to a URL, i.e. in a query string. Problem is that you want to take the original string and not escape characters such as the forward slashes otherwise http://troyhunt.com ends up as http%3a%2f%2ftroyhunt.com and you've got a problem.
As the context you're encoding for is an HTML attribute (it's the "href" attribute you're setting), you want to use Encoder.HtmlAttributeEncode:
MyHyperlink.NavigateUrl = Encoder.HtmlAttributeEncode(sourceUri);
What this means is that a string like http://troyhunt.com/<script> will get escaped to http://troyhunt.com/<script> - but of course Request Validation would catch that one first anyway.
Also take a look at the OWASP Top 10 Unvalidated Redirects and Forwards.

i think you can do it yourself by creating an array of the charecters and another array with the code,
if you found characters from the array replace it with the code, this will help you ! [but definitely not 100%]
character array
<
>
...
Code Array
& lt;
& gt;
...

I rely on HtmlSanitizer. It is a .NET library for cleaning HTML fragments and documents from constructs that can lead to XSS attacks.
It uses AngleSharp to parse, manipulate, and render HTML and CSS.
Because HtmlSanitizer is based on a robust HTML parser it can also shield you from deliberate or accidental
"tag poisoning" where invalid HTML in one fragment can corrupt the whole document leading to broken layout or style.
Usage:
var sanitizer = new HtmlSanitizer();
var html = #"<script>alert('xss')</script><div onload=""alert('xss')"""
+ #"style=""background-color: test"">Test<img src=""test.gif"""
+ #"style=""background-image: url(javascript:alert('xss')); margin: 10px""></div>";
var sanitized = sanitizer.Sanitize(html, "http://www.example.com");
Assert.That(sanitized, Is.EqualTo(#"<div style=""background-color: test"">"
+ #"Test<img style=""margin: 10px"" src=""http://www.example.com/test.gif""></div>"));
There's an online demo, plus there's also a .NET Fiddle you can play with.
(copy/paste from their readme)

parsing simple xml with jquery from asp.net webservice

I'm breaking my head over this for a while now and I have no clue what I do wrong.
The scenario is as followed, I'm using swfupload to upload files with a progressbar
via a webservice. the webservice needs to return the name of the generated thumbnail.
This all goes well and though i prefer to get the returned data in json (might change it later in the swfupload js files) the default xml data is fine too.
So when an upload completes the webservice returns the following xml as expected (note I removed the namespace in webservice):
<?xml version="1.0" encoding="utf-8"?>
<string>myfile.jpg</string>
Now I want to parse this result with jquery and thought the following would do it:
var xml = response;
alert($(xml).find("string").text());
But I cannot get the string value. I've tried lots of combinations (.html(), .innerhtml(), response.find("string").text() but nothing seems to work. This is my first time trying to parse xml via jquery so maybe I'm doing something fundemantally wrong. The 'response' is populated with the xml.
I hope someone can help me with this.
Thanks for your time.
Kind regards,
Mark

I think $(xml) is looking for a dom object with a selector that matches the string value of XML, so I guess it's coming back null or empty?
The First Plugin mentioned below xmldom looks pretty good, but if your returned XML really is as simply as your example above, a bit of string parsing might be quicker, something like:
var start = xml.indexOf('<string>') + 8;
var end = xml.indexOf('</string>');
var resultstring = xml.substring(start, end);
From this answer to this question: How to query an XML string via DOM in jQuery
Quote:
There are a 2 ways to approach this.
Convert the XML string to DOM, parse it using this plugin or follow this tutorial
Convert the XML to JSON using this plugin.

jQuery cannot parse XML. If you pass a string full of XML content into the $ function it will typically try to parse it as HTML instead using standard innerHTML. If you really need to parse a string full of XML you will need browser-specific and not-globally-supported methods like new DOMParser and the XMLDOM ActiveXObject, or a plugin that wraps them.
But you almost never need to do this, since an XMLHttpRequest should return a fully-parsed XML DOM in the responseXML property. If your web service is correctly setting a Content-Type response header to tell the browser that what's coming back is XML, then the data argument to your callback function should be an XML Document object and not a string. In that case you should be able to use your example with find() and text() without problems.
If the server-side does not return an XML Content-Type header and you're unable to fix that, you can pass the option type: 'xml' in the ajax settings as an override.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex