Parsing XHTML with DTD using XDocument - xhtml

I need to get plain text from XHTML documents.
I am sure I already read somewhere here, that XDocument on WP7 does not support DTD. I cannot find it though. Well, when I try to parse XHTML with DTD using XDocument, it throws NotSuportedException. Last call in stacktrace is at System.Xml.XmlTextReaderImpl.ParseDoctypeDecl().
That is exactly same even if I try to use some dummy XmlResolver - it doesn't really get called. (following answer in this question).
So I assume that WP7 really doesn't support it.
Well, I need to parse XHTML docs. So far I came up with two (more or less real) solutions:
I can do that if I remove that DTD declaration. But, there can be some character entity in the XHTML, and then exception is thrown if that character entity is not one of the predefined XML entity.
So that solution works only for some XHTMLs.
I thought of using Regex. It is quite easy to remove all the html tags, but the 'entity problem' remains as I don't think it is real/good solution to do replace for all entities.
Anyone faced/solved this? Can you give me some advice or correct me if I am wrong on something?
Thanks.

HTML Agility pack is a library for parsing html document, as claimed on the forum, it has a version for WP7
http://htmlagilitypack.codeplex.com/discussions/225113

Related

ruby nokogiri parse within parsed

I'm just learning how to program in ruby using nokogiri gem.
doc.xpath("//*[#class='someclass']//#href")
will return all href values under "someclass" class somewhere in the HTML.
doc.xpath("//*[#class='someclass']").xpath("//#href")
will return all href in entire HTML.
Could someone explain to me how would someone go about using //# equivalent in xpath for instance, within parsed data so something like:
doc.xpath("//*[#class='someclass']").xpath(grab all the href within previously parsed)
is possible?
using the *, # seems to be quite powerful but I can't seem to be able to narrow that down, other than searching through entire HTML, whereever I use it..
as a beginner, I just thought it would be.. intuitive? to be able to use "grab from everywhere" type of syntax limited to what has been parsed previously to narrow down my target, so I can do something like
xpath(whatever).css(whatever).xpath(whatever)
maybe this is not a good practice? maybe with more understanding of parsing concept I would never have to do this? sometimes I find using both xpath and CSS easier..
hopefully someone can enlighten me..
Try changing your second expression from
doc.xpath("//*[#class='someclass']").xpath("//#href")
to
doc.xpath("//*[#class='someclass']").xpath(".//#href")
// at the beginning of an XPath expression means "descendants of the root of the document," whereas .// means "descendants of the context node(s)."
You're right that XPath is powerful, and some major aspects of it are intuitive... but there are significant pieces that aren't intuitive, or depend on how your intuition is trained. Careful study reaps dividends, especially if you are going to use XPath much!

ASP.NET + MVC4 - "faking" a model? working without a datatable

I'm not an ASP.NET programmer, but, as it happens in life, I had to do some minor projects using it. Now came another one in which I have to implement some custom solutions and I haven't figured it out yet - I need some tip or maybe a piece of advice like "don't go that way" ;)
Previously it was simple - there was a table in DB, there was an adequate model and a view that worked with it - worked like charm. Now it's a little bit more complicated.
The "site" is going to contain, shortly and generally speaking, a survey - but a fully configurable one, unfortunately. In another product there's gonna be a configuration manager that will allow user to define pages, block types, questions, steps and so on and will generate an XML.
For the time being, in accordance with the specification, in the site's database I'm going to have only one table which will contain just a key and the XML generated by the configurator (and maybe some additional, not important information). Now - I need to parse this XML and build the site containing pages and other elements corresponding to it.
And that WOULD not be a problem, but I don't really know how to work that way using asp.net + mvc and can't find any piece of advice that would help me anyhow. Should I create an object that would somehow fake being a model and allow me to work for example on a dataset generated from XML? Or just create a model of the mentioned table and work with the XML directly on the view (I don't like even such an idea itself)? Or - having to do something like that - just give up on MVC and use only "clear" ASP.NET? Or maybe something else?
I'll be very grateful for any help.
And I hope I described what I need understandably ;)
If the XML documents have a schema defined then you can easily generate a class that matches the document using the xsd.exe tool. The document can then be deserialized into an instance of that class using existing functionality in the .Net framework. Just google .Net Xml serialization :-)
Now, if you don't have a schema you could create one if you are sure that you know the format of the Xml. Alternatively you could create a class that matches the format you expect to get and then parse the Xml manually. This last option is much more work, so I wouldn't recommend it.
In any case, the class you end up with should contain all the data you need from the Xml document and can then be used as the Model in your MVC page. As long as you can use the standard Xml deserialization technique then this should be quite easy and painless.

Displaying XML using CSS: How to handle &nbsp?

I'm dealing with a lot of .xml files. (Millions - an .xml formatted dump of Wikipedia) and they're a lot more unreadable than I imagined.
For the time being, I've written a .css file to display them in a readable manner in a browser, and wrote a script to plug a reference to this .css into all the files.
(I know there's other solutions, like XSLT - but all the information I found made it seem document-level which didn't suit - I'm really trying not to expand the size of these files if possible)
The .css works fine for some of the files, but many contain entities like &nbsp and I get errors like:
"XML Parsing Error: undefined entity" with a nice little illustration pointing to &nbsp or it's kin within a quote.
There is an articles.dtd file, which seems like it should connect the dots ( keyword -> Unicode ) for the browser. It is referenced in each file like:
<!DOCTYPE article SYSTEM "../article.dtd">
and contains a lot of entries like:
<!ENTITY nbsp " "> <!-- no-break space = non-breaking space,
U+00A0 ISOnum -->
but either I'm entirely misunderstanding what this file is for, or it's not working correctly.
In any case; How can I make these documents display; Either by:
displaying the entities (like "&nbSp" as plain-text)
removing the entities altogether (by any means other than just a linear search/removal of them in the actual files)
Interpreting the entities as unicode, as they were intended
Naturally, the latter being preferable; absolutely ideally, by referencing some sort of external file that maps identities to Unicode (if that's not what the articles.dtd file is for....)
EDIT: I'm not working with a powerful machine here.. extracting the .rars took days. Any sort of edits to each file would take a very long time.
It is not very good way, just workaround: try to replace with  
so I've since solved my problem: if it helps anyone in future:
It turned out the guts of my problem was that external .dtd files are totally deprecated.
The function of the .dtd was in fact to declare the entities I was having trouble with( etc) as I thought; but because external .dtd files are not supported by browsers any more (the browsers simply don't fetch/parse them - and the only way to force them to depends on files in the install of the browser on the client-machine) the entities went undeclared.
I had sourced an .XML collection that was simply too old to be up to standards; without realizing it.
The solution best for my circumstances turned out to be lazy-processing of each file as it was requested. with a simple flag to differentiate processed from not.

Data storage hazards and problems when using XmlDataSource as persistancy layer in an ASP.NET web application

This is a question about how to solve a possible, real, problem occurring when deciding to use xml files to store data in a web application.
The scenario
Consider you want to build a web application in order to run a personal blog.
Well, sure, this application needs to store data, and the text is formatted using particular tools (like Markdown for example), where the written content is converted in html.
Sure my xml files will have to store html tags....
The problem
In my xml files, how can I store html data in order not to violate the xsd specified?
For example, if I try to store html tags... the xml validation will sure fail, I suppose...
But the one I mentioned is only one of the possible problems occurring when using xml (for example, databinding will suffer from any possible problem because of what I mentioed before?).
Can you tell what is the general approach to this problem (talk about patterns and best practices)?
Thankyou.
To answer the portion of your question about stoing HTML in XML, in your xml file, surround your html in CDATA tags.
<![CDATA[
html
]]>

Character entity references - numeric or not?

So, I know that I can represent an ampersand as & or &.
I have found that at least one method of parsing XML does not allow for the abbreviation-based style - only numeric. Is there a best-practice? I want to instruct my team to use the numeric versions because of my experience, but one instance hardly seems like enough reason to convince them.
Which method should we favor?
XML only has a small set of these symbolic entities, for amp, quot, gt and lt.
The symbolic names we're familiar with for ©, etc. for entities exist because of their appearance in the HTML DTD, here http://www.w3.org/TR/html4/sgml/entities.html (although I think most browsers have this baked in).
Therefore, if you are using (X)HTML, get your doctype right, and then follow the links on w3.org to XHTML to see the entities available.
As far as best practices, most people find the symbolic names easier to understand and will use them when available. I would recommend that.
The only reason not to is that there used to be cases in very old browsers when entities wouldn't work-- but I don't believe this is the case any more.
If you mean other HTML entities, with pure XML, only the entities amp, lt, gt, quot, and apos are pre-defined (apos is not available in HTML, but amp indeed should be).
However, all other HTML entities (such as nbsp) will not be available unless defined in the DOCTYPE, so in such a case, using numeric entities may indeed be preferable.

Resources