XSLT. load xml document that contains escape characters - asp.net

I use XSLT to transform an XML document which I then load on to a ASP.NET website. However, if the XML contains '<' characters, the XML becomes malformed.
<title><b> < left arrows <b></title>
If I use disable-output-escaping="yes", the XML cannot be loaded and I get the error "Name cannot begin with the '' character".
If I do not disable output escaping the escaped characters are disregarded and the text appears as it is:
<title><b> < left arrows <b></title>
I want the bold tags to work, but I also want to escape the '<' character. Ideally
<b>< left arrows</b>
is what I want to achieve. Is there any solution for this?

The XML should contain the escaped sequence for the less than sign (<), not the literal < character. The XML is malformed and any XML parser must reject it.
In XSLT you could generate that sequence like this:
<xsl:text>&lt;<xsl:text>

From what I understand, the input contains HTML and literal < characters. In that case, disable-output-escaping="yes" will preserve the HTML tags but produce invalid XML and setting it to no means the HTML tags will be escaped.
What you need to do is to leave set disable-output-escaping="no" (which is the default, you don't actually have to add that) and add a XSLT rule that will copy the HTML tags. For instance:
<xsl:template match="*">
<xsl:copy>
<xsl:copy-of select="#*" />
<xsl:apply-templates />
</xsl:copy>
</xsl:template>

I came up with a solution and was triggered by the last answer by Josh. Thanks Josh. I tried to used the match template, however I had a problem as the html tags are placed within cdata, so I had difficulties doing a match. There might be a way to do it, but I gave up on that.
What I did was to do a test="contain($text, $replace)" where the $replace is the '<' character and on top of that, I also added a condition to test if the substring after the '<' is a relevant html tag such that it is actually a <b> or </b>. So if it's just a '<' character not belonging to any html tags, I will convert '<' to ampersand, &lt;. Basically that solved my problem. Hope this is useful to anyone who encounter the same problem as me.

Related

xdmp:document-load throws XDMP-DOCDUPATTR for malformed html document, even with repair=full

I used this expression to load an html document:
xdmp:document-load("http://example.com/index.html",
<options xmlns="xdmp:document-load" xmlns:http="xdmp:http">
<uri>/documents/content.xml</uri>
<repair>full</repair>
<format>xml</format>
</options>
The repair full option works well with unclosed tags. But one of the tags has two attributes with the same name, and this causes an error XDMP-DOCDUPATTR.
Is there a way to avoid this error?
You can try getting the document as text and then applying tidy -- there's an example at the end of:
http://docs.marklogic.com/xdmp:tidy
Hoping that helps
You could also load HTML documents as flat text: <format>text</format> instead of <format>xml</format>. The document will be a single text node. All the HTML will be preserved, but there will be no XML structure so XPath won't be useful.

Is there a way to escape non-alphanumeric characters in Nokogiri css?

I have an anchor tag:
file.html#stuff-morestuff-CHP-1-SECT-2.1
Trying to pull the referenced content in Nokogiri:
documentFragment.at_css('#stuff-morestuff-CHP-1-SECT-2.1')
fails with the error:
unexpected '.1' after '[#<Nokogiri::CSS:
:Node:0x007fd1a7df9b40 #type=:CONDITIONAL_SELECTOR, #value=[#<Nokogiri::CSS::Node:0x007fd1a7df9b90 #type=:ELEMENT_NAME, #value=["*"]>, #<Nokogiri::CSS::Node:0x007fd1a7df9cd0 #
type=:ID, #value=["#unixnut4-CHP-1-SECT-2"
]>]>]' (Nokogiri::CSS::SyntaxError)
Just trying talk through this - I think Nokogiri is complaining about the .1 in the selectorId, because . is not valid in an html id.
I don't own the content, so I really don't want to go through and fix all the bad IDs if it is avoidable. Is there a way to escape non-alphanumeric selectors in a nokogiri .css() call?
Assuming your HTML looks something like this:
<div id='stuff-morestuff-CHP-1-SECT-2.1'>foo</div>
The string in question, stuff-morestuff-CHP-1-SECT-2.1, is a valid HTML ID, but it isn’t a valid CSS selector — the . character isn’t valid there.
You should be able to escape the . with a slash character, i.e. this is a valid CSS selector:
#stuff-morestuff-CHP-1-SECT-2\.1
Unfortunately this doesn’t seem to work in Nokogiri, there may be a bug in the CSS to XPath translation that it does. (It does work in the browser).
You can get around this by just checking the id attribute directly:
documentFragment.at_css('*[id="stuff-morestuff-CHP-1-SECT-2.1"]')
Even if slash escaping worked, you would probably have to check the id attribute like this if it value started with a digit, which is valid in HTML but cannot be (as far as I can tell) expressed as a CSS selector, even with escaping.
You could also use XPath, which has an id function that you can use here:
documentFragment.xpath("id('stuff-morestuff-CHP-1-SECT-2.1')")

What is cstyle in XSLT?

My XSLT is shown below.
aic is a namespace.
What is cstyle?
is it a built-in XSLT element/function?
Or an element within the expected input xml?
<xsl:stylesheet exclude-result-prefixes="aic"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:aic="http://ns.adobe.com/AdobeInCopy/2.0/" >
<xsl:template match="/">
</xsl:template>
<xsl:template match="aic:cstyle[contains(#name,'bold')]">
</xsl:template>
</xsl:stylesheet>
It is an element within the expected input XML. The XPaths in an XSLT's match attributes are generally applied to contents from the input XML.
Exactly as in my answer to your previous question, aic:cstyle is a selector that matches elements whose local name is cstyle and whose namespace URI is http://ns.adobe.com/AdobeInCopy/2.0/ (the URI bound to the aic prefix in the xsl:stylesheet element). Thus
<xsl:template match="aic:cstyle[contains(#name,'bold')]">
is a template that will apply to any {http://ns.adobe.com/AdobeInCopy/2.0/}cstyle element that has a name attribute that contains the substring bold. (So, to answer your question directly: the expression in question will match elements in the input streams for which the stylesheet was written.)
As with any new programming language, I would strongly recommend that you find a decent tutorial and work through that to get comfortable with the syntax and idioms of the language through simple examples before you start trying to decode a large and complex XSLT that you've inherited from elsewhere.

Convert copy-of output to string and escape XML special characters (like less than (<) and greater than (>) symbols)

I am trying this in an XQuery (assume that doc('input:instance') does indeed return a valid XML document) which is generated using XSLT
let $a:= <xsl:text>"<xsl:copy-of select="doc('input:instance')//A" />"</xsl:text>
let $p := <xsl:text>"<xsl:copy-of select="doc('input:instance')//P" />"</xsl:text>
let $r := <xsl:text>"<xsl:copy-of select="doc('input:instance')//R" />"</xsl:text>
But I get the error:
xsl:text must not contain child elements
How do I retrieve XML results using the XPath in xsl:copy-of and then encode the special characters received in the result while formatting the result as string? I would be happy to use CDATA section if that's possible (if I do that instead of xsl:text above, xsl:copy-of is not evaluated since it becomes part of CDATA section).
Obviously I am a newcomer to XSL...
What you need here is the ability to serialize an XML document (here the document returned by doc()) using the XML serialization, into a string.
Various XQuery implementation have extension functions for this purpose. For example, if you are using Saxon:
saxon:serialize(document, 'xml')
This has nothing to do with XQuery (you could be building the XSLT stylesheet with any language, even XSLT itslef!).
From http://www.w3.org/TR/xslt20/#xsl-text
<!-- Category: instruction -->
<xsl:text
[disable-output-escaping]? = "yes" | "no">
<!-- Content: #PCDATA -->
</xsl:text>
[...] The content of the xsl:text
element is a single text node whose
value forms the string value of the
new text node.

Regular Expression in XSLT

I have an XSLT function that takes a regular expression as a parameter but the XSLT parser does not like it.
Here is the code:
<xsl:value-of select='ns:RegexReplace($variable, "", "style=\"\w+\:\s\w+;\"")' disable-output-escaping='yes' />
I found this:
http://www.xml.com/pub/a/2003/06/04/tr.html <-- but it is using what I am and seems to work (for them). Do I just have a rubbish parser??
Is there any way of doing this?
Or, a way of forcing an element to ignore inline style via a CSS trick?
You seem to be trying to include quotes in a quote-delimited XPath string literal by escaping them with a backslash. That does not work.
In XPath 1.0 (XSLT 1), there is no nice way to do this. You may need to resort to tricks like defining a variable which holds a single quote character and using the concat function to create your string:
<xsl:variable name='quot' select="'"'"/>
<xsl:value-of select='concat("a string with a quote ", $quot, " character")'/>
In XPath 2.0 (XSLT 2), you can escape a quote with another quote:
<xsl:value-of select='"a string with a quote "" character"'/>
It occurs to me that you may be trying to remove style attributes. If that is the case, then string replacement is not going to help you.
You can remove style attributes for example by writing a template which matches them and outputs nothing:
<xsl:template match="#style"/>

Resources