What characters must be escaped in XML documents, or where could I find such a list?
If you use an appropriate class or library, they will do the escaping for you. Many XML issues are caused by string concatenation.
XML escape characters
There are only five:
" "
' '
< <
> >
& &
Escaping characters depends on where the special character is used.
The examples can be validated at the W3C Markup Validation Service.
Text
The safe way is to escape all five characters in text. However, the three characters ", ' and > needn't be escaped in text:
<?xml version="1.0"?>
<valid>"'></valid>
Attributes
The safe way is to escape all five characters in attributes. However, the > character needn't be escaped in attributes:
<?xml version="1.0"?>
<valid attribute=">"/>
The ' character needn't be escaped in attributes if the quotes are ":
<?xml version="1.0"?>
<valid attribute="'"/>
Likewise, the " needn't be escaped in attributes if the quotes are ':
<?xml version="1.0"?>
<valid attribute='"'/>
Comments
All five special characters must not be escaped in comments:
<?xml version="1.0"?>
<valid>
<!-- "'<>& -->
</valid>
CDATA
All five special characters must not be escaped in CDATA sections:
<?xml version="1.0"?>
<valid>
<![CDATA["'<>&]]>
</valid>
Processing instructions
All five special characters must not be escaped in XML processing instructions:
<?xml version="1.0"?>
<?process <"'&> ?>
<valid/>
XML vs. HTML
HTML has its own set of escape codes which cover a lot more characters.
Perhaps this will help:
List of XML and HTML character entity references:
In SGML, HTML and XML documents, the
logical constructs known as character
data and attribute values consist of
sequences of characters, in which each
character can manifest directly
(representing itself), or can be
represented by a series of characters
called a character reference, of which
there are two types: a numeric
character reference and a character
entity reference. This article lists
the character entity references that
are valid in HTML and XML documents.
That article lists the following five predefined XML entities:
quot "
amp &
apos '
lt <
gt >
According to the specifications of the World Wide Web Consortium (w3C), there are 5 characters that must not appear in their literal form in an XML document, except when used as markup delimiters or within a comment, a processing instruction, or a CDATA section. In all the other cases, these characters must be replaced either using the corresponding entity or the numeric reference according to the following table:
Original CharacterXML entity replacementXML numeric replacement
< < <
> > >
" " "
& & &
' ' '
Notice that the aforementioned entities can be used also in HTML, with the exception of ', that was introduced with XHTML 1.0 and is not declared in HTML 4. For this reason, and to ensure retro-compatibility, the XHTML specification recommends the use of ' instead.
New, simplified answer to an old, commonly asked question...
Simplified XML Escaping (prioritized, 100% complete)
Always (90% important to remember)
Escape < as < unless < is starting a <tag/> or other markup.
Escape & as & unless & is starting an &entity;.
Attribute Values (9% important to remember)
attr=" 'Single quotes' are ok within double quotes."
attr=' "Double quotes" are ok within single quotes.'
Escape " as " and ' as ' otherwise.
Comments, CDATA, and Processing Instructions (0.9% important to remember)
<!-- Within comments --> nothing has to be escaped but no -- strings are allowed.
<![CDATA[ Within CDATA ]]> nothing has to be escaped, but no ]]> strings are allowed.
<?PITarget Within PIs ?> nothing has to be escaped, but no ?> strings are allowed.
Esoterica (0.1% important to remember)
Escape control codes in XML 1.1 via Base64 or Numeric Character References.
Escape ]]> as ]]> unless ]]> is ending a CDATA section. (This rule applies to character data in general – even outside a CDATA section.)
Escaping characters is different for tags and attributes.
For tags:
< <
> > (only for compatibility, read below)
& &
For attributes:
" "
' '
From Character Data and Markup:
The ampersand character (&) and the left angle bracket (<) must not
appear in their literal form, except when used as markup delimiters,
or within a comment, a processing instruction, or a CDATA section. If
they are needed elsewhere, they must be escaped using either numeric
character references or the strings " & " and " < "
respectively. The right angle bracket (>) may be represented using the
string " > ", and must, for compatibility, be escaped using either
" > " or a character reference when it appears in the string " ]]>
" in content, when that string is not marking the end of a CDATA
section.
To allow attribute values to contain both single and double quotes,
the apostrophe or single-quote character (') may be represented as "
' ", and the double-quote character (") as " " ".
In addition to the commonly known five characters [<, >, &, ", and '], I would also escape the vertical tab character (0x0B). It is valid UTF-8, but not valid XML 1.0, and even many libraries (including the highly portable (ANSI C) library libxml2) miss it and silently output invalid XML.
Abridged from: XML, Escaping
There are five predefined entities:
< represents "<"
> represents ">"
& represents "&"
' represents '
" represents "
"All permitted Unicode characters may be represented with a numeric character reference." For example:
中
Most of the control characters and other Unicode ranges are specifically excluded, meaning (I think) they can't occur either escaped or direct:
Valid characters in XML
The accepted answer is not correct. Best is to use a library for escaping xml.
As mentioned in this other question
"Basically, the control characters and characters out of the Unicode ranges are not allowed. This means also that calling for example the character entity is forbidden."
If you only escape the five characters. You can have problems like An invalid XML character (Unicode: 0xc) was found
It depends on the context. For the content, it is < and &, and ]]> (though a string of three instead of one character).
For attribute values, it is <, &, ", and '.
For CDATA, it is ]]>.
Only < and & are required to be escaped if they are to be treated character data and not markup:
2.4 Character Data and Markup
Working on Ext.net project. I need to set a validation on password field for not allowing blank spaces before the password or after password and also the length of the password should not be more than 15 characters including blank spaces. I have done following so far but does not work.
The issue is it counts space between text as invalid.
E.g. It does not allow "Pass word", what I want to not allow " password" or "password ".
<ext:TextField ID="txtConfirmPwd" AllowBlank="false" InputType="Password" Name="txtConfirmPwd" runat="server" StyleSpec="width:96%;" Regex="^[^\s.^\s]{1,15}$" InvalidClass="invalidClass" Validator="ComparePwd" IDMode="Static">
<Listeners>
<Valid Handler="InvalidClass(this,true);" />
<Invalid Handler="InvalidClass(this,false);" />
</Listeners>
</ext:TextField>
You may use
Regex="^\S(?:.{0,13}\S)?$"
Details:
^ - start of string
\S - a nonwhitespace
(?:.{0,13}\S)? - 1 or 0 sequences of:
.{0,13} - any zero to thirteen chars
\S - a nonwhitespace symbol
$ - end of string.
This means, the first char must be a char other than whitespace and then there can be any up to 14 chars with the last one being a nonwhitespace.
You may actually use lookaheads to achieve the same, ^(?!\s)(?!.*\s$).{1,15}$. The (?!\s) is a negative lookahead that fails the match if the first (as the pattern is immediately following ^) char is a whitespace char and (?!.*\s$) fails the match if the whitespace appears right at the end of the string. However, it is unnecessarily complex for the current task.
I need to validate the scenario in Regex, I'm using RegularExpression validation in ASP.NET.
Shouldn't start or end with SPACE
Doesn't contain only SPACE
whole string shouldn't contain two special char "#" & "?"
Valid:
"as#d qwe2", "&^%$$(&+_", "12#$.p"
InValid:
" ", "asd ", " asd#", "ksdhf?kh", "asdf#asd"
I'm trying with this:
<asp:RegularExpressionValidator ID="RegularExpressionValidator1" runat="server" ControlToValidate="TextBox1"
ErrorMessage="RegularExpressionValidator"
ValidationExpression="^[^\s]+(\s+[^#?]+)*[^\s]$">Error</asp:RegularExpressionValidator>
Untested RegEx: ^([^ #?][^#?]*[^ #?]|[^ #?])$
There are a couple of problems with the regex you are currently using:
^[^\s]+ is matching almost all of the string if it does not start with a space character (thus matching # or ?).
Due to the way the regex is constructed, you can only input strings of length 2 and above. This is a minor setback, but can be avoided.
I would suggest using negative lookaheads since there are multiple 'checks' to do on the first character:
^(?!\s|.*\s$)[^?#]+$
(?!\s|.*\s$) will prevent a match if the string begins with \s or ends with \s and [^?#]+ matches all characters but ? and #. Space only strings will automatically be rejected because space only strings will have to begin with a space.
I have a AppSetting in web.config.
<add key="key" value="\n|\r"/>
When i read it by ConfigurationManager.AppSettings["key"] it gives "\\n|\\r".
Why ?
In the debugger, becuase the backslash is a special character used for things like tabs (\t) and line endings (\n), it has to be escaped by the use of another backslash. Hence any text that contains an actual \ will be displayed as \. If you print it out to a file or use it in any other way, you will find your string only contains the one .
This isn't ConfigurationManager doing anything.
The backslash escaping syntax is only recognized inside of string literals by the C# compiler. Since your string is being read from an XML file at runtime, you need to use XML-compatible escaping (character entities) in order include those characters in your string. Thus, your app settings entry should look like the following:
<add key="key" value="&x10;|&x13;"/>
Because 10 and 13 are the hex values for linefeed and carriage return, respectively.
Like cjk said, the extra slash is being inserted by the debugger to indicate that it is seeing a literal slash and not an escape sequence.
I solved the same problem with a string replacement.
Not beautful.. but works!
ConfigurationManager.AppSettings["Key"].Replace("\\n", "\n")
string str = "\n";// means \n
string str1 = #"\n";// means \\n
From the AppSettings, It seems that when you extract the key's value, # is internally wrapped.. It is done by the compiler not runtime.
What is the escape sequence for &-sign in string literals in web.config?
& -> &
here: What characters do I need to escape in XML documents?
Use "&" instead of "&".
I'm afraid the & doesn't work in web.config, it must have stricter rules than normal XML:
<add key="ST_CF_AppType_ipad" value="http://itunes.apple.com/se/app/itunes-u/id490217893?l=en&mt=8" />
Gives an error occurred while parsing EntityName.