We have this asp.net app that is pure Arabic. Most fields in this application accept only Arabic characters, not a single English letter should be there. So I was thinking that if I strip all the English characters by whitelisting all Arabic characters and any punctuation I want XSS can't slip into those fields. This is basically an advantage of having a non Latin characters inputs.
Right? Why?
Chances for XSS are rare, but you probably still need to substitute special characters (some of the punctuations, like ", ', <, >, etc) for two reasons.
Security Concerns
Malicious javascript code could be written WITHOUT ENGLISH CHARS. For instance, the following javascript code equals to alert(document.cookie), you could try it in your browser console:
[][(![]+[])[+[]]+([![]]+[][[]])[+!+[]+[+[]]]+(![]+[])[!+[]+!+[]]+(!![]+[])[+[]]+(!![]+[])[!+[]+!+[]+!+[]]+(!![]+[])[+!+[]]][([][(![]+[])[+[]]+([![]]+[][[]])[+!+[]+[+[]]]+(![]+[])[!+[]+!+[]]+(!![]+[])[+[]]+(!![]+[])[!+[]+!+[]+!+[]]+(!![]+[])[+!+[]]]+[])[!+[]+!+[]+!+[]]+(!![]+[][(![]+[])[+[]]+([![]]+[][[]])[+!+[]+[+[]]]+(![]+[])[!+[]+!+[]]+(!![]+[])[+[]]+(!![]+[])[!+[]+!+[]+!+[]]+(!![]+[])[+!+[]]])[+!+[]+[+[]]]+([][[]]+[])[+!+[]]+(![]+[])[!+[]+!+[]+!+[]]+(!![]+[])[+[]]+(!![]+[])[+!+[]]+([][[]]+[])[+[]]+([][(![]+[])[+[]]+([![]]+[][[]])[+!+[]+[+[]]]+(![]+[])[!+[]+!+[]]+(!![]+[])[+[]]+(!![]+[])[!+[]+!+[]+!+[]]+(!![]+[])[+!+[]]]+[])[!+[]+!+[]+!+[]]+(!![]+[])[+[]]+(!![]+[][(![]+[])[+[]]+([![]]+[][[]])[+!+[]+[+[]]]+(![]+[])[!+[]+!+[]]+(!![]+[])[+[]]+(!![]+[])[!+[]+!+[]+!+[]]+(!![]+[])[+!+[]]])[+!+[]+[+[]]]+(!![]+[])[+!+[]]]((![]+[])[+!+[]]+(![]+[])[!+[]+!+[]]+(!![]+[])[!+[]+!+[]+!+[]]+(!![]+[])[+!+[]]+(!![]+[])[+[]]+(![]+[][(![]+[])[+[]]+([![]]+[][[]])[+!+[]+[+[]]]+(![]+[])[!+[]+!+[]]+(!![]+[])[+[]]+(!![]+[])[!+[]+!+[]+!+[]]+(!![]+[])[+!+[]]])[!+[]+!+[]+[+[]]]+([][[]]+[])[!+[]+!+[]]+(!![]+[][(![]+[])[+[]]+([![]]+[][[]])[+!+[]+[+[]]]+(![]+[])[!+[]+!+[]]+(!![]+[])[+[]]+(!![]+[])[!+[]+!+[]+!+[]]+(!![]+[])[+!+[]]])[+!+[]+[+[]]]+([][(![]+[])[+[]]+([![]]+[][[]])[+!+[]+[+[]]]+(![]+[])[!+[]+!+[]]+(!![]+[])[+[]]+(!![]+[])[!+[]+!+[]+!+[]]+(!![]+[])[+!+[]]]+[])[!+[]+!+[]+!+[]]+([][[]]+[])[+[]]+((+[])[([][(![]+[])[+[]]+([![]]+[][[]])[+!+[]+[+[]]]+(![]+[])[!+[]+!+[]]+(!![]+[])[+[]]+(!![]+[])[!+[]+!+[]+!+[]]+(!![]+[])[+!+[]]]+[])[!+[]+!+[]+!+[]]+(!![]+[][(![]+[])[+[]]+([![]]+[][[]])[+!+[]+[+[]]]+(![]+[])[!+[]+!+[]]+(!![]+[])[+[]]+(!![]+[])[!+[]+!+[]+!+[]]+(!![]+[])[+!+[]]])[+!+[]+[+[]]]+([][[]]+[])[+!+[]]+(![]+[])[!+[]+!+[]+!+[]]+(!![]+[])[+[]]+(!![]+[])[+!+[]]+([][[]]+[])[+[]]+([][(![]+[])[+[]]+([![]]+[][[]])[+!+[]+[+[]]]+(![]+[])[!+[]+!+[]]+(!![]+[])[+[]]+(!![]+[])[!+[]+!+[]+!+[]]+(!![]+[])[+!+[]]]+[])[!+[]+!+[]+!+[]]+(!![]+[])[+[]]+(!![]+[][(![]+[])[+[]]+([![]]+[][[]])[+!+[]+[+[]]]+(![]+[])[!+[]+!+[]]+(!![]+[])[+[]]+(!![]+[])[!+[]+!+[]+!+[]]+(!![]+[])[+!+[]]])[+!+[]+[+[]]]+(!![]+[])[+!+[]]]+[])[+!+[]+[+!+[]]]+(!![]+[])[!+[]+!+[]+!+[]]+([][[]]+[])[+!+[]]+(!![]+[])[+[]]+(+(+!+[]+[+!+[]]+(!![]+[])[!+[]+!+[]+!+[]]+[!+[]+!+[]]+[+[]])+[])[+!+[]]+([][(![]+[])[+[]]+([![]]+[][[]])[+!+[]+[+[]]]+(![]+[])[!+[]+!+[]]+(!![]+[])[+[]]+(!![]+[])[!+[]+!+[]+!+[]]+(!![]+[])[+!+[]]]+[])[!+[]+!+[]+!+[]]+(!![]+[][(![]+[])[+[]]+([![]]+[][[]])[+!+[]+[+[]]]+(![]+[])[!+[]+!+[]]+(!![]+[])[+[]]+(!![]+[])[!+[]+!+[]+!+[]]+(!![]+[])[+!+[]]])[+!+[]+[+[]]]+(!![]+[][(![]+[])[+[]]+([![]]+[][[]])[+!+[]+[+[]]]+(![]+[])[!+[]+!+[]]+(!![]+[])[+[]]+(!![]+[])[!+[]+!+[]+!+[]]+(!![]+[])[+!+[]]])[+!+[]+[+[]]]+(+(!+[]+!+[]+[+[]]))[(!![]+[])[+[]]+(!![]+[][(![]+[])[+[]]+([![]]+[][[]])[+!+[]+[+[]]]+(![]+[])[!+[]+!+[]]+(!![]+[])[+[]]+(!![]+[])[!+[]+!+[]+!+[]]+(!![]+[])[+!+[]]])[+!+[]+[+[]]]+(+![]+([]+[])[([][(![]+[])[+[]]+([![]]+[][[]])[+!+[]+[+[]]]+(![]+[])[!+[]+!+[]]+(!![]+[])[+[]]+(!![]+[])[!+[]+!+[]+!+[]]+(!![]+[])[+!+[]]]+[])[!+[]+!+[]+!+[]]+(!![]+[][(![]+[])[+[]]+([![]]+[][[]])[+!+[]+[+[]]]+(![]+[])[!+[]+!+[]]+(!![]+[])[+[]]+(!![]+[])[!+[]+!+[]+!+[]]+(!![]+[])[+!+[]]])[+!+[]+[+[]]]+([][[]]+[])[+!+[]]+(![]+[])[!+[]+!+[]+!+[]]+(!![]+[])[+[]]+(!![]+[])[+!+[]]+([][[]]+[])[+[]]+([][(![]+[])[+[]]+([![]]+[][[]])[+!+[]+[+[]]]+(![]+[])[!+[]+!+[]]+(!![]+[])[+[]]+(!![]+[])[!+[]+!+[]+!+[]]+(!![]+[])[+!+[]]]+[])[!+[]+!+[]+!+[]]+(!![]+[])[+[]]+(!![]+[][(![]+[])[+[]]+([![]]+[][[]])[+!+[]+[+[]]]+(![]+[])[!+[]+!+[]]+(!![]+[])[+[]]+(!![]+[])[!+[]+!+[]+!+[]]+(!![]+[])[+!+[]]])[+!+[]+[+[]]]+(!![]+[])[+!+[]]])[+!+[]+[+[]]]+(!![]+[])[+[]]+(!![]+[])[+!+[]]+([![]]+[][[]])[+!+[]+[+[]]]+([][[]]+[])[+!+[]]+(+![]+[![]]+([]+[])[([][(![]+[])[+[]]+([![]]+[][[]])[+!+[]+[+[]]]+(![]+[])[!+[]+!+[]]+(!![]+[])[+[]]+(!![]+[])[!+[]+!+[]+!+[]]+(!![]+[])[+!+[]]]+[])[!+[]+!+[]+!+[]]+(!![]+[][(![]+[])[+[]]+([![]]+[][[]])[+!+[]+[+[]]]+(![]+[])[!+[]+!+[]]+(!![]+[])[+[]]+(!![]+[])[!+[]+!+[]+!+[]]+(!![]+[])[+!+[]]])[+!+[]+[+[]]]+([][[]]+[])[+!+[]]+(![]+[])[!+[]+!+[]+!+[]]+(!![]+[])[+[]]+(!![]+[])[+!+[]]+([][[]]+[])[+[]]+([][(![]+[])[+[]]+([![]]+[][[]])[+!+[]+[+[]]]+(![]+[])[!+[]+!+[]]+(!![]+[])[+[]]+(!![]+[])[!+[]+!+[]+!+[]]+(!![]+[])[+!+[]]]+[])[!+[]+!+[]+!+[]]+(!![]+[])[+[]]+(!![]+[][(![]+[])[+[]]+([![]]+[][[]])[+!+[]+[+[]]]+(![]+[])[!+[]+!+[]]+(!![]+[])[+[]]+(!![]+[])[!+[]+!+[]+!+[]]+(!![]+[])[+!+[]]])[+!+[]+[+[]]]+(!![]+[])[+!+[]]])[!+[]+!+[]+[+[]]]](!+[]+!+[]+[+!+[]])+([![]]+[][[]])[+!+[]+[+[]]]+(!![]+[])[!+[]+!+[]+!+[]]+(!![]+[][(![]+[])[+[]]+([![]]+[][[]])[+!+[]+[+[]]]+(![]+[])[!+[]+!+[]]+(!![]+[])[+[]]+(!![]+[])[!+[]+!+[]+!+[]]+(!![]+[])[+!+[]]])[!+[]+!+[]+[+[]]])()
Though I don't think that it's possible to let the browser execute that code as javascript without any English characters, it really depends on how you are going to display the data users input. If they manage to execute codes like this, they could steal users' cookies. But they can't do this if those chars to substituted with things like [ It's nice to keep safer and eliminate the possibilities.
Page Layouts and Display
Your HTML might be broken and the layouts could be messed up if you don't substitute special chars (some of the punctuations). For instance, " should be substituted with " in order that your page can display normally (other characters you probably need to substitute include <(<),>(>), etc.), otherwise your HTML might be messed up (e.g. <input type="text" name="test" value="He recited that, "Shall I compare thee to a summer's day", and continued..."><br> English as an example, but it's basically the same with other languages, only "He recited that, " will be treated as the value, and the page would be messed up.) This could lead to wrong display, layouts, and contents.
Hope this answers your concerns ^_^
Related
I'm implementing a simple search on a website, and right now I'm working on sanitizing the input. My plan is to make a whitelist of allowed characters. I'm using PHP, and so far I've got the current regex:
preg_replace('/[^a-z0-9 -]/i', '', $s);
So, I'm removing anything that's not alphanumeric or a space or a hyphen.
Is there a generally accepted whitelist for this sort of thing, or does it just depend on the application? I'm going to be searching on book titles, author names and book blurbs.
What about 2010 (A space odyssey)? What about Giscard d`Estaing's autobiography? ... This is really impossible to answer generally, it will depend on your application and data structures.
You want to look into the fulltext search functions of the database of your choice, or even specialized search appliances like Sphinx.
Clarify what engine you will use first to actually perform your search, and the rules on what you need to strip out will become much clearer.
Google has some pretty advanced rules for searches, but their basic rule is this:
Generally, punctuation is ignored, including ##$%^&*()=+[]\ and other special characters.
However, Google makes exceptions for common search terms, like C++, C#, or $100.
If you want a search as sophisticated as Google's, you can make rules against the above punctuation and have some exceptions. However, for a simple search, just ignore the characters that Google generally ignores.
There's not a generic regular expression to solve this problem. Your code strips out a lot of things you might want to keep, like commas, exclamation points, (semi-)colons, and non-English letters. If you have a full list of all of the titles in your database, you should be able to write a script that will construct a list of all characters found in all of your titles. If your regular expression strips out any of those characters, then you risk having problems (although passing this test doesn't mean that you won't run into problems).
Depending on how the rest of your search is implemented, you may be able to strip out valid characters and still return relevant search results. In this case, you would want your expression to allow non-English characters (since you don't want to split a word) but you might be able to remove all punctuation marks that aren't inside of a quote-delimited phrase. For example, searching for red haired should give you all of the results you would get from searching for red-haired plus a few extra.
Although it is strongly recommended (W3C source, via Wikipedia) for web servers to support semicolon as a separator of URL query items (in addition to ampersand), it does not seem to be generally followed.
For example, compare
http://www.google.com/search?q=nemo&oe=utf-8
http://www.google.com/search?q=nemo;oe=utf-8
results. (In the latter case, semicolon is, or was at the time of writing this text, treated as ordinary string character, as if the url was: http://www.google.com/search?q=nemo%3Boe=utf-8)
Although the first URL parsing library i tried, behaves well:
>>> from urlparse import urlparse, query_qs
>>> url = 'http://www.google.com/search?q=nemo;oe=utf-8'
>>> parse_qs(urlparse(url).query)
{'q': ['nemo'], 'oe': ['utf-8']}
What is the current status of accepting semicolon as a separator, and what are potential issues or some interesting notes? (from both server and client point of view)
The W3C Recommendation from 1999 is obsolete. The current status, according to the 2014 W3C Recommendation, is that semicolon is now illegal as a parameter separator:
To decode application/x-www-form-urlencoded payloads, the following algorithm should be used. [...] The output of this algorithm is a sorted list of name-value pairs. [...]
Let strings be the result of strictly splitting the string payload on U+0026 AMPERSAND characters (&).
In other words, ?foo=bar;baz means the parameter foo will have the value bar;baz; whereas ?foo=bar;baz=sna should result in foo being bar;baz=sna (although technically illegal since the second = should be escaped to %3D).
As long as your HTTP server, and your server-side application, accept semicolons as separators, you should be good to go. I cannot see any drawbacks. As you said, the W3C spec is on your side:
We recommend that HTTP server implementors, and in particular, CGI implementors support the use of ";" in place of "&" to save authors the trouble of escaping "&" characters in this manner.
I agree with Bob Aman. The W3C spec is designed to make it easier to use anchor hyperlinks with URLs that look like form GET requests (e.g., http://www.host.com/?x=1&y=2). In this context, the ampersand conflicts with the system for character entity references, which all start with an ampersand (e.g., "). So W3C recommends that web servers allow a semicolon to be used as a field separator instead of an ampersand, to make it easier to write these URLs. But this solution requires that writers remember that the ampersand must be replaced by something, and that a ; is an equally valid field delimiter, even though web browsers universally use ampersands in the URL when submitting forms. That is arguably more difficult that remembering to replace the ampersand with an & in these links, just as would be done elsewhere in the document.
To make matters worse, until all web servers allow semicolons as field delimiters, URL writers can only use this shortcut for some hosts, and must use & for others. They will also have to change their code later if a given host stops allowing semicolon delimiters. This is certainly harder than simply using &, which will work for every server forever. This in turn removes any incentive for web servers to allow semicolons as field separators. Why bother, when everyone is already changing the ampersand to & instead of ;?
In short, HTML is a big mess (due to its leniency), and using semicolons help to simplify this a LOT. I estimate that when i factor in the complications that i've found, using ampersands as a separator makes the whole process about three times as complicated as using semicolons for separators instead!
I'm a .NET programmer and to my knowledge, .NET does not inherently allow ';' separators, so i wrote my own parsing and handling methods because i saw a tremendous value in using semicolons rather than the already problematic system of using ampersands as separators. Unfortunately, very respectable people (like #Bob Aman in another answer) do not see the value in why semicolon usage is far superior and so much simpler than using ampersands. So i now share a few points to perhaps persuade other respectable developers who don't recognize the value yet of using semicolons instead:
Using a querystring like '?a=1&b=2' in an HTML page is improper (without HTML encoding it first), but most of the time it works. This however is only due to most browsers being tolerant, and that tolerance can lead to hard-to-find bugs when, for instance, the value of the key value pair gets posted in an HTML page URL without proper encoding (directly as '?a=1&b=2' in the HTML source). A QueryString like '?who=me+&+you' is problematic too.
We people can have biases and can disagree about our biases all day long, so recognizing our biases is very important. For instance, i agree that i just think separating with ';' looks 'cleaner'. I agree that my 'cleaner' opinion is purely a bias. And another developer can have an equally opposite and equally valid bias. So my bias on this one point is not any more correct than the opposite bias.
But given the unbiased support of the semicolon making everyone's life easier in the long run, cannot be correctly disputed when the whole picture is taken into account. In short, using semicolons does make life simpler for everyone, with one exception: a small hurdle of getting used to something new. That's all. It's always more difficult to make anything change. But the difficulty of making the change pales in comparison to the continued difficulty of continuing to use &.
Using ; as a QueryString separator makes it MUCH simpler. Ampersand separators are more than twice as difficult to code properly than if semicolons were used. (I think) most implementations are not coded properly, so most implementations aren't twice as complicated. But then tracking down and fixing the bugs leads to lost productivity. Here, i point out 2 separate encoding steps needed to properly encode a QueryString when & is the separator:
Step 1: URL encode both the keys and values of the querystring.
Step 2: Concatenate the keys and values like 'a=1&b=2' after they are URL encoded from step 1.
Step 3: Then HTML encode the whole QueryString in the HTML source of the page.
So special encoding must be done twice for proper (bug free) URL encoding, and not just that, but the encodings are two distinct, different encoding types. The first is a URL encoding and the second is an HTML encoding (for HTML source code). If any of these is incorrect, then i can find you a bug. But step 3 is different for XML. For XML, then XML character entity encoding is needed instead (which is almost identical). My point is that the last encoding is dependent upon the context of the URL, whether that be in an HTML web page, or in XML documentation.
Now with the much simpler semicolon separators, the process is as one wud expect:
1: URL encode the keys and values,
2: concatenate the values together. (With no encoding for step 3.)
I think most web developers skip step 3 because browsers are so lenient. But this leads to bugs and more complications when hunting down those bugs or users not being able to do things if those bugs were not present, or writing bug reports, etc.
Another complication in real use is when writing XML documentation markup in my source code in both C# and VB.NET. Since & must be encoded, it's a real drag, literally, on my productivity. That extra step 3 makes it harder to read the source code too. So this harder-to-read deficit applies not only to HTML and XML, but also to other applications like C# and VB.NET code because their documentation uses XML documentation. So the step #3 encoding complication proliferates to other applications too.
So in summary, using the ; as a separator is simple because the (correct) process when using the semicolon is how one wud normally expect the process to be: only one step of encoding needs to take place.
Perhaps this wasn't too confusing. But all the confusion or difficulty is due to using a separation character that shud be HTML encoded. Thus '&' is the culprit. And semicolon relieves all that complication.
(I will point out that my 3 step vs 2 step process above is usually how many steps it would take for most applications. However, for completely robust code, all 3 steps are needed no matter which separator is used. But in my experience, most implementations are sloppy and not robust. So using semicolon as the querystring separator would make life easier for more people with less website and interop bugs, if everyone adopted the semicolon as the default instead of the ampersand.)
I am using ^[\w-\.\+]+#([\w-]+\.)+[\w-]{2,4}$ to validate email address, when I use it from .aspx.cs it works fine to validate IDN email but when I use it from aspx page directly it doesn't work.
return Regex.IsMatch(
email,
#"^[\w-\.\+]+#([\w-]+\.)+[\w-]{2,4}$",
RegexOptions.Singleline);
the ID that I would like to validate looks like pelai#ÖßÜÄÖ.com
I am too bad at regex do you guys know what am I doing wrong?
You may want to take a look at regexlib.com, they have a fantastic selection of user-created content to do these extremely commont types of matches.
http://regexlib.com/Search.aspx?k=email
First the correct validation of an e-mail address is somewhat more complex as regex. But that apart, the Regex is not at fault, but probably rather how you use it.
Edit (after seeing your code): do you make sure that the string to be tested has no whitespace and such in it? Put a breakpoint on it right there and inspect the string, that might give you an idea of what is going wrong.
You should escape dash (-) within the first char class and no need for dot and plus :
[\w\-.+]
or
[\w.+-]
no need to escape dash if it is the last char.
With "directly from aspx page" you probably mean in a regularexpression validator?
Then you need to be aware that the regex is used by a different system: javascript which has it's own implementation of regex. This means that regexes that work in .Net directly, might fail in js.
The implementations are not too different, the basics are identical. But there might be differences in details (as how an unescaped - is handled) and js lacks some "advanced features" (although your regex doesn't look too "advanced" ;-) ).
Do you see any error messages in the browser?
The problem is those non-ASCII characters in your test address, ÖßÜÄÖ (which you only ever mentioned in a comment to #HansKesting's answer). In .NET, \w matches all Unicode letters and digits, and even several characters besides _ that are classified as connecting punctuation, but in JavaScript it only matches [A-Za-z0-9_].
JavaScript also lacks support for Unicode properties (like \p{L} for letters) and blocks (\p{IsLatin}), so you would have to list any non-ASCII characters you want to allow by their Unicode escapes (\uXXXX). If you just want to support Latin1 letters, I suppose you could use [\w\u00C0-\u00FF], but IDN is supposed to support more than just Latin1, isn't it?
By the way, JavaScript also doesn't support Singleline mode, and even if it did you wouldn't be able to use it. JS does support Multiline and IgnoreCase modes, but there's no way to set them on both the server and client side. The inline modifiers, (?i) and (?m), don't work in JS, and the RegexOptions argument only works server-side.
Fortunately, you don't really need Singleline mode anyway; it allows the . metacharacter to match linefeeds, but the only dots in your regex are matching literal dots.
I need to prevent the characters that cause vulnerabilities in the URL.
My sample URL is http://localhost/add.aspx?id=4;req=4.
Please give the list of characters that I need block.
I am using an ASP.NET web page. I am binding the information from an SQL Server database.
I just want to list the characters to stay away from hackers to enter unwanted strings in the URL.
Depending on what technology you're using, there is usually a built-in function that will handle this for you.
ASP.NET (VB) & Classic ASP
myUrl = Server.UrlEncode(myUrl)
ASP.NET (C#)
myUrl = Server.UrlEncode(myUrl);
PHP
$myUrl = urlencode($myurl);
If you simply would like to remove unsafe characters, you would need a regular expression. RFC 1738 defines what characters are unsafe for URLs:
Unsafe:
Characters can be unsafe for a
number of reasons. The space
character is unsafe because
significant spaces may disappear and
insignificant spaces may be introduced
when URLs are transcribed or
typeset or subjected to the treatment
of word-processing programs. The
characters "<" and ">" are unsafe
because they are used as the
delimiters around URLs in free text;
the quote mark (""") is used to
delimit URLs in some systems. The
character "#" is unsafe and should
always be encoded because it is used
in World Wide Web and in other
systems to delimit a URL from a
fragment/anchor identifier that might
follow it. The character "%" is
unsafe because it is used for
encodings of other characters. Other
characters are unsafe because
gateways and other transport agents
are known to sometimes modify such
characters. These characters are "{",
"}", "|", "\", "^", "~", "[", "]",
and "`".
I need to prevent the characters that cause vulnerabilities
Well, of course you need to URL encode, as the answers have said. But does not URL encoding cause vulnerabilities? Well, normally not directly; mostly it just makes your application break when unexpected characters are input.
If we're talking about web ‘vulnerabilities’, the most common ones today are:
Server-side code injection, compromising your server
SQL injection, compromising your database
HTML injection, allowing cross-site scripting (XSS) attacks against your users
Unvalidated actions, allowing cross-site request forgery (XSRF) attacks against your users
These are in order of decreasing seriousness and increasing commonness. (Luckily few web site authors are stupid enough to be passing user input to system() these days, but XSS and XSRF vulnerabilities are rife.)
Each of these vulnerabilities requires you to understand the underlying problem and cope with it deliberately. There is no magic list of “strings you need to block” that will protect your application if it is playing naïve about security. There are some add-ons that do things like blocking the string ‘<script>’ when submitted, but all they give you is a false sense of security since they can only catch a few common cases, and are usually easy to code around.
They'll also stop those strings being submitted when you might genuinely want them. For example, some (stupid) PHP authors refuse all incoming apostrophes as an attempt to curb SQL-injection; result is you can't be called “O'Reilly”. D'oh. Don't block; encode properly.
For example, to protect against SQL injection make sure to SQL-escape any strings that you are making queries with (or use parameterised queries to do this automatically); to protect against HTML injection, HTML-encode all text strings you output onto the page (or use a templating/MVC scheme that will do this automatically).
My sample URL http://localhost/add.aspx?id=4;req=4
Is there supposed to be something wrong with that URL? It's valid to separate two query parameters with a ‘;’ instead of the more common ‘&’, but many common web frameworks lamentably still don't understand this syntax by default (including Java Servlet and ASP.NET). So you'd have to go with ‘id=4&req=4’ — or, if you really wanted that to be a single parameter with a literal semicolon in it, ‘id=4%3Breq%3D4’.
http://en.wikipedia.org/wiki/Query_string#URL_encoding
See also: https://www.rfc-editor.org/rfc/rfc3986#section-2.2
I wrote this, for pretty URLs, but it's of course not complete:
""",´,’,·,‚,*,#,?,=,;,:,.,/,+,&,$,<,>,#,%,{,(,),},|,,^,~,[,],—,–,-',,"
And then I translate spaces " " and repeating spaces for "-".
A better thing is to do it or combine it with a regular expression.
Inside an asp.net page, should I use
<html><title>My page's title from México</title></html>
Or
<html><title>My page’s title from México</title></html>
Both examples have the same output. Since asp.net encodes all my pages to utf-8, there is no need to use html entities, is that right?
The ASCII table is set of characters, arguable the first standardized set of characters back in the days when you could only spare 1 byte per character. http://asciitable.com/ But I did some looking around at the extended character set of ASCII and it appears that the character you are referencing is an ASCII character. So there really isn't a problem which ever way you choose to display your title.
My revised answer is go for less expensive one according to space (i.e. the first one)
The second example will ensure compatibility with ASCII standards of HTML transmition. So my vote is for the second example, so you don't have to ensure the HTML is output and encoded as UTF-8 all the way through all the proxy servers and any other kind of caching and translation that might occur.
You're correct; As long as there's unicode at both ends of the pipe, it really doesn't matter. Personally, I would use the first simply because it's more readable.
And, honestly, unicode has been widespread for some time. I personally believe that it's time to leave anyone who can't handle UTF-8 behind.