HTTP query and URI encoding doubts [closed] - http

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
Recently I was researching HTTP query strings while wondering about possibilities on web service access interface API. And it seems very underspecified.
In fact RFC 3986 (Uniform Resource Identifier (URI): Generic Syntax) doesn’t say anything about format of the query string fragment and ends on defining which characters are allowed and how to encode other characters. (I will return to this later.)
The only thing I found was HTML specification on how forms are mangled into query string (HTML 4.01; 17.13.4 Form content types, application/x-www-form-urlencoded). HTML 5 algorithm seems close enough (4.10.22.5 URL-encoded form data).
This might seem OK. After all why would anyone want to set a query string format for everyone else. What for? But are there any other (than HTML) well established standards? Is anyone else using a different format?
A side question here is dealing with [] in form fields names. PHP uses that to ensure that multiple occurrences of a field are all present in $_GET superglobal variable. (Otherwise only last occurrence is present.)
But from RFC 3986 it seems that neither [ nor ] are allowed in query string. Yet my experiments with various browsers suggested that no browser encodes those characters and they are there in the URI just like that...
Is this real life practice? Or am I testing it incorrectly? I tested with PHP 5.3.17 on IIS 7. Using Internet Explorer, Firefox and Chrome. Then I compared what is in $_SERVER['QUERY_STRING'] and $_GET.
Another question is real life support for semicolon separation.
HTML 4.01 specification (B.2.2 Ampersands in URI attribute values) recommends HTTP servers to accept semicolon (;) as parameter separator (opposed to ampersand &).
Is any server supporting it? Is anyone using this? Is it worth to bother with that (when considering allowed formats of query string for a web service)?
Then how about non-ASCII characters support?
HTML 4.01 specification (B.2.1 Non-ASCII characters in URI attribute values) restates clearly what URI describing RFCs stated in the first place: non-ASCII characters are not allowed in URI. Yet specification takes into account existing practice (of use of illegal URIs) and advices to change such characters into UTF-8 encoding and then treat each byte with URI-standard hex encoding.
From my tests is seems that for example Chrome and Firefox do so. But Internet Explorer did not and just sent those characters like they were. PHP partially coped with that. $_SERVER['QUERY_STRING'] and $_GET contained those characters. But $_SERVER['REQUEST_URI'] contained ? instead.
Are there any standards or practices how to approach such cases?
And another connected question is how then should authors publish (by URI) resources with names containing non-ASCII (for example national) characters? Considering all the various parties (HTML code, browser sending request, browser saving file do disk, server receiving and processing request and server storing the file) it seems nearly impossible to have it working consistently. Or at least I never managed.
When it comes to web pages I’m already used to that and always replace national characters with corresponding Latin base characters. But when it comes to external files (PDFs, images, …) it somehow “feels wrong” to “downgrade” the names. Especially if one expects users to save those files on disk.. How to deal with this issue?

Have you checked HTTP specyfication (RFC2616)?
Take a look at those parts:
http://www.w3.org/Protocols/rfc2616/rfc2616-sec5.html#sec5.1.2
http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.2
The practical advice would be to use Base64 to encode the fields that you expect to contain risky characters and later on decode them on your backend.
Btw. Your question is really long. It decreases the chance that someone will dig into it.

In fact RFC 3986 (Uniform Resource Identifier (URI): Generic Syntax) doesn’t say anything about format of the query string fragment
Yes, it does, in Section 3.4:
query = *( pchar / "/" / "?" )
pchar is defined in Section 3.3:
pchar = unreserved / pct-encoded / sub-delims / ":" / "#"
and ends on defining which characters are allowed and how to encode other characters.
Exactly. That is defining the format of the query string fragment.
But from RFC 3986 it seems that neither [ nor ] are allowed in query string.
Officially, yes. But not all browsers do it, and that is broken behavior on their part. All official specs I have seen (and 3986 is not the only one in play) say those characters must be percent-encoded.
Then how about non-ASCII characters support?
Non-ASCII characters are not allowed in URIs. They must be charset-encoded and percent-encoded. The actual charset used is server-specific, there is no spec that allows a URI to specify the charset used. Various specs recommend UTF-8, but do not require UTF-8, and some foreign servers indeed do not use UTF-8.
The IRI spec (RFC 3987), which replaces the URL/URI specs, supports the full Unicode charset, but IRIs are still relatively new and many servers do not support them yet. However, The RFC does define algorithms for converting IRIs to URIs and vice versa.
When in doubt, percent-encode everything you are not sure about. Servers are required to support an decode them when present, before then processing the decoded data as needed.

Related

Is there some way to set character encoding in SCORM 2004?

I'm trying to record some text values (cmi.interactions.n.learner_response, and cmi.interactions.n.description) on the backend. I'm sending them in a post response from a JS object that uses JSON.stringify.
Inspecting the response in PHP, accented characters äöå (and spaces) are recorded as underscores in learner_response, and in description, they are omitted altogether. Inspecting the response string, it appears to be an ASCII encoded string.
Is it possible to set encoding in SCORM 2004 so that I can see accented characters in the response? My client would like record the interactions more thoroughly. The content was created in Adobe Captivate.
Thanks.
Essentially, no. SCORM's scope limits it to what is happening in the runtime layer that is implemented as the JavaScript API that the SCORM player (the thing launching the content) provides. So the transfer mechanism between that runtime environment and the storage layer (whether that is on a server, local, etc.) is outside the scope of the spec and is therefore implementation specific.
There is reference to ISO-10646-1 which will take you down a path that likely leads to not a lot more information. Essentially it is a character set without including specifics about how to handle those elements, which for this use case probably boils down to JavaScript string.
Having said all of that you should seek support from the SCORM player to see if they have the ability to adjust that so that larger ranges of characters can be supported.

Custom HTTP headers : naming conventions

Several of our users have asked us to include data relative to their account in the HTTP headers of requests we send them, or even responses they get from our API.
What is the general convention to add custom HTTP headers, in terms of naming, format... etc.
Also, feel free to post any smart usage of these that you stumbled upon on the web; We're trying to implement this using what's best out there as a target :)
The recommendation is was to start their name with "X-". E.g. X-Forwarded-For, X-Requested-With. This is also mentioned in a.o. section 5 of RFC 2047.
Update 1: On June 2011, the first IETF draft was posted to deprecate the recommendation of using the "X-" prefix for non-standard headers. The reason is that when non-standard headers prefixed with "X-" become standard, removing the "X-" prefix breaks backwards compatibility, forcing application protocols to support both names (E.g, x-gzip & gzip are now equivalent). So, the official recommendation is to just name them sensibly without the "X-" prefix.
Update 2: On June 2012, the deprecation of recommendation to use the "X-" prefix has become official as RFC 6648. Below are cites of relevance:
3. Recommendations for Creators of New Parameters
...
SHOULD NOT prefix their parameter names with "X-" or similar
constructs.
4. Recommendations for Protocol Designers
...
SHOULD NOT prohibit parameters with an "X-" prefix or similar
constructs from being registered.
MUST NOT stipulate that a parameter with an "X-" prefix or
similar constructs needs to be understood as unstandardized.
MUST NOT stipulate that a parameter without an "X-" prefix or
similar constructs needs to be understood as standardized.
Note that "SHOULD NOT" ("discouraged") is not the same as "MUST NOT" ("forbidden"), see also RFC 2119 for another spec on those keywords. In other words, you can keep using "X-" prefixed headers, but it's not officially recommended anymore and you may definitely not document them as if they are public standard.
Summary:
the official recommendation is to just name them sensibly without the "X-" prefix
you can keep using "X-" prefixed headers, but it's not officially recommended anymore and you may definitely not document them as if they are public standard
The question bears re-reading. The actual question asked is not similar to vendor prefixes in CSS properties, where future-proofing and thinking about vendor support and official standards is appropriate. The actual question asked is more akin to choosing URL query parameter names. Nobody should care what they are. But name-spacing the custom ones is a perfectly valid -- and common, and correct -- thing to do.
Rationale:
It is about conventions among developers for custom, application-specific headers -- "data relevant to their account" -- which have nothing to do with vendors, standards bodies, or protocols to be implemented by third parties, except that the developer in question simply needs to avoid header names that may have other intended use by servers, proxies or clients. For this reason, the "X-Gzip/Gzip" and "X-Forwarded-For/Forwarded-For" examples given are moot. The question posed is about conventions in the context of a private API, akin to URL query parameter naming conventions. It's a matter of preference and name-spacing; concerns about "X-ClientDataFoo" being supported by any proxy or vendor without the "X" are clearly misplaced.
There's nothing special or magical about the "X-" prefix, but it helps to make it clear that it is a custom header. In fact, RFC-6648 et al help bolster the case for use of an "X-" prefix, because -- as vendors of HTTP clients and servers abandon the prefix -- your app-specific, private-API, personal-data-passing-mechanism is becoming even better-insulated against name-space collisions with the small number of official reserved header names. That said, my personal preference and recommendation is to go a step further and do e.g. "X-ACME-ClientDataFoo" (if your widget company is "ACME").
IMHO the IETF spec is insufficiently specific to answer the OP's question, because it fails to distinguish between completely different use cases: (A) vendors introducing new globally-applicable features like "Forwarded-For" on the one hand, vs. (B) app developers passing app-specific strings to/from client and server. The spec only concerns itself with the former, (A). The question here is whether there are conventions for (B). There are. They involve grouping the parameters together alphabetically, and separating them from the many standards-relevant headers of type (A). Using the "X-" or "X-ACME-" prefix is convenient and legitimate for (B), and does not conflict with (A). The more vendors stop using "X-" for (A), the more cleanly-distinct the (B) ones will become.
Example:
Google (who carry a bit of weight in the various standards bodies) are -- as of today, 20141102 in this slight edit to my answer -- currently using "X-Mod-Pagespeed" to indicate the version of their Apache module involved in transforming a given response. Is anyone really suggesting that Google should use "Mod-Pagespeed", without the "X-", and/or ask the IETF to bless its use?
Summary:
If you're using custom HTTP Headers (as a sometimes-appropriate alternative to cookies) within your app to pass data to/from your server, and these headers are, explicitly, NOT intended ever to be used outside the context of your application, name-spacing them with an "X-" or "X-FOO-" prefix is a reasonable, and common, convention.
The format for HTTP headers is defined in the HTTP specification. I'm going to talk about HTTP 1.1, for which the specification is RFC 2616. In section 4.2, 'Message Headers', the general structure of a header is defined:
message-header = field-name ":" [ field-value ]
field-name = token
field-value = *( field-content | LWS )
field-content = <the OCTETs making up the field-value
and consisting of either *TEXT or combinations
of token, separators, and quoted-string>
This definition rests on two main pillars, token and TEXT. Both are defined in section 2.2, 'Basic Rules'. Token is:
token = 1*<any CHAR except CTLs or separators>
In turn resting on CHAR, CTL and separators:
CHAR = <any US-ASCII character (octets 0 - 127)>
CTL = <any US-ASCII control character
(octets 0 - 31) and DEL (127)>
separators = "(" | ")" | "<" | ">" | "#"
| "," | ";" | ":" | "\" | <">
| "/" | "[" | "]" | "?" | "="
| "{" | "}" | SP | HT
TEXT is:
TEXT = <any OCTET except CTLs,
but including LWS>
Where LWS is linear white space, whose definition i won't reproduce, and OCTET is:
OCTET = <any 8-bit sequence of data>
There is a note accompanying the definition:
The TEXT rule is only used for descriptive field contents and values
that are not intended to be interpreted by the message parser. Words
of *TEXT MAY contain characters from character sets other than ISO-
8859-1 [22] only when encoded according to the rules of RFC 2047
[14].
So, two conclusions. Firstly, it's clear that the header name must be composed from a subset of ASCII characters - alphanumerics, some punctuation, not a lot else. Secondly, there is nothing in the definition of a header value that restricts it to ASCII or excludes 8-bit characters: it's explicitly composed of octets, with only control characters barred (note that CR and LF are considered controls). Furthermore, the comment on the TEXT production implies that the octets are to be interpreted as being in ISO-8859-1, and that there is an encoding mechanism (which is horrible, incidentally) for representing characters outside that encoding.
So, to respond to #BalusC in particular, it's quite clear that according to the specification, header values are in ISO-8859-1. I've sent high-8859-1 characters (specifically, some accented vowels as used in French) in a header out of Tomcat, and had them interpreted correctly by Firefox, so to some extent, this works in practice as well as in theory (although this was a Location header, which contains a URL, and these characters are not legal in URLs, so this was actually illegal, but under a different rule!).
That said, i wouldn't rely on ISO-8859-1 working across all servers, proxies, and clients, so i would stick to ASCII as a matter of defensive programming.
RFC6648 recommends that you assume that your custom header "might become standardized, public, commonly deployed, or usable across multiple implementations." Therefore, it recommends not to prefix it with "X-" or similar constructs.
However, there is an exception "when it is extremely unlikely that [your header] will ever be standardized." For such "implementation-specific and private-use" headers, the RFC says a namespace such as a vendor prefix is justified.
Modifying, or more correctly, adding additional HTTP headers is a great code debugging tool if nothing else.
When a URL request returns a redirect or an image there is no html "page" to temporarily write the results of debug code to - at least not one that is visible in a browser.
One approach is to write the data to a local log file and view that file later. Another is to temporarily add HTTP headers reflecting the data and variables being debugged.
I regularly add extra HTTP headers like X-fubar-somevar: or X-testing-someresult: to test things out - and have found a lot of bugs that would have otherwise been very difficult to trace.
The header field name registry is defined in RFC3864, and there's nothing special with "X-".
As far as I can tell, there are no guidelines for private headers; in doubt, avoid them. Or have a look at the HTTP Extension Framework (RFC 2774).
It would be interesting to understand more of the use case; why can't the information be added to the message body?

Semicolon as URL query separator

Although it is strongly recommended (W3C source, via Wikipedia) for web servers to support semicolon as a separator of URL query items (in addition to ampersand), it does not seem to be generally followed.
For example, compare
        http://www.google.com/search?q=nemo&oe=utf-8
        http://www.google.com/search?q=nemo;oe=utf-8
results. (In the latter case, semicolon is, or was at the time of writing this text, treated as ordinary string character, as if the url was: http://www.google.com/search?q=nemo%3Boe=utf-8)
Although the first URL parsing library i tried, behaves well:
>>> from urlparse import urlparse, query_qs
>>> url = 'http://www.google.com/search?q=nemo;oe=utf-8'
>>> parse_qs(urlparse(url).query)
{'q': ['nemo'], 'oe': ['utf-8']}
What is the current status of accepting semicolon as a separator, and what are potential issues or some interesting notes? (from both server and client point of view)
The W3C Recommendation from 1999 is obsolete. The current status, according to the 2014 W3C Recommendation, is that semicolon is now illegal as a parameter separator:
To decode application/x-www-form-urlencoded payloads, the following algorithm should be used. [...] The output of this algorithm is a sorted list of name-value pairs. [...]
Let strings be the result of strictly splitting the string payload on U+0026 AMPERSAND characters (&).
In other words, ?foo=bar;baz means the parameter foo will have the value bar;baz; whereas ?foo=bar;baz=sna should result in foo being bar;baz=sna (although technically illegal since the second = should be escaped to %3D).
As long as your HTTP server, and your server-side application, accept semicolons as separators, you should be good to go. I cannot see any drawbacks. As you said, the W3C spec is on your side:
We recommend that HTTP server implementors, and in particular, CGI implementors support the use of ";" in place of "&" to save authors the trouble of escaping "&" characters in this manner.
I agree with Bob Aman. The W3C spec is designed to make it easier to use anchor hyperlinks with URLs that look like form GET requests (e.g., http://www.host.com/?x=1&y=2). In this context, the ampersand conflicts with the system for character entity references, which all start with an ampersand (e.g., "). So W3C recommends that web servers allow a semicolon to be used as a field separator instead of an ampersand, to make it easier to write these URLs. But this solution requires that writers remember that the ampersand must be replaced by something, and that a ; is an equally valid field delimiter, even though web browsers universally use ampersands in the URL when submitting forms. That is arguably more difficult that remembering to replace the ampersand with an & in these links, just as would be done elsewhere in the document.
To make matters worse, until all web servers allow semicolons as field delimiters, URL writers can only use this shortcut for some hosts, and must use & for others. They will also have to change their code later if a given host stops allowing semicolon delimiters. This is certainly harder than simply using &, which will work for every server forever. This in turn removes any incentive for web servers to allow semicolons as field separators. Why bother, when everyone is already changing the ampersand to & instead of ;?
In short, HTML is a big mess (due to its leniency), and using semicolons help to simplify this a LOT. I estimate that when i factor in the complications that i've found, using ampersands as a separator makes the whole process about three times as complicated as using semicolons for separators instead!
I'm a .NET programmer and to my knowledge, .NET does not inherently allow ';' separators, so i wrote my own parsing and handling methods because i saw a tremendous value in using semicolons rather than the already problematic system of using ampersands as separators. Unfortunately, very respectable people (like #Bob Aman in another answer) do not see the value in why semicolon usage is far superior and so much simpler than using ampersands. So i now share a few points to perhaps persuade other respectable developers who don't recognize the value yet of using semicolons instead:
Using a querystring like '?a=1&b=2' in an HTML page is improper (without HTML encoding it first), but most of the time it works. This however is only due to most browsers being tolerant, and that tolerance can lead to hard-to-find bugs when, for instance, the value of the key value pair gets posted in an HTML page URL without proper encoding (directly as '?a=1&b=2' in the HTML source). A QueryString like '?who=me+&+you' is problematic too.
We people can have biases and can disagree about our biases all day long, so recognizing our biases is very important. For instance, i agree that i just think separating with ';' looks 'cleaner'. I agree that my 'cleaner' opinion is purely a bias. And another developer can have an equally opposite and equally valid bias. So my bias on this one point is not any more correct than the opposite bias.
But given the unbiased support of the semicolon making everyone's life easier in the long run, cannot be correctly disputed when the whole picture is taken into account. In short, using semicolons does make life simpler for everyone, with one exception: a small hurdle of getting used to something new. That's all. It's always more difficult to make anything change. But the difficulty of making the change pales in comparison to the continued difficulty of continuing to use &.
Using ; as a QueryString separator makes it MUCH simpler. Ampersand separators are more than twice as difficult to code properly than if semicolons were used. (I think) most implementations are not coded properly, so most implementations aren't twice as complicated. But then tracking down and fixing the bugs leads to lost productivity. Here, i point out 2 separate encoding steps needed to properly encode a QueryString when & is the separator:
Step 1: URL encode both the keys and values of the querystring.
Step 2: Concatenate the keys and values like 'a=1&b=2' after they are URL encoded from step 1.
Step 3: Then HTML encode the whole QueryString in the HTML source of the page.
So special encoding must be done twice for proper (bug free) URL encoding, and not just that, but the encodings are two distinct, different encoding types. The first is a URL encoding and the second is an HTML encoding (for HTML source code). If any of these is incorrect, then i can find you a bug. But step 3 is different for XML. For XML, then XML character entity encoding is needed instead (which is almost identical). My point is that the last encoding is dependent upon the context of the URL, whether that be in an HTML web page, or in XML documentation.
Now with the much simpler semicolon separators, the process is as one wud expect:
1: URL encode the keys and values,
2: concatenate the values together. (With no encoding for step 3.)
I think most web developers skip step 3 because browsers are so lenient. But this leads to bugs and more complications when hunting down those bugs or users not being able to do things if those bugs were not present, or writing bug reports, etc.
Another complication in real use is when writing XML documentation markup in my source code in both C# and VB.NET. Since & must be encoded, it's a real drag, literally, on my productivity. That extra step 3 makes it harder to read the source code too. So this harder-to-read deficit applies not only to HTML and XML, but also to other applications like C# and VB.NET code because their documentation uses XML documentation. So the step #3 encoding complication proliferates to other applications too.
So in summary, using the ; as a separator is simple because the (correct) process when using the semicolon is how one wud normally expect the process to be: only one step of encoding needs to take place.
Perhaps this wasn't too confusing. But all the confusion or difficulty is due to using a separation character that shud be HTML encoded. Thus '&' is the culprit. And semicolon relieves all that complication.
(I will point out that my 3 step vs 2 step process above is usually how many steps it would take for most applications. However, for completely robust code, all 3 steps are needed no matter which separator is used. But in my experience, most implementations are sloppy and not robust. So using semicolon as the querystring separator would make life easier for more people with less website and interop bugs, if everyone adopted the semicolon as the default instead of the ampersand.)

What characters are unsafe in query strings?

I need to prevent the characters that cause vulnerabilities in the URL.
My sample URL is http://localhost/add.aspx?id=4;req=4.
Please give the list of characters that I need block.
I am using an ASP.NET web page. I am binding the information from an SQL Server database.
I just want to list the characters to stay away from hackers to enter unwanted strings in the URL.
Depending on what technology you're using, there is usually a built-in function that will handle this for you.
ASP.NET (VB) & Classic ASP
myUrl = Server.UrlEncode(myUrl)
ASP.NET (C#)
myUrl = Server.UrlEncode(myUrl);
PHP
$myUrl = urlencode($myurl);
If you simply would like to remove unsafe characters, you would need a regular expression. RFC 1738 defines what characters are unsafe for URLs:
Unsafe:
Characters can be unsafe for a
number of reasons. The space
character is unsafe because
significant spaces may disappear and
insignificant spaces may be introduced
when URLs are transcribed or
typeset or subjected to the treatment
of word-processing programs. The
characters "<" and ">" are unsafe
because they are used as the
delimiters around URLs in free text;
the quote mark (""") is used to
delimit URLs in some systems. The
character "#" is unsafe and should
always be encoded because it is used
in World Wide Web and in other
systems to delimit a URL from a
fragment/anchor identifier that might
follow it. The character "%" is
unsafe because it is used for
encodings of other characters. Other
characters are unsafe because
gateways and other transport agents
are known to sometimes modify such
characters. These characters are "{",
"}", "|", "\", "^", "~", "[", "]",
and "`".
I need to prevent the characters that cause vulnerabilities
Well, of course you need to URL encode, as the answers have said. But does not URL encoding cause vulnerabilities? Well, normally not directly; mostly it just makes your application break when unexpected characters are input.
If we're talking about web ‘vulnerabilities’, the most common ones today are:
Server-side code injection, compromising your server
SQL injection, compromising your database
HTML injection, allowing cross-site scripting (XSS) attacks against your users
Unvalidated actions, allowing cross-site request forgery (XSRF) attacks against your users
These are in order of decreasing seriousness and increasing commonness. (Luckily few web site authors are stupid enough to be passing user input to system() these days, but XSS and XSRF vulnerabilities are rife.)
Each of these vulnerabilities requires you to understand the underlying problem and cope with it deliberately. There is no magic list of “strings you need to block” that will protect your application if it is playing naïve about security. There are some add-ons that do things like blocking the string ‘<script>’ when submitted, but all they give you is a false sense of security since they can only catch a few common cases, and are usually easy to code around.
They'll also stop those strings being submitted when you might genuinely want them. For example, some (stupid) PHP authors refuse all incoming apostrophes as an attempt to curb SQL-injection; result is you can't be called “O'Reilly”. D'oh. Don't block; encode properly.
For example, to protect against SQL injection make sure to SQL-escape any strings that you are making queries with (or use parameterised queries to do this automatically); to protect against HTML injection, HTML-encode all text strings you output onto the page (or use a templating/MVC scheme that will do this automatically).
My sample URL http://localhost/add.aspx?id=4;req=4
Is there supposed to be something wrong with that URL? It's valid to separate two query parameters with a ‘;’ instead of the more common ‘&’, but many common web frameworks lamentably still don't understand this syntax by default (including Java Servlet and ASP.NET). So you'd have to go with ‘id=4&req=4’ — or, if you really wanted that to be a single parameter with a literal semicolon in it, ‘id=4%3Breq%3D4’.
http://en.wikipedia.org/wiki/Query_string#URL_encoding
See also: https://www.rfc-editor.org/rfc/rfc3986#section-2.2
I wrote this, for pretty URLs, but it's of course not complete:
""",´,’,·,‚,*,#,?,=,;,:,.,/,+,&,$,<,>,#,%,{,(,),},|,,^,~,[,],—,–,-',,"
And then I translate spaces " " and repeating spaces for "-".
A better thing is to do it or combine it with a regular expression.

using non-latin characters in a URL

I'm working on a site which the client has had translated into Croatian and Slovenian. In keeping with our existing URL patterns we have generated URL re-writing rules that mimic the layout of the application which has lead to having many non-ascii charachters in the URLs.
Examples š ž č
Some links are triggered from Flash using getURL, some are standard HTML links. Some are programatic Response.Redirects and some through adding 301 status codes and location headers to the response. I'm testing in IE6, IE7 and Firefox 3 and internitmtently, the browsers display the non-latin chars url encoded.
š = %c5%a1
ž = %c5%be
č = %c4%8d
I'm guessing this is something to do with IIS and the way it handles Response.Redirect and AddHeader("Location ...
Does anyone know of a way of forcing IIS to not URL encode these chars or is my best bet to replace these with non-diacritic chars?
Thanks
Ask yourself if you really want them non-url encoded. What happens when a user that does not have support for those characters installed comes around? I have no idea, but I wouldn't want to risk making large parts of my site unavailable to a large part of the world's computers...
Instead, focus on why you need this feature. Is it to make the urls look nice? If so, using a regular z instead of ž will do just fine. Do you use the urls for user input? If so, url-encode everything before parsing it to link output, and url-decode it before using the input. But don't use ž and other local letters in urls...
As a side note, in Sweden we have å, ä and ö, but no one ever uses them in urls - we use a, a and o, because browsers won't support the urls otherwise. This doesn't surprise the users, and very few are unable to understand what words we're aiming at just because the ring in å is missing in the url. The text will still show correctly on the page, right? ;)
Does anyone know of a way of forcing IIS to not URL encode
You must URL-encode. Passing a raw ‘š’ (\xC5\xA1) in an HTTP header is invalid. A browser might fix the error up to ‘%C5%A1’ for you, but if so the result won't be any different to if you'd just written ‘%C5%A1’ in the first place.
Including a raw ‘š’ in a link is not wrong as such, the browser is supposed to encode it to UTF-8 and URL-encode as per the IRI spec. But to make sure this actually works you should ensure that the page with the link in is served as UTF-8 encoded. Again, manual URL-encoding is probably safest.
I've had no trouble with UTF-8 URLs, can you link to an example that is not working?
do you have a link to a reference where it details what comprises a valid HTTP header?
Canonically, RFC 2616. However, in practice it is somewhat unhelpful. The critical passage is:
Words of *TEXT MAY contain characters from character sets other than ISO-8859-1 only when encoded according to the rules of RFC 2047.
The problem is that according to the rules of RFC 2047, only ‘atoms’ can accommodate a 2047 ‘encoded-word’. TEXT, in most situations it is included in HTTP, cannot be contrived to be an atom. Anyway RFC 2047 is explicitly designed for RFC 822-family formats, and though HTTP looks a lot like an 822 format, it isn't in reality compatible; it has its own basic grammar with subtle but significant differences. The reference to RFC 2047 in the HTTP spec gives no clue for how one might be able to interpret it in any consistent way and is, as far as anyone I know can work out, a mistake.
In any case no actual browser attempts to find a way to interpret RFC 2047 encoding anywhere in its HTTP handling. And whilst non-ASCII bytes are defined by RFC 2616 to be in ISO-8859-1, in reality browsers can use a number of other encodings (such UTF-8, or whatever the system default encoding is) in various places when handling HTTP headers. So it's not safe to rely even on the 8859-1 character set! Not that that would have given you ‘š’ anyhow...
Those characters should be valid in a URL. I did the URL SEO stuff on a large travel site and that's when I learnt that. When you force diacritics to ascii you can change the meaning of words if you're not careful. There often is no translation as diacritics only exist in their context.

Resources