ampersand in URL of RSS Feed - asp.net

As part of our app, user can save some data as XML on server which becomes RSS feed for them.
Now some of the file user created have & in file name as BB&T_RSS.xml.
So when user point this to http://example.com/BB&T.xml, they won't get this.
How to stop this? I tried BB%26T.xml, BB&T.xml without any success with IE, Chrome

use an
%26
for an
&
http://example.com/BB%26T.xml,
http://www.w3schools.com/tags/ref_urlencode.asp
then use
HttpServerUtility.UrlDecode Method
to get the file from the url again
URL encoding ensures that all browsers will correctly transmit text in URL strings. Characters such as a question mark (?), ampersand (&), slash mark (/), and spaces might be truncated or corrupted by some browsers. As a result, these characters must be encoded in tags or in query strings where the strings can be re-sent by a browser in a request string.
Many URL schemes reserve certain characters for a special meaning:
their appearance in the scheme-specific part of the URL has a
designated semantics. If the character corresponding to an octet is
reserved in a scheme, the octet must be encoded. The characters ";",
"/", "?", ":", "#", "=" and "&" are the characters which may be
reserved for special meaning within a scheme. No other characters may
be reserved within a scheme. (src)

Related

Should the plus in tel URIs be encoded?

In a URI, spaces can be encoded as +. Since this is the case, should the leading plus be encoded when creating tel URIs with international prefix?
Which is better? Do both work in practice?
Call me
Call me
No.
From section 3 of RFC 3966 (The tel URI for Telephone Numbers):
If the reserved characters "+", ";", "=", and "?" are used as delimiters between components of the "tel" URI, they MUST NOT be percent encoded.
You would only percent-encode a + if it’s part of a parameter value:
These characters ["+", ";", "=", and "?"] MUST be percent encoded if they appear in tel URI parameter values.
I’m not sure if the leading +, which indicates that it’s a global number, counts as delimiter, but the definition of a global number says:
Globally unique numbers are identified by the leading "+" character.
So it refers to +, not to something percent-encoded.
And also the examples make clear that it’s not supposed to be percent-encoded, e.g.:
tel:+1-201-555-0123
Note that spaces in tel URIs (e.g., in parameter values) may not be encoded with a +. Using + instead of %20 for a space character is not something that may be done in any URI; it’s only possible in URIs whose URI scheme explicitly defines that.
The tel: URI scheme doesn't have a provision for encoding spaces - see RFC 3966:
5.1.1. Separators in Phone Numbers
...
even though ITU-T E.123 [E.123] recommends the use of space
characters as visual separators in printed telephone numbers, "tel"
URIs MUST NOT use spaces in visual separators to avoid excessive
escaping.
The plus sign encodes a space specifically only in application/x-www-form-urlencoded (default content type for form submission - see W3C info re: forms). There's no valid way to encode a space in tel: URIs. See again RFC 3966 (page 5) for valid visual separators.

Response.Redirect Ampersand Encoding

Failing to escape an "&" character in HTML markup creates an entity. It is often done inadvertently when linking URLs in a document, and W3C's Markup Validation Service will consider this an error.
I'm wondering, does ASP.NET's Response.Redirect method expect ampersands to be escaped in its url parameter? From reading its MSDN description, I honestly can't tell.
Pass the URL exactly as it should appear in the address bar in the web browser. For example, if you're trying to redirect to http://example.com/?foo=bar&baz=quux, then pass that exact string as-is to Response.Redirect.
try UrlEncode The UrlEncode(String) method can be used to encode the entire URL, including query-string values. If characters such as blanks and punctuation are passed in an HTTP stream without encoding, they might be misinterpreted at the receiving end. URL encoding converts characters that are not allowed in a URL into character-entity equivalents; URL decoding reverses the encoding. For example, when the characters < and > are embedded in a block of text to be transmitted in a URL, they are encoded as %3c and %3e. URLEncode
System.Web.HttpUtility.UrlEncode(string url)

Characters in URL after host transmitted as-is?

If I visit the URL:
http://foo/bar
Than the web browser will connect to host foo by TCP on port 80 and transmit a GET request:
GET /bar HTTP...
Clearly not all characters in the bar part will work (be transmitted verbatim). For example the space character (#20).
Of the 256 possible bytes, which will a standard web browser transmit verbatim (as is, without special encoding) from a URL entered into the address bar to the GET request, and which will it not?
RFC-3986 talks about reserved and unreserved characters. Unreserved characters can be used without special encoding; everything else must be url encoded, using the %xx notation.
The unreserved characters include uppercase and lowercase letters, decimal digits, hyphen, period, underscore, and tilde.
The browser is smart enough to automatically escape space and other characters before opening a socket connection to the server.
EDIT
Reserved characters need not be encoded when used for their intended purpose. But, when they are used as part of query string or path components, they must be url-encoded.
The reserved characters are ! * ' ( ) ; : # & = + $ , / ? # [ ]
For example, here is a url
http://example.com:8090/email/tigerwoods%40gmail.com?folder=sport%2Fgolf
Over here, the / and ? are not encoded when they serve their normal roles. However, the # symbol is encoded as part of the email. The / is also encoded as part of the query string parameter.

What's the correct encoding of HTTP get request strings?

Does the HTTP standard or something define which encoding should be used on special characters before they are encoded in url with %XXs? If it doesn't define is there a way define which encoding is used? It seems that most browsers send the data in utf-8.
Does the HTTP standard or something define which encoding should be used on special characters before they are encoded in url with %XXs?
The HTTP standard, no. But another standard, IRI, can come into play.
URIs are explicitly (once %-decoded) byte sequences. What Unicode characters those bytes map onto is not specified by the URI standard or the HTTP standard for http:-scheme URIs.
Specifically for query parameters: web browsers will use the encoding of the originating page to make a form submission GET URL, so if you have a page in ISO-8859-1 and you put ‘é’ in a search box you'll get ‘?search=%E9’, but if you do the same in a page encoded as UTF-8 you'll get ‘?search=%C3%E9’. If you don't serve your form page with any particular charset the browser will guess, which you don't want as it'll make it impossible to guess what format the submission is going to come in as.
For the other parts of a URL, a browser won't generate them itself, but if you supply it with non-ASCII characters in links it will usually encode them as UTF-8. This is not reliable as it depends on browser and locale settings, so it's best not to use this at the moment.
The standard that properly allows non-ASCII characters in links is IRI. IRI converts to URI by UTF-8-%-encoding most of the URL, but the hostname is converted using Punycode instead. For compatibility it is best not to rely on browsers understanding IRIs in links yet. Instead, UTF-8-then-%-encode your path and parameter characters yourself. They will still appear as the right characters in the address bar in modern browsers; unfortunately IE won't display the decoded-character IRI form in all cases, depending on language settings.
The Wiki IRI for the Greek gamma character is:
http://en.wikipedia.org/wiki/Γ
Encoded into a URI, it is:
http://en.wikipedia.org/wiki/%CE%93
Per RFC 2616,
CHAR = <any US-ASCII character (octets 0 - 127)>
and
token = 1*<any CHAR except CTLs or separators>
separators = "(" | ")" | "<" | ">" | "#"
| "," | ";" | ":" | "\" | <">
| "/" | "[" | "]" | "?" | "="
| "{" | "}" | SP | HT
and URIs are tokens with various specific separators. So, in theory, nothing but US-ASCII should be there. (In practice, since the ISO-8859-1 extension to US-ASCII is used in many other spots in the HTTP specs, it's not unusual to find HTTP implementations which support ISO-8859-1 rather than just US-ASCII, but strictly speaking that's not standards-compliant HTTP).
As far as I'm aware, there is no way to define it, though I've always assumed that it is ASCII, since that is what DNS is (currently, though localised DNS is coming, with all the problems that entails).
Note: UTF8 is "ASCII compatible" unless you try to use extended characters. This probably plays some small part in the reasoning behind why some browsers might send their GET data UTF8 encoded.
EDIT: From your comment, it seems like you don't know how the % encoding works at all, so here goes.
Given the following string query string, "?foo=Hello World!", the "Hello World!" part needs URL encoding. The way this works is any 'special' characters get their ASCII value taken and converted to hex prefixed by a '%'. So the above string would convert to "?foo=Hello%20World%21".

when assigning location.href, please explain url encoding (in asp.net and firefox)

In some javascript, I have:
var url = "find.aspx?" + "location=" + encodeURIComponent( address );
alert( url );
location.href = url;
where the value of address is the string "Seattle, WA".
In the alert I see
find.aspx?Seattle%2C%20WA
as I expect.
But on the server side, when I look at Request.Url, the relevant substring I see is
find.aspx?Seattle, WA
And in the Firefox url window I see
find.aspx?location=Seattle%2C WA
So I'm getting three different representations whereas I would expect that in all three places I should see what I see in the alert. My expectation is that the url I assign to location.href should show up as-is in the browser url window, and should be passed as-is to the server in Request.Url (and I would need to decode the values on the server before using them). What's happening?
Firefox converts certain encoded characters into their literal forms as a way to be friendly to users. It will also convert spaces typed into the address bar into %20 for the server.
Update: The reason Firefox doesn't display the comma unencoded is because commas are allowed in URLs, but spaces are not, so it knows that a space is going to be unambiguously interpreted, whereas the pre-encoded comma is different from a non-encoded comma to some servers. see: Can I use commas in a URL?
ASP is probably trying to help you out by auto-un-encoding the string for you.
Update: It looks like ASP.NET unencodes Request.Url for you by default, as mentioned here: QueryString malformed after URLDecode They also mention that you can use HttpRequest.Url.Query to access the un-decoded version.
The alert is the only thing not doing any "magic" for you.
For the alert, you are doing the encoding yourself. Perhaps it looks the same as on the server-side if you removed encodeURIComponent.
On the server side, ASP.NET will always show you the unencoded form. This is to make it easier to directly map to files that also have text that needed to be (un)encoded.
Note that you can replace every letter for its UTF8 representation in URL Encoding. It will still be the same URL. I.e., type the following in the browser window and it will still work: %66%59%6E%64.aspx?location=Seattle%2C%20WA. To only encode the necessary chars, use UrlEncode on the server side if you create a link yourself.
URL encoding can become fairly tricky. You ask to explain it. To know the correct escape of a certain character, you need to know how that character looks in UTF8. The hexadecimal value of the UTF-8 bytes then become the %XX%YY value of your letter. Sometimes it's one %XX, but it can be up to six byte sequences in total (some Chinese characters for instance).
URL Encoding works one way only. Never double-encode or double-unencode. This is prohibited by the specification. Also, because you can encode any character, it is not always possible (as you found out) to do roundtrip encoding/unencoding. If you unencode and re-encode again, it is well possible that the resulting string is different, but syntactically the same.
In HTML, URL Encoding is sometimes interspersed with HTML Encoding. I.e., the ampersand is valid in HTML, but not in HTML. find.aspx?city=A&name=B becomes find.aspx?city=A&name=B in and HTML URL. However, browsers are lenient and will accept wrongly HTML-encoded strings.
Finally, a not on the browser: if you type in a space in a link, even inside an <a> tag, it will escape the space (or other character) for you. Likewise, it will nowadays show the odd characters (é, ï etc) in the address bar, but when it sends it over HTTP, the browser will correctly do the encoding for you.
Update: about anwering your question of needing a "definitive" reference or proof.
While I couldn't find any on the internet, I decided to look for it myself using Reflector. Going through the methods that set, for instance, the HttpRequest.QueryString, you quickly encounter the private method HttpRequest.FillInQueryStringCollection which then calls HttpValueCollection.FillfromEncodedBytes. Somewhat near the end of that method, HttpUtility.UrlDecode is called for the values. Conclusion: do not call it yourself, to prevent double decoding.
You can see this for yourself when you download Reflector and disassemble the .NET libs of System.Web.
For your example you can change this line
var url = "find.aspx?" + "location=" + encodeURIComponent( address );
to
var url = "find.aspx?" + "location=" + address;
and see the address as it is. Bu if address variable contains any '&' character your variable will be corrupt. So you are using encodeURIComponent to encode these things url.
On the Server side all these encoded strings are decoded back. It means encodeURIComponent is just for sending the address variable (whether it contains & character or not) to server side correctly.

Resources