HttpServerUtility.UrlPathEncode vs HttpServerUtility.UrlEncode - asp.net

What's the difference between HttpServerUtility.UrlPathEncode and HttpServerUtility.UrlEncode? And when should I choose one over the other?

UrlEncode is useful for query string values (so to the left or especially, right, of each =).
In this url, foo, fooval, bar, and barval should EACH be UrlEncode'd separately:
http://www.example.com/whatever?foo=fooval&bar=barval
UrlEncode encodes everything, such as ?, &, =, and /, accented or other non-ASCII characters, etc, into %-style encoding, except space which it encodes as a +. This is form-style encoding, and is best for something you intend to put in the querystring (or maybe between two slashes in a url) as a parameter without it getting all jiggy with the url's control characters (like &). Otherwise an unfortunately placed & or = in a user's form input or db value value could break things.
EDIT: Uri.EscapeDataString is a very close match to UrlEncode, and may be preferable, though I don't know the exact differences.
UrlPathEncode is useful for the rest of the query string, it affects everything to the left of the ?.
In this url, the entire url (from http to barval) should be run through UrlPathEncode.
http://www.example.com/whatever?foo=fooval&bar=barval
UrlPathEncode does NOT encode ?, &, =, or /. It DOES, however, like UrlEncode, encode accented/non-ASCII characters with % notation, and space also becomes %20. This is useful to make sure the url is valid, since spaces and accented characters are not. It won't touch your querystring (everything to the right of ?), so you have to encode that with UrlEncode, above.

Update: as of 4.5, per MSDN reference, Microsoft recommends to only use UrlEncode. Also, the information previously listed in MSDN does not fully describe behavior of the two methods - see comments.
The difference is all in the space escaping - UrlEncode escapes them into + sign, UrlPathEncode escapes into %20. + and %20 are only equivalent if they are part of QueryString portion per W3C. So you can't escape whole URL using + sign, only querystring portion. Bottom line is that UrlPathEncode is always better imho
You can encode a URL using with the UrlEncode() method or the UrlPathEncode() method. However, the methods return different results. The UrlEncode() method converts each space character to a plus character (+). The UrlPathEncode() method converts each space character into the string "%20", which represents a space in hexadecimal notation. Use the UrlPathEncode() method when you encode the path portion of a URL in order to guarantee a consistent decoded URL, regardless of which platform or browser performs the decoding.
http://msdn.microsoft.com/en-us/library/4fkewx0t.aspx

To explain it as simply as possible:
HttpUtility.UrlPathEncode("http://www.foo.com/a b/?eggs=ham&bacon=1")
becomes
http://www.foo.com/a%20b/?eggs=ham&bacon=1
and
HttpUtility.UrlEncode("http://www.foo.com/a b/?eggs=ham&bacon=1")
becomes
http%3a%2f%2fwww.foo.com%2fa+b%2f%3feggs%3dham%26bacon%3d1

Related

How should I escape image URLs for CSS?

I have a div which will receive a CSS background image from user chosen URL, like so:
background-image: url("/* user specified URL here*/")
How should I escape the URL so that it's safe to embed in the CSS? Is escaping the quotes enough?
If you are setting the background url through JS, then the correct and safe ways is using encodeURI() and wrapping in quotes.
node.style.backgroundImage = 'url("' + encodeURI(url) + '")';
Is escaping the quotes enough?
No, you also should worry about backslashes and newlines.
Here is the CSS grammar for a double quoted URI:
http://www.w3.org/TR/CSS21/grammar.html#scanner
"([^\n\r\f\\"]|\\{nl}|{escape})"
where {nl} is
\n|\r\n|\r|\f
and {escape} is a backslash-escaped character. So a trailing backslash will break your CSS. A non-escaped newline likewise.
I would strongly recommend to remove all whitespace and finally escape " and \
Since the user data that you need to insert into CSS can be treated like a URL, and not just a string, you only need to ensure that it is properly URL-encoded.
This is safe because a well-formed URL does not contain any characters that are unsafe in CSS strings; except for apostrophe ('), which is not a problem as long as you use double quotes for your CSS string: url("...")
A simple way to do this is to URL-encode all characters that are not "reserved" or "unreserved" in URLs. According to RFC 3986, that would be all characters except for these:
A-Z a-z 0-9 ; , / ? : # & = + $ - _ . ! ~ * ' ( ) # [ ]
That is what encodeURI() does in Mārtiņš Briedis's JavaScript answer. (With one exception: encodeURI() encodes [ and ], which is mostly inconsequential.)
In addition to that, you might consider only allowing URLs that begin with https: or data:. By doing this you can prevent mixed content warnings if the page is served over HTTPS, and also avoid the javascript: issue Alexander O'Mara commented on.
There might be other URL parsing and validation that you want to do, but that is outside the scope of this question.
If you need to insert user data into a CSS string that cannot be treated like a URL, then you would need to do CSS backslash escaping. See user123444555621's answer for more on that.
const style = "background-image: url(\"" + CSS.escape(imageUrl) + "\")";
See https://developer.mozilla.org/en-US/docs/Web/API/CSS/escape
It is an experimental new thing, but it seems to be quite well supported (as of 2021).

Why do URL parameters use %-encoding instead of a simple escape character

For example, in Unix, a backslash (\) is a common escape character. So to escape a full stop (.) in a regular expression, one does this:
\.
But with % encoding URL parameters, we have an escape character, %, and a control code, so an ampersand (&) doesn't become:
%&
Instead, it becomes:
%26
Any reason why? Seems to just make things more complicated, on the face of it, when we could just have one escape character and a mechanism to escape itself where necessary:
%%
Then it'd be:
simpler to remember; we just need to know which characters to escape, not which to escape and what to escape them to
encoding-agnostic, as we wouldn't be sending an ASCII or Unicode representation explicitly, we'd just be sending them in the encoding the rest of the URL is going in
easy to write an encoder: s/[!\*'();:#&=+$,/?#\[\] "%-\.<>\\^_`{|}~]/%&/g (untested!)
better because we could switch to using \ as an escape character, and life would be simpler and it'd be summer all year long
I might be getting carried away now. Someone shoot me down? :)
EDIT: replaced two uses of "delimiter" with "escape character".
Percent encoding happens not only to escape delimiters, but also so that you can transport bytes that are not allowed inside URIs (such as control characters or non-ASCII characters).
I guess it's because the URL Specification and specifically the HTTP part of it, only allow certain characters so to escape those one must replace them with characters that are allowed.
Also some allowed characters have special meanings like & and ? etc
so replacing them with a control code seems the only way to solve it
If you find it hard to recognize them, bookmark this page
http://www.w3schools.com/tags/ref_urlencode.asp

RegEx for Client-Side Validation of FileUpload

I'm trying to create a RegEx Validator that checks the file extension in the FileUpload input against a list of allowed extensions (which are user specified). The following is as far as I have got, but I'm struggling with the syntax of the backward slash (\) that appears in the file path. Obviously the below is incorrect because it just escapes the (]) which causes an error. I would be really grateful for any help here. There seems to be a lot of examples out there, but none seem to work when I try them.
[a-zA-Z_-s0-9:\]+(.pdf|.PDF)$
To include a backslash in a character class, you need to use a specific escape sequence (\b):
[a-zA-Z_\s0-9:\b]+(\.pdf|\.PDF)$
Note that this might be a bit confusing, because outside of character classes, \b represents a word boundary. I also assumed, that -s was a typo and should have represented a white space. (otherwise it shouldn't compile, I think)
EDIT: You also need to escape the dots. Otherwise they will be meta character for any character but line breaks.
another EDIT: If you actually DO want to allow hyphens in filenames, you need to put the hyphen at the end of the character class. Like this:
[a-zA-Z_\s0-9:\b-]+(\.pdf|\.PDF)$
You probably want to use something like
[a-zA-Z_0-9\s:\\-]+\.[pP][dD][fF]$
which is same as
[\w\s:\\-]+\.[pP][dD][fF]$
because \w = [a-zA-Z0-9_]
Be sure character - to put as very first or very last item in the [...] list, otherwise it has special meaning for range or characters, such as a-z.
Also \ character has to be escaped by another slash, even inside of [...].

Why does the encoding's of a URL and the query string part differ?

I was researching why my query parameters have plus + signs in it instead of %20 and why they have strings like %C3%BC instead of a ü (UTF-8) as an encoded URL does.
After 2 hours of thinking my webapp is not compatible to the URL encoding standard I found that the encoding scheme of a query string is not the same as the encoding of a URL (here i mean the part without the query string).
Examples:
URL:
whitespace encodes to %20
UTF-8 chars stays UTF-8 chars
Query params:
whitespace encodes to +
UTF-8 chars encodes to the hex representation
So can someone tell me why do the encoding schemes differ, since the query parameters are a part of the URL?
See:
wiki Percent-encoding
wiki: Query String
URIs originated in RFC 1630, with percent-encoding as a method to allow "unsafe" characters to be represented. This original version actually mentioned the ISO Latin 1 character set as the encoding for non-ASCII characters. RFC 1738 later that year removed this reference to Latin-1 in defining URLs.
The query string format is actually a different but related encoding, application/x-www-form-urlencoded, defined in RFC 1866 along with HTML 2.0. It was based on RFC 1738, but specified that spaces (not all whitespace, just the character with ASCII code 0x20) are replaced by '+' and that line breaks are to be encoded as CRLF (i.e. %0D%0A). The former is likely because that saves 2 bytes for a very common character in form submissions at the expense of using an extra 2 bytes for a much less common character, and the latter is to avoid problems when transferring between systems using different end-of-line codings. Non-ASCII characters were left unconsidered.
UTF-8 coding in URIs came over a decade later, in RFC 3986, although individual protocols may have specified this or another encoding of non-ASCII characters earlier. To maintain backwards compatibility, all UTF-8 octets must be percent-encoded. The companion RFC 3987 defines "Internationalized Resource Identifiers" (IRIs) which are basically "URIs with most codepoints 160 and above allowed to appear unencoded", but many protocols still require URIs. Note that your statement above is incorrect, as a URL may not contain an unencoded ü or any other non-ASCII character.
application/x-www-form-urlencoded has been internationalized in a different manner. The HTML5 specification of application/x-www-form-urlencoded explicitly allows that any ASCII-compatible character set may be used for characters in the query string, and in fact different fields may use different character sets, but all non-ASCII octets must still be percent-encoded. When used in the query part of an IRI, it is possible that these characters could be represented unencoded if properly-normalized UTF-8 is being used as the character set, since conversion back to a URI would result in correct application/x-www-form-urlencoded data.
They don't necessarily have to differ, a + is a valid path character and a ü is a valid search character (per RFC 3987). You're probably seeing browsers or some other preconceived encoding scheme making assumptions that are either outdated or overly cautious.
There is no difference between + and %20 when it comes to Query string parameters:
SPACE is encoded as '+' or '%20'
Quote reference

How exactly does an array get urlencoded?

Many languages allow one to pass an array of values through the url. I need to , for various reasons, directly construct the url by hand. How is an array of values urlencoded?
It looks like the content in the form of MIME-Type: application/x-www-form-urlencoded.
This is the default content type. Forms submitted with this content type must be encoded as follows:
Control names and values are escaped. Space characters are replaced by +, and then reserved characters are escaped as described in [RFC1738], section 2.2: Non-alphanumeric characters are replaced by %HH, a percent sign and two hexadecimal digits representing the ASCII code of the character. Line breaks are represented as "CR LF" pairs (i.e., %0D%0A).
The control names/values are listed in the order they appear in the document. The name is separated from the value by = and name/value pairs are separated from each other by &.
Which is used for the POST. To do it for the GET, you'll have to append a ? after your URL, and the rest is almost equal. In the comments, mdma states, that the URL may not contain a + for a space character. Instead use %20.
So an array of values:
http://localhost/someapp/?0=zero&1=valueone%20withspace&2=etc&3=etc
Often there is some functionality in libraries that will do the URL encoding for you (point 1). Point two is easily implementable by looping over your array, building the string, appending the index, =, the URL encoded value and when it's not the last entry an &.

Resources