How to determine if a URI is escaped? - uri

I am using apache commons HTTPClient to download web resources. The URI for these resources come from third parties, I do not generate them.
The commons httpclient requires a URI object to be given to the GetMethod object.
The URI constructor takes a string (for the uri) and a boolean specifying if it is escaped or not.
Currently, I am doing the following to determine if the original url I am given is already escaped...
boolean isEscaped = URIUtil.getPathQuery(originalUrl).contains("%");
m.setURI(new URI(originalUrl, isEscaped));
Is this the correct way to determine if a uri is already escaped?
Update...
according to wikipedia ( Well, according to wikipedia ( http://en.wikipedia.org/wiki/Percent-encoding ) it says that percent is a reserved character and should always be encoded... I am quoting verbatim here...
Percent-encoding the percent character[edit] Because the percent ("%")
character serves as the indicator for percent-encoded octets, it must
be percent-encoded as "%25" for that octet to be used as data within a
URI.
Doesnt this mean that you can never have a naked '%' character in a valid uri?
Also, the uri(s) come from various sources so I cannot be sure if they are escaped or unescaped.

This wouldn't work. It's possible the un-encoded string has a % in it already.
ex:
https://www.google.com/#q=like%25&safe=off
is the url for a google search for like%. In unescaped form it would be https://www.google.com/#q=like%&safe=off
Your consumers should let you know if the URI is escaped or not.

Related

Servlet stripping parameter values because of # character

My URL is http://175.24.2.166/download?a=TOP#0;ONE=1;TWO2.
How should I encode the parameter so that when I print the parameter in the Servlet, I get the value in its entirety? Currently when I print the value by using request.getParameter("a") I get the output as TOP instead of TOP#0;ONE=1;TWO2.
You should encode it like this http://175.24.2.166/download?a=TOP%230%3BONE%3D1%3BTWO2 . There are a lot of the encoders in Java, you can try to use URLEncoder or some online encoders for experements
This is known as the "fragment identifier".
as mentioned in wiki
The fragment identifier introduced by a hash mark # is the optional last part of a URL for a document. It is typically used to identify a portion of that document.
the part after the # is info for the client. Put everything your client needs here.
you need to encode your query string.
you can use encodeURIComponent() function in JavaScript encodes a URI component.This function encodes special characters.

User info in URI without password

I know that URI supports the following syntax:
http://[user]:[password]#[domain.tld]
When there is no password or if the password is empty, is there a colon?
In other words, should I accept this:
http://[user]:#[domain.tld]
Or this:
http://[user]#[domain.tld]
Or are they both valid?
The current URI standard (STD 66) is RFC 3986, and the relevant section is 3.2.1. User Information.
There it’s defined that the userinfo subcomponent (which gets followed by #) can contain any combination of
the character :,
percent-encoded characters, and
characters from the sets unreserved and sub-delims.
So this means that both of your examples are valid.
However, note that the format user:password is deprecated. Anyway, they give recommendations how applications should handle such URIs, i.e., everything after the first : character should not be displayed by applications, unless
the data after the colon is the empty string (indicating no password).
So according to this recommendation, the userinfo subcomponent user: indicates that there is the username "user" and no password.
This is more like convenience and both are valid. I would go with http://[user]#[domain.tld] (and prompt for a password.) because it's simple and not ambiguous. It does not give any chance for user to think if he has to add anything after :

Is a URL with // in the path-section valid?

I have a question regarding URLs:
I've read the RFC 3986 and still have a question about one URL:
If a URI contains an authority component, then the path component
must either be empty or begin with a slash ("/") character. If a URI
does not contain an authority component, then the path cannot begin
with two slash characters ("//"). In addition, a URI reference
(Section 4.1) may be a relative-path reference, in which case the
first path segment cannot contain a colon (":") character. The ABNF
requires five separate rules to disambiguate these cases, only one of
which will match the path substring within a given URI reference. We
use the generic term "path component" to describe the URI substring
matched by the parser to one of these rules.
I know, that //server.com:80/path/info is valid (it is a schema relative URL)
I also know that http://server.com:80/path//info is valid.
But I am not sure whether the following one is valid:
http://server.com:80//path/info
The problem behind my question is, that a cookie is not sent to http://server.com:80//path/info, when created by the URI http://server.com:80/path/info with restriction to /path
See url with multiple forward slashes, does it break anything?, Are there any downsides to using double-slashes in URLs?, What does the double slash mean in URLs? and RFC 3986 - Uniform Resource Identifier (URI): Generic Syntax.
Consensus: browsers will do the request as-is, they will not alter the request. The / character is the path separator, but as path segments are defined as:
path-abempty = *( "/" segment )
segment = *pchar
Means the slash after http://example.com/ can directly be followed by another slash, ad infinitum. Servers might ignore it, but browsers don't, as you have figured out.
The phrase:
If a URI does not contain an authority component, then the path cannot begin
with two slash characters ("//").
Allows for protocol-relative URLs, but specifically states in that case no authority (server.com:80 in your example) may be present.
So: yes, it is valid, no, don't use it.

.Net Uri Encoding RFC 2396 vs RFC 3986

First, some quick background... As part of an integration with a third party vendor, I have a C# .Net web application that receives a URL with a bunch of information in the query string. That URL is signed with an MD5 hash and a shared secret key. Basically, I pull in the query string, remove their hash, perform my own hash on the remaining query string, and make sure mine matches the one that was supplied.
I'm retrieving the Uri in the following way...
Uri uriFromVendor = new Uri(Request.Url.ToString());
string queryFromVendor = uriFromVendor.Query.Substring(1); //Substring to remove question mark
My issue is stemming from query strings that contain special characters like an umlaut (ü). The vendor is calculating their hash based on the RFC 2396 representation which is %FC. My C# .Net app is calculating it's hash based on the RFC 3986 representation which is %C3%BC. Needless to say, our hashes don't match, and I throw my errors.
Strangely, the documentation for the Uri class in .Net says that it should follow RFC 2396 unless otherwise set to RFC 3986, but I don't have the entry in my web.config file that they say is required for this behavior.
How can I force the Uri constructor to use the RFC 2396 convention?
Failing that, is there an easy way to convert the RFC 3986 octet pairs to RFC 2396 octets?
Nothing to do with your question, but why are you creating a new Uri here? You can just do string queryFromVendor = Request.Url.Query.Substring(1); – atticae
+1 for atticae! I went back to try removing the extraneous Uri I was creating and suddenly, the string had the umlaut encoded as UTF-8 instead of UTF-16.
At first, I didn't think this would work. Somewhere along the line, I had tried retrieving the url using Request.QueryString, but this was causing the umlaut to come through as %ufffd which is the � character. In the interest of taking a fresh perspective, I tried atticae's suggestion and it worked.
I'm pretty sure the answer has to do with something I read here.
C# uses UTF-16 in all its strings, with tools to encode when it comes to dealing with streams and files that bring us onto...
ASP.NET uses UTF-8 by default, and it's hard to think of a time when it isn't a good choice...
My problems stemmed from here...
Uri uriFromVendor = new Uri(Request.Url.ToString());
By taking the Request.Url uri and creating another uri, it was encoding as the C# standard UTF-16. By using the original uri, it remained in the .Net standard UTF-8.
Thanks to all for your help.
I'm wondering if this is a bit of a red herring:
I say this because FC is the UTF16 representation of the u with umlaut; C2BC is the UTF8 representation.
I wonder if one of the System.Text.Encoding methods to convert the source data into a normal .Net string might help.
This question might be of interest too: Encode and Decode rfc2396 URLs
I don't know about the standard encoding for Uri constructors, but if everything else fails you could always decode the URL yourself and encode it in whatever encoding you like.
The HttpUtility-Class has an UrlDecode() and UrlEncode() method, which lets you specify the System.Text.Encoding as second parameter.
For example:
string decodedQueryString = HttpUtility.UrlDecode(Request.Url.Query.Substring(1));
string encodedQueryString = HttpUtility.UrlEncode(decodedQueryString, System.Text.Encoding.GetEncoding("utf-16"));
// calc hash here

Query string: Can a query string contain a URL that also contains query strings?

Example:
http://foo.com/generatepdf.aspx?u=http://foo.com/somepage.aspx?color=blue&size=15
I added the iis tag because I am guessing it also depends on what server technology you use?
The server technology shouldn't make a difference.
When you pass a value to a query string you need to url encode the name/value pair. If you want to pass in a value that contains a special character such as a question mark (?) you'll just need to encode that character as %3F. If you then needed to recursively pass another query string to the encoded url, you'll need to double/triple/etc encode the url resulting in the original ? turning into %253F, %25253F, etc.
you'll probably want to UrlEncode the url that is in the query string.
As reported in http://en.wikipedia.org/wiki/Query_string
W3C recommends that all web servers support semicolon separators in
addition to ampersand separators (link reported on that wiki page) to allow
application/x-www-form-urlencoded query strings in URLs within HTML
documents without having to entity escape ampersands.
So, I suppose the answer to the question is yes and you have to change in a ";" semicolon the "&" ampersand usaully used for key=value separator.
Yes it can, as far as I can tell, according to RFC 3986: Uniform Resource Identifier (URI): Generic Syntax (from year 2005):
This is the BNF for the query string:
query = *( pchar / "/" / "?" )
pchar = unreserved / pct-encoded / sub-delims / ":" / "#"
The spec says:
The characters slash ("/") and question mark ("?") may represent data within the query component.
as query components are often used to carry identifying information in the form of "key=value" pairs and one frequently used value is a reference to another URI, it is sometimes better for usability to avoid percent-encoding those characters
(But I suppose your server framework might or might not follow the specification exactly.)
No, but you can encode the url and decode it later.

Resources