This question already has an answer here:
Is an email mailto link a valid URL?
(1 answer)
Closed 6 years ago.
While writing a library code to generate URI strings, I got confused about mailto. According to RFC 3986 authority for a URI must be preceded by //. Authority is the part of URI where the userinfo and host resides in userinfo#host syntax. According to this RFC the format should be: mailto://me#host.com. However, it is used as mailto:me#host.com not just in the wild but also shown like that in RFC 2368 and RFC 6068.
The only way mailto being a URI is that the email is appended as path, which doesn't make much sense. Is this assumption correct or is there another point I am missing.
It seems that mailto is a URN, even if it feels a bit strange. Thus in mailto:me#host.com, me#host.com is indeed path for URI as described in RFC 3986
Related
What RFC defines the passing arrays over HTTP? Most web application platforms allow you to supply an array of arguments over GET or POST. The following URL is an example:
http://localhost/?var[1]=one&var[2]=two&var[3]=three
RFC1738 defines URLs, however the bracket is missing from the Backus–Naur Form(BNF) definition of the URL. Also this RFC doesn't cover POST. Ideally I would like to get the BNF for this feature as defined in the RFC.
According to Wikipedia, there is no single spec:
While there is no definitive standard, most web frameworks allow multiple values to be associated with a single field (eg. field1=value1&field1=value2&field2=value3)
That Wikipedia article links to the following Stack Overflow post, which covers a similar question: Authoritative position of duplicate HTTP GET query keys
The issue here is that form parameters can be whatever you want them to be. Some web frameworks have settled on key[number]=value for arrays, others haven't. Interestingly, RFC1866 section 8.2.4, page 48 (note: this RFC is historical and not current) shows an example with the same key used twice in a form POST:
name=John+Doe
&gender=male
&family=5
&city=kent
&city=miami
&other=abc%0D%0Adef
&nickname=J%26D
On the W3C side of things, HTML 4.01 has some information about how to encode form parameters. Sadly this doesn't cover arrays.
At the time of writing, I don't think there is a correct answer to your question - no IETF RFC or W3C spec defines the behavior that you're interested in.
(As a side note, the W3C HTML JSON form submission draft spec covers posting arrays, thank goodness.)
URIs are defined by RFC 3986.
However, what you're asking about is encoding of form parameters. You need to look up the HTML spec for that.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
Recently I was researching HTTP query strings while wondering about possibilities on web service access interface API. And it seems very underspecified.
In fact RFC 3986 (Uniform Resource Identifier (URI): Generic Syntax) doesn’t say anything about format of the query string fragment and ends on defining which characters are allowed and how to encode other characters. (I will return to this later.)
The only thing I found was HTML specification on how forms are mangled into query string (HTML 4.01; 17.13.4 Form content types, application/x-www-form-urlencoded). HTML 5 algorithm seems close enough (4.10.22.5 URL-encoded form data).
This might seem OK. After all why would anyone want to set a query string format for everyone else. What for? But are there any other (than HTML) well established standards? Is anyone else using a different format?
A side question here is dealing with [] in form fields names. PHP uses that to ensure that multiple occurrences of a field are all present in $_GET superglobal variable. (Otherwise only last occurrence is present.)
But from RFC 3986 it seems that neither [ nor ] are allowed in query string. Yet my experiments with various browsers suggested that no browser encodes those characters and they are there in the URI just like that...
Is this real life practice? Or am I testing it incorrectly? I tested with PHP 5.3.17 on IIS 7. Using Internet Explorer, Firefox and Chrome. Then I compared what is in $_SERVER['QUERY_STRING'] and $_GET.
Another question is real life support for semicolon separation.
HTML 4.01 specification (B.2.2 Ampersands in URI attribute values) recommends HTTP servers to accept semicolon (;) as parameter separator (opposed to ampersand &).
Is any server supporting it? Is anyone using this? Is it worth to bother with that (when considering allowed formats of query string for a web service)?
Then how about non-ASCII characters support?
HTML 4.01 specification (B.2.1 Non-ASCII characters in URI attribute values) restates clearly what URI describing RFCs stated in the first place: non-ASCII characters are not allowed in URI. Yet specification takes into account existing practice (of use of illegal URIs) and advices to change such characters into UTF-8 encoding and then treat each byte with URI-standard hex encoding.
From my tests is seems that for example Chrome and Firefox do so. But Internet Explorer did not and just sent those characters like they were. PHP partially coped with that. $_SERVER['QUERY_STRING'] and $_GET contained those characters. But $_SERVER['REQUEST_URI'] contained ? instead.
Are there any standards or practices how to approach such cases?
And another connected question is how then should authors publish (by URI) resources with names containing non-ASCII (for example national) characters? Considering all the various parties (HTML code, browser sending request, browser saving file do disk, server receiving and processing request and server storing the file) it seems nearly impossible to have it working consistently. Or at least I never managed.
When it comes to web pages I’m already used to that and always replace national characters with corresponding Latin base characters. But when it comes to external files (PDFs, images, …) it somehow “feels wrong” to “downgrade” the names. Especially if one expects users to save those files on disk.. How to deal with this issue?
Have you checked HTTP specyfication (RFC2616)?
Take a look at those parts:
http://www.w3.org/Protocols/rfc2616/rfc2616-sec5.html#sec5.1.2
http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.2
The practical advice would be to use Base64 to encode the fields that you expect to contain risky characters and later on decode them on your backend.
Btw. Your question is really long. It decreases the chance that someone will dig into it.
In fact RFC 3986 (Uniform Resource Identifier (URI): Generic Syntax) doesn’t say anything about format of the query string fragment
Yes, it does, in Section 3.4:
query = *( pchar / "/" / "?" )
pchar is defined in Section 3.3:
pchar = unreserved / pct-encoded / sub-delims / ":" / "#"
and ends on defining which characters are allowed and how to encode other characters.
Exactly. That is defining the format of the query string fragment.
But from RFC 3986 it seems that neither [ nor ] are allowed in query string.
Officially, yes. But not all browsers do it, and that is broken behavior on their part. All official specs I have seen (and 3986 is not the only one in play) say those characters must be percent-encoded.
Then how about non-ASCII characters support?
Non-ASCII characters are not allowed in URIs. They must be charset-encoded and percent-encoded. The actual charset used is server-specific, there is no spec that allows a URI to specify the charset used. Various specs recommend UTF-8, but do not require UTF-8, and some foreign servers indeed do not use UTF-8.
The IRI spec (RFC 3987), which replaces the URL/URI specs, supports the full Unicode charset, but IRIs are still relatively new and many servers do not support them yet. However, The RFC does define algorithms for converting IRIs to URIs and vice versa.
When in doubt, percent-encode everything you are not sure about. Servers are required to support an decode them when present, before then processing the decoded data as needed.
I would like to know why designers of the URI standard chose to have // in the definition of URIs like http://.
Why make it so complex? Why not just use http:?
Here's the answer (The Web’s Inventor Regrets One Small Thing).
In hindsight Tim Berners-Lee would remove it as well.
The reason it was included:
The double slash, though a programming convention at the time, turned out to not be really necessary.
RFC 2396 covers this, FWIW.
http://www.ietf.org/rfc/rfc2396.txt
The pseudocode in part 7 of section 5.2 in particular best answers your question, that the "//" is there to denote that what follows it is the authority part of the URI (since the pseudocode also makes it clear that it's not a required part of the URI).
if authority is defined then
append "//" to result
append authority to result
In addition, it's spelled out a bit more in RFC 3986 section 3.
When authority is not present, the path cannot begin with two
slash characters ("//"). These restrictions result in five
different ABNF rules for a path (Section 3.3), only one of which
will match any given URI reference.
According to https://www.rfc-editor.org/rfc/rfc3986 and http://en.wikipedia.org/wiki/Uniform_resource_identifier, a URI may or may not contain a double slash following the scheme identifier. This makes "urn:issn:1535-3613" a valid URI just as "http://stackoverflow.com".
Is there a strict/formal need to include the double slash or is it optional and in any case, what is the reason/semantics? When answering, please provide a conclusive answer - Don't just report how you browser/library/... handles it.
It's in the RFC you linked: If there is a //, it means that what follows that is the authority. See Section 3. So if the scheme uses an authority, it will use the // after the colon (either requiring it, if authority is required in that scheme, or having it be optional if authority is optional in that scheme). mailto doesn't use an authority in the URI sense, so mailto URIs don't include a //.
Besides the RFC which thoroughly explains the answer, I thought you might like this quote straight from the inventor of the World Wide Web himself.
When [Sir Tim Berners-Lee] was asked what he would have done
differently, the answer was easy. "I would have got rid of the slash
slash after the colon. You don't really need it. It just seemed like a
good idea at the time."
Source: http://www.wired.co.uk/news/archive/2014-02/06/tim-berners-lee-reclaim-the-web
Well, if you want a "conclusive answer", I think nothing is more conclusive than the official HTTP RFC document (see point 3.2.2 which talks about the HTTP URL scheme).
This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
What's the correct encoding of HTTP get request strings?
One of my clients sent me they require HTTP requests to be encoded in ISO-8859-2,
so I wonder about what charset is used for HTTP communication, and if this request is somehow technicaly right.
It depends. A "smart" server will always use percent-escaped UTF-8, but you can't rely on that.
Pure ASCII is all that's allowed in HTTP headers. But, as far as HTTP is concerned, anything goes in the request body of a POST. The headers and body are always separated by a blank line. A set of headers will normally identify the format of the content/body. Responses work the same way. However, HTML has some additional rules regarding what normally goes in a POST.
EDIT: Sorry, I missed the word 'GET' in your title. Might be nice to duplicate that in the body of your question.
At any rate, I believe I am correct in saying ONLY ASCII (ANSI X3.4-1986) is allowed in the headers of any HTTP request, GET or POST. So no, ISO-8859-2 requests are not strictly valid HTTP. That said, there's probably a way to escape the desired special characters in the query string if that's what you're really asking for here.
SOURCE: https://www.rfc-editor.org/rfc/rfc2616