Web API maximum header value length? - asp.net

I've created a Web API project in ASP.Net, and am having some trouble getting the authentication working.
The API is expecting a token to be submitted in the Authorization header in each request. The code that checks to see if the header is set checks if the
HttpRequestMessage.Headers.Authorization
property is null. The first few times I tested this, I discovered that this property was always null, but the strange part is that if you checked the HttpRequestMessage.Headers enumerable, the Authorization header WAS set correctly (also if you did HttpRequestMessage.Headers.ToString(), it would appear there too).
Stranger still, I found that if I removed some of the attributes that are sent in the token, I could get it to work as expected. So it was as though the Authorization property wasn't being set if the header value's character length was too long. Unfortunately, even when manually removing some of the text from the token, it would then proceed to fail on a digest check, as it should!
I can't find any documentation that mentions this, so I was wondering if anyone else has come across it? I don't think the header is too long for IIS, because the header value appears in HttpRequestMessage.Headers.ToString(), so it IS being received, but for some reason it's not being assigned to the Authorization property.
Unfortunately I can't re-write the code that checks this property (this seems the easy solution) because it's apart of the Thinktecture library (ie not written by ourselves).

If you are passing the parameters on a GET, you will be limited to 2100 characters. The RFC spec will be different between implementations. Most of the browsers limit you to 2083 characters. You can definitely get away with 1000 characters.
Microsoft
Pretty much everybody else
If you are passing the parameters on a POST, you should have virtually unlimited lengths.

Related

SNS Subscription Filter Policies do not seem to work when a Binary Message Attribute present

I am having a strange problem.
My SNS -> SQS subscription with filter policy (anything-but) has been working fine.
However when I added a new attribute, initially as a String (JSON Object stringified) things stopped working, and seemed due to this new attribute as if I remove it, things work again. From what I can see in the Cloudwatch Metrics it is due to NumberOfNotificationsFilteredOut-InvalidAttributes.
I tried to change it to send as a Binary type, but the same problem, the filtering again fails on NumberOfNotificationsFilteredOut-InvalidAttributes.
I seemed to be able to work around it, by Base64 encoding the value myself and sending as a String type.
What might be happening here? From what I understand, a filter policy should ignore attributes it doesn't care about, and always ignore binary attributes. However this doesn't seem to be my experience.
Amazon SNS ignores message attributes with the Binary data type.
https://docs.aws.amazon.com/sns/latest/dg/subscription-filter-policy-constraints.html

Script exploits in ASP.NET - Is setting validateRequest="true" good advice?

I was reading about ASP.NET Script Exploits, and one of the suggestions is: (emphasis is mine; and the suggestion is #3 in section "Guarding Against Scripting Exploits
" in the web page)
If you want your application to accept some HTML (for example, some formatting instructions from users), you should encode the HTML at the client before it is submitted to the server. For more information, see How to: Protect Against Script Exploits in a Web Application by Applying HTML Encoding to Strings.
Isn't that really bad advice? I mean, an exploiter could send the HTML via curl or something similar, and the HTML would then be sent un-encoded to the server, which can't be good(?)
Am I missing something here or mis-interpreting the statement?
Microsoft is not wrong in their sentence, but on the other hand far from complete, and their sentence is dangerous.
Since by default, validateRequest == true, you indeed should encode special HTML characters in the client in order for them to get into the server in the first place and bypass validateRequest.
But - they should have emphasized that this is certainly not a replacement for server side filtering and validation.
Specifically, if you must accept HTML, the strongest advice is to use white-listing instead of black filtering (i.e. allow very specific HTML tags and eliminate all the others). Use of Microsoft AntiXSS library is highly recommended for strong user input filtering. It's far better than "re-inventing the wheel" yourself.
I don't think that advice is good...
From my experience I would totally agree with your thought and replace that advice with the following:
all input has to be checked server-side first thing on arrival
all input that can possibly contain "active content" (like HTML, JavaScript...) has to be escaped on arrival and never be sent to any client till full sanitazion took place
I would never trust the client to send trusted data. As you stated there are simply too many ways that data can be submitted. Even non-malicious users may be able to bypass the system on the client if they have JavaScript disabled.
However on the link from that item it becomes clear what they mean with point 3:
You can help protect against script exploits in the following ways:
Perform parameter validation on form variables, query-string
variables, and cookie values. This validation should include two types
of verification: verification that the variables can be converted to
the expected type (for example, convert to an integer, convert to
date-time, and so on), and verification of expected ranges or
formatting. For example, a form post variable that is intended to be
an integer should be checked with the Int32.TryParse method to verify
the variable really is an integer. Furthermore, the resulting integer
should be checked to verify the value falls within an expected range
of values.
Apply HTML encoding to string output when writing values back out
to the response. This helps ensure that any user-supplied string input
will be rendered as static text in the browsers instead of executable
script code or interpreted HTML elements.
HTML encoding converts HTML elements using HTML–reserved characters so
that they are displayed rather than executed.
I think that this is just a case of a misplaced word because there is no way you can perform this level of validation on the client and in the examples contained in the link it is clearly server side code being presented without any mention of the client.
Edit:
You also have request validation enabled by default right? So clearly the focus of protecting content is on the server as far as Microsoft is concerned.
I think the author of the article misspoke. If you go to the linked web page, it talks about encoding data before it's sent back to the client, not the other way around. I think this is just an editing error by the author and he intended to say the opposite.. to encode it before it's returned to the client.

Is the ASP.NET cryptographic vulnerability work around a BIG LIE?

This question is somewhat of a follow up to How serious is this new ASP.NET security vulnerability and how can I workaround it? So if my question seems to be broken read over this question and its accepted solution first and then take that into the context of my question.
Can someone explain why returning the same error page and same status code for custom errors matters? I find this to be immaterial especially if this is advocated as part of the work around to it.
Isn't it just as easy for the script/application to execute this attack and not specifically care whether or not it gets a http status code and more on the outcome? Ie doing this 4000 times you get redirected to an error page where on 4001 you stay on the same page because it didn't invalidate the padding?
I see why adding the delay to the error page is somewhat relevant but doesn't this also just add another layer to fool the script into thinking the site is an invalid target?
What could be done to prevent this if the script takes into account that since the site is asp.net it's running the AES encryption that it ignores the timing of error pages and watches the redirection or lack of redirection as the response vector? If a script does this will that mean there's NO WAY to stop it?
Edit: I accept the timing attack reduction but the error page part is what really seems bogus. This attack vector puts their data into viewstate. There's only 2 cases. Pass. Fail.
Either Fail, they're on a page and the viewstate does not contain their data. No matter what you do here there is no way to remove the fail case because the page just will never contain their inserted data unless they successfully cracked the key. This is why I can't justify the custom errors usage having ANY EFFECT AT ALL.
Or Pass, they're on a page and the viewstate contains their inserted data.
Summary of this vulnerability
The cipher key from the WebResoure.axd / ScriptResource.axd is taken and the first guess of the validation key is used to generate a value of potential key with the ciphered text.
This value is passed to the WebResource.axd / ScriptResource.axd at this point if the decryption key was guessed correctly their response will be accepted but since the data is garbage that it's looking for the WebResource.axd / ScriptResource.axd will return a 404 error.
If the decryption key was not successfully guessed it will get a 500 error for the padding invalid exception. At this point the attack application knows to increment the potential decryption key value and try again repeating until it finds the first successful 404 from the WebResource.axd / ScriptResource.axd
After having successfully deduced the decryption key this can be used to exploit the site to find the actual machine key.
re:
How does this have relevance on whether they're redirected to a 200, 404 or 500? No one can answer this, this is the fundamental question. Which is why I call shenanigans on needing to do this tom foolery with the custom errors returning a 200. It just needs to return the same 500 page for both errors.
I don't think that was clear from the original question, I'll address it:
who said the errors need to return 200? that's wrong, you just need all the errors to return the same code, making all errors return 500 would work as well. The config proposed as a work around just happened to use 200.
If you don't do the workaround (even if its your own version that always returns 500), you will see 404 vs. 500 differences. That is particularly truth in webresource.axd and scriptresource.axd, since the invalid data decrypted is a missing resource / 404.
Just because you don't know which feature had the issue, doesn't mean there aren't features in asp.net that give different response codes in different scenarios that relate to padding vs. invalid data. Personally, I can't be sure if there is any other feature that gives different response code as well, I just can tell you those 2 do.
Can someone explain why returning the same error page and same status code for custom errors matters? I find this to be immaterial especially if this is advocated as part of the work around to it.
Sri already answered that very clearly in the question you linked to.
Its not about hiding than an error occurred, is about making sure the attacker can't tell the different between errors. Specifically is about making sure the attacker can't determine if the request failed because it couldn't decrypt /padding was invalid, vs. because the decrypted data was garbage.
You could argue: well but I can make sure it isn't garbage to the app. Sure, but you'd need to find a mechanism in the app that allows you to do that, and the way the attack works you Always need at least a tiny bit of garbage in the in message. Consider these:
ScriptResource and WebResource both throw, so custom error hides it.
View state is by default Not encrypted, so by default its Not involved the attack vector. If you go through the trouble of turning the encryption on, then you very likely set it to sign / validate it. When that's the case, the failure to decrypt vs. the failure to validate is the same, so the attacker again can't know.
Auth ticket also signs, so its like the view state scenario
Session cookies aren't encrypted, so its irrelevant
I posted on my blog how the attack is getting so far like to be able to forge authentication cookies.
Isn't it just as easy for the script/application to execute this attack and not specifically care whether or not it gets a http status code and more on the outcome? Ie doing this 4000 times you get redirected to an error page where on 4001 you stay on the same page because it didn't invalidate the padding?
As mentioned above, you need to find a mechanism that behaves that way i.e. decrypted garbage stays on the same page instead of throwing an exception / and thus getting you to the same error page.
Either Fail, they're on a page and the viewstate does not contain their data. No matter what you do here there is no way to remove the fail case because the page just will never contain their inserted data unless they successfully cracked the key. This is why I can't justify the custom errors usage having ANY EFFECT AT ALL.
Or Pass, they're on a page and the viewstate contains their inserted data.
Read what I mentioned about the view state above. Also note that the ability to more accurately re-encrypt is gained After they gained the ability to decrypt. That said, as mentioned above, by default view state is not that way, and when its on its usually accompanied with signature/validation.
I am going to elaborate on my answer in the thread you referenced.
To pull off the attack, the application must respond in three distinct ways. Those three distinct ways can be anything - status codes, different html content, different response times, redirects, or whatever creative way you can think of.
I'll repeat again - the attacker should be able to identify three distinct responses without making any mistake, otherwise the attack won't work.
Now coming to the proposed solution. It works, because it reduces the three outcomes to just two. How does it do that? The catch-all error page makes the status code/html/redirect all look identical. The random delay makes it impossible to distinguish between one or the other solely on the basis of time.
So, its not a lie, it does work as advertised.
EDIT : You are mixing things up with brute force attack. There is always going to be a pass/fail response from the server, and you are right it can't be prevented. But for an attacker to use that information to his advantage will take decades and billions of requests to your server.
The attack that is being discussed allows the attacker to reduce those billions of requests into a few thousands. This is possible because of the 3 distinct response states. The workaround being proposed reduces this back to a brute-force attack, which is unlikely to succeed.
The workaround works because:
You do not give any indication about "how far" the slightly adjusted took you. If you get another error message, that is information you can learn from.
With the delay you hide how long the actual calculation took. So you do not get information, that shows if you got deeper into the system, that you can learn from.
No, it isn't a big lie. See this answer in the question you referenced for a good explanation.

GET vs. POST does it really really matter?

Ok, I know the difference in purpose. GET is to get some data. Make a request and get data back. POST should be used for CRUD operations other than read I believe. But when it comes down to it, does the server really care if it's receiving a GET vs. POST in the end?
According to the HTTP RFC, GET should not have any side-effects, while POST may have side-effects.
The most basic example of this is that GET is not appropriate for anything like a purchase-transaction or posting an article to a blog, while POST is appropriate for actions-that-have-consequences.
By the RFC, you can hold a user responsible for actions done by POST (such as a purchase), but not for GET actions. 'Bots always use GET for this reason.
From the RFC 2616, 9.1.1:
9.1.1 Safe Methods
Implementors should be aware that the
software represents the user in
their interactions over the Internet,
and should be careful to allow the
user to be aware of any actions they
might take which may have an
unexpected significance to themselves
or others.
In particular, the convention has
been established that the GET and
HEAD methods SHOULD NOT have the
significance of taking an action
other than retrieval. These methods
ought to be considered "safe". This
allows user agents to represent other
methods, such as POST, PUT and
DELETE, in a special way, so that the
user is made aware of the fact that
a possibly unsafe action is being
requested.
Naturally, it is not possible to
ensure that the server does not
generate side-effects as a result of
performing a GET request; in fact,
some dynamic resources consider that a
feature. The important distinction
here is that the user did not request
the side-effects, so therefore
cannot be held accountable for them.
It does if a search engine is crawling the page, since they will be making GET requests but not POST. Say you have a link on your page:
http://www.example.com/items.aspx?id=5&mode=delete
Without some sort of authorization check performed before the delete, it's possible that Googlebot could come in and delete items from your page.
Since you're the one writing the server software (presumably), then it cares if you tell it to care. If you handle POST and GET data identically, then no, it doesn't.
However, the browser definitely cares. Refreshing or clicking back to a page you got as a response to a POST pops up the little "Are you sure you want to submit data again" prompt, for example.
GET has data limit restrictions based on the sending browser:
The spec for URL length does not dictate a minimum or maximum URL length, but implementation varies by browser. On Windows: Opera supports ~4050 characters, IE 4.0+ supports exactly 2083 characters, Netscape 3 -> 4.78 support up to 8192 characters before causing errors on shut-down, and Netscape 6 supports ~2000 before causing errors on start-up
If you use a GET request to alter back-end state, you run the risk of bad things happening if a webcrawler of some kind traverses your site. Back when wikis first became popular, there were horror stories of whole sites being deleted because the "delete page" function was implemented as a GET request, with disastrous results when the Googlebot came knocking...
"Use GET if: The interaction is more like a question (i.e., it is a safe operation such as a query, read operation, or lookup)."
"Use POST if: The interaction is more like an order, or the interaction changes the state of the resource in a way that the user would perceive (e.g., a subscription to a service), or the user be held accountable for the results of the interaction."
source
You be aware of a few subtle security differences. See my question
GET versus POST in terms of security?
Essentially the important thing to remember is that GET will go into the browser history and will be transmitted through proxies in plain text, so you don't want any sensitive information, like a password in a GET.
Obvious maybe, but worth mentioning.
By HTTP specifications, GET is safe and idempotent and POST is neither. What this means is that a GET request can be repeated multiple times without causing side effects.
Even if your server doesn't care (and this is unlikely), there may be intermediate agents between your client and the server, all of whom have this expectation. For example proxies to cache data at your ISP or other providers for improved performance. THe same expectation is true for accelerators, for example, a prefetching plugin for your browser.
Thus a GET request can be cached (based on certain parameters), and if it fails, it can be automatically repeated without any expecation of harmful effects. So, really your server should strive to fulfill this contract.
On the other hand, POST is not safe, not idempotent and every agent knows not to cache the results of a POST request, or retry a POST request automatically. So, for example, a credit card transaction would never, ever be a GET request (you don't want accounts being debited multiple times because of network errors, etc).
That's a very basic take on this. For more information, you might consider the "RESTful Web Services" book by Ruby and Richardson (O'Reilly press).
For a quick take on the topic of REST, consider this post:
http://www.25hoursaday.com/weblog/2008/08/17/ExplainingRESTToDamienKatz.aspx
The funny thing is that most people debate the merits of PUT v POST. The GET v POST issue is, and always has been, very well settled. Ignore it at your own peril.
GET has limitations on the browser side. For instance, some browsers limit the length of GET requests.
I think a more appropriate answer, is you can pretty much do the same things with both. It is not so much a matter of preference, however, but a matter of correct usage. I would recommend you use you GETs and POSTs how they were intended to be used.
Technically, no. All GET does is post the stuff in the first line of the HTTP request, and POST posts stuff in the body.
However, how the "web infrastructure" treats the differences makes a world of difference. We could write a whole book about it. However, I'll give you some "best practises":
Use "POST" for when your HTTP request would change something "concrete" inside the web server. Ie, you're editing a page, making a new record, and so on. POSTS are less likely to be cached, or treated as something that's "repeatable without side-effects"
Use "GET" for when you want to "look at an object". Now, such a look might change something "behind the scenes" in terms of caching or record keeping, but it shouldn't change anything "substantial". Ie, I could repeat my GET over and over and nothing bad would happen, except for inflated hit counts. GETs should be easily bookmarkable, so a user can go back to that same object later on.
The parameters to the GET (the stuff after the ?, traditionally) should be considered "attributes to the view" or "what to view" and so on. Again, it shouldn't actually change anything: use POST for that.
And, a final word, when you POST something (for example, you're creating a new comment), have the processing for the post issue a 302 to "redirect" the user to a new URL that views that object. Ie, a POST processes the information, then redirects the browser to a GET statement to view the new state. Displaying information as a result of a POST can also cause problems. Doing the redirection is often used, and makes things work better.
Should the user be able to bookmark the resulting page? Another thing to think about is some browsers/servers incorrectly limit the GET URI length.
Edit: corrected char length restriction note - thanks ars!
It depends on the software at the server end. Some libraries, like CGI.pm in perl handles both by default. But there are situations where you more or less have to use POST instead of GET, at least for pushing data to the server. Large amounts of data (where the corresponding GET url would become too long), binary data (to avoid lots of encoding/decoding trouble), multipart files, non-parsed headers (for continuous updates pre-AJAX style...) and similar.
The server technically couldn't care one way or the other about what kind of request it receives. It will blindly execute any request coming across the wire.
Which is the problem. If you have an action that destroys or modifies data in a GET action, Google will tear your site up as it crawls through indexing.
The server usually doesn't care. But it's mostly for following good practices, as you mentioned. The client side also matter - as mentioned you cannot bookmark a POST'd page usually, and some browsers have limits on the length of the URL for really long GET queries.
Since GET is intended for specifying resource you wanna get, depending on exact software on the server side, the web server (or the load balancer in front of it) may have a size limit on GET requests to prevent Denial Of Service attacks...
Be aware that browsers may cache GET requests but will generally not cache POST requests.
Yes, it does matter. GET and POST are quite different, really.
You are right in that normally, GET is for "getting" data from the server and displaying a page, while POST is for "posting" data back to the server. Internally, your scripts get the same data whether it's GET or POST, so no, the server doesn't really care.
The main difference is GET parameters are specified in URLs, while POST is not. This is why POST is used for signup and login forms - you don't want your password in a URL. Similarly, if you're viewing different pages or displaying a specific view of some data, you normally want a unique URL.
It really does matter. I have gathered like 11 things you should know abut them.
11 things you should know about GET vs POST
No, they shouldn't except for #jbruce2112 answer and uploading files require POST.

Paging in a Rest Collection

I'm interested in exposing a direct REST interface to collections of JSON documents (think CouchDB or Persevere). The problem I'm running into is how to handle the GET operation on the collection root if the collection is large.
As an example pretend I'm exposing StackOverflow's Questions table where each row is exposed as a document (not that there necessarily is such a table, just a concrete example of a sizable collection of 'documents'). The collection would be made available at /db/questions with the usual CRUD api GET /db/questions/XXX, PUT /db/questions/XXX, POST /db/questions is in play. The standard way to get the entire collection is to GET /db/questions but if that naively dumps each row as a JSON object, you'll get a rather sizeable download and a lot of work on the part of the server.
The solution is, of course, paging. Dojo has solved this problem in its JsonRestStore via a clever RFC2616-compliant extension of using the Range header with a custom range unit items. The result is a 206 Partial Content that returns only the requested range. The advantage of this approach over a query parameter is that it leaves the query string for...queries (e.g. GET /db/questions/?score>200 or somesuch, and yes that'd be encoded %3E).
This approach completely covers the behavior I want. The problem is that RFC 2616 specifies that on a 206 response (emphasis mine):
The request MUST have included a Range header field (section 14.35)
indicating the desired range, and MAY have included an If-Range
header field (section 14.27) to make the request conditional.
This makes sense in the context of the standard usage of the header but is a problem because I'd like the 206 response to be the default to handle naive clients/random people exploring.
I've gone over the RFC in detail looking for a solution but have been unhappy with my solutions and am interested in SO's take on the problem.
Ideas I've had:
Return 200 with a Content-Range header! - I don't think that this is wrong, but I'd prefer if a more obvious indicator that the response is only Partial Content.
Return 400 Range Required - There is not a special 400 response code for required headers, so the default error has to be used and read by hand. This also makes exploration via web browser (or some other client like Resty) more difficult.
Use a query parameter - The standard approach, but I'm hoping to allow queries a la Persevere and this cuts into the query namespace.
Just return 206! - I think most clients wouldn't freak out, but I'd rather not go against a MUST in the RFC
Extend the spec! Return 266 Partial Content - Behaves exactly like 206 but is in response to a request that MUST NOT contain the Range header. I figure that 266 is high enough that I shouldn't run into collision issues and it makes sense to me but I'm not clear on whether this is considered taboo or not.
I'd think this is a fairly common problem and I'd like to see this done in a sort of de facto fashion so I or someone else isn't reinventing the wheel.
What's the best way to expose a full collection via HTTP when the collection is large?
I don't really agree with some of you guys. I've been working for weeks on this features for my REST service. What I ended up doing is really simple. My solution only makes a sense for what REST people call a collection.
Client MUST include a "Range" header to indicate which part of the collection he needs, or otherwise be ready to handle a 413 REQUESTED ENTITY TOO LARGE error when the requested collection is too large to be retrieved in a single round-trip.
Server sends a 206 PARTIAL CONTENT response, with the Content-Range header specifying which part of the resource has been sent, and an ETag header to identify the current version of the collection. I usually use a Facebook-like ETag {last_modification_timestamp}-{resource_id}, and I consider that the ETag of a collection is that of the most recently modified resource it contains.
To request a specific part of a collection, the client MUST use the "Range" header, and fill the "If-Match" header with the ETag of the collection obtained from previously performed requests to acquire other parts of the same collection. The server can therefore verify that the collection hasn't changed before sending the requested portion. If a more recent version exists, a 412 PRECONDITION FAILED response is returned to invite the client to retrieve the collection from scratch. This is necessary because it could mean that some resources might have been added or removed before or after the currently requested part.
I use ETag/If-Match in tandem with Last-Modified/If-Unmodified-Since to optimize cache. Browsers and proxies might rely on one or both of them for their caching algorithms.
I think that a URL should be clean unless it's to include a search/filter query. If you think about it, a search is nothing more than a partial view of a collection. Instead of the cars/search?q=BMW type of URLs, we should see more cars?manufacturer=BMW.
My gut feeling is that the HTTP range extensions aren't designed for your use case, and thus you shouldn't try. A partial response implies 206, and 206 must only be sent if the client asked for it.
You may want to consider a different approach, such as the one use in Atom (where the representation by design may be partial, and is returned with a status 200, and potentially paging links). See RFC 4287 and RFC 5005.
You can still return Accept-Ranges and Content-Ranges with a 200 response code. These two response headers give you enough information to infer the same information that a 206 response code provides explicitly.
I would use Range for pagination, and have it simply return a 200 for a plain GET.
This feels 100% RESTful and doesn't make browsing any more difficult.
Edit:
I wrote a blog post about this: http://otac0n.com/blog/2012/11/21/range-header-i-choose-you.html
If there is more than one page of responses, and you don't want to offer the whole collection at once, does that mean there are multiple choices?
On a request to /db/questions, return 300 Multiple Choices with Link headers that specify how to get to each page as well as a JSON object or HTML page with a list of URLs.
Link: <>; rel="http://paged.collection.example/relation/paged"
Link: <>; rel="http://paged.collection.example/relation/paged"
...
You'd have one Link header for each page of results (an empty string means the current URL, and the URL is the same for each page, just accessed with different ranges), and the relationship is defined as a custom one per the upcoming Link spec. This relationship would explain your custom 266, or your violation of 206. These headers are your machine-readable version, since all of your examples require an understanding client anyway.
(If you stick with the "range" route, I believe your own 2xx return code, as you described it, would be the best behavior here. You're expected to do this for your applications and such ["HTTP status codes are extensible."], and you have good reasons.)
300 Multiple Choices says you SHOULD also provide a body with a way for the user agent to pick. If your client is understanding, it should use the Link headers. If it's a user manually browsing, perhaps an HTML page with links to a special "paged" root resource that can handle rendering that particular page based on the URL? /humanpage/1/db/questions or something hideous like that?
The comments on Richard Levasseur's post remind me of an additional option: the Accept header (section 14.1). Back when the oEmbed spec came out, I wondered why it hadn't been done entirely using HTTP, and wrote up an alternative using them.
Keep the 300 Multiple Choices, the Link headers and the HTML page for an initial naive HTTP GET, but rather than use ranges, have your new paging relationship define the use of the Accept header. Your subsequent HTTP request might look like this:
GET /db/questions HTTP/1.1
Host: paged.collection.example
Accept: application/json;PagingSpec=1.0;page=1
The Accept header allows you to define an acceptable content type (your JSON return), plus extensible parameters for that type (your page number). Riffing on my notes from my oEmbed writeup (can't link to it here, I'll list it in my profile), you could be very explicit and provide a spec/relation version here in case you need to redefine what the page parameter means in the future.
Edit:
After thinking about it a bit more, I'm inclined to agree that Range headers aren't appropriate for pagination. The logic being, the Range header is intended for the server's response, not the applications. If you served 100 megabytes of results, but the server (or client) could only process 1 megabyte at a time, well, thats what the Range header is for.
I'm also of the opinion that a subset of resources is its own resource (similar to relational algebra.), so it deserve representation in the URL.
So basically, I recant my original answer (below) about using a header.
I think you answered your own question, more or less - return 200 or 206 with content-range and optionally use a query parameter. I would sniff the user agent and content type and, depending on those, check for a query parameter. Otherwise, require the range headers.
You essentially have conflicting goals - let people use their browser to explore (which doesn't easily allow custom headers), or force people to use a special client that can set headers (which doesn't let them explore).
You could just provide them with the special client depending on the request - if it looks like a plain browser, send down a small ajax app that renders the page and sets the necessary headers.
Of course, there is also the debate about whether the URL should contain all the necessary state for this sort of thing. Specifying the range using headers can be considered "un-restful" by some.
As an aside, it would be nice if servers could respond with a "Can-Specify: Header1, header2" header, and web browsers would present a UI so users could fill in values, if they desired.
You might consider using a model something like the Atom Feed Protocol since it has a sane HTTP model of collections and how to manipulate them (where insane means WebDAV).
There's the Atom Publishing Protocol which defines the collection model and REST operations plus you can use RFC 5005 - Feed Paging and Archiving to page through big collections.
Switching from Atom XML to JSON content should not affect the idea.
I think the real problem here is that there is nothing in the spec that tells us how to do automatic redirects when faced with 413 - Requested Entity Too Large.
I was struggling with this same problem recently and I looked for inspiration in the RESTful Web Services book. Personally I don't think 206 is appropriate due to the header requirement. My thoughts also led me to 300, but I thought that was more for different mime-types, so I looked up what Richardson and Ruby had to say on the subject in Appendix B, page 377. They suggest that the server just pick the preferred representation and send it back with a 200, basically ignoring the notion that it should be a 300.
That also jibes with the notion of links to next resources that we have from atom. The solution I implemented was to add "next" and "previous" keys to the json map I was sending back and be done with it.
Later on I started thinking maybe the thing to do is send a 307 - Temporary Redirect to a link that would be something like /db/questions/1,25 - that leaves the original URI as the canonical resource name, but it gets you to an appropriately named subordinate resource. This is behavior I'd like to see out of a 413, but 307 seems a good compromise. Haven't actually tried this in code yet though. What would be even better is for the redirect to redirect to a URL containing the actual IDs of the most recently asked questions. For example if each question has an integer ID, and there are 100 questions in the system and you want to show the ten most recent, requests to /db/questions should be 307'd to /db/questions/100,91
This is a very good question, thanks for asking it. You confirmed for me that I'm not nuts for having spent days thinking about it.
One of the big problems with range headers is that a lot of corporate proxies filter them out. I'd advise to use a query parameter instead.
With the publication of rfc723x, unregistered range units do go against an explicit recommendation in the spec. Consider rfc7233 (deprecating rfc2616):
"New range units ought to be registered with IANA" (along with a reference to a HTTP Range Unit Registry).
You can detect the Range header, and mimic Dojo if it is present, and mimic Atom if it is not. It seems to me that this neatly divides the use cases. If you are responding to a REST query from your application, you expect it to be formatted with a Range header. If you are responding to a casual browser, then if you return paging links it will let the tool provide an easy way to explore the collection.
Seems to me that the best way to do this is to include range as query parameters. e.g., GET /db/questions/?date>mindate&date<maxdate. Upon a GET to the /db/questions/ with no query parameters, return 303 with Location: /db/questions/?query-parameters-to-retrieve-the-default-page. Then provide a different URL by which whomever is consuming your API to get statistics about the collection (e.g., what query parameters to use if s/he wants the entire collection);
While its possible to use the Range header for this purpose, I don't think that was the intent. It seems to have been designed for handling flaky connections as well as limiting the data (so the client can request part of the request if something was missing or the size was too large to process). You are hacking pagination into something that is likely used for other purposes at the communication layer.
The "proper" way to handle pagination is with the types you return. Rather than returning questions object, you should be returning a new type instead.
So if questions is like this:
<questions>
<question index=1></question>
<question index=2></question>
...
</questions>
The new type could be something like this:
<questionPage>
<startIndex>50</startIndex>
<returnedCount>10</returnedCount>
<totalCount>1203</totalCount>
<questions>
<question index=50></question>
<question index=51></question>
..
</questions>
<questionPage>
Of course you control your media types, so you can make your "pages" a format that suits your needs. If you make is something generic, you can have a single parser on the client to handle paging the same for all types. I think that is more in the spirit of the HTTP specification, rather than fudging the Range parameter for something else.

Resources