Content-Range for resuming a file of unknown length

Content-Range for resuming a file of unknown length - http

I create a ZIP archive on-the-fly of unknown length from existing material (using Node), which is already compressed. In the ZIP archive, files just get stored; the ZIP is only used to have a single container. That's why caching the created ZIP files makes no sense -there's no real computation involved.
So far, OK. Now I want to permit resuming downloads, and I'm reading about Accept-Range, Range and Content-Range HTTP headers. A client with a broken download would ask for an open-ended range, say: Range: bytes=8000000-.
How do I answer that? My answer must include a Content-Range header, and there, according to RFC 2616 § 14.16 :
Unlike byte-ranges-specifier values (see section 14.35.1), a byte- range-resp-spec MUST only specify one range, and MUST contain absolute byte positions for both the first and last byte of the range.
So I cannot just send "everything starting from position X", I must specify the last byte sent, too - either by sending only a part of known size, or by calculating the length in advance. Both ideas are not convenient to my situation. Is there any other possibility?

Answering myself: Looks like I have to choose between (1) chunked-encoding of a file of yet unknown length, or (2) knowing its Content-Length (or at least the size of the current part), allowing for resuming downloads (as well as for progress bars).
I can live with that - for each of my ZIP files, the length will be the same, so I can store it somewhere and re-use it for subsequent downloads. I'm just surprised the HTTP protocol does not allow for resuming downloads of unknown length.

Response with "multipart/byteranges" Content-Type including Content-Range fields for each part.
Reasoning:
When replying to requests with "Range" header, successful partial responses should report 206 HTTP status code (14.35.1 Byte Ranges section)
206 response suggests either "Content-Range" header or "multipart/byteranges" Content-Type (10.2.7 206 Partial Content)
"Content-Range" header cannot be added to the response as it does not allow omitting end position, so the only left way is to use "multipart/byteranges" Content-Type

Related

What size does the server have to give to each response chunk?

If I wanted to configure my personal server so that the response for a certain request is set according to the chunk rules: what size should each of the server response chunk have?
For example, let's say that the chunked response is a long HTML page or a file.
How would you behave in these two cases?

From the RFC:
This allows dynamically produced content to be transferred...
In other words: Transfer-Encoding: chunked is needed when the length of content is unknown.
The length of your content may be as big as 10Tb... but also it can be as small as 10 bytes. It doesn't matter. The chucks' sizes depend solely on the algorithms you are using to generate them and to read then.
Let's say you generate a stream of messages of different lengths, one character per second. In this case you can decide to send one byte chucks to the client. This way the client will be able to use the data as soon as it arrives. But if your client have no use for partial messages, then you probably should save the bandwidth and send a chunk at the moment you've finished generating the next message. And again it doesn't matter how big or small the message is. It can be 2 characters or it can be 1000.
On second thought, there are some use cases for Transfer-Encoding: chunked with the data of known size. But then your question becomes to broad to answer. It depends on your client code, server code, network conditions, data properties, desired user experience, etc.
And if by any chance you are asking about optimal size from the network perspective, then just send the whole file - that the best bet. And support Content-Range on your server instead of Transfer-Encoding: chunked.

MJPEG over HTTP Specification

I was trying to create a tool to grab frames from a mjpeg stream that is transmitted over http. I did not find any specification so I looked at what wikipedia says here:
In response to a GET request for a MJPEG file or stream, the server
streams the sequence of JPEG frames over HTTP. A special mime-type
content type multipart/x-mixed-replace;boundary=<boundary-name>
informs the client to expect several parts (frames) as an answer
delimited by <boundary-name>. This boundary name is expressly
disclosed within the MIME-type declaration itself.
But this doesn't seem to be very accurate in practice. I dumped some streams to find out how they behave. Most streams have the following format (where CRLF is a carriage return line feed, and a partial header are some header fields without a status line):
Status line (e.g. HTTP/1.0 200 OK) CRLF
Header fields (e.g. Cache-Control: no-cache) CRLF
Content-Type header field (e.g. Content-Type: multipart/x-mixed-replace; boundary=--myboundary) CRLF
CRLF (Denotes that the header is over)
Boundary (Denotes that the first frame is over) CRLF
Partial header fields (mostly: Content-type: image/jpeg) CRLF
CRLF (Denotes that this "partial header" is over)
Actual frame data CRLF
(Sometimes here is an optional CRLF)
Boundary
Starting again at partial header (line 6)
The first frame never contained actual image data.
All of the analyzed streams had the Content-Type header, with the type set to multipart/x-mixed-replace.
But some of the streams get things wrong here:
Two Servers claimed boundary="MOBOTIX_Fast_Serverpush" but then used --MOBOTIX_Fast_Serverpush as frame delimiter.
This irritated me quite a bit so I though of an other approach to get the frames.
Since each JPEG starts with 0xFF 0xD8 as Start of Image marker and ends with 0xFF 0xD9 I could just start looking for these. This seems to be a very dirty approach and I don't really like it, but it might be the most robust one.
Before I start implementing this, are there some points I missed about MJPEG over HTTP? Is there any real specification of transmitting MJPEG over HTTP?
What are the caveats when just watching for the Start and End markers of a JPEG instead of using the boundary to delimit frames?

this doesn't seem to be very accurate in practice.
It is very accurate in practice. You are just not handling it correctly.
The first frame never contained actual image data.
Yes, it does. There is always a starting boundary before the first MIME entity (as MIME can contain prologue data before the first entity). You are thinking that MIME boundaries exist only after each MIME entity, but that is simply not true.
I suggest you read the MIME specification, particularly RFC 2045 and RFC 2046. MIME works fine in this situation, you are just not interpreting the results correctly.
Actual frame data CRLF
(Sometimes here is an optional CRLF)
Boundary
Actually, that last CRLF is NOT optional, it is actually part of the next boundary that follows a MIME entity's data (see RFC 2046 Section 5). MIME boundaries must appear on their own lines, so a CRLF is artificially inserted after the entity data, which is especially important for data types (like images) that are not naturally terminated by their own CRLF.
Two Servers claimed boundary="MOBOTIX_Fast_Serverpush" but then used --MOBOTIX_Fast_Serverpush as frame delimiter
That is how MIME is supposed to work. The boundary specified in the Content-Type header is always prefixed with -- in the actual entity stream, and the terminating boundary after the last entity is also suffixed with -- as well.
For example:
Content-Type: multipart/x-mixed-replace; boundary="MOBOTIX_Fast_Serverpush"
--MOBOTIX_Fast_Serverpush
Content-Type: image/jpeg
<jpeg bytes>
--MOBOTIX_Fast_Serverpush
Content-Type: image/jpeg
<jpeg bytes>
--MOBOTIX_Fast_Serverpush
... and so on ...
--MOBOTIX_Fast_Serverpush--
This irritated me quite a bit so I though of an other approach to get the frames.
What you are thinking of will not work, and is not as robust as you are thinking. You really need to process the MIME stream correctly instead.
When processing multipart/x-mixed-replace, what you are supposed to do is:
read and discard the HTTP response body until you reach the first MIME boundary specified by the Content-Type response header.
then read a MIME entity's headers and data until you reach the next matching MIME boundary.
then process the entity's data as needed, according to its headers (for instance, displaying a image/jpeg entity onscreen).
if the connection has not been closed, and the last boundary read is not the termination boundary, go back to 2, otherwise stop processing the HTTP response.

How to determine valid Range Header w.r.t HTTP Entity?

As per HTTP/1.1 spec for Range header (http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.35), it is stated that
Byte range specifications in HTTP apply to the sequence of bytes in the entity-body (not necessarily the same as the message-body).
My question is suppose I am requesting to download a binary file of size 1GB & it is having multiple encrypted blocks of 128MB. Since Byte range of HTTP is not equal to the size of file instead the HTTP entity, to download these chunks parallely from the server without breaking the boundaries. Please note that I don't want to reassemble the file. I want to process these blocks separately to decrypt. which Range header would be most suitable & how to derive the correct value to be sent to in that Range header?
Thanks,

The Range header is applicable for not full HTTP Entity rather only the entity-body of that HTTP entity. The HTTP Message RFC (http://www.w3.org/Protocols/rfc2616/rfc2616-sec4.html) says
The message-body (if any) of an HTTP message is used to carry the entity-body associated with the request or response. The message-body differs from the entity-body only when a transfer-coding has been applied, as indicated by the Transfer-Encoding header field (section 14.41).
Another good reference to read is http://www.ietf.org/rfc/rfc3229.txt (section 4 - The HTTP message-generation sequence) which explains how the HTTP response is generated. Conceptually, when a Range header & transfer encoding both are provided in the request, Range is applied first for message response generation & then the transfer encoding is applied. I think most of the HTTP servers should be confirming to this, so we can apply the range header w.r.t message content length.

Http response with no http header

I have written a mini-minimalist http server prototype ( heavily inspired by boost asio examples ), and for the moment I haven't put any http header in the server response, only the html string content. Surprisingly it works just fine.
In that question the OP wonders about necessary fields in the http response, and one of the comments states that they may not be really important from the server side.
I have not tried yet to respond binary image files, or gzip compressed file for the moment, in which cases I suppose it is mandatory to have a http header.
But for text only responses (html, css, and xml outputs), would it be ok never to include the http header in my server responses ? What are the risks / errors possible ?

At a minimum, you must provide a header with a status line and a date.
As someone who has written many protocol parsers, I am begging you, on my digital metaphoric knees, please oh please oh please don't just totally ignore the specification just because your favorite browser lets you get away with it.
It is perfectly fine to create a program that is minimally functional, as long as the data it produces is correct. This should not be a major burden, since all you have to do is add three lines to the start of your response. And one of those lines is blank! Please take a few minutes to write the two glorious line of code that will bring your response data into line with the spec.
The headers you really should supply are:
the status line (required)
a date header (required)
content-type (highly recommended)
content-length (highly recommended), unless you're using chunked encoding
if you're returning HTTP/1.1 status lines, and you're not providing a valid content-length or using chunked encoding, then add Connection: close to your headers
the blank line to separate header from body (required)
You can choose not to send a content-type with the response, but you have to understand that the client might not know what to do with the data. The client has to guess what kind of data it is. A browser might decide to treat it as a downloaded file instead of displaying it. An automated process (someone's bash/curl script) might reasonably decide that the data isn't of the expected type so it should be thrown away.
From the HTTP/1.1 Specification section 3.1.1.5. Content-Type:
A sender that generates a message containing a payload body SHOULD
generate a Content-Type header field in that message unless the
intended media type of the enclosed representation is unknown to the
sender. If a Content-Type header field is not present, the recipient
MAY either assume a media type of "application/octet-stream"
([RFC2046], Section 4.5.1) or examine the data to determine its type.

application/x-www-form-urlencoded or multipart/form-data?

In HTTP there are two ways to POST data: application/x-www-form-urlencoded and multipart/form-data. I understand that most browsers are only able to upload files if multipart/form-data is used. Is there any additional guidance when to use one of the encoding types in an API context (no browser involved)? This might e.g. be based on:
data size
existence of non-ASCII characters
existence on (unencoded) binary data
the need to transfer additional data (like filename)
I basically found no formal guidance on the web regarding the use of the different content-types so far.

TL;DR
Summary; if you have binary (non-alphanumeric) data (or a significantly sized payload) to transmit, use multipart/form-data. Otherwise, use application/x-www-form-urlencoded.
The MIME types you mention are the two Content-Type headers for HTTP POST requests that user-agents (browsers) must support. The purpose of both of those types of requests is to send a list of name/value pairs to the server. Depending on the type and amount of data being transmitted, one of the methods will be more efficient than the other. To understand why, you have to look at what each is doing under the covers.
For application/x-www-form-urlencoded, the body of the HTTP message sent to the server is essentially one giant query string -- name/value pairs are separated by the ampersand (&), and names are separated from values by the equals symbol (=). An example of this would be:
MyVariableOne=ValueOne&MyVariableTwo=ValueTwo
According to the specification:
[Reserved and] non-alphanumeric characters are replaced by `%HH', a percent sign and two hexadecimal digits representing the ASCII code of the character
That means that for each non-alphanumeric byte that exists in one of our values, it's going to take three bytes to represent it. For large binary files, tripling the payload is going to be highly inefficient.
That's where multipart/form-data comes in. With this method of transmitting name/value pairs, each pair is represented as a "part" in a MIME message (as described by other answers). Parts are separated by a particular string boundary (chosen specifically so that this boundary string does not occur in any of the "value" payloads). Each part has its own set of MIME headers like Content-Type, and particularly Content-Disposition, which can give each part its "name." The value piece of each name/value pair is the payload of each part of the MIME message. The MIME spec gives us more options when representing the value payload -- we can choose a more efficient encoding of binary data to save bandwidth (e.g. base 64 or even raw binary).
Why not use multipart/form-data all the time? For short alphanumeric values (like most web forms), the overhead of adding all of the MIME headers is going to significantly outweigh any savings from more efficient binary encoding.

READ AT LEAST THE FIRST PARA HERE!
I know this is 3 years too late, but Matt's (accepted) answer is incomplete and will eventually get you into trouble. The key here is that, if you choose to use multipart/form-data, the boundary must not appear in the file data that the server eventually receives.
This is not a problem for application/x-www-form-urlencoded, because there is no boundary. x-www-form-urlencoded can also always handle binary data, by the simple expedient of turning one arbitrary byte into three 7BIT bytes. Inefficient, but it works (and note that the comment about not being able to send filenames as well as binary data is incorrect; you just send it as another key/value pair).
The problem with multipart/form-data is that the boundary separator must not be present in the file data (see RFC 2388; section 5.2 also includes a rather lame excuse for not having a proper aggregate MIME type that avoids this problem).
So, at first sight, multipart/form-data is of no value whatsoever in any file upload, binary or otherwise. If you don't choose your boundary correctly, then you will eventually have a problem, whether you're sending plain text or raw binary - the server will find a boundary in the wrong place, and your file will be truncated, or the POST will fail.
The key is to choose an encoding and a boundary such that your selected boundary characters cannot appear in the encoded output. One simple solution is to use base64 (do not use raw binary). In base64 3 arbitrary bytes are encoded into four 7-bit characters, where the output character set is [A-Za-z0-9+/=] (i.e. alphanumerics, '+', '/' or '='). = is a special case, and may only appear at the end of the encoded output, as a single = or a double ==. Now, choose your boundary as a 7-bit ASCII string which cannot appear in base64 output. Many choices you see on the net fail this test - the MDN forms docs, for example, use "blob" as a boundary when sending binary data - not good. However, something like "!blob!" will never appear in base64 output.

I don't think HTTP is limited to POST in multipart or x-www-form-urlencoded. The Content-Type Header is orthogonal to the HTTP POST method (you can fill MIME type which suits you). This is also the case for typical HTML representation based webapps (e.g. json payload became very popular for transmitting payload for ajax requests).
Regarding Restful API over HTTP the most popular content-types I came in touch with are application/xml and application/json.
application/xml:
data-size: XML very verbose, but usually not an issue when using compression and thinking that the write access case (e.g. through POST or PUT) is much more rare as read-access (in many cases it is <3% of all traffic). Rarely there where cases where I had to optimize the write performance
existence of non-ascii chars: you can use utf-8 as encoding in XML
existence of binary data: would need to use base64 encoding
filename data: you can encapsulate this inside field in XML
application/json
data-size: more compact less that XML, still text, but you can compress
non-ascii chars: json is utf-8
binary data: base64 (also see json-binary-question)
filename data: encapsulate as own field-section inside json
binary data as own resource
I would try to represent binary data as own asset/resource. It adds another call but decouples stuff better. Example images:
POST /images
Content-type: multipart/mixed; boundary="xxxx"
... multipart data
201 Created
Location: http://imageserver.org/../foo.jpg
In later resources you could simply inline the binary resource as link:
<main-resource&gt
...
<link href="http://imageserver.org/../foo.jpg"/>
</main-resource>

I agree with much that Manuel has said. In fact, his comments refer to this url...
http://www.w3.org/TR/html401/interact/forms.html#h-17.13.4
... which states:
The content type
"application/x-www-form-urlencoded" is
inefficient for sending large
quantities of binary data or text
containing non-ASCII characters. The
content type "multipart/form-data"
should be used for submitting forms
that contain files, non-ASCII data,
and binary data.
However, for me it would come down to tool/framework support.
What tools and frameworks do you
expect your API users to be building
their apps with?
Do they have
frameworks or components they can use
that favour one method over the
other?
If you get a clear idea of your users, and how they'll make use of your API, then that will help you decide. If you make the upload of files hard for your API users then they'll move away, of you'll spend a lot of time on supporting them.
Secondary to this would be the tool support YOU have for writing your API and how easy it is for your to accommodate one upload mechanism over the other.

Just a little hint from my side for uploading HTML5 canvas image data:
I am working on a project for a print-shop and had some problems due to uploading images to the server that came from an HTML5 canvas element. I was struggling for at least an hour and I did not get it to save the image correctly on my server.
Once I set the
contentType option of my jQuery ajax call to application/x-www-form-urlencoded everything went the right way and the base64-encoded data was interpreted correctly and successfully saved as an image.
Maybe that helps someone!

If you need to use Content-Type=x-www-urlencoded-form then DO NOT use FormDataCollection as parameter: In asp.net Core 2+ FormDataCollection has no default constructors which is required by Formatters. Use IFormCollection instead:
public IActionResult Search([FromForm]IFormCollection type)
{
return Ok();
}

In my case the issue was that the response contentType was application/x-www-form-urlencoded but actually it contained a JSON as the body of the request. Django when we access request.data in Django it cannot properly converted it so access request.body.
Refer this answer for better understanding:
Exception: You cannot access body after reading from request's data stream

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex