Serving large PDF, should I set content-length? - http

I create pdf document dynamically and want to serve them in my handler. I set the content-type to application/pdf and it works fine. I run my server through nginx proxy.
My problem is that some requests generate a lot of other requests for the same doc. I looked at the headers and seen that it want a Chunked transfer encoding.
My solution was to set the content-length and it seems to works fine.
I wonder if it's enough and why I never had to do it with simple html page.

A comment in the source code says:
If the handler didn't declare a Content-Length up front, we either go into chunking mode or, if the handler finishes running before the chunking buffer size, we compute a Content-Length and send that in the header instead.
If you want to avoid chunking, then set the content length. Setting the content length for a large response does reduce the amount of data transferred and can reduce copying within the HTTP server.
As a rule of thumb, set the content length if the length is known in advance of producing the response body.
Your simple HTML pages may be smaller than the chunking buffer size. If so, they were not chunked.

Related

How-to drop extra data based on "Content-Length" header in nginx

I have a custom application deployed on an IIS instance, which among other things acts as an http file server. For various reasons (bugs), many files are corrupted, in the sense that they have an additional byte at the end of the binary content. Fortunately I've got the exact content length of each file saved on a db, and when my application returns the file, it sets the content-length header correctly (both for corrupted and correct files).
So I have situations where the content-length in the response header says 100, while the bytes actually present in the body of the same response are 101 (100+1).
For some internal reason I cannot change the behavior of the application.
Calling the application directly from the browser (so direct call to IIS) there seem to be no obvious problems, but this situation seems to mess up my nginx (version 1.15.7) behind which the application is exposed in production. Note that the file is served, but it results corrupted (they are Excel files), while those downloaded from direct call to IIS result correct.
I think there is some problem on some internal buffer because it's like it always discards the last 8192 bytes and in the error log it shows this warning: upstream sent more data than specified in "Content-Length" header while reading response header from upstream.
I tried to add the directive proxy_buffering: off; but the result does not change (only the warning disappears from the error log).
My question is: is there any way to trim the response body based on the content-length value provided by my upstream? Obviously if and only if this value is present in the headers.
Thanks,
AleBer

Using "Content-Encoding":"GZIP"

I want to send large amount of json over http to sever.
If I use "Content-Encoding":"GZIP" in my httpClient, does it automatically convert the request body to compressed format?
No, the RFC 7231 describes content encoding. If you are sending Content-Encoding you need to make sure that the content is in that encoding.
If you send Content-Encoding: gzip and the message in plain text you will (quite rightly) receive an HTTP 400. The body of a gzip message will always start with 0x1f 0x8b and if the server does not find that int he POST request it is right to complain.
Another reason for this is that you need an appropriate Content-Length header. This will not be the length of the original JSON, it must be the length (in bytes) of the gzipped JSON.
You need to perform the gzip of the JSON before sending anything since you need to know what to place in Content-Length beforehand.
Extra note: If the JSON is that huge (e.g. several gigabytes) you probably will need Transfer-Encoding: chunked, which comes with its own complications. (You do not send Content-Length but add the length of the chuck to the body itself.)
If it automatically does this, is 100% dependent on which http client you are using and if they implemented it that way. Usually setting a header will not automatically encode it, at least in the clients I regularly use.

Transfer-Encoding: chunked

I was trying to understand more on Transfer-Encoding:chunked. referred some articles:
http://zoompf.com/blog/2012/05/too-chunky and "Transfer-Encoding: chunked" header in PHP.
I still didn't get very clear picture. I understand setting this encoding allows server to set content in chunk to the browser and cause partial rendering of content at a time that makes web site responsive.
If I've a web application that serves dynamic content (ex: JSF based web app) hosted on IBM WAS, most of the web pages are designed to server rich static content with lots of CSS and JS files + dynamic content. How can I set transfer-encoding 'chunked' for my pages? Or in other words:
How do you decide which page will have 'Transfer-Encoding: chunked' and how do you set it for that page?
Your personal experience will certainly be valuable for my understanding.
Transfer-Encoding: chunked isn't needed for progressive rendering. However, it is needed when the total content length is unknown before the first bytes are sent.
When the server needs to send large amount of data, chunked encoding is used by the server because it did not exactly know how big (length) the data is going to be. In HTTP terms, when server sends response Content-Length header is omitted by the server. Instead server writes the length of current chunk in hexadecimal format followed by \r\n and then chunk, followed by \r\n (Content begins with chunk size in hex followed by chunk)
This feature can be used for progressive rendering; however the server needs to flush the data as much as possible so that client can render content progressively (in case of html,css etc)
This feature is often used when server pushes data to the client in large amounts - usually in large size (mega/giga)
Mozilla Documentation

HTTP: How should I respond to "Range: bytes=" when Range is unsupported?

What is the correct response to a GET request with the header field Range: bytes=278528- if Range is not supported?
Reading the HTTP header definitions (http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html) i think i should at least set: Accept-Ranges: none, but it clearly states that
Clients MAY generate byte-range requests without having received this header for the resource involved.
So, if a client requests a range, should I:
Reply with the whole file from byte 0?
Reply with some status error? (400/406/416/501) see: http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html
You may ignore it, as the spec says. To be precise:
If you support it, you return a status code of 206 Partial Content and include the proper headers like Content-Range.
If you don’t support it, you return a 200 OK as normal.
I have not tested this, but the spec seems pretty clear. I have seen this work — using wget or curl to resume an interrupted download will properly restart from the beginning if the server does not support the Range header.
RFC2616 section 14.35.2 says:
A server MAY ignore the Range header.
The possibility is check the http header and if there is a range string, parse it, parse to ranges, compute skip and take positions, open file stream from url, then, seek to skip and take 'take ' bytes, setup response of it, send response and finaly close stream.
do not forget to respond with range header
do not ignore range, never when you are working on big streams.
if you are using nanohttp, i can help you out with example
Ignoring range requests can made play content (which is huge) on airplay service or another unstable or unacceptable. I know that http is not right protokol to transfer video, but try to send video to airplay from server not accepting ranges....
Airplay uses range requests...

Will sending an HTTP Header with Accept: text/html only download text from the page?

I'm writing a simple crawler, and ideally to save bandwidth, I'd only like to download the text and links on the page. Can I do that using HTTP Headers? I'm confused about how they work.
You're on the right track to solving the problem.
I'm not sure how much you already know about HTTP headers, but basically an HTTP header is just a string formatting for a web server - it follows a protocol - and is pretty straightforward in that aspect. You write a request, and receive a response. The requests look like the things you see in the Firefox plugin LiveHTTPHeaders at https://addons.mozilla.org/en-US/firefox/addon/3829/.
I wrote a small post at my site http://blog.gnucom.cc/2010/write-http-request-to-web-server-with-php/ that shows you how you can write a request to a web server and then later read the response. If you only accept text/html you'll only accept a subset of what is available on the web (so yes, it will "optimize" your script to an extent). Note this example is really low level, and if you're going to write a spider you may want to use an existing library like cURL or whatever other tools your implementation language offers.
Yes, with using Accept: text/html you should only get HTML as a valid responses. That’s at least how it ought to be.
But in practice there is a huge difference between the standards and the actual implementations. And proper content negotiation (that’s what Accept is for) is one of the things that are barely supported.
An HTML page contains just the text plus some tag markup.
Images, scripts and stylesheets are (usually) external files that are referenced from the HTML markup. This means that if you request a page, you will already receive just the text (without the images and other stuff).
Since you are writing the crawler, you should make sure it doesn't follow URLs from images, scripts or stylesheets.
I'm not 100% sure, but I believe that GET /foobar.png will return the image even if you send Accept: text/html. For this reason I believe you should just filter what kind of URLs you crawl.
In addition, you may try to read the response headers in the crawler and close the connection before you read the body if the Content-Type is not text/html. It might be worthwhile for undesired larger files.

Resources