I was trying to understand more on Transfer-Encoding:chunked. referred some articles:
http://zoompf.com/blog/2012/05/too-chunky and "Transfer-Encoding: chunked" header in PHP.
I still didn't get very clear picture. I understand setting this encoding allows server to set content in chunk to the browser and cause partial rendering of content at a time that makes web site responsive.
If I've a web application that serves dynamic content (ex: JSF based web app) hosted on IBM WAS, most of the web pages are designed to server rich static content with lots of CSS and JS files + dynamic content. How can I set transfer-encoding 'chunked' for my pages? Or in other words:
How do you decide which page will have 'Transfer-Encoding: chunked' and how do you set it for that page?
Your personal experience will certainly be valuable for my understanding.
Transfer-Encoding: chunked isn't needed for progressive rendering. However, it is needed when the total content length is unknown before the first bytes are sent.
When the server needs to send large amount of data, chunked encoding is used by the server because it did not exactly know how big (length) the data is going to be. In HTTP terms, when server sends response Content-Length header is omitted by the server. Instead server writes the length of current chunk in hexadecimal format followed by \r\n and then chunk, followed by \r\n (Content begins with chunk size in hex followed by chunk)
This feature can be used for progressive rendering; however the server needs to flush the data as much as possible so that client can render content progressively (in case of html,css etc)
This feature is often used when server pushes data to the client in large amounts - usually in large size (mega/giga)
Mozilla Documentation
Related
I create pdf document dynamically and want to serve them in my handler. I set the content-type to application/pdf and it works fine. I run my server through nginx proxy.
My problem is that some requests generate a lot of other requests for the same doc. I looked at the headers and seen that it want a Chunked transfer encoding.
My solution was to set the content-length and it seems to works fine.
I wonder if it's enough and why I never had to do it with simple html page.
A comment in the source code says:
If the handler didn't declare a Content-Length up front, we either go into chunking mode or, if the handler finishes running before the chunking buffer size, we compute a Content-Length and send that in the header instead.
If you want to avoid chunking, then set the content length. Setting the content length for a large response does reduce the amount of data transferred and can reduce copying within the HTTP server.
As a rule of thumb, set the content length if the length is known in advance of producing the response body.
Your simple HTML pages may be smaller than the chunking buffer size. If so, they were not chunked.
I am new to the world of coding as well as PHP and XHTML. I was just going through the details of meta tags and do not understand the property http-equiv nor what is the charset used for as well as what the value UTF-8 refers to e.g. <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />.
There's a really good article by Joel Spolsky that discusses charsets. Take a look:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
As for HTTP headers, a quick Google search for "understanding HTTP headers" turned up quite a few good articles that describe them. Here's one: HTTP Headers for Dummies.
In short, HTTP headers are basically small messages that get sent to the user's web browser that tell the browser about the output it's going to receive (ie, is it a web page, a file, an image; should it be cached; etc), or for things like cookies that should be saved by the client's browser.
HTTP headers are also sent from the user's web browser back to the server. The most obvious example are again, cookies - every cookie that is saved by the browser is supposed to be sent back to the server on each HTTP request.
In your case, you're probably talking about a specific HTTP header that defines the character set for the page. The <meta http-equiv=""> tags are used to simulate HTTP headers.
For example, if you had a static HTML page and wanted to take advantage of a particular HTTP header, but couldn't configure it in the web server, you could use a <meta http-equiv=""> tag to achieve the same result.
#Ryan: Great link. I wish I had read the article 12 years ago before I embarked on a large international CMS implementation.
In simple terms, when the browser receives an HTTP response stream (the web page that was requested), all it sees is a sequence of bytes. Because bytes can mean different characters, depending on the encoding that was used by the server, it is very important that the browser uses the encoding specified in the meta tag when interpreting the byte stream. If the server used a different encoding than what it put in the meta tag, you will see gibberish.
Conversely, the HTTP request associated with a web page POST also has an encoding that is provided transparently by the browser, because the server needs to know how to interpret any form data that was sent.
I'm writing a simple crawler, and ideally to save bandwidth, I'd only like to download the text and links on the page. Can I do that using HTTP Headers? I'm confused about how they work.
You're on the right track to solving the problem.
I'm not sure how much you already know about HTTP headers, but basically an HTTP header is just a string formatting for a web server - it follows a protocol - and is pretty straightforward in that aspect. You write a request, and receive a response. The requests look like the things you see in the Firefox plugin LiveHTTPHeaders at https://addons.mozilla.org/en-US/firefox/addon/3829/.
I wrote a small post at my site http://blog.gnucom.cc/2010/write-http-request-to-web-server-with-php/ that shows you how you can write a request to a web server and then later read the response. If you only accept text/html you'll only accept a subset of what is available on the web (so yes, it will "optimize" your script to an extent). Note this example is really low level, and if you're going to write a spider you may want to use an existing library like cURL or whatever other tools your implementation language offers.
Yes, with using Accept: text/html you should only get HTML as a valid responses. That’s at least how it ought to be.
But in practice there is a huge difference between the standards and the actual implementations. And proper content negotiation (that’s what Accept is for) is one of the things that are barely supported.
An HTML page contains just the text plus some tag markup.
Images, scripts and stylesheets are (usually) external files that are referenced from the HTML markup. This means that if you request a page, you will already receive just the text (without the images and other stuff).
Since you are writing the crawler, you should make sure it doesn't follow URLs from images, scripts or stylesheets.
I'm not 100% sure, but I believe that GET /foobar.png will return the image even if you send Accept: text/html. For this reason I believe you should just filter what kind of URLs you crawl.
In addition, you may try to read the response headers in the crawler and close the connection before you read the body if the Content-Type is not text/html. It might be worthwhile for undesired larger files.
I use PHP to generate dynamic Web pages. As stated on the following tutorial (see link below), the MIME type of XHTML documents should be "application/xhtml+xml" when $_SERVER['HTTP_ACCEPT'] allows it. Since you can serve the same page with 2 different MIMEs ("application/xhtml+xml" and "text/html") you should set the "Vary" HTTP header to "Accept". This will help the cache on proxies.
Link:
http://keystonewebsites.com/articles/mime_type.php
Now I'm not sure of the implication of:
header('Vary: Accept');
I'm not really sure of what 'Vary: Accept' will precisely do...
The only explanation I found is:
After the Content-Type header, a Vary
header is sent to (if I understand it
correctly) tell intermediate caches,
like proxy servers, that the content
type of the document varies depending
on the capabilities of the client
which requests the document.
http://www.456bereastreet.com/archive/200408/content_negotiation/
Anyone can give me a "real" explanation of this header (with that value). I think I understand things like:
Vary: Accept-Encoding
where the cache on proxies could be based on the encoding of the page served, but I don't understand:
Vary: Accept
The cache-control header is the primary mechanism for an HTTP server to tell a caching proxy the "freshness" of a response. (i.e., how/if long to store the response in the cache)
In some situations, cache-control directives are insufficient. A discussion from the HTTP working group is archived here, describing a page that changes only with language. This is not the correct use case for the vary header, but the context is valuable for our discussion. (Although I believe the Vary header would solve the problem in that case, there is a Better Way.) From that page:
Vary is strictly for those cases where it's hopeless or excessively complicated for a proxy to replicate what the server would do.
RFC2616 "Header-Field Definitions" describes the header usage from the server perspective, RFC2616 "Caching Negotiated Responses" from a caching proxy perspective. It's intended to specify a set of HTTP request headers that determine uniqueness of a request.
A contrived example:
Your HTTP server has a large landing page. You have two slightly different pages with the same URL, depending if the user has been there before. You distinguish between requests and a user's "visit count" based on Cookies. But -- since your server's landing page is so large, you want intermediary proxies to cache the response if possible.
The URL, Last-Modified and Cache-Control headers are insufficient to give this insight to a caching proxy, but if you add Vary: Cookie, the cache engine will add the Cookie header to its caching decisions.
Finally, for small traffic, dynamic web sites -- I have always found the simple Cache-Control: no-cache, no-store and Pragma: no-cache sufficient.
Edit -- to more precisely answer your question: the HTTP request header 'Accept' defines the Content-Types a client can process. If you have two copies of the same content at the same URL, differing only in Content-Type, then using Vary: Accept could be appropriate.
Update 11 Sep 12:
I'm including a couple links that have appeared in the comments since this comment was originally posted. They're both excellent resources for real-world examples (and problems) with Vary: Accept; Iif you're reading this answer you need to read those links as well.
The first, from the outstanding EricLaw, on Internet Explorer's behavior with the Vary header and some of the challenges it presents to developers: Vary Header Prevents Caching in IE. In short, IE (pre IE9) does not cache any content that uses the Vary header because the request cache does not include HTTP Request headers. EricLaw (Eric Lawrence in the real world) is a Program Manager on the IE team.
The second is from Eran Medan, and is an on-going discussion of Vary-related unexpected behavior in Chrome: Backing doesn't handle Vary header correctly. It's related to IE's behavior, except the Chrome devs took a different approach -- though it doesn't appear to have been a deliberate choice.
Vary: Accept simply says that the response was generated based on the Accept header in the request. A request with a different Accept header might get a different response.
(You can see that the linked PHP code looks at $HTTP_ACCEPT. That's the value of the Accept request header.)
To HTTP caches, this means that the response must be cached with extra care. It is only going to be a valid match for later requests with exactly the same Accept header.
Now this only matters if the page is cacheable in the first place. By default, PHP pages aren't. A PHP page can mark the output as cacheable by sending certain headers (Expires, for example). But whether and how to do that is a different question.
This google webmaster video has a very good explanation about HTTP Vary header.
There are actually a significant number of new features coming soon (and already in Chrome) that make the Vary header extremely useful. For example, consider Client Hinting. When used in connection with images, for example, client hinting allows a server to optimize resources such as images depending on:
Image Width
Viewport Width
Type of encoding supported by browser (think WebP)
Downlink (essentially network speed)
So a server which supports those features would set the Vary header to indicate that.
Chrome advertises WebP support by setting "image/webp" as part of the Vary header for each request. So a server might rewrite an image as WebP if the browser supports it, so the proxy would need to check the header so as to not cache a WebP image and then serve it to a browser that doesn't support WebP. Obviously, if your server doesn't do that, it wouldn't matter. So since the server's response varies on the Accept request header, the response must include that so as not to confuse proxies:
Vary: Accept
Another example might be image width. On a mobile browser the Width header might be quite small for a responsive image, in comparison with what it would be if viewed from a desktop browser. So in that case Width would be added to the the Vary header is essential for proxy to not cache the small mobile version and serve it to desktop browsers, or vice versa. In that case, the header might include:
Vary: Accept, Width
Or in the case that a server supported all of the client hinting specs, the header would be something like:
Vary: Accept, DPR, Width, Save-Data, Downlink
In the header exchange below I see that the server is returning the page Gzipped but I don't see where my browser ever indicated that it could accept GZip. How did the server know?
The content you have reproduced here is not what was sent by your browser; the "general" part is a mix of some of the request data and some of the response data. If you want to see the actual request an response, use something like wireshark.
Coincidentally, it is worth noting that some so-called security products will interfere with your browsers request - a common "enhancement" is to remove or mangle the header asking for compression. Your webserver will honour such requests in the absence of specific configuration to force compression. Google delivers a compressed JavaScript to the client when it sees such behaviour - if it runs on the client then Google start sending compressed content. There are Apache config snippets on the web which can detect and override some such tampering.
But there's no evidence here to suggest that is the case with your setup. You're just not seeing the request headers.