I am new to the world of coding as well as PHP and XHTML. I was just going through the details of meta tags and do not understand the property http-equiv nor what is the charset used for as well as what the value UTF-8 refers to e.g. <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />.
There's a really good article by Joel Spolsky that discusses charsets. Take a look:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
As for HTTP headers, a quick Google search for "understanding HTTP headers" turned up quite a few good articles that describe them. Here's one: HTTP Headers for Dummies.
In short, HTTP headers are basically small messages that get sent to the user's web browser that tell the browser about the output it's going to receive (ie, is it a web page, a file, an image; should it be cached; etc), or for things like cookies that should be saved by the client's browser.
HTTP headers are also sent from the user's web browser back to the server. The most obvious example are again, cookies - every cookie that is saved by the browser is supposed to be sent back to the server on each HTTP request.
In your case, you're probably talking about a specific HTTP header that defines the character set for the page. The <meta http-equiv=""> tags are used to simulate HTTP headers.
For example, if you had a static HTML page and wanted to take advantage of a particular HTTP header, but couldn't configure it in the web server, you could use a <meta http-equiv=""> tag to achieve the same result.
#Ryan: Great link. I wish I had read the article 12 years ago before I embarked on a large international CMS implementation.
In simple terms, when the browser receives an HTTP response stream (the web page that was requested), all it sees is a sequence of bytes. Because bytes can mean different characters, depending on the encoding that was used by the server, it is very important that the browser uses the encoding specified in the meta tag when interpreting the byte stream. If the server used a different encoding than what it put in the meta tag, you will see gibberish.
Conversely, the HTTP request associated with a web page POST also has an encoding that is provided transparently by the browser, because the server needs to know how to interpret any form data that was sent.
Related
I was trying to understand more on Transfer-Encoding:chunked. referred some articles:
http://zoompf.com/blog/2012/05/too-chunky and "Transfer-Encoding: chunked" header in PHP.
I still didn't get very clear picture. I understand setting this encoding allows server to set content in chunk to the browser and cause partial rendering of content at a time that makes web site responsive.
If I've a web application that serves dynamic content (ex: JSF based web app) hosted on IBM WAS, most of the web pages are designed to server rich static content with lots of CSS and JS files + dynamic content. How can I set transfer-encoding 'chunked' for my pages? Or in other words:
How do you decide which page will have 'Transfer-Encoding: chunked' and how do you set it for that page?
Your personal experience will certainly be valuable for my understanding.
Transfer-Encoding: chunked isn't needed for progressive rendering. However, it is needed when the total content length is unknown before the first bytes are sent.
When the server needs to send large amount of data, chunked encoding is used by the server because it did not exactly know how big (length) the data is going to be. In HTTP terms, when server sends response Content-Length header is omitted by the server. Instead server writes the length of current chunk in hexadecimal format followed by \r\n and then chunk, followed by \r\n (Content begins with chunk size in hex followed by chunk)
This feature can be used for progressive rendering; however the server needs to flush the data as much as possible so that client can render content progressively (in case of html,css etc)
This feature is often used when server pushes data to the client in large amounts - usually in large size (mega/giga)
Mozilla Documentation
I've been working a lot with HTTP related things - HTTP requests, HTTP responses, HTTP methods etc., but I'm not really sure I understand what the protocol itself looks like. Is it a document like a specification?
Hyper Text Transfer Protocol (HTTP) provides a pattern to interact with Resources (e.g. webpages on a webserver). Essentially it boils down to a Request (typically from a browser) and a Response (typically from a webserver).
The request highlighted red above identifies an action verb such as GET, POST, DELETE, or PUT (there are others verbs too) and a resource (URI/URL) to preform the action on. The request above depicts a browser request to view the wikipedia main page.
The server then responds to the request with the blue and green sections above; they represent the response header and the response body. The response header contains a lot of optional information about the server but the important fields are the status code (200 OK), the content length (54218) and the content type (text/html).
Since the content type is html the browser will try to render the html inside the response body. If the content type were something else such as a word doc then the browser would probably open a save dialog box. There are a plethora of content types that the body could represent, but not all browsers support each of the content types.
Is it a document like a specification?
Yes, HTTP is a protocol over TCP/IP defined in the following specification: http://www.w3.org/Protocols/rfc2616/rfc2616.html
This protocol is for example implemented by web servers and client browsers.
I'm writing a simple crawler, and ideally to save bandwidth, I'd only like to download the text and links on the page. Can I do that using HTTP Headers? I'm confused about how they work.
You're on the right track to solving the problem.
I'm not sure how much you already know about HTTP headers, but basically an HTTP header is just a string formatting for a web server - it follows a protocol - and is pretty straightforward in that aspect. You write a request, and receive a response. The requests look like the things you see in the Firefox plugin LiveHTTPHeaders at https://addons.mozilla.org/en-US/firefox/addon/3829/.
I wrote a small post at my site http://blog.gnucom.cc/2010/write-http-request-to-web-server-with-php/ that shows you how you can write a request to a web server and then later read the response. If you only accept text/html you'll only accept a subset of what is available on the web (so yes, it will "optimize" your script to an extent). Note this example is really low level, and if you're going to write a spider you may want to use an existing library like cURL or whatever other tools your implementation language offers.
Yes, with using Accept: text/html you should only get HTML as a valid responses. That’s at least how it ought to be.
But in practice there is a huge difference between the standards and the actual implementations. And proper content negotiation (that’s what Accept is for) is one of the things that are barely supported.
An HTML page contains just the text plus some tag markup.
Images, scripts and stylesheets are (usually) external files that are referenced from the HTML markup. This means that if you request a page, you will already receive just the text (without the images and other stuff).
Since you are writing the crawler, you should make sure it doesn't follow URLs from images, scripts or stylesheets.
I'm not 100% sure, but I believe that GET /foobar.png will return the image even if you send Accept: text/html. For this reason I believe you should just filter what kind of URLs you crawl.
In addition, you may try to read the response headers in the crawler and close the connection before you read the body if the Content-Type is not text/html. It might be worthwhile for undesired larger files.
I've read the w3.org spec on the 'HEAD' verb, and I guess I'm missing something. I can't see how it would be useful.
Is the HTTP 'HEAD' verb useful in web development?
If so, how?
From RFC2616:
This method (HEAD) can be used for obtaining
metainformation about the entity
implied by the request without
transferring the entity-body itself.
This method is often used for testing
hypertext links for validity,
accessibility, and recent
modification.
The reason why HEAD is preferred to GET is due to the absence of the message body in the response making it using in scenarios where you want to determine if the content has changed at all - a change in the last modified time or content length usually signifies this.
Also, a HEAD request will provide some information about the server setup (whether it is IIS/Apache etc.), unless the server was masked; of course, this is available in all responses, but HEAD is preferred especially when you don't know the size of the response. HEAD is also the easiest way to determine if a site is up or down; again the irrelevance of the message body makes HEAD the ideal candidate.
I'm not sure about this, but RSS/ATOM feed readers would use HEAD over GET to ascertain if the contents of the feed have changed.
The HTTP HEAD can also be used to pre-authenticate to web server, before you do HTTP PUT/POST of some large data. Without the first HEAD request, you would be sending the large data to web server twice (because the first request would return 401 unauthorized reponse with WWW-authenticate header).
It's mainly for browsers and proxies to determine whether they can use a cached copy of the web document without having to download the whole thing (which would rather defeat the purpose of a cache).
hey guys i have a restful xml service where client passes current version of html they are viewing. if the version on the server is the same as the client, i just respond with the current server version in xml. example: <Response ServerHTMLVersion="1" />
however if server html version is greater than current client version, i still spit out the same response like above like <Response ServerHTMLVersion="2" />. but problem being my client application needs to do a seperate http request to download the html file incase response xml version is greater than clients version
for performance reasons, i wanted to cut down this http request and i wanted to know what is the best way to do this. should i simply encode the html to make it xml safe and append that with xml response - problem with this being html is FAT and encoding makes it even fatter
OR
is there a better way of managing this? note that i am already gziping my response for both, xml as well as html right now
i wanted to know the way to do this keeping performance in mind. the restful xml service is implemented via asp.net 3.5 and iis 7
Have you thought about using HTTP headers? Since really the primary data here is the HTML, and the ServerHTMLVersion is a sort of "meta data" about that html, it should work.
Personally, I'd make the response to the request 1) blank when the versions match and 2) the HTML for non-matching versions; then, use the Pragma HTTP header to send something like Pragma: "ServerHTMLVersion=2". By doing this, you can easily check if the client and server versions differ, and just grab the full response if they're different.
Some people would debate the idea of returning HTML from a REST service, but I personally would consider this totally valid, and an nice clean way of separating your meta data from the actual user data.
-Jerod