Do search engines respect the HTTP header field “Content-Location”? - http

I was wondering whether search engines respect the HTTP header field Content-Location.
This could be useful, for example, when you want to remove the session ID argument out of the URL:
GET /foo/bar?sid=0123456789 HTTP/1.1
Host: example.com
…
HTTP/1.1 200 OK
Content-Location: http://example.com/foo/bar
…
Clarification:
I don’t want to redirect the request, as removing the session ID would lead to a completely different request and thus probably also a different response. I just want to state that the enclosed response is also available under its “main URL”.
Maybe my example was not a good representation of the intent of my question. So please take a look at What is the purpose of the HTTP header field “Content-Location”?.

I think Google just announced the answer to my question: the canonical link relation for declaring the canonical URL.
Maile Ohye from Google wrote:
MickeyC said...
You should have used the Content-Location header instead, as per:
http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html
"14.14 Content-Location"
#MikeyC: Yes, from a theoretical standpoint that makes sense and we certainly considered it. A few points, however, led us to choose :
Our data showed that the "Content-Location" header is configured improperly on many web sites. Sometimes webmasters provide long, ugly URLs that aren’t even duplicates -- it's probably unintentional. They're likely unaware that their webserver is even sending the Content-Location header.
It would've been extremely time consuming to contact site owners to clean up the Content-Location issues throughout the web. We realized that if we started with a clean slate, we could provide the functionality more quickly. With Microsoft and Yahoo! on-board to support this format, webmasters need to only learn one syntax.
Often webmasters have difficulty configuring their web server headers, but can more easily change their HTML. rel="canonical" seemed like a friendly attribute.
http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonical.html?showComment=1234714860000#c8376597054104610625

Most decent crawlers do follow Content-Location. So, yes, search engines respect the Content-Location header, although that is no guarantee that the URL having the sid parameter will not be on the results page.

In 2009 Google started looking at URIs qualified as rel=canonical in the response body.
Looks like since 2011, links formatted as per RFC5988 are also parsed from the header field Link:. It is also clearly mentioned in the Webmaster Tools FAQ as a valid option.
Guess this is the most up-to-date way of providing search engines some extra hypermedia breadcrumbs to follow - thus allow keeping you to keep them out of the response body when you don't actually need to serve it as content.

In addition to using 'Location' rather than 'Content-Location' use the proper HTTP status code in your response depending on your reason for redirect. Search engines tend to favor permanent redirect (301) status vs temporary (302) status.

Try the "Location:" header instead.

Related

itemtype with http or better https?

I use like:
itemtype="http://schema.org/ImageObject"
but the request http://schema.org/ImageObject will be forwarded to https://schema.org/ImageObject.
If I change to itemtype="https://schema.org/ImageObject", the Google SDTT shows no problem, but nearly all examples about structured data from Google are with http.
What is best or recommended to use http://schema.org or https://schema.org for itemtype?
From Schema.org’s FAQs:
Q: Should we write https://schema.org or http://schema.org in our markup?
There is a general trend towards using https more widely, and you can already write https://schema.org in your structured data. Over time we will migrate the schema.org site itself towards using https: as the default version of the site and our preferred form in examples. However http://schema.org -based URLs in structured data markup will remain widely understood for the forseeable future and there should be no urgency about migrating existing data. This is a lengthy way of saying that both https://schema.org and http://schema.org are fine.
tl;dr: Both variants are possible.
The purpose of itemtype URIs
Note that the URIs used for itemtype are primarily identifiers, they typically don’t get dereferenced:
If a Microdata consumer doesn’t know what the URI in itemtype="http://schema.org/ImageObject" stands for, this consumer "must not automatically dereference" it.
If a Microdata consumer does know what the URI stands for, this consumer has no need to dereference this URI in the first place.
So, there is no technical reason to prefer the HTTPS variant. User agents won’t dereference this URI (in contrast to URIs specified in href/src attributes), and users can’t click on it. I think there is only one case where the HTTPS variant is useful: if a visitor looks into the source code and copy-pastes the URI to check what the type is about.
I would recommend to stick with the HTTP variant until Schema.org switched everything to HTTPS, most importantly the URI in RDF’a initial context.
The specification of Schema for the type ImageObject indicated:
Canonical URL: http://schema.org/ImageObject
It is probably useful to refer to the canonical URL because it is the “preferred” version of the web page.

Documentation for Rebol2's read/custom?

I've been trying to update Ross-Gill's Twitter API for REBOL2 to support uploading media. From looking at its source, the REBOL cookbook, the codeconscious site, and other questions here, my understanding is that read/custom is the preferred way to POST data to websites.
However, I haven't been able to find any real documentation on read/custom. For example: Does it support sending multipart/form-data? (I've managed to work around this by manually composing each part, but it doesn't seem to work for all image files on Twitter's end and is a bit of a hack). Does read/custom only return text on an HTTP/1.0 200 OK response? (It appears so, which is problematic when I receive HTTP/1.0 202 Accepted and need to read the resulting data). Is there a reason that read/custom/binary doesn't appear to send binary data correctly without converting the data using to-string?
TL;DR: Is there good documentation on REBOL2's read/custom somewhere? Alternatively, is read/custom only meant for basic POSTs and I should be using ports and handling the HTTP responses manually?
You guessed right, read/custom is meant for simple HTTP posts, handling web forms data only (that is why it will fail on binary data). No official documentation for it. But that is not an issue as you can access the source code of the HTTP implementation:
probe system/schemes/HTTP
There you can see that /custom refinement supports two keywords, post and header (for setting custom HTTP headers). It also appears that even if you use both keywords, Content-Type will be forced to application/x-www-form-urlencoded no matter what (which is probably the reason why your binary data gets rejected by the server, as the provided mime type is wrong).
In order to work around that, you can save the HTTP object, modify its implementation to fit your needs and reload it.
Saving:
save %http-scheme.r system/schemes/HTTP
Reloading:
system/schemes/HTTP: do load %http-scheme.r
If you just disable the hard-coded Content-Type setting in the HTTP code, and then provide your own one using header keyword, it should work fine, even with binary data:
read/custom <url> [header [Content-Type: <...>] post <data>]
Hope this helps.

Is there anything not good using POST instead of GET?

I know the difference between POST and GET, however if I used POST instead of GET, anything not good besides not up to W3C standards?
Anything inefficiency, insecurity or anything else?
See the answer from deceze:
POST requests can't be bookmarked.
In all the interviews I've done, all the teaching I've done, this is the best place to start. There's a lot more, but start with this.
Ignore anything anyone says about security. A good hacker can change POST to GET easily.
If you get this far, know that POST changes data (adds a membership, or charges a credit card), whereas GET only fetches data (searches for red shirts). The makers of browsers make their browsers behave differently for the results of POST vs GET. The results of POST have side effects that you may not want to repeat (such as adding another membership or double charging a credit card).
If you understand THIS, then read about the POST-Redirect-GET pattern, and understand it well. (Then know that GET has a URL length limit, and that you may need to resort to POST in this case.)
Never use POST requests for normal view-only pages. POST requests can't be bookmarked, send in an email or otherwise be reused. They screw up proper navigation using the browsers back/forward buttons. Only ever use them for sending data to the server in one unique operation and (usually) have the server answer with a redirect.
Other than that, they're not more or less efficient or secure than GET requests, they're just for a different purpose.

What is the difference between GET and POST in the context of creating an AJAX request?

I have an AJAX request that sends a GET:'getPendingList'. This request should return a JSON string indicating a list pending requests that need to be approved. I'm a little confused about whether I should be using a GET or POST here.
From this website:
GET requests can be cached
GET requests can remain in the browser history
GET requests can be bookmarked
GET requests can be distributed & shared
GET requests can be hacked (ask Jakob!)
So I'm thinking: I don't want the results of this GET to be cached because the pending list could change. On the other hand, using POST doesn't seem to make much sense either.
How should I think about GET and POST? I've been told that GET is the same as a 'read'; it doesn't (or shouldn't) change anything on the server side. This makes sense. What doesn't make sense is the caching part; it wouldn't work for me if someone else cached my GET request because I'm expecting the data to change.
Yahoo's best practices might be worth reading over. They recommend using GET primarily for retrieving information and using POST for updating information. In a separate item, they also recommend that do you make AJAX requests cachable where it makes sense. Check it out, it's a good read.
In short, GET requests should be idempodent. POST requests are not.
If you are altering state, use POST - otherwise use GET.
And don't forget, when talking about caching with GET/POST, that is browser-caching.
Nothing stopping you from caching the data server-side.
Also, in general - JSON calls should be POST (here's why)
So, after some IRC'ing, it looks like the best way to do this is to use GET (in this particular instance), but to prevent caching. There are two ways to do this:
1) Append a random string to your GET request.
This seems like a hacky way to do this but it sounds like it might be the only solution for IE: Prevent browser caching of jQuery AJAX call result.
2) In your response from the server, set the headers to no-cache.
It's not clear what the definitive behavior is on this. Some folks (see the previous link) claim that IE doesn't respect the no-cache directives. Other folks seem to think that this works: Internet Explorer 7 Ajax links only load once.

RESTful URLs and folders

On the Microformats spec for RESTful URLs:
GET /people/1
return the first record in HTML format
GET /people/1.html
return the first record in HTML format
and /people returns a list of people
So is /people.html the correct way to return a list of people in HTML format?
If you just refer to the URL path extension, then, yes, that scheme is the recommended behavior for content negotiation:
path without extension is a generic URL (e.g. /people for any accepted format)
path with extension is a specific URL (e.g. /people.json as a content-type-specific URL for the JSON data format)
With such a scheme the server can use content negotiation when the generic URL is requested and respond with a specific representation when a specific URL is requested.
Documents that recommend this scheme are among others:
Cool URIs don't change
Cool URIs for the Semantic Web
Content Negotiation: why it is useful, and how to make it work
You have the right idea. Both /people and /people.html would return HTML-formatted lists of people, and /people.json would return a JSON-formatted list of people.
There should be no confusion about this with regard to applying data-type extensions to "folders" in the URLs. In the list of examples, /people/1 is itself used as a folder for various other queries.
It says that GET /people/1.json should return the first record in JSON format. - Which makes sense.
URIs and how you design them have nothing to do with being RESTful or not.
It is a common practice to do what you ask, since that's how the Apache web server works. Let's say you have foo.txt and foo.html and foo.pdf, and ask to GET /foo with no preference (i.e. no Accept: header). A 300 MULTIPLE CHOICES would be returned with a listing of the three files so the user could pick. Because browsers do such marvelous content negotiation, it's hard to link to an example, but here goes: An example shows what it looks like, except for that the reason you see the page in the first place is the different case of the file name ("XSLT" vs "xslt").
But this Apache behaviour is echoed in conventions and different tools, but really it isn't important. You could have people_html or people?format=html or people.html or sandwiches or 123qweazrfvbnhyrewsxc6yhn8uk as the URI which returns people in HTML format. The client doesn't know any of these URIs up front, it's supposed to learn that from other resources. A human could see the result of All People (HTML format) and understand what happens, while ignoring the strange looking URI.
On a closing note, the microformats URL conventions page is absolutely not a spec for RESTful URLs, it's merely guidance on making URIs that apparently are easy to consume by various HTTP libraries for some reason or another, and has nothing to do with REST at all. The guidelines are all perfectly OK, and following them makes your URIs look sane to other people that happen to glance on the URIs (/sandwiches is admittedly odd). But even the cited AtomPub protocol doesn't require entries to live "within" the collection...

Resources