HTTP method/standard for obtaining a list of files - http

Is there any protocol or method specifically for listing files (or a list of file meta-data) in a directory?
I saw nothing obvious at http://www.iana.org/assignments/http-methods/http-methods.xhtml . The closest I think I could see was SEARCH which might be used for this purpose but defined no semantics. Are there no standards based on top of HTTP which allow for this?

What you're looking for is PROPFIND (http://greenbytes.de/tech/webdav/rfc4918.html#METHOD_PROPFIND)

Related

itemtype with http or better https?

I use like:
itemtype="http://schema.org/ImageObject"
but the request http://schema.org/ImageObject will be forwarded to https://schema.org/ImageObject.
If I change to itemtype="https://schema.org/ImageObject", the Google SDTT shows no problem, but nearly all examples about structured data from Google are with http.
What is best or recommended to use http://schema.org or https://schema.org for itemtype?
From Schema.org’s FAQs:
Q: Should we write https://schema.org or http://schema.org in our markup?
There is a general trend towards using https more widely, and you can already write https://schema.org in your structured data. Over time we will migrate the schema.org site itself towards using https: as the default version of the site and our preferred form in examples. However http://schema.org -based URLs in structured data markup will remain widely understood for the forseeable future and there should be no urgency about migrating existing data. This is a lengthy way of saying that both https://schema.org and http://schema.org are fine.
tl;dr: Both variants are possible.
The purpose of itemtype URIs
Note that the URIs used for itemtype are primarily identifiers, they typically don’t get dereferenced:
If a Microdata consumer doesn’t know what the URI in itemtype="http://schema.org/ImageObject" stands for, this consumer "must not automatically dereference" it.
If a Microdata consumer does know what the URI stands for, this consumer has no need to dereference this URI in the first place.
So, there is no technical reason to prefer the HTTPS variant. User agents won’t dereference this URI (in contrast to URIs specified in href/src attributes), and users can’t click on it. I think there is only one case where the HTTPS variant is useful: if a visitor looks into the source code and copy-pastes the URI to check what the type is about.
I would recommend to stick with the HTTP variant until Schema.org switched everything to HTTPS, most importantly the URI in RDF’a initial context.
The specification of Schema for the type ImageObject indicated:
Canonical URL: http://schema.org/ImageObject
It is probably useful to refer to the canonical URL because it is the “preferred” version of the web page.

Is there a preferred canonical form for the path part of URLs?

All of these URLs are equivalent:
http://rbutterworth.nfshost.com/Me
http://rbutterworth.nfshost.com/Me/
http://rbutterworth.nfshost.com/Me/.
http://rbutterworth.nfshost.com/Me/index
http://rbutterworth.nfshost.com/Me/index.html
The "rel='canonical'" link allows me to specifiy whichever I want.
Is one of those forms considered "better" or "more standard" than the others?
As a maintainer, I personally prefer the first one, as it allows me the freedom to change "Me" to be "Me.php", or change "index.html" to be "index.shtml", or some other form should I ever need to, without having to define redirects, or to change any existing links to this URL. (This isn't specific to "index"; it could be for any web page.)
I.e. using that simplest form avoids publishing what is only an implementation detail that is best hidden from the users.
Unfortunately, of all the forms, my preferred choice is the only one that web servers don't like; they return "HTTP/1.1 301 Moved Permanently" and add the trailing "/".
For directories, is incurring this redirection penalty worth it?
For non-directories, is there any reason I shouldn't continue omitting the suffix?
Added after receiving the answer:
It's nice to know I'm not the only one that thinks omitting suffixes is a good idea.
And I just realized that my problem with directories goes away if I use "directoryname/index" as the canonical form.
Thanks.
For directories, is incurring this redirection penalty worth it?
No.
"The canonical URL for this resource is a 301 redirect to another URL" doesn't make sense.
For non-directories, is there any reason I shouldn't continue omitting the suffix?
No.
There is a reason to omit the suffix: It leaks information about the technologies used to built the site, and makes it harder to change them (i.e. if you moved away from static HTML files to a PHP based system, then you'd need to redirect all your old URLs … or configure your server to process files with a .html extension as PHP (which is possible, but confusing).

RESTful URLs and folders

On the Microformats spec for RESTful URLs:
GET /people/1
return the first record in HTML format
GET /people/1.html
return the first record in HTML format
and /people returns a list of people
So is /people.html the correct way to return a list of people in HTML format?
If you just refer to the URL path extension, then, yes, that scheme is the recommended behavior for content negotiation:
path without extension is a generic URL (e.g. /people for any accepted format)
path with extension is a specific URL (e.g. /people.json as a content-type-specific URL for the JSON data format)
With such a scheme the server can use content negotiation when the generic URL is requested and respond with a specific representation when a specific URL is requested.
Documents that recommend this scheme are among others:
Cool URIs don't change
Cool URIs for the Semantic Web
Content Negotiation: why it is useful, and how to make it work
You have the right idea. Both /people and /people.html would return HTML-formatted lists of people, and /people.json would return a JSON-formatted list of people.
There should be no confusion about this with regard to applying data-type extensions to "folders" in the URLs. In the list of examples, /people/1 is itself used as a folder for various other queries.
It says that GET /people/1.json should return the first record in JSON format. - Which makes sense.
URIs and how you design them have nothing to do with being RESTful or not.
It is a common practice to do what you ask, since that's how the Apache web server works. Let's say you have foo.txt and foo.html and foo.pdf, and ask to GET /foo with no preference (i.e. no Accept: header). A 300 MULTIPLE CHOICES would be returned with a listing of the three files so the user could pick. Because browsers do such marvelous content negotiation, it's hard to link to an example, but here goes: An example shows what it looks like, except for that the reason you see the page in the first place is the different case of the file name ("XSLT" vs "xslt").
But this Apache behaviour is echoed in conventions and different tools, but really it isn't important. You could have people_html or people?format=html or people.html or sandwiches or 123qweazrfvbnhyrewsxc6yhn8uk as the URI which returns people in HTML format. The client doesn't know any of these URIs up front, it's supposed to learn that from other resources. A human could see the result of All People (HTML format) and understand what happens, while ignoring the strange looking URI.
On a closing note, the microformats URL conventions page is absolutely not a spec for RESTful URLs, it's merely guidance on making URIs that apparently are easy to consume by various HTTP libraries for some reason or another, and has nothing to do with REST at all. The guidelines are all perfectly OK, and following them makes your URIs look sane to other people that happen to glance on the URIs (/sandwiches is admittedly odd). But even the cited AtomPub protocol doesn't require entries to live "within" the collection...

ASP.NET response filtering and post-cache substitution are not compatible

According to this article http://support.microsoft.com/kb/2014472 you can't use Response filters and Substitution controls together. Has anyone found a workaround for this? I am trying to process complete HTML response just before it's written to client and I do use substitution controls widely.
Here's an official "answer" from MS Dev Support on this issue.
Question:
What is the alternative to response filtering in ASP.NET for modifying HTML rendered by another process when:
1. The other process cannot be modified
2. Post-cache substitution must be supported
Answer:
"Yes, you question is clear as blue sky and this is officially claimed to be not support. As Post-cache substitution would combine certain substitution chunks to the response bytes while response filtering expects to filter the raw bytes of the response(not modified). So the previously combined substitution chunks cannot be preserved anymore.
There is not an alternative from Microsoft so far."
The page you reference has the solution:
Disable output caching on pages that are using substitution blocks.
Edit
Possible solution:
Create master pages of all non-dynamic content. Cache that. Don't cache the changing content.

What is the best way to determine the mime type of an http file upload?

Assume you have an html form with an input tag of type 'file'. When the file is posted to the server it will be stored locally, along with relevant metadata.
I can think of three ways to determine the mime type:
Use the mime type supplied in the 'multipart/form-data' payload.
Use the file name supplied in the 'multipart/form-data' payload and look up the mime type based on the file extension.
scan the raw file data and use a mime type guessing library.
None of these solutions are perfect.
Which is the most accurate solution?
Is there another, better option?
If you are using PHP then you can use
http://pecl.php.net/package/Fileinfo
Which will inspect many aspects of the file. For Python you can use
http://pypi.python.org/pypi/python-magic/0.1
Which is the bindings for libmagic on Linux/Unix and possibly Windows? systems. See:
man magic
man libmagic
On Linux. It uses magic number tests to try and assert mime-types of files.
I like the magic number method, because it can catch wrong extensions and alot of trickery if you are handling files on a webserver that are uploaded. These tests are generally one-offs so the performance hit of reading through the file is negligible.
I don't think you can rely on any one of these as being the definite "I am mime type x". The problem with the first two are that the content type supplied may be incorrect, because of issues with the client (browser or otherwise) or a misleading request (various hack attempts etc...) from various clients.
So you should probably try and combine information from each type and work out some sort of confidence level. Iif the file extension says .doc and the mime type is application/msword then there's a pretty good chance it's a word document, but run it through a mime type detection utility just to make sure.
There should be a solution available for mime magic detection with the language you're using - you didn't mention which one though. They all generally work by looking at the first few bytes/characters of the file and match them against a lookup table of mime types. Some also remove the BOM from the file to help with this. Often they fall back to plain text if the mime type can't be detected.
If you want a platform independent approach to this then take a look at the various Java libraries that exist:
http://code.google.com/p/mimemagic/
http://sourceforge.net/projects/jmimemagic/

Resources