What is the correct terminology for breaking up a URI into its component parts? - http

Suppose we have a string "http://www.example.com/feed". I am breaking this string up into three pieces for use with Apache's URI class:
1. "http"
2. "www.example.com"
3. "/feed"
Is there a proper term for this process of breaking down a URI into its component pieces?

A uri can be parsed into it's component parts:
The following are two examples from the RFC3986:
foo://example.com:8042/over/there?name=ferret#nose
\_/ \______________/\_________/ \_________/ \__/
| | | | |
scheme authority path query fragment
| _____________________|__
/ \ / \
urn:example:animal:ferret:nose
A uri can be either a url or a urn.

Split or parse? I think it's really semantics, and there's not an agreed upon term.

I would always use the term parsing.

Related

What do we call the combined path, query, and fragment in a URI?

A URI is composed of several parts: scheme:[//[user[:password]#]host[:port]][/path][?query][#fragment]. Often, I find myself wanting to refer to the entire part of the URI to the right of the host and port - the part that would be considered the URI in an HTTP request:
GET /path?query#fragment
Host: example.com
As a short-hand, I normally call this the "path", but that's not quite accurate, as the path is only part of it. This is essentially the inverse of What do you call the entire first part of a URL?
Is there an agreed-upon name for this?
Within a full HTTP URI, there doesn’t seem to be a term that denotes everything coming after the authority.
If you only have the part in question as a URI reference (e.g., in a HTTP GET request), it’s called a relative reference:
relative-ref = relative-part [ "?" query ] [ "#" fragment ]
But this term also includes network-path references (often called protocol-relative URIs), e.g. //example.com/path?query#fragment. To exclude this case, you could use the terms for the other two cases:
absolute-path reference (begins with a single /, e.g. /path?query#fragment)
relative-path reference (doesn’t begin with a /, e.g., path?query#fragment)¹
¹ If the first path segment contains a :, you have to begin the relative-path reference with ./ (e.g., ./pa:th?query#fragment).
RFC 7230 says:
request-line = method SP request-target SP HTTP-version CRLF
I personally prefer to use the terms
origin for scheme and authority (where) and
resource for path, query string and fragment (what).
I'm not aware of any term for that portion of a URI.
RFC3986 says this
foo://example.com:8042/over/there?name=ferret#nose
\_/ \______________/\_________/ \_________/ \__/
| | | | |
scheme authority path query fragment
| _____________________|__
/ \ / \
urn:example:animal:ferret:nose

What is the name of this URL part?

If I have a URL:
http://mysite.com/part1/page.aspx
What is the name of the part1 part in that URL?
Justin is correct - it's one part of the path. Adding a reference to the URI RFC, for the pretty picture contained therein, which illustrates (in section 3) each part of an example URI:
foo://example.com:8042/over/there?name=ferret#nose
\_/ \______________/\_________/ \_________/ \__/
| | | | |
scheme authority path query fragment
Further on, in the section devoted to the path:
A path consists of a sequence of path segments separated by a slash
("/") character. A path is always defined for a URI, though the
defined path may be empty (zero length). Use of the slash character
to indicate hierarchy is only required when a URI will be used as the
context for relative references. For example, the URI
mailto:fred#example.com has a path of "fred#example.com", whereas
the URI foo://info.example.com?fred has an empty path. (emphasis mine)
And so "path segment" might be the term you're looking for.
That would be the first part of the path. There's no term to describe that part alone.
I might possibly be name "the protocol"?

How to be HTTP (cache) friendly on an index page that presents random content?

Let's say I have a quoting site, that accepts new quotes and presents a random one every time you visit it.
I would have resources like:
| URL | Method | What is does |
|-----------------------|--------|----------------------|
| / | GET | Shows a random quote |
| /quote | POST | Create a new quote |
| /quote/slang-of-quote | GET | Presents a quote |
Resources could be presented on HTML, JSON, XML, image/png, etc.
Proper headers for cache control would be sent on relevant resources (probably just on /quote/another-quote URLs).
What is the beast approach (for the index page to give a user a random quote) other than issuing a 307 to a real quote resource? Is that even nice to HTTP/REST?!

regex for URL starting with http/https or www

The user should be able to write it in any of the below formats
http://www.microsoft.com
or
https://www.microsoft.com
or
www.microsoft.com
Programming Language : C#
This should work for most regex processors:
/((?:https?\:\/\/|www\.)(?:[-a-z0-9]+\.)*[-a-z0-9]+.*)/i
What this matches:
Anything that starts with http://, https://, or www.
Followed by at least one or more valid domain characters (a-z, 0-9, or -)
Matches without case sensitivity (/i)
It does not enforce white space, so it will match this: blahwww.domain.com, and return www.domain.com If you want to enforce space, add \s to the beginning, but then you have to ensure that you add a space to the beginning of the string to match.
The (?:) blocks are non-matching groups. They prevent those specific groups of characters from being assigned a number. They can be replaced with matching groups () if your regex processor has trouble. Group 1 is always the entire URL.
It's not terribly strict, but it matches all standard domain names (but might let slip through some invalid ones).
Also, next time, you might want to include the programming language or context, because regex processors vary greatly in feature support.
Without using http | https | fts:
Var url = /^((www|WWW)\.){1}?([a-zA-Z0-9]+(\.[a-zA-Z0-9]+)+(\.[a-zA-Z0-9]+)*)$/;
using http | https | fts:
var url = /^((http|ftp|https):\/\/|((www|WWW)\.)){1}?([a-zA-Z0-9]+(\.[a-zA-Z0- 9]+)+(\.[a-zA-Z0-9]+)*)$/;

Force refresh of cached CSS data

Is it possible force the browser to fresh the cached CSS?
This is not as simple as every request. We have a site that has had stable CSS for a while.
Now we need to make some major updates to the CSS; however, browsers that have cached the CSS will not receive the new CSS for a couple of days causing rendering issues.
Is there a way to force refresh of the CSS or are we better just opting for version specific CSS URLs?
TL;DR
Change the file name or query string
Use a change that only occurs once per release
File renaming is preferable to a query string change
Always set HTTP headers to maximize the benefits of caching
There are several things to consider and a variety of ways to approach this. First, the spec
What are we trying to accomplish?
Ideally, a modified resource will be unconditionally fetched the first time it is requested, and then retrieved from a local cache until it expires with no subsequent server interaction.
Observed Caching Behavior
Keeping track of the different permutations can be a bit confusing, so I created the following table. These observations were generated by making requests from Chrome against IIS and observing the response/behavior in the developer console.
In all cases, a new URL will result in HTTP 200. The important thing is what happens with subsequent requests.
+---------------------+--------------------+-------------------------+
| Type | Cache Headers | Observed Result |
+---------------------+--------------------+-------------------------+
| Static filename | Expiration +1 Year | Taken from cache |
| Static filename | Expire immediately | Never caches |
| Static filename | None | HTTP 304 (not modified) |
| | | |
| Static query string | Expiration +1 Year | HTTP 304 (not modified) |
| Static query string | Expire immediately | HTTP 304 (not modified) |
| Static query string | None | HTTP 304 (not modified) |
| | | |
| Random query string | Expiration +1 Year | Never caches |
| Random query string | Expire immediately | Never caches |
| Random query string | None | Never caches |
+---------------------+--------------------+-------------------------+
However, remember that browsers and web servers don't always behave the way we expect. A famous example: in 2012 mobile Safari began caching POST requests. Developers weren't pleased.
Query String
Examples in ASP.Net MVC Razor syntax, but applicable in nearly any server processing language.
...since some applications have traditionally used GETs and HEADs with
query URLs (those containing a "?" in the rel_path part) to perform
operations with significant side effects, caches MUST NOT treat
responses to such URIs as fresh unless the server provides an explicit
expiration time. This specifically means that responses from HTTP/1.0
servers for such URIs SHOULD NOT be taken from a cache.
Appending a random parameter to the end of the CSS URL included in your HTML will force a new request and the server should respond with HTTP 200 (not 304, even if it is hasn't been
modified).
<link href="normalfile.css?random=#Environment.TickCount" />
Of course, if we randomize the query string with every request, this will defeat caching entirely. This is rarely/never desirable for a production application.
If you are only maintaining a few URLs, you might manually modify them to contain a build number or a date:
#{
var assembly = Assembly.GetEntryAssembly();
var name = assembly.GetName();
var version = name.Version;
}
<link href="normalfile.css?build=#version.MinorRevision" />
This will cause a new request the first time the user agent encounters the URL, but subsequent requests will mostly return 304s. This still causes a request to be made, but at least the whole file isn't served.
Path Modification
A better solution is to create a new path. With a little effort, this process can be automated to rewrite the path with a version number (or some other consistent identifier).
This answer shows a few simple and elegant options for non-Microsoft platforms.
Microsoft developers can use a HTTP module which intercepts all requests for a given file type(s), or possibly leverage an MVC route/controller combo to serve up the correct file (I haven't seen this done, but I believe it is feasible).
Of course, the simplest (not necessarily the quickest or the best) method is to just rename the files in question with each release and reference the updated paths in the link tags.
I think renaming the CSS file is a far better idea. It might not suit all applications but it'll ensure the user only has to load the CSS file once. Adding a random string to the end will ensure they have to download it every time.
The same goes for the javascript method and the apache methods above.
Sometimes the simple answer can be the most effective.
Another solution is:
<FilesMatch "\.(js|css)$">
Header set Cache-Control "max-age=86400, public"
</FilesMatch>
This limits the maximum cache age to 1 day or 86400 seconds.
Please go read Tim Medora's answer first, as this is a really knowledgeable and great effort post.
Now I'll tell you how I do it in PHP. I don't want to bother with the traditional versioning or trying to maintain 1000+ pages but I want to ensure that the user always gets the latest version of my CSS and caches that.
So I use the query string technique and PHP filemtime() which is going to return the last modified timestamp.
This function returns the time when the data blocks of a file were being written to, that is, the time when the content of the file was changed.
In my webapps I use a config.php file to store my settings, so in here I'll make a variable like this:
$siteCSS = "/css/standard.css?" .filemtime($_SERVER['DOCUMENT_ROOT']. "/css/standard.css");
and then in all of my pages I will reference my CSS like this:
<link rel="stylesheet" type="text/css" media="all" href="<?php echo $siteCSS?>" />
This has been working great for me so far on PHP/IIS.
Yu might be able to do it in apache...
<FilesMatch "\.(html|htm|js|css)$">
FileETag None
<IfModule mod_headers.c>
Header unset ETag
Header set Cache-Control "max-age=0, no-cache, no-store, must-revalidate"
Header set Pragma "no-cache"
Header set Expires "Wed, 11 Jan 1984 05:00:00 GMT"
</IfModule>
</FilesMatch>

Resources