Official name for URL "scheme plus authority" - http

Assume an absolute http or https URL. I'm looking for an "official" or generally accepted name for the part of the URL that comes before the path.
http://foo:bar#example.com:8042/over/there?name=ferret#nose
\_____________________________/
|
this part
RFC 3986 defines the URL syntax parts as follows:
http://foo:bar#example.com:8042/over/there?name=ferret#nose
\__/ \______________________/\_________/ \_________/ \__/
| | | | |
scheme authority path query fragment
RFC 6454 defines the origin (as in "same origin") of an the URL as the triple (scheme, host, port):
http://foo:bar#example.com:8042/over/there?name=ferret#nose
\__/ \______________/
\________________/
|
origin
As such, neither term is appropriate. Is there a good term for the part I'm looking at, or am I stuck with "scheme (plus ://) plus authority"?

The name in practice and per the current URL standard for the part of a URL that comes before the path is in fact just origin.
The :// part of a URL is just a syntactic (or lexical?) artifact that there’s never any real need to mention in discussions about actual behavior of anything that consumes or processes URLs (other than low-level parsers of course).
The username-password part is a non-conforming misfeature that’s now only useful to discuss as a historical error. The relevant part of the current URL standard has this to say about it;
There is no conforming way to express a username or password of a URL
record within a URL string.
So again in practice for any normal discussions of URLs that align with how the current standards define URLs, it’s sufficient to speak about a URL simply in terms of its highest-level parts being just four parts: its origin, its path, its query (part), and its fragment (part).
Certainly that is at least what the current URL standard itself limits it to.

It would have to just be "scheme plus authority". Bear in mind, that you can't have a valid URI that just has a scheme plus authority, so the combination doesn't come up much as a unit to discuss, and so didn't end up with a name.
Note also that userinfo has never been allowed in HTTP URIs; particular schemes can prohibit or restrict the values of particular portions. Some browsers had a design flaw where they would accept userinfo and base authentication headers on it, but most now will at least warn about this being done, if they allow it at all.

Related

Cognitive Services Translation and Profanity Filtering

Issue Description
I use cognitive services TranslateArray to translate my users comments. One of the advantages of this service is that we can use ProfanityAction to mark every profane words in the destination language. I also make use of the automatic language detection, so that I do not have to identify the content before sending it in.
When I get my translation back for a destination language which match the source language, the profanity is not marked. Is there another endpoint I could/should hit, or a parameter I do not know about, or is there a possible improvement of the service ?
Corresponding Documentation
Follow the cognitive service protocol to hit the TranslateArray endpoint, with an english sentence containing profanities, with the ProfanityAction: Marked behavior: http://docs.microsofttranslator.com/text-translate.html#!/default/post_TranslateArray
Reproduction Steps
Send an English sentence with profanities
Translate to fr, notice correctly marked profanities
Translate to en, notice the missing profanities tag
Expected Behavior
Profanities should be marked even if no translation occured.
Actual Results
I obtained the unmodified sentence back.
There is nothing in the documentation that specifies what happens if the source and target language are the same. My guess is that if it sees that they match it will do nothing.
However, there is a specific API that detects profanity for any given language: Content Moderation for Text. The API docs are here.
The Text - Screen function does it all – scans the incoming text (maximum 1024 characters) for profanity, autocorrects text, and extracts Personally Identifiable Information (PII), all while matching against custom lists of terms.
Your observation that Translator API does nothing if source and target languages are the same, is correct. Not an answer, just clarification.

What is 'uniform' about URI, URL and URN [Uniform Resource Identifier, Uniform Resource Location, Uniform Resource Name]?

I have read about the differences of the URI, URN and URL here and here but the answers talk of the differences of the last letter, that is, the differences amongst identifier, name and location respectively.
What I have not understood is why all these terms have the word 'uniform' and what is uniform about them. This Wikipedia section doesn't mention much about the reason why the change was made from 'universal' to 'uniform'.
I would like to find the missing explanation and not just memorize the terms as they are without fully understanding them.
Based on Tim Berners-Lee’s own account, as published in his book Weaving the Web:
At an IETF meeting, Tim Berners-Lee tried to form a working group that would create an Internet standard for what he suggested to be named universal document identifiers.
About the meeting, in his words (page 61):
[…] there was a strong reaction against the "arrogance" of calling something a universal document identifier. How could I be so presumptuous as to define my creation as "universal"? If I wanted the UDI addresses to be standardized, then the name "uniform document identifiers" would certainly suffice.
While he didn’t agree (it was an issue of whether the Web could be something "universal"), there wasn’t much time and so (page 62):
I was willing to compromise so I could get to the technical details. So universal became uniform, and document became resource.
They formed "a uniform resource identifier working group". (And this group then decided that "identifier" wasn’t a good label, and they chose "locator" instead, forming "URL" – which he also didn’t agree with.)
The current URI Internet standard (RFC 3986) describes the meaning of "Uniform", "Resource" and "Identifier" in section 1.1.

UTF8 component in URL, should it be case sensitive?

I understand URL should be case-sensitive, e.g.
http://www.example.com/test.php
http://www.example.com/TEST.php
Should be two things.
But should the UTF-8 also be case sensitive, e.g. ?
http://zh.wikipedia.org/wiki/%E8%A7%82%E6%B5%8B%E5%A4%A9%E6%96%87%E5%AD%A6
vs
http://zh.wikipedia.org/wiki/%e8%a7%82%e6%b5%8b%e5%a4%a9%e6%96%87%e5%ad%a6
Should they be equal?
The reason I ask is: Googlebot keep using the upper case viarant of the URL, although my site is all using lower case url.
I can't speak with 100% authority on this question, but if you stop to think about how the URL would be stored in a search index, or table of urls, or any of the myriad data stores Google uses, I can't imagine that the URLs would not be normalized in some fashion.
Any kind of normalization should decode the URL into a string, so there should be no difference. I would be surprised if Google stored URLs with the % encodings. They can store text in UTF-8; the percent signs are there to make things visible to humans.
Google "use" of the uppercase variant is simply for display and reporting purposes, I would guess. I do not think the URL encodings are stored at all.
Since they are supposed to be pairs of hexadecimal characters, the lowercase and uppercase variants of encoded characters should be considered equivalent (e.g. 0xab and 0xAB are the same value).
When it comes to Googlebot it's hard to predict where it's getting it's information from. Even if you only link to it with the lowercase and it's in your XML sitemaps in lowercase, there could be someone out there linking to your site with them in uppercase.
You are correct that Google treats URLs as case sensitive. Which is why they support the rel=canonical specification. If you're using the rel=canonical spec correctly, I wouldn't worry that Googlebot is accessing the URLs with all caps. When they process the page for indexing, it will consolidate all the page "value" to the canonical URL.
If you want to be even more explicit about it, you can 301 redirect all cap case requests to the lower case version. So Googlebot will follow the 301s to the lowercase version.
Also note, even if you have a 301 and rel=canonical ... you'll see that Googlebot will continue to crawl URLs with all caps. This will happen even if these URLs 404 or 410. Basically Googlebot never forgets a URL, and from time to time it'll try old URLs it knows existed at one point, or have links still pointing to it ... even if they're years old and long gone.
Percent-encoded parts of the url should be normalized into the same url. This wikipedia page should give you all the answers ;)
http://en.wikipedia.org/wiki/URL_normalization

URL hash is persisting between redirects

For some reason, non IE browsers seem to persist a URL hash (if present) when a server-side redirect is sent (using the Location header). Example:
// a simple redirect using Response.Redirect("http://www.yahoo.com");
Text.aspx
If I visit:
Test.aspx#foo
In Firefox/Chrome, I'm taken to:
http://www.yahoo.com#foo
Can anyone explain why this happens? I've tried this with various server side redirects in different platforms as well (all resulting in the Location header, though) and this always seems to happen. I don't see it anywhere in the HTTP spec, but it really seems to be a problem with the browsers themselves. The URL hash (as expected) is never sent to the server, so the server redirect isn't polluted by it, the browsers are just persisting it for some reason.
Any ideas?
I suggest that this is the correct behaviour. The 302 and 307 status codes indicate that the resource is to be found elsewhere. #bookmark is a location within the resource.
Once the resource (html document) has been located it is for the browser to locate the #bookmark within the document.
The analogy is this: You want to look something up in a book in chapter 57, so you go to the library to get the book. But there is a note on the shelf saying the book has moved, it is now in the other building. So you go to the new location. You still want chapter 57 - it is irrelevant where you got the book.
This is an aspect that was not covered by previous HTTP specifications but has been addressed in the later HTTP development:
If the server returns a response code of 300 ("multiple choice"), 301
("moved permanently"), 302 ("moved temporarily") or 303 ("see
other"), and if the server also returns one or more URIs where the
resource can be found, then the client SHOULD treat the new URIs as
if the fragment identifier of the original URI was added at the end.
The exception is when a returned URI already has a fragment
identifier. In that case the original fragment identifier MUST NOT be
not added to it.
So the fragment of the original URI should also be used for the redirection URI unless it also contains a fragment.
Although this was just a draft that expired in 2000, it seems that the behavior as described above is the de-facto standard behavior among todays web browsers.
#Julian Reschke or #Mark Nottingham probably know more/better about this.
From what I have found, it doesn't seem clear what the exact behaviour should be. There are plently of people having problems with this, some of them wants to keep the bookmark through the redirect, some of them wants to get rid of it.
Different browsers handle this differently, so in practice it's not useful to rely on either behaviour.
It definitely is a browser issue. The browser never sends the bookmark part of the URL to the server, so there is nothing that the server could do to find out if there is a bookmark or not, and nothing that could be done about it reliably.
When I put the full URL in the action attribute of the form, it will keep the hash. But when I just do the query string then it drops the hash. E.g.,
Keeps the hash:
https://example.com/edit#alrighty
<form action="https://example.com/edit?ok=yes">
Drops the hash:
https://example.com/edit
<form action="?ok=yes">

Advice needed on REST URL to be given to 3rd parties to access my site

Important: This question isn't actually really an ASP.NET question. Anyone who knows anything about URLS can answer it. I just happen to be using ASP.NET routing so included that detail.
In a nutshell my question is :
"What URL format should I design that i can give to external parties to get to a specific place on my site that will be future proof. [I'm new to creating these 'REST' URLs]."
I need an ASP.NET routing URL that will be given to a third party for tracking marketing campaigns. It is essentially a 'gateway' URL that redirects the user to a specific page on our site which may be the homepage, a special contest or a particular product.
In addition to trying to capture the referrer I will need to receive a partnerId, a campaign number and possibly other parameters. I want to provide a route to do this BUT I want to get it right first time because obviously I cant easily change it once its being used externally.
How does something like this look?
routes.MapRoute(
"3rd-party-campaign-route",
"campaign/{destination}/{partnerid}/{campaignid}/{custom}",
new
{
controller = "Campaign",
action = "Redirect",
custom = (string)null // optional so we need to set it null
}
);
campaign : possibly don't want the word 'campaign' in the actual link -- since users will see it in the URL bar. i might change this to just something cryptic like 'c'.
destination : dictates which page on our site the link will take the user to. For instance PR to direct the user to products page.
partnerid : the ID for the company that we've assigned - such as SO for Stack overflow.
campaignid : campaign id such as 123 - unique to each partner. I have realized that I think I'd prefer for the 3rd party company to be able to manage the campaign ids themselves rather than us providing a website to 'create a campaign'. I'm not
completely sure about this yet though.
custom : custom data (optional). i can add further custom data parameters without breaking existing URLS
Note: the reason i have 'destination' is because the campaign ID is decided upon by the client so they need to also tell us where the destination of that campaign is. Alternatively they could 'register' a campaign with us. This may be a better solution to avoid people putting in random campaign IDs but I'm not overly concerned about that and i think this system gives more flexibility.
In addition we want to know perhaps which image they used to link to us (so we can track which banner works the best). I THINK this is a candiate for a new campaignid as opposed to a custom data field but i'm not sure.
Currently I am using a very primitive URL such as http://example.com?cid=123. In this case the campaign ID needs to be issued to the third party and it just isn't a very flexible system. I want to move immediately to a new system for new clients.
Any thoughts on future proofing this system? What may I have missed? I know i can always add new formats but I want to use this format as much as possible if that is a good idea.
This URL:
"campaign/{destination}/{partnerid}/{campaignid}/{custom}",
...doesn't look like a resource to me, it looks like a remote method call. There is a lot of business logic here which is likely to change in the future. Also, it's complicated. My gut instinct when designing URLs is that simpler is generally better. This goes double when you are handing the URL to an external partner.
Uniform Resource Locators are supposed to specify, well, resources. The destination is certainly a resource (but more on this in a moment), and I think you could consider the campaign a resource. The partner is not a resource you serve. Custom is certainly not a resource, as it's entirely undefined.
I hear what you're saying about not wanting to have to tell the partners to "create a campaign," but consider that you're likely to eventually have to go down this road anyway. As soon as the campaign has any properties other than the partner identifier, you pretty much have to do this.
So my first to conclusions are that you should probably get rid of the partner ID, and derive it from the campaign. Get rid of custom, too, and use query string parameters instead, should it be necessary. It is appropriate to use query string parameters to specify how to return a resource (as opposed to the identity of the resource).
Removing those yields:
"campaign/{destination}/{campaignid}",
OK, that's simpler, but it still doesn't look right. What's destination doing in between campaign and campaign ID? One approach would be to rearrange things:
"campaign/{campaignid}/{destination}",
Another would be to use Astoria-style indexing:
"campaign({campaignid})/{destination}",
For some reason, this looks odd to a lot of people, but it's entirely legal. Feel free to use other legal characters to separate campaign from the ID; the point here is that a / is not the only choice, and may not be the appropriate choice.
However...
One question we haven't covered yet is what should happen if/when the user submits a valid destination, but an invalid campaign or partner ID. If the correct response is that the user should see an error, then all of the above is still valid. If, on the other hand, the correct response is that the user should be silently taken to the destination page anyway, then the campaign ID is really a query string parameter, not a part of the resource. Perhaps some partners wouldn't like being given a URL with a question mark in it, but from a purely REST point of view, I think that's the right approach, if the campaign ID's validity does not determine where the user ends up. In this case, the URL would be:
"campaign/{destination}",
...and you would add a query string parameter with the campaign ID.
I realize that I haven't given you a definite answer to your question. The trouble is that most of this rests on business considerations which you are probably aware of, but I'm certainly not. So I'm more trying to cover the philosophy of a REST-ful URL, rather than attempting to explain your business to you. :)
I think the URL rewriting is getting out of hand a little bit lately. Not everything belongs to the URL. After all, a URL is supposed to describe a resource that can be searched for, discovered or manipulated and it seems to me that at least the partner ID and the custom fields from above are not part of the resource.
Not to mention that that at some point you would like to actually keep the partner ID constant across multiple campaigns and that means that it is now orthogonal to the particular places they need to visit. If you keep these as parameters, you will allow your partners to access uniformly multiple resources on your website, while still reliably identifying themselves, so you can track their participation in any of your campaigns.
It looks like you've covered all of your bases. The only suggestion I have is to change
{custom}
to
{*custom}
That way, if you ever need to accept further parameters, you don't have to take the chance that old URLs will get a 404. For example:
If you have a URL that looks like:
campaign/PR/SO/123
and you decide in the future that you would like to accept a fourth and fifth parameter:
campaign/PR/SO/123/blah/foo
then the first URL will still be valid, because you're using a wildcard character in {*custom}. "blah/foo" would be passed as a string to your action. To get those extra two parameters, you would simply split the custom argument in your action by '/'. Add some friendly error handling if they don't exist and you've successfully changed the amount of information you can receive with a campaign URL without completely breaking URLs already in the wild.
Why not use URL encoded variables instead of routes? They're a lot more flexible - you can add any new features in the future while still maintaining 100% backwards compatibility. Admittedly, it's a little more trouble to type manually, but if there's all those parameters anyway, it's already no picnic.
http://mysite.com/page?campaign=1&dest=products&pid=15&cid=25
To me, this is much more indicative of what is really going on. Using paths implies a that a resource exists at that location. But really you're just providing a web service with various parameters, and this model captures that much more clearly. And in the future, you can add more parameters effortlessly. You can also default parameters if they are missing without messing anything up.
Not sure of the code in ASP, but it should be trivial to implement.
I think I'd look at doing it the way that SO does it's questions.
"campaign/{campaign-id}/friendly-name-of-campaign"
Create a mapping in your database when the campaign is created that associates all the data you need with an automatically generated id. The friendly name could be assigned basically the same way as a question is on SO -- by the user -- but you could also have an approval process that makes sure that it meets your requirements and is distinct from any existing campaign names. Your tracking company can track by the id and you can correlate that with your associated data with a simple look up.
What you have looks good for your needs. The other posts here have good points. But may not be suitable for you. One thing that you could consider with future proofing your links is to put a version number somewhere in there.
"campaign/{version}/{destination}/{partnerid}/{campaignid}/{custom}"
This way if you decide to completely change your format you can up the version to 2.0 (or whatever) and still keep track of the old links coming in.
I would do
/c/{destination}/{partnerid}/{campaignid}/?customvar=s
You should think about the hierarchy of the first parameters, you already got that managed quite well. Only if there's a hierarchy path segments should be used.
From your description, destination seems to be the broadest parameter, partnerid only works with destination, and campaingid is specific to a partner.
When you really need to add custom parameters I would go for query variables (they are not forbidden in REST), because these are not part of the hierarchy.
You also shouldn't try to be too RESTful here. After all, it's for a campaign and for redirecting to a final resource. So the URL you want to design here is not really a specific resource in the terms of REST.
Create an URL called http://mysite.com/gateway
Return an HTML form, tell your partners to fill in the form and POST it. Redirect based on the form values.
You could easily provide your partners with the javascript to do the GET and POST. Should be trivial.
The most important thing i have learned about REST URL´s thats usually burried deep in some book or article:
The URL should point to a resource and the following ?querystring should have all the scoping information needed. DONT mix those two or you will have a design thats very hard to work with.
Other then that i fully agree with Craig Stuntz

Resources