Schemeless URLS and misbehaving crwalers

Schemeless URLS and misbehaving crwalers - nginx

I am using schemeless urls for loading few external libraries.
//ajax.googleapis.com/ajax/libs/jquery.....
The problem i am facing is that few crawlers are treating them as relative urls
www.mydomain.com//ajax.googleapis.com/ajax/libs/jquery.....
How can i handle such links for crawlers.
I am using Nginx server but i am fairly new to Nginx.
is some kind of rewrite possible?

Your URL is actually valid, it's the crawler's fault for not handling this case while crawling, I would just ignore it.
Also the 404 response is also valid from your server, because the crawler is requesting www.example.com//ajax.googleapis.com/.. which really doesn't exist.

Related

IIS to handle request transfer to other server (without url rewrite)

I want my website to serve static contents by different server and if possible even javascripts files as well from another server. I know it can be achieved using url rewrites but it gives 301 errors (i.e. extra round trip to browser). I asked similar question here but did not get any satisfactory response as it seems my question was not correct.
Can some one suggest how to achieve this without extra round trip to browser? I have seen other teams achieving this in Apache server but not able to find how to do it in IIS.

Have you tried "Rewrite" instead of "Redirect"? These are at least two viable options of rewriting where Redirect means 301/302 and rewrite operates silently at the server.
From your other question, it seems that you tried with "Redirect", this explains the 301. Change Redirect to Rewrite then.
The only important thing is that if you want your rewrites act between different servers, you will need the ARR (Application Request Routing) installed on a front-end server. While URL Rewrite handles only local rewrites, ARR can act as a transparent proxy to other servers.
ARR is a free add-on to IIS
http://www.iis.net/downloads/microsoft/application-request-routing

Asp.Net SecuritySwitch & SSL Load balancer

I am trying to use SecuritySwitch (http://code.google.com/p/securityswitch/) on an asp.net website.
On secure pages I get an error of: This web page has a redirect loop
I assume that SecuritySwitch is trying to detect if the request is https and redirecting if it is not. As my host uses some kind of SSL load balancer which uses the http header HTTP_X_FORWARDED_PROTO to detect if the request is https I think that it is not recognising that the request as valid and trying to redirect again.
Can anyone think of a way to solve this?

There was no solution to this, but the good people on the project released a new version within 2days of a request being made making it possible.
SecuritySwitch (http://code.google.com/p/securityswitch/)
Highly recommended!

How does url rewrite works?

How does web server implements url rewrite mechanism and changes the address bar of browsers?
I'm not asking specific information to configure apache, nginx, lighthttpd or other!
I would like to know what kind of information is sent to clients when servers want rewrite url?

There are two types of behaviour.
One is rewrite, the other is redirect.
Rewrite
The server performs the substitution for itself, making a URL like http://example.org/my/beatuful/page be understood as http://example.org/index.php?page=my-beautiful-page
With rewrite, the client does not see anything and redirection is internal only. No URL changes in the browser, just the server understands it differently.
Redirect
The server detects that the address is not wanted by the server. http://example.org/page1 has moved to http://example.org/page2, so it tells the browser with an HTTP 3xx code what the new page is. The client then asks for this page instead. Therefore the address in the browser changes!
Process
The process remains the same and is well described by this diagram:
Remark Every rewrite/redirect triggers a new call to the rewrite rules (with exceptions IIRC)
RewriteCond %{REDIRECT_URL} !^$
RewriteRule .* - [L]
can become useful to stop loops. (Since it makes no rewrite when it has happened once already).

Are you talking about server-side rewrites (like Apache mod-rewrite)? For those, the address bar does not generally change (unless a redirection is performed).
Or are you talking about redirections? These are done by having the server respond with an HTTP code (301, 302 or 307) and the location in the HTTP header.

There are two forms of "URL rewrite": those done purely within the server and those that are redirections.
If it's purely within the server, it's an internal matter and only matters with respect to the dispatch mechanism implemented in the server. In Apache HTTPD, mod_rewrite can do this, for example.
If it's a redirection, a status code implying a redirection is sent in the response, along with a Location header indicating to which URL the browser should be redirected (this should be an absolute URL). mod_rewrite can also do this, with the [R] flag.
The status code is usually 302 (found), but it could be configured for other codes (e.g. 301 or 307).
Another quite common use (often unnoticed because it's usually on by default in Apache HTTPD) is the redirection to the the URL with a trailing slash on a directory. This is implemented by mod_dir:
A "trailing slash" redirect is issued
when the server receives a request for
a URL http://servername/foo/dirname
where dirname is a directory.
Directories require a trailing slash,
so mod_dir issues a redirect to
http://servername/foo/dirname/.

Jeff Atwood had a great post about this: http://www.codinghorror.com/blog/2007/02/url-rewriting-to-prevent-duplicate-urls.html
How web server implements url rewrite mechanism and changes the address bar of browsers?
URL rewriting and forwarding are two completely different things. A server has no control over your browser so it can't change the URL of your browser, but it can ask your browser to go to a different URL. When your browser gets a response from a server it's entirely up to your browser to determine what to do with that response: it can follow the redirect, ignore it or be really mean and spam the server until the server gives up. There is no "mechanism" that the server uses to change the address, it's simply a protocol (HTTP 1.1) that the server abides by when a particular resource has been moved to a different location, thus the 3xx responses.

URL rewriting can transform URLs purely on the server-side. This allows web application developers the ability to make web resources accessible from multiple URLs.
For example, the user might request http://www.example.com/product/123 but thanks to rewriting is actually served a resource from http://www.example.com/product?id=123. Note that, there is no need for the address displayed in the browser to change.
The address can be changed if so desired. For this, a similar mapping as above happens on the server, but rather than render the resource back to the client, the server sends a redirect (301 or 302 HTTP code) back to the client for the rewritten URL.
For the example above this might look like:
Client request
GET /product/123 HTTP/1.1
Host: www.example.com
Server response
HTTP/1.1 302 Found
Location: http://www.example.com/product?id=123
At this point, the browser will issue a new GET request for the URL in the Location header.

Is it possible to redirect non-HTML files with HTTP? And chaining redirects?

I have been thinking about a neat way of load balancing and one thing that would be required is to be capable of loading an image on an HTML page from multiple locations without rewriting the URL(on each load)
So what I need to be able to do is have one URL which is the "static" URL. Such as http://example.com/myimage.png The image is not actually contained in example.com though. So example.com does a either a 302 or 301 or 307 HTTP response to cause a redirect to 2.example.com. How do browsers handle this with images like in this situation? Also, how do browsers handle multiple redirections for instance if 2.example.com also didn't contain it and it went to 3.example.com ? (Note, I am asking this because I've never seen a 301 redirect on anything but an HTML page)
Also, which status code would be best to use. 301 means "moved permanently" which this "move" isn't permanent so I don't want it cached. Should I use 307? Is that supported by search engines and modern browsers?

Redirect is an HTTP concept and applies to any resource that can be delivered over HTTP, not only HTML. Chaining redirects and non-HTML redirects work just fine in most modern browsers.
If you want temporary redirect, use 302, unless you want to redirect POST and PUTs as well. The problem is that most implementations will issue GET for the new resource address after POST or PUT that got 302.
Note that 303 and 307 are HTTP 1.1 specific.

I would advise against load balancing like this. Load balancing is not what 3xx responses are intended for.
The HTTP protocol has capabilities for caching which can help with reducing server load. There are also server technologies for load balancing. These technologies are well developed will be more stable and reusable.

As Benedict C says, I think you're barking up the wrong tree.
If you want to do load-balancing, then do load balancing. Round robin DNS is the simplest method (and is more effective in a number of regards than more expensive solutions). If you must try to load balance across servers with different FQDNs, then generate the URL client-side in javascript.
The remnants of your post are applicable to other questions about redirection. There's a lot of bad advice published about SEO. Google has approx 92% of the world market and publish quite detailled specs about how they spider and rank sites. Redirection within your domain should not affect your rankings in any competent search engine. Redirection outside your domain will only improve the ranking of the target.
Yes, browsers implement a limit on the number of redirections followed for a single request - but it varies by browser.

URL redirects; for general purpose use, which is better: server-side or client-side?

Take a very simple case as an example, say I have this URL:
http://www.example.com/65167.html
and I wish to serve that content under:
http://www.example.com/about
UPDATE: Note that the 'bad' URL is the canonical one (it's produced by a CMS which uses it internally for linking), so "/about" is just a way of polishing it.
I have two broad options: a server-side redirect or a client-side one. I always thought that server-side would be preferable since it's more efficient, i.e. HTTP traffic is approximately halved. However, SEO techniques tend to favour a single URL for a resource, thus client-side is to be preferred.
How do you resolve this conflict, and are there other factors I've omitted?

Apache HTTPD's mod_rewrite can leave a browser showing a SEO-friendly URL in its location bar while redirecting to a numeric URL on the server:
RewriteEngine on
RewriteRule ^/about$ /65167.html [L]

A 301 is the wrong approach for this problem if you're redirecting from /about to /65167.html. Your CMS will only understand the 65167.html
request but a 301 is basically telling Google that /about no longer exists and to index the 65167.html page.
Ignacio is correct. You need to implement either mod_rewrite or something similar depending on your platform and hide the CMS assuming that you can actually re-write all your CMS generated links to something more friendly.
A client side redirect is probably too complex to implement and a server side redirect will cause two requests to the server.

I'm pretty sure Google understands 301 Moved Permanently.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Schemeless URLS and misbehaving crwalers - nginx

Your URL is actually valid, it's the crawler's fault for not handling this case while crawling, I would just ignore it. Also the 404 response is also valid from your server, because the crawler is requesting www.example.com//ajax.googleapis.com/.. which really doesn't exist.

Related

IIS to handle request transfer to other server (without url rewrite)

Asp.Net SecuritySwitch & SSL Load balancer

How does url rewrite works?

Is it possible to redirect non-HTML files with HTTP? And chaining redirects?

URL redirects; for general purpose use, which is better: server-side or client-side?

Categories

Resources