Disallow directory contents, but Allow directory page in robots.txt - directory

Will this work for disallowing pages under a directory, but still allow a page on that directory url?
Allow: /special-offers/$
Disallow: /special-offers/
to allow:
www.mysite.com/special-offers/
but block:
www.mysite.com/special-offers/page1
www.mysite.com/special-offers/page2.html
etc

Having looked at Google's very own robots.txt file, they are doing exactly what I was questioning.
At line 136-137 they have:
Disallow: /places/
Allow: /places/$
So they are blocking any thing under places, but allowing the root places URL. The only difference with my syntax is the order, the Disallow being first.

Standards
According to the HTML 4.01 specification, Appendix B.4.1 the values allowed in Disallow (no pun intended) are partial URIs (representing partial or full paths), only:
The "Disallow" field specifies a partial URI that is not to be visited. This can be a full path, or a partial path; any URI that starts with this value will not be retrieved. For example,
Disallow: /help disallows both /help.html and /help/index.html, whereas
Disallow: /help/ would disallow /help/index.html but allow /help.html.
I don't think anything has changed since then, since current HTML5 Specification Drafts don't mention robots.txt at all.
Extensions
However, in practice, many Robot Engines (such as Googlebot) are more flexible in what they accept. If you use, for instance:
Disallow: /*.gif$
then Googlebot will skip any file with the gif extension. I think you could do something like this to disallow all files under a folder, but I'm not 100% sure (you could test them with Google Webmaster Tools):
Disallow: /special-offers/*.*$
Other options
Anyway, you shouldn't rely on this too much (since each search engine might behave differently), so if possible it would be preferrable to use meta tags or HTTP headers instead. For instance, you could configure your webserver to include this header in all responses that should not be indexed (or followed):
X-Robots-Tag: noindex, nofollow
Search for the best way of doing it in your particular webserver. Here's an example in Apache, combining mod_rewrite with mod_headers to conditionally set some headers depending on the URL pattern. Disclaimer: I haven't tested it myself, so I can't tell how well it works.
# all /special-offers/ sub-urls set env var ROBOTS=none
RewriteRule ^/special-offers/.+$ - [E=ROBOTS:none]
# if env var ROBOTS is set then create header X-Robots-Tag: $ENV{ROBOTS}
RequestHeader set X-Robots-Tag %{ROBOTS}e env=ROBOTS
(Note: none is equivalent to noindex, nofollow)

Related

Contradictory rules in robots.txt

I'm attempting to scrape a website and these two rules seem to be contradictory in robots.txt
User-agent: *
Disallow: *
Allow: /
Does Allow: / mean that I can scrape the entire website, or just the root? As if means I can scrape the entire site then this is directly contradictory to the previous rule.
If you are following the original robots.txt standard:
The * in the disallow line would be treated as a literal rather than a wildcard. That line would disallow URL paths that start with an asterisk. All URL paths start with a /, so that rule disallows nothing.
The Allow Rule isn't in the specification, so that line would be ignored.
Anything that isn't specifically disallowed is allowed to be crawled.
Verdict: You can crawl the site.
Google and a few other crawlers support wildcards and allows. If you are following Google's extensions to robots.txt, here is how Google would interpret this robots.txt:
Both Allow: / and Disallow: * match any specific path on the site.
In the case of such a conflict, the more specific rule (ie longer) rule wins. / and * are each one character, so neither is considered more specific than the other.
In a case of a tie for specificity, the least restrictive rule wins. Allow is considered less restrictive than Disallow.
Verdict: You can crawl the site.

IIS set up effecting bots which is impacting on search results

This is a tricky one to explain. I believe the google bot is getting confused because of the way iis/sites are set up. The actual issue is, when searching Google and the result is www.someSiteURL.com the description underneath is:
A description for this result is not available because of this site's robots.txt – learn more.
I think the reason the issue exists is fairly clear. Using the example above there is not page content at www.someSiteURL.com/default.asp At this level there is a default.asp file with a whole bunch of redirects to take the user to the correct physical dir where the sites are. The sites are all living under one root 'Site' in IIS like so:
siteOneDir
siteTwoDir
siteThreeDir
default.asp (this is the page with the redirects)
How do you overcome this without chnaging the site setup/use of IPAddresses?
Here is the robots.txt file:
User-agent: *
Allow: /default.asp
Allow: /siteOneDir/
Allow: /siteTwoDir/
Allow: /siteThreeDir/
Disallow: /
BTW Google webmaster tool says this is valid. I know some clients may not recognize 'Allow' but Google and Bing do so I don't care about this. I would rather disallow all then only allow sites instead of only using this to disallow specific sites.
If I use the Google webmaster tool Crawl > Fetch a Google and type in www.someSiteURL.com/default.asp it does have a status of 'Redirected' and its status is http/1.1 302 found
I believe the order of the items in robot.txt matters. Try putting the disallow first, ie. change to:
User-agent: *
Disallow: /
Allow: /default.asp
Allow: /siteOneDir/
Allow: /siteTwoDir/
Allow: /siteThreeDir/

new to cross domain CORS

I am new to this thing, so there is some questions I wanted to ask after looking up bunch of site that related to CORS.
First of all, lets say i have http://domain1.com that has a ajax call to http://domain2.com, I look up on http://enable-cors.org/server.html it say that I will have to add
Access-Control-Allow-Origin: *
this response to my page header or add this setting to web.config on the root directory of my application, but I was confused, should I add the response header to domain1 or domain2 application? My guess was add to domain2, but I cannot be sure because I don't have the required things to test it.
Furthermore, what if domain2.com were in https, means I am calling from http to https, will this works?
and how about IE?
You should add it on http://domain2.com because Access-Control-Allow-Origin is permission for http://domain1.com to get information from http://domain2.com.
Note that (*) symbol means that domain allows access to everyone, so you need to be careful with that. Better option would be:
Access-Control-Allow-Origin: http://domain1.com
It work fine for IE and for https:
Access-Control-Allow-Origin: http://domain1.com, https://domain1.com
Take a look for more information here.

How widely supported are scheme-relative URIs in HTTP 301 redirects

I want to have requests for the www subdomain or for alternate top-level domains redirected to one canonical URL.
To avoid HTTP/HTTPS issues, I figured the easiest way would be to just send a scheme-relative URI in the Location header, like so:
HTTP/1.1 301 Moved Permanently
Location: //example.com/
This seems to work fine in browsers, but the toy »validator« at http://no-www.org/ does not handle it correctly. Is this just a single badly written script, or is this behavior actually more common in scripts, crawlers, etc. out there?
Location expects an absolute URI:
[…] The field value consists of a single absolute URI.
Location = "Location" ":" absoluteURI
Although most user agents will also accept relative URIs, you should stick to the specification and provide an absolute URI.

Should I use 301 for in-site redirects?

We would like to redirect to a localized version of our entry webpage if IP is detected to be from a certain country. We are using ASP.Net, GeoLite Country Db (it's a very small, 1Mb downloadable DB at time of writing this question).
So, most users would get english content, but if they come from a local place, they would have local content served by default. Of course, they would be able to change the preferred language at any time.
The question is: if www.example.com by default displays default.aspx, should we (if we detect the IP to be "local"):
Use "301 Moved Permanently" and redirect it to, say, www.example.com/local.aspx, or
Simply render the appropriate content inside default.aspx?
We would like to know if there are some side effects with SEO or similar issues with any of the approaches?
This may not be the best solution.
From wikipedia it says to use 300 for different languages:
http://en.wikipedia.org/wiki/URL_redirection
http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html#sec10.3.1
The HTTP standard defines several
status codes for redirection:
* 300 multiple choices (e.g. offer different languages)
* 301 moved permanently
* 302 found (originally temporary redirect, but now commonly used to specify redirection for unspecified reason)
* 303 see other (e.g. for results of cgi-scripts)
* 307 temporary redirect
I would just deliver the localized contents of local.aspx and send an appropriate Content-Location referring to local.aspx along with it.
Or, if you want a redirect, use the status code 307 to indicate a temporary redirect.

Resources