Contradictory rules in robots.txt - web-scraping

I'm attempting to scrape a website and these two rules seem to be contradictory in robots.txt
User-agent: *
Disallow: *
Allow: /
Does Allow: / mean that I can scrape the entire website, or just the root? As if means I can scrape the entire site then this is directly contradictory to the previous rule.

If you are following the original robots.txt standard:
The * in the disallow line would be treated as a literal rather than a wildcard. That line would disallow URL paths that start with an asterisk. All URL paths start with a /, so that rule disallows nothing.
The Allow Rule isn't in the specification, so that line would be ignored.
Anything that isn't specifically disallowed is allowed to be crawled.
Verdict: You can crawl the site.
Google and a few other crawlers support wildcards and allows. If you are following Google's extensions to robots.txt, here is how Google would interpret this robots.txt:
Both Allow: / and Disallow: * match any specific path on the site.
In the case of such a conflict, the more specific rule (ie longer) rule wins. / and * are each one character, so neither is considered more specific than the other.
In a case of a tie for specificity, the least restrictive rule wins. Allow is considered less restrictive than Disallow.
Verdict: You can crawl the site.

Related

Interpreting robots.txt vs. terms of use

I'm interested in scraping craigslist, solely for the purpose of data analysis for a blog post (i.e., no commercial or financial gain, no posting/emailing, no personal data collection, no sharing of data scraped). Their robots.txt file is the following:
User-agent: *
Disallow: /reply
Disallow: /fb/
Disallow: /suggest
Disallow: /flag
Disallow: /mf
Disallow: /eaf
I intend to visit none of these directories, only to view posts and then collect the text from the postbody. This seems to not be disallowed in the robots.txt file. However, Craigslist terms of use has the following entry (relevant bit in bold):
USE. You agree not to use or provide software (except for general purpose web browsers and email clients, or software expressly licensed by us) or services that interact or interoperate with CL, e.g. for downloading, uploading, posting, flagging, emailing, search, or mobile use. Robots, spiders, scripts, scrapers, crawlers, etc. are prohibited, as are misleading, unsolicited, unlawful, and/or spam postings/email. You agree not to collect users' personal and/or contact information ("PI").
So should I assume that my bot is forbidden across the entire site, or just forbidden in the Disallowed directories in robots.txt? If it's the former, then what am I misunderstanding about the robots.txt file? If it's the latter, then may I assume that they will not ban my IP given that I abide by robots.txt?
They provide data in rss format. At the bottom right there is an rss link that will take you to ?format=rss
For example: https://losangeles.craigslist.org/search/sss?format=rss
My guess would be that sort of thing is really not allowed if you're redistributing the post content, collecting emails to spam, etc. It probably depends on how you use the data. If you're only gathering statistical information maybe it's acceptable but I really don't know. Probably a better question for a lawyer.

HTTP/FTP: Does trailing slash in URL mean another resource

I have two URLs:
http://example.com/foo
and
http://example.com/foo/
Are they different URLs or the same? The same question is and about FTP protocol (ftp://example.com/foo[/])
In the URI standard, the relevant section is Normalization and Comparison:
After doing a simple string comparison, these URIs are not equivalent.
After applying syntax-based normalization, these URIs are not equivalent.
For scheme-based normalization, you have to refer to the specifications of the http/https and ftp URI schemes, and check if any scheme-specific rules are defined:
For http/https, these rules are in the section http and https URI Normalization and Comparison, and there don’t seem to be any for your case.
For ftp, there don’t seem to be defined any normalization/comparison rules.
For protocol-based normalization, you have to take something like redirects into account (in case of http).
tl;dr: The URIs are not equivalent.
Note that this is not the case for an empty path in HTTP(S) URIs, as the section linked above defines:
[…] an empty path component is equivalent to an absolute path of "/" […]
So the following URIs are equivalent:
http://example.com/
http://example.com
By the way, for the protocol-based normalization, the standard gives your case as an example:
[…] For example, if they observe that a URI such as
http://example.com/data
redirects to a URI differing only in the trailing slash
http://example.com/data/
they will likely regard the two as equivalent in the future. […]
Yes, they are different resource.
It's particularly important in HTML.
If you have a relative link bar (blah):
In the https://www.example.com/foo, that link resolves to the https://www.example.com/bar
While in the https://www.example.com/foo/, that link resolves to https://www.example.com/foo/bar
But HTTP servers will usually redirect the https://www.example.com/foo to the https://www.example.com/foo/, when foo is a folder, to avoid this confusion.
With the FTP protocol, it's probably client-specific, as the FTP protocol itself does not work with URLs.
So it depends on the FTP client how it behaves, if you use the https://www.example.com/foo, when the foo is actually a folder. The "FTP client" in this case typically means a web browser, as these work with URLs. Dedicated FTP clients usually do not work with URLs either.

IIS set up effecting bots which is impacting on search results

This is a tricky one to explain. I believe the google bot is getting confused because of the way iis/sites are set up. The actual issue is, when searching Google and the result is www.someSiteURL.com the description underneath is:
A description for this result is not available because of this site's robots.txt – learn more.
I think the reason the issue exists is fairly clear. Using the example above there is not page content at www.someSiteURL.com/default.asp At this level there is a default.asp file with a whole bunch of redirects to take the user to the correct physical dir where the sites are. The sites are all living under one root 'Site' in IIS like so:
siteOneDir
siteTwoDir
siteThreeDir
default.asp (this is the page with the redirects)
How do you overcome this without chnaging the site setup/use of IPAddresses?
Here is the robots.txt file:
User-agent: *
Allow: /default.asp
Allow: /siteOneDir/
Allow: /siteTwoDir/
Allow: /siteThreeDir/
Disallow: /
BTW Google webmaster tool says this is valid. I know some clients may not recognize 'Allow' but Google and Bing do so I don't care about this. I would rather disallow all then only allow sites instead of only using this to disallow specific sites.
If I use the Google webmaster tool Crawl > Fetch a Google and type in www.someSiteURL.com/default.asp it does have a status of 'Redirected' and its status is http/1.1 302 found
I believe the order of the items in robot.txt matters. Try putting the disallow first, ie. change to:
User-agent: *
Disallow: /
Allow: /default.asp
Allow: /siteOneDir/
Allow: /siteTwoDir/
Allow: /siteThreeDir/

Disallow directory contents, but Allow directory page in robots.txt

Will this work for disallowing pages under a directory, but still allow a page on that directory url?
Allow: /special-offers/$
Disallow: /special-offers/
to allow:
www.mysite.com/special-offers/
but block:
www.mysite.com/special-offers/page1
www.mysite.com/special-offers/page2.html
etc
Having looked at Google's very own robots.txt file, they are doing exactly what I was questioning.
At line 136-137 they have:
Disallow: /places/
Allow: /places/$
So they are blocking any thing under places, but allowing the root places URL. The only difference with my syntax is the order, the Disallow being first.
Standards
According to the HTML 4.01 specification, Appendix B.4.1 the values allowed in Disallow (no pun intended) are partial URIs (representing partial or full paths), only:
The "Disallow" field specifies a partial URI that is not to be visited. This can be a full path, or a partial path; any URI that starts with this value will not be retrieved. For example,
Disallow: /help disallows both /help.html and /help/index.html, whereas
Disallow: /help/ would disallow /help/index.html but allow /help.html.
I don't think anything has changed since then, since current HTML5 Specification Drafts don't mention robots.txt at all.
Extensions
However, in practice, many Robot Engines (such as Googlebot) are more flexible in what they accept. If you use, for instance:
Disallow: /*.gif$
then Googlebot will skip any file with the gif extension. I think you could do something like this to disallow all files under a folder, but I'm not 100% sure (you could test them with Google Webmaster Tools):
Disallow: /special-offers/*.*$
Other options
Anyway, you shouldn't rely on this too much (since each search engine might behave differently), so if possible it would be preferrable to use meta tags or HTTP headers instead. For instance, you could configure your webserver to include this header in all responses that should not be indexed (or followed):
X-Robots-Tag: noindex, nofollow
Search for the best way of doing it in your particular webserver. Here's an example in Apache, combining mod_rewrite with mod_headers to conditionally set some headers depending on the URL pattern. Disclaimer: I haven't tested it myself, so I can't tell how well it works.
# all /special-offers/ sub-urls set env var ROBOTS=none
RewriteRule ^/special-offers/.+$ - [E=ROBOTS:none]
# if env var ROBOTS is set then create header X-Robots-Tag: $ENV{ROBOTS}
RequestHeader set X-Robots-Tag %{ROBOTS}e env=ROBOTS
(Note: none is equivalent to noindex, nofollow)

Should I use 301 for in-site redirects?

We would like to redirect to a localized version of our entry webpage if IP is detected to be from a certain country. We are using ASP.Net, GeoLite Country Db (it's a very small, 1Mb downloadable DB at time of writing this question).
So, most users would get english content, but if they come from a local place, they would have local content served by default. Of course, they would be able to change the preferred language at any time.
The question is: if www.example.com by default displays default.aspx, should we (if we detect the IP to be "local"):
Use "301 Moved Permanently" and redirect it to, say, www.example.com/local.aspx, or
Simply render the appropriate content inside default.aspx?
We would like to know if there are some side effects with SEO or similar issues with any of the approaches?
This may not be the best solution.
From wikipedia it says to use 300 for different languages:
http://en.wikipedia.org/wiki/URL_redirection
http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html#sec10.3.1
The HTTP standard defines several
status codes for redirection:
* 300 multiple choices (e.g. offer different languages)
* 301 moved permanently
* 302 found (originally temporary redirect, but now commonly used to specify redirection for unspecified reason)
* 303 see other (e.g. for results of cgi-scripts)
* 307 temporary redirect
I would just deliver the localized contents of local.aspx and send an appropriate Content-Location referring to local.aspx along with it.
Or, if you want a redirect, use the status code 307 to indicate a temporary redirect.

Resources