Interpreting robots.txt vs. terms of use - web-scraping

I'm interested in scraping craigslist, solely for the purpose of data analysis for a blog post (i.e., no commercial or financial gain, no posting/emailing, no personal data collection, no sharing of data scraped). Their robots.txt file is the following:
User-agent: *
Disallow: /reply
Disallow: /fb/
Disallow: /suggest
Disallow: /flag
Disallow: /mf
Disallow: /eaf
I intend to visit none of these directories, only to view posts and then collect the text from the postbody. This seems to not be disallowed in the robots.txt file. However, Craigslist terms of use has the following entry (relevant bit in bold):
USE. You agree not to use or provide software (except for general purpose web browsers and email clients, or software expressly licensed by us) or services that interact or interoperate with CL, e.g. for downloading, uploading, posting, flagging, emailing, search, or mobile use. Robots, spiders, scripts, scrapers, crawlers, etc. are prohibited, as are misleading, unsolicited, unlawful, and/or spam postings/email. You agree not to collect users' personal and/or contact information ("PI").
So should I assume that my bot is forbidden across the entire site, or just forbidden in the Disallowed directories in robots.txt? If it's the former, then what am I misunderstanding about the robots.txt file? If it's the latter, then may I assume that they will not ban my IP given that I abide by robots.txt?

They provide data in rss format. At the bottom right there is an rss link that will take you to ?format=rss
For example: https://losangeles.craigslist.org/search/sss?format=rss
My guess would be that sort of thing is really not allowed if you're redistributing the post content, collecting emails to spam, etc. It probably depends on how you use the data. If you're only gathering statistical information maybe it's acceptable but I really don't know. Probably a better question for a lawyer.

Related

IIS set up effecting bots which is impacting on search results

This is a tricky one to explain. I believe the google bot is getting confused because of the way iis/sites are set up. The actual issue is, when searching Google and the result is www.someSiteURL.com the description underneath is:
A description for this result is not available because of this site's robots.txt – learn more.
I think the reason the issue exists is fairly clear. Using the example above there is not page content at www.someSiteURL.com/default.asp At this level there is a default.asp file with a whole bunch of redirects to take the user to the correct physical dir where the sites are. The sites are all living under one root 'Site' in IIS like so:
siteOneDir
siteTwoDir
siteThreeDir
default.asp (this is the page with the redirects)
How do you overcome this without chnaging the site setup/use of IPAddresses?
Here is the robots.txt file:
User-agent: *
Allow: /default.asp
Allow: /siteOneDir/
Allow: /siteTwoDir/
Allow: /siteThreeDir/
Disallow: /
BTW Google webmaster tool says this is valid. I know some clients may not recognize 'Allow' but Google and Bing do so I don't care about this. I would rather disallow all then only allow sites instead of only using this to disallow specific sites.
If I use the Google webmaster tool Crawl > Fetch a Google and type in www.someSiteURL.com/default.asp it does have a status of 'Redirected' and its status is http/1.1 302 found
I believe the order of the items in robot.txt matters. Try putting the disallow first, ie. change to:
User-agent: *
Disallow: /
Allow: /default.asp
Allow: /siteOneDir/
Allow: /siteTwoDir/
Allow: /siteThreeDir/

Identify users accessing hidden link in a website

Recently I put some hidden links in a web site in order to trap web crawlers. (Used CSS visibility hidden style in order to avoid human users accessing it).
Any way, I found that there were plenty of HTTP requests with a reference of browsers which have accessed the hidden links.
E.g : "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.64 Safari/537.31"
So now my problems are:
(1) Are these web crawlers? or else what can be?
(2) Are they malicious?
(3) Is there a way to profile their behaviour?
I searched on the web but couldn't find any valuable information. Can you please provide me some resources, or any help would be appreciated.
This is a HTTP user agent. They are not malicious at all. It's following the pattern, for example Mozilla/<version> and so on. A browser is a user-agent for example. However, they can be used by attackers and this can be identified by looking at anomalies. You can read this paper.
The Hypertext Transfer Protocol (HTTP) identifies the client software
originating the request, using a "User-Agent" header, even when the
client is not operated by a user.
The answer to your questions are, in order:
They are not web crawlers. They are user agents. Common term for a web developer.
Generally they aren't malicious but they can be, as I suggest, look at the paper.
I don't understand what you mean by profiling behaviour, they aren't malware!

Disallow directory contents, but Allow directory page in robots.txt

Will this work for disallowing pages under a directory, but still allow a page on that directory url?
Allow: /special-offers/$
Disallow: /special-offers/
to allow:
www.mysite.com/special-offers/
but block:
www.mysite.com/special-offers/page1
www.mysite.com/special-offers/page2.html
etc
Having looked at Google's very own robots.txt file, they are doing exactly what I was questioning.
At line 136-137 they have:
Disallow: /places/
Allow: /places/$
So they are blocking any thing under places, but allowing the root places URL. The only difference with my syntax is the order, the Disallow being first.
Standards
According to the HTML 4.01 specification, Appendix B.4.1 the values allowed in Disallow (no pun intended) are partial URIs (representing partial or full paths), only:
The "Disallow" field specifies a partial URI that is not to be visited. This can be a full path, or a partial path; any URI that starts with this value will not be retrieved. For example,
Disallow: /help disallows both /help.html and /help/index.html, whereas
Disallow: /help/ would disallow /help/index.html but allow /help.html.
I don't think anything has changed since then, since current HTML5 Specification Drafts don't mention robots.txt at all.
Extensions
However, in practice, many Robot Engines (such as Googlebot) are more flexible in what they accept. If you use, for instance:
Disallow: /*.gif$
then Googlebot will skip any file with the gif extension. I think you could do something like this to disallow all files under a folder, but I'm not 100% sure (you could test them with Google Webmaster Tools):
Disallow: /special-offers/*.*$
Other options
Anyway, you shouldn't rely on this too much (since each search engine might behave differently), so if possible it would be preferrable to use meta tags or HTTP headers instead. For instance, you could configure your webserver to include this header in all responses that should not be indexed (or followed):
X-Robots-Tag: noindex, nofollow
Search for the best way of doing it in your particular webserver. Here's an example in Apache, combining mod_rewrite with mod_headers to conditionally set some headers depending on the URL pattern. Disclaimer: I haven't tested it myself, so I can't tell how well it works.
# all /special-offers/ sub-urls set env var ROBOTS=none
RewriteRule ^/special-offers/.+$ - [E=ROBOTS:none]
# if env var ROBOTS is set then create header X-Robots-Tag: $ENV{ROBOTS}
RequestHeader set X-Robots-Tag %{ROBOTS}e env=ROBOTS
(Note: none is equivalent to noindex, nofollow)

How does StackExchange handle invalid characters in route URLs?

Scott Hanselman's post on using wacky chars in a Request URL, explains how IIS and ASP.Net security features can be circumvented to allow invalid characters to be passed on in a URL... but I am sure stack exchange is doing it different as his methodology would leave the site wide open to nasty attacks and bugs.
StackExchange has links to tags, like C# that are sent to the web server in a GET request encoded, like this:
// C#
http://stackoverflow.com/questions/tagged/c%23
// C++
http://stackoverflow.com/questions/tagged/c%2b%2b
The trick is... they are sent as request path values (ex. route parameters), not as values in a query string...
If you see Hanselman's article, he suggests it is only possible by turning off several other security features beyond RequestValidation (the later allows encoded chars in a query string portion of a URL).
Questions
How does StackExchange accomplish this?
If it is done the same way Hanselman illustrates in his blog, what extra steps do they take to protect themselves?
They don't accept just any character. They use slugs.

Custom HTTP Headers with old proxies

Is it true that some old proxies/caches will not honor some custom HTTP headers? If so, can you prove it with sections from the HTTP spec or some other information online?
I'm designing a REST API interface. For versioning I'm debating whether to use version as a part of the URL like (/path1/path2/v1 OR /path1/path2?ver=1) OR to use a custom Accepts X-Version header.
I was just reading in O'Reilly's Even Faster Websites about how mainly internet security software, but really anything that has to check the contents of a page, might filter the Accept-Encoding header in order to reduce the CPU time used decompressing and reading the file. The books cites that about 15% of user have this issue.
However, I see no reason why other, custom headers would be filtered. On the other hand, there also isn't really any reason to send it as a header and not with GET is there? It's not really part of the HTTP protocol, it's just your API.
Edit: Also, see the actual section of the book I mention.

Resources