IIS set up effecting bots which is impacting on search results - asp-classic

This is a tricky one to explain. I believe the google bot is getting confused because of the way iis/sites are set up. The actual issue is, when searching Google and the result is www.someSiteURL.com the description underneath is:
A description for this result is not available because of this site's robots.txt – learn more.
I think the reason the issue exists is fairly clear. Using the example above there is not page content at www.someSiteURL.com/default.asp At this level there is a default.asp file with a whole bunch of redirects to take the user to the correct physical dir where the sites are. The sites are all living under one root 'Site' in IIS like so:
siteOneDir
siteTwoDir
siteThreeDir
default.asp (this is the page with the redirects)
How do you overcome this without chnaging the site setup/use of IPAddresses?
Here is the robots.txt file:
User-agent: *
Allow: /default.asp
Allow: /siteOneDir/
Allow: /siteTwoDir/
Allow: /siteThreeDir/
Disallow: /
BTW Google webmaster tool says this is valid. I know some clients may not recognize 'Allow' but Google and Bing do so I don't care about this. I would rather disallow all then only allow sites instead of only using this to disallow specific sites.
If I use the Google webmaster tool Crawl > Fetch a Google and type in www.someSiteURL.com/default.asp it does have a status of 'Redirected' and its status is http/1.1 302 found

I believe the order of the items in robot.txt matters. Try putting the disallow first, ie. change to:
User-agent: *
Disallow: /
Allow: /default.asp
Allow: /siteOneDir/
Allow: /siteTwoDir/
Allow: /siteThreeDir/

Related

Contradictory rules in robots.txt

I'm attempting to scrape a website and these two rules seem to be contradictory in robots.txt
User-agent: *
Disallow: *
Allow: /
Does Allow: / mean that I can scrape the entire website, or just the root? As if means I can scrape the entire site then this is directly contradictory to the previous rule.
If you are following the original robots.txt standard:
The * in the disallow line would be treated as a literal rather than a wildcard. That line would disallow URL paths that start with an asterisk. All URL paths start with a /, so that rule disallows nothing.
The Allow Rule isn't in the specification, so that line would be ignored.
Anything that isn't specifically disallowed is allowed to be crawled.
Verdict: You can crawl the site.
Google and a few other crawlers support wildcards and allows. If you are following Google's extensions to robots.txt, here is how Google would interpret this robots.txt:
Both Allow: / and Disallow: * match any specific path on the site.
In the case of such a conflict, the more specific rule (ie longer) rule wins. / and * are each one character, so neither is considered more specific than the other.
In a case of a tie for specificity, the least restrictive rule wins. Allow is considered less restrictive than Disallow.
Verdict: You can crawl the site.

Google Search Console - Verification Failed

I am trying to add my website to Google Search Console but failed, it returns
The connection to your server timed out
The file is there, I can open it on the normal browser, all meta tag set to index,all, robots.txt is added and have User-agent: * Disallow: allowing everything to be crawled.
But it seems I coudn't let Search Console check the verification file, I have tried using the HTML File Verification, HTML Tag Verification, Google Analytics Verification, and Google Tag Verification. But all of them returning the same error , connection time out.
Is there anything else I have to do to verify this?
Thank you
Do you have it on two lines like so: ?
User-agent: *
Disallow:

Should I consider robots.txt when url is redirected to other domain?

I want to crawl some site on medium.com, custom domain.
(eg, https://uber-developers.news/)
These sites always redirect to "medium.com" and that returns back to the site. But a problem is here, the redirected url of medium.com is disallowed by its robots.txt.
Here is the way redirected.
https://uber-developers.news/
https://medium.com/m/global-identity?redirectUrl=https://uber-developers.news/
https://uber-developers.news/?gi=e0f8caa9844c
The problem is above the second url "https://medium.com/m/global-identity?redirectUrl=https://uber-developers.news/", disallowed by robots.txt
https://medium.com/robots.txt
User-Agent: *
Disallow: /m/
Disallow: /me/
Disallow: /#me$
Disallow: /#me/
Disallow: /*/*/edit
Allow: /_/
Allow: /_/api/users/*/meta
Allow: /_/api/users/*/profile/stream
Allow: /_/api/posts/*/responses
Allow: /_/api/posts/*/responsesStream
Allow: /_/api/posts/*/related
Sitemap: https://medium.com/sitemap/sitemap.xml
Should I consider robots.txt of second url?
Thanks for reading.
robot.txt files are only indicating crawlers what they should do, but they, bye no means, can forbid crawlers from doing differently. What Medium does will stop polite and respectful crawlers only.
You need to follow the redirections (if you are using CURL for example, there is an option for this) and you will reach the page you want. But if you do this on a massive scale, Medium might not be happy about this.

Interpreting robots.txt vs. terms of use

I'm interested in scraping craigslist, solely for the purpose of data analysis for a blog post (i.e., no commercial or financial gain, no posting/emailing, no personal data collection, no sharing of data scraped). Their robots.txt file is the following:
User-agent: *
Disallow: /reply
Disallow: /fb/
Disallow: /suggest
Disallow: /flag
Disallow: /mf
Disallow: /eaf
I intend to visit none of these directories, only to view posts and then collect the text from the postbody. This seems to not be disallowed in the robots.txt file. However, Craigslist terms of use has the following entry (relevant bit in bold):
USE. You agree not to use or provide software (except for general purpose web browsers and email clients, or software expressly licensed by us) or services that interact or interoperate with CL, e.g. for downloading, uploading, posting, flagging, emailing, search, or mobile use. Robots, spiders, scripts, scrapers, crawlers, etc. are prohibited, as are misleading, unsolicited, unlawful, and/or spam postings/email. You agree not to collect users' personal and/or contact information ("PI").
So should I assume that my bot is forbidden across the entire site, or just forbidden in the Disallowed directories in robots.txt? If it's the former, then what am I misunderstanding about the robots.txt file? If it's the latter, then may I assume that they will not ban my IP given that I abide by robots.txt?
They provide data in rss format. At the bottom right there is an rss link that will take you to ?format=rss
For example: https://losangeles.craigslist.org/search/sss?format=rss
My guess would be that sort of thing is really not allowed if you're redistributing the post content, collecting emails to spam, etc. It probably depends on how you use the data. If you're only gathering statistical information maybe it's acceptable but I really don't know. Probably a better question for a lawyer.

Disallow directory contents, but Allow directory page in robots.txt

Will this work for disallowing pages under a directory, but still allow a page on that directory url?
Allow: /special-offers/$
Disallow: /special-offers/
to allow:
www.mysite.com/special-offers/
but block:
www.mysite.com/special-offers/page1
www.mysite.com/special-offers/page2.html
etc
Having looked at Google's very own robots.txt file, they are doing exactly what I was questioning.
At line 136-137 they have:
Disallow: /places/
Allow: /places/$
So they are blocking any thing under places, but allowing the root places URL. The only difference with my syntax is the order, the Disallow being first.
Standards
According to the HTML 4.01 specification, Appendix B.4.1 the values allowed in Disallow (no pun intended) are partial URIs (representing partial or full paths), only:
The "Disallow" field specifies a partial URI that is not to be visited. This can be a full path, or a partial path; any URI that starts with this value will not be retrieved. For example,
Disallow: /help disallows both /help.html and /help/index.html, whereas
Disallow: /help/ would disallow /help/index.html but allow /help.html.
I don't think anything has changed since then, since current HTML5 Specification Drafts don't mention robots.txt at all.
Extensions
However, in practice, many Robot Engines (such as Googlebot) are more flexible in what they accept. If you use, for instance:
Disallow: /*.gif$
then Googlebot will skip any file with the gif extension. I think you could do something like this to disallow all files under a folder, but I'm not 100% sure (you could test them with Google Webmaster Tools):
Disallow: /special-offers/*.*$
Other options
Anyway, you shouldn't rely on this too much (since each search engine might behave differently), so if possible it would be preferrable to use meta tags or HTTP headers instead. For instance, you could configure your webserver to include this header in all responses that should not be indexed (or followed):
X-Robots-Tag: noindex, nofollow
Search for the best way of doing it in your particular webserver. Here's an example in Apache, combining mod_rewrite with mod_headers to conditionally set some headers depending on the URL pattern. Disclaimer: I haven't tested it myself, so I can't tell how well it works.
# all /special-offers/ sub-urls set env var ROBOTS=none
RewriteRule ^/special-offers/.+$ - [E=ROBOTS:none]
# if env var ROBOTS is set then create header X-Robots-Tag: $ENV{ROBOTS}
RequestHeader set X-Robots-Tag %{ROBOTS}e env=ROBOTS
(Note: none is equivalent to noindex, nofollow)

Resources