Reading robots.txt file? - web-scraping

I am trying to webscrape a website and their robots.txt file says this:
(where zoeksuggestie is search suggestion in english)
User-agent: *
# Miscellaneous
Disallow: /mijn/
Disallow: /*/print/*
Disallow: /koop/zoeksuggestie/
Disallow: /huur/zoeksuggestie/
Disallow: /nieuwbouw/zoeksuggestie/
Disallow: /recreatie/zoeksuggestie/
Disallow: /europe/zoeksuggestie/
Disallow: /*/brochure/download/
Disallow: */uitgebreid-zoeken/*
Disallow: /makelaars/*/woningaanbod/*
Disallow: /zoekwidget/*
Allow: /zoekwidget/$
Disallow: /relatedobjects
Disallow: /mijn/huis/wonen/toevoegen/
Disallow: /*/woningrapport/
# Prevent bots from indexing combinations of locations
Disallow: /koop/*,*
Disallow: /huur/*,*
Disallow: /nieuwbouw/*,*
Disallow: /recreatie/*,*
Disallow: /europe/*,*
Does this mean I can't scrape any link that is /koop/*,* ? what does the *,*mean? I really need to get data from this website for a project, but I keep getting blocked using scrapy/beautiful soup.

The robots.txt file is part of the “Robots exclusion standard” whenever a bot visits a website, they check the robots.txt file to see what they can’t access. Google uses this to not index or at least publicly display URLs matching those in the robots.txt file.
The file is however not mandatory to comply with the robots.txt.
The * is a wildcard so /koop/*,* will match anything with /koop/[wildcard],[wildcard].
Here is a great guide on wildcards in robots.txt https://geoffkenyon.com/how-to-use-wildcards-robots-txt/
You mentioned scrapy not working, that is because scrapy follows the robots.txt by default. This can be disabled in settings, that question has been answered here: getting Forbidden by robots.txt: scrapy

Related

Google index: robots.txt to stop wp uploads indexing

I have a Wordpress site that is being indexed by google, but google is picking up images as search results - ie if I do site:mysite.com I see loads of results which, when clicked on, just go to images from wp-content/uploads/
How do I stop these from coming up in search results, whilst still allowing them in google images?
I've made changes to my robots.txt so the first bit reads:
User-agent:*
Noindex: /product-tag/*
Noindex: /product-tag/
Noindex: /wp-content/uploads/*
Noindex: /forum/profile/*
Noindex: /my-account/*
Noindex: /my-account/
Noindex: /?s=*
Noindex: /tag/*
Disallow: /wp-admin/
Disallow: /wp-content/uploads/*
Disallow: /product-tag/*
Disallow: /product-tag/
Disallow: /forum/profile/*
Disallow: /my-account/*
Disallow: /my-account/
Disallow: /?s=*
Disallow: /tag/*
Allow: /shop/*
Allow: /product-category/*
User-agent: Googlebot-image
Allow: /
Disallow: /wp-admin/
I guess my question is, is this ok or am I doing something wrong? If it is right, how do I get google to realize that some results shouldn't be in the index any more?
I'm aware that I can request removal of pages individually but there is a large amount so I'd rather re-index my entire site if that's the right way to go.
Answer :
User-agent: Googlebot-Image
Disallow: /*.gif$
Disallow: /*.png$
Error is in your code, you allowed Googlebot-image to index your images
User-agent: Googlebot-image
Allow: /
Disallow: /wp-admin/
Refer this : https://support.google.com/webmasters/answer/35308?hl=en

Stylesheet Blocked by Google after Rendering

When rendering my website in Google Search Condole (Fetch as Google), I discovered that my site's stylesheets are blocked. The fetch status says "Partial". I suspect this reason for my site not showing up in Google Search results.
However, after adding my website to the Bing Search engine, it shows up in the search results after an hour. Hence, my website doesn't face any indexing issues in Bing.
How do I allow Googlebot to access my site's CSS files?
I am using Magento website. Here are the fetch details from Search Console:
http://www.cherryconcept.com/skin/frontend/default/ma_sahara_fashion5/css/styles.css3.php?url=http://www.cherryconcept.com/skin/frontend/default/ma_sahara_fashion5/ Style Sheet Blocked robots.txt Tester
http://www.cherryconcept.com/skin/frontend/default/ma_sahara_fashion5/css/bootstrap.css Style Sheet Blocked robots.txt Tester
http://www.cherryconcept.com/skin/frontend/default/ma_sahara_fashion5/css/bootstrap-theme.css Style Sheet Blocked robots.txt Tester
http://www.cherryconcept.com/skin/frontend/default/ma_sahara_fashion5/css/font-awesome.css Style Sheet Blocked robots.txt Tester
http://www.cherryconcept.com/skin/frontend/default/ma_sahara_fashion5/css/font-awesome.min.css Style Sheet Blocked robots.txt Tester
http://www.cherryconcept.com/skin/frontend/default/ma_sahara_fashion5/css/styles.css Style Sheet Blocked robots.txt Tester
http://www.cherryconcept.com/skin/frontend/base/default/css/widgets.css Style Sheet Blocked robots.txt Tester
http://www.cherryconcept.com/skin/frontend/default/ma_sahara_fashion5/magentothem/fancybox/jquery.fancybox.css Style Sheet Blocked robots.txt Tester
http://www.cherryconcept.com/skin/frontend/default/ma_sahara_fashion5/magentothem/ajaxcartsuper/ajax_cart_super.css Style Sheet Blocked robots.txt Tester
http://www.cherryconcept.com/skin/frontend/default/ma_sahara_fashion5/magentothem/css/categorytabsliders.css Style Sheet Blocked robots.txt Tester
http://www.cherryconcept.com/skin/frontend/default/ma_sahara_fashion5/magentothem/css/custommenu.css Style Sheet Blocked robots.txt Tester
http://www.cherryconcept.com/skin/frontend/default/ma_sahara_fashion5/magentothem/imagerotator/effect.css Style Sheet Blocked robots.txt Tester
http://www.cherryconcept.com/skin/frontend/default/ma_sahara_fashion5/layerednavigationajax/jquery-ui.css Style Sheet Blocked robots.txt Tester
http://www.cherryconcept.com/skin/frontend/default/ma_sahara_fashion5/magentothem/css/ma.upsellslider.css Style Sheet Blocked robots.txt Tester
http://www.cherryconcept.com/skin/frontend/default/ma_sahara_fashion5/magentothem/css/ma.verticalmenu.css Style Sheet Blocked robots.txt Tester
http://www.cherryconcept.com/skin/frontend/default/ma_sahara_fashion5/magentothem/css/ma.banner7.css Style Sheet Blocked robots.txt Tester
http://www.cherryconcept.com/skin/frontend/default/ma_sahara_fashion5/magentothem/css/ma.brandslider.css Style Sheet Blocked robots.txt Tester
http://www.cherryconcept.com/skin/frontend/default/ma_sahara_fashion5/css/print.css Style Sheet Blocked robots.txt Tester
http://www.cherryconcept.com/skin/frontend/default/ma_sahara_fashion5/images/logo.png Image Blocked robots.txt Tester
And here is my robots.txt:
# Google Image Crawler Setup
User-agent: Googlebot-Image
Disallow:
# Crawlers Setup
# Directories
Disallow: /404/
Disallow: /app/
Disallow: /cgi-bin/
Disallow: /downloader/
Disallow: /errors/
Disallow: /includes/
#Disallow: /js/
#Disallow: /lib/
Disallow: /magento/
#Disallow: /media/
Disallow: /pkginfo/
Disallow: /report/
Disallow: /scripts/
Disallow: /shell/
Disallow: /skin/
Disallow: /stats/
Disallow: /var/
# Paths (clean URLs)
Disallow: /index.php/
Disallow: /catalog/product_compare/
Disallow: /catalog/category/view/
Disallow: /catalog/product/view/
Disallow: /catalogsearch/
#Disallow: /checkout/
Disallow: /control/
Disallow: /contacts/
Disallow: /customer/
Disallow: /customize/
Disallow: /newsletter/
Disallow: /poll/
Disallow: /review/
Disallow: /sendfriend/
Disallow: /tag/
Disallow: /wishlist/
Disallow: /catalog/product/gallery/
# Files
Disallow: /cron.php
Disallow: /cron.sh
Disallow: /error_log
Disallow: /install.php
Disallow: /LICENSE.html
Disallow: /LICENSE.txt
Disallow: /LICENSE_AFL.txt
Disallow: /STATUS.txt
# Paths (no clean URLs)
Disallow: /*.php$
Disallow: /*?SID=
The robots.txt you posted doesn't seem to be the one you're currently using. Your current robots.txt contains:
User-agent: Googlebot
Allow: /blocked-folder/css/
Allow: /blocked-folder/java/
...
Disallow: /skin/
So you're explicitly telling Googlebot not to fetch files in the /skin directory.
Edit: You also have the following rule
Disallow: /*.php$
which disallows all PHP files. Remove these lines and you'll be fine.

Website Duplicate content detected with google webmaster

We have a website that is based on codeigniter with a wordpress blog in a sub directory. /blog.
Through using google webmaster tools and search results - we are seeing duplicate content mainly for our home page with the following shown after the domain name.
So for example a search on google for site:domainname.com on google shows:
domainname.com/?author=1
domainname.com/?author=2
domainname.com/?cat=1
domainname.com/?cat=3
domainname.com/?cat=4
/?feed=rss2&tag=drinking-establishments
/?feed=rss2&tag=fun
/?feed=rss2&tag=introduction
These appear to be generated all from the generated from the wordpress blog and we are not sure how to fix?
You could use a robots.txt file to tell Google what they should (and shouldn't) be looking for on your site.
A robots.txt file should live here: example.com/robots.txt
An example robots.txt as taken from the WordPress Codex:
http://codex.wordpress.org/Search_Engine_Optimization_for_WordPress
Sitemap: http://www.example.com/sitemap.xml
# Google Image
User-agent: Googlebot-Image
Disallow:
Allow: /*
# Google AdSense
User-agent: Mediapartners-Google
Disallow:
# digg mirror
User-agent: duggmirror
Disallow: /
# global
User-agent: *
Disallow: /cgi-bin/
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /wp-content/plugins/
Disallow: /wp-content/cache/
Disallow: /wp-content/themes/
Disallow: /trackback/
Disallow: /feed/
Disallow: /comments/
Disallow: /category/*/*
Disallow: */trackback/
Disallow: */feed/
Disallow: */comments/
Disallow: /*?
Allow: /wp-content/uploads/
Background reading:
http://en.wikipedia.org/wiki/Robots_exclusion_standard

How to set up robots.txt file for WordPress

[UPDATE 2013]
I can't find an authoritative page with a format for robots.txt file for WordPress. I promise to maintain one on my site but I want one here on stack overflow.
If you know what your doing please check current draft here:
http://mast3rpee.tk/?p=127
Everyone else comment on this:
robots.txt
User-agent: *
Disallow: /cgi-bin/
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /wp-content/plugins/
Disallow: /wp-content/cache/
Disallow: /wp-content/themes/
Disallow: /trackback/
Disallow: /feed/
Disallow: /comments/
Sitemap: http://domain.com/sitemap.xml
Crawl-delay: 4
User-agent: *
Allow: /
Disallow: /cgi-bin
Disallow: /wp-admin
Disallow: /wp-includes
Disallow: /wp-content
Disallow: /e/
Disallow: /show-error-*
Disallow: /xmlrpc.php
Disallow: /trackback/
Disallow: /comment-page-
Allow: /wp-content/uploads/
User-agent: Mediapartners-Google
Allow: /
User-agent: Adsbot-Google
Allow: /
User-agent: Googlebot-Image
Allow: /
User-agent: Googlebot-Mobile
Allow: /
Sitemap: http://yoursite.com/sitemap.xml
I think this code is very authentic for robots.txt file, Just go to Public_HTML and create file with robots.txt and paste above code.
You can make in your Notepad, just copy above code and paste into notpad but remember file name should robots.txt and upload to your public_HTML.
As with all things SEO, things change. I think that the current advice is to have a a very minimal robots.txt file.
Ignoring wp-admin, wp-includes, wp-content, etc. may prevent Google from rendering pages correctly, which it doesn't like.
Check out this article by Yoast: https://yoast.com/wordpress-robots-txt-example/.
Create in notepad robots.txt and upload it to public_html in CPANEL .
*remember rename your file notepad to robots before you upload it to public_html
It's not safe to block much in your robots.txt nowadays single Google tries to load all assets to determine "mobile friendliness." At minimum you can block /wp-admin. Here's a more detailed, current answer to the question at the StackExchange forum for WordPress.

Duplicated content in Google. SEO for Drupal

I have a Drupal site that is up and running. The site is not properly optimized for SEO and there is lot of duplicate content that gets generated in google because of the /category, /taxonomy etc
The structure is:
/var/www/appname/ This contains a custom built application
/var/www/appname/drup This contains my drupal installation
I went through the site results in a google search site:appname.com and was that there is lot of duplicated content because of /content, /taxonomy, /node etc.
My ROBOTS.txt .. in /var/www/appname has the following already in, but I am surprised that the pages are still getting indexed. Please advise.
User-agent: *
Crawl-delay: 10
Allow: /
Allow: /drup/
# Directories
Disallow: /drup/includes/
Disallow: /drup/misc/
Disallow: /drup/modules/
Disallow: /drup/profiles/
Disallow: /drup/scripts/
Disallow: /drup/themes/
# Files
Disallow: /drup/CHANGELOG.txt
Disallow: /drup/cron.php
Disallow: /drup/INSTALL.mysql.txt
Disallow: /drup/INSTALL.pgsql.txt
Disallow: /drup/install.php
Disallow: /drup/INSTALL.txt
Disallow: /drup/LICENSE.txt
Disallow: /drup/MAINTAINERS.txt
Disallow: /drup/update.php
Disallow: /drup/UPGRADE.txt
Disallow: /drup/xmlrpc.php
# Paths (clean URLs)
Disallow: /drup/admin/
Disallow: /drup/comment/reply/
Disallow: /drup/contact/
Disallow: /drup/logout/
Disallow: /drup/node/add/
Disallow: /drup/search/
Disallow: /drup/user/register/
Disallow: /drup/user/password/
Disallow: /drup/user/login/
# Paths (no clean URLs)
Disallow: /drup/?q=admin/
Disallow: /drup/?q=comment/reply/
Disallow: /drup/?q=contact/
Disallow: /drup/?q=logout/
Disallow: /drup/?q=node/add/
Disallow: /drup/?q=search/
Disallow: /drup/?q=user/password/
Disallow: /drup/?q=user/register/
Disallow: /drup/?q=user/log
You just need an XML sitemap that tells Google where all the pages are, rather than letting Google crawl it on its own.
In fact, when Stackoverflow was in beta -- they tried to let the crawler work its magic. However, on highly dynamic sites, it's almost impossible to get adequate results in this fashion.
Thus, with the XML sitemap you tell Google where each page is and what its priority is and how often it changes.
There are several modules that take care of SEO and duplicated content.
I would first advice to install and go over http://drupal.org/project/seo_checklist
For duplicated content you may check http://drupal.org/project/globalredirect
Anyway, /taxonomy and /content are just lists that instead of disallowing you may want to override their paths with some sort of custom content and let crawlers know what they are looking at.
You can disallow the directory that are showing duplicate content. As you explained that the /content, /taxonomy, /node are showing duplicate content.
Add the following code in the Directories section of robots.txt file to restrict access of search engines to these directories.
Disallow: /drup/content/
Disallow: /drup/taxonomy/
Disallow: /drup/node/
Do you have the ability to verify ownership of the site with Google Webmaster Tools at:
http://www.google.com/webmasters/tools
If so, I'd recommend doing that, then trying "Fetch as Googlebot" under the "Diagnostics" category for that site. Your "Fetch Status" will indicate "Denied by robots.txt" if your robots.txt is working as expected.
Indexed pages can hang for awhile and display in Google search results after you've changed the robots.txt. But the Fetch As Googlebot gives you a real-time indication of what's happening when Googlebot comes knockin...
If the URLs that you don't want indexed are retrieved without a problem, then you'll need to focus on problems with robots.txt...where it's at, syntax, paths listed, etc. I always suggest people retrieve it manually in the browser (at the root of their web site) to double-check against obvious goofs.

Resources