Why is Google Webmaster Tools completely misreading my robots.txt file? - wordpress

Below is the entire content of my robots.txt file.
User-agent: *
Disallow: /marketing/wp-admin/
Disallow: /marketing/wp-includes/
Sitemap: http://mywebsite.com/sitemap.xml.gz
It is the one apparently generated by Wordpress. I haven't manually created one.
Yet when I signed up for Google Webmaster tools today. This is the content of that Google Webmasters tools is seeing:
User-agent: *
Disallow: /
... So ALL my urls are blocked!
In Wordpress, settings > reading > search engine visibility: "Discourage search engines from indexing this site" is not checked. I unchecked it fairly recently. (Google Webmaster tools is telling me it downloaded my robots.txt file on Nov 13, 2013.)
...So why is it still reading the old version where all my pages are disallowed, instead of the new version?
Does it take a while? Should I just be patient?
Also what is the ".gz" on the end of my sitemap line? I'm using the Yoast All-in-One SEO pack plugin. I'm thinking the plugin added the ".gz", whatever that is.

You can ask Googlebot to crawl again after you've changed your robots.txt. See Ask Google to crawl a page or site for information.
The Sitemap file tells Googlebot more about the structure of your site, and allows it to crawl more effectively. See About Sitemaps for more info.
The .gz is just telling Googlebot that the generated sitemap file is compressed.

A WordPress discussion on this topic can be found here: https://wordpress.org/support/topic/robotstxt-wordpress-and-google-webmaster-tools?replies=5

Related

What is the meaning of this on a site's Robots.txt page?

I've been trying to scrape a website's data to build a game out of the database and I'm frequently getting blocked with a CAPTCHA request. When I checked the Robots.txt file for the site, I see this:
Disallow: /a/
Disallow: /contact-us/
What is the meaning of this?
According to Google docs.
A robots. txt file tells search engine crawlers which pages or files
the crawler can or can't request from your site. This is used mainly
to avoid overloading your site with requests; it is not a mechanism
for keeping a web page out of Google.

robots.txt file being overridden / injected from external source?

We have a couple of Wordpress sites with this same issue. They appear to have a "robots.txt" file with the following contents:
User-Agent: *
Crawl-Delay: 300
User-Agent: MJ12bot
Disallow: /
User-agent: MegaIndex.ru
Disallow: /
User-agent: megaindex.com
Disallow: /
We have absolutely no idea where this robots.txt file is coming from.
We have looked and there is definitely no "robots.txt" file in the public_html root folder or any sub-folder that we can see.
We have deactivated every single plugin on the site and even changed themes, but the robots.txt file remains exactly the same. It seems as though it is somehow being injected into the site from an external source somehow!
We have been assured that it couldn't be coming from Google Tag Manager.
Just wondering if anyone happens to recognise the above robots.txt contents and knows how it is existing on our sites???
You have a few possibilities.
Some security plugins (WordFence, iTheme etc) actually add files to your site. These files don't generally go away when you just "disable" the plugins. They need to be actually removed/uninstalled and sometimes you have to manually go through and do it.
WordPress will generate a virtual robots.txt.
If Google has cached that. You can go in and tell Google to look at the robots.txt again.
You should also be able to overwrite it by creating your own by just making a robots.txt file and putting it in the root or using another plugin to do it.
Turns out it was a generic robots.txt file that our server administrator had set up to be injected into every site on our server to prevent our server being attacked and overloaded by those particular bots (which we had been having trouble with).

Keeping robots.txt blank

I have couple of wordpress sites and with the current google seo algorithm update a site should be mobile friendly (here)
My query here is as follows, Currently I have written a rule in robots.txt to disallow crawling the url's with wp-
User-agent: *
Disallow: /cgi-bin
Disallow: /wp-admin
Disallow: /wp-includes
Disallow: /wp-content/plugins
Disallow: /feed
Disallow: /*/feed
Disallow: /wp-login.php
I don't want google to crawl the above url's. Earlier it was working fine but now with the recent google algorithm update, when I disallow these url's It will start giving errors in the mobile friendly test (here). As all my CSS and JS are behind the wp- url's. I am wondering how can I fix this one.
Any suggestions appreciated.
If you keep the crawler away from those files your page may look and work different to Google than it looks to your visitors. This is what Google wants to avoid.
There is no problem in allowing Google to access the CSS or JS files as anyone else who can open your HTML-source and read links can access them either.
Therefore Google definitely wants to access the CSS and JS files used on your page:
https://developers.google.com/webmasters/mobile-sites/mobile-seo/common-mistakes/blocked-resources?hl=en
Those files are needed to render your pages.
If your site’s robots.txt file disallows crawling of these assets, it directly harms how well our algorithms render and index your content. This can result in suboptimal rankings.
If you are dependent on mobile rankings you must follow Googles guidelines. If not, feel free to block the crawler.

Robots.txt: ALLOW Google Fonts

I've been testing my website with Google Webmaster Tools and when I tried to "fetch it as Googlebot" I got a "Partial" status and a note that three EXTERNAL css files, namely 3 Google fonts, had been blocked for some reason by robots. txt.
Now, here's my file:
User-agent: *
Disallow:
Disallow: /cgi-bin/
Sitemap: http://example.com/sitemapindex.xml
Is there something wrong with it that might be preventing access to said files?
Thanks!
If robots.txt is blocking external CSS files, then it will be the robots.txt for the server hosting those files, not the one for your main hostname.
I don't know why you would worry about Googlebot being unable to read your stylesheets though.

Getting all posts from a blog (wordpress or blogger)

This is assuming that direct access to an api is not available. Since I am requesting ALL posts, I am not sure RSS would help much.
I considered a simple system that would loop through each year and month and download each html file but changing the following URL for each year month pair. This works for wordpress and blogger blogs.
http://www.lostincheeseland.com/2011/05
However, is there a way to use the following search function provided by blogger to return all blogs? I have played around with it, but documentation seems sparse.
http://www.lostincheeseland.com/search?updated-max=2012-08-17T09:44:00%2B02:00&max-results=6
Are there other methods I have not considered?
What you're looking for is a sitemap.
First of all, you're writing a bot so it's good manners to check the blog's robots.txt file. And lo and behold, you'll often find a sitemap mentioned there. Here's an example from the Google blog:
User-agent: Mediapartners-Google
Disallow:
User-agent: *
Disallow: /search
Allow: /
Sitemap: http://googleblog.blogspot.com/feeds/posts/default?orderby=UPDATED
In this case, you can visit the Sitemap URL to get an xml sitemap.
For Wordpress, the same applies but it's not built-in as standard so not all blogs will have it. Have a look at this plugin which is the most popular way to create these sitemaps in Wordpress. For example, my blog uses this and you can find the sitemap at /sitemap.xml
(the standard location)
In short:
Check robots.txt
Follow the Sitemap url if it's present
Otherwise, check for /sitemap.xml
Also: be a good Internet citizen! If you're going to write a bot, make sure it obeys the robots.txt file (like where blogspot tells you explicitly not to use /search!)

Resources