Keeping robots.txt blank - wordpress

I have couple of wordpress sites and with the current google seo algorithm update a site should be mobile friendly (here)
My query here is as follows, Currently I have written a rule in robots.txt to disallow crawling the url's with wp-
User-agent: *
Disallow: /cgi-bin
Disallow: /wp-admin
Disallow: /wp-includes
Disallow: /wp-content/plugins
Disallow: /feed
Disallow: /*/feed
Disallow: /wp-login.php
I don't want google to crawl the above url's. Earlier it was working fine but now with the recent google algorithm update, when I disallow these url's It will start giving errors in the mobile friendly test (here). As all my CSS and JS are behind the wp- url's. I am wondering how can I fix this one.
Any suggestions appreciated.

If you keep the crawler away from those files your page may look and work different to Google than it looks to your visitors. This is what Google wants to avoid.
There is no problem in allowing Google to access the CSS or JS files as anyone else who can open your HTML-source and read links can access them either.
Therefore Google definitely wants to access the CSS and JS files used on your page:
https://developers.google.com/webmasters/mobile-sites/mobile-seo/common-mistakes/blocked-resources?hl=en
Those files are needed to render your pages.
If your site’s robots.txt file disallows crawling of these assets, it directly harms how well our algorithms render and index your content. This can result in suboptimal rankings.
If you are dependent on mobile rankings you must follow Googles guidelines. If not, feel free to block the crawler.

Related

Disallow URL with specific querystring from crawl using robots.txt

My client has an ASP.NET MVC web application that also has a WordPress blog in a subfolder.
https://www.example.com/
https://www.example.com/wordpress
The WordPress site is loaded with some social sharing links that I do not want crawlers to index. For example:
https://www.example.com/wordpress/some-post/?share=pinterest
First thing, should there be a robots.txt in the / folder and also one in the /wordpress folder? Or just a single one in the / folder? I've tried both without any success.
In my robots.txt file I've included the following:
User-agent: Googlebot
Disallow: ?share=pinterest$
I've also tried several variations like:
Disallow: /wordpress/*/?share=pinterest
No matter what rule I have in robots.txt, I'm not able to get crawlers to stop trying to index these social sharing links. The plugin that creates these sharing links is also making them "nofollow noindex noreferer", but since they are all internal links it causes issues due to blocking internal "link juice".
How do I form a rule to Disallow crawlers to index any link inside this site that ends with ?share=pinterest?
Should both sites have a robots.txt or only one in the main/root folder?
robots.txt should only be at the root of the domain. https://example.com/robots.txt is the correct URL for your robots.txt file. Any robots.txt file in a subdirectory will be ignored.
By default, robots.txt rules are all "starts with" rules. Only a few major bots such as Googlebot support wildcards in Disallow: rules. If you use wildcards, the rules will be obeyed by the major search engines but ignored by most less sophisticated bots.
Using nofollow on those links isn't really going to effect your internal link juice. Those links are all going to be external redirects that will either pass PageRank out of your site, or if you block that PageRank somehow, it will evaporate. Neither external linking, nor PageRank evaporation hurt the SEO of the rest of your site, so it doesn't really matter from an SEO perspective what you do. You can allow those links to be crawled, use nofollow on those links, or disallow those links in robots.txt. It won't change how the rest of your site is ranked.
robots.txt also has the disadvantage that search engines occasionally index disallowed pages. robots.txt blocks crawling, but it doesn't always prevent indexing. If any of those URLs get external links, Google may index the URL with the anchor text of the links it finds to them.
If you really want to hide the social sharing from search engine bots, you should have the functionality handled with onclick events. Something like:
<a onclick="pintrestShare()">Share on Pinterest</a>
Where pintrestShare is a JavaScript function that uses location.href set the URL of the page to the Pinterest share URL for the current URL.
To directly answer your question about robots.txt, this rule is correct:
User-agent: *
Disallow: /wordpress/*/?share=pinterest
You can use Google's robots.txt testing tool to verify that it blocks your URL:
You have to wait 24 hours after making robots.txt changes before bots start obeying the new rules. Bots often cache your old robots.txt for a day.
You may have to wait weeks for new results to show in your webmaster tools and search console accounts. Search engines won't report new results until they get around to re-crawling pages, realize the requests are blocked, and that information makes it back to their webmaster information portals.

What is the meaning of this on a site's Robots.txt page?

I've been trying to scrape a website's data to build a game out of the database and I'm frequently getting blocked with a CAPTCHA request. When I checked the Robots.txt file for the site, I see this:
Disallow: /a/
Disallow: /contact-us/
What is the meaning of this?
According to Google docs.
A robots. txt file tells search engine crawlers which pages or files
the crawler can or can't request from your site. This is used mainly
to avoid overloading your site with requests; it is not a mechanism
for keeping a web page out of Google.

Robots.txt: ALLOW Google Fonts

I've been testing my website with Google Webmaster Tools and when I tried to "fetch it as Googlebot" I got a "Partial" status and a note that three EXTERNAL css files, namely 3 Google fonts, had been blocked for some reason by robots. txt.
Now, here's my file:
User-agent: *
Disallow:
Disallow: /cgi-bin/
Sitemap: http://example.com/sitemapindex.xml
Is there something wrong with it that might be preventing access to said files?
Thanks!
If robots.txt is blocking external CSS files, then it will be the robots.txt for the server hosting those files, not the one for your main hostname.
I don't know why you would worry about Googlebot being unable to read your stylesheets though.

Why is Google Webmaster Tools completely misreading my robots.txt file?

Below is the entire content of my robots.txt file.
User-agent: *
Disallow: /marketing/wp-admin/
Disallow: /marketing/wp-includes/
Sitemap: http://mywebsite.com/sitemap.xml.gz
It is the one apparently generated by Wordpress. I haven't manually created one.
Yet when I signed up for Google Webmaster tools today. This is the content of that Google Webmasters tools is seeing:
User-agent: *
Disallow: /
... So ALL my urls are blocked!
In Wordpress, settings > reading > search engine visibility: "Discourage search engines from indexing this site" is not checked. I unchecked it fairly recently. (Google Webmaster tools is telling me it downloaded my robots.txt file on Nov 13, 2013.)
...So why is it still reading the old version where all my pages are disallowed, instead of the new version?
Does it take a while? Should I just be patient?
Also what is the ".gz" on the end of my sitemap line? I'm using the Yoast All-in-One SEO pack plugin. I'm thinking the plugin added the ".gz", whatever that is.
You can ask Googlebot to crawl again after you've changed your robots.txt. See Ask Google to crawl a page or site for information.
The Sitemap file tells Googlebot more about the structure of your site, and allows it to crawl more effectively. See About Sitemaps for more info.
The .gz is just telling Googlebot that the generated sitemap file is compressed.
A WordPress discussion on this topic can be found here: https://wordpress.org/support/topic/robotstxt-wordpress-and-google-webmaster-tools?replies=5

Getting all posts from a blog (wordpress or blogger)

This is assuming that direct access to an api is not available. Since I am requesting ALL posts, I am not sure RSS would help much.
I considered a simple system that would loop through each year and month and download each html file but changing the following URL for each year month pair. This works for wordpress and blogger blogs.
http://www.lostincheeseland.com/2011/05
However, is there a way to use the following search function provided by blogger to return all blogs? I have played around with it, but documentation seems sparse.
http://www.lostincheeseland.com/search?updated-max=2012-08-17T09:44:00%2B02:00&max-results=6
Are there other methods I have not considered?
What you're looking for is a sitemap.
First of all, you're writing a bot so it's good manners to check the blog's robots.txt file. And lo and behold, you'll often find a sitemap mentioned there. Here's an example from the Google blog:
User-agent: Mediapartners-Google
Disallow:
User-agent: *
Disallow: /search
Allow: /
Sitemap: http://googleblog.blogspot.com/feeds/posts/default?orderby=UPDATED
In this case, you can visit the Sitemap URL to get an xml sitemap.
For Wordpress, the same applies but it's not built-in as standard so not all blogs will have it. Have a look at this plugin which is the most popular way to create these sitemaps in Wordpress. For example, my blog uses this and you can find the sitemap at /sitemap.xml
(the standard location)
In short:
Check robots.txt
Follow the Sitemap url if it's present
Otherwise, check for /sitemap.xml
Also: be a good Internet citizen! If you're going to write a bot, make sure it obeys the robots.txt file (like where blogspot tells you explicitly not to use /search!)

Resources