robots.txt file being overridden / injected from external source? - wordpress

We have a couple of Wordpress sites with this same issue. They appear to have a "robots.txt" file with the following contents:
User-Agent: *
Crawl-Delay: 300
User-Agent: MJ12bot
Disallow: /
User-agent: MegaIndex.ru
Disallow: /
User-agent: megaindex.com
Disallow: /
We have absolutely no idea where this robots.txt file is coming from.
We have looked and there is definitely no "robots.txt" file in the public_html root folder or any sub-folder that we can see.
We have deactivated every single plugin on the site and even changed themes, but the robots.txt file remains exactly the same. It seems as though it is somehow being injected into the site from an external source somehow!
We have been assured that it couldn't be coming from Google Tag Manager.
Just wondering if anyone happens to recognise the above robots.txt contents and knows how it is existing on our sites???

You have a few possibilities.
Some security plugins (WordFence, iTheme etc) actually add files to your site. These files don't generally go away when you just "disable" the plugins. They need to be actually removed/uninstalled and sometimes you have to manually go through and do it.
WordPress will generate a virtual robots.txt.
If Google has cached that. You can go in and tell Google to look at the robots.txt again.
You should also be able to overwrite it by creating your own by just making a robots.txt file and putting it in the root or using another plugin to do it.

Turns out it was a generic robots.txt file that our server administrator had set up to be injected into every site on our server to prevent our server being attacked and overloaded by those particular bots (which we had been having trouble with).

Related

Disallow URL with specific querystring from crawl using robots.txt

My client has an ASP.NET MVC web application that also has a WordPress blog in a subfolder.
https://www.example.com/
https://www.example.com/wordpress
The WordPress site is loaded with some social sharing links that I do not want crawlers to index. For example:
https://www.example.com/wordpress/some-post/?share=pinterest
First thing, should there be a robots.txt in the / folder and also one in the /wordpress folder? Or just a single one in the / folder? I've tried both without any success.
In my robots.txt file I've included the following:
User-agent: Googlebot
Disallow: ?share=pinterest$
I've also tried several variations like:
Disallow: /wordpress/*/?share=pinterest
No matter what rule I have in robots.txt, I'm not able to get crawlers to stop trying to index these social sharing links. The plugin that creates these sharing links is also making them "nofollow noindex noreferer", but since they are all internal links it causes issues due to blocking internal "link juice".
How do I form a rule to Disallow crawlers to index any link inside this site that ends with ?share=pinterest?
Should both sites have a robots.txt or only one in the main/root folder?
robots.txt should only be at the root of the domain. https://example.com/robots.txt is the correct URL for your robots.txt file. Any robots.txt file in a subdirectory will be ignored.
By default, robots.txt rules are all "starts with" rules. Only a few major bots such as Googlebot support wildcards in Disallow: rules. If you use wildcards, the rules will be obeyed by the major search engines but ignored by most less sophisticated bots.
Using nofollow on those links isn't really going to effect your internal link juice. Those links are all going to be external redirects that will either pass PageRank out of your site, or if you block that PageRank somehow, it will evaporate. Neither external linking, nor PageRank evaporation hurt the SEO of the rest of your site, so it doesn't really matter from an SEO perspective what you do. You can allow those links to be crawled, use nofollow on those links, or disallow those links in robots.txt. It won't change how the rest of your site is ranked.
robots.txt also has the disadvantage that search engines occasionally index disallowed pages. robots.txt blocks crawling, but it doesn't always prevent indexing. If any of those URLs get external links, Google may index the URL with the anchor text of the links it finds to them.
If you really want to hide the social sharing from search engine bots, you should have the functionality handled with onclick events. Something like:
<a onclick="pintrestShare()">Share on Pinterest</a>
Where pintrestShare is a JavaScript function that uses location.href set the URL of the page to the Pinterest share URL for the current URL.
To directly answer your question about robots.txt, this rule is correct:
User-agent: *
Disallow: /wordpress/*/?share=pinterest
You can use Google's robots.txt testing tool to verify that it blocks your URL:
You have to wait 24 hours after making robots.txt changes before bots start obeying the new rules. Bots often cache your old robots.txt for a day.
You may have to wait weeks for new results to show in your webmaster tools and search console accounts. Search engines won't report new results until they get around to re-crawling pages, realize the requests are blocked, and that information makes it back to their webmaster information portals.

What is the meaning of this on a site's Robots.txt page?

I've been trying to scrape a website's data to build a game out of the database and I'm frequently getting blocked with a CAPTCHA request. When I checked the Robots.txt file for the site, I see this:
Disallow: /a/
Disallow: /contact-us/
What is the meaning of this?
According to Google docs.
A robots. txt file tells search engine crawlers which pages or files
the crawler can or can't request from your site. This is used mainly
to avoid overloading your site with requests; it is not a mechanism
for keeping a web page out of Google.

Keeping robots.txt blank

I have couple of wordpress sites and with the current google seo algorithm update a site should be mobile friendly (here)
My query here is as follows, Currently I have written a rule in robots.txt to disallow crawling the url's with wp-
User-agent: *
Disallow: /cgi-bin
Disallow: /wp-admin
Disallow: /wp-includes
Disallow: /wp-content/plugins
Disallow: /feed
Disallow: /*/feed
Disallow: /wp-login.php
I don't want google to crawl the above url's. Earlier it was working fine but now with the recent google algorithm update, when I disallow these url's It will start giving errors in the mobile friendly test (here). As all my CSS and JS are behind the wp- url's. I am wondering how can I fix this one.
Any suggestions appreciated.
If you keep the crawler away from those files your page may look and work different to Google than it looks to your visitors. This is what Google wants to avoid.
There is no problem in allowing Google to access the CSS or JS files as anyone else who can open your HTML-source and read links can access them either.
Therefore Google definitely wants to access the CSS and JS files used on your page:
https://developers.google.com/webmasters/mobile-sites/mobile-seo/common-mistakes/blocked-resources?hl=en
Those files are needed to render your pages.
If your site’s robots.txt file disallows crawling of these assets, it directly harms how well our algorithms render and index your content. This can result in suboptimal rankings.
If you are dependent on mobile rankings you must follow Googles guidelines. If not, feel free to block the crawler.

Robots.txt: ALLOW Google Fonts

I've been testing my website with Google Webmaster Tools and when I tried to "fetch it as Googlebot" I got a "Partial" status and a note that three EXTERNAL css files, namely 3 Google fonts, had been blocked for some reason by robots. txt.
Now, here's my file:
User-agent: *
Disallow:
Disallow: /cgi-bin/
Sitemap: http://example.com/sitemapindex.xml
Is there something wrong with it that might be preventing access to said files?
Thanks!
If robots.txt is blocking external CSS files, then it will be the robots.txt for the server hosting those files, not the one for your main hostname.
I don't know why you would worry about Googlebot being unable to read your stylesheets though.

Does robots.txt apply to files/directories only, or URLs too?

I can use robots.txt to stop a folder of images/html files getting indexed. But what about dynamic pages, e.g. preventing indexing of certain WordPress pages?
The robots.txt syntax doesn't care about whether a page is dynamic or not: All that matters for it is the directory structure.
If you are using a permalink structure like
example.com/blog/year/month/slug
you should be able to exclude single pages like so:
user-agent: *
disallow: /blog/2011/09/this-is-a-test-entry
you could use Google's webmaster tools to verify whether that happens properly.
Remember that Wordpress stores static content like images and PDF documents in /wp-content - you can't block those this way unless you want to block all resources in that directory.

Resources