Robots.txt: ALLOW Google Fonts - css

I've been testing my website with Google Webmaster Tools and when I tried to "fetch it as Googlebot" I got a "Partial" status and a note that three EXTERNAL css files, namely 3 Google fonts, had been blocked for some reason by robots. txt.
Now, here's my file:
User-agent: *
Disallow:
Disallow: /cgi-bin/
Sitemap: http://example.com/sitemapindex.xml
Is there something wrong with it that might be preventing access to said files?
Thanks!

If robots.txt is blocking external CSS files, then it will be the robots.txt for the server hosting those files, not the one for your main hostname.
I don't know why you would worry about Googlebot being unable to read your stylesheets though.

Related

robots.txt file being overridden / injected from external source?

We have a couple of Wordpress sites with this same issue. They appear to have a "robots.txt" file with the following contents:
User-Agent: *
Crawl-Delay: 300
User-Agent: MJ12bot
Disallow: /
User-agent: MegaIndex.ru
Disallow: /
User-agent: megaindex.com
Disallow: /
We have absolutely no idea where this robots.txt file is coming from.
We have looked and there is definitely no "robots.txt" file in the public_html root folder or any sub-folder that we can see.
We have deactivated every single plugin on the site and even changed themes, but the robots.txt file remains exactly the same. It seems as though it is somehow being injected into the site from an external source somehow!
We have been assured that it couldn't be coming from Google Tag Manager.
Just wondering if anyone happens to recognise the above robots.txt contents and knows how it is existing on our sites???
You have a few possibilities.
Some security plugins (WordFence, iTheme etc) actually add files to your site. These files don't generally go away when you just "disable" the plugins. They need to be actually removed/uninstalled and sometimes you have to manually go through and do it.
WordPress will generate a virtual robots.txt.
If Google has cached that. You can go in and tell Google to look at the robots.txt again.
You should also be able to overwrite it by creating your own by just making a robots.txt file and putting it in the root or using another plugin to do it.
Turns out it was a generic robots.txt file that our server administrator had set up to be injected into every site on our server to prevent our server being attacked and overloaded by those particular bots (which we had been having trouble with).

Keeping robots.txt blank

I have couple of wordpress sites and with the current google seo algorithm update a site should be mobile friendly (here)
My query here is as follows, Currently I have written a rule in robots.txt to disallow crawling the url's with wp-
User-agent: *
Disallow: /cgi-bin
Disallow: /wp-admin
Disallow: /wp-includes
Disallow: /wp-content/plugins
Disallow: /feed
Disallow: /*/feed
Disallow: /wp-login.php
I don't want google to crawl the above url's. Earlier it was working fine but now with the recent google algorithm update, when I disallow these url's It will start giving errors in the mobile friendly test (here). As all my CSS and JS are behind the wp- url's. I am wondering how can I fix this one.
Any suggestions appreciated.
If you keep the crawler away from those files your page may look and work different to Google than it looks to your visitors. This is what Google wants to avoid.
There is no problem in allowing Google to access the CSS or JS files as anyone else who can open your HTML-source and read links can access them either.
Therefore Google definitely wants to access the CSS and JS files used on your page:
https://developers.google.com/webmasters/mobile-sites/mobile-seo/common-mistakes/blocked-resources?hl=en
Those files are needed to render your pages.
If your site’s robots.txt file disallows crawling of these assets, it directly harms how well our algorithms render and index your content. This can result in suboptimal rankings.
If you are dependent on mobile rankings you must follow Googles guidelines. If not, feel free to block the crawler.

How to block libwww-perl access in .net?

I was testing my website with online tools and one of the tools gave me this warning:
Your server appears to allow access from User-agent Libwww-perl. Botnet scripts that automatically look for vulnerabilities in your software are sometimes identified as User-Agent libwww-perl. By blocking access from libwww-perl you can eliminate many simpler attacks. Read more on blocking Libwww-perl access and improving your website's security.
My web site is an ASP.NET MVC 5 site and I've simply added these lines to my "robots.txt" file.
User-agent: *
Disallow: /
User-Agent: bingbot
Allow: /
However, the tool still reports the warning. What is the problem? I'm blocking all bots and just set bingbot to allow.
Unless you give the URL or Name of the online scanning tool I can only guess that it tried to crawl your pages while sending a User-Agent: libwww-perl - not if you block this in your robots.txt.
The Background for this is, robots.txt contains rules for well behaving search engines, not for malware. From http://www.robotstxt.org/robotstxt.html:
robots can ignore your /robots.txt. Especially malware robots that
scan the web for security vulnerabilities, and email address
harvesters used by spammers will pay no attention.
I assume to "fix" this warning you must deny all requests for any page, image or file if the HTTP Headers contain User-Agent: libwww-perl. See this question on configuring IIS to deny these requests without modifying your website.
Personally, I would not deny these requests as it is not worth the hassle. It is easy to change the User-Agent within a scanning tool and most already allow to mimic widely used browsers so the security gain would be very small. On the other hand, there may exist a good / legit tool that cannot be used because it does not fake its identity.
Try to add this into your file:
User-agent: Libwww-perl
Disallow: /
Add robot.txt to your application and include that file within the root directory of your application. Add below to that robot.txt file
User-agent: Libwww-perl
Disallow: /
And For more information Please check information given below...
http://www.iis.net/learn/application-frameworks/install-and-configure-php-applications-on-iis/translate-htaccess-content-to-iis-webconfig
It can be blocked via an .htaccess Rewrite:
RewriteCond %{HTTP_USER_AGENT} libwww-perl.*
RewriteRule .* ? [F,L]

Why is Google Webmaster Tools completely misreading my robots.txt file?

Below is the entire content of my robots.txt file.
User-agent: *
Disallow: /marketing/wp-admin/
Disallow: /marketing/wp-includes/
Sitemap: http://mywebsite.com/sitemap.xml.gz
It is the one apparently generated by Wordpress. I haven't manually created one.
Yet when I signed up for Google Webmaster tools today. This is the content of that Google Webmasters tools is seeing:
User-agent: *
Disallow: /
... So ALL my urls are blocked!
In Wordpress, settings > reading > search engine visibility: "Discourage search engines from indexing this site" is not checked. I unchecked it fairly recently. (Google Webmaster tools is telling me it downloaded my robots.txt file on Nov 13, 2013.)
...So why is it still reading the old version where all my pages are disallowed, instead of the new version?
Does it take a while? Should I just be patient?
Also what is the ".gz" on the end of my sitemap line? I'm using the Yoast All-in-One SEO pack plugin. I'm thinking the plugin added the ".gz", whatever that is.
You can ask Googlebot to crawl again after you've changed your robots.txt. See Ask Google to crawl a page or site for information.
The Sitemap file tells Googlebot more about the structure of your site, and allows it to crawl more effectively. See About Sitemaps for more info.
The .gz is just telling Googlebot that the generated sitemap file is compressed.
A WordPress discussion on this topic can be found here: https://wordpress.org/support/topic/robotstxt-wordpress-and-google-webmaster-tools?replies=5

robots.txt disallow /variable_dir_name/directory

I need to disallow /variable_dir_name/directory via robots.txt
I use:
Disallow: */directory
Noindex: */directory
is that correct?
The following should work in your robots.txt:
User-Agent: *
Disallow: /*/directory
Further reading from Google: Block or remove pages using a robots.txt file
Indeed, there was the opportunity of GoogleBot that allowed to use these in the Robots.txt:
Noindex
Nofollow
Crawl-delay
But seen on the GoogleBlog-News they will no longer support those (0,001% used) commands anymore from September 2019 on. So you should only use meta tags anymore for these on your page to be safe for the future.
What you really should do, is the following:
Disallow via robots.txt and
Noindex already indexed documents via Google Search Console

Resources