I was testing my website with online tools and one of the tools gave me this warning:
Your server appears to allow access from User-agent Libwww-perl. Botnet scripts that automatically look for vulnerabilities in your software are sometimes identified as User-Agent libwww-perl. By blocking access from libwww-perl you can eliminate many simpler attacks. Read more on blocking Libwww-perl access and improving your website's security.
My web site is an ASP.NET MVC 5 site and I've simply added these lines to my "robots.txt" file.
User-agent: *
Disallow: /
User-Agent: bingbot
Allow: /
However, the tool still reports the warning. What is the problem? I'm blocking all bots and just set bingbot to allow.
Unless you give the URL or Name of the online scanning tool I can only guess that it tried to crawl your pages while sending a User-Agent: libwww-perl - not if you block this in your robots.txt.
The Background for this is, robots.txt contains rules for well behaving search engines, not for malware. From http://www.robotstxt.org/robotstxt.html:
robots can ignore your /robots.txt. Especially malware robots that
scan the web for security vulnerabilities, and email address
harvesters used by spammers will pay no attention.
I assume to "fix" this warning you must deny all requests for any page, image or file if the HTTP Headers contain User-Agent: libwww-perl. See this question on configuring IIS to deny these requests without modifying your website.
Personally, I would not deny these requests as it is not worth the hassle. It is easy to change the User-Agent within a scanning tool and most already allow to mimic widely used browsers so the security gain would be very small. On the other hand, there may exist a good / legit tool that cannot be used because it does not fake its identity.
Try to add this into your file:
User-agent: Libwww-perl
Disallow: /
Add robot.txt to your application and include that file within the root directory of your application. Add below to that robot.txt file
User-agent: Libwww-perl
Disallow: /
And For more information Please check information given below...
http://www.iis.net/learn/application-frameworks/install-and-configure-php-applications-on-iis/translate-htaccess-content-to-iis-webconfig
It can be blocked via an .htaccess Rewrite:
RewriteCond %{HTTP_USER_AGENT} libwww-perl.*
RewriteRule .* ? [F,L]
Related
Need help on this robots.txt question. My default file looks something like this
User-agent: *
Disallow:
Sitemap: https://mywebsite.com/sitemap_index.xml
Problem is that with this configuration, Google deindexed almost all (at the time of this writing) of my URLs.
Is it correct to leave the disallow field blank?
Yes, it's technically correct.
This means that all user agents, including search engines, can access to your website pages.
The asterisk near user-agent means that it applies to all user agents.
Nothing is listed after disallow, this means there are no restrictions at all.
We have a couple of Wordpress sites with this same issue. They appear to have a "robots.txt" file with the following contents:
User-Agent: *
Crawl-Delay: 300
User-Agent: MJ12bot
Disallow: /
User-agent: MegaIndex.ru
Disallow: /
User-agent: megaindex.com
Disallow: /
We have absolutely no idea where this robots.txt file is coming from.
We have looked and there is definitely no "robots.txt" file in the public_html root folder or any sub-folder that we can see.
We have deactivated every single plugin on the site and even changed themes, but the robots.txt file remains exactly the same. It seems as though it is somehow being injected into the site from an external source somehow!
We have been assured that it couldn't be coming from Google Tag Manager.
Just wondering if anyone happens to recognise the above robots.txt contents and knows how it is existing on our sites???
You have a few possibilities.
Some security plugins (WordFence, iTheme etc) actually add files to your site. These files don't generally go away when you just "disable" the plugins. They need to be actually removed/uninstalled and sometimes you have to manually go through and do it.
WordPress will generate a virtual robots.txt.
If Google has cached that. You can go in and tell Google to look at the robots.txt again.
You should also be able to overwrite it by creating your own by just making a robots.txt file and putting it in the root or using another plugin to do it.
Turns out it was a generic robots.txt file that our server administrator had set up to be injected into every site on our server to prevent our server being attacked and overloaded by those particular bots (which we had been having trouble with).
I want to add nofollow and noindex to my site whilst it's being built. The client has request I use these rules.
I am aware of
<meta name="robots" content="noindex,nofollow">
But I only have access to the robots.txt file.
Does anyone know the correct format I can use to apply noindex, nofollow rules via the robots.txt file?
noindex and nofollow means you do not want your site to crawl in search engine.
so simply put code in robots.txt
User-agent: *
Disallow: /
it means noindex and nofollow.
There is a non-standard Noindex field, which Google (and likely no other consumer) supported as experimental feature.
Following the robots.txt specification, you can’t disallow indexing nor following links with robots.txt.
For a site that is still in development, has not been indexed yet, and doesn’t get backlinks from pages which may be crawled, using robots.txt should be sufficient:
# no bot may crawl
User-agent: *
Disallow: /
If pages from the site are already indexed, and/or if other pages which may be crawled link to it, you have to use noindex, which can not only be specified in the HTML, but also as HTTP header:
X-Robots-Tag: noindex, nofollow
Noindex tells search engines not to include pages in search results, but can follow links (and also can transfer PA and DA)
Nofollow tells bots not to follow the links. We also can combine noindex with follow in pages we don´t want to be indexed, but we want to follow the links
I just read this thread, and thought to add an idea.
In case one wants to place a site under construction or development, not vieawable to unauthorized users I think this idea is safe although a bit of IT proficiency is required.
There is a "hosts" file on any operating system, that works as a manual repository of DNS entries, overriding an online DNS server.
In Windows, it is under C:\Windows\System32\drivers\etc\hosts and linuxes distros (Android, too) I know have it under /etc/hosts. Maybe in OSX it's the same.
The idea is to add an entry like
xxx.xxx.xxx.xxx anyDomain.tld
to that file.
It is important that the domain is created in your server/provider, but it is not sent to the DNS servers yet.
What happens: while the domain is created in the server, it will respond to calls on that domain, but no one else (no browsers) in the internet will know the IP address to your site, besides the computers you have added the above snippet to the hosts file.
In this situation, you can add the change to anyone interested in seeing your site (and has your authorization), end no one else will be able to see your site. No crawler will see it until you publish the DNS online.
I even use it for a private file server that my family share.
Here you can find a thorough explanation on how to edit the hosts file:
https://www.howtogeek.com/howto/27350/beginner-geek-how-to-edit-your-hosts-file/
I have couple of wordpress sites and with the current google seo algorithm update a site should be mobile friendly (here)
My query here is as follows, Currently I have written a rule in robots.txt to disallow crawling the url's with wp-
User-agent: *
Disallow: /cgi-bin
Disallow: /wp-admin
Disallow: /wp-includes
Disallow: /wp-content/plugins
Disallow: /feed
Disallow: /*/feed
Disallow: /wp-login.php
I don't want google to crawl the above url's. Earlier it was working fine but now with the recent google algorithm update, when I disallow these url's It will start giving errors in the mobile friendly test (here). As all my CSS and JS are behind the wp- url's. I am wondering how can I fix this one.
Any suggestions appreciated.
If you keep the crawler away from those files your page may look and work different to Google than it looks to your visitors. This is what Google wants to avoid.
There is no problem in allowing Google to access the CSS or JS files as anyone else who can open your HTML-source and read links can access them either.
Therefore Google definitely wants to access the CSS and JS files used on your page:
https://developers.google.com/webmasters/mobile-sites/mobile-seo/common-mistakes/blocked-resources?hl=en
Those files are needed to render your pages.
If your site’s robots.txt file disallows crawling of these assets, it directly harms how well our algorithms render and index your content. This can result in suboptimal rankings.
If you are dependent on mobile rankings you must follow Googles guidelines. If not, feel free to block the crawler.
I've been testing my website with Google Webmaster Tools and when I tried to "fetch it as Googlebot" I got a "Partial" status and a note that three EXTERNAL css files, namely 3 Google fonts, had been blocked for some reason by robots. txt.
Now, here's my file:
User-agent: *
Disallow:
Disallow: /cgi-bin/
Sitemap: http://example.com/sitemapindex.xml
Is there something wrong with it that might be preventing access to said files?
Thanks!
If robots.txt is blocking external CSS files, then it will be the robots.txt for the server hosting those files, not the one for your main hostname.
I don't know why you would worry about Googlebot being unable to read your stylesheets though.