Robots.txt Disallow file - should this be left empty? - wordpress

Need help on this robots.txt question. My default file looks something like this
User-agent: *
Disallow:
Sitemap: https://mywebsite.com/sitemap_index.xml
Problem is that with this configuration, Google deindexed almost all (at the time of this writing) of my URLs.
Is it correct to leave the disallow field blank?

Yes, it's technically correct.
This means that all user agents, including search engines, can access to your website pages.
The asterisk near user-agent means that it applies to all user agents.
Nothing is listed after disallow, this means there are no restrictions at all.

Related

Disallow URL with specific querystring from crawl using robots.txt

My client has an ASP.NET MVC web application that also has a WordPress blog in a subfolder.
https://www.example.com/
https://www.example.com/wordpress
The WordPress site is loaded with some social sharing links that I do not want crawlers to index. For example:
https://www.example.com/wordpress/some-post/?share=pinterest
First thing, should there be a robots.txt in the / folder and also one in the /wordpress folder? Or just a single one in the / folder? I've tried both without any success.
In my robots.txt file I've included the following:
User-agent: Googlebot
Disallow: ?share=pinterest$
I've also tried several variations like:
Disallow: /wordpress/*/?share=pinterest
No matter what rule I have in robots.txt, I'm not able to get crawlers to stop trying to index these social sharing links. The plugin that creates these sharing links is also making them "nofollow noindex noreferer", but since they are all internal links it causes issues due to blocking internal "link juice".
How do I form a rule to Disallow crawlers to index any link inside this site that ends with ?share=pinterest?
Should both sites have a robots.txt or only one in the main/root folder?
robots.txt should only be at the root of the domain. https://example.com/robots.txt is the correct URL for your robots.txt file. Any robots.txt file in a subdirectory will be ignored.
By default, robots.txt rules are all "starts with" rules. Only a few major bots such as Googlebot support wildcards in Disallow: rules. If you use wildcards, the rules will be obeyed by the major search engines but ignored by most less sophisticated bots.
Using nofollow on those links isn't really going to effect your internal link juice. Those links are all going to be external redirects that will either pass PageRank out of your site, or if you block that PageRank somehow, it will evaporate. Neither external linking, nor PageRank evaporation hurt the SEO of the rest of your site, so it doesn't really matter from an SEO perspective what you do. You can allow those links to be crawled, use nofollow on those links, or disallow those links in robots.txt. It won't change how the rest of your site is ranked.
robots.txt also has the disadvantage that search engines occasionally index disallowed pages. robots.txt blocks crawling, but it doesn't always prevent indexing. If any of those URLs get external links, Google may index the URL with the anchor text of the links it finds to them.
If you really want to hide the social sharing from search engine bots, you should have the functionality handled with onclick events. Something like:
<a onclick="pintrestShare()">Share on Pinterest</a>
Where pintrestShare is a JavaScript function that uses location.href set the URL of the page to the Pinterest share URL for the current URL.
To directly answer your question about robots.txt, this rule is correct:
User-agent: *
Disallow: /wordpress/*/?share=pinterest
You can use Google's robots.txt testing tool to verify that it blocks your URL:
You have to wait 24 hours after making robots.txt changes before bots start obeying the new rules. Bots often cache your old robots.txt for a day.
You may have to wait weeks for new results to show in your webmaster tools and search console accounts. Search engines won't report new results until they get around to re-crawling pages, realize the requests are blocked, and that information makes it back to their webmaster information portals.

robots.txt file being overridden / injected from external source?

We have a couple of Wordpress sites with this same issue. They appear to have a "robots.txt" file with the following contents:
User-Agent: *
Crawl-Delay: 300
User-Agent: MJ12bot
Disallow: /
User-agent: MegaIndex.ru
Disallow: /
User-agent: megaindex.com
Disallow: /
We have absolutely no idea where this robots.txt file is coming from.
We have looked and there is definitely no "robots.txt" file in the public_html root folder or any sub-folder that we can see.
We have deactivated every single plugin on the site and even changed themes, but the robots.txt file remains exactly the same. It seems as though it is somehow being injected into the site from an external source somehow!
We have been assured that it couldn't be coming from Google Tag Manager.
Just wondering if anyone happens to recognise the above robots.txt contents and knows how it is existing on our sites???
You have a few possibilities.
Some security plugins (WordFence, iTheme etc) actually add files to your site. These files don't generally go away when you just "disable" the plugins. They need to be actually removed/uninstalled and sometimes you have to manually go through and do it.
WordPress will generate a virtual robots.txt.
If Google has cached that. You can go in and tell Google to look at the robots.txt again.
You should also be able to overwrite it by creating your own by just making a robots.txt file and putting it in the root or using another plugin to do it.
Turns out it was a generic robots.txt file that our server administrator had set up to be injected into every site on our server to prevent our server being attacked and overloaded by those particular bots (which we had been having trouble with).

How to add `nofollow, noindex` all pages in robots.txt?

I want to add nofollow and noindex to my site whilst it's being built. The client has request I use these rules.
I am aware of
<meta name="robots" content="noindex,nofollow">
But I only have access to the robots.txt file.
Does anyone know the correct format I can use to apply noindex, nofollow rules via the robots.txt file?
noindex and nofollow means you do not want your site to crawl in search engine.
so simply put code in robots.txt
User-agent: *
Disallow: /
it means noindex and nofollow.
There is a non-standard Noindex field, which Google (and likely no other consumer) supported as experimental feature.
Following the robots.txt specification, you can’t disallow indexing nor following links with robots.txt.
For a site that is still in development, has not been indexed yet, and doesn’t get backlinks from pages which may be crawled, using robots.txt should be sufficient:
# no bot may crawl
User-agent: *
Disallow: /
If pages from the site are already indexed, and/or if other pages which may be crawled link to it, you have to use noindex, which can not only be specified in the HTML, but also as HTTP header:
X-Robots-Tag: noindex, nofollow
Noindex tells search engines not to include pages in search results, but can follow links (and also can transfer PA and DA)
Nofollow tells bots not to follow the links. We also can combine noindex with follow in pages we don´t want to be indexed, but we want to follow the links
I just read this thread, and thought to add an idea.
In case one wants to place a site under construction or development, not vieawable to unauthorized users I think this idea is safe although a bit of IT proficiency is required.
There is a "hosts" file on any operating system, that works as a manual repository of DNS entries, overriding an online DNS server.
In Windows, it is under C:\Windows\System32\drivers\etc\hosts and linuxes distros (Android, too) I know have it under /etc/hosts. Maybe in OSX it's the same.
The idea is to add an entry like
xxx.xxx.xxx.xxx anyDomain.tld
to that file.
It is important that the domain is created in your server/provider, but it is not sent to the DNS servers yet.
What happens: while the domain is created in the server, it will respond to calls on that domain, but no one else (no browsers) in the internet will know the IP address to your site, besides the computers you have added the above snippet to the hosts file.
In this situation, you can add the change to anyone interested in seeing your site (and has your authorization), end no one else will be able to see your site. No crawler will see it until you publish the DNS online.
I even use it for a private file server that my family share.
Here you can find a thorough explanation on how to edit the hosts file:
https://www.howtogeek.com/howto/27350/beginner-geek-how-to-edit-your-hosts-file/

robots.txt disallow /variable_dir_name/directory

I need to disallow /variable_dir_name/directory via robots.txt
I use:
Disallow: */directory
Noindex: */directory
is that correct?
The following should work in your robots.txt:
User-Agent: *
Disallow: /*/directory
Further reading from Google: Block or remove pages using a robots.txt file
Indeed, there was the opportunity of GoogleBot that allowed to use these in the Robots.txt:
Noindex
Nofollow
Crawl-delay
But seen on the GoogleBlog-News they will no longer support those (0,001% used) commands anymore from September 2019 on. So you should only use meta tags anymore for these on your page to be safe for the future.
What you really should do, is the following:
Disallow via robots.txt and
Noindex already indexed documents via Google Search Console

Getting all posts from a blog (wordpress or blogger)

This is assuming that direct access to an api is not available. Since I am requesting ALL posts, I am not sure RSS would help much.
I considered a simple system that would loop through each year and month and download each html file but changing the following URL for each year month pair. This works for wordpress and blogger blogs.
http://www.lostincheeseland.com/2011/05
However, is there a way to use the following search function provided by blogger to return all blogs? I have played around with it, but documentation seems sparse.
http://www.lostincheeseland.com/search?updated-max=2012-08-17T09:44:00%2B02:00&max-results=6
Are there other methods I have not considered?
What you're looking for is a sitemap.
First of all, you're writing a bot so it's good manners to check the blog's robots.txt file. And lo and behold, you'll often find a sitemap mentioned there. Here's an example from the Google blog:
User-agent: Mediapartners-Google
Disallow:
User-agent: *
Disallow: /search
Allow: /
Sitemap: http://googleblog.blogspot.com/feeds/posts/default?orderby=UPDATED
In this case, you can visit the Sitemap URL to get an xml sitemap.
For Wordpress, the same applies but it's not built-in as standard so not all blogs will have it. Have a look at this plugin which is the most popular way to create these sitemaps in Wordpress. For example, my blog uses this and you can find the sitemap at /sitemap.xml
(the standard location)
In short:
Check robots.txt
Follow the Sitemap url if it's present
Otherwise, check for /sitemap.xml
Also: be a good Internet citizen! If you're going to write a bot, make sure it obeys the robots.txt file (like where blogspot tells you explicitly not to use /search!)

Resources