I facing an issue with robot.txt file

I facing an issue with robot.txt file - wordpress

I am using WordPress. Google not crawl all resource of my page. it shows "Page partially loaded". I had all ready tried too many times to solve this issue with robots.txt file. My website return bad gateway error.
Here's screenshot
My website link : https://www.alphaclick.in
My robots.txt File
User-agent: *
Disallow: /cgi-bin/
Disallow: /wp-admin/
Disallow: /linkout/
Disallow: /recommended/
Disallow: /comments/feed/
Disallow: /trackback/
Disallow: /index.php
Disallow: /xmlrpc.php
User-agent: NinjaBot
Allow: /
User-agent: Mediapartners-Google*
Allow: /
User-agent: Googlebot-Image
Allow: /wp-content/uploads/
User-agent: Adsbot-Google
Allow: /
User-agent: Googlebot-Mobile
Allow: /
Sitemap: https://www.alphaclick.in/sitemap_index.xml
Sitemap: https://www.alphaclick.in/post-sitemap.xml

Delete this line Disallow: /index.php. It's blocking the whole website for bots. More information about robots.txt file you can get here

Related

Yoast SEO how to allow crawler bot

I got a job to get rid of "No information is available for this page" in a website. The website uses Yoast SEO, but it was disabled so I reenabled it and then I got a basic robots.txt like this
User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
I have applied that settings for about six hours ago, tried to search it in Google, and still nothing changed. I feel anxious now.
Is this enough for the crawlers to read the website? Do I miss something? Do I need to mess with .htaccess? I have zero exp in SEO, so any help would be very appreciated.

copy and paste it in your robot.txt
User-agent: Googlebot
Disallow:
User-agent: googlebot-image
Disallow:
User-agent: googlebot-mobile
Disallow:
User-agent: MSNBot
Disallow:
User-agent: Slurp
Disallow:
User-agent: Teoma
Disallow:
User-agent: Gigabot
Disallow:
User-agent: Robozilla
Disallow:
User-agent: Nutch
Disallow:
User-agent: ia_archiver
Disallow:
User-agent: baiduspider
Disallow:
User-agent: naverbot
Disallow:
User-agent: yeti
Disallow:
User-agent: yahoo-mmcrawler
Disallow:
User-agent: psbot
Disallow:
User-agent: yahoo-blogs/v3.9
Disallow:
User-agent: *
Disallow:
Sitemap: https://www.yoursitename.com/sitemap.xml

Correct Syntax for Robot.txt File?

What's below is in my Robot.txt file.
If I want a particular Search engine to have access to the site, but not a few key areas, such as the admin section, the wp-content area, and a folder that is non-existent, is the syntax that I have below correct for google, msn, bing, yahoo, duckduckbot, but to disallow everyone else ?
User-agent: Googlebot
Allow: *
Disallow: /wp-admin/*
Disallow: /wp-content/*
Disallow: /docs/*
User-agent: MSNBot
Allow: *
Disallow: /wp-admin/*
Disallow: /wp-content/*
Disallow: /docs/*
User-agent: Bingbot
Allow: *
Disallow: /wp-admin/*
Disallow: /wp-content/*
Disallow: /docs/*
User-agent: Slurp
Allow: *
Disallow: /wp-admin/*
Disallow: /wp-content/*
Disallow: /docs/*
User-agent: DuckDuckBot
Allow: *
Disallow: /wp-admin/*
Disallow: /wp-content/*
Disallow: /docs/*
User-agent: Google (+https://developers.google.com/+/web/snippet/)
Allow: *
Disallow: /wp-admin/*
Disallow: /wp-content/*
Disallow: /docs/*
User-agent: Googlebot-Image/1.0
Allow: *
Disallow: /wp-admin/*
Disallow: /wp-content/*
Disallow: /docs/*
User-agent: Googlebot-Video/1.0
Allow: *
Disallow: /wp-admin/*
Disallow: /wp-content/*
Disallow: /docs/*
User-agent: SAMSUNG-SGH-E250/1.0 Profile/MIDP-2.0 Configuration/CLDC-1.1 UP.Browser/6.2.3.3.c.1.101 (GUI) MMP/2.0 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)
Allow: *
Disallow: /wp-admin/*
Disallow: /wp-content/*
Disallow: /docs/*
User-agent: Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Allow: *
Disallow: /wp-admin/*
Disallow: /wp-content/*
Disallow: /docs/*
User-agent: *
Disallow: *

Syntax is correct, but approach is wrong.
1. Never block your content
Google (and many other search engines) fully renders your page. If you block access to images, Google drops down your position is search results, just for a case. Googlebot cannot understand if your page is full of broken links to images, or not.
This is a quote from Maile Ohye, Google Developer Programs Tech Lead:
“We recommend making sure Googlebot can access any embedded resource that meaningfully contributes to your site’s visible content or its layout”
2. Do not block /wp-admin/admin-ajax.php
When you block access to /wp-admin/ entirely, no ajax content is available for robots. That is why standard robots.txt generated by WordPress on the fly is as follows:
User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
3. Do not block other bots
List of search bots are wider than shown in your question, and grows occasionally. In your list, for example, Googlebot-Mobile does not present. The last statement in your file blocks access to this bot with evident results for mobile search.
It is better not to invent a bicycle, but use standard WordPress robots.txt settings shown above or even wider settings by Yoast SEO plugin (1+ million installs).

Google Bot Robots.txt tester not working

txt tester not working in my case. I have the below lines in robots.txt.
But in the Tester if i test wp-admin the tools showing allowed. I dont know why? please help me how to disallow wp-admin
User-Agent: Googlebot
Allow: *.css*
Allow: *.js*
Allow: /*.jpg
Allow: /*.gif
Allow: /*.png
User-agent: *
Disallow: /cgi-bin
Disallow: /wp-admin
Disallow: /wp-includes
Disallow: /wp-content/plugins
Disallow: /wp-content/cache
Disallow: /wp-content/themes
Disallow: /category
Disallow: /tag
Disallow: /page
Disallow: /author
Disallow: /trackback
Disallow: /*trackback
Disallow: /*trackback*
Disallow: /*/trackback
Disallow: /*?*
Disallow: /*.html/$
Disallow: /*feed*
# Google Image
User-agent: Googlebot-Image
Disallow:
Allow: /*
# Google AdSense
User-agent: Mediapartners-Google*
Disallow:
Allow: /*

If you remove the trailing slash, you'll pass, or if you put a page after the wp-admin in the tester, you'd also see your rule would pass (block the bots) like /wp-admin/admin.php
User-agent: *
Disallow: /cgi-bin/
Disallow: /wp-admin
Disallow: /recommended/
Disallow: /comments/feed/
Disallow: /trackback/
Disallow: /index.php
Disallow: /xmlrpc.php

How to let or restrict Google Bot index or crawl certain things in Wordpress?

Well, I have problem with Google Bot. Taking 700MB of bandwidth daily. This is for those which will obviously ask why I want to do this.
I know about robots.txt and that I can stop bots to index some folders.
But what in WordPress, I am using post-name permalinks, so permalinks for posts and pages are just /page or /post.
Searched for any plugin to restrict bot on indexing only few tags and few categories, didn't found it.
Want to allow sticky posts, few categories, few tags.
Can be done? How?
I have update on this question.
I decided to go with robots.txt rulles.
User-agent: *
Disallow: /
User-agent: AhrefsBot
Disallow: /
User-agent: AhrefsBot/3.1
Disallow: /
User-agent: Yahoo-slurp
Disallow: /
User-agent: Msnbot
Disallow: /
User-agent: Googlebot
Allow: /
Disallow: /category
Disallow: /video
Disallow: /author
Disallow: /?s=
Disallow: /feed/
Disallow: /xmlrpc.php
Disallow: /cgi-bin/
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /wp-content/plugins/
Disallow: /wp-content/cache/
Disallow: /wp-content/themes/
Disallow: /wp-login.php
Disallow: /wp-register.php
Disallow: /tag
Allow: /tag/marry
Allow: /tag/john
Will last two tags be indexed?
And is there something more to hide in WordPress?

If you want to allow particular posts but disallow everything else, then use Allow tags. For example:
User-agent: Googlebot
Allow: /post/foo
Allow: /page/bar
Disallow: *
So the bot can crawl the pages you specify, but not anything else.

Website Duplicate content detected with google webmaster

We have a website that is based on codeigniter with a wordpress blog in a sub directory. /blog.
Through using google webmaster tools and search results - we are seeing duplicate content mainly for our home page with the following shown after the domain name.
So for example a search on google for site:domainname.com on google shows:
domainname.com/?author=1
domainname.com/?author=2
domainname.com/?cat=1
domainname.com/?cat=3
domainname.com/?cat=4
/?feed=rss2&tag=drinking-establishments
/?feed=rss2&tag=fun
/?feed=rss2&tag=introduction
These appear to be generated all from the generated from the wordpress blog and we are not sure how to fix?

You could use a robots.txt file to tell Google what they should (and shouldn't) be looking for on your site.
A robots.txt file should live here: example.com/robots.txt
An example robots.txt as taken from the WordPress Codex:
http://codex.wordpress.org/Search_Engine_Optimization_for_WordPress
Sitemap: http://www.example.com/sitemap.xml
# Google Image
User-agent: Googlebot-Image
Disallow:
Allow: /*
# Google AdSense
User-agent: Mediapartners-Google
Disallow:
# digg mirror
User-agent: duggmirror
Disallow: /
# global
User-agent: *
Disallow: /cgi-bin/
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /wp-content/plugins/
Disallow: /wp-content/cache/
Disallow: /wp-content/themes/
Disallow: /trackback/
Disallow: /feed/
Disallow: /comments/
Disallow: /category/*/*
Disallow: */trackback/
Disallow: */feed/
Disallow: */comments/
Disallow: /*?
Allow: /wp-content/uploads/
Background reading:
http://en.wikipedia.org/wiki/Robots_exclusion_standard