This is assuming that direct access to an api is not available. Since I am requesting ALL posts, I am not sure RSS would help much.
I considered a simple system that would loop through each year and month and download each html file but changing the following URL for each year month pair. This works for wordpress and blogger blogs.
http://www.lostincheeseland.com/2011/05
However, is there a way to use the following search function provided by blogger to return all blogs? I have played around with it, but documentation seems sparse.
http://www.lostincheeseland.com/search?updated-max=2012-08-17T09:44:00%2B02:00&max-results=6
Are there other methods I have not considered?
What you're looking for is a sitemap.
First of all, you're writing a bot so it's good manners to check the blog's robots.txt file. And lo and behold, you'll often find a sitemap mentioned there. Here's an example from the Google blog:
User-agent: Mediapartners-Google
Disallow:
User-agent: *
Disallow: /search
Allow: /
Sitemap: http://googleblog.blogspot.com/feeds/posts/default?orderby=UPDATED
In this case, you can visit the Sitemap URL to get an xml sitemap.
For Wordpress, the same applies but it's not built-in as standard so not all blogs will have it. Have a look at this plugin which is the most popular way to create these sitemaps in Wordpress. For example, my blog uses this and you can find the sitemap at /sitemap.xml
(the standard location)
In short:
Check robots.txt
Follow the Sitemap url if it's present
Otherwise, check for /sitemap.xml
Also: be a good Internet citizen! If you're going to write a bot, make sure it obeys the robots.txt file (like where blogspot tells you explicitly not to use /search!)
Related
My client has an ASP.NET MVC web application that also has a WordPress blog in a subfolder.
https://www.example.com/
https://www.example.com/wordpress
The WordPress site is loaded with some social sharing links that I do not want crawlers to index. For example:
https://www.example.com/wordpress/some-post/?share=pinterest
First thing, should there be a robots.txt in the / folder and also one in the /wordpress folder? Or just a single one in the / folder? I've tried both without any success.
In my robots.txt file I've included the following:
User-agent: Googlebot
Disallow: ?share=pinterest$
I've also tried several variations like:
Disallow: /wordpress/*/?share=pinterest
No matter what rule I have in robots.txt, I'm not able to get crawlers to stop trying to index these social sharing links. The plugin that creates these sharing links is also making them "nofollow noindex noreferer", but since they are all internal links it causes issues due to blocking internal "link juice".
How do I form a rule to Disallow crawlers to index any link inside this site that ends with ?share=pinterest?
Should both sites have a robots.txt or only one in the main/root folder?
robots.txt should only be at the root of the domain. https://example.com/robots.txt is the correct URL for your robots.txt file. Any robots.txt file in a subdirectory will be ignored.
By default, robots.txt rules are all "starts with" rules. Only a few major bots such as Googlebot support wildcards in Disallow: rules. If you use wildcards, the rules will be obeyed by the major search engines but ignored by most less sophisticated bots.
Using nofollow on those links isn't really going to effect your internal link juice. Those links are all going to be external redirects that will either pass PageRank out of your site, or if you block that PageRank somehow, it will evaporate. Neither external linking, nor PageRank evaporation hurt the SEO of the rest of your site, so it doesn't really matter from an SEO perspective what you do. You can allow those links to be crawled, use nofollow on those links, or disallow those links in robots.txt. It won't change how the rest of your site is ranked.
robots.txt also has the disadvantage that search engines occasionally index disallowed pages. robots.txt blocks crawling, but it doesn't always prevent indexing. If any of those URLs get external links, Google may index the URL with the anchor text of the links it finds to them.
If you really want to hide the social sharing from search engine bots, you should have the functionality handled with onclick events. Something like:
<a onclick="pintrestShare()">Share on Pinterest</a>
Where pintrestShare is a JavaScript function that uses location.href set the URL of the page to the Pinterest share URL for the current URL.
To directly answer your question about robots.txt, this rule is correct:
User-agent: *
Disallow: /wordpress/*/?share=pinterest
You can use Google's robots.txt testing tool to verify that it blocks your URL:
You have to wait 24 hours after making robots.txt changes before bots start obeying the new rules. Bots often cache your old robots.txt for a day.
You may have to wait weeks for new results to show in your webmaster tools and search console accounts. Search engines won't report new results until they get around to re-crawling pages, realize the requests are blocked, and that information makes it back to their webmaster information portals.
Need help on this robots.txt question. My default file looks something like this
User-agent: *
Disallow:
Sitemap: https://mywebsite.com/sitemap_index.xml
Problem is that with this configuration, Google deindexed almost all (at the time of this writing) of my URLs.
Is it correct to leave the disallow field blank?
Yes, it's technically correct.
This means that all user agents, including search engines, can access to your website pages.
The asterisk near user-agent means that it applies to all user agents.
Nothing is listed after disallow, this means there are no restrictions at all.
I want to add nofollow and noindex to my site whilst it's being built. The client has request I use these rules.
I am aware of
<meta name="robots" content="noindex,nofollow">
But I only have access to the robots.txt file.
Does anyone know the correct format I can use to apply noindex, nofollow rules via the robots.txt file?
noindex and nofollow means you do not want your site to crawl in search engine.
so simply put code in robots.txt
User-agent: *
Disallow: /
it means noindex and nofollow.
There is a non-standard Noindex field, which Google (and likely no other consumer) supported as experimental feature.
Following the robots.txt specification, you can’t disallow indexing nor following links with robots.txt.
For a site that is still in development, has not been indexed yet, and doesn’t get backlinks from pages which may be crawled, using robots.txt should be sufficient:
# no bot may crawl
User-agent: *
Disallow: /
If pages from the site are already indexed, and/or if other pages which may be crawled link to it, you have to use noindex, which can not only be specified in the HTML, but also as HTTP header:
X-Robots-Tag: noindex, nofollow
Noindex tells search engines not to include pages in search results, but can follow links (and also can transfer PA and DA)
Nofollow tells bots not to follow the links. We also can combine noindex with follow in pages we don´t want to be indexed, but we want to follow the links
I just read this thread, and thought to add an idea.
In case one wants to place a site under construction or development, not vieawable to unauthorized users I think this idea is safe although a bit of IT proficiency is required.
There is a "hosts" file on any operating system, that works as a manual repository of DNS entries, overriding an online DNS server.
In Windows, it is under C:\Windows\System32\drivers\etc\hosts and linuxes distros (Android, too) I know have it under /etc/hosts. Maybe in OSX it's the same.
The idea is to add an entry like
xxx.xxx.xxx.xxx anyDomain.tld
to that file.
It is important that the domain is created in your server/provider, but it is not sent to the DNS servers yet.
What happens: while the domain is created in the server, it will respond to calls on that domain, but no one else (no browsers) in the internet will know the IP address to your site, besides the computers you have added the above snippet to the hosts file.
In this situation, you can add the change to anyone interested in seeing your site (and has your authorization), end no one else will be able to see your site. No crawler will see it until you publish the DNS online.
I even use it for a private file server that my family share.
Here you can find a thorough explanation on how to edit the hosts file:
https://www.howtogeek.com/howto/27350/beginner-geek-how-to-edit-your-hosts-file/
I have couple of wordpress sites and with the current google seo algorithm update a site should be mobile friendly (here)
My query here is as follows, Currently I have written a rule in robots.txt to disallow crawling the url's with wp-
User-agent: *
Disallow: /cgi-bin
Disallow: /wp-admin
Disallow: /wp-includes
Disallow: /wp-content/plugins
Disallow: /feed
Disallow: /*/feed
Disallow: /wp-login.php
I don't want google to crawl the above url's. Earlier it was working fine but now with the recent google algorithm update, when I disallow these url's It will start giving errors in the mobile friendly test (here). As all my CSS and JS are behind the wp- url's. I am wondering how can I fix this one.
Any suggestions appreciated.
If you keep the crawler away from those files your page may look and work different to Google than it looks to your visitors. This is what Google wants to avoid.
There is no problem in allowing Google to access the CSS or JS files as anyone else who can open your HTML-source and read links can access them either.
Therefore Google definitely wants to access the CSS and JS files used on your page:
https://developers.google.com/webmasters/mobile-sites/mobile-seo/common-mistakes/blocked-resources?hl=en
Those files are needed to render your pages.
If your site’s robots.txt file disallows crawling of these assets, it directly harms how well our algorithms render and index your content. This can result in suboptimal rankings.
If you are dependent on mobile rankings you must follow Googles guidelines. If not, feel free to block the crawler.
Below is the entire content of my robots.txt file.
User-agent: *
Disallow: /marketing/wp-admin/
Disallow: /marketing/wp-includes/
Sitemap: http://mywebsite.com/sitemap.xml.gz
It is the one apparently generated by Wordpress. I haven't manually created one.
Yet when I signed up for Google Webmaster tools today. This is the content of that Google Webmasters tools is seeing:
User-agent: *
Disallow: /
... So ALL my urls are blocked!
In Wordpress, settings > reading > search engine visibility: "Discourage search engines from indexing this site" is not checked. I unchecked it fairly recently. (Google Webmaster tools is telling me it downloaded my robots.txt file on Nov 13, 2013.)
...So why is it still reading the old version where all my pages are disallowed, instead of the new version?
Does it take a while? Should I just be patient?
Also what is the ".gz" on the end of my sitemap line? I'm using the Yoast All-in-One SEO pack plugin. I'm thinking the plugin added the ".gz", whatever that is.
You can ask Googlebot to crawl again after you've changed your robots.txt. See Ask Google to crawl a page or site for information.
The Sitemap file tells Googlebot more about the structure of your site, and allows it to crawl more effectively. See About Sitemaps for more info.
The .gz is just telling Googlebot that the generated sitemap file is compressed.
A WordPress discussion on this topic can be found here: https://wordpress.org/support/topic/robotstxt-wordpress-and-google-webmaster-tools?replies=5