web scraping search results - web-scraping

I need help solving the following issue:
I need to validate cached URLs by Google search engine for a particular site. In the case the url will 404 or the page will not render some necessary html elements (considered broken) I need to log those URLs and later 301 redirect to correct URLs. I know PHP and a little bit of Python but I'm not sure what approach to use to scrap all URLs from search engine results for given site.

http://simplehtmldom.sourceforge.net/ - a simple html parser. there is an example at this page; not sure if this still works with googles instant search etc.

Related

Does Wordpress list all pages for crawlers?

I created a page on a Wordpress site that was for internal use only and triggers some backend code. Within a few days I started seeing hits on that page from "bingbot".
I'm not using any kind of sitemap plugin. How are crawlers finding this page?
I know the robots.txt file can block them but I want to make sure they don't show up for crawlers that don't respect this. I still want to have the page publicly accessible if someone types in the URL.
What needs to be done in Wordpress to make sure a page can't be discovered except by typing in the URL?
Any given URL is potentially "discovered" once the post is published and if there's a link to it from elsewhere on your site. There's no guaranteed way to prevent search engines from indexing a URL.

From wordpress to SPA application - SEO issues

I need to test in terms of SEO a new page which was exported from wordpress to single page application. I want to keep my rank in search results. Unfortunately I am not sure how to do it properly.
Could you please give me any advice?
I know that I need to verify sitemap and broken links. Could you please recommend any tools to do it automatically?
The most important thing to do is to use the same URLs you have from your old website, or use 301 redirect method to redirect the old URLs to the new one.
Regarding the one pages, its not good for SEO since all your content will be in the same page, and here you can't target many keywords.
Still you can use one page website with advanced techniques to rank one many keywords but here it will affect the UX.
Regarding the XML site, since you have one page in your new website, you have to redirect all the pages to the new one using 301 redirect or you will have many 404s in the webmaster tool.

WordPress SEO by Yoast Changing sitemap xml url

I want to change the url for the sitemap xml in seo yoast plugin.
from: http://example.com/sitemap_index.xml
to: http://example.com/sitemap-something-unique-aqw65643.xml
I know exactly how to change it using rewite rule, but my problem is that I don't know if this is safe.
Is there anyone can explain me what will be the bad effect?
Using a URL rewrite rule for something like this should not impose any negative side-effects. You are simply directing traffic from one URL to another. There are even some benefits of using rewrites when it comes to SEO but those mostly pertain to your page URLs. Since you are using WordPress, I'm sure that's not the advice you're looking for.
There is thread on Google Support here that talks about URL mapping when preparing to migrate to a new domain. I'm sure some of this might be helpful information for your situation. I would lean toward a 301 Permanent redirect for a sitemap that I was submitting to Google or another search engine.

How to mechanically identify all broken links in a drupal site

We have just moved to drupal and are trying to pro-actively identify all broken external web (http://, https://) links.
I've seen some references to validation of links but wasn't sure if it only meant validation of the syntax of the link as opposed to whether these web links work or not (e.g. 404).
What is the easiest way to go through all web links in a drupal site and identify all of the broken external web links? This is something we'd like to automate and schedule every day/week.
As someone else mentioned, use Link Checker module. It's a great tool.
In addition, you can check the Crawl errors in Google Webmaster tools for 404'd links like this:
Clicking any URL from there will show you where the URL was linked from so you can update any internal broken links. Be sure to use canonical URLs to avoid that.
Make sure you're using a proper internal linking strategy to avoid broken internal links in the first place, too: http://www.daymuse.com/blogs/drupal-broken-internal-link-path-module-tutorial
Essentially: use canonical, relative links to avoid broken internal links in the future when you change aliases. In simple Drupal terms, be sure you're linking to "node/23" instead of "domain.ext/content/my-node-title" since multiple parts of that might change in the future.
I have not found a Drupal based approach for this. The best, free piece of software I've found for finding bad links on sites is Screaming Frog SEO Spider Tool.
http://www.screamingfrog.co.uk/seo-spider/

Google URL Crawl error 404 - domain appending to end of URL

I recently built and published my Wordpress site at www.kernelops.com and submitted it to the google index and webmaster tools. Today I logged into webmaster tools and found 60 URL errors all with the same type of issue. The base domain address www.kernelops.com is being appended to all my sites page, category, and post URLs. An example of the failed URL looks like this:
http://www.kernelops.com/blog/www.kernelops.com
Google Webmaster Tools indicates that this weird link is originating from the base url "http://www.kernelops.com/blog" which obviously means the issue is on my end. My Wordpress permalink settings are set to use the post-name; I'm not sure if that could be causing this, i.e.:
http://www.kernelops.com/sample-post/
I can't seem to find any help resolving this weird issue with google searches and thought someone here may be able to point me in the right direction.
The Wordpress plugins that would potentially affect the site's URLs are the following:
All in One SEO
XML-Sitemap
But I can't see any sort of setting within these plugins that would be causing this type of issue.
Any ideas would be greatly appreciated - thanks in advance!
This is a long shot, but it may be happening if the Google crawler picks up a link that seems like a relative path and attempts to append it to the current directory. It's highly unlikely that Google would have such a bug, but it's not impossible either.
The closes thing I could find that may be considered a relative path is this:
<div class="copyright">
...
Kernel, Inc.
...
</div>
I doubt that this is the problem, but it may be worth fixing it.
Now, there is yet another possibility and that's if the website serves slightly different content depending on the User Agent string. When Google presents your website with a User Agent string, the SEO plugins detects it and tries to optimize things in order to improve your ranking (not familiar with that plugins, so I don't know what it does exactly). There may be a bug in the SEO plugin that will cause the www.kernelops.com URL to look like a relative path or to actually construct that faulty URL somehow.
You can possibly test this by setting the user-agent string in your browser (e.g. FireFox's user-agent switcher) to Googlebot's user-agent string and test what happens when you visit your website. Look at the page source that you receive and look for any links that might look like the one Google is finding.
However, if the SEO tool is smart enough, it will "realize" that your IP doesn't match one of the valid IPs for Googlebot and it will not make the modifications.

Resources