How to mechanically identify all broken links in a drupal site - drupal

We have just moved to drupal and are trying to pro-actively identify all broken external web (http://, https://) links.
I've seen some references to validation of links but wasn't sure if it only meant validation of the syntax of the link as opposed to whether these web links work or not (e.g. 404).
What is the easiest way to go through all web links in a drupal site and identify all of the broken external web links? This is something we'd like to automate and schedule every day/week.

As someone else mentioned, use Link Checker module. It's a great tool.
In addition, you can check the Crawl errors in Google Webmaster tools for 404'd links like this:
Clicking any URL from there will show you where the URL was linked from so you can update any internal broken links. Be sure to use canonical URLs to avoid that.
Make sure you're using a proper internal linking strategy to avoid broken internal links in the first place, too: http://www.daymuse.com/blogs/drupal-broken-internal-link-path-module-tutorial
Essentially: use canonical, relative links to avoid broken internal links in the future when you change aliases. In simple Drupal terms, be sure you're linking to "node/23" instead of "domain.ext/content/my-node-title" since multiple parts of that might change in the future.

I have not found a Drupal based approach for this. The best, free piece of software I've found for finding bad links on sites is Screaming Frog SEO Spider Tool.
http://www.screamingfrog.co.uk/seo-spider/

Related

Prevent CMS identification

I have a little curiosity. I often use extensions like Wappalyzer to try to understand certain sites about which CMS are based and, in this regard, I would have a question: is there a way to prevent extensions similar to the one mentioned above from identifying the CMS used?
You can try, but I don't think that you can successfully hide CMS behind your site and that attempt is waste of time. For drupal check on this page:
https://www.drupal.org/node/766404
Most CMS can be run as headless CMS, decoupling backend and frontend. They either render a static set of html pages, or a JSON file with all the sites data, which is then rendered by some kind of Javascript app. This way no one can figure out where the data is coming from, or if there was a CMS involved at all.

How can i hide my platform (CMS)

I have Joomla and Drupal sites, but I don't want others to find out what platform (CMS) I'm running.
I want to prevent detection from tools like Wappalyzer or similar tools. (as seen in this screenshot: http://i43.tinypic.com/2evc6qo.png)
I've heard that has to do with meta tags but I'm not sure.
There is no way to hide the fact you're using Joomla. If you inspect the source code of a websites built using Wordpress for example, you will see wp-includes within the URL's of CSS and JS file includes.
When using Joomla, you can type /administrator at the end of the URL, however if the admin URL is hidden, against, inspecting the source can give it away.
This might be of little help:
How to disable right-click context-menu in javascript
For Drupal, see the community wiki page "Hide, obscure, or remove clues that a site runs on Drupal":
The short answer is :
You can't. Do not try.
You can get pretty far with trying to hide the fact that your site runs on Drupal. But at some point you’ll probably don’t run Drupal anymore ;-)
Have a look …
at our sister site, Drupal SE: How can I obscure the fact my site uses Drupal?
at drupalscout.com: Hiding the fact your site runs Drupal OR Fingerprinting a Drupal Site
There is way to hide Joomla from bots.
You need to use this jomdefender plugin. It removes word joomla from all pages, change admin page and add few antibot tricks.
Its not perfect, but it still adds much more security to your joomla such as file integrity check, which could be quite usefull when some file gets hacked.

Does automatic redirection/geo-location have impact on my SEO? - Detect if its a spider that is accessing site

I have a site who's search ranking has plumetted. It should be quite SEO friendly because its built using XHtml/CSS and has been run against the SEO toolkit.
The only thing I can think that may be annoying Google is
The keywords are the same accross the whole site rather than being page specific. (cant see why this would be a massive deal
Another URL has been set up that simply points to my site (without redirecting) (again - no big deal)
Non UK users are automatically forwaded onto the US version of the site which is a different brand. I guess this could be the problem. If google spiders my site from the US then it will never get the UK version
So the question is, does geo redirecting setting effect my SEO? Is it possible to detect if who is accessing your site is actually a search engine that is spidering my site. In this case I don't want to do any geo-location
Do not use same keywords on entire site. Try to use specific keywords per page.
Do not let several URL:s point directly to the same site since this will cause the inlinks from the different domains to be treated as to different domains. If you point URLs by redirect, all inlinks will be added to the target domain and thus increase it's "inlink score".
To detect is request is from a crawler, use the browsercaps project: http://owenbrady.net/browsercaps/

how to completely Hide website from search engines?

Whats the best recommended way yo hide my staging website from search engines, i Googled it and found some says that i should put a metatag, and some said that i should put a text file inside my website directory, i want to know the standard way.
my current website is in asp.net, while i believe that it must be a common way for any website whatever its programming language.
Use a robots.txt file.
see here http://www.robotstxt.org/robotstxt.html
You could also use your servers robots.txt:
User-agent: *
Disallow: /
Google's crawler actually respects these settings.
Really easy answer; password protect it. If it’s a staging site then it quite likely is not intended to be publicly facing (private audience only most likely). Trying to keep it out of search engines is only treating a symptom when the real problem is that you haven’t appropriately secured it.
Keep in mind that you can't hide a public-facing unprotected web site from a search engine. You can ask that bots not index it (through the robots.txt that my fine colleagues have brought up), and the people who write the bots may choose not to index your site based on that, but there's got to be at least one guy out there who is indexing all the things people ask him not to index. At the very least one.
If this is a big requirement, keeping automated crawlers out, some kind of CAPCHA solution might work for you.
http://www.robotstxt.org/robotstxt.html
There are search engines / book marking services which do not use robots.txt. If you really don't want it to turn up ever I'd suggest using capcha's just to navigate to the site.
Whats the best recommended way yo hide my staging website from search engines
Simple: don't make it public. If that doesn't work, then only make it public long enough to validate that it is ready to post live and then take it down.
However, all that said, a more fundamental question is, "Why care?". If the staging site is really supposed to be the live site one step before pushing live, then it shouldn't matter if it is indexed.

What does it mean when I see some IPs look at hundreds of pages on my website?

What should I do when I see some IP in my logs scrolling through 100s of pages on my site? I have a wordpress blog, and it seems like this isn't a real person. This happens almost daily with different IPs.
UPDATE: Oh, i forgot to mention, I'm pretty sure it's not a search engine spider. The hostname is not a searchengine, but some random person from india (ends in '.in').
What I am concerned with, is if it is a scraper, is there anything I can do? Or could it possibly be something worse than a scraper e.g. hacker?
It's a spider/crawler. Search engines use these to compile their listings, researchers use them to figure out the structure of the internet, the Internet Archive uses them to download the contents of the Internet for future generations, spammers use them to search for e-mail addresses, and many more such situations.
Checking out the user agent string in your logs may give you more information on what they're doing. Well-behaved bots will generally indicate who/what they are - Google's search bots, for example, are called Googlebot.
If you're concerned about script kiddies, I suggest checking your error logs. The scripts often look for things you may not have; e.g. on one system I run, I don't have ASP, however, I can tell when a script kiddie has probed the site because I see lots attempts to find ASP pages in my error logs.
Probably some script kiddie looking to take advantage of an exploit in your blog (or server). That, or some web crawler.
It's probably a spider-bot indexing your site. The "User-Agent" might give it away. It is possible to have 100s of GET requests easily for a dynamically generated Wordpress site if it isn't all blog pages but includes things like css, js and images.

Resources