What does it mean when I see some IPs look at hundreds of pages on my website? - wordpress

What should I do when I see some IP in my logs scrolling through 100s of pages on my site? I have a wordpress blog, and it seems like this isn't a real person. This happens almost daily with different IPs.
UPDATE: Oh, i forgot to mention, I'm pretty sure it's not a search engine spider. The hostname is not a searchengine, but some random person from india (ends in '.in').
What I am concerned with, is if it is a scraper, is there anything I can do? Or could it possibly be something worse than a scraper e.g. hacker?

It's a spider/crawler. Search engines use these to compile their listings, researchers use them to figure out the structure of the internet, the Internet Archive uses them to download the contents of the Internet for future generations, spammers use them to search for e-mail addresses, and many more such situations.
Checking out the user agent string in your logs may give you more information on what they're doing. Well-behaved bots will generally indicate who/what they are - Google's search bots, for example, are called Googlebot.

If you're concerned about script kiddies, I suggest checking your error logs. The scripts often look for things you may not have; e.g. on one system I run, I don't have ASP, however, I can tell when a script kiddie has probed the site because I see lots attempts to find ASP pages in my error logs.

Probably some script kiddie looking to take advantage of an exploit in your blog (or server). That, or some web crawler.

It's probably a spider-bot indexing your site. The "User-Agent" might give it away. It is possible to have 100s of GET requests easily for a dynamically generated Wordpress site if it isn't all blog pages but includes things like css, js and images.

Related

Wait for 10 seconds while download file being generated - Why they create and where to get those codes?

I noticed so many Downloads websites create those download links. Eg: Go to any post and Try to download the product, you will get a redirect to another page before they show you too real download links.
https://null-24.com
http://thewpclub.net/
The purpose they might be doing to reduce traffic Bounce rates... not very sure. Can anybody of you help me to understand this and provide any help to get those codes? They might be wordpress plugin or custom codes.. not very sure.
Thanks
Not 100 percent sure if you aking this, but...
They do it this way because they do not want another site to parse their links.
Another possibility is if the content is paid - they prevent you to send the links to hundred of your friends.
Lastly, if you pay for streaming, and suppose you have 24h right for streaming, they will invalidate your link after that.
If you ask how this can be done, there are several possibilities.
If content is small, say phone ringtone, you can copy it in download directory, then delete itafter some time.
If content is big and you are on Linux, you can do symlink or hardlink, then delete it.
You can also do it via PHP, Java or whatever technology you are using, but it is far away from the best way. not recommended at all.
Finally, you can use server like nginx or apache. Both of them have modules for this kind of expiring links. Then you can have your website on one of the servers and download will be made from totaly separated server, aven located at different datacenter. The remote server will not need any programs running there, except apache or nginx. This is the only realistic way for streaming large content (movies, music, 3D, games) over the internet.

How to mechanically identify all broken links in a drupal site

We have just moved to drupal and are trying to pro-actively identify all broken external web (http://, https://) links.
I've seen some references to validation of links but wasn't sure if it only meant validation of the syntax of the link as opposed to whether these web links work or not (e.g. 404).
What is the easiest way to go through all web links in a drupal site and identify all of the broken external web links? This is something we'd like to automate and schedule every day/week.
As someone else mentioned, use Link Checker module. It's a great tool.
In addition, you can check the Crawl errors in Google Webmaster tools for 404'd links like this:
Clicking any URL from there will show you where the URL was linked from so you can update any internal broken links. Be sure to use canonical URLs to avoid that.
Make sure you're using a proper internal linking strategy to avoid broken internal links in the first place, too: http://www.daymuse.com/blogs/drupal-broken-internal-link-path-module-tutorial
Essentially: use canonical, relative links to avoid broken internal links in the future when you change aliases. In simple Drupal terms, be sure you're linking to "node/23" instead of "domain.ext/content/my-node-title" since multiple parts of that might change in the future.
I have not found a Drupal based approach for this. The best, free piece of software I've found for finding bad links on sites is Screaming Frog SEO Spider Tool.
http://www.screamingfrog.co.uk/seo-spider/

Does Google Analytics count the visit if someone references an image from my site?

Well, the question is in the title. I searched SO (obviously) but nothing similar came up. Additional reading material (if you happen to know one) will be helpful to solve this mystery for me.
No, not by default at least.
It is technically possible to contrive a serverside solution that measures referenced assets. But usually (i.e. when you use the javascript tracking code) Google Analytics will only measure documents that have the tracking code embedded. Since you cannot embed javascript code in image files they will not be tracked.
If you want to see which images have been called from other domains you can instead have a look in your webservers access logs which keeps track of all requests to your server and usually includes the address of the referring site.

Google analytic shows me wired links for one of my visitors

I have a website wich is registered with google analytic so I can see the statistics of it The problem is that sometime it shows me this link :
website.com/www.bndv521.cf/
or:
website.com/admin
I do not know if this is a hacker trying to hack me or something but I think nobody will try to access my admin for good
Can you help me to know what is this link refers to ?
Consider checking for a malicious code included on your pages. And yes it's likely that some one is trying to access those pages but it may not execute because it's invalid path. You should consider blocking such ip addresses after checking in logs.
Although trying to reach an admin page seems a suspicious action, in our website we come accross this issue every one in ten thousand requests.
We think that a browser extension or a virus like program tries to change URL or trying to add this keyword to URL. Not for a hacking purpose, but to redirect to their advertisement website.
Very similar issue here: Weird characters in URL

how to completely Hide website from search engines?

Whats the best recommended way yo hide my staging website from search engines, i Googled it and found some says that i should put a metatag, and some said that i should put a text file inside my website directory, i want to know the standard way.
my current website is in asp.net, while i believe that it must be a common way for any website whatever its programming language.
Use a robots.txt file.
see here http://www.robotstxt.org/robotstxt.html
You could also use your servers robots.txt:
User-agent: *
Disallow: /
Google's crawler actually respects these settings.
Really easy answer; password protect it. If it’s a staging site then it quite likely is not intended to be publicly facing (private audience only most likely). Trying to keep it out of search engines is only treating a symptom when the real problem is that you haven’t appropriately secured it.
Keep in mind that you can't hide a public-facing unprotected web site from a search engine. You can ask that bots not index it (through the robots.txt that my fine colleagues have brought up), and the people who write the bots may choose not to index your site based on that, but there's got to be at least one guy out there who is indexing all the things people ask him not to index. At the very least one.
If this is a big requirement, keeping automated crawlers out, some kind of CAPCHA solution might work for you.
http://www.robotstxt.org/robotstxt.html
There are search engines / book marking services which do not use robots.txt. If you really don't want it to turn up ever I'd suggest using capcha's just to navigate to the site.
Whats the best recommended way yo hide my staging website from search engines
Simple: don't make it public. If that doesn't work, then only make it public long enough to validate that it is ready to post live and then take it down.
However, all that said, a more fundamental question is, "Why care?". If the staging site is really supposed to be the live site one step before pushing live, then it shouldn't matter if it is indexed.

Resources