For the security of my website, is there any way I can distinguish between bots and human visitors on my website?
Not really. If a bot WANT to be recognized as a bot, yes you can. Example: search engines bots, like Googlebots.
BUT it's extremely easy for a bot do identify himself as a normal browser; then youre stuck.
If you want a list of bots, here you go: http://www.robotstxt.org/db.html
The only way to do this might be to check for the User-Agent sent in the HTTP request by the current client.
Some bots do not specify any or specifies a specific one such as GoogleBot (Googlebot, Mozilla/5.0) or Baidu Spider.
There is also a list maintained by useragentstring which lists all the known user-agents used by various bots, automated scripts or browsers.
Related
There is a lot of resources here and on the web explaining how to avoid biaised statistics coming from referals such as Darobar, semalt, iloveitaly, etc. and how to block these malicious bots.
My question is not about how to prevent it to happen. I don't understand WHY i'm getting spammed. What is the interrest of these companies / entities to flood my stats ? They must have a pretty big infrastructure (either servers or infected slave computers) to visit so much websites and so many times. But what is the purpose of all of this ? Is it financial ? malicious ? Just for fun ?
What are the risk for myself or my company ? Can I be disqualified by Adsense or another online advertising program ?
Those bots don't generate ad traffic, even if they do, google and ad companies detect them, I used to work in adserving. Again google, yahoo and major ad serving systems take precautions to prevent fake traffic etc.
Those bots basically searches for things like email adresses, contact information in short any kind of information. Dont forget google uses bots to crawl internet which they have google search engine.
Some bots place comments on higher ranking sites for SEO work.
This is just a big business.
if you want to avoid them, take a loot at here : http://www.robotstxt.org/faq/prevent.html
However, these are just standarts and some folks dont care about them. But then i wouldnt really worry that much.
Spammers are trying to get traffic to their sites. Very often curious webmasters visit "referring" websites, and spammers can show them advertising, or redirect them to sites like amazon.com or alibaba.com to put affiliate cookie (and get revenue in case their targets buy something later).
Is there a way to prevent faked Google Analytics statistics by using PhantomJS and/or a ruby crawler like Anemone?
Our monitoring tool (which is based on both of them) crawls the sites from our clients and updates the link status of each link in a specific domain.
The problem, that simulates huge trafic.
Is there a way to say something like "I'm a robot, don't track me" with a cookie, header or something?
( adding crawler IP's to Google Analytics [as a filter] may not be the best solution )
Thanks in advance
Joe, try setting up advanced exclude filter -- use field Browser and into "Filter Pattern" put down the name of your user agent for phantom (or any other user agent -- look up the desired name in your Technology -> Browser and OS report).
I found a quick solution for this specific problem. The easiest way to exclude your crawler which executes js (like phantomjs) from all Google Analytics statistics is, to simply block the Google Analytics domain through the /etc/hosts.
127.0.0.1 www.google-analytics.com
127.0.0.1 google-analytics.com
It's the easiest way to prevent fake data. This way, you don't have to add a filter to all your clients.
( thanks for other answers )
IP filtering might not be sufficient, but maybe filtering by user agent string (which can be set arbitrarily with phantom) ? That would be the "browser" field in the filters.
I want to make a site that uses QR codes, and for user analytics I want to see how many people access it through the code, but I don't want a GET parameter for this because I don't want any URL guessing giving me incorrect stuff. Any way to detect QR code access?
No.
Besides something in the URL, you only really have the headers. For example, if you assume all mobile traffic to the URL is from a QR code reader, you could sniff for a mobile User-Agent header. Doesn't seem very robust, tho.
Not really, but you can use a URL shortener such as bit.ly to collect analytics.
What's the best way to track how many times items in your RSS have been accessed?
Assuming your RSS is served from a webserver, the server logs would be the obvious place to gather statistics from. There are numerous packages for parsing and interpreting webserver logs.
AWStats is a popular (free) package, and Wikipedia keeps a fairly comprehensive list.
If you serve your feeds through something like FeedBurner then you can also get stats from there including clicks
You could use Google Analytics, but you would need a service to make the correct requests to the Google Analytics API or redirect to it. There are two APIs you can use:
the __utm.gif "API"
the Measurement Protocol API
To use the later (you need Universal Analytics), which is way better in my opinion, you would need to make a request or redirect to something like:
http://www.google-analytics.com/collect?z=<randomnumber>&t=pageview&dh=<domainname>&cid=<unique-client-uuid>&tid=<propertyid>&v=1dp=<path>
Where:
<randomnumber> is a random number to avoid caches (especially if you do redirects)
<domainname> is the domain name you see in your tracking code
<propertyid> is the property id you see in your tracking code (eg: UA-123456)
<path> is the path to the page you want to register the pageview for. Note that it must be quoted. Eg, for /path/to/page you would need to send %2Fpath%2Fto%2Fpage
I've implemented a simple redirector service that does exactly that here (explained at length here)
If you're stuck with the Classic Analytics then you would need to use nojsstats or the older implementation
I have a service with a large share of requests with an empty value for HTTP_REFERER. I'd like to interpret this correctly and wonder about the most common reasons for that.
I understand that HTTP_REFERER is an optional header field, but most browsers with default setting seem to send them.
Common reasons I have found so far:
proxies
robots
JavaScript links (All of them? Is this browser dependant?)
request from bookmarks or as browser startup page
user entered URL manually
Flash links
link from a different app like email client
browser settings or privacy browser add-ons
some personal firewalls filter referrers
no referrer is sent by most browsers if the redirect happens via semi-official Refresh http header
referrer fakers like this
What's missing|irrelevant|wrong?
Is it possible to put percentages behind these items? Or at maybe sort the list and point out the proportions?
A percentage will depend on what your website is and why people may want to fake their referrer .. Also some people just crack open a new-tab without a homepage. Or land via something other than the browser (such as an addon or chat link, whatever).
If your functionality relies on the referrer use a cookie or rethink the design. Because you can't rely on it.
Basically, all page requests that does not involve user clicking on a link on a webpage.
All depends, and we don't have enough information to say which of the causes is most likely. I'd say robots, but you have to analyse the data (assuming you have server logs) and interpret it. I have no idea how popular your site is or what is its purpose, so robots may not be the number one reason.
In some cases 301 redirects are the cause of losing referral information.