Recently I have seen a huge increase in referral traffic in GA that comes from spammy domains like bidvertiser . com, easyhits4u . com or trafficswirl . com. These are messing a lot the data in GA triggering a sudden decrease in conversion rate rendering the data unusable.
You can easily see which referrals are bad because they have a few charateristics:
high bounce rate
low time spent on pages (even fewer pageviews per user)
0 conversions (if you measure such a thing)
Looking in the logs I found lines like this
52.33.56.250 - - [10/May/2017:08:39:05 +0000] "GET / HTTP/1.0" 200 18631 "http://ptp4all.com/ptp/promote.aspx?id=628" "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; WOW64; Trident/6.0; .NET4.0E; .NET4.0C; .NET CLR 3.5.30729; .NET CLR 2.0.50727; .NET CLR 3.0.30729; MALCJS)"
74.73.253.77 - - [10/May/2017:08:39:05 +0000] "GET / HTTP/1.0" 200 18631 "http://secure.bidvertiser.com/performance/bdv_rd.dbm?enparms2=7523,1871496,2463272,7474,7474,8973,7684,0,0,7478,0,1870757,475406,91376,112463629579,78645910,nlx.lwlgwre&ioa=0&ncm=1&bd_ref_v=www.bidvertiser.com&TREF=1&WIN_NAME=&Category=1000&ownid=627368&u_agnt=&skter=vgzouvw%2B462c%2B40v10h%2Bghru%2Bmlir%2Bhoveizn%2Bsxgzd&skwdb=ooz_wvvu" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
How to handle this?
There are 2 main things that you need to do:
1. Server level - You must block spammy request from the beggining.
I thought it would be best to prepare dynamic filters that will block requests from specific IPs that are doing the spammy traffic.
I am using for this purpose fail2ban, but there is no rule that will help ypu do this out fo the box. First you need to create a new Jail Filter (I am using Plesk so here is how to do that in Plesk https://docs.plesk.com/en-US/onyx/administrator-guide/server-administration/protection-against-brute-force-attacks-fail2ban/fail2ban-jails-management.73382/). For those that do not use Plesk and use ssh you can have a look here https://www.fail2ban.org/wiki/index.php/MANUAL_0_8
The definition of a jail is like this:
[Definition]
failregex =
ignoreregex =
Be sure to include ignoreregex as well otherwise it will not save.
After that look for the domains you find in Google Analytics, in your access log. You will find a lot of requests like the one above.
Once you identify the domain you need to add rules like this:
failregex = <HOST>.+bidvertiser\.com
<HOST>.+easyhits4u\.com
HOST - is a keyword for fail2ban to use the ip in the log.
please note the ".+" - this will enable fail2ban to ignore all text
until they find the domain you are looking for in that line
bidvertiser.com - the domain that causes the trouble with the "." escaped by "\".
the new line (new domain) should have a TAB character before the rule
otherwise it will not save
My rule looks like this:
[Definition]
failregex = <HOST>.+bidvertiser\.com
<HOST>.+easyhits4u\.com
<HOST>.+sitexplosion\.com
<HOST>.+ptp4all\.com
<HOST>.+trafficswirl\.com
<HOST>.+bdv_rd\.dbm
ignoreregex =
You can see the bdv_rd\.dbm. That is not a domain but a script they used to produce the spam. So it could be easy for them to change the domain and use the same script. This adds an extra layer of filtering. I added that there becasue fail2ban will search for any string that matches the pattern.
Note 1: be sure you do not interfere with your own website URL's because this will block legitimate users and you do not want that.
Note 2: you can test your regex in ssh like this
:# fail2ban-regex path/to/log/access_log "<HOST>.+bidvertiser\.com"
This should produce the following output:
Running tests
=============
Use failregex line : <HOST>.+bidvertiser\.com
Use log file : access_log
Use encoding : UTF-8
Results
=======
Failregex: 925 total
|- #) [# of hits] regular expression
| 1) [925] <HOST>.+bidvertiser\.com
`-
Ignoreregex: 0 total
Date template hits:
|- [# of hits] date format
| [4326] Day(?P<_sep>[-/])MON(?P=_sep)Year[ :]?24hour:Minute:Second(?:\.Microseconds)?(?: Zone offset)?
`-
Lines: 4326 lines, 0 ignored, 925 matched, 3401 missed
[processed in 3.14 sec]
Missed line(s): too many to print. Use --print-all-missed to print all 3401 lines
Now this means your filter found 925 requests matching that domain (a lot if you ask me) that will translate into 925 hits from the referral bidvertiser.com in your Google Analytics.
You can verify this downloading the log and doing the search with a tool like Notepad++.
Now that your definition is ready you should add a jail and a rule.
I use the definition above with the action to block all ports for that IP for 24 hours.
After I installed this in just a few hours I had close to 850 blocked IPs. Some are in Amazon AWS network so I filed an abuse complaint here https://aws.amazon.com/forms/report-abuse .
You can use this service https://ipinfo.io/ to find the owner of the ip.
2. Google Analytics level
Here you have a few options that I will not describe here because it is not the place and there are well written resources on this theme:
https://moz.com/blog/how-to-stop-spam-bots-from-ruining-your-analytics-referral-data
https://www.optimizesmart.com/geek-guide-removing-referrer-spam-google-analytics/
A few notes:
These guys use in some places .htaccess blocking. That is an option as well that I did not use here because in my filters I use also script names and not only domains.
Fail2Ban will use iptables to block any other request from these IPs and not only the http/https port.
The first request will always pass and create 1 hit in Analytics and then depending on whether the script still accesses your website another hit when the ban expires
You can use the recidive filter to permanently ban those IPs https://wiki.meurisse.org/wiki/Fail2Ban#Recidive
The Analytics filters will not filter out historic data.
Related
I have a WordPress site that I manage. I recently received a Qualys vulnerability security scan (non-authenticated scan) that has a large number of "Path Based Vulnerability" findings. Almost all of the paths listed follow this format:
https://www.example.com/search/SomeString
https://www.example.com/search/1/feed/rss2
Some examples include:
https://www.example.com/search/errors
https://www.example.com/search/admin
https://www.example.com/search/bin
When I go to these URL's, I get an appropriate search page response stating, for example, "Search for Admin produced no results".
But, if I go to https://www.example.com/search/ without a string parameter, I get a 404 error (custom error page) stating the page could not be found. All this works like I would expect it to. No sensitive data/pages are being shown.
An example of the Qualys finding is:
150004 Path-Based Vulnerability URL:
https://www.example.com/search/1/feed/rss2/ Finding #
8346060(130736429) Severity Confirmed Vulnerability - Level 2 Unique #
redacted Group Path Disclosure Detection Date 22 Mar 2021 18:16
GMT-0400 CWE CWE-22 OWASP A5 Broken Access Control WASC WASC-15
APPLICATION MISCONFIGURATION WASC-16 DIRECTORY INDEXING WASC-17
IMPROPER FILESYSTEM PERMISSIONS CVSS V3 Base 5.3 CVSS V3 Temporal5
CVSS V3 Attack VectorNetwork
Details
Threat A potentially sensitive file, directory, or directory listing was discovered on the Web server.
Impact The contents of this file or directory may disclose sensitive information.
Solution Verify that access to this file or directory is permitted. If necessary, remove it or apply access controls to it.
Detection Information
Parameter No param has been required for detecting the information.
Authentication In order to detect this vulnerability, no authentication has been required.
Access Path Here is the path followed by the scanner to reach the exploitable URL: https://www.example.com
https://www.example.com/?s=1
Payloads
#1
#1 Request
GET https://www.example.com/search/tools/
Referer: https://www.example.com
Cookie: [removed in case its sensitive];
caosLocalGa= [removed in case its sensitive];
Host: https://www.example.com
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1.1
Safari/605.1.15
Accept: /
Based on the findings, this seems to be a false positive. But, my CIO insists that I prove it as such. First, is there any documentations on this that might be helpful? Second, does anyone know of any updates to WP that could hide/remove these findings?
(I'd comment, but my rep isn't high enough.)
I can partially answer this, as I fighting the same battle right now with a different web app. If you run the request in a browser with the developer tools on, I'll bet you'll see the response code from the server is 200 even though it is actually doing a redirect.
The scanner sees the response code is OK and based on that, the request succeeded as is when it really didn't. You have to give it a different response code when doing a "silent" redirect.
I am exploring the Google PageSpeed insights api and there in the response I see a tag called:
{
...
lighthouse.userAgent:'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/84.0.4147.140 Safari/537.36'
...
}
Docs mention userAgent="The user agent that was used to run this LHR."
https://developers.google.com/speed/docs/insights/rest/v5/pagespeedapi/runpagespeed#LighthouseResultV5 What does that mean? How is this performance aggregated by running on all browsers?
PS: This is for Desktop version.
What does that mean?
This lets you know what browser was used to run the test.
It is useful for if you believe that there is an issue with Lighthouse (a bug in your report) so you can test it directly on the same browser that Lighthouse uses.
There is also the "environment" object which contains how Lighthouse presented itself (it sent a header saying "treat me like this browser") to the website that was being tested. (lighthouseResult.environment.networkUserAgent)
This is useful so you can check your server isn't blocking requests for that user agent etc.
It is also useful for checking your server logs to see what requests Lighthouse made etc. etc.
See the Wikipedia page for user agent for more info on user agents
How is this performance aggregated by running on all browsers?
As for your second question it doesn't quite make sense, but it has no impact on performance unless your server does something different for that user agent string if that is what you mean.
I've been trying to connect to the REST API of Woocommerce (using HTTP Basic Auth) but fail to do so.
I'm probably doing stuff wrong (first timer # REST API's), but here is what I've been doing:
I'm using a GET with an url consisting of: https://example.com/wc-api/v2/
I'm using an Authorization header with the consumer key and secret base64 encoded
I've enabled the REST Api in the Woocommerce setting and enabled secure checkout. Also I've put some product in the shop. But whenever I try to run the URL as described above; the connection is just being refused.
I do not receive an error, but it looks like the page cannot even be reached. Can someone help me out?
I've followed the docs (http://woothemes.github.io/woocommerce-rest-api-docs/#requestsresponses) up to the Authentication-section, but that's where I've been stuck up till now.
The complete url I'm using is:
http://[MYDOMAIN]/wc-api/v2/orders
With the HTTP-header looking like:
GET /wc-api/v2/ HTTP/1.1
Authorization: Basic [BASE64 encoded_key:BASE64 encoded_secret]
Host: [MYDOMAIN]
Connection: close
User-Agent: Paw/2.1.1 (Macintosh; OS X/10.10.2) GCDHTTPRequest
Then after I run the request I'm getting:
Given the screenshot that you posted, it seems that the server is not responding on HTTPS. So you'll need to configure your webserver to respond to HTTPS requests, and to do that you'll need to install an SSL certificate.
You can either generate one yourself, which is free, but won't work for the general public. Or you can buy one - most domain registrars and hosts will let you buy a certificate, and they usually start at around $50 per year.
I'm using a GET with an url consisting of: https://example.com/wc-api/v2/
In this example, you're using HTTPS. Is that where you're trying to connect?
I highly recommend going straight to HTTPS connection. It's a thousand times easier to accomplish. Documentation for over HTTPS can be found here. Follow directions for "OVER HTTPS". From there you can use something like Postman to test if you'd like.
Recently I put some hidden links in a web site in order to trap web crawlers. (Used CSS visibility hidden style in order to avoid human users accessing it).
Any way, I found that there were plenty of HTTP requests with a reference of browsers which have accessed the hidden links.
E.g : "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.64 Safari/537.31"
So now my problems are:
(1) Are these web crawlers? or else what can be?
(2) Are they malicious?
(3) Is there a way to profile their behaviour?
I searched on the web but couldn't find any valuable information. Can you please provide me some resources, or any help would be appreciated.
This is a HTTP user agent. They are not malicious at all. It's following the pattern, for example Mozilla/<version> and so on. A browser is a user-agent for example. However, they can be used by attackers and this can be identified by looking at anomalies. You can read this paper.
The Hypertext Transfer Protocol (HTTP) identifies the client software
originating the request, using a "User-Agent" header, even when the
client is not operated by a user.
The answer to your questions are, in order:
They are not web crawlers. They are user agents. Common term for a web developer.
Generally they aren't malicious but they can be, as I suggest, look at the paper.
I don't understand what you mean by profiling behaviour, they aren't malware!
Is it possible to obtain raw logs from Google Analytic? Is there any tool that can generate the raw logs from GA?
No you can't get the raw logs, but there's nothing stopping you from getting the exact same data logged to your own web server logs. Have a look at the Urchin code and borrow that, changing the following two lines to point to your web server instead.
var _ugifpath2="http://www.google-analytics.com/__utm.gif";
if (_udl.protocol=="https:") _ugifpath2="https://ssl.google-analytics.com/__utm.gif";
You'll want to create a __utm.gif file so that they don't show up in the logs as 404s.
Obviously you'll need to parse the variables out of the hits into your web server logs. The log line in Apache looks something like this. You'll have lots of "fun" parsing out all the various stuff you want from that, but everything Google Analytics gets from the basic JavaScript tagging comes in like this.
127.0.0.1 - - [02/Oct/2008:10:17:18 +1000] "GET /__utm.gif?utmwv=1.3&utmn=172543292&utmcs=ISO-8859-1&utmsr=1280x1024&utmsc=32-bit&utmul=en-us&utmje=1&utmfl=9.0%20%20r124&utmdt=My%20Web%20Page&utmhn=www.mydomain.com&utmhid=979599568&utmr=-&utmp=/urlgoeshere/&utmac=UA-1715941-2&utmcc=__utma%3D113887236.511203954.1220404968.1222846275.1222906638.33%3B%2B__utmz%3D113887236.1222393496.27.2.utmccn%3D(organic)%7Cutmcsr%3Dgoogle%7Cutmctr%3Dsapphire%2Btechnologies%2Bsite%253Arumble.net%7Cutmcmd%3Dorganic%3B%2B HTTP/1.0" 200 35 "http://www.mydomain.com/urlgoeshere/" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/525.19 (KHTML, like Gecko) Chrome/0.2.153.1 Safari/525.19"
No. But why don't you just use your webserver's logs? The value of GA is not in the data they collect, but the aggregation/analysis. That's why it's not called Google Raw Data.
Please have a look on this article which explains a hack to get Google analytics data.
http://blogoscoped.com/archive/2008-01-17-n73.html
Also If you can wait for sometime then official Google analytics blog says that they are working on data export api but currently it is in Private Beta.
http://analytics.blogspot.com/2008/10/more-enterprise-class-features-added-to.html
Not exactly the same as raw vs aggregated, but it seems that "unsampled" data is only available to Premium accounts:
"Unsampled Reports are only available in Premium accounts using the latest version of Google Analytics."
http://support.google.com/analytics/bin/answer.py?hl=en&answer=2601061
You can get the Analytics data, but it'll take a bit of hacking.
In any analytics report, click the 'email' button at the top of the screen. Set up the email to go to your address (or a new address on your server) and change the format to csv or xml.
Then, you can use php (or another language) to check the email account, parse the email and import the attachment to your system.
There's an article entitled 'Incoming mail and PHP' on evolt.org: http://evolt.org/incoming_mail_and_php
No, but there are other paid services like Mixpanel and KISSmetrics that have data export APIs. Much easier than trying to build your own analytics service, but costs money.