I try to identify what data is actually collected by the default script of Google Analytics. What seems to be an easy question turns out to have no clear answer.
I know that they (for example) collect the IP-address, screen resolution, operating system and so forth ... but I simply do not find a complete list. I also have a list of all the possible dimensions and metrics that can be collected, but not for the "default" analytics script.
I ask for a list of all the data collected by default by Google Analytics.
... identify what data is actually collected by the default script .... I also have a list of all the possible dimensions and metrics that can be collected
Just to be clear, GA collects more information than what they share with Analytics consumers. While their client-side script may allow for additional data to be collected (like custom query string parameters), most of what they collect data seems to be similar on every site, regardless of what the analytics user chooses to consume (with the exception of a few configuration items such as "anonymizeIp").
Google's policies are cleverly worded to indicate that turning on "Advertising Features" doesn't necessarily change what they collect with GA, other than the fact that a new cookie might be present:
By enabling the Advertising Features, you enable Google Analytics to
collect data about your traffic via Google advertising cookies and
identifiers
Knowing what GA collects (even when you don't ask it to) is particularly important given the ambiguity around whether GA is really GDPR compliant (which includes IP addresses, cookie identifiers, and GPS locations as "personal data").
Looking at the source code
Google Analytics is a moving target, BUT there is value in having a snapshot of the identifying information about the client and browser that was being leaked to Google Analytics at a given point in time,
Even though it's a bit outdated, this analysis was done using a Manually Deobfuscated Google Analytics javascript file, snapshot taken Mar 27, 2018.
1. Data available in Document and Window Objects
Some key objects to look for in the analytics JS: DOCUMENT, WINDOW, NAVIGATOR, SCREEN, LOCATION
Here are the items that are utilized by GA (doesn't necessarily mean this data is sent back to google in a raw form).
Data Utilized | Code Snippet
------------- | ------------
Url | LOCATION.protocol + "//" + LOCATION.hostname + LOCATION.pathname + LOCATION.search
ReferringPage | DOCUMENT.referrer
PageTitle | DOCUMENT.title
HowLongIsPageVisible | DOCUMENT.visibilityState .. DOCUMENT,"visibilitychange"
DocumentSize | DOCUMENT.documentElement .clientWidth && .clientHeight
ScreenResolution | SCREEN.width SCREEN.height
ScreenColors | SCREEN.colorDepth + "-bit"
ClientSize | e = document.body; e.clientWidth && e.clientHeight
ViewportSize | ca = [documentEl.clientWidth .... : ca = [e.clientWidth .... ca.join("x")
FlashVersion | getFlashVersion
Encoding | characterSet || DOCUMENT.charset
JSONAvailable | window.JSON
JavaEnabled | NAVIGATOR.javaEnabled()
Language | NAVIGATOR.language || NAVIGATOR.browserLanguage
UserAgent | NAVIGATOR.userAgent
Timezone/LocalTime | c.getTimezoneOffset(), c.getYear(), c.getDate(), c.getHours(), c.getMinutes()
PerformanceData | WINDOW.performance || WINDOW.webkitPerformance ... loadEventStart,domainLookupEnd,domainLookupStart,connectStart,responseStart,requestStart,responseEnd,responseStart,fetchStart,domInteractive,domContentLoadedEventStart
Plugins | NAVIGATOR.plugins
SignalUserLeaving | navigator.sendBeacon() // how long the user was on the page
HistoryLength | WINDOW.history.length // number of pages viewed with this browser tab
IsTopSiteForUser | navigator.loadPurpose // "Top Sites" section of Safari
NameOfPage (JS) | WINDOW.name
IsFrame | WINDOW.top != WINDOW
IsEmbedded | WINDOW.external
RandomData | WINDOW.crypto.getRandomValues // because of the try/catch, it doesn't appear to leak anything other than random values
ScriptTags | getElementsByTagName("script"); // probably for Ads, AutoLink decorating [https://support.google.com/analytics/answer/4627488?hl=en] and cross-domain tracking [https://developers.google.com/analytics/devguides/collection/analyticsjs/cross-domain]
Cookies (JS) | DOCUMENT.cookie.split(";") // limited to cookies not marked as server only
2. Data available from the QueryString and Hash
By default, GA seems to only explicitly collect querystring parameters that are documented as specific to Google Analytics. But keep in mind that they also have the entire URL available to extract this data server-side, querystring and hash included:
_ga
_gac
gclid
gclsrc
dclid
utm_id
utm_campaign
utm_source
utm_medium
utm_term
utm_content
3. Data available in the HTTP Header
They can choose to capture anything on the request header from the browser. Most notably:
Cookies (Google) | for the google analytics domain, to track the user between sites
IP Address | (parameter "anonymizeIp" claims to anonymize the IP address)
Browser w/ version |
Operating system |
Device Type |
Referer | (in this context, only the url of the page the client is currently on)
X-Forwarded-For | Is a proxy being used? And, if not used for privacy, the actual IP address
4. Other inferred data
Javascript enabled
Cookies enabled
Other identifying information they don't appear to track/utilize
Some other metrics that are readily available, but GA doesn't appear to access:
Canvas Supported
CPU Architecture
CPU Number of cores
AudioContext Supported
Bluetooth Supported
Battery Status
Memory (RAM)
Number of speakers
Number of microphones
Number of webcams
Device Orientation
Device input is Touchscreen
System Fonts
LocalStorage Data
IndexedDB Data
WebRTC Supported
WebGL Supported
WebSocket Supported
Misc Hacks
They don't appear to use any known hacks to extract additional unique user information, such as finding the video card model of the current machine using Canvas and GL. This is not too surprising, since Google can just expose any data they want in chromium/webkit.
However, their control of 70% of the browser market gives them the power to manipulate otherwise innocuous functions (like the random number generator) to leak data for user tracking, if they so desire.
Summary
What you choose to see from the Google Analytics portal does not necessarily impact what they collect.
GA helps Google determine how well a site performs for Search Ranking, and creates a User Fingerprint to track what each internet user looks at and for how long. The latter helps them select ads, which is where they make the bulk of their money. Much of the data they touch in their script doesn't get sent back in raw form, but rather, is used to create said fingerprint.
If you dig deeper you'll find plenty of literature on Google Analytics architecture.
According to the official documentation:
Google Analytics works by the inclusion of a block of JavaScript code
on pages in your website. When users to your website view a page, this
JavaScript code references a JavaScript file which then executes the
tracking operation for Analytics. The tracking operation retrieves
data about the page request through various means and sends this
information to the Analytics server via a list of parameters attached
to a single-pixel image request.
Source: How Does Google Analytics Collect Data?
Additional reading: Google Analytics Features
I think to find out what information GA collects, it is better to take a look at Google's general policy:
" We collect information to provide better services to all of our users – from figuring out basic stuff like which language you speak, to more complex things like which ads you’ll find most useful, the people who matter most to you online, or which YouTube videos you might like.
We collect information in two ways:
Information you give us. For example, many of our services require you to sign up for a Google Account. When you do, we’ll ask for personal information, like your name, email address, telephone number or credit card. If you want to take full advantage of the sharing features we offer, we might also ask you to create a publicly visible Google Profile, which may include your name and photo.
Information we get from your use of our services. We collect information about the services that you use and how you use them, like when you watch a video on YouTube, visit a website that uses our advertising services, or you view and interact with our ads and content...."
Source : http://www.google.com/policies/privacy/#infocollect
Related
I have a requirement where I need to track whether a user clicked a link in a PDA email where the link included in the email is >900 characters.
I'm not sure if Google analytics support tracking in PDA.
If anyone has ever done this,please help me out.
Thanks
I seem to have misunderstood the question, so here is an update. Google will usually track any valid Urls. The two exceptions I can think of are more theoretical than a practical concerns.
Some old browsers (I think IE6 and similar vintages) have a character limit for GET requests (2048 bytes IIRC), so very long links will not work, and this not be tracked correctly. For all practical purposes these browsers should be extinct by now
A Google Analytics request is limited to 8096 bytes.The request has to transmit the document location as part of the payload, so if your URL is really massively oversizes (technically 8000 characters is ">900") this would not be tracked. Again, this is hardly a practical concern (unless there is a lot of other data, like e.g. Enhanced E-Commerce product impressions in that request).
Old (and probably irrelevant) answer:
Google Analytics does typically not track actions within emails, since email clients do not usually support javascript (there are implementations of email open tracking via "web bugs" linked to a script that does a measurement protocol request, but event that does not work particularly well).
If this is a link that points to your homepage the typical way to track this would be via utm parameters - i.e. you do not track the action within the email itself, but the result (the visit to your homepage).
UTM parameters (or "campaign parameters") are
utm_medium - the kind of traffic (if it's paid advertising, banner ads, or in your case e.mail)
utm_source - the specific vendor (e.g. "google" if the link is from a paid Google Ad, or in your case it could be the name of the department that sent out the mail)
utm_campaign - your advertising campaign; in the case of a periodic newsletter this could be e.g. the number of the newsletter
utm_term - you usually would not use that in an email, that's reserved for when a link is a result of a search (then you would insert the search term)
utm_content - if you have multiple links with the same link target and campaign info you can add additional information (e.g. if you have the same link at the top and the bottom of your mail you could indicate the position here)
You cannot do anything dynamic, though - if you want to mark links with a specific character count you would have to do this within your newsletter programm and insert the number. GA would then be able to pick this up from the campaign parameters.
E.g. for your use case you might construct a target URL like
www.example.com?utm_medium=email&utm_source=my_department&utm_campaign=pda_mail&utm_content=<number of characters>
and then get the information from the Aquisition reports in Google Analytics.
If the links do not point to your own homepage you would need to set up an intermediate page that tracks the utm_parameters before it redirects to the intended destination.
I am searching for a way to track user behavior on my website. I want to know if it is possible to get a table with data looking something like this:
+------+---------------+-----------------+------+---------+
| time | ip or user_id | user_session_id | link | actions |
+------+---------------+-----------------+------+---------+
(Link - where user came from)
I want to track different user actions by sessions. Is this possible using Google Analytics or I should search other tools? My site is currently set up to track events but on my Analytics account I get only the number of events that occurred. I want to track what a specific user does on my site.
tl;dr: if you must do this use Mixpanel or similar software.
Time based dimensions are already available (date, hour, minutes and datetime). "link" would be referrer. Actions in Google Analytics are basically pageviews, events and transactions, so you have that, too.
IP and user id are a big no-gos. Storing anything that that identifies a person is a violation of Googles Terms of Service and depending on your location might be a violation of national laws.And if by user_id you mean the Google Analytics feature of the same name, Google says you may set it for logged in users and have to unset it for user that log out, so by extension that means storing it in Ga would probably be a violation of their TOS.
The GA session id is not exposed via the interface. You may read it from the cookie and store it in a custom dimension (I'm not sure if this is allowed within the TOS, on the other hand GA premium customers get this via a BigQuery export in any case, so it should be allowed).
If you simply want to tell different users apart you might simply generate a string in the UUID format and store that in a custom dimension. If you want to actually identify users (by name, adress etc), well, you are not allowed to and Google will terminate your account if they find out.
Not to mention that it completely eludes why so many people want to track individual users. You must not use GA information to target individuals, and simply looking at individual user paths will not help you (I wrote an article about that, although I do not expect that this will convince you).
Google Analytics is for technical and legal reasons not a good tool for tracking individual users, if you need to do this use a software that is made for this purpose. Mixpanel is often mentioned in that context but I'm sure there are many other solutions.
There are a lot of very good articles and answers out there that explain (in detail) why we get spam referrals messing up Analytics data. Example results: how to automatically stop spam traffic in google analytics
What I want is a definitive solution...
If you only manage one analytics account, it would not be unreasonable to manually filter suspect domains, but even this is not sustainable. If you, like me, manage over 30 accounts and counting, it gets ridiculous. What is the long term solution?
Analytics data is important for making business decisions.
Is there some service that, like antivirus software, keeps updating its 'definitions' and constantly filters spam traffic?
How can we fight back?
And how can/is it already automated in a one-click solution?
You can try these 2 options to decrease referral spam:
Option 1 - Filter bots
Mark "Bot Filtering" option on on your Google Analytics View Settings. You will need to do it to all of your views
And also create a filter to exclude referral sources. To do this, create a new filter with options:
Filter Type: Custom
Exclude
Filter Field: Campaign Source
Filter Pattern: youporn-forum.uni.me|free-share-buttons.com|Get-Free-Traffic-Now.com|event-tracking.com|darodar.com
--> Also add other spam sources that you have
You can reuse the filter to multiple views.
And its also recommended to not apply these filters in your main view. Instead, create a copy of main view and use it to analyse your data.
Its not a permanent solution (you will need to add new spam sources from time to time).
Option 2 - Segment real users
You can also create a segment to filter only users that visited at least one page (it will filter spams):
Create a new segment
Advanced > Conditions
Filter | Sessions | Include
Screen Views | per session | > | 0
Then, when you are analyzing your data, use this segment to see only real users.
Not all spammers are blocked by Google Analytics (only up to 75% of bots can be blocked by google analytics). By adopting the following steps you can automate removing referrer spam:
Go to Acquisition>all traffic >refferals and a new window will be opened which shows sources from where your website get traffic
In this step select all website who have 0 or 100 % bounce rate and copy it and make a regular expression .the method for making expression is given below
Use "\." (escaped dot) between every part of the domain and use "|" (pipe) symbol to separate every link. E.g. consider your blog hits by two spammy URLs "ads123.abc59055xxb896.comtom" and "dd54.xy789z.usjpa" then we made following resultant expression:
"ads123\.abc59055xxb896\.comtom|dd54\.xy789z\.usjpa"
if you have any trouble in making regular expression then click here to see the complete process
Select the add filter option in admin tab and select include and paste the whole expression in the expression field and click on save
I have a website that allows people to create an account (that is the conversion I wish to track).
I wish to know where a specific person is coming from. I have google analytics installed and have set up the registration page as a goal, but the reporting tells me traffic sources as an aggregated pie chart. It doesn't report down to the user account level to say that 'person with email xyz' came from 'facebook' for example.
What custom variables or mark up would I need to add to GA to report at that detailed level, if that is at all possible?
Otherwise, I will just have to record the first http_referer in a cookie and stick it in a database during the registration process.
Any advice?
Firstly I must ask you, how actionable do you think it is to look at data at that granular of a level? Finding out what % of people who registered came from facebook or some other place is actionable, because it helps you do things like determine where to focus marketing efforts. But individual users? How is this actionable to you? (hint: it's not)
However, if you are still determined to know this, you should first note that it is against Google's ToS to record personally identifiable data both directly (recording the actual value in GA) or indirectly (e.g. - recording a unique id that you can use to tie to personal info stored within your own system). If this is something you don't want to risk, I suggest moving to another analytics tool that does not have this sort of thing in their ToS (e.g. Adobe SiteCatalyst, which costs money, or perhaps you may instead prefer to choose an "in-house" approach, like Piwik)
If you are still determined to follow through with this and hope not to get caught or whatever, Google Analytics doesn't record data like what info a visitor filled out in a form (like their email address) unless you populate that data in a custom field/dimension/metric/event to be sent along with the request. Usually you would populate this on the form "thank you" page (which is usually the same page you use as your goal url or goal event if you're popping and using an event for your goal). So you would populate the email address in one of those custom variables and then have it as a dimension to break down the http referrer by.
After being stumped by an earlier quesiton: SO google-analytics-domain-data-without-filtering
I've been experimenting with a very basic analytics system of my own.
MySQL table:
hit_id, subsite_id, timestamp, ip, url
The subsite_id let's me drill down to a folder (as explained in the previous question).
I can now get the following metrics:
Page Views - Grouped by subsite_id and date
Unique Page Views - Grouped by subsite_id, date, url, IP (not nesecarily how Google does it!)
The usual "most visited page", "likely time to visit" etc etc.
I've now compared my data to that in Google Analytics and found that Google has lower values each metric. Ie, my own setup is counting more hits than Google.
So I've started discounting IP's from various web crawlers, Google, Yahoo & Dotbot so far.
Short Questions:
Is it worth me collating a list of
all major crawlers to discount, is
any list likely to change regularly?
Are there any other obvious filters
that Google will be applying to GA
data?
What other data would you
collect that might be of use further
down the line?
What variables does
Google use to work out entrance
search keywords to a site?
The data is only going to used internally for our own "subsite ranking system", but I would like to show my users some basic data (page views, most popular pages etc) for their reference.
Lots of people block Google Analytics for privacy reasons.
Under-reporting by the client-side rig versus server-side eems to be the usual outcome of these comparisons.
Here's how i've tried to reconcile the disparity when i've come across these studies:
Data Sources recorded in server-side collection but not client-side:
hits from
mobile devices that don't support javascript (this is probably a
significant source of disparity
between the two collection
techniques--e.g., Jan 07 comScore
study showed that 19% of UK
Internet Users access the Internet
from a mobile device)
hits from spiders, bots (which you
mentioned already)
Data Sources/Events that server-side collection tends to record with greater fidelity (much less false negatives) compared with javascript page tags:
hits from users behind firewalls,
particularly corporate
firewalls--firewalls block page tag,
plus some are configured to
reject/delete cookies.
hits from users who have disabled
javascript in their browsers--five
percent, according to the W3C
Data
hits from users who exit the page
before it loads. Again, this is a
larger source of disparity than you
might think. The most
frequently-cited study to
support this was conducted by Stone
Temple Consulting, which showed that
the difference in unique visitor
traffic between two identical sites
configured with the same web
analytics system, but which differed
only in that the js tracking code was
placed at the bottom of the pages
in one site, and at the top of
the pages in the other--was 4.3%
FWIW, here's the scheme i use to remove/identify spiders, bots, etc.:
monitor requests for our
robots.txt file: then of course filter all other requests from same
IP address + user agent (not all
spiders will request robots.txt of
course, but with miniscule error,
any request for this resource is
probably a bot.
compare user agent and ip addresses
against published lists: iab.net and
user-agents.org publish the two
lists that seem to be the most
widely used for this purpose
pattern analysis: nothing sophisticated here;
we look at (i) page views as a
function of time (i.e., clicking a
lot of links with 200 msec on each
page is probative); (ii) the path by
which the 'user' traverses out Site,
is it systematic and complete or
nearly so (like following a
back-tracking algorithm); and (iii)
precisely-timed visits (e.g., 3 am
each day).
Biggest reasons are users have to have JavaScript enabled and load the entire page as the code is often in the footer. Awstars, other serverside solutions like yours will get everything. Plus, analytics does a real good job identifying bots and scrapers.