I have a special situation where the sites visitors can access the page from a certain domain but no others. So HTML and assets are no problem as long as they are stored on the server. Google Analytics on the other hand requires a download of analytics.js from Googles servers, which is impossible.
So I'm looking for a way to proxy this. The webserver itself has internet access and could relay the trafic. To report to Google about my page view, a single pixel GIF is downloaded from Google, described here: https://developers.google.com/analytics/resources/concepts/gaConceptsTrackingOverview
I think it would be kind of easy to get all the parameters in the GIF and use the measurement protocol to report to Google from the server - but the hard bit is to get all this info to the server. To download analytics.js and modify it to go to my own server seems to me as a hack that ain't future proof at all. To just get the current page from the user to the server is not a big deal, but we would like to get the user id, browser version and everything you get with Analytics.
How would you do it? Do you find a solution for this?
Update: Google has since released server-side GTM, which allows you to proxy requests and scripts through a custom domain. In most use cases I can imagine, this would be the much superior solution to a dyi proxy.
As pointed out in my comment the utm.gif is no longer used. Google Analytics has completely switched to the Measurement Protocol and data is now sent to the Endpoint for the Measurement Protocol at google-analytics.com/collect. Actually this still return a transparent pixel since calling an image with parameters is a probate way of transmitting informations across domain boundaries.
Now, you could just the Measurement Protocol to implement your own Google Analytics tracker.
To quote myself:
Each calls includes at least the ID of the account you want to send
data to, a client id that allows to group interactions into sessions
(so it should be unique per visitor, but it must not identify a user
personally), an interaction type (pageview, event, timing etc., some
interactions types require additional parameters) and the version of
the protocol you are using (at the moment there is only one version).
So the most basic example to record a pageview would look like this:
www.google-analytics.com/collect/v=1&tid=UA-XXXXY&cid=555&t=pageview&dp=%2Fmypage
You probably would want to add the users IP (will be anonymized automatically) and the user agent.
However it sounds like you prefer to use the standard Analytics code to collect the data and relay the tracking call via your own server. While I haven't used the following in production I don't see any reason why it wouldn't work.
First you need the analytics.js file. Self-hosting the file is discouraged, but the given reason is that the code is updated sometimes by Google and if you host it yourself you might miss the updates. This can be remedied by setting up a cron job that downloads the file regularly to your server so you always have a current version.
Next you'd adapt the GA bootstrap function to load the code from your own server:
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','//www.myserver.com/analytics.js','ga');
Now you have the code, but the tracking call will still be sent to the Analytics Server (i.e. in your case it won't be sent at all). So you need to re-route the call via your server.
To make this possible the Google (Universal) Analytics Code has a feature called "tasks". Tasks are functions within the tracking code in which the tracking call is being assembled.
It is possible to modify tasks by using the "set" function of the tracker object, using the taskname as parameter and passing a function that overwrites/overloads the task function.
The following is pretty much the example from the Google documentation (except I omitted the part where data is still being sent to Google - you don't need this at this point):
ga('create', 'UA-XXXXX-Y', 'auto');
ga(function(tracker) {
tracker.set('sendHitTask', function(model) {
var payLoad = model.get('hitPayload');
var gifRequest = new XMLHttpRequest();
var gifPath = "/__ua.gif";
gifRequest.open('get', gifPath + '?' + payLoad, true);
gifRequest.send();
});
});
ga('send', 'pageview');
Now this sends the data to a file called __ua.gif at your own server (if you need to send data cross-domain you can simply do a var ua = new Image; ua.src = gifPath + '?' + payLoad to create an image request).
The model parameter to the sendHitTask-function contains (apart from a lot of overhead) the payload, that is the assembled query string that contains the analytics data. You can then make your _ua.gif a script that proxies the request to the google-analytics.com/collect.
At this point the user agent will be your script and the IP adress will be that of your server, so you need to include &uip (User IP override) and &ua (User agent override) parameters ( https://groups.google.com/forum/#!msg/google-analytics-measurement-protocol/8TAp7_I1uTk/KNjI5IGwT58J) to get geo and technical information.
If you are feeling more adventurous you can override the buildHitTask instead and try and add the additional parameters there (more hassle probably since you'd need to get the IP address from somewhere).
For additional parameter see the reference for analytics.js and the Measurement Protocol.
Related
In case of HTTP requests like HEAD / GET / POST etc, which information of client is received by the server?
I know some of the info includes client IP, which can be used to block a user in case of, lets say, too many requests.
Another information of use would be user-agent, which is different for browsers, scripts, curl, postman etc. (Of course client can change default by setting request headers, but thats alright)
I want to know which other parameters can be used to identify a client (or define some properties)? Does the server get the mac address somehow?
So, is there a possibility that just by the request, it is identifiable that this request is being done by a "bot" (python or java code, eg.) vs a genuine user?
Assume there is no token or any such secret shared between client-server so there is no session...each subsequent request is independent.
The technique you are describing is generally called fingerprinting - the article covers properties and techniques. Depending on the use there are many criticisms of it, as it bypasses a users intention of being anonymous. In all cases it is a statistical technique - like most analytics.
Putting your domain behind a service like cloudflare might help prevent some of those bots from hitting your server. Other than a service like that, setting up a reCAPTCHA would block bots from accessing any pages behind it.
It would be hard to detect bots using solely HTTP because they can send you whatever headers they want. These services use other techniques to try and detect and filter out the bots, while allowing real users to access the site.
I don't think you can rely on any HTTP request header, because a client might not send it to the server, and/or there might be proxies between the client and the server that strip or alter the request headers.
If you just want to associate a unique ID to an HTTP request, you could generate an ID on your backend. For example, the JavaScript framework Hapi.js computes a request ID using this code:
new Date() + '-' + process.pid + '-' + Math.floor(Math.random() * 0x10000)
You might not even need to generate an ID manually. For example, if your app is on AWS and there is an Application Load Balancer in front of your backend, the incoming request will have the custom header X-Amzn-Trace-Id.
As for distinguishing between requests made by human clients and bots, I think you could adopt a "time trap" approach like the one described in this answer about honeypots for spambots.
HTTP request headers are not a good way to track users that use your site. This is because users can edit these headers and the server has no way to verify their authenticy. Also, in the case of the IP Address, it can change during a session if, for example, a user is on a mobile network.
My suggestion is using a cookie with a unique, random id, given to the user the first time they land on a page of your site. Keep in mind that the user can still edit/remove this cookie, so it isn't a perfect method. If you can force the user to login, then you could track the user with their session token.
Let's say i created a google sheet to capture user's email addresses.
On my website there is a small form and once the submit button is clicked and an ajax request to a google web app that writes data to a sheet is fired:
// Let's select and cache all the fields
var $inputs = $form.find("input, select, button, textarea");
// Serialize the data in the form
var serializedData = $form.serialize();
// Fire off the request
request = $.ajax({
url: https://script.google.com/macros/s/longURLcode/exec,
type: "post",
data: serializedData
});
In the google script you now use doPost(e) or doGet(e) to handle any incoming http request. To allow this to work as a sign up mechanism permissions for the web app have to be set to (i think !?)
Execute the app as: Me (myemail#gmail.com)
Who has access to the app: Anyone, even anonymous
Given everything on the google script site is set up properly, this works like a charm. So whats wrong?
Problem:
Anyone can either look into the source code of my webpage or use the dev tools to extract the url to the google web app after clicking submit. This url can now in theory be used to flood the sheet with countless (undefined) entries.
Questions:
1) Is there a way to limit accepted http request to certain origins? I tried to do this by accessing the http headers within doPost() but there seems to be no way to do so.
2) Is "Who has access to the app: Anyone, even anonymous" the wrong approach? I thought this is necessary since you can only choose google users here and some url (mywebsite.com) seemed to fall within the anonymous category.
3) I don't think this is possible but maybe i missed an option: Is there a way to NOT expose the google web app url to anyone? I guess not because you can monitor any requests with dev tools.
4) Is using sheets for capturing that kind of data just a terrible idea in general (partly for above reasons) and i should find another solution asap?
Our website is a vertical search engine and we refer a lot of traffic offsite to partners sites.
We recently switched our website over to serve all traffic via HTTPS. We realised this might confuse some of our partners if they were looking at referrer stats and saw a drop in traffic attributed to us. Therefore at the same time, we added the content-security-policy:referrer origin header and we can see that the referrer is correctly passed along by the browser.
Generally this is working fine but we have had complaints from users of Adobe SiteCatalyst (previously Omniture) who are no longer able to attribute traffic as being referred from us. We don't have access to SiteCatalyst to test this out. How does SiteCatalyst track referral traffic and is there a way to view all traffic split by different sources/referrers?
I don't know if this accounts for everything, since I don't have full context on both your end or your users' end, but here is some info / thoughts that might help.
By default, Adobe Analytics tracks referrer from document.referrer. This can be overridden by setting s.referrer.
In general, depending on how your site directs visitors to the other site vs. Browser security/privacy settings, document.referrer may or may not have a value. For example, Internet Explorer's default security/privacy settings is to suppress document.referrer on dynamically generated popup windows (e.g. window.open() calls).
So, and again, this is just speculation because I don't know the full context, you may need to work something out w/ your users, e.g. explicitly passing the referring url as a query param to the target page, and have your users pop s.referrer with it if it exists. Something along the lines of:
if ( !document.referrer ) {
s.referrer=s.Util.getQueryParam( 'refURL' );
}
Note: s.Util.getQueryParam is a utility function for Adobe Analytics AppMeasurement library that will return the value of the specified query param, or an empty string if it doesn't exist. If your users are still using legacy H code, they should use the s.getQueryParam plugin instead. Or use whatever homebrewed method of getting a query param from the URL, since javascript doesn't have a built-in function for it.
While trying to understand how does the google keyword tool is requesting data I have found that it has a request for a .gif file with GET arguments.. for example:
https://ssl.google-analytics.com/__utm.gif?utmwv=&utms=&utmn=&utmhn=&utmt=&utme=&utmcs=utmsr=&utmvp=&utmsc=&utmul=&utmje=&utmfl=&utmdt=&utmhid=&utmr=&utmp=&utmac=&utmcc=&utmu=
(I have omitted all argument's data)
can someone please explain?
Although it's a request, it's purpose is to send analytics data in the query string parameters. For a good explanation, see Why does Google Analytics use __utm.gif?.
For more detail on what the actual parameters on the GIF request are, see: https://developers.google.com/analytics/resources/articles/gaTrackingTroubleshooting#gifParameters
This GET request gets handled by Google's analytics servers. It probably doesn't just directly serve __utm.gif from somewhere on the filesystem; it probably executes a script that takes all the parameters, does some processing, and logs that request in their analytics database, and then serves a 1x1 transparent GIF.
Just trying to understand why they didn't use a REST API.
In REST, clients initiate requests to servers for resources; servers process those requests and return appropriate responses.
The utm.gif is not involved in server-to-client data transfer, but instead it's involved in moving data in the other direction.
Of course REST has HTTP methods for the client to communicate with servers (GET and POST) and indeed, Google Analytics directs the client's browser to send all analytics data to the GA servers via a GET Request. More precisely, a GET Request is comprised of a Request URL and Request Headers (e.g., Referer and User-Agent Headers).
All GA data--every single item--is assembled and packed into the Request URL's query string (everything after the '?'). But in order for that data to go from the client (where it is created) to the GA server (where it is logged and aggregated) there must be an HTTP Request, so the ga.js (google analytics script that's downloaded, unless it's cached, by the client, as a result of a function called when the page loads) directs the client to assemble all of the analytics data--e.g., cookies, location bar, request headers, etc.--concatenate it into a single string and append it as a query string to a URL (http://www.google-analytics.com/__utm.gif?) and that becomes the Request URL.
Of course there can't be an HTTP Request without a resource; so resource is the client requesting from the server? It doesn't need anything from the server, instead it wants to send information to the server. So the actual server resource requested by the client is purely pretextual--the resource isn't even needed by the client, it's solely requested to comply with the transmission protocol operator. Therefore, it makes sense to make that resource as small and as unobtrusive as possible, which is why it's a 1 x 1 transparent pixel in gif format. It is the smallest possible size and the least dense image format (bytes/pixel); I think it's a little over 30 bytes. A 1 x 1 image in the other common formats (e.g., jpeg, png, tiff) are larger.
This general scheme for transferring data between a client and a server has been around forever; there could very well be a better way of doing this, but it's the only way I know of (that satisfies the constraints imposed by a hosted analytics service).
(Google Analytics does indeed have two APIs--"Data Export" and "Management"--which are both RESTful Web Services.)
You can use __utm.gif in browsers that don't support javascript using the <noscript> tag (with some work on the server), as well as in email messages (with some work before sending the email).
How are you gonna make a REST request in an email message?
Because it's an image you can stick it anywhere you can use and image tag even if you can't execute JS. Many years back this Google pushed this for tracking of email campaigns. You could stick this formatted string in an html email message and then any client that displays the message will send that request to the GA servers and you will get at a minimum IP info (which get's you geo location also) depending on client you may also get OS, language and all the other browser settings. You don't get all the fancy analytics you get from the modern JS tracking scripts but if still has it's uses.
Here is a site that will help you format the request string and also has some more details.
Google pixel generator